## Datasport Public Page Analytics

This notebook, queries the datasport public pages and retrieves the ranking and athletes to do some basic analytics, like number of ahtletes, and the average birthyear etc.

To run the notebook on macOS, from a terminal, install the `notebook` component und run `jupyter`:

    pip install notebook
    jupyter notebook
    
Once opened, start executing the blocks from top down, to get some data

### Requirements

The following block installs the requirements and imports the necessary libraries

In [None]:
# Install requests
!pip2 install requests

In [None]:
# Import requirements
import requests
import string
import os
import re

### Datasport public page query

We query the public ranking pages of datasport base on a year, an run name and a letter.
Thus we use a url like the following

```python
run = 'escalade'
year = 2016
letter = a
url = 'https://services.datasport.com/{}/lauf/{}/alfa{}.htm'.format(year, run, letter)

# > https://services.datasport.com/2016/lauf/escalade/alfaa.htm
```

The following block are the required methods to run 

In [None]:
def parse_athletes_regex(data, year=2016):
    """Parse an athlete line with regex"""
    athletes = []
    if year >= 2000:
        ath = {}
        regex = ur"(^|>)(?P<category>[-/\w]+)\s+(?P<rank>[\d-]+)\.?\s+(?P<name>.+?)\s+(?P<year>\d{4}|\d{2}|\?{4})\s+(?P<city>.+?)\s\s.+$"
        matches = re.finditer(regex, data, re.MULTILINE)
        matchNum = 0

        for matchNum, match in enumerate(matches):
            matchNum = matchNum + 1

            ath['category'] = match.group('category')
            ath['rank'] = match.group('rank')
            ath['name'] = match.group('name')
            ath['city'] = match.group('city')
            try:
                ath['year'] = int(match.group('year'))
            except:
                ath['year'] = 0

            ath['country'] = 'CH'
            if ath['city'][1:2] == '-':
                ath['country'] = ath['city'][0:1]

            athletes.append(ath)

        # print 'Matched {} athletes in data.'.format(matchNum)
            
    return athletes

def analyse_datasport_url(run, year, filn, url, store_local=False):
    """Analyse a datasport URL"""

    filename = '{}/{}/{}'.format(run, year, filn)
    
    if store_local and os.path.exists(filename):
        with open(filename, 'r') as fd:
            # print 'Reading from local file...',
            data = fd.read()
    else:
        # print 'Querying Datasport website...',
        r  = requests.get(url)
        data = r.text

        if store_local:
            if not os.path.exists(os.path.dirname(filename)):
                try:
                    os.makedirs(os.path.dirname(filename))
                except OSError as exc: # Guard against race condition
                    if exc.errno != errno.EEXIST:
                        raise
            with open(filename, 'wb') as fd:
                for chunk in r.iter_content(chunk_size=128):
                    fd.write(chunk)
    
    return parse_athletes_regex(data, year)

def category_to_group(category):
    if category in ['PousF-A9', 'PousF-B6', 'PousF-B7', 'Cad-A-F']:
        return 'EcoliÃ¨res'
    if category in ['PousM-B6', 'Cad-A-M']:
        return 'Ecoliers'
    
    if category in ['Mix2-H', 'Mix3-H']:
        return 'Hommes'
    
    if category in ['Mix2-F', 'Mix3-F']:
        return 'Femmes'
    
    if category in ['Walk-Adu']:
        return 'Walk'
    
def print_stats(run, year, parsed):
    year_sum = sum(a['year'] for a in parsed)
    year_count = len([x for x in parsed if x['year']>0])
    if year_count == 0:
        average = 0
    else:
        average = year_sum / year_count
    
    print 'For {}, year {}: Average {}, Participants {}'.format(run, year, average, len(parsed))

def get_datasport(run, from_year, to_year):
    """Query datasport years"""
    dd = {}
    for year in range(from_year,to_year):
        print 'Querying letters a-z for {} and year {}'.format(run, year)
        dd[year] = []
        parsed_data = []
        for letter in string.ascii_lowercase:
            # print 'Querying letter {} for {}.'.format(letter, year),
            filn = 'alfa{}.htm'.format(letter)
            url = 'https://services.datasport.com/{}/lauf/{}/alfa{}.htm'.format(year, run, letter)
            # Query the data
            try:
                new_data = analyse_datasport_url(run, year, filn, url, False)
                # append to array
                parsed_data = parsed_data + new_data
            except Exception, e:
                print e
        dd[year] = parsed_data

    return dd

def print_datasport(dd, from_year, to_year):
    """Print datasport years"""
    for year in range(from_year,to_year):
        print_stats(run, year, dd[year])

### Execute queries

The following block will execute the queries and run the analytics.  
Change run name (see below for other runs), start and end year

In [None]:
run = 'zuerich'
from_year = 2000
to_year = 2017

dd = get_datasport(run, from_year, to_year)
print '-------------------------------'
print 'RESULTS:'
print_datasport(dd, from_year, to_year)

#### Other runs

The folowing other runs were tested as well and should work with the notebook

In [None]:
runs = ['escalade', 'morat', 'lamara', '20km', 'kerzers', 
        'gurten', 'frauenlauf', 'zinal', 'zuerich', 'greifenseelauf',
        'trotteuse', 'winterthur']