# Parsing Data from [datasport.com](https://www.datasport.com/en/)

We use postman to understand the parameters used by the url request, asked for the exercise.

(However, notice that there are equivalent tools for other browser - for instance, for firefox:
http://stackoverflow.com/questions/28997326/postman-addons-like-in-firefox)

In [1]:
# important modules for this HW
import bs4 # doc: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import requests as rq 
import re

# previous useful modules
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook')

In [2]:
form_source = rq.get("https://www.datasport.com/en/")
form_soup = bs4.BeautifulSoup(form_source.text, "html.parser")
# print(form_soup.prettify())

Let's get all the `select` menus of the page, using the `find_all` method of *BeautifulSoup* which allows to search for all tags of a certain type.

In [3]:
selectors = form_soup.find_all('select')
print(len(selectors))

4


Most importantly, we can find out what each tag is about by printing the its `name` attribute :

In [4]:
for num, s in enumerate(selectors):
    print("Select n°{} : {}".format(num, s.attrs['name'])) # wild french appears...

Select n°0 : etyp
Select n°1 : eventmonth
Select n°2 : eventyear
Select n°3 : eventlocation


In [5]:
for s in selectors:
    options = s.find_all('option')
    options_desc_values = [(o.text, o.attrs['value']) for o in options]
    print(s.attrs['name'] + ':')
    for (d,v) in options_desc_values:
        print("- {} [{}]".format(d,v)) # more french

etyp:
- All [all]
- ---- [all]
- Cross-Country-Skiing [Cross-Country-Skiing]
- Cycling [Cycling]
- Cycling,MTB [Cycling,MTB]
- Cycling,Others [Cycling,Others]
- Duathlon [Duathlon]
- Inline [Inline]
- MTB [MTB]
- MTB,Cycling [MTB,Cycling]
- MTB,Cycling,Others [MTB,Cycling,Others]
- MTB,Others [MTB,Others]
- MTB,X-Hours [MTB,X-Hours]
- Others [Others]
- Others,Inline,Running,MTB [Others,Inline,Running,MTB]
- Running [Running]
- Running,Inline [Running,Inline]
- Running,MTB [Running,MTB]
- Running,MTB,Others [Running,MTB,Others]
- Running,Skiing/Snowboard [Running,Skiing/Snowboard]
- Running,Waffenlauf [Running,Waffenlauf]
- Running,Walking [Running,Walking]
- Running,Walking,MTB [Running,Walking,MTB]
- Running,Walking,Others [Running,Walking,Others]
- Running,X-Hours [Running,X-Hours]
- Skiing/Snowboard [Skiing/Snowboard]
- Triathlon [Triathlon]
- Triathlon,Duathlon [Triathlon,Duathlon]
- Triathlon,Others [Triathlon,Others]
- Waffenlauf [Waffenlauf]
- Walking [Walking]
- X-Hours [X-Hour

## Let's get some data

In order to get started, we can now start collecting the results from the Lausanne marathone, one of the main early event in Switzerland.  

Understand the html of the main page, and __extract the relevant parameters__ to query:

In [6]:
laus_mar_url = 'https://services.datasport.com/2016/lauf/lamara/'
result_html = rq.get(laus_mar_url)

# use BS to get the categories in which the data is devided:

result_soup = bs4.BeautifulSoup(result_html.text, "lxml")
result_font = result_soup.find_all('font')

print('number of categories (age/sex/overall) in the main page:', len(result_font))

number of categories (age/sex/overall) in the main page: 119


In [7]:
# we look for the ones containing 
# '*** Overall ***', as they are the most general categories 

# this is indeed probably a GENERAL KEYWORD, as it's indeed found also in
# events in other laungauges, 
# like https://services.datasport.com/2016/lauf/ascona-locarno-marathon/

good_fonts_num = []

for n_font, font in enumerate(result_font):
    
    if 'Overall' in font.findChild().get_text():
            
        good_fonts_num.append(n_font)
        print(font.findChild().get_text())
        
        
good_fonts_num = np.asarray(good_fonts_num)        
        
#  S***** -.- THERE IS A PROBLEM with the marathon hommes : 
# they are not in the same 'html shape' .. -.-

Marathon Dames Overall (220 classées)
Semi Marathon Hommes Overall (2934 classés)
Semi Marathon Dames Overall (1480 classées)
10km Hommes Overall (2769 classés)
10km Dames Overall (2747 classées)


In [8]:
good_fonts_num

array([ 3,  5,  7,  9, 11])

In [9]:
# we have to get all: href=RANG*** b

rang_to_query = []

for i in range(len(good_fonts_num)-1):
        
    my_font = result_font[good_fonts_num[i] + 1]
    a_tag = my_font.find_all('a')
    
    for t in a_tag:
    
        if 'RANG' in t['href']:
            
            rang_to_query.append(t['href'])
            
#             print(t['href'])

Query the datasport.com with the right parameters and finally get the __tables__:

In [10]:
base_url = "https://services.datasport.com/2016/lauf/lamara"
full_url = base_url + '/' + rang_to_query[0]

In [11]:
result_html = rq.get(full_url, params=rang_to_query[0])
result_soup = bs4.BeautifulSoup(result_html.text, "lxml")

data = result_soup.find_all('font')
len(data)

3

Code to get the columns names:

In [81]:
col_list = data[0].get_text()
col_list  = re.split(' +',col_list)[1:12]
col_list

['rang',
 'nom',
 'nat',
 'an',
 'lieu',
 'équipe',
 'pénalité',
 'temps',
 'retard',
 'doss',
 'cat/rang']

Code to get the rows:
( we __neglet__: 'dossard', 'rank in his/her category', 'team', ....)

(mind that the relevant info are in __data[2].contents[1,2,8]__)

In [126]:
# rows =  data[2].find_all('span')
# len(rows)

for i in range(len(data[2].contents)//8):
    
    
    name = data[2].contents[8*i+1].get_text()

    country_age_city_time = re.split(' +',data[2].contents[8*i+2])

    country = country_age_city_time[1]
    age = country_age_city_time[2]
    city = country_age_city_time[3]
    tot_time = country_age_city_time[5]

    cat_catrank = re.split(' +',data[2].contents[8*i+8].split('¦')[0])
    category = cat_catrank[1] # we keep only the age/sex category
    
    print(country,age,city,tot_time,category)

BEL 1976 Bern 2:42.41,0 42-D40
SUI 1972 Cernier 2:51.45,8 42-D40
CRO 1976 CRO-Zagreb Maksimir 42-D40
SUI 1969 GB-Penzance 2:53.43,2 42-D40
SUI 1977 Ecublens ----- 42-D30
FRA 1981 F-Lille METROPOLE 42-D30
SUI 1986 Cheiry 3:08.08,4 42-D30
SUI 1976 Portalban ----- 42-D40
FRA 1990 F-Besancon val 42-D20
ITA 1991 Genève Runners 42-D20
FRA 1979 F-Marnay 3:17.45,2 42-D30
SUI 1974 Zuchwil Langenthal 42-D40
SUI 1981 Lausanne Ecublens 42-D30
FRA 1972 F-Yvre MAMERS 42-D40
BEL 1989 B-Bertrix 3:27.03,3 42-D20
SUI 1978 Etoy 3:27.14,1 42-D30
AUT 1983 Genève 3:27.33,6 42-D30
SUI 1980 Bern 3:28.55,1 42-D30
IRL 1993 Basel 3:34.19,1 42-D20
SUI 1988 Zürich 3:35.23,6 42-D20
FRA 1980 F-Montjean Loire 42-D30
SUI 1991 Zürich 3:35.35,7 42-D20
JPN 1954 J-Osaka Global 42-D60
SUI 1972 Ringgenberg X-Bionic 42-D40
SUI 1980 Lausanne 3:38.37,7 42-D30
FRA 1990 F-Thonon Bains 42-D20
SUI 1956 Gippingen 3:40.03,9 42-D60
SUI 1969 Daillens ----- 42-D40
SUI 1975 Rothenburg 3:40.36,9 42-D40
SUI 1966 Le ----- 42-D50
SUI 1978 P

# ******* ******* ******* ******* *******  
# OLD CODE 
# ******* ******* ******* ******* ******* ******* 

In [None]:
df = pd.read_html(result_table.decode())[0]
df.head()

In [None]:
df.columns = df.loc[1]                # use row 2 as column names
df = df.drop([0, 1])                  # drop useless first rows
df = df.drop([np.nan], axis=1)        # drop useless nan column
df.index = df['No Sciper']            # use sciper column as index

# Drop some columns
df = df.drop(['Orientation Bachelor', 'Orientation Master', 'Filière opt.', 'Type Echange', 'Ecole Echange'], axis=1)

# Do some renaming
df.index.name = 'sciper'
df.columns = ['gender', 'full_name', 'specialization', 'minor', 'status', 'sciper']

# Map gender to more standard names
dict_gender = {'Monsieur': 'male','Madame': 'female'}
df.gender.replace(dict_gender, inplace=True)
df.head()

## Some tools

We can define a helper function which, given a base URL and a dictionary of parameters, will fetch the data and fill a DataFrame with it.

In [None]:
def get_data(base_url, params_dict):
    """Get data from IS-Academia in a pandas DataFrame"""
    
    # Same sequence of operations of above, with a check if the result_table is empty
    
    result_html = rq.get(base_url,params=params_dict)
    result_soup = bs4.BeautifulSoup(result_html.text, "lxml")
    result_table = result_soup.find_all('table')[0]
    
    if (result_table.text == ''):
        # Return empty dataframe
        df = pd.DataFrame()
    else:
        # Build a DataFrame containing the data, with SCIPER as index
        df = pd.read_html(result_table.decode())[0]
        try:
            df.columns = df.loc[1]                # use 2nd row as column names
            df = df.drop([0, 1])                  # drop useless first rows
            df = df.drop([np.nan], axis=1)        # drop useless nan column
            df.index = df['No Sciper']            # use sciper column as index
        
            # Drop some columns
            df = df.drop(['Orientation Bachelor', 'Orientation Master', 'Filière opt.', 'Type Echange', 'Ecole Echange'], axis=1)
            # Do some renaming
            df.index.name = 'sciper'
            df.columns = ['gender', 'full_name', 'specialization', 'minor', 'status', 'sciper']
            # Map gender to more standard names
            dict_gender = {'Monsieur': 'male','Madame': 'female'}
            df.gender.replace(dict_gender, inplace=True)
        except:
            df = pd.DataFrame()
    
    return df

The following lines test this function with hardcoded values :

In [None]:
base_url = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?"
params_dict = {
    'ww_x_GPS': 2021043255,
    'ww_i_reportModel': 133685247,
    'ww_i_reportModelXsl': 133685270,
    'ww_x_UNITE_ACAD': 249847,
    'ww_x_PERIODE_ACAD': 355925344,
    'ww_x_PERIODE_PEDAGO': 249108,
    'ww_x_HIVERETE':2936286
}

get_data(base_url, params_dict).head()

Finally let's get all the possible values in a cleaner way and keep them in variables that we will use throughout this notebook.

In [None]:
acad_period = {}
level = {}
semester = {}
acad_unit = {}

for s in selectors:
    options = s.find_all('option')
    options_desc_values = [(o.text, o.attrs['value']) for o in options]
    s_name = s.attrs['name']
    choices = {d: int(v) for (d,v) in options_desc_values if d!=''}
    
    if s_name == 'ww_x_PERIODE_ACAD':
        acad_period = choices
    elif s_name == 'ww_x_PERIODE_PEDAGO':
        level = choices
    elif s_name == 'ww_x_HIVERETE':
        for (d,v) in options_desc_values:
            if 'automne' in d:
                semester['automne'] = int(v)
            elif 'printemps' in d:
                semester['printemps'] =int(v)
    elif s_name == 'ww_x_UNITE_ACAD':
        acad_unit = choices

# Example of result
acad_period

### Store data locally

In [None]:
# Get bachelor data for every year and store it if it's not empty
import os
local_dir = '.local-data'
try:
    os.mkdir(local_dir)
except FileExistsError:
    # directory exists
    print("Using existing '" + local_dir + "' directory")

In [None]:
# Fixed values
params_dict = {
    'ww_x_GPS': -1,
    'ww_i_reportModel': 133685247,
    'ww_i_reportModelXsl': 133685270,
    'ww_x_UNITE_ACAD': acad_unit['Informatique']
}

# Iterate over all the varying params and keep only data for bachelors
for year_key, year_value in acad_period.items():
    for level_key, level_value in level.items():
        for semester_key, semester_value in semester.items():
            if 'bachelor' in level_key.lower():
                params_dict['ww_x_PERIODE_ACAD'] = year_value
                params_dict['ww_x_PERIODE_PEDAGO'] = level_value
                params_dict['ww_x_HIVERETE'] = semester_value
                
                df = get_data(base_url, params_dict)
                if not df.empty:
                    # Persist dataframe locally with pickle
                    filename = year_key + '-' + level_key.replace(' ', '-').lower() + '-' + semester_key
                    df.to_pickle(local_dir + '/' + filename)

In [None]:
# the previous cell should download 60 files!, as you can check with this command:
print(len([name for name in os.listdir(local_dir)]))

We hereby show an example of dataframe laoded from the files previously download:

In [None]:
df_example = pd.read_pickle(local_dir + '/2007-2008-bachelor-semestre-6-printemps')
df_example.head()