# Parsing Data from [datasport.com](https://www.datasport.com/en/)

We use postman to understand the parameters used by the url request, asked for the exercise.

(However, notice that there are equivalent tools for other browser - for instance, for firefox:
http://stackoverflow.com/questions/28997326/postman-addons-like-in-firefox)

In [99]:
# important modules for this HW
import bs4 # doc: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import requests as rq 
import re

# previous useful modules
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook')

In [2]:
form_source = rq.get("https://www.datasport.com/en/")
form_soup = bs4.BeautifulSoup(form_source.text, "html.parser")
# print(form_soup.prettify())

Let's get all the `select` menus of the page, using the `find_all` method of *BeautifulSoup* which allows to search for all tags of a certain type.

In [3]:
selectors = form_soup.find_all('select')
print(len(selectors))

4


Most importantly, we can find out what each tag is about by printing the its `name` attribute :

In [4]:
for num, s in enumerate(selectors):
    print("Select n°{} : {}".format(num, s.attrs['name'])) # wild french appears...

Select n°0 : etyp
Select n°1 : eventmonth
Select n°2 : eventyear
Select n°3 : eventlocation


In [5]:
for s in selectors:
    options = s.find_all('option')
    options_desc_values = [(o.text, o.attrs['value']) for o in options]
    print(s.attrs['name'] + ':')
    for (d,v) in options_desc_values:
        print("- {} [{}]".format(d,v)) # more french

etyp:
- All [all]
- ---- [all]
- Cross-Country-Skiing [Cross-Country-Skiing]
- Cycling [Cycling]
- Cycling,MTB [Cycling,MTB]
- Cycling,Others [Cycling,Others]
- Duathlon [Duathlon]
- Inline [Inline]
- MTB [MTB]
- MTB,Cycling [MTB,Cycling]
- MTB,Cycling,Others [MTB,Cycling,Others]
- MTB,Others [MTB,Others]
- MTB,X-Hours [MTB,X-Hours]
- Others [Others]
- Others,Inline,Running,MTB [Others,Inline,Running,MTB]
- Running [Running]
- Running,Inline [Running,Inline]
- Running,MTB [Running,MTB]
- Running,MTB,Others [Running,MTB,Others]
- Running,Skiing/Snowboard [Running,Skiing/Snowboard]
- Running,Waffenlauf [Running,Waffenlauf]
- Running,Walking [Running,Walking]
- Running,Walking,MTB [Running,Walking,MTB]
- Running,Walking,Others [Running,Walking,Others]
- Running,X-Hours [Running,X-Hours]
- Skiing/Snowboard [Skiing/Snowboard]
- Triathlon [Triathlon]
- Triathlon,Duathlon [Triathlon,Duathlon]
- Triathlon,Others [Triathlon,Others]
- Waffenlauf [Waffenlauf]
- Walking [Walking]
- X-Hours [X-Hour

## Let's get some data

In order to get started, we can now start collecting the results from the Lausanne marathone, one of the main early event in Switzerland.  

Understand the html of the main page, and __extract the relevant parameters__ to query:

### Get all the pages links

In [397]:
laus_mar_url = 'https://services.datasport.com/2016/lauf/lamara/'
fri_half_url = 'https://services.datasport.com/2013/lauf/semi-marathon-fribourg/'
german_mar_url='https://services.datasport.com/2014/lauf/grmarathon/'
kapoag_url='https://services.datasport.com/2013/lauf/kapoag/'
base_url=kapoag_url



result_html = rq.get(base_url)

# use BS to get the classes in which the data is devided:

result_soup = bs4.BeautifulSoup(result_html.text, "lxml")
result_font = result_soup.find_all('font')

print('number of categories in the main page:', len(result_font))

number of categories in the main page: 22


In [398]:
# we look for the ones containing 
# '*** Overall ***', as they are the most general categories 

# this is indeed probably a GENERAL KEYWORD, as it's indeed found also in
# events in other laungauges, 
# like https://services.datasport.com/2016/lauf/ascona-locarno-marathon/

base_url_lausanne = "https://services.datasport.com/2016/lauf/lamara"

good_fonts_num = []

links=[]
for n_font, font in enumerate(result_font):
#     print(font.findChild())
    if font.get('size')=='3':
        links_to_process=font.findAll('a')
        for link in links_to_process:
            link=str(link)
            try:
                link=link.split('"')[1]
                if link[:4]=='ALFA':
                    links.append(base_url+'/'+link)
            except:
                pass
        break
links
#     if 'Overall' in font.findChild().get_text():
            
#         good_fonts_num.append(n_font)
#         print(font.findChild().get_text())
        
        
# good_fonts_num = np.asarray(good_fonts_num)        
        
#  S***** -.- THERE IS A PROBLEM with the marathon hommes : 
# they are not in the same 'html shape' .. -.-

['https://services.datasport.com/2013/lauf/kapoag//ALFAA.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAB.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAC.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAD.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAE.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAF.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAG.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAH.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAI.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAJ.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAK.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAL.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAM.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAN.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFAO.HTM',
 'https://services.datasport.com/2013/lauf/kapoag//ALFA

### Get the tables

Query the datasport.com with the right parameters and finally get the __tables__:

In [399]:
# There are more fields than that. These are the only the ones that matters
# Important to automatically check if some tables are differently structured
# Impossible to manually check all the tables for all the games.
header_fields_french=[['catégorie'],['rang'],['nom'],['an'],['lieu','pays/lieu'],['équipe'],['pénalité'],['temps'],['retard']]
optional_french=['pénalité']
first_excluded_field_french='doss'
header_fields_german=[['Kategorie'],['Rang'],['Name/Ort','Name'],['Jg'],['Team/Ortschaft','Land/Ort'],['Team'],['Zeit'],['Rückstand']]
optional_german=['Team']
first_excluded_field_german='Stnr'

In [400]:
def parse_time(time):
    if time.count(',')==0:
        raise()
    time=re.split("[:.,]+",time)
    while len(time)<4:
        time=[0]+time
    hours,minutes,seconds,mseconds=[float(x) for x in time]
    
    return (hours,minutes,seconds,mseconds)

In [401]:
def process_legend(legend):
    penalty=True
    
    legend=str(legend).split('¦')[0]
    legend=re.sub('<[^>]+>', ' ', legend)
    legend=legend.lstrip()
    
    # check language
    if legend.startswith(header_fields_french[0][0]):
        language='French'
        header_fields=header_fields_french
        first_excluded=first_excluded_field_french
        optional=optional_french
    elif legend.startswith(header_fields_german[0][0]):
        language='German'
        header_fields=header_fields_german
        first_excluded=first_excluded_field_german
        optional=optional_german
    else:
        print(legend)
        raise('Error, problems in language detection')
        return '',False,True
    
    # Check if all words are present
    for words in header_fields:
        found=False
        for word in words:
            if legend.startswith(word):
                legend=legend.split(word)[1]
                legend=legend.lstrip()
                found=True
                break
            
        if found==False:
            if words[0] in optional:
                penalty=False
            else:
                print(words)
                print(legend)
                raise('Error, word not known')
                return '',False,True
    legend_first_excluded=legend.split(' ')[0]
    if legend_first_excluded != first_excluded:
        print(legend_first_excluded)
        print(legend)
        raise('First excluded element not good')
    
    
    return language,False

### Hypothesis

IMPORTANT - TODO

Remove walking data - the rang is not present in these dataset.
It can cause problem for the splitting if nom is not a link

*Fields* - standard fields for each language:
1. catégorie (0)
2. rang (1) (CAN BE MERGED WITH NOM)
3. nom (2) (CAN BE MERGED WITH RANG)
4. an (3)
5. lieu (3)
6. équipe  (4) (MAYBE MISSING)
7. pénalité (5) (NOT ALWAYS PRESENT)
8. temps (6)
9. retard (7)

*Only* 1,2,3,4,8,9 are parsed!!
After 4, it checks if the other fields are a time field. If they are not, they are not used.
If more than 2 times are found an error is raised.
If 1 time is found it is supposed that it is the final time, not the delay.
If 0 times are found the player is not used and it is printed

The presence of these fields is automatically checked. They have to be in this order.
If they are not, an error is raised.
Other possible problems:
1. temps and retard should be formatted in a way parsable by parse_time()
2. Also the other fields should be formatted in the same way as Lausanne Marathon

In [402]:
def process_fields(runner_splitted):
    fields_processed=[]
    # The first 3 elements are the same
    fields_processed.append(runner_splitted[0])
    # Check if splitting second element
    splitted=runner_splitted[1].split('.')
    first_to_check=3
    if len(splitted)!=1 and splitted[1]!='': #rang and nom are merged
        splitted[1]=splitted[1].lstrip()
        fields_processed+=splitted
        first_to_check=2
    else:
        fields_processed.append(splitted[0])
        fields_processed.append(runner_splitted[2])
        
    # Split the an-lieu element
    fields_processed+=runner_splitted[first_to_check].split(' ',1)
    first_to_check+=1
    
    # Take only the first element (the year). The second is kept only if it is a time (not encountered yet)
    try:
        parse_time(fields_processed[-1])
        raise('It should not be a date')
    except:
        pass
        #del fields_processed[-1]
    
    # Insert all times found after the year (if they are not 2 raise an error)
    added_fields=0
    for i in range(first_to_check,len(runner_splitted)):
        try:
            parse_time(runner_splitted[i])
            fields_processed.append(runner_splitted[i])
            added_fields+=1
        except:
            pass
    if added_fields==0:
        return []
    if added_fields==1:
        fields_processed.append('----')
        added_fields=2
    if added_fields!=2:
        print(added_fields)
        print(runner_splitted)
        raise('Added fields not equal to 2')
    
    
#     team_field_present=True
#     field_to_check=4+penalty
#     try:
#         parse_time(runner_splitted[field_to_check]) # It should raise error if field not missing
#         fields_processed.append('')
#         team_field_present=False
#     except:
#         pass
   
#     fields_processed+=runner_splitted[4:6+penalty+team_field_present]
    
    
    return fields_processed
    

In [403]:
final_list=[]
for link in links:
    # Get raw HTML response
    result_html = rq.get(link)#, params=rang_to_query[0])

    # Use BeautifulSoup and extract the first (and only) HTML table
    result_soup = bs4.BeautifulSoup(result_html.text, "lxml")

    results=result_soup.findAll('font')  # Search for all fonts
    language,errors=process_legend(results[0])
    print(language,errors)
    del results[0]    # This is the legend
    for table in results:
        if table.get('size')=='2': # If size is 1 it stores the split times, not interesting
            
            # NOT TRUE IN GENERAL !!!!!!!!!!!!!!!!!!!!
            runner_list=str(table).split('\n')         # Each line is delimited by ¦
            for k,runner in enumerate(runner_list):
                start_runner=runner[:]
                runner=re.sub('<[^>]+>', ' ', runner) # Remove all text between <>
                runner=re.sub('  +','#@$&',runner)       # Replace all the double or more spaces with &
                
                runner=runner.replace('\n','')        # Remove the \n at the beginning of the line
                

                runner=runner.replace(' \r','')       # Remove the \r at the beginning of the line
                runner=runner.replace('\r','')       # Remove the \r at the beginning of the line
                runner=runner.lstrip()                 # The first athlete starts with a space
#                 print(repr(runner))
#                 print()
                # The team can be empty, check:
                counter=runner.count('#@$&')
                start=runner[:3]
                if start=='10-' or start=='21-' or start=='42-':
                    runner2=runner.split('#@$&')          # Split the fields
                    
                    # It works ONLY if the number of fields are the same for different languages
                    runner=process_fields(runner2) 
                    if len(runner)!=0:
                        final_list.append(runner)         # Append to the final list  
                    else:
                        print(start_runner)
                        print(runner2)
                        
    

German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False
German False


In [404]:
df = pd.DataFrame(final_list)
df.shape

(312, 7)

In [405]:
df

Unnamed: 0,0,1,2,3,4,5,6
0,10-S1,16,Aellig Hansruedi,1967,Mepo,"48.26,6","7.44,6"
1,10-S1,14,Aeschlimann Hanspeter,1968,Mepo,"47.25,3","6.43,3"
2,10-WD,12,Affolter Verena,1956,Repol LAR,"1:30.31,8","17.27,3"
3,10-WH,15,Altermatt Philip,1982,Kapo West,"1:17.07,4","7.16,5"
4,10-H,61,Amsler Dominik,1987,Mepo,"49.17,9","12.24,4"
5,10-GH,52,Antoniazzi Franco,1964,StA Baden,"53.25,8","17.34,6"
6,10-GH,44,Aufdenblatten Dominik,1964,StA Baden,"50.58,2","15.07,0"
7,10-GH,64,Bachmann Pascal,1982,Stapo Aarau,"57.30,1","21.38,9"
8,10-D,10,Baumann Evelin,1975,Kapo Ost,"53.52,1","8.36,2"
9,10-D,6,Baumann Nadia,1989,Pol S,"52.49,7","7.33,8"


# ******* ******* ******* ******* *******  
# OLD CODE 
# ******* ******* ******* ******* ******* ******* 

In [None]:
df = pd.read_html(result_table.decode())[0]
df.head()

In [None]:
df.columns = df.loc[1]                # use row 2 as column names
df = df.drop([0, 1])                  # drop useless first rows
df = df.drop([np.nan], axis=1)        # drop useless nan column
df.index = df['No Sciper']            # use sciper column as index

# Drop some columns
df = df.drop(['Orientation Bachelor', 'Orientation Master', 'Filière opt.', 'Type Echange', 'Ecole Echange'], axis=1)

# Do some renaming
df.index.name = 'sciper'
df.columns = ['gender', 'full_name', 'specialization', 'minor', 'status', 'sciper']

# Map gender to more standard names
dict_gender = {'Monsieur': 'male','Madame': 'female'}
df.gender.replace(dict_gender, inplace=True)
df.head()

## Some tools

We can define a helper function which, given a base URL and a dictionary of parameters, will fetch the data and fill a DataFrame with it.

In [None]:
def get_data(base_url, params_dict):
    """Get data from IS-Academia in a pandas DataFrame"""
    
    # Same sequence of operations of above, with a check if the result_table is empty
    
    result_html = rq.get(base_url,params=params_dict)
    result_soup = bs4.BeautifulSoup(result_html.text, "lxml")
    result_table = result_soup.find_all('table')[0]
    
    if (result_table.text == ''):
        # Return empty dataframe
        df = pd.DataFrame()
    else:
        # Build a DataFrame containing the data, with SCIPER as index
        df = pd.read_html(result_table.decode())[0]
        try:
            df.columns = df.loc[1]                # use 2nd row as column names
            df = df.drop([0, 1])                  # drop useless first rows
            df = df.drop([np.nan], axis=1)        # drop useless nan column
            df.index = df['No Sciper']            # use sciper column as index
        
            # Drop some columns
            df = df.drop(['Orientation Bachelor', 'Orientation Master', 'Filière opt.', 'Type Echange', 'Ecole Echange'], axis=1)
            # Do some renaming
            df.index.name = 'sciper'
            df.columns = ['gender', 'full_name', 'specialization', 'minor', 'status', 'sciper']
            # Map gender to more standard names
            dict_gender = {'Monsieur': 'male','Madame': 'female'}
            df.gender.replace(dict_gender, inplace=True)
        except:
            df = pd.DataFrame()
    
    return df

The following lines test this function with hardcoded values :

In [None]:
base_url = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?"
params_dict = {
    'ww_x_GPS': 2021043255,
    'ww_i_reportModel': 133685247,
    'ww_i_reportModelXsl': 133685270,
    'ww_x_UNITE_ACAD': 249847,
    'ww_x_PERIODE_ACAD': 355925344,
    'ww_x_PERIODE_PEDAGO': 249108,
    'ww_x_HIVERETE':2936286
}

get_data(base_url, params_dict).head()

Finally let's get all the possible values in a cleaner way and keep them in variables that we will use throughout this notebook.

In [None]:
acad_period = {}
level = {}
semester = {}
acad_unit = {}

for s in selectors:
    options = s.find_all('option')
    options_desc_values = [(o.text, o.attrs['value']) for o in options]
    s_name = s.attrs['name']
    choices = {d: int(v) for (d,v) in options_desc_values if d!=''}
    
    if s_name == 'ww_x_PERIODE_ACAD':
        acad_period = choices
    elif s_name == 'ww_x_PERIODE_PEDAGO':
        level = choices
    elif s_name == 'ww_x_HIVERETE':
        for (d,v) in options_desc_values:
            if 'automne' in d:
                semester['automne'] = int(v)
            elif 'printemps' in d:
                semester['printemps'] =int(v)
    elif s_name == 'ww_x_UNITE_ACAD':
        acad_unit = choices

# Example of result
acad_period

### Store data locally

In [None]:
# Get bachelor data for every year and store it if it's not empty
import os
local_dir = '.local-data'
try:
    os.mkdir(local_dir)
except FileExistsError:
    # directory exists
    print("Using existing '" + local_dir + "' directory")

In [None]:
# Fixed values
params_dict = {
    'ww_x_GPS': -1,
    'ww_i_reportModel': 133685247,
    'ww_i_reportModelXsl': 133685270,
    'ww_x_UNITE_ACAD': acad_unit['Informatique']
}

# Iterate over all the varying params and keep only data for bachelors
for year_key, year_value in acad_period.items():
    for level_key, level_value in level.items():
        for semester_key, semester_value in semester.items():
            if 'bachelor' in level_key.lower():
                params_dict['ww_x_PERIODE_ACAD'] = year_value
                params_dict['ww_x_PERIODE_PEDAGO'] = level_value
                params_dict['ww_x_HIVERETE'] = semester_value
                
                df = get_data(base_url, params_dict)
                if not df.empty:
                    # Persist dataframe locally with pickle
                    filename = year_key + '-' + level_key.replace(' ', '-').lower() + '-' + semester_key
                    df.to_pickle(local_dir + '/' + filename)

In [None]:
# the previous cell should download 60 files!, as you can check with this command:
print(len([name for name in os.listdir(local_dir)]))

We hereby show an example of dataframe laoded from the files previously download:

In [None]:
df_example = pd.read_pickle(local_dir + '/2007-2008-bachelor-semestre-6-printemps')
df_example.head()