# Global Parsing of Data from [datasport.com](https://www.datasport.com/en/)

In this file, we want to extract the global information on all the competitions we want to analyse. We are interested in:
* all the **running** races,
* only in **Switzerland**. 

We have taken all the data between the first run present in *Datasport*, that is **from 1999, up to 2015**, not to have to deal with competitions that have not yet happened, or for which the data is not present yet.

In this file, we want to extract the following information on the data:
* **Date** of the race,
* **Name** of the race,
* **Place** where the race has been organised,
* **URL** of the page where the rankings are given.

To find such information, we use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and Postman to use the API present in the *Datasport* main page, opened in [*expert mode*](https://www.datasport.com/en/calendar/). Indeed the *expert mode* contains more information than the *simple mode*, and it contains more specifically all the information that we need. Since for each resquest through the API, we get back a maximum number of results per request and per page, we decide to make a request for each month and each year in the time interval we are interested in (from January 1999 to December 2015), in order to simplify the parsing. Indeed, all the running races organized each month can be presented in only one page.

We understand how each of these pages' html have been written, and we extract the wanted data from it. Once all the data have been parsed and assembled into a pandas DataFrame, we load it to the file *links2runs.csv* so that we do not need to run this code more than once, since it takes some time. 

First, we import required modules and libraries.

In [1]:
import pandas as pd
import bs4 # doc: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import requests as rq 

#%matplotlib inline
#import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns
#sns.set_context('notebook')

Then, we begin the parsing.

In [None]:
# Lists that will contain the information we are looking for, per category (URL, 
# date, name, place)
list_url = []
list_dates = []
list_names = []
list_places = []

# Fixed request parameters: Running races, in Switzerland, all type of running events.
etyp = 'Running'
eventlocation = 'CCH'
eventservice = 'all'

# Specify all the 12 months in number, and all the years between 1999 and 2016. We will
# loop on those two lists, these are the variable request parameters. 
eventmonth = []
for month in range(12):
    eventmonth.append(str(month+1).zfill(2))
eventyear = []
for year in range(1999,2016):
    eventyear.append(str(year).zfill(4))

# Debugging parameters, to be sure that for each event, we have one date, one URL where 
# we can find the rankings, one name and one place. Not less, not more than that. 
yes_date = 0
yes_rank = 0
yes_name = 0
yes_place = 0

In [14]:
# Loop on all the years and all the months of each year, and make a request.
for year in eventyear:
    for month in eventmonth:
        d = {'etyp': etyp, 'eventlocation': eventlocation, 
             'eventmonth': month, 'eventservice': eventservice,
             'eventyear': year}
        post_source = rq.post('https://www.datasport.com/fr/calendrier/',data=d)
        form = bs4.BeautifulSoup(post_source.text, "html.parser")
        
        # Each <tr> containing a 'class' attribute corresponds to an event, let us find 
        # all of them.
        find_tr = form.find_all('tr')
        for tr in find_tr:
            if (tr.has_attr('class') and (tr['class'][0]=='even' 
                                          or tr['class'][0]=='odd')):
                # For each event, the 5th <td> contains the URL where to find the 
                # rankings. This URL corresponds to the 'href' value of some <a>. 
                all_td = tr.find_all('td')
                find_a = all_td[4].find_all('a')
                for a in find_a:
                    # Select only the 'href' of the <a> that contain the URLs than 
                    # interest us, i.e. the ones going to a datasport page. We also 
                    # handle three exceptions: we do not want any pdf (no parsing can be
                    # done on them), nor the link about "Course des pavées" ('pavees') 
                    # that gives an error 404, nor the 'mmc' event that is in 
                    # Lichtenchtein and not in Switzerland (website error). 
                    if (a['href'].startswith('http://services.datasport.com/')
                        and not a['href'].endswith('.pdf') 
                        and not a['href'].endswith('pavees')
                        and not a['href'].endswith('mmc')):
                        list_url.append(a['href'])
                        yes_rank += 1
                
                if(yes_rank > 0):
                    # If a ranking URL has been found, we can analyse the data. Let us
                    # look for the other attributes. The 15th <td> of each event contains
                    # the URL link to the place of the race on Google Maps. From it, we 
                    # extract the name of the city. 
                    find_a2 = all_td[14].find_all('a')
                    for a2 in find_a2:
                        if (a2['href'].startswith('http://maps.google.ch/')):
                            list_places.append(a2['href'].split('=')[-1].split(',')[0])
                            yes_place += 1
                    
                    # In the 2nd <td> of each event, in its first <a>, we extract the name
                    # of the event. 
                    find_a = all_td[1].find_all('a')
                    list_names.append(find_a[0].contents[0])
                    yes_name += 1
                    
                    # In the 1st <td> of each event, in the <span> containing a void 
                    # attribute 'class', we extract the date of the event. If the event 
                    # is on many days (specified by a '+' or 'bis), we only keep the date 
                    # of the first day of the event.
                    find_date = all_td[0].find_all('span')
                    for date in find_date:
                        if (date.has_attr('class') and date['class'][0]==''):
                            the_date = date.contents[0]
                            if the_date[-1]=='+':
                                the_date = the_date[:-1]
                            if the_date[-4:]==' bis':
                                the_date = the_date[:-4]
                            list_dates.append(date.contents[0])
                            yes_date += 1
                
                # Debugging step
                if yes_date != yes_name or yes_date!=yes_rank or yes_date!= yes_place: 
                    print(yes_rank, yes_place, month, year)
                    print(list_url[-1])
                    print(list_dates[-1])
                yes_rank = 0
                yes_name = 0
                yes_date = 0
                yes_place = 0
    print(year, 'parsed.')

1999 parsed.
2000 parsed.
2001 parsed.
2002 parsed.
2003 parsed.
2004 parsed.
2005 parsed.
2006 parsed.
2007 parsed.
2008 parsed.
2009 parsed.
2010 parsed.
2011 parsed.
2012 parsed.
2013 parsed.
2014 parsed.
2015 parsed.


Let us double check that we have the same number of event and of each of the attributes we are interested in.

In [20]:
print('Answer:', len(list_url)==len(list_names) and
      len(list_names)==len(list_dates) and
      len(list_dates)==len(list_places))

Answer: True


Now, we assemble all of these information into a pandas DataFrame.

In [16]:
all_runs_df = pd.DataFrame({ 'Name' : list_names,
                    'Date' : list_dates,
                    'Place' : list_places,
                    'URL' : list_url })

The DataFrame looks like the following:

In [17]:
all_runs_df.head()

Unnamed: 0,Date,Name,Place,URL
0,sam. 27.03.1999,Männedörfler Waldlauf,Männedorf,http://services.datasport.com/1999/zkb/maennedorf
1,sam. 20.03.1999,Kerzerslauf,Kerzers,http://services.datasport.com/1999/lauf/kerzers
2,sam. 24.04.1999,Luzerner Stadtlauf,Luzern,http://services.datasport.com/1999/lauf/luzern
3,sam. 24.04.1999,20km de Lausanne,Lausanne,http://services.datasport.com/1999/lauf/km20
4,sam. 24.04.1999,"Chäsitzerlouf, Kehrsatz",Kehrsatz,http://services.datasport.com/1999/lauf/kehrsatz


Finally, we load the DataFrame into a csv file in order to re-use this data without having to re-run all this code. 

In [18]:
all_runs_df.to_csv('links2runs.csv')