# Parsing Data from [datasport.com](https://www.datasport.com/en/)

We use postman to understand the parameters used by the url request, asked for the exercise.

(However, notice that there are equivalent tools for other browser - for instance, for firefox:
http://stackoverflow.com/questions/28997326/postman-addons-like-in-firefox)

In [2]:
# important modules for this HW
import bs4 # doc: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import requests as rq 
import re
import time
# previous useful modules
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook')

## Let's get some data

In order to get started, we can now start collecting the results from the Lausanne marathone, one of the main early event in Switzerland.  

Understand the html of the main page, and __extract the relevant parameters__ to query:

# Load all the runs main pages
Load the csv file links2runs.csv

In [4]:
links2runs=pd.read_csv('links2runs.csv')
del links2runs['Unnamed: 0']

In [5]:
links2runs.head(3)

Unnamed: 0,Date,Name,URL
0,sam. 27.03.1999,Männedörfler Waldlauf,http://services.datasport.com/1999/zkb/maennedorf
1,sam. 20.03.1999,Kerzerslauf,http://services.datasport.com/1999/lauf/kerzers
2,sam. 24.04.1999,Luzerner Stadtlauf,http://services.datasport.com/1999/lauf/luzern


In [6]:
links2runs.shape

(2014, 3)

### Pages links

6 test pages.
Change only base_url to decide which page to parse

In [76]:
laus_mar_url = 'https://services.datasport.com/2016/lauf/lamara/'
fri_half_url = 'https://services.datasport.com/2013/lauf/semi-marathon-fribourg/'
german_mar_url='https://services.datasport.com/2014/lauf/grmarathon/'
kapoag_url='https://services.datasport.com/2013/lauf/kapoag/'
laufen_url='https://services.datasport.com/2010/lauf/laufen/'
sommer_url='https://services.datasport.com/2014/lauf/sommer-gommer/'
emme_url='https://services.datasport.com/2010/lauf/emme/'
biel_url='https://services.datasport.com/2009/lauf/bielercross/'
lugano_url='https://services.datasport.com/2010/lauf/stralugano/'
# PARSED PAGE
base_url=laus_mar_url

result_html = rq.get(base_url)

# use BS to get the classes in which the data is devided:

result_soup = bs4.BeautifulSoup(result_html.text, "lxml")
result_font = result_soup.find_all('font')

print('number of categories in the main page:', len(result_font))

number of categories in the main page: 119


In [77]:
# we look for the classements par ordre alphabetique

# FOR THIS IT DOES NOT WORK - category to be got from the category field, not from the pace 
# https://services.datasport.com/2016/lauf/ascona-locarno-marathon/
# https://services.datasport.com/2010/lauf/emme/alfaa.htm

def get_links(base_url):
    result_html = rq.get(base_url)
    result_soup = bs4.BeautifulSoup(result_html.text, "lxml")
    result_font = result_soup.find_all('font')

    
    links=[] # It contains all the tables to be parsed
    for n_font, font in enumerate(result_font):
        if font.get('size')=='3':
            links_to_process=font.findAll('a')
            alfa_found=False
            for link in links_to_process:
                link=str(link)
                try:
                    link=link.split('"')[1]
                    if link[:4]=='ALFA':
                        links.append(base_url+'/'+link)
                        alfa_found=True
                    elif alfa_found:
                        break
                except:
                    pass
            break
    print('links found:', len(links))

    return links

links=get_links(base_url)
links

links found: 26


['https://services.datasport.com/2016/lauf/lamara//ALFAA.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAB.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAC.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAD.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAE.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAF.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAG.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAH.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAI.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAJ.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAK.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAL.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAM.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAN.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAO.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFA

## Get the tables

Query the datasport.com with the right parameters and finally get the __tables__

A lot of checks are done to check if the table structure is standard

### Table format check
The table has to contains these fields according to the language

In [78]:
# There are more fields than that. These are the only the ones that matters
# Important to automatically check if some tables are differently structured
# Impossible to manually check all the tables for all the games.
header_fields_french=[['catégorie'],['rang'],['nom et prénom','nom/lieu','nom'],['an'],['équipe/lieu','lieu','pays/lieu'],['équipe'],['pénalité'],['temps'],['retard']]
optional_french=['pénalité','équipe','retard']
first_excluded_field_french='doss'
last_field_french=['moyenne','Ø/km','km/h']
header_fields_german=[['Kategorie'],['Rang'],['Name und Vorname','Name/Ort','Name'],['Jg'],['Team/Ortschaft','Land/Ort'],['S Start'],['Team'],['Nat'],['Zeit'],['Rückstand']]
optional_german=['Team','Rückstand','S Start','Nat']
first_excluded_field_german='Stnr'
last_field_german=['Schnitt','Ø/km','km/h']
header_fields_italian=[['categoria'],['posto'],['nome/località','nome'],['anno','an'],['squadra/località','località'],['squadra'],['tempo'],['ritardo']]
optional_italian=['nazione','squadra','ritardo']
first_excluded_field_italian='pett'
last_field_italian=['media','Ø/km','km/h']

parse_time() will be used both to parse the time fields and to check if a field is a time field or not

In [79]:
def parse_time(time,check_only=False,split=True):
    ''' Return a parsing of the time
    '''
    if split:
        time=time.split(' ')[0]
    if time.count(',')==0 and not check_only:
        raise()
    time=re.split("[:.,]+",time)
    while len(time)<4:
        time=[0]+time
    hours,minutes,seconds,mseconds=[float(x) for x in time]
    
    if not check_only:
        return (hours,minutes,seconds,mseconds)

process_legend() is a function to check if the table has a standard format

In [80]:
def process_legend(legend):
    ''' Check if the legend is in a compatible format and find the language of the legend
    @return language, if pace is available
    
    The pace is necessary to get the distance of the run if it is not available in the description.
    '''
    legend_start=str(legend)
    legend=str(legend).split('\n')[0]
    if 'TdCN' in legend or 'Waffenlauf' in legend or 'DATASPORT' in legend:
        legend=legend_start.split('\n')[1]
    legend=legend.split('¦')[0]
    legend=re.sub('<[^>]+>', ' ', legend)
    legend=legend.lstrip()
    # check language
    if legend.startswith(header_fields_french[0][0]):
        language='French'
        header_fields=header_fields_french
        first_excluded=first_excluded_field_french
        optional=optional_french
        last_field=last_field_french
    elif legend.startswith(header_fields_german[0][0]):
        language='German'
        header_fields=header_fields_german
        first_excluded=first_excluded_field_german
        optional=optional_german
        last_field=last_field_german
    elif legend.startswith(header_fields_italian[0][0]):
        language='Italian'
        header_fields=header_fields_italian
        first_excluded=first_excluded_field_italian
        optional=optional_italian
        last_field=last_field_italian
    else:
        print(legend)
        raise('Error, problems in language detection')
        return '',False,True
    
    # Check if all words are present
    for words in header_fields:
        found=False
        for word in words:
            if legend.startswith(word):
                legend=legend.split(word)[1]
                legend=legend.lstrip()
                found=True
                break
            
        if found==False:
            if words[0] in optional:
                pass
            else:
                print(words)
                print(legend)
                raise('Error, word not known')
                return '',False,True
    legend_splitted=legend.split(' ')
    legend_first_excluded=legend_splitted[0]
    if legend_first_excluded != first_excluded:
        print(legend_first_excluded)
        print(legend)
        raise('First excluded element not good')
    legend_splitted=[x.lstrip() for x in legend_splitted]
    legend_splitted=[x for x in legend_splitted if x!='' ]
    last=legend_splitted[-1]

    for word in last_field:
        if last.startswith(word):
            if word=='km/h':
                return language,word
            return language,True
    
    
    
    return language,False

### Hypothesis

*Fields* - standard fields for each language:
1. catégorie (0)
2. rang (1) (CAN BE MERGED WITH NOM)
3. nom (2) (CAN BE MERGED WITH RANG)
4. an (3)
5. lieu (3)
6. équipe  (4) (MAYBE MISSING)
7. pénalité (5) (NOT ALWAYS PRESENT)
8. temps (6)
9. retard (7)

*Only* 1,2,3,4,5,8,9 are parsed!!
After 5, it checks if the other fields are a time field. If they are not, they are not used.
If more than 2 times are found an error is raised.
If 1 time is found it is supposed that it is the final time, not the delay.
If 0 times are found the player is not used and it is printed

The presence of these fields is automatically checked in process_legend(). They have to be in this order.
If they are not, an error is raised.
Other possible problems:
1. temps and retard should be formatted in a way parsable by parse_time()
2. Also the other fields should be formatted in the same way as Lausanne Marathon

### Parsing of category/sex/length
We are not interested in the specific category of the race. It will be deduced by the year

We are strongly interested in:
1. Sex
2. Length of the race

These informations are not easily parsable.

TO BE VERIFIED

It seems that *sex* is always included in some way in category: here are the words in the second part of the category string that contains the sex.

Don't parse if it ends with 'W', it can be a walking and it makes confusion.

In [81]:
men_category=['Hommes','Herren','Boys','Hom','Gar']
men_category_starting_ending_word=['H','M']
women_category=['Femmes','Damen','Girls','Dam','Fam','Fille']
women_category_starting_ending_word=['D','F']
women_category_only_starting_word=['W']

In [82]:
def number_or_majuscule(letter):
    return (letter.isdigit() or letter.isupper())

### Fields parsing
The fields are parsed in process_fields.

In [83]:
def process_category(category):  
    split=category.split('-')
    if len(split)==2:
        first,second=split
    elif len(split)==1:
        second=split[0]
        first=False
    else: 
        print(category)
        raise('Category not expected')
    # Category retrieval
    try:
        float(first)
    except:
        first=False
    
    # Sex retrieval
    sex=False
    for word in men_category:
        if word in second:
            sex='M'
            break
    for word in men_category_starting_ending_word:
        if (second.startswith(word) and number_or_majuscule(second[len(word):])) or second.endswith(word):
            sex='M'
            break
    for word in women_category:
        if word in second:
            if sex=='M':
                print('Double sex detected:', category)
                sex=False
                return first,sex
#                 raise('Double sex detected')
            sex='F'
            break
    for word in women_category_starting_ending_word:
        if (second.startswith(word) and number_or_majuscule(second[len(word):])): 
            if sex=='M':
                print('Double sex detected:', category)
                sex=False
                return first,sex
#                 raise('Double sex detected')
            sex='F'
            break
    for word in women_category_only_starting_word:
        if second.startswith(word): 
            if sex=='M':
                print('Double sex detected:', category)
                sex=False
                return first,sex
#                 raise('Double sex detected')
            sex='F'
            break
    return first,sex

In [84]:
def process_fields(runner_splitted,pace):
    ''' @ paramethers
            runner_splitted is a list of fields. It is created by the for loop in the Parsing section.
                It is not well formatted. Some fields can be merged together. Check hypothesis.
        @ returns
            the list of fields that will be directly imported in the database
    '''
    print(runner_splitted)
    fields_processed=[]
    # The first element is the category - process it
    fields_processed+=process_category(runner_splitted[0])
    print(runner_splitted)
    # Check if splitting second element
    try:
        splitted=runner_splitted[1].split('.')
    except:
        print(runner_splitted)
        raise()
    try:
        splitted[0]=int(splitted[0])
    except:
#         print('Bad rank')
        return ['Bad rank']
    
    if len(splitted)!=1 and splitted[1]!='': #rang and nom are merged
        splitted[1]=splitted[1].lstrip()
        splitted[1]='.'.join(splitted[1:])
        fields_processed+=splitted[:2]
        first_to_check=2
    else:
        fields_processed.append(splitted[0])
        fields_processed.append(runner_splitted[2])
        first_to_check=3
    
    if len(fields_processed)!=4:
        print('before')
        print(fields_processed,runner_splitted)
        raise()
        

    # Check if nom is merged with an-lieu
    try:
        parse_time(runner_splitted[first_to_check])
        splitted_name_an=fields_processed[-1].split(' ')
        added_year=False
        for i,word in enumerate(splitted_name_an):
            try:
                int(word)
                fields_processed[-1]=fields_processed[-1].split(word)[0]
                fields_processed.append(word)
                fields_processed.append(' '.join(splitted_name_an[i+1:]))
                added_year=True
                break
            except:
                if word=='??':
                    print('Added ?? as year:',runner_splitted)
                    fields_processed[-1]=fields_processed[-1].split(word)[0]
                    fields_processed.append(word)
                    fields_processed.append(' '.join(splitted_name_an[i+1:]))
                    added_year=True
                    break
        if not added_year:
            runner_splitted.append('---')
    except:        
        # Split the an-lieu element
        fields_processed+=runner_splitted[first_to_check].split(' ',1)
        first_to_check+=1
        if len(fields_processed)<6:
            try:
                parse_time(runner_splitted[first_to_check])
            except:
                del fields_processed[4:]
                fields_processed+=runner_splitted[first_to_check].split(' ',1)
                first_to_check+=1
        # Add if they are not present
        while len(fields_processed)<6:
            fields_processed.append('---')
            print('Added an-lieu:',fields_processed,runner_splitted)


        # Take only the first element (the year). The second is kept only if it is a time (not encountered yet)
        try:
            parse_time(fields_processed[-1])
            raise('It should not be a date')
        except:
            pass
            #del fields_processed[-1]
    if len(fields_processed)!=6:
        if runner_splitted[1].split('.')[1]=='':
#             print('Missing name:',fields_processed,runner_splitted)
            return ['Missing name']
        print(fields_processed,runner_splitted)
        raise()
    
        
    # Insert all times found after the year (if they are not 2 raise an error)
    added_fields=0
    for i in range(first_to_check,len(runner_splitted)):
        try:
            parse_time(runner_splitted[i])
            fields_processed.append(runner_splitted[i].split(' ')[0])
            added_fields+=1
        except:
            pass

    if added_fields==0:
        print('No added fields')
        print(runner_splitted[first_to_check:])
        return ['No added fields']
    if added_fields==1:
        fields_processed.append('----')
        added_fields=2
    if added_fields!=2:
        if added_fields!=3 or pace!='km/h':
            print('More than 2 added fields:',runner_splitted)
        for i in range(2,added_fields):
            del fields_processed[-1]
#         print(added_fields)
#         print(runner_splitted)
#         raise('Added fields not equal to 2')
    
    # Add pace if present
    if pace:
        try:
            parse_time(runner_splitted[-1],check_only=True)
            if pace=='km/h':
                ms=float(runner_splitted[-1].replace(',','.'))/3.6
                sm=1000/ms
                minutes=int(sm/60)
                sec=int(sm%60)
                runner_splitted[-1]=str(minutes)+'.'+str(sec)
            fields_processed.append(runner_splitted[-1])
        except:
#             print(fields_processed)
#             print(runner_splitted)
            return ['pace not present']
            raise('pace not present')
    else:
        fields_processed.append(False)
        
    return fields_processed
    

## Parsing

In [85]:
def do_parse(runner):
    return True
    start=runner[:3]
    if start=='10-' or start=='21-' or start=='42-':
        return True
    

In [86]:
links

['https://services.datasport.com/2016/lauf/lamara//ALFAA.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAB.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAC.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAD.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAE.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAF.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAG.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAH.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAI.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAJ.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAK.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAL.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAM.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAN.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFAO.HTM',
 'https://services.datasport.com/2016/lauf/lamara//ALFA

In [87]:
final_list=[]
for link in links:
        # Get raw HTML response
        result_html = rq.get(link)#, params=rang_to_query[0])

        # Use BeautifulSoup and extract the first (and only) HTML table
        result_soup = bs4.BeautifulSoup(result_html.text, "lxml")

        results=result_soup.findAll('font')  # Search for all fonts
        try:
            if 'DATASPORT Diplom Service für den Schweizer Frauenlauf' in str(results[0]):
                while 'Kategorie' not in str(results[0]):
                    del results[0]
            language,pace=process_legend(results[0])
        except:
            print('Link not working')
            continue
#         print(language,pace)
        del results[0]    # This is the legend
        for table in results:
            if table.get('size')=='2': # If size is 1 it stores the split times, not interesting
                # NOT TRUE IN GENERAL !!!!!!!!!!!!!!!!!!!!
                runner_list=str(table).split('\n')         # Each line is delimited by \n
                for k,runner in enumerate(runner_list):
                    runner=runner.split('¦')[0] # The part on the right of ¦ is composed by partial times if present
                    start_runner=runner[:]
                    runner=re.sub('<[^>]+>', ' ', runner) # Remove all text between <>
                    runner=re.sub('  +','#@$&',runner)       # Replace all the double or more spaces with &

                    runner=runner.replace('\n','')        # Remove the \n at the beginning of the line


                    runner=runner.replace(' \r','')       # Remove the \r at the beginning of the line
                    runner=runner.replace('\r','')       # Remove the \r at the beginning of the line
                    runner=runner.lstrip()                 # The first athlete starts with a space

                    # The team can be empty, check:
                    start=runner.split('#@$&')[0]
                    if do_parse(start):
                        runner2=runner.split('#@$&') # Split the fields
                        if len(runner2)==1:
                            continue

                        # It works ONLY if the number of fields are the same for different languages
                        runner=process_fields(runner2,pace=pace) 
                        if len(runner)==9:
                            final_list.append(runner)         # Append to the final list  
                        else:
                            try:
                                if runner[0]=='Bad rank':
                                    pass
                                elif runner[0]=='pace not present':
                                    print('No pace:',runner2)
                                elif runner[0]=='Missing name':
                                    print('No name:',runner2)
                                else:
                                    print("Bad PF:",runner2)
                            except:
                                print(runner)
                                raise()

['21-H50', '147.', 'Abaidia Jilani', '1966 St-Légier-La Chiésaz', '-----', '1:45.28,4', '25.56,8', '(5082)', 'diplôme', 'foto', 'video', '21-Hom', '1241.', '4.59 ']
['21-H50', '147.', 'Abaidia Jilani', '1966 St-Légier-La Chiésaz', '-----', '1:45.28,4', '25.56,8', '(5082)', 'diplôme', 'foto', 'video', '21-Hom', '1241.', '4.59 ']
['21-D40', '81.', 'Abaidia Sandrine', '1972 St-Légier', '-----', '1:49.40,8', '24.09,5', '(5080)', 'diplôme', 'foto', 'video', '21-Fem', '289.', '5.11 ']
['21-D40', '81.', 'Abaidia Sandrine', '1972 St-Légier', '-----', '1:49.40,8', '24.09,5', '(5080)', 'diplôme', 'foto', 'video', '21-Fem', '289.', '5.11 ']
['M-Fille3', '33.', 'Abaidia Selma', '2006 St-Légier-La Chiésaz', '-----', '7.12,2', '1.36,3', '(22010)', 'diplôme', 'foto', 'video', '---', '4.48 ']
['M-Fille3', '33.', 'Abaidia Selma', '2006 St-Légier-La Chiésaz', '-----', '7.12,2', '1.36,3', '(22010)', 'diplôme', 'foto', 'video', '---', '4.48 ']
['10W-NW', '---', 'Abansir Florence', '1971 Lausanne', '-----'

In [74]:
# for i,link in enumerate(['http://services.datasport.com/2002/lauf/biel']):
t1=time.time()
final_list=[]

for i,link in enumerate(links2runs.URL):
    if i==87: # Not working - distance not available in any case
        continue
    if i==120: # https://services.datasport.com/2000/lauf/jungfrau/
        continue
    if i==176: # https://services.datasport.com/2001/lauf/zuerimeitli/ - no time
        continue
    if i==257: # https://services.datasport.com/2002/lauf/zuerimeitli/ - no time
        continue 
        
#     if i!=227:
#         continue
    
    if i<267:
        continue
    if i==300:
        break
    print(i,link)
    links=get_links(link)

    
    for link in links:
        # Get raw HTML response
        result_html = rq.get(link)#, params=rang_to_query[0])

        # Use BeautifulSoup and extract the first (and only) HTML table
        result_soup = bs4.BeautifulSoup(result_html.text, "lxml")

        results=result_soup.findAll('font')  # Search for all fonts
        try:
#             print(repr(results[0])
            if 'DATASPORT Diplom Service für den Schweizer Frauenlauf' in str(results[0]):
                while 'Kategorie' not in str(results[0]):
                    del results[0]
            language,pace=process_legend(results[0])
        except:
            print('Link not working')
            continue
#         print(language,pace)
        del results[0]    # This is the legend
        for table in results:
            if table.get('size')=='2': # If size is 1 it stores the split times, not interesting
                # NOT TRUE IN GENERAL !!!!!!!!!!!!!!!!!!!!
                runner_list=str(table).split('\n')         # Each line is delimited by \n
                for k,runner in enumerate(runner_list):
                    runner=runner.split('¦')[0] # The part on the right of ¦ is composed by partial times if present
                    start_runner=runner[:]
                    runner=re.sub('<[^>]+>', ' ', runner) # Remove all text between <>
                    runner=re.sub('  +','#@$&',runner)       # Replace all the double or more spaces with &

                    runner=runner.replace('\n','')        # Remove the \n at the beginning of the line


                    runner=runner.replace(' \r','')       # Remove the \r at the beginning of the line
                    runner=runner.replace('\r','')       # Remove the \r at the beginning of the line
                    runner=runner.lstrip()                 # The first athlete starts with a space

                    # The team can be empty, check:
                    start=runner.split('#@$&')[0]
                    if do_parse(start):
                        runner2=runner.split('#@$&') # Split the fields
                        if len(runner2)==1:
                            continue

                        # It works ONLY if the number of fields are the same for different languages
                        runner=process_fields(runner2,pace=pace) 
                        if len(runner)==9:
                            final_list.append(runner)         # Append to the final list  
                        else:
                            try:
                                if runner[0]=='Bad rank':
                                    pass
                                elif runner[0]=='pace not present':
                                    print('No pace:',runner2)
                                elif runner[0]=='Missing name':
                                    print('No name:',runner2)
                                else:
                                    print("Bad PF:",runner2)
                            except:
                                print(runner)
                                raise()
print('Time: ',time.time()-t1)
    

267 http://services.datasport.com/2002/lauf/defi
links found: 22
['DM', '47. Abbet Pascal', '62 Vessy', '8:23.07,8', '2:51.26,9', '(402)', '']
Double sex detected: DM
['DM', '47. Abbet Pascal', '62 Vessy', '8:23.07,8', '2:51.26,9', '(402)', '']
['', '2:39.30', '42.']
['', '2:39.30', '42.']
['', 'DM', '15. Ackermann Martin', '57 WOHLEN', '7:02.19,3', '1:30.38,4', '(100)', '']
['', 'DM', '15. Ackermann Martin', '57 WOHLEN', '7:02.19,3', '1:30.38,4', '(100)', '']
['', '2:25.30', '22.']
['', '2:25.30', '22.']
['', 'DW', '4. Aeschlimann Heidi', '56 Gippingen', '7:14.57,3', '46.10,5', '(101)', '']
['', 'DW', '4. Aeschlimann Heidi', '56 Gippingen', '7:14.57,3', '46.10,5', '(101)', '']
['', '3:08.30', '14.']
['', '3:08.30', '14.']
['', 'DVM', '3. Aeschlimann Ulrich', '51 Gippingen', '6:37.50,5', '56.59,7', '(14)', '']
['', 'DVM', '3. Aeschlimann Ulrich', '51 Gippingen', '6:37.50,5', '56.59,7', '(14)', '']
['', '2:14.30', '3.']
['', '2:14.30', '3.']
['', 'SJM', '3. Allmendinger Rémy', '84 Yverd

IndexError: list index out of range

In [88]:
df = pd.DataFrame(final_list)
df = df.rename(columns={0:'cat',1:'sex',2:'rang',3:'nom',4:'an',5:'lieu',6:'temps',7:'retard',8:'pace'})

In [89]:
df

Unnamed: 0,cat,sex,rang,nom,an,lieu,temps,retard,pace
0,21,M,147,Abaidia Jilani,1966,St-Légier-La Chiésaz,"1:45.28,4","25.56,8",4.59
1,21,F,81,Abaidia Sandrine,1972,St-Légier,"1:49.40,8","24.09,5",5.11
2,False,F,33,Abaidia Selma,2006,St-Légier-La Chiésaz,"7.12,2","1.36,3",4.48
3,21,M,103,Abb Jochen,1948,Ernen,"2:50.40,7","1:21.28,7",8.05
4,10,M,426,Abbas Dhia,1961,Lausanne,"1:13.04,1","38.13,0",7.18
5,21,M,640,Abbet Florian,1982,Pully,"1:56.01,7","47.33,8",5.29
6,10,F,517,Abdala Maria Lucia,1979,Lausanne,"1:01.30,8","27.09,1",6.09
7,10,M,152,Abdela Esa,1992,Pully,"42.44,1","14.26,0",4.16
8,21,M,67,Abdelaziem Ahmed Ramy Bac,1992,Lausanne,"1:29.06,1","20.32,9",4.13
9,False,M,2,Abdelmoumène Eden,2003,F-Evian les Bains,"10.47,6","0.46,9",2.34


In [90]:
df.to_csv('lausanne_marathon_2016.csv')

In [91]:
pd.read_csv('lausanne_marathon_2016.csv')

Unnamed: 0.1,Unnamed: 0,cat,sex,rang,nom,an,lieu,temps,retard,pace
0,0,21,M,147,Abaidia Jilani,1966,St-Légier-La Chiésaz,"1:45.28,4","25.56,8",4.59
1,1,21,F,81,Abaidia Sandrine,1972,St-Légier,"1:49.40,8","24.09,5",5.11
2,2,False,F,33,Abaidia Selma,2006,St-Légier-La Chiésaz,"7.12,2","1.36,3",4.48
3,3,21,M,103,Abb Jochen,1948,Ernen,"2:50.40,7","1:21.28,7",8.05
4,4,10,M,426,Abbas Dhia,1961,Lausanne,"1:13.04,1","38.13,0",7.18
5,5,21,M,640,Abbet Florian,1982,Pully,"1:56.01,7","47.33,8",5.29
6,6,10,F,517,Abdala Maria Lucia,1979,Lausanne,"1:01.30,8","27.09,1",6.09
7,7,10,M,152,Abdela Esa,1992,Pully,"42.44,1","14.26,0",4.16
8,8,21,M,67,Abdelaziem Ahmed Ramy Bac,1992,Lausanne,"1:29.06,1","20.32,9",4.13
9,9,False,M,2,Abdelmoumène Eden,2003,F-Evian les Bains,"10.47,6","0.46,9",2.34
