# Data source

The data for this notebook is pulled from https://www.alltime-athletics.com/m_100ok.htm and https://en.wikipedia.org/wiki/List_of_doping_cases_in_athletics

Unfortunately, the 100m data is not provided in any downloadable format, so I have just copied their table to `mens-100m.txt`.

Neither the 100m data nor the doping data is intended to be definitive proof or an exhaustive list, this is just a little side project to sate my curiosity.

# Converting to CSV for future convenience

In [1]:
import os
import re

def create_csv(input_path, output_path):
    processed_lines = []
    with open(input_path, 'r') as data:
        for line in data:
            line = line.strip()
            columns = re.split(r'\s{2,}', line)
            processed_lines.append('"' + '","'.join(columns) + '"')
    
    with open(output_path, 'w') as file:
        file.write('\n'.join(processed_lines))

ALWAYS_REMAKE = False
input_path = 'mens-100m.txt'
output_path = 'mens-100m.csv'
if ALWAYS_REMAKE or not os.path.isfile(output_path):
    print('Creating CSV File...')
    create_csv(input_path, output_path)

# Loading the data

In [2]:
import pandas as pd

top_sprinters = pd.read_csv('mens-100m.csv', header=None, quotechar='"')
top_sprinters.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1,9.58,+0.9,Usain Bolt,JAM,21.08.86,1,Berlin,16.08.2009
1,2,9.63,+1.5,Usain Bolt,JAM,21.08.86,1,London,05.08.2012
2,3,9.69,±0.0,Usain Bolt,JAM,21.08.86,1,Beijing,16.08.2008
3,3,9.69,+2.0,Tyson Gay,USA,09.08.82,1,Shanghai,20.09.2009
4,3,9.69,-0.1,Yohan Blake,JAM,26.12.89,1,Lausanne,23.08.2012


In [3]:
doping_cases = pd.read_csv('List of doping cases.csv')
doping_cases.head()

Unnamed: 0,Name,Country,Event,Date of violation,Banned substance(s)/Anti-doping rule violation,Sanction,Reference(s)
0,Ahmed Abd El Raouf,Egypt,Hammer throw,2008,Norandrosterone,2 years,[2][3]
1,Inga Abitova,Russia,Long distance,2009,Biological passport anomalies,2 years,[4][5][6]
2,Folashade Abugan,Nigeria,Sprinting,2010,Testosterone prohormone,2 years,[7][8][9]
3,Ibrahim Mohamed Aden,Somalia,Middle distance,1999,Ephedrine,Public warning,[10][11][12]
4,Tosin Adeloye,Nigeria,Sprinting,2012\n2015,Metenolone\nExogenous steroids,2 years\n8 years,[13][14]\n[15][16]


# Creating list of sprinters

In [4]:
doping_sprinters = doping_cases[doping_cases['Event'] == 'Sprinting']['Name'].unique()

# TODO handle alternate names
print([name for name in doping_sprinters if bool(re.search(r'[^\w\s-]', name))])

['Yekaterina Grigoryeva\n(Yekaterina Leshcheva)', 'Gloria Kemasuode\n(Gloria Ubiebor)']


# Combining data

In [5]:
doped_or_not = []
for index, row in top_sprinters.iterrows():
    entry = {'Place': row[0], 'Time': row[1], 'Name': row[3], 'Country': row[4], 'Date': row[8]}
    entry['Doped'] = row[3] in doping_sprinters
    doped_or_not.append(entry)

doped_or_not = pd.DataFrame(doped_or_not)
doped_or_not.head()

Unnamed: 0,Place,Time,Name,Country,Date,Doped
0,1,9.58,Usain Bolt,JAM,16.08.2009,False
1,2,9.63,Usain Bolt,JAM,05.08.2012,False
2,3,9.69,Usain Bolt,JAM,16.08.2008,False
3,3,9.69,Tyson Gay,USA,20.09.2009,True
4,3,9.69,Yohan Blake,JAM,23.08.2012,True
