## EDA / Data Preprocessing
This notebook goes over scraping the event results from Olympic.org, it currently only consists of the event results for mens' track events (website server went down while scraping results). 
This notebook starts by loading in the necessary libraries for exploratory data analysis and for reading in .pdf/.csv files.

I then create two lists, holding the names in the athlete dataframe and the names in the Doping dataframe. One complete dataframe is created with an added feature 'flagged' indicating whether or not the athlete has tested positive for PED use.

From there, I begin scraping the result tables for the individual track events and merge the results with the complete dataframe by matching names. The rows being dropped are primarily due to the athlete not starting/finishing the race. This may due to an injury, false start, or disqualification for PED use.

In [2713]:
%load_ext autoreload
%autoreload 2

import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2714]:
import pandas as pd
import numpy as np
import pdfplumber
import tabula
import seaborn as sns
from Olympic_PED_use.src import functions as fn
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

from bs4 import BeautifulSoup
import certifi
import urllib3
import re
from csv import DictReader, DictWriter
import datetime as dt

import glob

pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)

In [2715]:
athlete_df = fn.create_athlete_df()

finished


In [2716]:
athlete_df = athlete_df.drop_duplicates('name')
len(athlete_df)

4719

In [2717]:
doping_df = fn.create_wiki_doping()

In [2718]:
athlete_names = list(athlete_df.name)
doping_names = list(doping_df.name)

In [2719]:
matched = []
for x in range(len(doping_names)):
    if doping_names[x] in athlete_names:
        matched.append(doping_names[x])
print('Found {} matches\n\n'.format(len(matched)))

Found 247 matches




In [2720]:
len(doping_names)

579

#### Inserting the 'flagged' column to hold binary values indicating PED use

In [2721]:
doping_df.insert(4, 'flagged', 1)

In [2722]:
athlete_df = pd.merge(athlete_df, doping_df[['name', 'flagged']], how='left', on='name')

In [2723]:
athlete_df.flagged = athlete_df.flagged.fillna(value=0)
athlete_df.flagged.value_counts()

0.0    4474
1.0     247
Name: flagged, dtype: int64

### Merging Event Results
Below I will begin to scrape the results of the men's track events and merge them with the athlete dataframe on matching names.

#### Scraping the results from the men's 100m dash 2004-Athens

In [2724]:
mens_100m_results = []

In [2725]:
rio_100m_men = fn.olympic_query('athens', '2004', '100m-men')
mens_100m_04 = fn.olympic_scraper(rio_100m_men)
mens_100m_04 = fn.content_cleaner(mens_100m_04)
mens_100m_04.columns = ['rank', 'name', 'mens_100m_04']
mens_100m_04.insert(3, "event_x", "Athletics Men's 100 metres")
mens_100m_04.mens_100m_04 = [str(x) for x in mens_100m_04.mens_100m_04]
mens_100m_04.name = [x.strip() for x in mens_100m_04.name]
mens_100m_04.mens_100m_04 = [x.strip() for x in mens_100m_04.mens_100m_04]
mens_100m_results.append(mens_100m_04.mens_100m_04)

#### Merging the results with the athlete dataframe

In [2727]:
athlete_df = pd.merge(athlete_df, mens_100m_04[['name', 'mens_100m_04']], how='left', left_on=['name'], right_on='name')

#### Scraping the results from the men's 100m dash 2008-Beijing

In [2728]:
beijing_100m_men = fn.olympic_query('beijing', '2008', '100m-men')
mens_100m_08 = fn.olympic_scraper(beijing_100m_men)
mens_100m_08 = fn.content_cleaner(mens_100m_08)
mens_100m_08.columns = ['rank', 'name', 'mens_100m_08']
mens_100m_08.insert(3, "event_x", "Athletics Men's 100 metres")
mens_100m_08.mens_100m_08 = [str(x) for x in mens_100m_08.mens_100m_08]
mens_100m_08.mens_100m_08 = [x.strip() for x in mens_100m_08.mens_100m_08]
mens_100m_08.name = [x.strip() for x in mens_100m_08.name]
mens_100m_08.mens_100m_08 = [x.strip('DPG /') for x in mens_100m_08.mens_100m_08]
mens_100m_results.append(mens_100m_08.mens_100m_08)

In [2730]:
athlete_df = pd.merge(athlete_df, mens_100m_08[['name', 'mens_100m_08']], how='left', on='name', suffixes=(None, '_100m_mens_beijing'))

#### Scraping the results from the men's 100m dash 2012-London
Removing rows with missing values, this is either due to an update on doping sanctions or the athlete did not finish/start the race.

In [2731]:
london_100m_men = fn.olympic_query('london', '2012', '100m-men')
mens_100m_12 = fn.olympic_scraper(london_100m_men)
mens_100m_12 = fn.content_cleaner(mens_100m_12)
mens_100m_12.columns = ['rank', 'name', 'mens_100m_12']
mens_100m_12.insert(3, "event_x", "Athletics Men's 100 metres")
mens_100m_12.mens_100m_12 = [str(x) for x in mens_100m_12.mens_100m_12]
mens_100m_12.name = [x.strip() for x in mens_100m_12.name]
mens_100m_12.mens_100m_12 = [x.strip() for x in mens_100m_12.mens_100m_12]
mens_100m_12 = mens_100m_12.drop([31,86,87])
mens_100m_results.append(mens_100m_12.mens_100m_12)

In [2732]:
athlete_df = pd.merge(athlete_df, mens_100m_12[['name', 'mens_100m_12']], how='left', on='name', suffixes=(None, '_100m_mens_london'))


#### Scraping the results from the men's 100m dash 2016-Rio

In [2733]:
rio_100m_men = fn.olympic_query('rio', '2016', '100m-men')
mens_100m_16 = fn.olympic_scraper(rio_100m_men)
mens_100m_16 = fn.content_cleaner(mens_100m_16)
mens_100m_16.columns = ['rank', 'name', 'mens_100m_16']
mens_100m_16.insert(3, "event_x", "Athletics Men's 100 metres")
mens_100m_16.mens_100m_16 = [str(x) for x in mens_100m_16.mens_100m_16]
mens_100m_16.name = [x.strip() for x in mens_100m_16.name]
mens_100m_16.mens_100m_16 = [x.strip() for x in mens_100m_16.mens_100m_16]
mens_100m_results.append(mens_100m_16.mens_100m_16)

In [2734]:
athlete_df = pd.merge(athlete_df, mens_100m_16[['name', 'mens_100m_16']], how='left', on='name', suffixes=(None, '_100m_mens_rio'))

#### Scraping the results from the men's 200m dash 2004-Athens

In [2735]:
athens_200m_men = fn.olympic_query('athens', '2004', '200m-men')
mens_200m_04 = fn.olympic_scraper(athens_200m_men)
mens_200m_04 = fn.content_cleaner(mens_200m_04)
mens_200m_04.columns = ['rank', 'name', 'mens_200m_04']
mens_200m_04.insert(3, "event_x", "Athletics Men's 200 metres")
mens_200m_04.mens_200m_04 = [str(x) for x in mens_200m_04.mens_200m_04]
mens_200m_04.mens_200m_04 = [x.strip() for x in mens_200m_04.mens_200m_04]
mens_200m_04.name = [x.strip() for x in mens_200m_04.name]


In [2736]:
athlete_df = pd.merge(athlete_df, mens_200m_04[['name', 'mens_200m_04']], how='left', on='name', suffixes=(None, '_200m_mens_athens'))

#### Scraping the results from the men's 200m dash 2008-Beijing
droping rows with missing values

In [2737]:
beijing_200m_men = fn.olympic_query('beijing', '2008', '200m-men')
mens_200m_08 = fn.olympic_scraper(beijing_200m_men)
mens_200m_08 = fn.content_cleaner(mens_200m_08)
mens_200m_08.columns = ['rank', 'name', 'mens_200m_08']
mens_200m_08.insert(3, "event_x", "Athletics Men's 200 metres")
mens_200m_08.mens_200m_08 = [str(x) for x in mens_200m_08.mens_200m_08]
mens_200m_08.mens_200m_08 = [x.strip() for x in mens_200m_08.mens_200m_08]
mens_200m_08.mens_200m_08 = [x[:5] for x in mens_200m_08.mens_200m_08]
mens_200m_08.name = [x.strip() for x in mens_200m_08.name]
mens_200m_08 = mens_200m_08.drop([6,7,67,23,39,71,103,166,
                                 167,168,169,178,219,227,235])


In [2738]:
athlete_df = pd.merge(athlete_df, mens_200m_08[['name', 'mens_200m_08']], how='left', on='name', suffixes=(None, '_200m_mens_beijing'))

#### Scraping the results from the men's 200m dash 2012-London

In [2739]:
london_200m_men = fn.olympic_query('london', '2012', '200m-men')
mens_200m_12 = fn.olympic_scraper(london_200m_men)
mens_200m_12 = fn.content_cleaner(mens_200m_12)
mens_200m_12.columns = ['rank', 'name', 'mens_200m_12']
mens_200m_12.insert(3, "event_x", "Athletics Men's 200 metres")
mens_200m_12.mens_200m_12 = [str(x) for x in mens_200m_12.mens_200m_12]
mens_200m_12.mens_200m_12 = [x.strip() for x in mens_200m_12.mens_200m_12]
mens_200m_12.name = [x.strip() for x in mens_200m_12.name]
mens_200m_12 = mens_200m_12.drop([30,31,85,86])


In [2740]:
athlete_df = pd.merge(athlete_df, mens_200m_12[['name', 'mens_200m_12']], how='left', on='name', suffixes=(None, '_200m_mens_london'))

#### Scraping the results from the men's 200m dash 2016-Rio

In [2741]:
rio_200m_men = fn.olympic_query('rio', '2016', '200m-men')
mens_200m_16 = fn.olympic_scraper(rio_200m_men)
mens_200m_16 = fn.content_cleaner(mens_200m_16)
mens_200m_16.columns = ['rank', 'name', 'mens_200m_16']
mens_200m_16.insert(3, "event_x", "Athletics Men's 200 metres")
mens_200m_16.mens_200m_16 = [str(x) for x in mens_200m_16.mens_200m_16]
mens_200m_16.mens_200m_16 = [x.strip() for x in mens_200m_16.mens_200m_16]
mens_200m_16.name = [x.strip() for x in mens_200m_16.name]


In [2742]:
athlete_df = pd.merge(athlete_df, mens_200m_16[['name', 'mens_200m_16']], how='left', on='name', suffixes=(None, '_200m_mens_rio'))

#### Scraping the results from the men's 400m dash 2004-Athens

In [2743]:
athens_400m_men = fn.olympic_query('athens', '2004', '400m-men')
mens_400m_04 = fn.olympic_scraper(athens_400m_men)
mens_400m_04 = fn.content_cleaner(mens_400m_04)
mens_400m_04.columns = ['rank', 'name', 'mens_400m_04']
mens_400m_04.insert(3, "event_x", "Athletics Men's 400 metres")
mens_400m_04.mens_400m_04 = [str(x) for x in mens_400m_04.mens_400m_04]
mens_400m_04.mens_400m_04 = [x.strip() for x in mens_400m_04.mens_400m_04]
mens_400m_04.name = [x.strip() for x in mens_400m_04.name]


In [2744]:
athlete_df = pd.merge(athlete_df, mens_400m_04[['name', 'mens_400m_04']], how='left', on='name', suffixes=(None, '_400m_mens_athens'))

#### Scraping the results from the men's 400m dash 2008-Beijing
dropping rows with missing values

In [2745]:
beijing_400m_men = fn.olympic_query('beijing', '2008', '400m-men')
mens_400m_08 = fn.olympic_scraper(beijing_400m_men)
mens_400m_08 = fn.content_cleaner(mens_400m_08)
mens_400m_08.columns = ['rank', 'name', 'mens_400m_08']
mens_400m_08.insert(3, "event_x", "Athletics Men's 400 metres")
mens_400m_08.mens_400m_08 = [str(x) for x in mens_400m_08.mens_400m_08]
mens_400m_08.mens_400m_08 = [x.strip() for x in mens_400m_08.mens_400m_08]
mens_400m_08.mens_400m_08 = [x[:5] for x in mens_400m_08.mens_400m_08]
mens_400m_08.name = [x.strip() for x in mens_400m_08.name]
mens_400m_08 = mens_400m_08.drop([113,114,122,146])


In [2746]:
athlete_df = pd.merge(athlete_df, mens_400m_08[['name', 'mens_400m_08']], how='left', on='name', suffixes=(None, '_400m_mens_beijing'))

#### Scraping the results from the men's 400m dash 2012-London
dropping rows with missing values

In [2747]:
london_400m_men = fn.olympic_query('london', '2012', '400m-men')
mens_400m_12 = fn.olympic_scraper(london_400m_men)
mens_400m_12 = fn.content_cleaner(mens_400m_12)
mens_400m_12.columns = ['rank', 'name', 'mens_400m_12']
mens_400m_12.insert(3, "event_x", "Athletics Men's 400 metres")
mens_400m_12.mens_400m_12 = [str(x) for x in mens_400m_12.mens_400m_12]
mens_400m_12.mens_400m_12 = [x.strip() for x in mens_400m_12.mens_400m_12]
mens_400m_12.name = [x.strip() for x in mens_400m_12.name]
mens_400m_12 = mens_400m_12.drop([30,78,79,80,81,82])


In [2748]:
athlete_df = pd.merge(athlete_df, mens_400m_12[['name', 'mens_400m_12']], how='left', on='name', suffixes=(None, '_400m_mens_london'))

#### Scraping the results from the men's 400m dash 2016-Rio

In [2749]:
rio_400m_men = fn.olympic_query('rio', '2016', '400m-men')
mens_400m_16 = fn.olympic_scraper(rio_400m_men)
mens_400m_16 = fn.content_cleaner(mens_400m_16)
mens_400m_16.columns = ['rank', 'name', 'mens_400m_16']
mens_400m_16.insert(3, "event_x", "Athletics Men's 400 metres")
mens_400m_16.mens_400m_16 = [str(x) for x in mens_400m_16.mens_400m_16]
mens_400m_16.mens_400m_16 = [x.strip() for x in mens_400m_16.mens_400m_16]
mens_400m_16.name = [x.strip() for x in mens_400m_16.name]


In [2750]:
athlete_df = pd.merge(athlete_df, mens_400m_16[['name', 'mens_400m_16']], how='left', on='name', suffixes=(None, '_400m_mens_rio'))

#### Scraping the results from the men's 800m dash 2004-Athens

In [2751]:
athens_800m_men = fn.olympic_query('athens', '2004', '800m-men')
mens_800m_04 = fn.olympic_scraper(athens_800m_men)
mens_800m_04 = fn.content_cleaner(mens_800m_04)
mens_800m_04.columns = ['rank', 'name', 'mens_800m_04']
mens_800m_04.insert(3, "event_x", "Athletics Men's 800 metres")
mens_800m_04.mens_800m_04 = [str(x) for x in mens_800m_04.mens_800m_04]
mens_800m_04.mens_800m_04 = [x.strip() for x in mens_800m_04.mens_800m_04]
mens_800m_04.name = [x.strip() for x in mens_800m_04.name]



In [2752]:
athlete_df = pd.merge(athlete_df, mens_800m_04[['name', 'mens_800m_04']], how='left', on='name', suffixes=(None, '_800m_mens_athens'))

#### Scraping the results from the men's 800m dash 2008-Beijing
dropping rows with missing values

In [2753]:
beijing_800m_men = fn.olympic_query('beijing', '2008', '800m-men')
mens_800m_08 = fn.olympic_scraper(beijing_800m_men)
mens_800m_08 = fn.content_cleaner(mens_800m_08)
mens_800m_08.columns = ['rank', 'name', 'mens_800m_08']
mens_800m_08.insert(3, "event_x", "Athletics Men's 800 metres")
mens_800m_08.mens_800m_08 = [str(x) for x in mens_800m_08.mens_800m_08]
mens_800m_08.mens_800m_08 = [x.strip() for x in mens_800m_08.mens_800m_08]
mens_800m_08.name = [x.strip() for x in mens_800m_08.name]
mens_800m_08 = mens_800m_08.drop([114,115,116,130,138,154])


In [2754]:
athlete_df = pd.merge(athlete_df, mens_800m_08[['name', 'mens_800m_08']], how='left', on='name', suffixes=(None, '_800m_mens_beijing'))

#### Scraping the results from the men's 800m dash 2012-London
dropping rows with missing values

In [2755]:
london_800m_men = fn.olympic_query('london', '2012', '800m-men')
mens_800m_12 = fn.olympic_scraper(london_800m_men)
mens_800m_12 = fn.content_cleaner(mens_800m_12)
mens_800m_12.columns = ['rank', 'name', 'mens_800m_12']
mens_800m_12.insert(3, "event_x", "Athletics Men's 800 metres")
mens_800m_12.mens_800m_12 = [str(x) for x in mens_800m_12.mens_800m_12]
mens_800m_12.mens_800m_12 = [x.strip() for x in mens_800m_12.mens_800m_12]
mens_800m_12.name = [x.strip() for x in mens_800m_12.name]
mens_800m_12.mens_800m_12 = [x.strip('/ DPG') for x in mens_800m_12.mens_800m_12]
mens_800m_12 = mens_800m_12.drop([32,84,85,86,87])


In [2756]:
athlete_df = pd.merge(athlete_df, mens_800m_12[['name', 'mens_800m_12']], how='left', on='name', suffixes=(None, '_800m_mens_london'))

#### Scraping the results from the men's 800m dash 2016-Rio

In [2757]:
rio_800m_men = fn.olympic_query('rio', '2016', '800m-men')
mens_800m_16 = fn.olympic_scraper(rio_800m_men)
mens_800m_16 = fn.content_cleaner(mens_800m_16)
mens_800m_16.columns = ['rank', 'name', 'mens_800m_16']
mens_800m_16.insert(3, "event_x", "Athletics Men's 800 metres")
mens_800m_16.mens_800m_16 = [str(x) for x in mens_800m_16.mens_800m_16]
mens_800m_16.mens_800m_16 = [x.strip() for x in mens_800m_16.mens_800m_16]
mens_800m_16.name = [x.strip() for x in mens_800m_16.name]


In [2758]:
athlete_df = pd.merge(athlete_df, mens_800m_16[['name', 'mens_800m_16']], how='left', on='name', suffixes=(None, '_800m_mens_rio'))

#### Scraping the results from the men's 1500m dash 2004-Athens

In [2759]:
athens_1500m_men = fn.olympic_query('athens', '2004', '1500m-men')
mens_1500m_04 = fn.olympic_scraper(athens_1500m_men)
mens_1500m_04 = fn.content_cleaner(mens_1500m_04)
mens_1500m_04.columns = ['rank', 'name', 'mens_1500m_04']
mens_1500m_04.insert(3, "event_x", "Athletics Men's 1500 metres")
mens_1500m_04.mens_1500m_04 = [str(x) for x in mens_1500m_04.mens_1500m_04]
mens_1500m_04.mens_1500m_04 = [x.strip() for x in mens_1500m_04.mens_1500m_04]
mens_1500m_04.name = [x.strip() for x in mens_1500m_04.name]



In [2760]:
athlete_df = pd.merge(athlete_df, mens_1500m_04[['name', 'mens_1500m_04']], how='left', on='name', suffixes=(None, '_1500m_mens_athens'))

#### Scraping the results from the men's 1500m dash 2008-Beijing
dropping rows with missing values

In [2761]:
beijing_1500m_men = fn.olympic_query('beijing', '2008', '1500m-men')
mens_1500m_08 = fn.olympic_scraper(beijing_1500m_men)
mens_1500m_08 = fn.content_cleaner(mens_1500m_08)
mens_1500m_08.columns = ['rank', 'name', 'mens_1500m_08']
mens_1500m_08.insert(3, "event_x", "Athletics Men's 1500 metres")
mens_1500m_08.mens_1500m_08 = [str(x) for x in mens_1500m_08.mens_1500m_08]
mens_1500m_08.mens_1500m_08 = [x.strip() for x in mens_1500m_08.mens_1500m_08]
mens_1500m_08.name = [x.strip() for x in mens_1500m_08.name]
mens_1500m_08.mens_1500m_08 = [x.strip('/ DPG') for x in mens_1500m_08.mens_1500m_08]
mens_1500m_08 = mens_1500m_08.drop([107,108,121,134])


In [2762]:
athlete_df = pd.merge(athlete_df, mens_1500m_08[['name', 'mens_1500m_08']], how='left', on='name', suffixes=(None, '_1500m_mens_beijing'))

#### Scraping the results from the men's 1500m dash 2012-London
dropping row with missing value

In [2763]:
london_1500m_men = fn.olympic_query('london', '2012', '1500m-men')
mens_1500m_12 = fn.olympic_scraper(london_1500m_men)
mens_1500m_12 = fn.content_cleaner(mens_1500m_12)
mens_1500m_12.columns = ['rank', 'name', 'mens_1500m_12']
mens_1500m_12.insert(3, "event_x", "Athletics Men's 1500 metres")
mens_1500m_12.mens_1500m_12 = [str(x) for x in mens_1500m_12.mens_1500m_12]
mens_1500m_12.mens_1500m_12 = [x.strip() for x in mens_1500m_12.mens_1500m_12]
mens_1500m_12.name = [x.strip() for x in mens_1500m_12.name]
mens_1500m_12.mens_1500m_12 = [x.strip('/ DPG') for x in mens_1500m_12.mens_1500m_12]
mens_1500m_12 = mens_1500m_12.drop([79])


In [2764]:
athlete_df = pd.merge(athlete_df, mens_1500m_12[['name', 'mens_1500m_12']], how='left', on='name', suffixes=(None, '_1500m_mens_london'))

#### Scraping the results from the men's 1500m dash 2016-Rio

In [2765]:
rio_1500m_men = fn.olympic_query('rio', '2016', '1500m-men')
mens_1500m_16 = fn.olympic_scraper(rio_1500m_men)
mens_1500m_16 = fn.content_cleaner(mens_1500m_16)
mens_1500m_16.columns = ['rank', 'name', 'mens_1500m_16']
mens_1500m_16.insert(3, "event_x", "Athletics Men's 1500 metres")
mens_1500m_16.mens_1500m_16 = [str(x) for x in mens_1500m_16.mens_1500m_16]
mens_1500m_16.mens_1500m_16 = [x.strip() for x in mens_1500m_16.mens_1500m_16]
mens_1500m_16.name = [x.strip() for x in mens_1500m_16.name]


In [2766]:
athlete_df = pd.merge(athlete_df, mens_1500m_16[['name', 'mens_1500m_16']], how='left', on='name', suffixes=(None, '_1500m_mens_rio'))

#### Scraping the results from the men's 5000m dash 2004-Athens
dropping row with missing value

In [2767]:
athens_5000m_men = fn.olympic_query('athens', '2004', '5000m-men')
mens_5000m_04 = fn.olympic_scraper(athens_5000m_men)
mens_5000m_04 = fn.content_cleaner(mens_5000m_04)
mens_5000m_04.columns = ['rank', 'name', 'mens_5000m_04']
mens_5000m_04.insert(3, "event_x", "Athletics Men's 5000 metres")
mens_5000m_04.mens_5000m_04 = [str(x) for x in mens_5000m_04.mens_5000m_04]
mens_5000m_04.mens_5000m_04 = [x.strip() for x in mens_5000m_04.mens_5000m_04]
mens_5000m_04.name = [x.strip() for x in mens_5000m_04.name]
mens_5000m_04 = mens_5000m_04.drop(35)


In [2768]:
athlete_df = pd.merge(athlete_df, mens_5000m_04[['name', 'mens_5000m_04']], how='left', on='name', suffixes=(None, '_5000m_mens_athens'))

#### Scraping the results from the men's 5000m dash 2008-Beijing
dropping rows with missing values

In [2769]:
beijing_5000m_men = fn.olympic_query('beijing', '2008', '5000m-men')
mens_5000m_08 = fn.olympic_scraper(beijing_5000m_men)
mens_5000m_08 = fn.content_cleaner(mens_5000m_08)
mens_5000m_08.columns = ['rank', 'name', 'mens_5000m_08']
mens_5000m_08.insert(3, "event_x", "Athletics Men's 5000 metres")
mens_5000m_08.mens_5000m_08 = [str(x) for x in mens_5000m_08.mens_5000m_08]
mens_5000m_08.mens_5000m_08 = [x.strip() for x in mens_5000m_08.mens_5000m_08]
mens_5000m_08.name = [x.strip() for x in mens_5000m_08.name]
mens_5000m_08.mens_5000m_08 = [x.strip('/ DPG') for x in mens_5000m_08.mens_5000m_08]
mens_5000m_08 = mens_5000m_08.drop([14,54,55,56,57,85,98,99,100])


In [2770]:
athlete_df = pd.merge(athlete_df, mens_5000m_08[['name', 'mens_5000m_08']], how='left', on='name', suffixes=(None, '_5000m_mens_beijing'))

#### Scraping the results from the men's 5000m dash 2012-London
dropping row with missing value

In [2771]:
london_5000m_men = fn.olympic_query('london', '2012', '5000m-men')
mens_5000m_12 = fn.olympic_scraper(london_5000m_men)
mens_5000m_12 = fn.content_cleaner(mens_5000m_12)
mens_5000m_12.columns = ['rank', 'name', 'mens_5000m_12']
mens_5000m_12.insert(3, "event_x", "Athletics Men's 5000 metres")
mens_5000m_12.mens_5000m_12 = [str(x) for x in mens_5000m_12.mens_5000m_12]
mens_5000m_12.mens_5000m_12 = [x.strip() for x in mens_5000m_12.mens_5000m_12]
mens_5000m_12.name = [x.strip() for x in mens_5000m_12.name]
mens_5000m_12.mens_5000m_12 = [x.strip('/ DPG') for x in mens_5000m_12.mens_5000m_12]
mens_5000m_12 = mens_5000m_12.drop([56])


In [2772]:
athlete_df = pd.merge(athlete_df, mens_5000m_12[['name', 'mens_5000m_12']], how='left', on='name', suffixes=(None, '_5000m_mens_london'))

#### Scraping the results from the men's 5000m dash 2016-Rio
dropping rows with missing values

In [2773]:
rio_5000m_men = fn.olympic_query('rio', '2016', '5000m-men')
mens_5000m_16 = fn.olympic_scraper(rio_5000m_men)
mens_5000m_16 = fn.content_cleaner(mens_5000m_16)
mens_5000m_16.columns = ['rank', 'name', 'mens_5000m_16']
mens_5000m_16.insert(3, "event_x", "Athletics Men's 5000 metres")
mens_5000m_16.mens_5000m_16 = [str(x) for x in mens_5000m_16.mens_5000m_16]
mens_5000m_16.mens_5000m_16 = [x.strip() for x in mens_5000m_16.mens_5000m_16]
mens_5000m_16.name = [x.strip() for x in mens_5000m_16.name]
mens_5000m_16.mens_5000m_16 = [x.strip('DPG/') for x in mens_5000m_16.mens_5000m_16]
mens_5000m_16 = mens_5000m_16.drop([15,66])


In [2774]:
athlete_df = pd.merge(athlete_df, mens_5000m_16[['name', 'mens_5000m_16']], how='left', on='name', suffixes=(None, '_5000m_mens_rio'))

#### Scraping the results from the men's 110m hurdles 2004-Athens
dropping row with missing value

In [2775]:
athens_110H_men = fn.olympic_query('athens', '2004', '110m-hurdles-men')
mens_110H_04 = fn.olympic_scraper(athens_110H_men)
mens_110H_04 = fn.content_cleaner(mens_110H_04)
mens_110H_04.columns = ['rank', 'name', 'mens_110H_04']
mens_110H_04.insert(3, "event_x", "Athletics Men's 110m Hurdles")
mens_110H_04.mens_110H_04 = [str(x) for x in mens_110H_04.mens_110H_04]
mens_110H_04.mens_110H_04 = [x.strip() for x in mens_110H_04.mens_110H_04]
mens_110H_04.mens_110H_04 = [x.strip('/ DPG') for x in mens_110H_04.mens_110H_04]
mens_110H_04.name = [x.strip() for x in mens_110H_04.name]
mens_110H_04 = mens_110H_04.drop(15)


In [2776]:
athlete_df = pd.merge(athlete_df, mens_110H_04[['name', 'mens_110H_04']], how='left', on='name', suffixes=(None, '_110H_mens_athens'))

#### Scraping the results from the men's 110m hurdles 2008-Beijing
dropping rows with missing values

In [2777]:
beijing_110H_men = fn.olympic_query('beijing', '2008', '110m-hurdles-men')
mens_110H_08 = fn.olympic_scraper(beijing_110H_men)
mens_110H_08 = fn.content_cleaner(mens_110H_08)
mens_110H_08.columns = ['rank', 'name', 'mens_110H_08']
mens_110H_08.insert(3, "event_x", "Athletics Men's 110m Hurdles")
mens_110H_08.mens_110H_08 = [str(x) for x in mens_110H_08.mens_110H_08]
mens_110H_08.mens_110H_08 = [x[:7] for x in mens_110H_08.mens_110H_08]
mens_110H_08.mens_110H_08 = [x.strip() for x in mens_110H_08.mens_110H_08]
mens_110H_08.mens_110H_08 = [x.strip('/ DPG') for x in mens_110H_08.mens_110H_08]
mens_110H_08.name = [x.strip() for x in mens_110H_08.name]
mens_110H_08 = mens_110H_08.drop([70,71,95,103,144,
                                 145,146,181,182,189])


In [2778]:
athlete_df = pd.merge(athlete_df, mens_110H_08[['name', 'mens_110H_08']], how='left', on='name', suffixes=(None, '_110H_mens_beijing'))

#### Scraping the results from the men's 110m hurdles 2012-London
dropping rows with missing values

In [2779]:
london_110H_men = fn.olympic_query('london', '2012', '110m-hurdles-men')
mens_110H_12 = fn.olympic_scraper(london_110H_men)
mens_110H_12 = fn.content_cleaner(mens_110H_12)
mens_110H_12.columns = ['rank', 'name', 'mens_110H_12']
mens_110H_12.insert(3, "event_x", "Athletics Men's 110m Hurdles")
mens_110H_12.mens_110H_12 = [str(x) for x in mens_110H_12.mens_110H_12]
mens_110H_12.mens_110H_12 = [x[:7] for x in mens_110H_12.mens_110H_12]
mens_110H_12.mens_110H_12 = [x.strip() for x in mens_110H_12.mens_110H_12]
mens_110H_12.mens_110H_12 = [x.strip('/ DPG') for x in mens_110H_12.mens_110H_12]
mens_110H_12.name = [x.strip() for x in mens_110H_12.name]
mens_110H_12 = mens_110H_12.drop([7,31,32,78,79,80,81,
                                 82,83,84,85,86])


In [2780]:
athlete_df = pd.merge(athlete_df, mens_110H_12[['name', 'mens_110H_12']], how='left', on='name', suffixes=(None, '_110H_mens_london'))

#### Scraping the results from the men's 110m hurdles 2016-Rio
dropping row with missing value

In [2781]:
rio_110H_men = fn.olympic_query('rio', '2016', '110m-hurdles-men')
mens_110H_16 = fn.olympic_scraper(rio_110H_men)
mens_110H_16 = fn.content_cleaner(mens_110H_16)
mens_110H_16.columns = ['rank', 'name', 'mens_110H_16']
mens_110H_16.insert(3, "event_x", "Athletics Men's 110m Hurdles")
mens_110H_16.mens_110H_16 = [str(x) for x in mens_110H_16.mens_110H_16]
mens_110H_16.mens_110H_16 = [x[:7] for x in mens_110H_16.mens_110H_16]
mens_110H_16.mens_110H_16 = [x.strip() for x in mens_110H_16.mens_110H_16]
mens_110H_16.mens_110H_16 = [x.strip('/ DPG') for x in mens_110H_16.mens_110H_16]
mens_110H_16.name = [x.strip() for x in mens_110H_16.name]
mens_110H_16 = mens_110H_16.drop(7)


In [2782]:
athlete_df = pd.merge(athlete_df, mens_110H_16[['name', 'mens_110H_16']], how='left', on='name', suffixes=(None, '_110H_mens_rio'))

#### Scraping the results from the men's 400m hurdles 2004-Athens

In [2783]:
athens_400H_men = fn.olympic_query('athens', '2004', '400m-hurdles-men')
mens_400H_04 = fn.olympic_scraper(athens_400H_men)
mens_400H_04 = fn.content_cleaner(mens_400H_04)
mens_400H_04.columns = ['rank', 'name', 'mens_400H_04']
mens_400H_04.insert(3, "event_x", "Athletics Men's 400m Hurdles")
mens_400H_04.mens_400H_04 = [str(x) for x in mens_400H_04.mens_400H_04]
mens_400H_04.mens_400H_04 = [x.strip() for x in mens_400H_04.mens_400H_04]
mens_400H_04.mens_400H_04 = [x.strip('/ DPG') for x in mens_400H_04.mens_400H_04]
mens_400H_04.name = [x.strip() for x in mens_400H_04.name]


In [2784]:
athlete_df = pd.merge(athlete_df, mens_400H_04[['name', 'mens_400H_04']], how='left', on='name', suffixes=(None, '_400H_mens_athens'))

#### Scraping the results from the men's 400m hurdles 2008-Beijing
dropping rows with missing values

In [2785]:
beijing_400H_men = fn.olympic_query('beijing', '2008', '400m-hurdles-men')
mens_400H_08 = fn.olympic_scraper(beijing_400H_men)
mens_400H_08 = fn.content_cleaner(mens_400H_08)
mens_400H_08.columns = ['rank', 'name', 'mens_400H_08']
mens_400H_08.insert(3, "event_x", "Athletics Men's 400m Hurdles")
mens_400H_08.mens_400H_08 = [str(x) for x in mens_400H_08.mens_400H_08]
mens_400H_08.mens_400H_08 = [x[:7] for x in mens_400H_08.mens_400H_08]
mens_400H_08.mens_400H_08 = [x.strip() for x in mens_400H_08.mens_400H_08]
mens_400H_08.mens_400H_08 = [x.strip('/ DPG') for x in mens_400H_08.mens_400H_08]
mens_400H_08.name = [x.strip() for x in mens_400H_08.name]
mens_400H_08 = mens_400H_08.drop([65,91])


In [2786]:
athlete_df = pd.merge(athlete_df, mens_400H_08[['name', 'mens_400H_08']], how='left', on='name', suffixes=(None, '_400H_mens_beijing'))

#### Scraping the results from the men's 400m hurdles 2012-London
dropping rows with missing values

In [2787]:
london_400H_men = fn.olympic_query('london', '2012', '400m-hurdles-men')
mens_400H_12 = fn.olympic_scraper(london_400H_men)
mens_400H_12 = fn.content_cleaner(mens_400H_12)
mens_400H_12.columns = ['rank', 'name', 'mens_400H_12']
mens_400H_12.insert(3, "event_x", "Athletics Men's 400m Hurdles")
mens_400H_12.mens_400H_12 = [str(x) for x in mens_400H_12.mens_400H_12]
mens_400H_12.mens_400H_12 = [x[:7] for x in mens_400H_12.mens_400H_12]
mens_400H_12.mens_400H_12 = [x.strip() for x in mens_400H_12.mens_400H_12]
mens_400H_12.mens_400H_12 = [x.strip('/ DPG') for x in mens_400H_12.mens_400H_12]
mens_400H_12.name = [x.strip() for x in mens_400H_12.name]
mens_400H_12 = mens_400H_12.drop([30,31,78,79,80,81])


In [2788]:
athlete_df = pd.merge(athlete_df, mens_400H_12[['name', 'mens_400H_12']], how='left', on='name', suffixes=(None, '_400H_mens_london'))

#### Scraping the results from the men's 400m hurdles 2016-Rio
dropping row with missing value

In [2789]:
rio_400H_men = fn.olympic_query('rio', '2016', '400m-hurdles-men')
mens_400H_16 = fn.olympic_scraper(rio_400H_men)
mens_400H_16 = fn.content_cleaner(mens_400H_16)
mens_400H_16.columns = ['rank', 'name', 'mens_400H_16']
mens_400H_16.insert(3, "event_x", "Athletics Men's 400m Hurdles")
mens_400H_16.mens_400H_16 = [str(x) for x in mens_400H_16.mens_400H_16]
mens_400H_16.mens_400H_16 = [x[:7] for x in mens_400H_16.mens_400H_16]
mens_400H_16.mens_400H_16 = [x.strip() for x in mens_400H_16.mens_400H_16]
mens_400H_16.mens_400H_16 = [x.strip('/ DPG') for x in mens_400H_16.mens_400H_16]
mens_400H_16.name = [x.strip() for x in mens_400H_16.name]
mens_400H_16 = mens_400H_16.drop(7)

In [2790]:
athlete_df = pd.merge(athlete_df, mens_400H_16[['name', 'mens_400H_16']], how='left', on='name', suffixes=(None, '_400H_mens_rio'))

#### Scraping the results from the men's 10000m run 2004-Athens

In [2791]:
athens_10000m_men = fn.olympic_query('athens', '2004', '10000m-men')
mens_10000m_04 = fn.olympic_scraper(athens_10000m_men)
mens_10000m_04 = fn.content_cleaner(mens_10000m_04)
mens_10000m_04.columns = ['rank', 'name', 'mens_10000m_04']
mens_10000m_04.insert(3, "event_x", "Athletics Men's 10000 metres")
mens_10000m_04.mens_10000m_04 = [str(x) for x in mens_10000m_04.mens_10000m_04]
mens_10000m_04.mens_10000m_04 = [x.strip() for x in mens_10000m_04.mens_10000m_04]
mens_10000m_04.name = [x.strip() for x in mens_10000m_04.name]


In [2792]:
athlete_df = pd.merge(athlete_df, mens_10000m_04[['name', 'mens_10000m_04']], how='left', on='name', suffixes=(None, '_10000m_mens_athens'))

#### Scraping the results from the men's 10000m run 2008-Beijing
dropping rows with missing values

In [2793]:
beijing_10000m_men = fn.olympic_query('beijing', '2008', '10000m-men')
mens_10000m_08 = fn.olympic_scraper(beijing_10000m_men)
mens_10000m_08 = fn.content_cleaner(mens_10000m_08)
mens_10000m_08.columns = ['rank', 'name', 'mens_10000m_08']
mens_10000m_08.insert(3, "event_x", "Athletics Men's 10000 metres")
mens_10000m_08.mens_10000m_08 = [str(x) for x in mens_10000m_08.mens_10000m_08]
mens_10000m_08.mens_10000m_08 = [x.strip() for x in mens_10000m_08.mens_10000m_08]
mens_10000m_08.name = [x.strip() for x in mens_10000m_08.name]
mens_10000m_08.mens_10000m_08 = [x.strip('/ DPG') for x in mens_10000m_08.mens_10000m_08]
mens_10000m_08 = mens_10000m_08.drop([35,36,37,38])


In [2794]:
athlete_df = pd.merge(athlete_df, mens_10000m_08[['name', 'mens_10000m_08']], how='left', on='name', suffixes=(None, '_10000m_mens_beijing'))

#### Scraping the results from the men's 10000m run 2012-London
dropping rows with missing values

In [2795]:
london_10000m_men = fn.olympic_query('london', '2012', '10000m-men')
mens_10000m_12 = fn.olympic_scraper(london_10000m_men)
mens_10000m_12 = fn.content_cleaner(mens_10000m_12)
mens_10000m_12.columns = ['rank', 'name', 'mens_10000m_12']
mens_10000m_12.insert(3, "event_x", "Athletics Men's 10000 metres")
mens_10000m_12.mens_10000m_12 = [str(x) for x in mens_10000m_12.mens_10000m_12]
mens_10000m_12.mens_10000m_12 = [x.strip() for x in mens_10000m_12.mens_10000m_12]
mens_10000m_12.name = [x.strip() for x in mens_10000m_12.name]
mens_10000m_12 = mens_10000m_12.drop([26,27,28])


In [2796]:
athlete_df = pd.merge(athlete_df, mens_10000m_12[['name', 'mens_10000m_12']], how='left', on='name', suffixes=(None, '_10000m_mens_london'))

#### Scraping the results from the men's 1000m run 2016-Rio
dropping rows with missing values

In [2797]:
rio_10000m_men = fn.olympic_query('rio', '2016', '10000m-men')
mens_10000m_16 = fn.olympic_scraper(rio_10000m_men)
mens_10000m_16 = fn.content_cleaner(mens_10000m_16)
mens_10000m_16.columns = ['rank', 'name', 'mens_10000m_16']
mens_10000m_16.insert(3, "event_x", "Athletics Men's 10000 metres")
mens_10000m_16.mens_10000m_16 = [str(x) for x in mens_10000m_16.mens_10000m_16]
mens_10000m_16.mens_10000m_16 = [x.strip() for x in mens_10000m_16.mens_10000m_16]
mens_10000m_16.name = [x.strip() for x in mens_10000m_16.name]
mens_10000m_16 = mens_10000m_16.drop([32,33])


In [2798]:
athlete_df = pd.merge(athlete_df, mens_10000m_16[['name', 'mens_10000m_16']], how='left', on='name', suffixes=(None, '_10000m_mens_rio'))

#### Scraping the results from the men's 20km walk 2004-Athens


In [2799]:
athens_mens_20km = fn.olympic_query('athens', '2004', '20km-walk-men')
mens_20km = fn.olympic_scraper(athens_mens_20km)
mens_20km = fn.content_cleaner(mens_20km)
mens_20km.columns = ['rank', 'name', 'mens_20km']
mens_20km.insert(3, "event_x", "Athletics Men's 20km walk")
mens_20km.mens_20km = [str(x) for x in mens_20km.mens_20km]
mens_20km.mens_20km = [x.strip() for x in mens_20km.mens_20km]
mens_20km.mens_20km = [x.strip('/ DPG') for x in mens_20km.mens_20km]
mens_20km.mens_20km = [x.replace('h','.')for x in mens_20km.mens_20km]
mens_20km.mens_20km = [x.replace(':','.')for x in mens_20km.mens_20km]
mens_20km.name = [x.strip() for x in mens_20km.name]


In [2800]:
athlete_df = pd.merge(athlete_df, mens_20km[['name', 'mens_20km']], how='left', on='name', suffixes=(None, '_20km_mens_athens'))

#### Scraping the results from the men's 20km walk 2008-Beijing
dropping rows with missing values

In [2801]:
beijing_mens_20km = fn.olympic_query('beijing', '2008', '20km-walk-men')
mens_20km_08 = fn.olympic_scraper(beijing_mens_20km)
mens_20km_08 = fn.content_cleaner(mens_20km_08)
mens_20km_08.columns = ['rank', 'name', 'mens_20km_08']
mens_20km_08.insert(3, "event_x", "Athletics Men's 20km walk")
mens_20km_08.mens_20km_08 = [str(x) for x in mens_20km_08.mens_20km_08]
mens_20km_08.mens_20km_08 = [x.strip() for x in mens_20km_08.mens_20km_08]
mens_20km_08.mens_20km_08 = [x.strip('/ DPG') for x in mens_20km_08.mens_20km_08]
mens_20km_08.mens_20km_08 = [x.replace('h','.')for x in mens_20km_08.mens_20km_08]
mens_20km_08.mens_20km_08 = [x.replace(':','.')for x in mens_20km_08.mens_20km_08]
mens_20km_08.name = [x.strip() for x in mens_20km_08.name]
mens_20km_08 = mens_20km_08.drop([49,50])


In [2802]:
athlete_df = pd.merge(athlete_df, mens_20km_08[['name', 'mens_20km_08']], how='left', on='name', suffixes=(None, '_20km_mens_beijing'))

#### Scraping the results from the men's 20km walk 2012-London
dropping rows with missing values

In [2803]:
london_mens_20km = fn.olympic_query('london', '2012', '20km-walk-men')
mens_20km_12 = fn.olympic_scraper(london_mens_20km)
mens_20km_12 = fn.content_cleaner(mens_20km_12)
mens_20km_12.columns = ['rank', 'name', 'mens_20km_12']
mens_20km_12.insert(3, "event_x", "Athletics Men's 20km walk")
mens_20km_12.mens_20km_12 = [str(x) for x in mens_20km_12.mens_20km_12]
mens_20km_12.mens_20km_12 = [x.strip() for x in mens_20km_12.mens_20km_12]
mens_20km_12.mens_20km_12 = [x.strip('/ DPG') for x in mens_20km_12.mens_20km_12]
mens_20km_12.mens_20km_12 = [x.replace('h','.')for x in mens_20km_12.mens_20km_12]
mens_20km_12.mens_20km_12 = [x.replace(':','.')for x in mens_20km_12.mens_20km_12]
mens_20km_12.name = [x.strip() for x in mens_20km_12.name]
mens_20km_12 = mens_20km_12.drop([48,49,50,51,52,53,54,55])


In [2804]:
athlete_df = pd.merge(athlete_df, mens_20km_12[['name', 'mens_20km_12']], how='left', on='name', suffixes=(None, '_20km_mens_london'))

#### Scraping the results from the men's 20km walk 2016-Rio
dropping rows with missing values

In [2805]:
rio_mens_20km = fn.olympic_query('rio', '2016', '20km-walk-men')
mens_20km_16 = fn.olympic_scraper(rio_mens_20km)
mens_20km_16 = fn.content_cleaner(mens_20km_16)
mens_20km_16.columns = ['rank', 'name', 'mens_20km_16']
mens_20km_16.insert(3, "event_x", "Athletics Men's 20km walk")
mens_20km_16.mens_20km_16 = [str(x) for x in mens_20km_16.mens_20km_16]
mens_20km_16.mens_20km_16 = [x.strip() for x in mens_20km_16.mens_20km_16]
mens_20km_16.mens_20km_16 = [x.strip('/ DPG') for x in mens_20km_16.mens_20km_16]
mens_20km_16.mens_20km_16 = [x.replace('h','.')for x in mens_20km_16.mens_20km_16]
mens_20km_16.mens_20km_16 = [x.replace(':','.')for x in mens_20km_16.mens_20km_16]
mens_20km_16.name = [x.strip() for x in mens_20km_16.name]
mens_20km_16 = mens_20km_16.drop(range(63,74))


In [2806]:
athlete_df = pd.merge(athlete_df, mens_20km_16[['name', 'mens_20km_16']], how='left', on='name', suffixes=(None, '_20km_mens_rio'))

#### Scraping the results from the men's 50km walk 2004-Athens


In [2807]:
athens_mens_50km = fn.olympic_query('athens', '2004', '50km-walk-men')
mens_50km_04 = fn.olympic_scraper(athens_mens_50km)
mens_50km_04 = fn.content_cleaner(mens_50km_04)
mens_50km_04.columns = ['rank', 'name', 'mens_50km_04']
mens_50km_04.insert(3, "event_x", "Athletics Men's 50km Walk")
mens_50km_04.mens_50km_04 = [str(x) for x in mens_50km_04.mens_50km_04]
mens_50km_04.mens_50km_04 = [x.strip() for x in mens_50km_04.mens_50km_04]
mens_50km_04.mens_50km_04 = [x.strip('/ DPG') for x in mens_50km_04.mens_50km_04]
mens_50km_04.mens_50km_04 = [x.replace('h','.')for x in mens_50km_04.mens_50km_04]
mens_50km_04.mens_50km_04 = [x.replace(':','.')for x in mens_50km_04.mens_50km_04]
mens_50km_04.name = [x.strip() for x in mens_50km_04.name]


In [2808]:
athlete_df = pd.merge(athlete_df, mens_50km_04[['name', 'mens_50km_04']], how='left', on='name', suffixes=(None, '_50km_mens_athens'))

#### Scraping the results from the men's 50km walk 2008-Beijing
dropping rows with missing values

In [2809]:
beijing_mens_50km = fn.olympic_query('beijing', '2008', '50km-walk-men')
mens_50km_08 = fn.olympic_scraper(beijing_mens_50km)
mens_50km_08 = fn.content_cleaner(mens_50km_08)
mens_50km_08.columns = ['rank', 'name', 'mens_50km_08']
mens_50km_08.insert(3, "event_x", "Athletics Men's 50km walk")
mens_50km_08.mens_50km_08 = [str(x) for x in mens_50km_08.mens_50km_08]
mens_50km_08.mens_50km_08 = [x.strip() for x in mens_50km_08.mens_50km_08]
mens_50km_08.mens_50km_08 = [x.strip('/ DPG') for x in mens_50km_08.mens_50km_08]
mens_50km_08.mens_50km_08 = [x.replace('h','.')for x in mens_50km_08.mens_50km_08]
mens_50km_08.mens_50km_08 = [x.replace(':','.')for x in mens_50km_08.mens_50km_08]
mens_50km_08.name = [x.strip() for x in mens_50km_08.name]
mens_50km_08 = mens_50km_08.drop(range(47,61))


In [2810]:
athlete_df = pd.merge(athlete_df, mens_50km_08[['name', 'mens_50km_08']], how='left', on='name', suffixes=(None, '_50km_mens_beijing'))

#### Scraping the results from the men's 50km walk 2012-London
dropping rows with missing values

In [2811]:
london_mens_50km = fn.olympic_query('london', '2012', '50km-walk-men')
mens_50km_12 = fn.olympic_scraper(london_mens_50km)
mens_50km_12 = fn.content_cleaner(mens_50km_12)
mens_50km_12.columns = ['rank', 'name', 'mens_50km_12']
mens_50km_12.insert(3, "event_x", "Athletics Men's 50km walk")
mens_50km_12.mens_50km_12 = [str(x) for x in mens_50km_12.mens_50km_12]
mens_50km_12.mens_50km_12 = [x.strip() for x in mens_50km_12.mens_50km_12]
mens_50km_12.mens_50km_12 = [x.strip('/ DPG') for x in mens_50km_12.mens_50km_12]
mens_50km_12.mens_50km_12 = [x.replace('h','.')for x in mens_50km_12.mens_50km_12]
mens_50km_12.mens_50km_12 = [x.replace(':','.')for x in mens_50km_12.mens_50km_12]
mens_50km_12.name = [x.strip() for x in mens_50km_12.name]
mens_50km_12 = mens_50km_12.drop(range(48,60))


In [2812]:
athlete_df = pd.merge(athlete_df, mens_50km_12[['name', 'mens_50km_12']], how='left', on='name', suffixes=(None, '_50km_mens_london'))

#### Scraping the results from the men's 50km walk 2016-Rio
dropping rows with missing values

In [2813]:
rio_mens_50km = fn.olympic_query('rio', '2016', '50km-walk-men')
mens_50km_16 = fn.olympic_scraper(rio_mens_50km)
mens_50km_16 = fn.content_cleaner(mens_50km_16)
mens_50km_16.columns = ['rank', 'name', 'mens_50km_16']
mens_50km_16.insert(3, "event_x", "Athletics Men's 50km walk")
mens_50km_16.mens_50km_16 = [str(x) for x in mens_50km_16.mens_50km_16]
mens_50km_16.mens_50km_16 = [x.strip() for x in mens_50km_16.mens_50km_16]
mens_50km_16.mens_50km_16 = [x.strip('/ DPG') for x in mens_50km_16.mens_50km_16]
mens_50km_16.mens_50km_16 = [x.replace('h','.')for x in mens_50km_16.mens_50km_16]
mens_50km_16.mens_50km_16 = [x.replace(':','.')for x in mens_50km_16.mens_50km_16]
mens_50km_16.name = [x.strip() for x in mens_50km_16.name]
mens_50km_16 = mens_50km_16.drop(range(49,80))

In [2814]:
athlete_df = pd.merge(athlete_df, mens_50km_16[['name', 'mens_50km_16']], how='left', on='name', suffixes=(None, '_50km_mens_Rio'))

#### These are the results that I will use for this model until I can get more results from Olympedia.org. There server is currently down and I do not have the event results for the women.

# 

# 

## Preparing dataframe for modeling

#### Removing the duplicate rows from the dataframe, a few athletes appeared in the dataframe more than once due to them competing in more than one event. Since I was matching the event results above on the name column, some athletes got results added twice. I can remove the doubles here since I have used the result times as their own individual columns.

In [2815]:
athlete_df.duplicated().sum()

1204

In [2816]:
athlete_df = athlete_df.drop_duplicates()
athlete_df = athlete_df.drop_duplicates('name')

#### Removing the missing values in the era column so I can convert the values to intetegers.

In [2817]:
athlete_df[athlete_df['era'] == ''].index

Int64Index([627, 628, 2556, 2608, 3833], dtype='int64')

In [2818]:
athlete_df = athlete_df.drop([627, 628, 2556, 2608, 3833])

#### Setting the era column to be the first year shown in the values. Some athletes have a range while others only have a start year, this could be due to the fact that many of the athletes are still competing. Since we know that these are the athletes that competed in the Summer Olympic Games, we can keep the first year value.

In [2819]:
athlete_df.era = [x[:4] for x in athlete_df.era]

#### Defining the columns that have NA values since the athletes in those rows did not compete in that event, I am going to fill the missing values with 0 for now. I plan on removing the athletes with no results from the dataframe.

#### Subsetting the dataframe with only the event results. I am going to insert a column that counts the amount of results there are per athlete. If there are no results I will drop the athlete from the dataframe for now. Since I only have the men's results I am expecting to drop a significant amount of rows.

In [2820]:
cols = list(athlete_df.columns[6:])

In [2821]:
results_df = athlete_df[cols]

In [2822]:
results_df.insert(44, 'events', 0)

The cell below may throw a 'setting with copy' warning but if you run it twice it works. I am counting the values across the columns of the dataframe and subtracting by 1 in order to see which athletes need to be dropped for now.

In [2823]:
results_df.events = results_df[list(results_df.columns)].count(axis=1)-1
results_df.events.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


0    4105
1     477
2     102
3      24
4       4
6       1
5       1
Name: events, dtype: int64

#### Setting the index of the results_df dataframe with the index of the athlete_df dataframe so I can merge them on the name values

In [2824]:
results_df.index = athlete_df.name
results_df.reset_index(inplace=True)

In [2825]:
len(athlete_df)

4714

In [2826]:
athlete_df = pd.merge(athlete_df, results_df[['name', 'events']], how='left', on='name')
athlete_df.events.value_counts()

0    4105
1     477
2     102
3      24
4       4
6       1
5       1
Name: events, dtype: int64

In [2827]:
for x in athlete_df[cols]:
    athlete_df[x] = [str(x) for x in athlete_df[x]]
    athlete_df[x] = [x.strip('\r\n') for x in athlete_df[x]]
    athlete_df[x] = [x.strip() for x in athlete_df[x]]


#### Defining the events with the format minute:seconds.milliseconds so I can convert the values to seconds. This is done in two for loops since the order of the events has times that aren't in this format between these columns

In [2828]:
to_seconds_1 = athlete_df.columns[18:30]

In [2829]:
to_seconds_2 = athlete_df.columns[38:50]

In [2830]:
for x in athlete_df[to_seconds_1]:
    athlete_df[x] = [x.replace('nan', '0.00.00') for x in athlete_df[x]]


In [2831]:
for x in athlete_df[to_seconds_2]:
    athlete_df[x] = [x.replace('nan', '0.00.00') for x in athlete_df[x]]


#### Converting the times to seconds

In [2832]:
val_to_replace = athlete_df.mens_800m_04[325][1]

In [2833]:
for x in athlete_df[to_seconds_1]:
    athlete_df[x] = [x.replace(val_to_replace, '.') for x in athlete_df[x]]

In [2834]:
for x in athlete_df[to_seconds_2]:
    athlete_df[x] = [x.replace(val_to_replace, '.') for x in athlete_df[x]]

In [2835]:
for x in athlete_df[to_seconds_1]:
    athlete_df[x] = [(int(a) * 60 )+ int(b) + (int(c) / 1000) for a,b,c in athlete_df[x].str.split('.')]

In [2836]:
for x in athlete_df[to_seconds_2]:
    athlete_df[x] = [(int(a) * 60 )+ int(b) + (int(c) / 1000) for a,b,c in athlete_df[x].str.split('.')]

In [2837]:
len(athlete_df[athlete_df.events == 0])

4105

There are 4103 athletes that I do not have results for yet. This is not ideal and may cause poor model performance but I am going to move forward with removing these athletes so the model will have values to train on. This dataframe will not be this small in the next iteration of this project when I can get the results from the women's events as well.

In [2838]:
athlete_df = athlete_df[athlete_df.events != 0]

#### Since the athletes did not compete in the events where there are 'nan' values, I will fill them with 0.00 for now. On the first model iterations I had combined all results into one column but that only had results from the 2004 Games. Having 2004-2016 event results won't allow me to combine them without misrepresenting the data. I plan on feature engineering percentage of speed increase and over heats as well as games on the next iteration.

I need to convert the remaining event result column values to floats in order to input them into a model

In [2839]:
to_float_1 = athlete_df.columns[6:18]

In [2840]:
to_float_2 = athlete_df.columns[30:38]

In [2841]:
df_copy = athlete_df.copy()

In [2842]:
athlete_df = df_copy

#### I missed a few NaN values when loading in the event results. Here I am double trying to strip the strings that I found to common in the result tables to see if I can convert the values to floats.

In [2843]:
for x in athlete_df[to_float_1]:
    athlete_df[x] = [x.strip('DPG /') for x in athlete_df[x]]
    athlete_df[x] = [x.strip('NF,-') for x in athlete_df[x]]
    athlete_df[x] = [x.strip('DNF,-') for x in athlete_df[x]]
    athlete_df[x] = [x.strip('DQ,-0') for x in athlete_df[x]]
    athlete_df[x] = [x.strip() for x in athlete_df[x]]
    athlete_df[x] = [x.replace('nan', '0.0') for x in athlete_df[x]]

In [2844]:
for x in athlete_df[to_float_2]:
    athlete_df[x] = [x.strip('DPG /') for x in athlete_df[x]]
    athlete_df[x] = [x.strip('NF,-') for x in athlete_df[x]]
    athlete_df[x] = [x.strip('DNF,-') for x in athlete_df[x]]
    athlete_df[x] = [x.strip('DQ,-0') for x in athlete_df[x]]
    athlete_df[x] = [x.strip() for x in athlete_df[x]]
    athlete_df[x] = [x.replace('nan', '0.00') for x in athlete_df[x]]

#### Saving the dataframe as a .csv file for modeling

In [2846]:
athlete_df.to_csv('../data/model_df_2.csv', index=False)