## Madrid 10K Run

#### Short presentation

I am Eva Donaque and I am currently involved in the part-time data analytics course at Ironhack. I really enjoy to see the progress I have made in just a few months. It is even funny to see how things that seemed impossible at first can now even be considered a "piece of cake". Please have a look below at my data visualization project on the popular 10K race "San Silvestre Vallecana".

#### Introduction

Every December 31st the city of Madrid wakes up early to enjoy the last day of the year. What a better way to do so than with a 10K run? It's the perfect way to leave behind the previous year and kick start the new one with a strong foot (literally). The name of this race is "San Silvestre Vallecana" and for this project we will be using the data available from 2019. 

The "San Silvestre Vallecana" is ran by people from 16 to 88+ years old. Given the popularity of the run, the profiles of the runners vary. Some runners just do it for fun while others try to compete and beat personal records. 

The dataset contains information of all 23K participants including: id number, overall position of the runner, position of each runner in his/her category, category by age, gender, seconds passed at 2.5 km, 5 km, 7.5 km and 10 km. 

Please refer to the bottom of this notebook to find the Machine Learning section.

# Let's run it! (literally)

Import all necessary libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import matplotlib.patches as patches

Download the `madrid_10k` dataset from [here](https://drive.google.com/drive/folders/1GunsUDvEUbkKIfkdSBBDVWYSsOaPVWBM) and place it in the data folder.
Load and save your dataset in a variable called `madrid_10k`.

In [None]:
madrid_10k = pd.read_csv('../data/madrid_10k.csv')
madrid_10k = madrid_10k.rename(columns=lambda x: x.strip()) 
madrid_10k

Explore the madrid_10k dataset using Pandas dtypes and describe.

In [None]:
madrid_10k.describe()

In [None]:
madrid_10k.dtypes

Check for any missing values. 

In [None]:
madrid_10k.isnull().sum()

What to do with missing values?

In [None]:
madrid_10k = madrid_10k.dropna()
madrid_10k
#I decided to drop them since the amount of data that is missing accounts for 9.88%.
#This is quite a low percentage that would likely not have a big impact 
#in the end results. 

What is the mean `total_seconds` of the whole run?

In [None]:
madrid_10k['total_seconds'].mean()

What is the mean `total_seconds` by `sex`?

In [None]:
mean_sex=madrid_10k.groupby('sex')['total_seconds'].mean().reset_index()
plt.subplots(figsize=(5,4))
plt.bar(mean_sex['sex'],mean_sex['total_seconds'])
plt.title('Average total seconds by sex')
plt.xlabel('Sex')
plt.ylabel('Average total seconds')
plt.show()

In [None]:
number_sex=madrid_10k.groupby('sex')['id_number'].count().reset_index()
plt.subplots(figsize=(5,4))
plt.bar(number_sex['sex'],number_sex['id_number'])
plt.title('Number of runners by sex')
plt.xlabel('Sex')
plt.ylabel('Total Runners')
plt.show()

In [None]:
madrid_10k.groupby('sex')['id_number'].count()

What is the mean `total_seconds` by `age_category`?

In [None]:
mean_age_category=madrid_10k.groupby('age_category')['total_seconds'].mean().reset_index()
plt.subplots(figsize=(5,4))
plt.bar(mean_age_category['age_category'],mean_age_category['total_seconds'])
plt.title('Average total seconds by age category')
plt.xlabel('Age Category')
plt.ylabel('Average total seconds')
plt.show()

What is the mean `total_seconds` per `sex` and `age_category`? Make a bar chart.

In [None]:
madrid_10k['age_category_number']=madrid_10k['age_category']
madrid_10k['age_category_number'].replace(['16-19','20-22','23-34','35-44','45-54','55+'],[1,2,3,4,5,6], inplace=True)
madrid_10k=madrid_10k.sort_values(['age_category_number']).reset_index(drop=True)
sns.barplot(x='age_category', y='total_seconds', hue='sex', data=madrid_10k)
plt.title('Total Seconds per Sex and Age Category')
plt.xlabel('Age Category')
plt.ylabel('Total Seconds')
plt.show()

What is the mean `total_seconds` per `sex` and `age_category`? Make a line chart.

In [None]:
madrid_10k_pivot= madrid_10k.pivot_table(index='age_category',columns='sex',values='total_seconds',aggfunc='mean')
madrid_10k_pivot.plot()
plt.title('Comparison Age Category and Sex')
plt.xlabel('Age Category')
plt.ylabel('Total seconds')
plt.show()

Summary statistic of the `age_category`.

In [None]:
pd.to_numeric(madrid_10k['age_category_number'], errors='coerce')
sns.boxplot(x='age_category_number', data=madrid_10k)
plt.title('Age Category Distribution')
plt.xlabel('Age Category')
plt.show()

~~~~
From this boxplot we get that the median Age Category is 4 which accounts for 35-44 years old.  
Also we appreciate that most of the runners are within 23 and 54 years old. 
~~~~

Distribution of `age_category`.

In [None]:
sns.violinplot('age_category_number', data=madrid_10k)
plt.title('Age Category Distribution')
plt.xlabel('Age Category')
plt.show()

~~~~
From this violinplot we see again the median (white dot) in category 4 (35-44). Also, we can appreciate the distribution of the data within the categories 3 to 5 which accounts for runners within 23 and 54 years old. 
~~~~

Distribution of `age_category`.

In [None]:
madrid_10k.hist('age_category_number')
plt.title('Age Category Distribution')
plt.xlabel('Age Category')
plt.ylabel('Number of Runners')
plt.show()

Make a comparison between the 4 stages of the run. Does the average speed changes throguhtout the different milestones?

In [None]:
madrid_10k['seconds_5km']=madrid_10k['5km_seconds'] - madrid_10k['2.5km_seconds']
madrid_10k['seconds_7.5km']=madrid_10k['7.5km_seconds'] - madrid_10k['5km_seconds']
madrid_10k['seconds_10km']=madrid_10k['total_seconds'] - madrid_10k['7.5km_seconds']
activity = madrid_10k[['2.5km_seconds','seconds_5km', 'seconds_7.5km','seconds_10km', 'place', 'sex']]
activity
# Create a figure of a fixed size and axes
fig, axs = plt.subplots(1,4, figsize = (20,5))

# Iterate to draw each scatter plot
x=0
for ax in axs:
    ax = ax.scatter(activity['place'],activity.iloc[:,x])
    axs[x].set_title(activity.columns[x])   
    axs[x].set_xlabel('Place')   
    axs[x].set_ylabel('Seconds')   
    x+=1

plt.show()

~~~~
From these scatter plots we can see that in the beggining of the run there was a tendency to go fast, however this went lower after passing the 2.5 km. Speed started increasing after passing the 5km and to end well, runners had a tendency to make a final sprint. 
~~~~

How was the performance of the top 10 performers for every milestone?

In [None]:
madrid_10k.groupby('place')[['2.5km_seconds', 'seconds_5km', 'seconds_7.5km', 'seconds_10km','total_seconds']].agg('sum').nsmallest(9, 'total_seconds')[['2.5km_seconds', 'seconds_5km', 'seconds_7.5km', 'seconds_10km']].plot.barh()
plt.title('Top 10 runners & number of seconds per milestone')
plt.xlabel('Seconds per milestone')
plt.ylabel('Place in the race')
plt.show()

~~~~
From this barchart we can see that most runners kept a similar speed throught the first 3 milestones. However, all of them reduced their speed during the quarter of the run.  
~~~~

Within which `age_category` and `sex` where the top 10 performers of the run?

In [None]:
top_10=madrid_10k.sort_values('place').head(9)
top_10

In [None]:
top_10_pivot= top_10.pivot_table(index='place',columns=['sex', 'age_category'],values='total_seconds')
top_10_pivot.plot(kind='hist', figsize= (5,5))
plt.title('Top 10 runners')
plt.show()

~~~~
Top performers were all males and within 4 categories: 23-34, 35-44, 45-54, 20-22.
~~~~

# Machine Learning

- Is it possible to build a model that can accurately predict the final result of a runner given his/her splits until the 7.5 km and his/her demographics? 
- Is it possible to build a model that can accurately predict whether the runner is a woman or a men?


In [None]:
#Request the right libraries and change categorical values into dummy ones
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import r2_score

madrid_10k_fixed=madrid_10k

In [None]:
#Create 6 dummy variables for age_category 
age_category=['16-19','20-22','23-34','35-44','45-54', '55+']
for x in age_category:
    madrid_10k_fixed[x]=0
madrid_10k_fixed

In [None]:
for x in age_category:
    madrid_10k_fixed[x]=madrid_10k_fixed['age_category'].str.contains(x).astype(int)

In [None]:
#Create 1 dummy variable for sex
sex=['M']
for x in sex:
    madrid_10k_fixed[x]=0
for x in sex:
    madrid_10k_fixed[x]=madrid_10k_fixed['sex'].str.contains(x).astype(int)
madrid_10k_fixed

In [None]:
#Take total_seconds as dependent variable
y=madrid_10k_fixed['total_seconds']
X= madrid_10k_fixed[['16-19','20-22','23-34','35-44','45-54','55+','M', '2.5km_seconds', 'seconds_5km', 'seconds_7.5km']]

In [None]:
#Split the data with test size=0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [None]:
#Run Linear Regression
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
from math import sqrt
from sklearn.metrics import mean_squared_error
y_pred = model.predict(X_test)
regression_error = y_pred - y_test
print('R square of regression...',model.score(X_test,y_test))
print('RMSE of regression...',sqrt(mean_squared_error(y_test, y_pred)))

In [None]:
sns.distplot(regression_error, np.linspace(-40,40,200), kde = False)
plt.xlim([-40,40])
plt.xlabel('Error in Seconds')
plt.ylabel('Number of Runners')

In [None]:
pd.DataFrame({'test':y_test, 'predicted':y_pred})

In [None]:
df = pd.DataFrame({'test':y_test, 'predicted':y_pred})
df = df.reset_index(drop=True)
df = df.reset_index()
data = pd.melt(df, id_vars=['index'], value_vars=['test', 'predicted'])

In [None]:
plt.figure(figsize=(50,10))
sns.lineplot(x="index", y="value", hue="variable", data=data)

In [None]:
#Run Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
y_pred = model.predict(X_test)
decision_tree_regression_error = y_pred - y_test
print('R square of regression...',model.score(X_test,y_test))
print('RMSE of regression...',sqrt(mean_squared_error(y_test, decision_tree_regression_error)))

In [None]:
sns.distplot(decision_tree_regression_error, np.linspace(-40,40,200), kde = False)
plt.xlim([-40,40])
plt.xlabel('Error in Seconds')
plt.ylabel('Number of Runners')

In [None]:
pd.DataFrame({'test':y_test, 'predicted':y_pred})

In [None]:
df = pd.DataFrame({'test':y_test, 'predicted':y_pred})
df = df.reset_index(drop=True)
df = df.reset_index()
data = pd.melt(df, id_vars=['index'], value_vars=['test', 'predicted'])

In [None]:
plt.figure(figsize=(50,10))
sns.lineplot(x="index", y="value", hue="variable", data=data)

In [None]:
#Comparison between Linear Regression and Decision Tree
sns.distplot(regression_error, np.linspace(-40,40,200), kde = False)
sns.distplot(decision_tree_regression_error,np.linspace(-40,40,200), kde = False)
plt.xlabel('Error in seconds')
plt.legend(['Linear Regression','Decision Tree'], loc = 2)

In [None]:
#Sex becomes the dependent variables
y=madrid_10k_fixed['M']
X= madrid_10k_fixed[['16-19','20-22','23-34','35-44','45-54','55+','2.5km_seconds', 'seconds_5km', 'seconds_7.5km', 'seconds_10km','total_seconds']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)


In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc

In [None]:
#Run Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=1000)
rfc.fit(X_train,y_train)
y_pred_rf = rfc.predict(X_test)
y_score_rf = rfc.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_score_rf)
print('Random Forest ROC AUC', auc(fpr_rf, tpr_rf))
print(classification_report(y_test, y_pred_rf, target_names=['Female','Male']))

In [None]:
#Run Gradient Boosting Classifier
gbc = GradientBoostingClassifier(n_estimators=1000)
gbc.fit(X_train,y_train)
y_pred_gb = gbc.predict(X_test)
y_score_gb = gbc.predict_proba(X_test)[:,1]
fpr_gb, tpr_gb, _ = roc_curve(y_test, y_score_gb)
print('Gradient Boosting ROC AUC', auc(fpr_gb, tpr_gb))
print(classification_report(y_test, y_pred_gb, target_names=['Female','Male']))

In [None]:
#Comparison between Random Forest Classifier and Gradient Boosting Classifier
fig, ax = plt.subplots()
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_gb, tpr_gb, label='GB')
ax.set_aspect('equal')
plt.legend()
plt.show()

# Conclusions

- The Linear Regression model gives the best results to predict Total Seconds.  There is a margin error of up to 40 seconds, which is quite low but could definitely make a difference in a short run. 
- Both analysis of the performance of runners relative to sex were quite accurate. Sex seems to be a factor that influences the final time of the race as shown in the distribution of total seconds per sex. Is important to bare in mind that the two machine learning models explored produced good results given the lack of features in the data.
