## Madrid 10K Run

#### Short presentation

I am Eva Donaque and I am currently involved in the part-time data analytics course at Ironhack. I really enjoy to see the progress I have made in just a few months. It is even funny to see how things that seemed impossible at first can now even be considered a "piece of cake". Please have a look below at my data visualization project on the popular 10K race "San Silvestre Vallecana".

#### Introduction

Every December 31st the city of Madrid wakes up early to enjoy the last day of the year. What a better way to do so than with a 10K run? It's the perfect way to leave behind the previous year and kick start the new one with a strong foot (literally). The name of this race is "San Silvestre Vallecana" and for this project we will be using the data available from 2019. 

The "San Silvestre Vallecana" is ran by people from 16 to 88+ years old. Given the popularity of the run, the profiles of the runners vary. Some runners just do it for fun while others try to compete and beat personal records. 

The dataset contains information of all 23K participants including: id number, overall position of the runner, position of each runner in his/her category, category by age, gender, seconds passed at 2.5 km, 5 km, 7.5 km and 10 km. 

In [None]:
#Paolo: good intro, you could add a 'goal' section to specify, if you have it already, what the goal of your reaserach is
# what is the question you are trying to answer. Here the focus is mainly visualizations but in general it is useful to
# have this section, to have the reader focus on the question you are asking.

# Let's run it! (literally)

Import all necessary libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import matplotlib.patches as patches

Download the `madrid_10k` dataset from [here](https://drive.google.com/drive/folders/1D1iseKNOy50mqo31FkuQbID-zfCyijZD) and place it in the data folder.
Load and save your dataset in a variable called `madrid_10k`.

In [None]:
madrid_10k = pd.read_csv('../data/madrid_10k.csv')
madrid_10k = madrid_10k.rename(columns=lambda x: x.strip()) 
madrid_10k

Explore the madrid_10k dataset using Pandas dtypes and describe.

In [None]:
madrid_10k.describe()

In [None]:
madrid_10k.dtypes

Check for any missing values. 

In [None]:
madrid_10k.isnull().sum()

What to do with missing values?

In [None]:
madrid_10k = madrid_10k.dropna()
madrid_10k
#I decided to drop them since the amount of data that is missing accounts for 9.88%.
#This is quite a low percentage that would likely not have a big impact 
#in the end results. 

What is the mean `total_seconds` of the whole run?

In [None]:
madrid_10k['total_seconds'].mean()

What is the mean `total_seconds` by `sex`?

In [None]:
mean_sex=madrid_10k.groupby('sex')['total_seconds'].mean().reset_index()
plt.subplots(figsize=(5,4))
plt.bar(mean_sex['sex'],mean_sex['total_seconds'])
plt.title('Average total seconds by sex')
plt.xlabel('Sex')
plt.ylabel('Average total seconds')
plt.show()

In [None]:
#Paolo: I was thinking, maybe also a bar  of the male and female partecipants, more men or more women take part?
#Also give it is a semimarathon I would have used hours instead of seconds, it is easier the natural unit of measure for these
#competions no?

What is the mean `total_seconds` by `age_category`?

In [None]:
mean_age_category=madrid_10k.groupby('age_category')['total_seconds'].mean().reset_index()
plt.subplots(figsize=(5,4))
plt.bar(mean_age_category['age_category'],mean_age_category['total_seconds'])
plt.title('Average total seconds by age category')
plt.xlabel('Age Category')
plt.ylabel('Average total seconds')
plt.show()

What is the mean `total_seconds` per `sex` and `age_category`? Make a bar chart.

In [None]:
madrid_10k['age_category_number']=madrid_10k['age_category']
madrid_10k['age_category_number'].replace(['16-19','20-22','23-34','35-44','45-54','55+'],[1,2,3,4,5,6], inplace=True)
madrid_10k=madrid_10k.sort_values(['age_category_number']).reset_index(drop=True)
sns.barplot(x='age_category', y='total_seconds', hue='sex', data=madrid_10k)
plt.title('Total Seconds per Sex and Age Category')
plt.xlabel('Age Category')
plt.ylabel('Total Seconds')
plt.show()

What is the mean `total_seconds` per `sex` and `age_category`? Make a line chart.

In [None]:
madrid_10k_pivot= madrid_10k.pivot_table(index='age_category',columns='sex',values='total_seconds',aggfunc='mean')
madrid_10k_pivot.plot()
plt.title('Comparison Age Category and Sex')
plt.xlabel('Age Category')
plt.ylabel('Total seconds')
plt.show()

Summary statistic of the `age_category`.

In [None]:
pd.to_numeric(madrid_10k['age_category_number'], errors='coerce')
sns.boxplot(x='age_category_number', data=madrid_10k)
plt.title('Age Category Distribution')
plt.xlabel('Age Category')
plt.show()

~~~~
From this boxplot we get that the median Age Category is 4 which accounts for 35-44 years old.  
Also we appreciate that most of the runners are within 23 and 54 years old. 
~~~~

Distribution of `age_category`.

In [None]:
sns.violinplot('age_category_number', data=madrid_10k)
plt.title('Age Category Distribution')
plt.xlabel('Age Category')
plt.show()

~~~~
From this violinplot we see again the median (white dot) in category 4 (35-44). Also, we can appreciate the distribution of the data within the categories 3 to 5 which accounts for runners within 23 and 54 years old. 
~~~~

Distribution of `age_category`.

In [None]:
madrid_10k.hist('age_category_number')
plt.title('Age Category Distribution')
plt.xlabel('Age Category')
plt.ylabel('Number of Runners')
plt.show()

Make a comparison between the 4 stages of the run. Does the average speed changes throguhtout the different milestones?

In [None]:
madrid_10k['seconds_5km']=madrid_10k['5km_seconds'] - madrid_10k['2.5km_seconds']
madrid_10k['seconds_7.5km']=madrid_10k['7.5km_seconds'] - madrid_10k['5km_seconds']
madrid_10k['seconds_10km']=madrid_10k['total_seconds'] - madrid_10k['7.5km_seconds']
activity = madrid_10k[['2.5km_seconds','seconds_5km', 'seconds_7.5km','seconds_10km', 'place']]
activity
# Create a figure of a fixed size and axes
fig, axs = plt.subplots(1,4, figsize = (20,5))

# Iterate to draw each scatter plot
x=0
for ax in axs:
    ax = ax.scatter(activity['place'],activity.iloc[:,x])
    axs[x].set_title(activity.columns[x])   
    axs[x].set_xlabel('Place')   
    axs[x].set_ylabel('Seconds')   
    x+=1

plt.show()

~~~~
From these scatter plots we can see that in the beggining of the run there was a tendency to go fast, however this went lower after passing the 2.5 km. Speed started increasing after passing the 5km and to end well, runners had a tendency to make a final sprint. 
~~~~

In [None]:
#Paolo: Interesting plot: in the first and last plot it looks like there are two groups of almosti distinct data. 
# Do you know what is it? you could add a visualization to color the dots if it correspond to male or female. Maybe that could 
# add an interesting insight.

How was the performance of the top 10 performers for every milestone?

In [None]:
madrid_10k.groupby('place')[['2.5km_seconds', 'seconds_5km', 'seconds_7.5km', 'seconds_10km','total_seconds']].agg('sum').nsmallest(9, 'total_seconds')[['2.5km_seconds', 'seconds_5km', 'seconds_7.5km', 'seconds_10km']].plot.barh()
plt.title('Top 10 runners & number of seconds per milestone')
plt.xlabel('Seconds per milestone')
plt.ylabel('Place in the race')
plt.show()

~~~~
From this barchart we can see that most runners kept a similar speed throught the first 3 milestones. However, all of them reduced their speed during the quarter of the run.  
~~~~

Within which `age_category` and `sex` where the top 10 performers of the run?

In [None]:
top_10=madrid_10k.sort_values('place').head(9)
top_10

In [None]:
#Paolo: maybe you could look at the top 10 for men and women separately. Also maybe for the strategy the use. 
# A question could be: do the top male and female runners use similar strategies?

In [None]:
top_10_pivot= top_10.pivot_table(index='place',columns=['sex', 'age_category'],values='total_seconds')
top_10_pivot.plot(kind='hist', figsize= (5,5))
plt.title('Top 10 runners')
plt.show()

In [None]:
#Paolo: I am not sure I understand this plot, what is the frequency?

~~~~
Top performers were all males and within 4 categories: 23-34, 35-44, 45-54, 20-22.
~~~~

# Conclusions

It was interesting to discover new insights within the running world. Some key findings are the following:
- From what we see, males are supposedly faster and especially within 23 and 54 years old
- Runners supposedly tend to start faster, then they slow down and towards the end they speed up
- The best performers supposedly decreased their speed in the last milestone

What else can be done?
- Find out if there is a significant difference between males and females running styles
- Find main differences between more junior and senior runners
- Make a comparison of top runners of this race with their other runs in order to discover if there is any trend.


In [None]:
#Paolo: I general good story and good visualizations. Also good organization of the notebook, next time you could experiment
# with interactive plots, if useful. The notebook runs without problems, the data are correctly stored and easily 
# retrievable. 
#Are you running it next year?