<a href="https://www.kaggle.com/code/mikedelong/eda-bar-charts-mostly?scriptVersionId=162990679" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
from warnings import filterwarnings
filterwarnings(action='ignore', category=FutureWarning)

In [2]:
import pandas as pd

GAMES = '/kaggle/input/olympics-legacy-1896-2020/all_athlete_games.csv'
df = pd.read_csv(filepath_or_buffer=GAMES).drop(columns=['Entry ID', 'NOC', 'Event'])
df['Medal rank'] = df['Medal'].map({'Gold': 1, 'Silver': 2, 'Bronze': 3})
df.head()

Unnamed: 0,Name,Gender,Age,Team,Year,Season,City,Sport,Medal,Medal rank
0,A Dijiang,Male,24.0,China,1992,Summer,Barcelona,Basketball,,
1,A Lamusi,Male,23.0,China,2012,Summer,London,Judo,,
2,Gunnar Nielsen Aaby,Male,24.0,Denmark,1920,Summer,Antwerpen,Football,,
3,Edgar Lindenau Aabye,Male,34.0,Denmark/Sweden,1900,Summer,Paris,Tug-Of-War,Gold,1.0
4,"Cornelia ""Cor"" Aalten (-Strannood)",Female,18.0,Netherlands,1932,Summer,Los Angeles,Athletics,,


There really is no predictive data in this dataset, but we can do a fair amount of EDA.

How many athletes of each sex are in our dataset?

In [3]:
from plotly.express import histogram
histogram(data_frame=df[['Name', 'Gender', 'Season']].drop_duplicates(ignore_index=True), x='Gender', color='Season')

What are the ages of Olympic athletes?

In [4]:
from plotly.express import histogram
histogram(data_frame=df[['Name', 'Age', 'Gender', 'Season']].drop_duplicates(ignore_index=True), x='Age', color='Gender', facet_col='Season', log_y=False,)

For the most part men's ages and women's ages appear to be similarly distributed, and athletes' ages are similarly distributed across summer and winter games. The real surprises, to the extent there are any, are the age extremes for medalists.

How has the number of medals varied over time?

In [5]:
histogram(data_frame=df[['Year', 'Medal', 'Medal rank']].dropna(subset='Medal').sort_values(by=['Medal rank'], ascending=False), x='Year', color='Medal')

Here we see 
* The growth of events over time
* The gaps from games being canceled due to world wars
* The shift from having summer and winter games the same years to having them two years apart


How do athlete counts break down by team? We have too many teams to visualize them all meaningfully; let's start by looking at the top 50 countries in total athletes.

In [6]:
histogram(data_frame=df['Team'].value_counts().head(n=50).reset_index(), x='Team', y='count')

Let's do the same but just looking at medalists.

In [7]:
histogram(data_frame=df[~df['Medal'].isna()]['Team'].value_counts().head(n=50).reset_index(), x='Team', y='count')

Let's join this data together and see which countries medal more often (on a per-athlete basis).

In [8]:
team_medal_df = df['Team'].value_counts().reset_index().merge(on='Team', right=df[~df['Medal'].isna()]['Team'].value_counts().reset_index())
team_medal_df.columns = ['Team', 'athlete', 'medal']
team_medal_df['per athlete'] = team_medal_df['medal']/team_medal_df['athlete']
histogram(team_medal_df[team_medal_df['athlete'] > 100].sort_values(by='per athlete', ascending=False).head(n=50), x='Team', y='per athlete')

Our data is noisy, which most of the time doesn't matter, but it really shows up here. We have filtered out teams that have sent less than 100 athletes, but we still have issues with the -1 and -2 teams.

What is the mean age of an Olympic medalist?

In [9]:
print('The mean age of an Olympic medalist is {:5.1f} years'.format(df[~df['Medal'].isna()][['Age']].mean()[0]))

The mean age of an Olympic medalist is  26.0 years


In [10]:
print('The median age of an Olympic medalist is {:5.1f} years'.format(df[~df['Medal'].isna()][['Age']].median()[0]))

The median age of an Olympic medalist is  25.0 years


In [11]:
# we can break this out by season
df[~df['Medal'].isna()][['Age', 'Season']].groupby(by='Season').median()

Unnamed: 0_level_0,Age
Season,Unnamed: 1_level_1
Summer,25.0
Winter,26.0


In [12]:
# or we can break it out by sex
df[~df['Medal'].isna()][['Age', 'Gender']].groupby(by='Gender').median()

Unnamed: 0_level_0,Age
Gender,Unnamed: 1_level_1
Female,24.0
Male,25.0


The median Olympic athlete is 24-26 years old no matter how we slice our data as we expected from our age histograms above.

We can also summarize the participation by year; it is fascinating to see how sports have gone in and out of fashion at the Olympic level over time.

In [13]:
from plotly.express import line
line(data_frame=df[['Sport', 'Year']].groupby(by=['Sport', 'Year']).size().reset_index(), x='Year', y=0, color='Sport', height=1500, log_y=True)

In [14]:
histogram(data_frame=df['City'].value_counts().reset_index(), x='City', y='count')