<p style="color: red; font-size: 16pt; font-weight: bold; text-align:center;">Change the name of this notebook before you edit!</p>

# Olympic Athletes

https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results/data

## Context
This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. I recommend checking my kernel before starting your own analysis.

Note that the Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on. A common mistake people make when analyzing this data is to assume that the Summer and Winter Games have always been staggered.

##  Content
The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:

```
ID - Unique number for each athlete
Name - Athlete's name
Sex - M or F
Age - Integer
Height - In centimeters
Weight - In kilograms
Team - Team name
NOC - National Olympic Committee 3-letter code
Games - Year and season
Year - Integer
Season - Summer or Winter
City - Host city
Sport - Sport
Event - Event
Medal - Gold, Silver, Bronze, or NA
```

## Acknowledgements
The Olympic data on www.sports-reference.com is the result of an incredible amount of research by a group of Olympic history enthusiasts and self-proclaimed 'statistorians'. Check out their blog for more information. All I did was consolidated their decades of work into a convenient format for data analysis.

## Inspiration
This dataset provides an opportunity to ask questions about how the Olympics have evolved over time, including questions about the participation and performance of women, different nations, and different sports and events.

# Setup Libraries and Functions

In [1]:
%reload_ext autoreload
%autoreload

In [2]:
import os
import sys
import re
import json
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load Data

In [3]:
! ls -lh /data/IFI8410/olympics/

total 40M
-rw-rw-r--. 1 pmolnar ifi8410_instructor  40M Oct 23  2023 athlete_events.csv
-rw-rw-r--. 1 pmolnar ifi8410_instructor 3.6K Oct 23  2023 noc_regions.csv
-rw-rw-r--. 1 pmolnar ifi8410_instructor  114 Oct 23  2023 README.md


In [4]:
data_file = '/data/IFI8410/olympics/athlete_events.csv'
df = pd.read_csv(data_file)
print(f"Number of rows: {df.shape[0]:,}\n")
display(df.dtypes)

Number of rows: 271,116



ID          int64
Name       object
Sex        object
Age       float64
Height    float64
Weight    float64
Team       object
NOC        object
Games      object
Year        int64
Season     object
City       object
Sport      object
Event      object
Medal      object
dtype: object

# Simple Questions about the data

In [5]:
df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


## How many athletes are listed in this database?

How many unique IDs?

In [6]:
type(df['ID'])

pandas.core.series.Series

In [7]:
#df['ID']
id_col = df.ID
id_col_uniq = id_col.drop_duplicates()
id_col_uniq.count()

135571

In [8]:
N_unique_athletes = df.ID \
    .drop_duplicates().count()
print(f"The database contains {N_unique_athletes:,} athletes.")

The database contains 135,571 athletes.


## What was the youngest age? What was the oldest age?


In [9]:
df.Age.min(), df.Age.max(), df.Age.mean()

(10.0, 97.0, 25.556898357297374)

Who was the youngest? Who was the oldest

In [10]:
df.sort_values('Age').head(1)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
142882,71691,Dimitrios Loundras,M,10.0,,,Ethnikos Gymnastikos Syllogos,GRE,1896 Summer,1896,Summer,Athina,Gymnastics,"Gymnastics Men's Parallel Bars, Teams",Bronze


In [11]:
df.sort_values('Age', ascending=False).head(1)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
257054,128719,John Quincy Adams Ward,M,97.0,,,United States,USA,1928 Summer,1928,Summer,Amsterdam,Art Competitions,"Art Competitions Mixed Sculpturing, Statues",


## Aggregation: Average Age, Height, Weight etc.

In [12]:
df.Height.mean()

175.33896987366376

In [13]:
df.Weight.mean()

70.70239290053351

In [14]:
df.Age.mean()

25.556898357297374

How about by gender?

In [15]:
'M' == 'F'

False

In [16]:
male_athlete_mask = df.Sex =='M'
female_athlete_mask = df.Sex =='F'

summer_mask = df.Season == 'Summer'
df[female_athlete_mask | summer_mask].shape

(237631, 15)

In [17]:
df[female_athlete_mask]

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,
5,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,"Speed Skating Women's 1,000 metres",
6,5,Christine Jacoba Aaftink,F,25.0,185.0,82.0,Netherlands,NED,1992 Winter,1992,Winter,Albertville,Speed Skating,Speed Skating Women's 500 metres,
7,5,Christine Jacoba Aaftink,F,25.0,185.0,82.0,Netherlands,NED,1992 Winter,1992,Winter,Albertville,Speed Skating,"Speed Skating Women's 1,000 metres",
8,5,Christine Jacoba Aaftink,F,27.0,185.0,82.0,Netherlands,NED,1994 Winter,1994,Winter,Lillehammer,Speed Skating,Speed Skating Women's 500 metres,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271080,135553,Galina Ivanovna Zybina (-Fyodorova),F,33.0,168.0,80.0,Soviet Union,URS,1964 Summer,1964,Summer,Tokyo,Athletics,Athletics Women's Shot Put,Bronze
271099,135560,Stavroula Zygouri,F,36.0,171.0,63.0,Greece,GRE,2004 Summer,2004,Summer,Athina,Wrestling,"Wrestling Women's Middleweight, Freestyle",
271102,135563,Olesya Nikolayevna Zykina,F,19.0,171.0,64.0,Russia,RUS,2000 Summer,2000,Summer,Sydney,Athletics,Athletics Women's 4 x 400 metres Relay,Bronze
271103,135563,Olesya Nikolayevna Zykina,F,23.0,171.0,64.0,Russia,RUS,2004 Summer,2004,Summer,Athina,Athletics,Athletics Women's 4 x 400 metres Relay,Silver


In [18]:
df[df.Sex=='F'][['Name', 'Year', 'Season', 'City', 'Sport', 'Event']].head()

Unnamed: 0,Name,Year,Season,City,Sport,Event
4,Christine Jacoba Aaftink,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres
5,Christine Jacoba Aaftink,1988,Winter,Calgary,Speed Skating,"Speed Skating Women's 1,000 metres"
6,Christine Jacoba Aaftink,1992,Winter,Albertville,Speed Skating,Speed Skating Women's 500 metres
7,Christine Jacoba Aaftink,1992,Winter,Albertville,Speed Skating,"Speed Skating Women's 1,000 metres"
8,Christine Jacoba Aaftink,1994,Winter,Lillehammer,Speed Skating,Speed Skating Women's 500 metres


In [19]:
print(f"""
Males:
    avg. age    = {df[(df.Sex=='M') & (df.Season=='Summer')].Age.mean():8.3f}
    avg. weight = {df[df.Sex=='M'].Weight.mean():8.3f}
    avg. height = {df[df.Sex=='M'].Height.mean():8.3f}
""")
print(f"""
Females:
    avg. age    = {df[df.Sex=='F'].Age.mean():8.3f}
    avg. weight = {df[df.Sex=='F'].Weight.mean():8.3f}
    avg. height = {df[df.Sex=='F'].Height.mean():8.3f}
""")



Males:
    avg. age    =   26.444
    avg. weight =   75.744
    avg. height =  178.858


Females:
    avg. age    =   23.733
    avg. weight =   60.021
    avg. height =  167.840



## Other statistical measures: min, max, standard deviation, etc.

In [20]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,271116.0,68248.954396,39022.286345,1.0,34643.0,68205.0,102097.25,135571.0
Age,261642.0,25.556898,6.393561,10.0,21.0,24.0,28.0,97.0
Height,210945.0,175.33897,10.518462,127.0,168.0,175.0,183.0,226.0
Weight,208241.0,70.702393,14.34802,25.0,60.0,70.0,79.0,214.0
Year,271116.0,1978.37848,29.877632,1896.0,1960.0,1988.0,2002.0,2016.0


# Grouping and Aggregation

## How many athlete per game (year, season)?

In [21]:
grouped_df = df.groupby(['Year', 'Season']).agg({'ID': 'count', 'Age': ['min', 'mean', 'max']})

In [22]:
grouped_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,ID,Age,Age,Age
Unnamed: 0_level_1,Unnamed: 1_level_1,count,min,mean,max
Year,Season,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1896,Summer,380,10.0,23.580645,40.0
1900,Summer,1936,13.0,29.034031,71.0
1904,Summer,1301,14.0,26.69815,71.0
1906,Summer,1733,13.0,27.125253,54.0
1908,Summer,3101,14.0,26.970228,61.0


In [23]:
grouped_df.shape

(51, 4)

In [24]:
grouped_df.reset_index().sort_values('Year', ascending=False).head()

Unnamed: 0_level_0,Year,Season,ID,Age,Age,Age
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,min,mean,max
50,2016,Summer,13688,13.0,26.207919,62.0
49,2014,Winter,4891,15.0,25.987324,55.0
48,2012,Summer,12920,13.0,25.961378,71.0
47,2010,Winter,4402,15.0,26.124262,51.0
46,2008,Summer,13602,12.0,25.734118,67.0


In [25]:
# grouped_df['Season']

## Table of "top" 3 athletes per game

E.g. three oldest

In [26]:
def pick_oldest(tempdf):
    return tempdf.sort_values('Age', ascending=False).head(3)

In [27]:
df.groupby(['Season']).apply(pick_oldest)

Unnamed: 0_level_0,Unnamed: 1_level_0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Summer,257054,128719,John Quincy Adams Ward,M,97.0,,,United States,USA,1928 Summer,1928,Summer,Amsterdam,Art Competitions,"Art Competitions Mixed Sculpturing, Statues",
Summer,98118,49663,Winslow Homer,M,96.0,,,United States,USA,1932 Summer,1932,Summer,Los Angeles,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
Summer,60862,31173,Thomas Cowperthwait Eakins,M,88.0,,,United States,USA,1932 Summer,1932,Summer,Los Angeles,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
Winter,127505,64263,Carl August Verner Kronlund,M,58.0,,,Sweden,SWE,1924 Winter,1924,Winter,Chamonix,Curling,Curling Men's Curling,Silver
Winter,30323,15656,Charles Granville Bruce,M,57.0,,,Great Britain,GBR,1924 Winter,1924,Winter,Chamonix,Alpinism,Alpinism Mixed Alpinism,Gold
Winter,254305,127321,Hubertus Rudolph von Frstenberg-von Hohenlohe-...,M,55.0,183.0,77.0,Mexico,MEX,2014 Winter,2014,Winter,Sochi,Alpine Skiing,Alpine Skiing Men's Slalom,


# Pivot Tables

In [28]:
pd.pivot_table(df, index='Year', columns=['Season', 'Sex'], values='ID', aggfunc='count', fill_value=0)

Season,Summer,Summer,Winter,Winter
Sex,F,M,F,M
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1896,0,380,0,0
1900,33,1903,0,0
1904,16,1285,0,0
1906,11,1722,0,0
1908,47,3054,0,0
1912,87,3953,0,0
1920,134,4158,0,0
1924,244,4989,17,443
1928,404,4588,33,549
1932,347,2622,22,330
