<p style="color: red; font-size: 16pt; font-weight: bold; text-align:center;">Change the name of this notebook before you edit!</p>

# Olympic Athletes

https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results/data

## Context
This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. I recommend checking my kernel before starting your own analysis.

Note that the Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on. A common mistake people make when analyzing this data is to assume that the Summer and Winter Games have always been staggered.

##  Content
The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:

```
ID - Unique number for each athlete
Name - Athlete's name
Sex - M or F
Age - Integer
Height - In centimeters
Weight - In kilograms
Team - Team name
NOC - National Olympic Committee 3-letter code
Games - Year and season
Year - Integer
Season - Summer or Winter
City - Host city
Sport - Sport
Event - Event
Medal - Gold, Silver, Bronze, or NA
```

## Acknowledgements
The Olympic data on www.sports-reference.com is the result of an incredible amount of research by a group of Olympic history enthusiasts and self-proclaimed 'statistorians'. Check out their blog for more information. All I did was consolidated their decades of work into a convenient format for data analysis.

## Inspiration
This dataset provides an opportunity to ask questions about how the Olympics have evolved over time, including questions about the participation and performance of women, different nations, and different sports and events.

# Setup Libraries and Functions

In [3]:
%reload_ext autoreload
%autoreload

In [4]:
import os
import sys
import re
import json
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load Data

In [5]:
! ls -lh /data/IFI8410/olympics/

total 40M
-rw-rw-r--. 1 pmolnar pmolnar  114 Oct 23 10:57 README.md
-rw-rw-r--. 1 pmolnar pmolnar  40M Oct 23 10:42 athlete_events.csv
-rw-rw-r--. 1 pmolnar pmolnar 3.6K Oct 23 10:42 noc_regions.csv


In [6]:
data_file = '/data/IFI8410/olympics/athlete_events.csv'
df = pd.read_csv(data_file)
print(f"Number of rows: {df.shape[0]:,}\n")
display(df.dtypes)

Number of rows: 271,116



ID          int64
Name       object
Sex        object
Age       float64
Height    float64
Weight    float64
Team       object
NOC        object
Games      object
Year        int64
Season     object
City       object
Sport      object
Event      object
Medal      object
dtype: object

# Simple Questions about the data

## How many athletes are listed in this database?

How many unique IDs?

In [7]:
N_unique_athletes = df.ID.drop_duplicates().count()
print(f"The database contains {N_unique_athletes:,} athletes.")

The database contains 135,571 athletes.


## What was the youngest age? What was the oldest age?


In [11]:
df.Age.min(), df.Age.max()

(10.0, 97.0)

Who was the youngest? Who was the oldest

In [12]:
df.sort_values('Age').head(1)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
142882,71691,Dimitrios Loundras,M,10.0,,,Ethnikos Gymnastikos Syllogos,GRE,1896 Summer,1896,Summer,Athina,Gymnastics,"Gymnastics Men's Parallel Bars, Teams",Bronze


In [13]:
df.sort_values('Age', ascending=False).head(1)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
257054,128719,John Quincy Adams Ward,M,97.0,,,United States,USA,1928 Summer,1928,Summer,Amsterdam,Art Competitions,"Art Competitions Mixed Sculpturing, Statues",


## Aggregation: Average Age, Height, Weight etc.

In [14]:
df.Height.mean()

175.33896987366376

In [15]:
df.Weight.mean()

70.70239290053351

In [16]:
df.Age.mean()

25.556898357297374

How about by gender?

In [21]:
print(f"""
Males:
    avg. age    = {df[df.Sex=='M'].Age.mean():8.3f}
    avg. weight = {df[df.Sex=='M'].Weight.mean():8.3f}
    avg. height = {df[df.Sex=='M'].Height.mean():8.3f}
""")
print(f"""
Females:
    avg. age    = {df[df.Sex=='F'].Age.mean():8.3f}
    avg. weight = {df[df.Sex=='F'].Weight.mean():8.3f}
    avg. height = {df[df.Sex=='F'].Height.mean():8.3f}
""")



Males:
    avg. age    =   26.278
    avg. weight =   75.744
    avg. height =  178.858


Females:
    avg. age    =   23.733
    avg. weight =   60.021
    avg. height =  167.840



## Other statistical measures: min, max, standard deviation, etc.

In [27]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,271116.0,68248.954396,39022.286345,1.0,34643.0,68205.0,102097.25,135571.0
Age,261642.0,25.556898,6.393561,10.0,21.0,24.0,28.0,97.0
Height,210945.0,175.33897,10.518462,127.0,168.0,175.0,183.0,226.0
Weight,208241.0,70.702393,14.34802,25.0,60.0,70.0,79.0,214.0
Year,271116.0,1978.37848,29.877632,1896.0,1960.0,1988.0,2002.0,2016.0


# Grouping and Aggregation

## How many athlete per game (year, season)?

## Table of top 3 athletes per game

# Pivot Tables