# Olympic Games Data Exploration and Analysis (1894 - Present)

## Introduction

For hundreds of years, the Olympic Games have existed as a cornerstone of entertainment, culture, and international collaboration. Inspired by ancient Greek traditions, the International Olympic Committee (IOC) - founded in 1894 - hosted the first modern games in 1896. Since then, the games have evolved to include the paraolympic games for athletes with disabilities, greater gender equality with in increase in female athletes, and alternating years for the summer and winter games. Able to adapt to changes in culture, technology, and economies, the Olympic Games have strengthened their relevance and impact in our modern world, an iconic symbol of the relationships and camaraderie that exist across political borders.

In the following project, equipped with near-comprehensive data from all the modern Olympic games, we will explore various facets of the games, including:
- gender
- types of athletes
- Olympic medalists
- participating countries
- summer vs. winter games
- trends over time

Throughout these sections, we will visualize the patterns, changes, and composition of the games from these different angles. We will attempt to make predictions about an athlete's gender, as well as their likelihood of placing in their event, based on their height, weight, and age data. By mapping the proporition of medals achieved by participating athletes, we will illustrate which countries experience the most success at the Olympic Games. Amongst this work and more across the following sections, we develop a foundation of understanding about the structure and evolution of the Olympics, and set up the possibility of extensive future work with this data.

Below, we start this process by loading, scrubbing, and saving the data as a more usable dataframe for the exploration and analysis in the following sections.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import keras

Using TensorFlow backend.


In [2]:
event_data = pd.read_csv('athlete_events.csv')
region_data = pd.read_csv('noc_regions.csv')

In [3]:
event_data.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [4]:
region_data.head()

Unnamed: 0,NOC,region,notes
0,AFG,Afghanistan,
1,AHO,Curacao,Netherlands Antilles
2,ALB,Albania,
3,ALG,Algeria,
4,AND,Andorra,


In [5]:
event_data.isna().sum()

ID             0
Name           0
Sex            0
Age         9474
Height     60171
Weight     62875
Team           0
NOC            0
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     231333
dtype: int64

Based on our initial data upload, we notice thousands of missing values in just four columns: Age, Height, Weight, and Medal.

In [6]:
len(event_data)

271116

In [7]:
62875/len(event_data)

0.23191180159046312

With a total length of over 270,000 rows, these missing values in Age, Height, and Weight (assuming crossover between the missing values) comprise nearly 25% of the data.

In [8]:
event_data.Medal.value_counts()

Gold      13372
Bronze    13295
Silver    13116
Name: Medal, dtype: int64

The current values in the Medal column only indicate if a given athlete received a gold, silver, or bronze medal. Thus, the missing values must indicate atheletes who did not place in their event. Below, we replace these null values with "None" to indicate this lack of medal.

In [9]:
event_data.Medal.fillna("None",inplace=True)

In [10]:
event_data.Medal.value_counts()

None      231333
Gold       13372
Bronze     13295
Silver     13116
Name: Medal, dtype: int64

In order to fill the null values for age, height, and weight, it appears that we must account for the significant differences based on gender. For height and weight, we will fill the null values with the average for that athlete's gender, to avoid changing these distributions in a way that affects our upcoming data exploration. The median values for age do not appear to be significantly different for males and females, so we will fill these missing values with the overall average age.

In [45]:
event_data.groupby('Sex')["Height"].median()

Sex
F    168.0
M    179.0
Name: Height, dtype: float64

In [46]:
event_data.groupby('Sex')["Weight"].median()

Sex
F    59.0
M    74.0
Name: Weight, dtype: float64

In [44]:
event_data.groupby('Sex')["Age"].median()

Sex
F    23.0
M    25.0
Name: Age, dtype: float64

In [43]:
for index, row in event_data.isna().iterrows():
    if row.Height == True:
        if event_data.Sex[index]=='F':
            event_data.Height[index]=168.0
        if event_data.Sex[index]=='M':
            event_data.Height[index]=179.0
    if row.Weight == True:
        if event_data.Sex[index]=='F':
            event_data.Weight[index]=59.0
        if event_data.Sex[index]=='M':
            event_data.Weight[index]=74.0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [48]:
event_data.Age.fillna(event_data.Age.median(),inplace=True)

Following this process, we see that the data has no more missing values. Thus we can proceed with describing and saving the data for future use.

In [49]:
event_data.isna().sum()

ID        0
Name      0
Sex       0
Age       0
Height    0
Weight    0
Team      0
NOC       0
Games     0
Year      0
Season    0
City      0
Sport     0
Event     0
Medal     0
dtype: int64

In [50]:
event_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
ID        271116 non-null int64
Name      271116 non-null object
Sex       271116 non-null object
Age       271116 non-null float64
Height    271116 non-null float64
Weight    271116 non-null float64
Team      271116 non-null object
NOC       271116 non-null object
Games     271116 non-null object
Year      271116 non-null int64
Season    271116 non-null object
City      271116 non-null object
Sport     271116 non-null object
Event     271116 non-null object
Medal     271116 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


In [51]:
event_data.describe()

Unnamed: 0,ID,Age,Height,Weight,Year
count,271116.0,271116.0,271116.0,271116.0,271116.0
mean,68248.954396,25.502493,175.861639,71.038308,1978.37848
std,39022.286345,6.287361,9.478962,12.811563,29.877632
min,1.0,10.0,127.0,25.0,1896.0
25%,34643.0,22.0,170.0,62.0,1960.0
50%,68205.0,24.0,178.0,73.0,1988.0
75%,102097.25,28.0,180.0,75.0,2002.0
max,135571.0,97.0,226.0,214.0,2016.0


In [54]:
len(event_data.Event.value_counts())

765

In [67]:
event_data.to_csv(r'cleaned_data.csv')

With our data fully scrubbed, we are ready to begin data exploration in the following sections. Saved as 'cleaned_data.csv', we can easily upload and use this DataFrame in each section.