# Dataset Analysis: **120 Years of Olympic Games**

## **Understanding the Problem**

**Context & Objective:** We're going to explore and analyze a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. Winter and Summer Games are included. Data scraped from www.sports-reference.com in May 2018.

**Dataset:** https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results


---


**Content:** Two datasets are involved: *athlete_events.csv* and *noc_regions.csv*.

The first one, *athlete_events.csv*, contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event. The columns are referred to as:

1. **ID** - unique number for each athlete
2. **Name** - athlete's name
3. **Sex** - athlete's gender - male (M) or female (F)
4. **Age** - athlete's age (integer)
5. **Height** - athlete's height, in centimeters (*cm*)
6. **Weight** - athlete's weight, in kilograms (*kg*)
7. **Team** - team name
8. **NOC** - National Olympic Committee 3-letter code
9. **Games** - olympic game's year (integer) and season (Summer/Winter)
10. **Year** - olympic game's year (integer)
11. **Season** - Summer or Winter
12. **City** - host city
13. **Sport** - sport
14. **Event** - event (sport category)
15. **Medal** - Gold, Silver, Bronze, or NaN (no medal).

#### **Importing libraries:**

In [4]:
# dataset importing and manipulation
import pandas as pd

# data visualization
import missingno   # https://github.com/ResidentMario/missingno
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## **Data Visualization and Analysis**

### **First Inspection**

In [3]:
df_athlete = pd.read_csv('athlete_events.csv')
type(df_athlete)

pandas.core.frame.DataFrame

In [9]:
df_athlete.head(3)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,


In [10]:
df_athlete.tail(3)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
271113,135570,Piotr ya,M,27.0,176.0,59.0,Poland,POL,2014 Winter,2014,Winter,Sochi,Ski Jumping,"Ski Jumping Men's Large Hill, Team",
271114,135571,Tomasz Ireneusz ya,M,30.0,185.0,96.0,Poland,POL,1998 Winter,1998,Winter,Nagano,Bobsleigh,Bobsleigh Men's Four,
271115,135571,Tomasz Ireneusz ya,M,34.0,185.0,96.0,Poland,POL,2002 Winter,2002,Winter,Salt Lake City,Bobsleigh,Bobsleigh Men's Four,


In [13]:
df_athlete.columns

Index(['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
       'Year', 'Season', 'City', 'Sport', 'Event', 'Medal'],
      dtype='object')

In [14]:
df_athlete.shape

(271116, 15)

In [12]:
df_athlete.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


The method `info` allows us to see that the feature **Age**, which should be represented as an integer number (`int64`), is presented as a type `float64`. Also, we can see there's missing data on a few of the features: **Age**, **Height**, and **Weight**. (NOTE: since NaN is an expected value in **Medal**, we won't consider it as missing data.)
