# Olympic Games Exploratory Data Analysis

Before we begin, let's set up some useful settings:
- Max number of columns to be displayed = 100
- Max number of columns to be displayed = 100

In [26]:
import pandas as pd

pd.set_option('display.max_columns', 100) 
pd.set_option('display.max_rows', 100) 

### First step: read and glimpse the dataset

In this EDA, we'll use the ["120 years of Olympic history: athletes and results"](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results) Kaggle dataset, locally available in this repo in  `raw_data\athlete_events.csv` . 

Let's first read the dataset:

In [8]:
df = pd.read_csv("raw_data/athlete_events.csv")

### Q0: How many rows and columns are there in this dataset?


In [14]:
print(df.shape)

(271116, 15)


Over 271 thousand competitors in the last 120 years of Olympics! Wow!



Let's get some basic info on the available data:

In [15]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB
None


Lots of infos available! Let's take a glimpse on actual data:

In [16]:
df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


Each row represents a competitor in a specific event from a specific olympic games. Interesting, very interesting.

### Q1: Which are the oldest olympic summer and winter games with data available in the dataset?

To solve this one, we may resort to the `np.sort()` function:


In [34]:
import numpy as np

np.sort(df['Year'].unique()) # .unique() to return only one ocurrence for each olympic year

array([1896, 1900, 1904, 1906, 1908, 1912, 1920, 1924, 1928, 1932, 1936,
       1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988,
       1992, 1994, 1996, 1998, 2000, 2002, 2004, 2006, 2008, 2010, 2012,
       2014, 2016], dtype=int64)

The first olympic game with data available is actually the first one in modern age, 1896 Olympic Summer games, in Athens. 



### Q2: Which game had the greatest number of registered competitors?

To answer this one, we may resort to `df.value_counts()` :

In [29]:
df['Year'].value_counts()

1992    16413
1988    14676
2000    13821
1996    13780
2016    13688
2008    13602
2004    13443
2012    12920
1972    11959
1984    11588
1976    10502
1968    10479
1964     9480
1952     9358
1960     9235
1980     8937
1948     7480
1936     7401
1956     6434
1924     5693
1928     5574
2014     4891
2010     4402
2006     4382
1920     4292
2002     4109
1912     4040
1998     3605
1932     3321
1994     3160
1908     3101
1900     1936
1906     1733
1904     1301
1896      380
Name: Year, dtype: int64

Well, the one with greatest number of competitors was not one of the last ones, but rather the 1992 Summer Games! Very interesting!



### Q3: Starting in which year did we have brazillians competing?

To solve this one, we now have to resort to filtering techniques:

In [33]:
brazil_competitions = np.sort(df.query('Team=="Brazil"')['Year'].unique())
print(brazil_competitions)

[1900 1920 1924 1932 1936 1948 1952 1956 1960 1964 1968 1972 1976 1980
 1984 1988 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
 2016]


Brazil started really early, actually in the second modern olympic games, in 1900! Nice!