# Exploratory Data Analysis Olympic Medals Data

In [1]:
import pandas as pd
import numpy as np

We first load the data file (obtained from wikipedia) and do some basic cleaning like changing the column names, adding a column with Country codes

In [2]:
olympic_data = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

for col in olympic_data.columns:  #renaming the medal columns
    if col[:2]=='01':
        olympic_data.rename(columns={col:'Gold'+col[4:]}, inplace=True)
    if col[:2]=='02':
        olympic_data.rename(columns={col:'Silver'+col[4:]}, inplace=True)
    if col[:2]=='03':
        olympic_data.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
    if col[:1]=='№':
        olympic_data.rename(columns={col:'#'+col[1:]}, inplace=True)
        
names_ids = olympic_data.index.str.split('\s\(') # split the index by '('

olympic_data.index = names_ids.str[0]

olympic_data['ID'] = names_ids.str[1].str[:3]

olympic_data = olympic_data.drop('Totals')

olympic_data.head()

Unnamed: 0,# Summer,Gold,Silver,Bronze,Total,# Winter,Gold.1,Silver.1,Bronze.1,Total.1,# Games,Gold.2,Silver.2,Bronze.2,Combined total,ID
Afghanistan,13,0,0,2,2,0,0,0,0,0,13,0,0,2,2,AFG
Algeria,12,5,2,8,15,3,0,0,0,0,15,5,2,8,15,ALG
Argentina,23,18,24,28,70,18,0,0,0,0,41,18,24,28,70,ARG
Armenia,5,1,2,9,12,6,0,0,0,0,11,1,2,9,12,ARM
Australasia,2,3,4,5,12,0,0,0,0,0,2,3,4,5,12,ANZ


### To return the entire data for any country, we can use the iloc/loc method. Here we return the data for the 3rd country and then for India

In [3]:
country_three = olympic_data.iloc[2]
country_three

# Summer           23
Gold               18
Silver             24
Bronze             28
Total              70
# Winter           18
Gold.1              0
Silver.1            0
Bronze.1            0
Total.1             0
# Games            41
Gold.2             18
Silver.2           24
Bronze.2           28
Combined total     70
ID                ARG
Name: Argentina, dtype: object

In [4]:
india_data = olympic_data.loc['India']
india_data

# Summer           23
Gold                9
Silver              6
Bronze             11
Total              26
# Winter            9
Gold.1              0
Silver.1            0
Bronze.1            0
Total.1             0
# Games            32
Gold.2              9
Silver.2            6
Bronze.2           11
Combined total     26
ID                IND
Name: India, dtype: object

### Find Country with maximum summer gold medals

There are a couple of ways to do this. We can sort by the Gold data and then find the first row or we can call the idxmax method which returns the index of the maximum of a certain column. The first method does not change the dataframe by default and we need to set inplace=True for changing it.

In [5]:
olympic_data.sort_values("Gold",ascending = False).iloc[0]

# Summer            26
Gold               976
Silver             757
Bronze             666
Total             2399
# Winter            22
Gold.1              96
Silver.1           102
Bronze.1            84
Total.1            282
# Games             48
Gold.2            1072
Silver.2           859
Bronze.2           750
Combined total    2681
ID                 USA
Name: United States, dtype: object

In [6]:
olympic_data.Gold.idxmax()

'United States'

### Find Country with biggest difference between their summer and winter gold medal counts

We first add a column with the difference of medals and then use the idxmax method again or we can directly use the idxmax method

In [7]:
def difference_max(df):
    df['Gold_Difference'] = df.apply(lambda row: row['Gold'] - row['Gold.1'], axis=1)
    return df.Gold_Difference.idxmax()

difference_max(olympic_data)

'United States'

In [8]:
(olympic_data['Gold'] - olympic_data['Gold.1']).idxmax()

'United States'

### Find country with the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count? 

$$\frac{Summer~Gold - Winter~Gold}{Total~Gold}$$

For this we only include countries that have won at least 1 gold in both summer and winter. We first create a new data frame called "gold_data" which is only comprised of the countries with at least 1 gold in both summer and winter and then use the idxmax method.

In [9]:
gold_data = olympic_data[(olympic_data['Gold']>0) & (olympic_data['Gold.1']>0)] 

In [10]:
len(gold_data)

36

In [12]:
((gold_data['Gold'] - gold_data['Gold.1'])/gold_data['Gold.2']).idxmax()

'Bulgaria'

### We next make a performance measure for each country. In this measure each gold medal (`Gold.2`) counts for 3 points, silver medals (`Silver.2`) for 2 points, and bronze medals (`Bronze.2`) for 1 point.

In [13]:
def Point_data(df):
    df['Points'] = 3*df['Gold.2'] + 2*df['Silver.2'] + df['Bronze.2']
    return df['Points']

In [14]:
Point_data(olympic_data)

Afghanistan                            2
Algeria                               27
Argentina                            130
Armenia                               16
Australasia                           22
Australia                            923
Austria                              569
Azerbaijan                            43
Bahamas                               24
Bahrain                                1
Barbados                               1
Belarus                              154
Belgium                              276
Bermuda                                1
Bohemia                                5
Botswana                               2
Brazil                               184
British West Indies                    2
Bulgaria                             411
Burundi                                3
Cameroon                              12
Canada                               846
Chile                                 24
China                               1120
Colombia        