### This dataset consists of competitors at the Rio Olympics in 2016.

### There are a few possibilities of predictions that can be made using this dataset.  Our first run through will be looking to classify the athlete's country using recipients that earned a medal.  What features may be of importance?

In [10]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv('athletes.csv')

In [3]:
df.head()

Unnamed: 0,id,name,nationality,sex,dob,height,weight,sport,gold,silver,bronze
0,736041664,A Jesus Garcia,ESP,male,10/17/69,1.72,64.0,athletics,0,0,0
1,532037425,A Lam Shin,KOR,female,9/23/86,1.68,56.0,fencing,0,0,0
2,435962603,Aaron Brown,CAN,male,5/27/92,1.98,79.0,athletics,0,0,1
3,521041435,Aaron Cook,MDA,male,1/2/91,1.83,80.0,taekwondo,0,0,0
4,33922579,Aaron Gate,NZL,male,11/26/90,1.81,71.0,cycling,0,0,0


In [4]:
df.shape

(11538, 11)

In [12]:
df.isnull().sum()

id               0
name             0
nationality      0
sex              0
dob              1
height         330
weight         659
sport            0
gold             0
silver           0
bronze           0
dtype: int64

### There are nulls, but not a lot.  Focusing on time, let's drop them.

In [13]:
df.dropna(inplace=True)

In [16]:
df.shape

(10858, 11)

In [18]:
df.dtypes

id               int64
name            object
nationality     object
sex             object
dob             object
height         float64
weight         float64
sport           object
gold             int64
silver           int64
bronze           int64
dtype: object

### Let's feature engineer a column that contains the number of medals an athlete has won.  Some athletes may have won more than one medal, and this may be a good indicator of nationality.

In [51]:
df['medal_or_nm'] =  df['gold'] + df['silver'] + df['bronze']

In [52]:
df_medals = df[df.medal_or_nm >= 1]
df_medals.shape

(1753, 12)

In [53]:
df_medals.head()

Unnamed: 0,id,name,nationality,sex,dob,height,weight,sport,gold,silver,bronze,medal_or_nm
2,435962603,Aaron Brown,CAN,male,5/27/92,1.98,79.0,athletics,0,0,1,1
6,266237702,Aaron Russell,USA,male,6/4/93,2.05,98.0,volleyball,0,0,1,1
14,162792594,Abbey Weitzeil,USA,female,12/3/96,1.78,68.0,aquatics,1,1,0,2
48,962468808,Abdoulrazak Issoufou Alfaga,NIG,male,12/26/94,2.07,98.0,taekwondo,0,1,0,1
55,969824503,Abdullah Alrashidi,IOA,male,8/21/63,1.83,84.0,shooting,0,0,1,1


In [67]:
df_medals.groupby('nationality')['medal_or_nm'].count()

nationality
ALG      1
ARG     22
ARM      4
AUS     71
AUT      2
AZE     16
BAH      3
BDI      1
BEL     21
BLR     12
BRA     48
BRN      2
BUL      7
CAN     62
CHN     96
CIV      1
COL      6
CRO     23
CUB      5
CZE     14
DEN     39
DOM      1
EGY      3
ESP     43
EST      4
ETH      7
FIJ     13
FRA     86
GBR    126
GEO      7
      ... 
NED     45
NIG      1
NZL     35
PHI      1
POL     16
POR      1
PRK      7
PUR      1
QAT      1
ROU     17
RSA     21
RUS     96
SIN      1
SLO      4
SRB     53
SUI     11
SVK      8
SWE     26
THA      6
TJK      1
TPE      5
TTO      1
TUN      3
TUR      8
UAE      1
UKR     13
USA    205
UZB      6
VEN      2
VIE      1
Name: medal_or_nm, Length: 82, dtype: int64

### This is good to see.  Looking at athletes that do have a medal still leaves us at 1,753 observations. 

### A thought: In order to classify, we will take only countries that have a minimum medal count.  We'll say 15 for now.  Let's see how many countries that leaves us.

In [50]:
df_medals.groupby('nationality')['medal_or_nm'].agg('sum')

nationality
ARG     22
ARM      4
AUS     78
AUT      2
AZE     16
BAH      3
BDI      1
BEL     21
BLR     12
BRA     50
BRN      2
BUL      7
CAN     65
CHN     97
CIV      1
COL      6
CRO     23
CUB      5
CZE     15
DEN     39
DOM      1
EGY      3
ESP     45
EST      4
ETH      8
FIJ     13
FRA     87
GBR    126
GEO      7
GER    149
      ... 
NED     46
NIG      1
NZL     36
PHI      1
POL     16
POR      1
PRK      7
PUR      1
QAT      1
ROU     17
RSA     20
RUS     99
SIN      1
SLO      4
SRB     53
SUI     11
SVK      8
SWE     28
THA      6
TJK      1
TPE      5
TTO      1
TUN      3
TUR      8
UAE      1
UKR     15
USA    235
UZB      6
VEN      2
VIE      2
Name: medal_or_nm, Length: 81, dtype: int64

In [None]:
train, test = train_test_split(df_medals, test_size=.3, random_state=123, stratify=df_medals[['nationality']])