### This dataset consists of competitors at the Rio Olympics in 2016.

### There are a few possibilities of predictions that can be made using this dataset.  Our first run through will be looking to classify the athlete's country using recipients that earned a medal.  What features may be of importance?

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv('athletes.csv')

In [3]:
df.head()

Unnamed: 0,id,name,nationality,sex,dob,height,weight,sport,gold,silver,bronze
0,736041664,A Jesus Garcia,ESP,male,10/17/69,1.72,64.0,athletics,0,0,0
1,532037425,A Lam Shin,KOR,female,9/23/86,1.68,56.0,fencing,0,0,0
2,435962603,Aaron Brown,CAN,male,5/27/92,1.98,79.0,athletics,0,0,1
3,521041435,Aaron Cook,MDA,male,1/2/91,1.83,80.0,taekwondo,0,0,0
4,33922579,Aaron Gate,NZL,male,11/26/90,1.81,71.0,cycling,0,0,0


In [4]:
df.shape

(11538, 11)

In [5]:
df.isnull().sum()

id               0
name             0
nationality      0
sex              0
dob              1
height         330
weight         659
sport            0
gold             0
silver           0
bronze           0
dtype: int64

### There are nulls, but not a lot.  Focusing on time, let's drop them.

In [6]:
df.dropna(inplace=True)

In [7]:
df.shape

(10858, 11)

In [8]:
df.dtypes

id               int64
name            object
nationality     object
sex             object
dob             object
height         float64
weight         float64
sport           object
gold             int64
silver           int64
bronze           int64
dtype: object

### Let's feature engineer a column that contains the number of medals an athlete has won.  Some athletes may have won more than one medal, and this may be a good indicator of nationality.

In [9]:
df['medal_or_nm'] =  df['gold'] + df['silver'] + df['bronze']

In [10]:
df_medals = df[df.medal_or_nm >= 1]
df_medals.shape

(1753, 12)

In [11]:
df_medals.head()

Unnamed: 0,id,name,nationality,sex,dob,height,weight,sport,gold,silver,bronze,medal_or_nm
2,435962603,Aaron Brown,CAN,male,5/27/92,1.98,79.0,athletics,0,0,1,1
6,266237702,Aaron Russell,USA,male,6/4/93,2.05,98.0,volleyball,0,0,1,1
14,162792594,Abbey Weitzeil,USA,female,12/3/96,1.78,68.0,aquatics,1,1,0,2
48,962468808,Abdoulrazak Issoufou Alfaga,NIG,male,12/26/94,2.07,98.0,taekwondo,0,1,0,1
55,969824503,Abdullah Alrashidi,IOA,male,8/21/63,1.83,84.0,shooting,0,0,1,1


In [12]:
df_medals.groupby('nationality')['medal_or_nm'].count()

nationality
ALG      1
ARG     22
ARM      4
AUS     71
AUT      2
AZE     16
BAH      3
BDI      1
BEL     21
BLR     12
BRA     48
BRN      2
BUL      7
CAN     62
CHN     96
CIV      1
COL      6
CRO     23
CUB      5
CZE     14
DEN     39
DOM      1
EGY      3
ESP     43
EST      4
ETH      7
FIJ     13
FRA     86
GBR    126
GEO      7
      ... 
NED     45
NIG      1
NZL     35
PHI      1
POL     16
POR      1
PRK      7
PUR      1
QAT      1
ROU     17
RSA     21
RUS     96
SIN      1
SLO      4
SRB     53
SUI     11
SVK      8
SWE     26
THA      6
TJK      1
TPE      5
TTO      1
TUN      3
TUR      8
UAE      1
UKR     13
USA    205
UZB      6
VEN      2
VIE      1
Name: medal_or_nm, Length: 82, dtype: int64

### This is good to see.  Looking at athletes that do have a medal still leaves us at 1,753 observations. 

### A thought: In order to classify, we will take only countries that have a minimum medal count.  We'll say 50 for now.  Let's see how many countries that leaves us.

In [13]:
country_count = pd.DataFrame(df_medals.groupby('nationality')['medal_or_nm'].agg('sum'))
country_count.columns = ['country_count']

In [14]:
df_medals = df_medals.merge(country_count, on='nationality')

In [15]:
df_medals.head(10)

Unnamed: 0,id,name,nationality,sex,dob,height,weight,sport,gold,silver,bronze,medal_or_nm,country_count
0,435962603,Aaron Brown,CAN,male,5/27/92,1.98,79.0,athletics,0,0,1,1,69
1,769580282,Akeem Haynes,CAN,male,3/11/92,1.68,71.0,athletics,0,0,1,1,69
2,373002185,Allison Beveridge,CAN,female,6/1/93,1.69,62.0,cycling,0,0,1,1,69
3,686662012,Allysha Chapman,CAN,female,1/25/89,1.6,58.0,football,0,0,1,1,69
4,857846421,Andre de Grasse,CAN,male,11/10/94,1.76,70.0,athletics,0,1,2,3,69
5,790221508,Ashley Lawrence,CAN,female,6/11/95,1.64,60.0,football,0,0,1,1,69
6,405218283,Ashley Steacy,CAN,female,6/28/87,1.58,64.0,rugby sevens,0,0,1,1,69
7,644567979,Bianca Farella,CAN,female,4/10/92,1.73,73.0,rugby sevens,0,0,1,1,69
8,321655820,Brendon Rodney,CAN,male,4/9/92,1.95,80.0,athletics,0,0,1,1,69
9,542571086,Brianne Theisen Eaton,CAN,female,12/18/88,1.75,64.0,athletics,0,0,1,1,69


In [16]:
df_medals = df_medals[df_medals.country_count > 50]

In [17]:
df_medals.shape

(1072, 13)

In [18]:
df_medals.nationality.nunique()

11

### We now have 11 countries to classify our athletes as their nationality.  Let's split and then explore our train.

In [19]:
train, test = train_test_split(df_medals, test_size=.3, random_state=123, stratify=df_medals[['nationality']])