# HW4 - Applied ML

There are four aspects:
* First, we process the data and explore the dataset (0).
* Then, we classify labels. (1).
* We plot learning curves (Bonus)
* Finally, we do clustering (2).

## 0. Data Pre-Processing and Visualization
Understanding the dataset, cleaning the variables. This is important because later we will need clean data for supervised and unsupervised learning.

Using http://nbviewer.jupyter.org/github/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb, we will clean the dataset. Then, we will aggregate information. Finally, we will clean up the features.

### Pre-Processing Tasks and Outline

1) Dataset cleaning
- exclude interactions by  ref who feature in fewer than 22 diyads

- drop NAs with no race values, averaging race values

2) Aggregate to player data

- combine games, referee and bias information using variable statistics

3) Cleaning features

- take out unnecessary features — player ID, photo ID
- turning date into proper format

- convert categorical variables into dummy variables


In [1]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv('CrowdstormingDataJuly1st.csv')

In [3]:
orig_num_dyads = df.shape[0]
print('number of dyads:', orig_num_dyads)

number of dyads: 146028


## 1. Dataset Cleaning

### Exclude Interactions

In [4]:
#This line defines a new dataframe based on our >21 games filter
all_refs = df.refNum.value_counts()
good_refs = all_refs[all_refs>21]

df=df[df['refNum'].isin(good_refs.index.values)]

In [5]:
print('percentage of dyads left:', 100 * df.shape[0] / orig_num_dyads)

percentage of dyads left: 91.4215082039061


### Getting Race Values

Get the real "race value" by taking the average of the two. Then drop the ones that don't have rating.

In [6]:
df['race_val'] = df['rater1'] + df['rater2']/2

In [7]:
df.dropna(subset=['race_val'],inplace=True)

In [8]:
# Percentage of data left...
print('percentage of dyads left:', 100 * df.shape[0] / orig_num_dyads)

percentage of dyads left: 77.9727175610157


In [10]:
df.drop(['rater1','rater2'],axis=1, inplace=True)

## 2. Aggregating into Players

We have three types of data, games, referee reference, and bias scores.

We sum up the games. For referee, we need to be more careful.

* refCountry cannot be averaged, but can take the mode.
* Alpha_3 is just the ref country, so we take the mode as well
* meanIAT and meanExp  are averaged
* nIAT and nExp are summed
* seIAT and seExp are done using functions

In [11]:
# demonstrating the columns of a single player
df[df['player'] == 'John Utaka'].ix[:,0:13].head(1)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,ties,defeats,goals
973,john-utaka,John Utaka,Montpellier HSC,France,08.01.1982,179.0,82.0,Right Winger,1,0,0,1,0


In [12]:
df[df['player'] == 'John Utaka'].ix[:,13:28].head(1)

Unnamed: 0,yellowCards,yellowReds,redCards,photoID,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp,race_val
973,0,0,0,1663.jpg,66,4,LUX,0.325185,127.0,0.003297,0.538462,130.0,0.013752,1.125


In [13]:
# each set of columns will be aggregated differently
player_cols = ['playerShort'] # basis
game_cols = ['games','victories','ties','defeats','goals','yellowCards','yellowReds','redCards'] #sum
ref_cols = ['refCountry','Alpha_3'] #mode
bias_mean_cols = ['meanIAT','meanExp'] #mean
bias_n_cols = ['nIAT','nExp'] #sum
bias_se_cols = ['seIAT', 'seExp'] #special function

### Aggregated DataFrame

* df_aggregated -- contains all the columns transformed by the groupby
* df_player -- combines player columns with df_aggregated

In [14]:
df_aggregated = df.groupby(player_cols)[game_cols].sum().reset_index()
df_aggregated = df_aggregated.merge(df.groupby(player_cols)[ref_cols].agg(lambda x: x.value_counts().index[0]).reset_index()) # refs
df_aggregated = df_aggregated.merge(df.groupby(player_cols)[bias_mean_cols].mean().reset_index()) # bias mean
df_aggregated = df_aggregated.merge(df.groupby(player_cols)[bias_n_cols].sum().reset_index()) #bias n
df_aggregated = df_aggregated.merge(df.groupby(player_cols)[bias_se_cols].std().reset_index()) #bias std

In [15]:
df_aggregated.head()

Unnamed: 0,playerShort,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards,refCountry,Alpha_3,meanIAT,meanExp,nIAT,nExp,seIAT,seExp
0,aaron-hughes,641,243,176,222,9,19,0,0,44,ENGL,0.344759,0.487879,3133820.0,3281187.0,0.000707,0.003308
1,aaron-hunt,329,140,70,119,59,39,0,1,8,DEU,0.349332,0.453989,2553329.0,2627685.0,0.000508,0.002356
2,aaron-lennon,412,200,97,115,31,11,0,0,44,ENGL,0.345893,0.491482,2144721.0,2246113.0,0.00122,0.008723
3,aaron-ramsey,254,145,42,67,39,31,0,1,44,ENGL,0.34679,0.51165,3975720.0,4124639.0,0.001406,0.009682
4,abdelhamid-el-kaoutari,124,41,40,43,1,8,4,2,7,FRA,0.3316,0.335587,104797.0,109292.0,0.006216,0.023134


In [16]:
player_cols = ['playerShort','player','club','leagueCountry','birthday','height','weight','position'] # basis

In [17]:
df_player = df[player_cols].drop_duplicates().merge(df_aggregated,how='inner',on='playerShort')

## Cleaning Features
Remove:
- playerID and photoID - already removed in the aggregation process
- refCountry - it is bijective with Alpha_3.
- playerShort, player - because they are just names

Process:
- birthday - convert to date

Dummy Variables:
- club
- leagueCountry
- position
- refCountry

In [18]:
# creating a new dataframe for machine learning later
df_ml = df_player

### Remove irrelevant features

In [19]:
df_ml.drop(['playerShort','player','refCountry'],1,inplace=True)

### Process date

In [20]:
# convert into datetime format
df_ml['birthay'] = pd.to_datetime(df_ml['birthday'])

### Turn categorical variables into dummy variables

In [21]:
club = pd.get_dummies(df_ml['club'])
leagueCountry = pd.get_dummies(df_ml['leagueCountry'])
position = pd.get_dummies(df_ml['position'])
refCountry = pd.get_dummies(df_ml['Alpha_3'])

In [22]:
dummy_variables = [df_ml, club, leagueCountry, position, refCountry]

In [23]:
df_ml = pd.concat(dummy_variables, axis=1)

In [24]:
df_ml.head()

Unnamed: 0,club,leagueCountry,birthday,height,weight,position,games,victories,ties,defeats,...,FIN,FRA,GRC,HUN,ISL,ITA,KOR,NLD,PRT,SCOT
0,Fulham FC,England,08.11.1979,182.0,71.0,Center Back,641,243,176,222,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Manchester City,England,10.11.1985,187.0,80.0,Left Fullback,275,134,55,86,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,Norwich City,England,04.04.1986,180.0,68.0,Defensive Midfielder,198,80,50,68,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Manchester United,England,13.04.1984,193.0,80.0,Goalkeeper,78,40,15,23,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1899 Hoffenheim,Germany,13.03.1987,180.0,70.0,Right Fullback,294,124,71,99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 1. Classification
### Training Random Forest
### Parameter Fitting
### Important Features

## Bonus 
Learning Curves: Cross Validation 

## 2. Clustering
### Feature Removal

# Stuff People have done/Discussions on Slack:

## First Discussion -- Input features
Ismail Bensouda Koraichi	[7:48 PM]  
Hello, I would like to clarify something. In the first question of the homework : " [...] given a soccer player description outputs his skin color", what does "soccer player description" mean ? I mean what should be the input ? A dyad ? An aggregation of multiple dyads ?

Gael Lederrey	[9:24 PM]  
@ismail64 I think that we'll have to aggregate the data by players.. But then, we loose some information (the IAT and Exp for examples).. Therefore, it can be interesting to create a new feature taking into account the removed feature..

But maybe there's another way to deal with the data without aggregating them.. =)

Dunai Fuentes Hitos	[11:31 PM]  
I went with the aggregation by player.
The "smart" way to deal with those features (IAT, Exp, etc.) would be to make a correction upon the number of yellow and red cards received by the player, as this is the only link players have to the referee...
But it all looks a bit far-fetched, so I particularly decided to remove all this data under a presumption of honesty (the referee's honesty) (edited)

Gael Lederrey	[11:57 PM]  
@dunai That's what I did too.. ^^ 
But I'm pretty sure racism exists everywhere. So I just created a weighted sum for the cards, the referees' "racism" values, and the number of times they encountered a player.. This gives you sort of a score linking the _IAT_ and the _Exp_ to all the referees who encounter a player..

At the end, I'm pretty sure that it's important to keep these values.. Because the other values (goals, height, weight, games, etc.) can't be linked to the color of a player.. So, there's only these two "racism" values linked to the number of cards that can maybe give some information..

If someone has another idea, I'd be glad to read about it.. =) (edited)



## Second Discussion - Aggregating Features

Jonas Racine	[3:39 PM]  
hey guys, is anyone able to get much more than ~70% accuracy on the homework? (exercise 1)

bojan.petrovski	[3:42 PM]  
I got something like 78-80

[3:43]  
But I'm aggregating the results by player

Gael Lederrey	[3:53 PM]  
@bojan.petrovski Did you create new features using the ones we loose (Like the IAT and Exp)?

bojan.petrovski	[4:14 PM]  
Yeah for the IATmean and EXPmean I averaged them and for IATstd and EXPstd I calculated the Stadndard deviation on the samples that I used for the new means, because you can't just average standard deviations

## Third Discussion - Dealing with Categorical Features

Paul Nicolet	[10:01 AM]  
Hey guys I’m curious to know if you have a good way to deal with the categorical features in the homework, as `scikit-learn` doesn’t accept them as strings. I figured out a way to encode them using the `OneHotEncoder()` class, in order to get a vector representation of each category, but I have some issues to deal with these vectors now, given a DataFrame doesn’t accept vectors as values… The easy way would be to encode them as integers but it’s risky because it could be interpreted as continuous and ordered data right ?

arnaudmiribel	[10:05 AM]  
We’ve used `LabelEncoder()` which naïvely binds labels to integers values. There are some issues doing this as two successive cells will be numerically _closer_ than two far apart cells (whereas there is no reason for this), so I’d also welcome any hint on this

Baptiste Billardon	[10:08 AM]  
From what I read, there are no implementation of random forest in sklearn that handles categorical features. I also labelEncoded the categorical features then oneHotEncoded them

Ondine Chanon	[11:20 AM]  
And what about dummy variables? It increases the number of attributes, but at least each cell has equal importance and there is no ordering issue.

Ciprian Tomoiaga	[3:53 PM]  
@bojan.petrovski what would be the reason for averaging IATmean (or EXPmean) ?

[3:55]  
I mean, it's great that you get a good score, but it doesn't make much sense to me to average independent values. I mean, they are not related to the player, but to the referee

Alexis Semple	[4:25 PM]  
Hey guys, I wonder how you understood the first sentence of the description for exercise 2 in this homework:
> Aggregate the _referee information_ grouping by soccer player
Maybe I'm missing something, but it seems to me that this indicates that we should not use features relevant only to the player himself (e.g. height, weight, position...) (edited)

Gianrocco Lazzari	[4:49 PM]  
well, in principle I don’t see what you should exclude them…it’s about reducing the variability across referees (the way I understand it) - it might  be that  player-dependent-features are still relevant (edited)

bojan.petrovski	[5:19 PM]  
@cipri_tom well in an ideal world the distribution of red and yellow cards should only depend on the position of the player, i.e. defenders are more likely to make a serious offence. So the cards alone should not help in any way to distinguish between black and white players. My idea is that some referees have a bias so the distribution is not the same for black and white players. And the average of the IAT just gives you a very crude idea of how much the player was discriminated against. For example if the average IAT is high and the player is black I would expect the distribution of cards for that player to be very different and if the IAT is low than you would expect the difference to be small in relation to the distribution for white players. I saw some people mention doing a weighted average based on the number of cards, but I think that in that way you are skewing the average because the lack of a card is also valuable information. (edited)

[5:20]  
But again it's hard to tell if the model is actually doing this kind of prediction or is picking up on some other factors