# ADA / Applied Data Analysis
<h2 style="color:#a8a8a8">Homework 4 - Applied Machine Learning<br>
Aimée Montero, Alfonso Peterssen, Cyriaque Brousse</h2>

## Assignment description
In this homework we will gain experience on Applied Machine Learning, exploring an interesting dataset about soccer players and referees.
You can find all the data in the `CrowdstormingDataJuly1st.csv` file, while you can read a thorough [dataset description here](DATA.md).
Given that the focus of this homework is Machine Learning, I recommend you to first take a look at [this notebook](http://nbviewer.jupyter.org/github/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb)
containing a solid work in pre-processing + visualization of the given dataset. You are *not* allowed to just copy/paste the pre-processing steps
performed by the notebook authors -- you are still supposed to perform your own data analysis for the homework. Still, I'm confident that consulting first
the work done by expert data analysts will speed up tangibly your effort (i.e., they have already found for you many glitches in the data :)


### Assignment
1. Train a `sklearn.ensemble.RandomForestClassifier` that given a soccer player description outputs his skin color. Show how different parameters 
passed to the Classifier affect the overfitting issue. Perform cross-validation to mitigate the overfitting of your model. Once you assessed your model,
inspect the `feature_importances_` attribute and discuss the obtained results. With different assumptions on the data (e.g., dropping certain features even
before feeding them to the classifier), can you obtain a substantially different `feature_importances_` attribute?

  *BONUS*: plot the learning curves against at least 2 different sets of parameters passed to your Random Forest. To obtain smooth curves, partition
your data in at least 20 folds. Can you find a set of parameters that leads to high bias, and one which does not?

2. Aggregate the referee information grouping by soccer player, and use an unsupervised learning technique to cluster the soccer players in 2 disjoint
clusters. Remove features iteratively, and at each step perform again the clustering and compute the silhouette score -- can you find a configuration of features with high silhouette
score where players with dark and light skin colors belong to different clusters? Discuss the obtained results.

## Data description

From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3147) that these players played under in their professional career. We created a dataset of player–referee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.
 
Player photos were available from the source for 1586 out of 2053 players. Players’ skin tone was coded by two independent raters blind to the research question who, based on their profile photo, categorized players on a 5-point scale ranging from “very light skin” to “very dark skin” with “neither dark nor light skin” as the center value. 

Additionally, implicit bias scores for each referee country were calculated using a race implicit association test (IAT), with higher values corresponding to faster white | good, black | bad associations. Explicit bias scores for each referee country were calculated using a racial thermometer task, with higher values corresponding to greater feelings of warmth toward whites versus blacks. Both these measures were created by aggregating data from many online users in referee countries taking these tests on [Project Implicit](http://projectimplicit.net).

In all, the dataset has a total of 146028 dyads of players and referees. A detailed description of all variables in the dataset can be seen in the list below.

### Variables:

- *playerShort* - short player ID
- *player* - player name
- *club* - player club
- *leagueCountry* - country of player club (England, Germany, France, and Spain)
- *birthday* - player birthday
- *height* - player height (in cm)
- *weight* - player weight (in kg)
- *position* - detailed player position
- *games* - number of games in the player-referee dyad
- *victories* - victories in the player-referee dyad
- *ties* - ties in the player-referee dyad
- *defeats* - losses in the player-referee dyad
- *goals* - goals scored by a player in the player-referee dyad
- *yellowCards* - number of yellow cards player received from referee
- *yellowReds* - number of yellow-red cards player received from referee
- *redCards* - number of red cards player received from referee
- *photoID* - ID of player photo (if available)
- *rater1* - skin rating of photo by rater 1 (5-point scale ranging from “very light skin” to “very dark skin”)
- *rater2* - skin rating of photo by rater 2 (5-point scale ranging from “very light skin” to “very dark skin”)
- *refNum* - unique referee ID number (referee name removed for anonymizing purposes)
- *refCountry* - unique referee country ID number (country name removed for anonymizing purposes)
- *meanIAT* - mean implicit bias score (using the race IAT) for referee country, higher values correspond to faster white | good, black | bad associations
- *nIAT* - sample size for race IAT in that particular country
- *seIAT* - standard error for mean estimate of race IAT
- *meanExp* - mean explicit bias score (using a racial thermometer task) for referee country, higher values correspond to greater feelings of warmth toward whites versus blacks
- *nExp* - sample size for explicit bias in that particular country
- *seExp* - standard error for mean estimate of explicit bias measure

## Part 1 - Familiarizing ourselves with the data

Let's import the required libraries:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
%matplotlib inline

And import the data from the CSV source:

In [3]:
data = pd.read_csv('CrowdstormingDataJuly1st.csv')
data.sample(5)

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
16033,mathieu-coutadeur,Mathieu Coutadeur,FC Lorient,France,20.06.1986,174.0,66.0,Defensive Midfielder,2,0,...,0.0,343,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586
41839,milan-petrzela,Milan Petržela,FC Augsburg,Germany,19.06.1983,175.0,65.0,Right Midfielder,1,0,...,0.0,759,8,DEU,0.336628,7749.0,5.5e-05,0.335967,7974.0,0.000225
53229,franco-zuculini,Franco Zuculini,Real Zaragoza,Spain,05.09.1990,174.0,69.0,Defensive Midfielder,1,0,...,0.25,1054,49,ARG,0.379422,1038.0,0.000403,0.632988,1158.0,0.002096
97975,roque-santa-cruz,Roque Santa Cruz,Málaga CF,Spain,16.08.1981,191.0,83.0,Center Forward,9,4,...,0.5,2080,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05
140592,kenwyne-jones,Kenwyne Jones,Stoke City,England,05.10.1984,188.0,78.0,Right Winger,2,1,...,1.0,2986,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05


First, we follow the intuition that we need to reconcile the `rater1` and `rater2` columns. Indeed, we want a single value on which we will then evaluate the model against. We use the mean as aggregation function:

In [4]:
data['skin'] = (data.rater1 + data.rater2) / 2
data = data.drop('rater1', axis=1)
data = data.drop('rater2', axis=1)

Let's also drop the `photoID` column, since we don't have access to the photos anyways:

In [7]:
data = data.drop('photoID', axis=1)

We notice that the `refCountry` and `Alpha_3` columns are bijective: they contain the same data.

In [None]:
# drop refcountry or alpha3