# HW4 - Applied ML

There are four aspects:
* First, we process the data and explore the dataset (0).
* Then, we classify labels. (1).
* We plot learning curves (Bonus)
* Finally, we do clustering (2).

## 0. Data Pre-Processing and Visualization
Understanding the dataset, cleaning the variables. This is important because later we will need clean data for supervised and unsupervised learning.

Using http://nbviewer.jupyter.org/github/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb, we will clean the dataset. Then, we will aggregate information. Finally, we will clean up the features.

### Pre-Processing Tasks and Outline

1) Dataset cleaning
- exclude interactions by  ref who feature in fewer than 22 diyads

- drop NAs with no race values, averaging race values

2) Aggregate to player data

- combine games, referee and bias information using variable statistics

3) Cleaning features

- take out unnecessary features — player ID, photo ID
- turning date into proper format

- convert categorical variables into dummy variables


In [65]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
%matplotlib inline

In [45]:
df = pd.read_csv('Data/CrowdstormingDataJuly1st.csv')

In [46]:
orig_num_dyads = df.shape[0]
print('Number of dyads:', orig_num_dyads)

Number of dyads: 146028


## 1. Dataset Cleaning

### Exclude Interactions

In [47]:
#This line defines a new dataframe based on our >21 games filter
all_refs = df.refNum.value_counts()
good_refs = all_refs[all_refs>21]

df=df[df['refNum'].isin(good_refs.index.values)]

In [48]:
print('percentage of dyads left:', 100 * df.shape[0] / orig_num_dyads)

percentage of dyads left: 91.4215082039061


### Getting Race Values

Get the real "race value" by taking the average of the two. Then drop the ones that don't have rating.

In [49]:
df['race_val'] = (df['rater1'] + df['rater2'])/2

In [50]:
df.dropna(subset=['race_val'],inplace=True)

In [51]:
# Percentage of data left...
print('percentage of dyads left:', 100 * df.shape[0] / orig_num_dyads)

percentage of dyads left: 77.9727175610157


In [52]:
#Dropping rater1, rater2, player, photo_ID and Alpha_3

In [59]:
df_new=df.copy()

In [60]:
df_new.drop(['rater1','rater2','player','photoID','Alpha_3','birthday'],axis=1, inplace=True)

In [61]:
df_new.isnull().any()

playerShort      False
club             False
leagueCountry    False
height            True
weight            True
position          True
games            False
victories        False
ties             False
defeats          False
goals            False
yellowCards      False
yellowReds       False
redCards         False
refNum           False
refCountry       False
meanIAT           True
nIAT              True
seIAT             True
meanExp           True
nExp              True
seExp             True
race_val         False
dtype: bool

In [74]:
df_new.playerShort.unique().shape

(1584,)

In [63]:
 # Get the categorical values into a 2D numpy array
train_categorical_values = np.array(df_new['playerShort'])

In [90]:
train_categorical_values

array([    0.,    44.,    63., ...,  1552.,  1560.,  1572.])

In [75]:
# do the first column
enc_label = LabelEncoder()
train_data = enc_label.fit_transform(train_categorical_values)

In [77]:
train_categorical_values = train_data.astype(float)

In [79]:
train_categorical_values

array([    0.,    44.,    63., ...,  1552.,  1560.,  1572.])

In [80]:
nc_onehot = OneHotEncoder()
train_cat_data = enc_label.fit_transform(train_categorical_values)

In [88]:
train_cat_data_df = pd.DataFrame(np.asarray(train_cat_data))

In [89]:
train_cat_data_df

Unnamed: 0,0
0,0
1,44
2,63
3,88
4,102
5,130
6,150
7,186
8,276
9,277


## 1. Classification
### Training Random Forest
### Parameter Fitting
### Important Features

## Bonus 
Learning Curves: Cross Validation 

## 2. Clustering
### Feature Removal

# Stuff People have done/Discussions on Slack:

## First Discussion -- Input features
Ismail Bensouda Koraichi	[7:48 PM]  
Hello, I would like to clarify something. In the first question of the homework : " [...] given a soccer player description outputs his skin color", what does "soccer player description" mean ? I mean what should be the input ? A dyad ? An aggregation of multiple dyads ?

Gael Lederrey	[9:24 PM]  
@ismail64 I think that we'll have to aggregate the data by players.. But then, we loose some information (the IAT and Exp for examples).. Therefore, it can be interesting to create a new feature taking into account the removed feature..

But maybe there's another way to deal with the data without aggregating them.. =)

Dunai Fuentes Hitos	[11:31 PM]  
I went with the aggregation by player.
The "smart" way to deal with those features (IAT, Exp, etc.) would be to make a correction upon the number of yellow and red cards received by the player, as this is the only link players have to the referee...
But it all looks a bit far-fetched, so I particularly decided to remove all this data under a presumption of honesty (the referee's honesty) (edited)

Gael Lederrey	[11:57 PM]  
@dunai That's what I did too.. ^^ 
But I'm pretty sure racism exists everywhere. So I just created a weighted sum for the cards, the referees' "racism" values, and the number of times they encountered a player.. This gives you sort of a score linking the _IAT_ and the _Exp_ to all the referees who encounter a player..

At the end, I'm pretty sure that it's important to keep these values.. Because the other values (goals, height, weight, games, etc.) can't be linked to the color of a player.. So, there's only these two "racism" values linked to the number of cards that can maybe give some information..

If someone has another idea, I'd be glad to read about it.. =) (edited)



## Second Discussion - Aggregating Features

Jonas Racine	[3:39 PM]  
hey guys, is anyone able to get much more than ~70% accuracy on the homework? (exercise 1)

bojan.petrovski	[3:42 PM]  
I got something like 78-80

[3:43]  
But I'm aggregating the results by player

Gael Lederrey	[3:53 PM]  
@bojan.petrovski Did you create new features using the ones we loose (Like the IAT and Exp)?

bojan.petrovski	[4:14 PM]  
Yeah for the IATmean and EXPmean I averaged them and for IATstd and EXPstd I calculated the Stadndard deviation on the samples that I used for the new means, because you can't just average standard deviations

## Third Discussion - Dealing with Categorical Features

Paul Nicolet	[10:01 AM]  
Hey guys I’m curious to know if you have a good way to deal with the categorical features in the homework, as `scikit-learn` doesn’t accept them as strings. I figured out a way to encode them using the `OneHotEncoder()` class, in order to get a vector representation of each category, but I have some issues to deal with these vectors now, given a DataFrame doesn’t accept vectors as values… The easy way would be to encode them as integers but it’s risky because it could be interpreted as continuous and ordered data right ?

arnaudmiribel	[10:05 AM]  
We’ve used `LabelEncoder()` which naïvely binds labels to integers values. There are some issues doing this as two successive cells will be numerically _closer_ than two far apart cells (whereas there is no reason for this), so I’d also welcome any hint on this

Baptiste Billardon	[10:08 AM]  
From what I read, there are no implementation of random forest in sklearn that handles categorical features. I also labelEncoded the categorical features then oneHotEncoded them

Ondine Chanon	[11:20 AM]  
And what about dummy variables? It increases the number of attributes, but at least each cell has equal importance and there is no ordering issue.

Ciprian Tomoiaga	[3:53 PM]  
@bojan.petrovski what would be the reason for averaging IATmean (or EXPmean) ?

[3:55]  
I mean, it's great that you get a good score, but it doesn't make much sense to me to average independent values. I mean, they are not related to the player, but to the referee

Alexis Semple	[4:25 PM]  
Hey guys, I wonder how you understood the first sentence of the description for exercise 2 in this homework:
> Aggregate the _referee information_ grouping by soccer player
Maybe I'm missing something, but it seems to me that this indicates that we should not use features relevant only to the player himself (e.g. height, weight, position...) (edited)

Gianrocco Lazzari	[4:49 PM]  
well, in principle I don’t see what you should exclude them…it’s about reducing the variability across referees (the way I understand it) - it might  be that  player-dependent-features are still relevant (edited)

bojan.petrovski	[5:19 PM]  
@cipri_tom well in an ideal world the distribution of red and yellow cards should only depend on the position of the player, i.e. defenders are more likely to make a serious offence. So the cards alone should not help in any way to distinguish between black and white players. My idea is that some referees have a bias so the distribution is not the same for black and white players. And the average of the IAT just gives you a very crude idea of how much the player was discriminated against. For example if the average IAT is high and the player is black I would expect the distribution of cards for that player to be very different and if the IAT is low than you would expect the difference to be small in relation to the distribution for white players. I saw some people mention doing a weighted average based on the number of cards, but I think that in that way you are skewing the average because the lack of a card is also valuable information. (edited)

[5:20]  
But again it's hard to tell if the model is actually doing this kind of prediction or is picking up on some other factors