# Homework 04 - Assignement 2

_Goal_ :

**We want to apply _unsupervised learning_ to the player-referee dyads dataset aggregated by player, to cluster the players in $n=2$ clusters. We will use `KMeans` technique to do so.**

_Tools_ :

**The tools used for this homework are :**
* Pandas
* Scikit Learn

_Contents_ :

* [1 - Importing data](#1---Importing-data)
* [2 - Players clustering](#2---Players-clustering)

---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime # needed for df.birthday
import seaborn as sns
sns.set_context('notebook')

from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import silhouette_score

# 1 - Importing data

## Loading and cleaning dataset

As usual, we begin by loading and cleaning a bit our dataset.

In [2]:
# Read dataset
df = pd.read_csv('./CrowdstormingDataJuly1st.csv')

# Remove redondant and useless features
df = df.drop(['player','Alpha_3','photoID'],axis=1)

# Drop NA
df = df.dropna()

print(df.shape)
df.head()

(115457, 25)


Unnamed: 0,playerShort,club,leagueCountry,birthday,height,weight,position,games,victories,ties,...,rater1,rater2,refNum,refCountry,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
0,lucas-wilchez,Real Zaragoza,Spain,31.08.1983,177.0,72.0,Attacking Midfielder,1,0,0,...,0.25,0.5,1,1,0.326391,712.0,0.000564,0.396,750.0,0.002696
1,john-utaka,Montpellier HSC,France,08.01.1982,179.0,82.0,Right Winger,1,0,0,...,0.75,0.75,2,2,0.203375,40.0,0.010875,-0.204082,49.0,0.061504
5,aaron-hughes,Fulham FC,England,08.11.1979,182.0,71.0,Center Back,1,0,0,...,0.25,0.0,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752
6,aleksandar-kolarov,Manchester City,England,10.11.1985,187.0,80.0,Left Fullback,1,1,0,...,0.0,0.25,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752
7,alexander-tettey,Norwich City,England,04.04.1986,180.0,68.0,Defensive Midfielder,1,0,0,...,1.0,1.0,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752


## Encoding categorical features

In order to handle categorical features, we need to encode them. Here we choose a simple technique using a `LabelEncoder` that maps them to integer values. More advanced techniques include for example a `OneHotEncoder`.

In [3]:
# Reference :
# https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/

# Useful for aggregation step
categorical_features = []

# Instanciate the encoder
le = LabelEncoder()

# Iterate over columns, encode the categorical ones
for col in df.columns.values:
    if df[col].dtype == 'object':
        # Remember it
        categorical_features.append(col)
        # Encode it
        le.fit(df[col].values)
        # Replace it
        df[col] = le.transform(df[col])

print('Encoded categorical features :', categorical_features)
df.head()

Encoded categorical features : ['playerShort', 'club', 'leagueCountry', 'birthday', 'position']


Unnamed: 0,playerShort,club,leagueCountry,birthday,height,weight,position,games,victories,ties,...,rater1,rater2,refNum,refCountry,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
0,790,66,3,1246,177.0,72.0,0,1,0,0,...,0.25,0.5,1,1,0.326391,712.0,0.000564,0.396,750.0,0.002696
1,651,48,1,286,179.0,82.0,11,1,0,0,...,0.75,0.75,2,2,0.203375,40.0,0.010875,-0.204082,49.0,0.061504
5,0,33,0,322,182.0,71.0,1,1,0,0,...,0.25,0.0,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752
6,39,45,0,400,187.0,80.0,6,1,1,0,...,0.0,0.25,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752
7,54,51,0,142,180.0,68.0,4,1,0,0,...,1.0,1.0,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752


## Aggregating by player

Since we want to cluster *players*, we need to aggregate our data. The canonical *pandas* way to do that is to use `group_by` followed by an application of functions on the grouped features. The last step is done with `agg` and by passing a dictionary of functions corresponding to features.

In [4]:
# Function that given a 'group' feature, keeps the first one
# In our case, the categorical features are unique / player
def keep_first(x):
    return x.unique()[0]

# Define which operation to apply after grouping by player
aggregation_fun = {c: np.mean for c in df.columns.values}
aggregation_fun.pop('playerShort')
for feat in categorical_features:
    aggregation_fun[feat] = keep_first

# Group by player
player_group = df.groupby('playerShort').agg(aggregation_fun)
player_group.head()

Unnamed: 0_level_0,meanExp,defeats,seIAT,redCards,games,refNum,leagueCountry,birthday,nIAT,nExp,...,height,goals,ties,yellowReds,yellowCards,victories,playerShort,meanIAT,rater2,refCountry
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.494575,1.373494,0.000652,0.0,3.939759,1612.656627,0,322,19710.156627,20637.277108,...,182.0,0.054217,1.078313,0.0,0.114458,1.487952,0,0.346459,0.0,43.921687
1,0.44922,1.232323,0.000219,0.010101,3.393939,1662.515152,2,160,26104.292929,26864.454545,...,183.0,0.626263,0.737374,0.0,0.424242,1.424242,1,0.348818,0.25,25.070707
2,0.491482,1.138614,0.000367,0.0,4.079208,1598.871287,0,641,21234.861386,22238.742574,...,165.0,0.306931,0.960396,0.0,0.108911,1.980198,2,0.345893,0.25,42.772277
3,0.514693,0.653846,0.003334,0.009615,2.5,1668.5,0,1077,38285.826923,39719.980769,...,178.0,0.375,0.403846,0.0,0.298077,1.442308,3,0.346821,0.0,45.067308
4,0.335587,1.162162,0.001488,0.054054,3.351351,1610.891892,1,677,2832.351351,2953.837838,...,180.0,0.027027,1.081081,0.108108,0.216216,1.108108,4,0.3316,0.25,17.189189


# 2 - Players clustering

In order to create 2 clusters of players, we will use here the unsupervised learning method k-means. In a first part we run a "simple" k-means clustering and discuss a bit the results, and then we iteratively remove features to see how the clustering evolves. 

## KMeans

In [5]:
# Preparing data (with 'usual' variable names)
X = player_group.drop(['rater1', 'rater2'], axis=1)
features = X.columns.values
X = scale(X)
y = round((player_group.rater1 + player_group.rater2)/2)

In [6]:
# Instanciate and fit KMeans
km = KMeans(n_clusters=2)
km.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Let's compute the silhouette score :

In [7]:
silhouette_score(X, km.labels_)

0.13329014184911325

From Scikit doc about silhouette score :
> The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

This means that here we have somewhat overlapping clusters.

Let's inspect a bit the clusters.

In [8]:
# Number of labels...
len(km.labels_)

1419

In [9]:
# ... which is indeed the number of players
player_group.shape

(1419, 25)

In [10]:
# Indices of elements in each cluster
idx_0 = km.labels_ == 0
idx_1 = km.labels_ == 1

In [11]:
# Number of elements in the cluster '0'
sum(idx_0)

474

In [12]:
# Number of elements in the cluster '1'
sum(idx_1)

945

Remembering our objective, we compute the percentage of players labeled as having a "dark" skin in each cluster :

In [13]:
# Percentage of 'dark skin' in cluster 0
sum(y[idx_0])/len(y[idx_0])

0.10548523206751055

In [14]:
# Percentage of 'dark skin' in cluster 1
sum(y[idx_1])/len(y[idx_1])

0.1873015873015873

The two last percentages seem to indicate that there is no class with a lot more black / white people than the other.

## Iteratively removing features

Now, we remove features iteratively, and at each step perform again the clustering and compute the silhouette score.

In [15]:
# Total number of features
num_features = len(features)

# Instanciate KMeans
km = KMeans(n_clusters=2)

# Iteratively remove features
for feature_num in range(num_features,1,-1):
    # Fit KMeans
    km.fit(X[:,1:feature_num])
    
    # Compute silhouette score
    silhouette = silhouette_score(X[:,1:feature_num], km.labels_)
    
    # Indices of elements in each cluster
    idx_0 = km.labels_ == 0
    idx_1 = km.labels_ == 1
    
    # Percentage of 'dark skin' in cluster 0
    dark0 = sum(y[idx_0])/len(y[idx_0])
    # Percentage of 'dark skin' in cluster 1
    dark1 = sum(y[idx_1])/len(y[idx_1])
    
    # Print summary
    print("Features :", features[:feature_num])
    print("\t silhouette score =", silhouette)
    print("\t % of 'dark' skin in cluster 0 :", dark0)
    print("\t % of 'dark' skin in cluster 1 :", dark1)
    print("\n")

Features : ['meanExp' 'defeats' 'seIAT' 'redCards' 'games' 'refNum' 'leagueCountry'
 'birthday' 'nIAT' 'nExp' 'club' 'seExp' 'position' 'weight' 'height'
 'goals' 'ties' 'yellowReds' 'yellowCards' 'victories' 'playerShort'
 'meanIAT' 'refCountry']
	 silhouette score = 0.139038852625
	 % of 'dark' skin in cluster 0 : 0.181818181818
	 % of 'dark' skin in cluster 1 : 0.117768595041


Features : ['meanExp' 'defeats' 'seIAT' 'redCards' 'games' 'refNum' 'leagueCountry'
 'birthday' 'nIAT' 'nExp' 'club' 'seExp' 'position' 'weight' 'height'
 'goals' 'ties' 'yellowReds' 'yellowCards' 'victories' 'playerShort'
 'meanIAT']
	 silhouette score = 0.138277690799
	 % of 'dark' skin in cluster 0 : 0.121863799283
	 % of 'dark' skin in cluster 1 : 0.184668989547


Features : ['meanExp' 'defeats' 'seIAT' 'redCards' 'games' 'refNum' 'leagueCountry'
 'birthday' 'nIAT' 'nExp' 'club' 'seExp' 'position' 'weight' 'height'
 'goals' 'ties' 'yellowReds' 'yellowCards' 'victories' 'playerShort']
	 silhouette score = 

The last (two or three) iterations have a much higher silhouette score than the others... however player with dark and light skin colors don't really belong to different clusters !