# 2. Unsupervised Learning
In this part, we will try to explore if we can partition our set with given features, then we will try if we can find a correlation between skin color and generated clusters.
## 2.1 Data cleaning & preprocessing

In [156]:
from itertools import combinations

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from HelperFunctions import get_kmeans_result, sil_score

# We read the CSV and drop rows if *rater1* or *rater2* is NaN
df = pd.read_csv('CrowdstormingDataJuly1st.csv').dropna(how='any', subset=['rater1', 'rater2'])

# Then we construct a new column using mean of these two columns so that we will only deal with a one feature
df['skin'] = df[['rater1', 'rater2']].mean(axis=1)

# And we don't need these two anymore.
df = df.drop(['rater1', 'rater2'], axis=1)

#We also use only birth year
df['birthday'] = df['birthday'].map(lambda x: int(str(x).split('.')[-1]))

Some columns are still filled with NaN, we need to fill them somehow if we want to use those rows. We will continue by using *mean* on those columns.

In [None]:
for _col_name_ in ['height', 'weight','meanIAT','nIAT','seIAT']:
    df[_col_name_] = df[_col_name_].fillna(df[_col_name_].mean())

In [None]:
df_grouped = df.groupby(['playerShort'])

# Here we prepare a different DataFrame to use after KMeans experiments.
# We binarize the skin color as 0 and 1, to match them with clusters, later.

df_skin = df_grouped.agg({'skin': 'mean'})
df_skin['actual_skin'] = np.where(df_skin['skin'] > 0.5, 1,0)

In [103]:
f = {
    'club':'first',
    'birthday':'first',
    'height':'first', 
    'weight':'first',
    'games': 'sum', 
    'victories':'sum',
    'ties': 'sum',
    'defeats': 'sum', 
    'goals': 'sum', 
    'yellowCards': 'sum', 
    'yellowReds': 'sum', 
    'redCards': 'sum',
    'position':'first',
    #'refNum':'first',       # We won't be using refNum since it is an ID and doesn't make sense to aggregate it
    #'refCountry':'first',   # Similiarly, we don't want to include refCountry since it is also an ID maps to Countries
    'meanIAT':'mean',
    'nIAT':'mean', 
    'seIAT':'mean',
    'meanExp': 'mean',
    'nExp':'mean', 
    'seExp':'mean',
    #'skin':'mean',          # We will be doing Unsupervised learning so we should exclude what we are testing against.
    'leagueCountry':'first'
}

_df_aggregated = df_grouped.agg(f)

After that, when we look to our data's columns,  we have noticed the categorical features such as **club**, **leagueCountry**, **position**. It is tricky to try using categorical features in KMeans together with numerical features and we might have problems in the distance function. For example, it doesn't make so much sense to calculate distance between two football team's name (strings) - and even if we add this using *OneHotEncoder*, it doesn't look like useful feature.

Nevertheless, we have tried using **club** (by adding them as dummy features) alone as a feature to see what it looks like.

In [91]:
pp(get_kmeans_result(_df_aggregated,['club']))

{'cluster_0_blacks': 268,
 'cluster_0_whites': 1290,
 'cluster_1_blacks': 1,
 'cluster_1_whites': 26,
 'features': 'club_1. FC Nürnberg,club_1. FSV Mainz 05,club_1899 '
             'Hoffenheim,club_AC Ajaccio,club_AS Nancy,club_AS '
             'Saint-Étienne,club_Arsenal FC,club_Arsenal FC (R),club_Aston '
             'Villa,club_Athletic Bilbao,club_Atlético Madrid,club_Bayer '
             'Leverkusen,club_Bayern München,club_Blackburn Rovers,club_Bolton '
             'Wanderers,club_Bor. Mönchengladbach,club_Borussia '
             'Dortmund,club_Bristol City,club_CA Osasuna,club_CF '
             'Badalona,club_Celta Vigo,club_Chelsea FC,club_Crewe '
             'Alexandra,club_Deportivo La Coruña,club_ESTAC '
             'Troyes,club_Eintracht Frankfurt,club_Espanyol '
             'Barcelona,club_Everton FC,club_FC Augsburg,club_FC '
             'Barcelona,club_FC Lorient,club_FC Schalke 04,club_FC '
             'Sochaux,club_Fortuna Düsseldorf,club_Fulham FC,club_Getafe

As we can see, we ended up with a terrible partitioning and negative silhouette score :)

Still, we wanted to add **leagueCountry** as feature by encoding it as 4 feature columns (for each distinct value, eg. Country names).

After this decision, we have decided on aggregation functions on the columns we are going to use. For player specific features such as *height*, we picked the first one (after grouped). For **number of XYZ** kind of features specific to dyads, we summed them. Finally, we took mean of all statistical features such as **meanIAT**, since it doesn't make sense to sum them.

In [141]:
# Below, these are the features we want to consider initially.

features_to_combine = [
    'height',
    'weight',
    'games',
    'victories',
    'ties',
    'defeats',
    'goals',
    'yellowCards',
    'yellowReds', 
    'redCards',
    'position',
    'meanIAT',
    'nIAT', 
    'seIAT',
    'meanExp',
    'nExp', 
    'leagueCountry',
    'seExp',
]

## 2.2 Applying K-Means

We first test all our features greedly and remove features one by one and prepare a table that might gives us some insight.

In [151]:
initial_rows = []

for i in range(1,len(features_to_combine)+1):
    combs_selected = list(reversed(features_to_combine))[:i]
    initial_rows.append(get_tuple(combs_selected,get_kmeans_result(_df_aggregated, list(combs_selected))))

initial_pd = pd.DataFrame(initial_rows,columns=['comb','blacks in 0', 'blacks in 1', 'silhouette', 'cluster 0 total',
                                                  'cluster 1 total'])
#three_comb[((three_comb['blacks in 0'] > 0.4) | (three_comb['blacks in 1'] > 0.4)) & three_comb['silhouette'] > 0.4]
initial_pd

Unnamed: 0,comb,blacks in 0,blacks in 1,silhouette,cluster 0 total,cluster 1 total
0,seExp,0.140162,0.60396,0.842142,1484,101
1,"seExp,nExp",0.139615,0.503817,0.72814,1454,131
2,"seExp,nExp,meanExp",0.227766,0.088989,0.355691,922,663
3,"seExp,nExp,meanExp,seIAT",0.139378,0.588785,0.625466,1478,107
4,"seExp,nExp,meanExp,seIAT,nIAT",0.529412,0.135956,0.627188,136,1449
5,"seExp,nExp,meanExp,seIAT,nIAT,meanIAT",0.223132,0.083882,0.339329,977,608
6,"seExp,nExp,meanExp,seIAT,nIAT,meanIAT,position",0.225235,0.084665,0.089295,959,626
7,"seExp,nExp,meanExp,seIAT,nIAT,meanIAT,position...",0.229348,0.087218,0.085169,920,665
8,"seExp,nExp,meanExp,seIAT,nIAT,meanIAT,position...",0.226333,0.091592,0.085404,919,666
9,"seExp,nExp,meanExp,seIAT,nIAT,meanIAT,position...",0.218889,0.105109,0.088508,900,685


Above, we have noticed that even though clusters' sizes are not close to each other, there are some cases where silhouette is high. After experimenting with different sized combinations of these features, we have only found the good silhouette scores when the size of the combination is relatively small - less than 5 - so we dediced to test more configurations again, but this time only testing combinations of 4,3,2,1 of **features_to_combine**. It takes some minutes to finish.

## 2.3 Computing and examining results of different combinations

In [158]:
comb_rows = []

for combination in combinations(features_to_combine, 4):
    comb_rows.append(get_tuple(combination,get_kmeans_result(_df_aggregated,list(combination))))

print("Combinations of 4, done")

for combination in combinations(features_to_combine, 3):
    comb_rows.append(get_tuple(combination,get_kmeans_result(_df_aggregated,list(combination))))

print("Combinations of 3, done")
    
for combination in combinations(features_to_combine, 2):
    comb_rows.append(get_tuple(combination,get_kmeans_result(_df_aggregated,list(combination))))
    
print("Combinations of 2, done")

for combination in combinations(features_to_combine, 1):
    comb_rows.append(get_tuple(combination,get_kmeans_result(_df_aggregated,list(combination))))
    
print("Combinations of 1, done")



Combinations of 4, done
Combinations of 3, done
Combinations of 2, done
Combinations of 1, done


In [159]:
comb_df = pd.DataFrame(comb_rows,columns=['comb','blacks perc. in 0', 'blacks perc. in 1', 'silhouette', 'cluster 0 total',
                                                  'cluster 1 total'])
filtered_comb_dff = comb_df[((comb_df['blacks perc. in 0'] > 0.4) | (comb_df['blacks perc. in 1'] > 0.4)) & comb_df['silhouette'] > 0.4]
filtered_comb_dff

Unnamed: 0,comb,blacks perc. in 0,blacks perc. in 1,silhouette,cluster 0 total,cluster 1 total
192,"height,games,seIAT,seExp",0.138889,0.587156,0.569874,1476,109
270,"height,victories,seIAT,seExp",0.138889,0.587156,0.578324,1476,109
336,"height,ties,seIAT,seExp",0.139378,0.588785,0.571123,1478,107
391,"height,defeats,seIAT,seExp",0.139472,0.583333,0.567296,1477,108
436,"height,goals,seIAT,seExp",0.138889,0.587156,0.594391,1476,109
472,"height,yellowCards,seIAT,seExp",0.139378,0.588785,0.574599,1478,107
500,"height,yellowReds,seIAT,seExp",0.139378,0.588785,0.585520,1478,107
521,"height,redCards,seIAT,seExp",0.139378,0.588785,0.582341,1478,107
546,"height,meanIAT,seIAT,seExp",0.138211,0.596330,0.585061,1476,109
552,"height,nIAT,seIAT,seExp",0.581197,0.136921,0.624037,117,1468


## 2.4 Conclusion

From the table above, we can observe that there are some configurations with reasonably high silhouette value and significant percentages of black people in one of the clusters. Yet, overall black people are already less numbered in our dataset and clusters' sizes are still significiantly different. In conclusion, percentages does not mean so much, therefore we can not claim that this clustering is related to skin colours.