In [None]:
import pandas as pd             
import numpy as np 
import matplotlib.pyplot as plt   
%matplotlib inline

In [None]:
from sklearn.ensemble import RandomForestClassifier
from scipy.stats.mstats import mode
from sklearn import preprocessing
#from sklearn import model_selection
from sklearn.preprocessing import Imputer
#from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

# Data Preprocessing

First we will do some data preprocessing to make the dataset reading for training. Indeed by reading through the example notebook and by further analysis. Some inconsitencies will be problematic durint the training of the random forest.

In [None]:
# Load dataset
df = pd.read_csv('CrowdstormingDataJuly1st.csv', index_col=0)
df.head(3)

The first step is to remove all the rows from which we can not infere the answer to the question asked. We need to find the skin color of the soccer player based on the other feature at our disposal. We will thus first take interest in the column '_rater1_' and '_rater2_'. By looking at the dataset we can directly see that some rows have no "rating" and won't be able to help use during the training: we don't have the output labels for the classifier. This is only a real problem when using supervised learning.

We only have rating for ~85% of the dataset also each sample is unique.

In [None]:
# Good news we always have either the two rating or none
sum(~(df.rater1.isnull() == df.rater2.isnull()))

In [None]:
data = df.dropna(subset=['rater1', 'rater2']).copy()
print('Total available', len(df))
print('Total with rating', len(data), "({}%)".format(round(len(data)*100/len(df),3)))
print('Number of sample with disagrement:', sum(~(data.rater1 == data.rater2)))

In [None]:
data['skin'] = data[['rater1', 'rater2']].mean(axis=1)

We can also look at own many "real" unique sample we have. Indeed a player is most certainly present multiple times. As we can see most of the players have several entries.

In [None]:
print("Number of unique player", len(data.player.unique()))
fig, ax = plt.subplots();
plt.plot(data.player.value_counts().tolist())
ax.set_title('Apparition count per player')
ax.set_xlabel('Player')
ax.set_ylabel('Apparition Count')
#data.player.value_counts().tolist().plot()

Next we will aggregate the data to only work with a sample per player. Some of the player are present only one time and other more than one hundred time.

Also a classifier as no notions of strings as input we thus need to deal with the columns with text features and encode them in a different way. We could either use a numbering encoding (clubX = 1, clubY = 2, etc) or juste dummy encode the column. We will use the dummy encoding.

We also deal with the missing data (_nan_ values).

##### Fill Nan values

Let's look at the missing values. We will have to deal with all of them before feeding anything to the classifier. As we can see some of the column have several missing values.

In [None]:
data.isnull().sum()

In [None]:
# Height & Weight : Use mean
data['height'] = data['height'].fillna(data['height'].mean())
data['weight'] = data['weight'].fillna(data['weight'].mean())

# Position: add mew label
data['position'] = data['position'].fillna('UNKNOWN')

# We decided to drop all rows containging nan in the rest of the columns. Indeed it would be difficult to 
# decide by which value we need to fill the data as they are specific for each dyad player - referee
data.dropna(subset=['Alpha_3', 'meanIAT', 'nIAT','seIAT','meanExp','nExp','seExp'],axis=0, how='any', inplace=True)

##### Aggregation

We want to work only with information about player not dyad player-referee. We need to aggregate the unformation about each player into a single sample.

In [None]:
#TODO Add age

In [None]:
most_present = lambda x: x.value_counts().index[0]

players = data.groupby(level=0).agg({
    'leagueCountry': most_present,
    'position': most_present,
    'height': 'mean', 
    'weight': 'mean', 

    'meanIAT':'mean', 
    'meanExp':'mean', 
    'seIAT':'mean', 
    'seExp':'mean',

    'games':'sum', 

    'victories':'sum',
    'defeats':'sum', 
    'ties': 'sum', 

    'goals':'sum', 

    'redCards':'sum', 
    'yellowReds': 'sum', 
    'yellowCards':'sum',

    'skin': most_present,
})

##### Simplification of the task


The problem is all its glory is to determine the skin color within the same 5 categories. Because there is some deisagrement between raters. New "categories" have been created that lie in between the official ones. Let's look at the distribution of those categories. As we can see below, the categories are skewed to the right (to the "white" side of the categories).

In [None]:
fig = players['skin'].value_counts(sort=False).sort_index().plot(kind='bar')
fig.set_ylabel('Number of players')
fig.set_xlabel('Skin "category"')
fig.set_title('Skin category by players')

We decided to reframe the problem to a classification that decide if the playe has light skin or dark skin. Even though we will first try a simple model on the determination of the whole range of skin categories.

In [None]:
players['skin_binary'] = pd.cut(players['skin'], [0, 0.5, 1.01], labels=['light', 'dark'], right=False)

We will thus work on classifing into two category with the following distribution.

In [None]:
fig = players['skin_binary'].value_counts(sort=False).plot(kind='bar')
fig.set_ylabel('Number of players')
fig.set_xlabel('Skin "category"')
fig.set_title('Skin category by players')

---------------

##### Final preparation

The models can not work on text features such as the position and such. We will dummy encode them.

In [None]:
X = players.copy()
X.drop(['skin', 'skin_binary' ], axis=1, inplace=True)
X = pd.get_dummies(X)

y_full = players.copy()['skin']
y = players.copy()['skin_binary']

# Just create a struct like object
dataset = lambda:0
dataset.X = X
dataset.y_full = y_full
dataset.y = y

---------

# Exercise 1

We split our dataset into train and test. The model will be "definied/trained" using the train dataset and the model will be comprared on the classification based on the test set. We don't look at the test set until we do the comparision between different model.



In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, dataset.y, test_size=0.10)

### Baseline Model

First, we will define our baseline. We start by selecting randomly the skin color.

In [None]:
y_random = np.random.randint(2, size=y_test.shape)
accuracy_score(y_test, y_random)

We can improve the score just by selecting the most present skin color.

In [None]:
most_present = mode(y_train)[0][0] # Model is definied based on train set
y_most = np.full(y_test.shape, most_present, dtype=int)
accuracy_score(y_test, y_most)

### Random Forest

Now, let's try to developpe un classifier that will improve the accuracy.

In [None]:
n_estimators = 30
max_depth = 20

clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=20, max_features=None)
scores_cross_val = cross_val_score(clf, X_train, y_train, cv=5)
clf.fit(X_train, y_train)
scores_test = accuracy_score(y_test, clf.predict(X_test))

print("Cross validation score:", np.mean(scores_cross_val))
print("Test set validation score:", scores_test)

It seems that just going straight to the goal without much thinking is not going to work. We obtain just a bit of gain compared to selecting the most present class. Let's see where we are on the overfitting side.

In [None]:
accuracy_score(y_train, clf.predict(X_train))

Well... We overfit _quiet_ a bit.