In [None]:
import pandas as pd             
import numpy as np 
import matplotlib.pyplot as plt   
%matplotlib inline

In [None]:
from sklearn.ensemble import RandomForestClassifier
from scipy.stats.mstats import mode
from sklearn import preprocessing

#from sklearn.model_selection import cross_val_score, KFold

# Data Preprocessing

First we will do some data preprocessing to make the dataset readable for training. Indeed by reading through the example notebook and by further analysis. Some inconsitencies will be problematic furing the training of the random forest.

In [None]:
# Load dataset
df = pd.read_csv('CrowdstormingDataJuly1st.csv', index_col=0)
df['rater1'] = df['rater1'].astype('category')
df['rater2'] = df['rater2'].astype('category')

df.head(3)

The first step is to remove all the rows from which we can not infere the answer to the question asked. We need to find the skin color of the soccer player based on the other feature at our disposal. We will thus first take interest in the column '_rater1_' and '_rater2_'. By looking at the dataset we can directly see that some rows have no "rating" and won't be able to help use during the training: we don't have the output labels for the classifier.

We only have rating for ~85% of the dataset also each sample is unique.

In [None]:
# Good news we always have either the two rating or none
sum(~(df.rater1.isnull() == df.rater2.isnull()))

In [None]:
data = df[~df.rater1.isnull()].copy()
print('Total available', len(df))
print('Total with rating', len(data), "({}%)".format(round(len(data)*100/len(df),3)))
print('Number of sample with disagrement:', sum(~(data.rater1 == data.rater2)))

We can also look at own many "real" unique sample we have. Indeed a player is most certainly present multiple times. As we can see most of the players have several entries.

In [None]:
print("Number of unique player", len(data.player.unique()))
rows_per_player = data.player.value_counts()
rows_per_player.hist(bins=28, range=(0, 280))

In [None]:
pd.get_dummies(data.head())

Next we will aggregate the data to only work with sample per player. Some of the player are present only one time and other more than one hundred time. Feeding the samples right away would not be a good thing. The data is skewed and thus the model won't be able to learn correctly to classify unseen data that might be closer to a player present only once.

We will deal with the missing data (_nan_ values) directly in the model declaration.

In [None]:
X = data.groupby(level=0).agg({
        'leagueCountry': 'first',
        'position':'first',
        'height':'first', 
        'weight':'first', 
        'games':'sum', 
        'victories':'sum',
        'defeats':'sum', 
        'ties': 'sum', 
        'goals':'sum', 
        'redCards':'sum', 
        'yellowReds': 'sum', 
        'yellowCards':'sum'
    })

# Just create a struct like object
dataset_raw = lambda:0
dataset_raw.X = pd.get_dummies(X)
dataset_raw.y = data.groupby(level=0)['rater1'].apply(lambda x: mode(x, axis=None)[0][0])

In [None]:
skin_color = preprocessing.LabelEncoder()

dataset = lambda:0
dataset.X = dataset_raw.X.as_matrix()
dataset.y = skin_color.fit_transform(dataset_raw.y)

---------

# Exercise 1

In [None]:
from sklearn import model_selection
from sklearn.preprocessing import Imputer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

We split our dataset into train and test. The model will be "definied" using the train dataset and the model will be comprared on the classification based on the test set. We don't look at the test set until we do the comparision between different model.

In [None]:
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
imputer = imputer.fit(dataset.X)
X = imputer.transform(dataset.X)

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, dataset.y, test_size=0.10)

First, we will define our baseline. We start we selecting randomly the skin color.

In [None]:
y_random = np.random.randint(5, size=y_test.shape)
accuracy_score(y_test, y_random)

We can improve the score just by selecting the most present skin color.

In [None]:
most_present = mode(y_train)[0][0] # Model is definied based on train set
y_most = np.full(y_test.shape, most_present, dtype=int)
accuracy_score(y_test, y_most)

Now, let's try to developpe a classifier that will improve the accuracy.

In [None]:
n_estimators = 30
max_depth = 20

clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
clf.fit(X_train, y_train)
accuracy_score(y_test, clf.predict(X_test))

It seems that just going straight to the goal without much thinking is not going to work. We obtain just a bit of gain compared to selecting the most present class. Let's see where we are on the overfitting side, and test the accurary of our classifier on the train dataset.

In [None]:
accuracy_score(y_train, clf.predict(X_train))

Well... We overfit _quiet_ a bit.