## Applied Machine Learning

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.learning_curve import learning_curve
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import preprocessing

import matplotlib.pyplot as plt
plt.style.use('ggplot')
pd.options.mode.chained_assignment = None  # default='warn'
%load_ext autoreload
%autoreload 2

ImportError: No module named 'seaborn'

# Data

In [None]:
path = 'CrowdstormingDataJuly1st.csv'
raw_data  = pd.read_csv(path)
print(raw_data .shape)
raw_data.T

# Data cleaning


In this section we will try to do a first attempt at cleaning the data. A lot of missing values, strings and so on can be expected and we will treat them here before doing anything of it.


It is quite a big dataframe. Are there duplicates in it?

In [None]:
raw_data[['playerShort', 'refNum']].duplicated().value_counts()

The answer is no. Second, where are the missing values?

In [None]:
raw_data.count()

We remember that out ouf the 2053 players, only 1586 had photos available to rate. This is the reason the raters 1 and 2 have only 124621 values out of the 146028 dyads. We'll start by letting these go.

In [None]:
data = raw_data.dropna(how='all', subset=['rater1', 'rater2'])
data.shape[0]

Seeing the same value 124621, we know judges rated the exact same people.

We can then parse the birthdate to keep only the year. The day and month will be meaningless, while mentalities may evolve over the years. Moreover, we can't keep strings as we won't be able to use them. We will drop the player, club columns right now (could try to map the club to a city, but it would be additional).  We drop the photoID because it is meaningless. The Alpha_3 column is the same as refCountry but in strings.

In [None]:
data["year"] = data["birthday"].apply(lambda x: pd.to_datetime(x).year)

In [None]:
data.drop(['player', 'club', 'photoID', 'birthday', 'Alpha_3'],axis=1, inplace=True)

### Convert string values to float values where possible

We do not like loosing all string data, but our ML learning methods can't deal with non numeric values. In this subsection, we try to get something out of the strings.

First, we split the "playerShort" column into First and Last names, and save the lengths to have a numeric value (I guess the length of a first/last name still gives an information we can use). We give IDs to countries and positions since there are only 4 countries and a dozen positions (and thus we can do that, as opposed to the clubs (there were like hundreds of them)). We merge the YellowRed and Red columns as they both describe a field exclusion.

In [None]:

data["firstname"] = pd.DataFrame(data["playerShort"].str.split('-').tolist())[0].apply(lambda x: len(str(x)))
data["lastname"] = pd.DataFrame(data["playerShort"].str.split('-').tolist())[1].apply(lambda x: len(str(x)))
data.drop(["playerShort"], axis=1, inplace=True)


In [None]:
#data.isnull().sum()

In [None]:
mapping_countries = {'England': 0, 'France': 1, 'Germany': 2, 'Spain': 3}

In [None]:
data.replace({'leagueCountry': mapping_countries}, inplace=True)

In [None]:
mapping_roles = {'Center Back': 0, 'Center Forward': 1, 'Defensive Midfielder': 2, 'Goalkeeper': 3, "Attacking Midfielder": 4,
                    "Left Fullback": 5, "Right Fullback" :6, "Left Midfielder" :7, "Right Winger" :8, "Center Midfielder" :9, 
                    "Right Midfielder": 10, "Left Winger" :11}
data.replace({'position': mapping_roles}, inplace=True)

Given the Standard Deviation is .1, it makes sense to have bins of width .25 (2 * .1 plus safety) and we can keep 

In [None]:
clean_data = data.copy()

## Processing the data

### Skin color

We start by adding features since the 2 raters give different skin colors. We add the squared, cross and mean terms.

In [None]:
data["meanrating"] = clean_data[["rater2", "rater1"]].mean(axis=1)
data["rater1_squared"] = clean_data["rater1"]*2
data["rater2_squared"] = clean_data["rater2"]*2
data["cross_rates"] = clean_data["rater1"]*clean_data["rater2"]

### Standardization

In [None]:
data = data.fillna(data.mean())

In [None]:
features = data.columns
x = data.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
std_data = pd.DataFrame(x_scaled, columns=features)


# Extract the labels

The objective of this section is to get which feature will be easier to predict. We keep the possibility of trusting only 1 rater instead of the two, since after all maybe the other one was drunk

In [None]:
focus_cols = ["rater1", "rater2", "rater1_squared", "rater2_squared", "cross_rates", "meanrating"]
std_data.corr().filter(focus_cols).drop(focus_cols).describe()

It seems the better label would be the cross_rates, since it has maximal mean, minimum min and maximal max.

In [None]:
y = np.asarray(std_data['cross_rates'], dtype="|S6")
X = std_data.drop(focus_cols, axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

In [None]:
# STEP 2: train the model on the training set
# instantiate the model (using the default parameters)
rfc = RandomForestClassifier(n_jobs=-1)
# fit the model with data
rfc.fit(X_train, y_train)

In [None]:
def print_score(classifier, X_train, y_train, X_test, y_test):
    print('Train set score :', classifier.score(X_train, y_train))
    print('Test set score :', classifier.score(X_test, y_test))

In [None]:
print_score(rfc, X_train, y_train, X_test, y_test)