## Applied Machine Learning

In [3]:
%matplotlib inline
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.learning_curve import learning_curve
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import preprocessing

import matplotlib.pyplot as plt
plt.style.use('ggplot')
pd.options.mode.chained_assignment = None  # default='warn'
%load_ext autoreload
%autoreload 2

# Data

In [4]:
path = 'CrowdstormingDataJuly1st.csv'
raw_data  = pd.read_csv(path)
print(raw_data .shape)
raw_data.T

(146028, 28)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,146018,146019,146020,146021,146022,146023,146024,146025,146026,146027
playerShort,lucas-wilchez,john-utaka,abdon-prats,pablo-mari,ruben-pena,aaron-hughes,aleksandar-kolarov,alexander-tettey,anders-lindegaard,andreas-beck,...,slobodan-rajkovic,steven-taylor,timmy-simons,titus-bramble,tom-huddlestone,tomas-rosicky,winston-reid,xherdan-shaqiri,yassine-el-ghanassi,zdenk-pospch
player,Lucas Wilchez,John Utaka,Abdón Prats,Pablo Marí,Rubén Peña,Aaron Hughes,Aleksandar Kolarov,Alexander Tettey,Anders Lindegaard,Andreas Beck,...,Slobodan Rajković,Steven Taylor,Timmy Simons,Titus Bramble,Tom Huddlestone,Tomáš Rosický,Winston Reid,Xherdan Shaqiri,Yassine El Ghanassi,Zdeněk Pospěch
club,Real Zaragoza,Montpellier HSC,RCD Mallorca,RCD Mallorca,Real Valladolid,Fulham FC,Manchester City,Norwich City,Manchester United,1899 Hoffenheim,...,Hamburger SV,Newcastle United,1. FC Nürnberg,Sunderland AFC,Tottenham Hotspur,Arsenal FC,West Ham United,Bayern München,West Bromwich Albion,1. FSV Mainz 05
leagueCountry,Spain,France,Spain,Spain,Spain,England,England,England,England,Germany,...,Germany,England,Germany,England,England,England,England,Germany,England,Germany
birthday,31.08.1983,08.01.1982,17.12.1992,31.08.1993,18.07.1991,08.11.1979,10.11.1985,04.04.1986,13.04.1984,13.03.1987,...,03.02.1989,23.01.1986,11.12.1976,21.07.1981,28.12.1986,04.10.1980,03.07.1988,10.10.1991,12.07.1990,14.12.1978
height,177,179,181,191,172,182,187,180,193,180,...,191,188,186,187,188,178,190,169,173,174
weight,72,82,79,87,70,71,80,68,80,70,...,88,81,79,87,80,67,87,72,,72
position,Attacking Midfielder,Right Winger,,Center Back,Right Midfielder,Center Back,Left Fullback,Defensive Midfielder,Goalkeeper,Right Fullback,...,Center Back,,Defensive Midfielder,Center Back,Defensive Midfielder,Attacking Midfielder,Center Back,Left Midfielder,Left Winger,Right Fullback
games,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
victories,0,0,0,1,1,0,1,0,0,1,...,1,1,1,1,0,1,0,1,0,0


# Data cleaning


In this section we will try to do a first attempt at cleaning the data. A lot of missing values, strings and so on can be expected and we will treat them here before doing anything of it.


It is quite a big dataframe. Are there duplicates in it?

In [5]:
raw_data[['playerShort', 'refNum']].duplicated().value_counts()

False    146028
dtype: int64

The answer is no. Second, where are the missing values?

In [6]:
raw_data.count()

playerShort      146028
player           146028
club             146028
leagueCountry    146028
birthday         146028
height           145765
weight           143785
position         128302
games            146028
victories        146028
ties             146028
defeats          146028
goals            146028
yellowCards      146028
yellowReds       146028
redCards         146028
photoID          124621
rater1           124621
rater2           124621
refNum           146028
refCountry       146028
Alpha_3          146027
meanIAT          145865
nIAT             145865
seIAT            145865
meanExp          145865
nExp             145865
seExp            145865
dtype: int64

We remember that out ouf the 2053 players, only 1586 had photos available to rate. This is the reason the raters 1 and 2 have only 124621 values out of the 146028 dyads. We'll start by letting these go.

In [7]:
data = raw_data.dropna(how='all', subset=['rater1', 'rater2'])
data.shape[0]

124621

Seeing the same value 124621, we know judges rated the exact same people.

We can then parse the birthdate to keep only the year. The day and month will be meaningless, while mentalities may evolve over the years. Moreover, we can't keep strings as we won't be able to use them. We will drop the player, club columns right now (could try to map the club to a city, but it would be additional).  We drop the photoID because it is meaningless. The Alpha_3 column is the same as refCountry but in strings.

In [8]:
data["year"] = data["birthday"].apply(lambda x: pd.to_datetime(x).year)

In [9]:
data.drop(['player', 'club', 'photoID', 'birthday', 'Alpha_3'],axis=1, inplace=True)

### Convert string values to float values where possible

We do not like loosing all string data, but our ML learning methods can't deal with non numeric values. In this subsection, we try to get something out of the strings.

First, we split the "playerShort" column into First and Last names, and save the lengths to have a numeric value (I guess the length of a first/last name still gives an information we can use). We give IDs to countries and positions since there are only 4 countries and a dozen positions (and thus we can do that, as opposed to the clubs (there were like hundreds of them)). We merge the YellowRed and Red columns as they both describe a field exclusion.

In [10]:

data["firstname"] = pd.DataFrame(data["playerShort"].str.split('-').tolist())[0].apply(lambda x: len(str(x)))
data["lastname"] = pd.DataFrame(data["playerShort"].str.split('-').tolist())[1].apply(lambda x: len(str(x)))
data.drop(["playerShort"], axis=1, inplace=True)


In [11]:
#data.isnull().sum()

In [12]:
mapping_countries = {'England': 0, 'France': 1, 'Germany': 2, 'Spain': 3}

In [13]:
data.replace({'leagueCountry': mapping_countries}, inplace=True)

In [14]:
mapping_roles = {'Center Back': 0, 'Center Forward': 1, 'Defensive Midfielder': 2, 'Goalkeeper': 3, "Attacking Midfielder": 4,
                    "Left Fullback": 5, "Right Fullback" :6, "Left Midfielder" :7, "Right Winger" :8, "Center Midfielder" :9, 
                    "Right Midfielder": 10, "Left Winger" :11}
data.replace({'position': mapping_roles}, inplace=True)

Given the Standard Deviation is .1, it makes sense to have bins of width .25 (2 * .1 plus safety) and we can keep 

In [15]:
clean_data = data.copy()

## Processing the data

### Skin color

We start by adding features since the 2 raters give different skin colors. We add the squared, cross and mean terms.

In [16]:
data["meanrating"] = clean_data[["rater2", "rater1"]].mean(axis=1)
data["rater1_squared"] = clean_data["rater1"]*2
data["rater2_squared"] = clean_data["rater2"]*2
data["cross_rates"] = clean_data["rater1"]*clean_data["rater2"]

### Standardization

In [17]:
data = data.fillna(data.mean())

In [18]:
features = data.columns
x = data.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
std_data = pd.DataFrame(x_scaled, columns=features)


# Extract the labels

The objective of this section is to get which feature will be easier to predict. We keep the possibility of trusting only 1 rater instead of the two, since after all maybe the other one was drunk

In [19]:
focus_cols = ["rater1", "rater2", "rater1_squared", "rater2_squared", "cross_rates", "meanrating"]
std_data.corr().filter(focus_cols).drop(focus_cols).describe()

Unnamed: 0,rater1,rater2,rater1_squared,rater2_squared,cross_rates,meanrating
count,23.0,23.0,23.0,23.0,23.0,23.0
mean,-0.006575,-0.004941,-0.006575,-0.004941,-0.006575,-0.005875
std,0.044192,0.04208,0.044192,0.04208,0.048553,0.04371
min,-0.133586,-0.135832,-0.133586,-0.135832,-0.160892,-0.137358
25%,-0.018643,-0.014856,-0.018643,-0.014856,-0.01689,-0.013924
50%,0.000958,0.000376,0.000958,0.000376,-0.000873,0.000681
75%,0.009484,0.008963,0.009484,0.008963,0.013739,0.009406
max,0.06635,0.064943,0.06635,0.064943,0.06961,0.066943


It seems the better label would be the cross_rates, since it has maximal mean, minimum min and maximal max.

In [20]:
y = np.asarray(std_data['cross_rates'], dtype="|S6")
X = std_data.drop(focus_cols, axis=1)

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

In [22]:
# STEP 2: train the model on the training set
# instantiate the model (using the default parameters)
rfc = RandomForestClassifier(n_jobs=-1)
# fit the model with data
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [23]:
def print_score(classifier, X_train, y_train, X_test, y_test):
    print('Train set score :', classifier.score(X_train, y_train))
    print('Test set score :', classifier.score(X_test, y_test))

In [24]:
print_score(rfc, X_train, y_train, X_test, y_test)

Train set score : 0.998381747178
Test set score : 0.904812533852
