# Week 3 - Training a model using scikit-learn
In this example we will train a model using scikit-learn. 

First we will prepare the data for scikit-learn. We will focus on
- Replacing or removing NaNs
- Removing columns not needed when training the model

For the purpose of this example we will not do any feature engineering.

In [1]:
import pandas as pd
import numpy as np

## Load the data

In [2]:
df_train = pd.read_csv("data/titanic.csv", encoding='utf-8-sig')
df_train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,False,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,True,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,False,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Remove columns

In [3]:
to_drop = ['PassengerId', 'Name', 'Cabin', 'Ticket', 'Cabin']
df_train.drop(to_drop, inplace=True, axis=1)

## Prepare remaining data for scikit-learn

In [4]:
df_train.dtypes

Survived       bool
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
dtype: object

In [5]:
# Replace string by integers
df_train['Sex'] = df_train['Sex'].map({'male': 0, 'female': 1})

In [6]:
# Replace NaNs with mean
age_mean = int(df_train['Age'].dropna().mean())
df_train['Age'] = df_train['Age'].fillna(age_mean)
df_train['Age'] = df_train['Age'].astype(int)

In [7]:
df_train['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [8]:
# Replace string by integers
df_train['Embarked'] = df_train['Embarked'].fillna('Other')
df_train['Embarked'] = df_train['Embarked'].map({'S': 0, 'C': 1, 'Q': 2, 'Other': 3})

In [9]:
df_train.dtypes

Survived       bool
Pclass        int64
Sex           int64
Age           int64
SibSp         int64
Parch         int64
Fare        float64
Embarked      int64
dtype: object

In [10]:
# Remove rows with NaNs
df_train.dropna(subset=['Fare'], inplace=True)

In [11]:
df_train.to_csv('data/titanic_prepped.csv')

## Training a model
In this example we will train a Random Forest classifier.

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss

Split data into training and validation:

In [13]:
predictors = df_train.drop("Survived", axis=1)
target = df_train["Survived"]

x_train, x_val, y_train, y_val = train_test_split(predictors, target, test_size = 0.20, random_state = 0)

Train the model:

In [14]:
randomforest = RandomForestClassifier(n_estimators=100)
randomforest.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## Evaluating the model
Let's try the `score` method of the model.

In [15]:
round(randomforest.score(predictors, target) * 100, 2)

95.06

Looks too good to be true. Let's evaluate the accuracy on our training/validation split.

In [16]:
y_pred = randomforest.predict(x_val)
round(accuracy_score(y_pred, y_val) * 100, 2)

84.36

So from doing almost nothing, we get an accuracy of 82%. 

A model like the Random Forest Classifier is actually computing the probabilities of a prediction being either true or false, and we have to make a decision about the threshold. So a better way of evaluating the model is to compute these probabilities and use the LogLoss metric to determine the performance of our model. 

In [17]:
y_pred_proba = randomforest.predict_proba(x_val)
y_pred_proba

array([[0.72966667, 0.27033333],
       [0.77116667, 0.22883333],
       [0.99      , 0.01      ],
       [0.04      , 0.96      ],
       [0.73      , 0.27      ],
       [0.65316667, 0.34683333],
       [0.07      , 0.93      ],
       [0.25      , 0.75      ],
       [0.4       , 0.6       ],
       [0.21541667, 0.78458333],
       [0.84      , 0.16      ],
       [0.495     , 0.505     ],
       [0.79624636, 0.20375364],
       [0.01      , 0.99      ],
       [0.02      , 0.98      ],
       [0.4       , 0.6       ],
       [1.        , 0.        ],
       [0.7       , 0.3       ],
       [0.98      , 0.02      ],
       [0.375     , 0.625     ],
       [0.91738095, 0.08261905],
       [0.1       , 0.9       ],
       [0.96677114, 0.03322886],
       [0.699     , 0.301     ],
       [0.62      , 0.38      ],
       [0.01      , 0.99      ],
       [0.82783333, 0.17216667],
       [0.25      , 0.75      ],
       [0.07333333, 0.92666667],
       [0.87      , 0.13      ],
       [0.

In [18]:
round(log_loss(y_val, y_pred_proba), 4)

0.5738