# Cross Validation

We're revisiting the Titanic. [The data set is located here](https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv). Using this data set:

* You will perform a supervised classification on the `survived` column, by performing a traintestsplit (without fixing the random_state), then a classifier of your choice (knn, decision tree, logistic regression, etc...). You have to clearly show the accuracy score of the training game and the accuracy score of the test game.

* In a new block of code, you will perform exactly the same step as before. As you have not set the random_state, the score should be different.

* With the same classifier, you will run a cross-validation with a split in 6 parts. Does the CrossValidation strengthen your confidence in this prediction?

* You will calculate the mean and standard deviation of the 6 scores obtained.

In [19]:
# Importing the base libraries
import numpy as np
import pandas as pd

from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split, cross_validate, KFold
from sklearn.ensemble import RandomForestClassifier

In [3]:
# Loading the data to the pandas DataFrame
df = pd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")

# Let's have a quick look
df

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Survived                 887 non-null    int64  
 1   Pclass                   887 non-null    int64  
 2   Name                     887 non-null    object 
 3   Sex                      887 non-null    object 
 4   Age                      887 non-null    float64
 5   Siblings/Spouses Aboard  887 non-null    int64  
 6   Parents/Children Aboard  887 non-null    int64  
 7   Fare                     887 non-null    float64
dtypes: float64(2), int64(4), object(2)
memory usage: 55.6+ KB


## Task 1


* You will perform a supervised classification on the `survived` column, by performing a traintestsplit (without fixing the random_state), then a classifier of your choice (knn, decision tree, logistic regression, etc...). You have to clearly show the accuracy score of the training game and the accuracy score of the test game.

In [23]:
# Selecting features
# I removed "Name" as I don't want to extract sex from it (we have that feature)
# I don't want to guess age from the title in the "Name" as well for the simplicity of the quest
# The name and surname isn't right now interesting for me in the context of survival
features = ["Pclass", "Sex", "Age", "Siblings/Spouses Aboard", "Parents/Children Aboard", "Fare"]
target   = "Survived"

# Making the train-test-split
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.20)

# Getting the sizes
print(f"Shape of training set = {X_train.shape}")
print(f"Shape of testing set  = {X_test.shape}")

# Since we need to do some preprocessing too,
# let's make a column transformer including the OrdinalEncoder and
# later a pipeline

# Column transformer passes through everything
# except "Sex", which is ordinally encoded
column_transformer = make_column_transformer(
    ("passthrough", ["Pclass"]),
    (OrdinalEncoder(), ["Sex"]),
    remainder="passthrough")

# Pipe contains from the column transformer and RFC
pipe = make_pipeline(column_transformer, RandomForestClassifier())

# Let's fit the pipe
pipe.fit(X_train, y_train)

# Let's compute the prediction scores (default = "accuracy")
scores_train = pipe.score(X_train, y_train)
scores_test = pipe.score(X_test, y_test)

# Output
print(f"RFC accuracy on the training set = {round(scores_train * 100, 2)}%")
print(f"RFC accuracy on the testing set  = {round(scores_test * 100, 2)}%")

Shape of training set = (709, 6)
Shape of testing set  = (178, 6)
RFC accuracy on the training set = 98.73%
RFC accuracy on the testing set  = 79.78%


## Task 2

* In a new block of code, you will perform exactly the same step as before. As you have not set the random_state, the score should be different.

In [24]:
# Doing everything once again

# Selecting features
features = ["Pclass", "Sex", "Age", "Siblings/Spouses Aboard", "Parents/Children Aboard", "Fare"]
target   = "Survived"

# Making the train-test-split
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.20)

# Getting the sizes
print(f"Shape of training set = {X_train.shape}")
print(f"Shape of testing set  = {X_test.shape}")

# Making column transformer
column_transformer = make_column_transformer(
    ("passthrough", ["Pclass"]),
    (OrdinalEncoder(), ["Sex"]),
    remainder="passthrough")

# Making pipeline
pipe = make_pipeline(column_transformer, RandomForestClassifier())

# Fitting the pipe
pipe.fit(X_train, y_train)

# Getting the prediction scores (default = "accuracy")
scores_train = pipe.score(X_train, y_train)
scores_test = pipe.score(X_test, y_test)

# Output
print(f"RFC accuracy on the training set = {round(scores_train * 100, 2)}%")
print(f"RFC accuracy on the testing set  = {round(scores_test * 100, 2)}%")

Shape of training set = (709, 6)
Shape of testing set  = (178, 6)
RFC accuracy on the training set = 98.45%
RFC accuracy on the testing set  = 82.02%


## Task 3 and 4

* With the same classifier, you will run a cross-validation with a split in 6 parts. Does the CrossValidation strengthen your confidence in this prediction?
* You will calculate the mean and standard deviation of the 6 scores obtained.


In [31]:
# Setting up cross-validation strategy
cv = KFold(n_splits=6, shuffle=True)

# Running the cross-validation
results = cross_validate(pipe, X_train, y_train, scoring="accuracy", cv=cv, return_train_score=True)

# Output
print(f"Mean time to fit the model             = {round(results['fit_time'].mean(), 2)} secs")
print(f"Mean time to score the model           = {round(results['score_time'].mean(), 2)} secs")
print(f"Mean accuracy of the model (train set) = {round(results['train_score'].mean() * 100, 2)}% ±{round(results['train_score'].std() * 100, 2)}%")
print(f"Mean accuracy of the model (test set)  = {round(results['test_score'].mean() * 100, 2)}% ±{round(results['test_score'].std() * 100, 2)}%")

Mean time to fit the model             = 0.22 secs
Mean time to score the model           = 0.02 secs
Mean accuracy of the model (train set) = 98.79% ±0.25%
Mean accuracy of the model (test set)  = 79.69% ±3.87%


__Answer__:  
The scores suggest the model seems to be _overfitting_, but the low variance of the prediction indicates it's _stable_ enough.