In [16]:
import numpy as np
import pandas as pd
import seaborn as sns

# Scikit-Learn Advanced: Titanic dataset
<img src="images/04_sklearn_advanced/titanic.png" style="display: block;margin-left: auto;margin-right: auto;width: 300px"/>

## About the data
In this notebook we will work with the [Titanic dataset](https://www.kaggle.com/c/titanic/data), containing historical data about Titanic passengers and whether they survived the wreck.


| Variable | Definition                                  | Key                                            |
| -------- | ------------------------------------------- | ---------------------------------------------- |
|  |
| survival | Survival                                    | 0 = No, 1 = Yes                                |
| pclass   | Ticket class                                | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex      | Sex                                         |                                                |
| age      | Age in years                                |                                                |
| sibsp    | \# of siblings / spouses aboard the Titanic |                                                |
| parch    | \# of parents / children aboard the Titanic |                                                |
| fare     | Passenger fare                              |                                                |
| cabin    | Cabin number                                |                                                |
| embarked | Port of Embarkation                         | C = Cherbourg, Q = Queenstown, S = Southampton |

## Titanic disaster survivor classification
In this notebook, we will demonstrate how to model whether a given passenger aboard the Titanic is expected to have survived the disaster with scikit-learn. 

Due to the nature of the data, this dataset requires a little more elaborate processing before we can get started. This allows us to introduce:
* **Preprocessing** with scikit-learn transformers
*  **Pipelines**
* **Cross-validation** 
* **Grid-search** for hyperparameter tuning


## 1. Loading in the data

The data is loaded from a package called _seaborn_. Seaborn was imported in the first cell (`import seaborn as sns`) and has a `.load_dataset` function that provides us with the Titanic dataset as a Pandas DataFrame. 

In [17]:
titanic_df = sns.load_dataset('titanic')
titanic_df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


## 2. Exploratory Data Analysis

Take some time to examine the dataset. Below are some suggestions for what you may want to investigate. Note that this is not an exhaustive list; see if you can find something interesting!

    - How much data do we have? how many features?
    - What do the features represent
    - What datatypes does it contain? Are there any missing values?
    
    - Investigate how many different values some of the categorical features contain
    - Is there any redundant information?
    
    - Produce some summary statistics for the different features
    - Are any of the features correlated?
    - Group the data by the survived column and compare statistics

In [18]:
# Your EDA code here. 

Think about it:
- What do you imagine a machine learning algorithm will do with missing values? 
- Do you think duplicate columns will improve or hurt performance? 
- Do you imagine a machine learning algorithm can deal with categorical values (strings)? 

Clearly some issues need to be resolved before we can do machine learning on this dataset.


## 3. Preprocessing

Raw datasets are often not suitable for machine learning algorithms. For example, the dataset may contain categorical features or have missing values. Preprocessing the dataset to ensure that machine learning is feasible is therefore an import phase of a project.

In the dataset we found plenty of interesting variables, though quite a few need preprocessing. Let's first remove the redundant columns and drop rows with missing values*.

**With Scikit-Learn we can actually impute missing values but for now this approach will be sufficient*

In [19]:
titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [20]:
# Remove irrelevant features & missing values.
# Later we will discuss alternatives to this.
titanic_processed = (
    titanic_df
    .drop(['embarked', 'sex', 'adult_male','deck','alive', 'class'], axis=1)
    .dropna()
)

# Get the feature matrix & target vector
titanic_features = (
    titanic_processed
    .drop(['survived'], axis=1)
)
titanic_labels = titanic_processed['survived']

# Notice that we now use Pandas dataframe & series rather than numpy arrays - this is also possible with sklearn!
print(titanic_features.shape, titanic_labels.shape)

(712, 8) (712,)


In [21]:
titanic_features.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,who,embark_town,alone
0,3,22.0,1,0,7.25,man,Southampton,False
1,1,38.0,1,0,71.2833,woman,Cherbourg,False
2,3,26.0,0,0,7.925,woman,Southampton,True
3,1,35.0,1,0,53.1,woman,Southampton,False
4,3,35.0,0,0,8.05,man,Southampton,True


In [22]:
titanic_labels.head()

0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64

### 3.1 Variable encoding

It would still be rather difficult to work with the raw dataset. One reason is that it contains categorical values that are represented as strings, e.g passenger class comes as text: "First", "Second" and "Third". A machine learning model cannot handle strings; it requires the input to be numeric. Therefore, we need to transform these categorical variables to numeric values that represent the same information.

In [29]:
cat_features = ['who','embark_town']
titanic_features[cat_features]

Unnamed: 0,who,embark_town
0,man,Southampton
1,woman,Cherbourg
2,woman,Southampton
3,woman,Southampton
4,man,Southampton
...,...,...
885,woman,Queenstown
886,man,Southampton
887,woman,Southampton
889,man,Cherbourg


In [33]:
def make_dummy_cols(df, cols):
    return(
        df
        .join(pd.get_dummies(df[cols]))
        .drop(columns=cols)
    )

make_dummy_cols(titanic_features, cat_features)

Unnamed: 0,pclass,age,sibsp,parch,fare,alone,who_child,who_man,who_woman,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,3,22.0,1,0,7.2500,False,0,1,0,0,0,1
1,1,38.0,1,0,71.2833,False,0,0,1,1,0,0
2,3,26.0,0,0,7.9250,True,0,0,1,0,0,1
3,1,35.0,1,0,53.1000,False,0,0,1,0,0,1
4,3,35.0,0,0,8.0500,True,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
885,3,39.0,0,5,29.1250,False,0,0,1,0,1,0
886,2,27.0,0,0,13.0000,True,0,1,0,0,0,1
887,1,19.0,0,0,30.0000,True,0,0,1,0,0,1
889,1,26.0,0,0,30.0000,True,0,1,0,1,0,0


## 5. Validation

When training our models, we should reserve a separate set of data for comparing model performance and picking hyperparameters, i.e. Should I use order 2 polynomial features or order 3?

The test set should only be used to give a final evaluation of the model.


 <img src="images/04_sklearn_advanced/validation.png" style="display: block;margin-left: auto;margin-right: auto;height: 100px"/>
 
Otherwise information from the test set leaks into our training process invalidating our metrics.

### 5.1 Cross-Validation

If we only have a small dataset, we may not have enough data to create a seperate training set. In addition to this, our validation results will still be biased to that split of the train/validation set. Cross-validation is used to address these issues.

 <img src="images/04_sklearn_advanced/crossvalidation.png" style="display: block;margin-left: auto;margin-right: auto;height: 300px"/>

K-fold cross validation splits can be acquired using `KFold()` from `sklearn.model_selection`. Then, for each split, the model can be fit and relevant metrics can be computed.

It may be however more practical to use `cross_val_score()` from `sklearn.model_selection`. It returns an array of cross-validated scores for a model/pipeline on the data.

Let's see how our current model performs on the different validation splits:

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(titanic_pipeline,
                         X_train,
                         y_train,
                         cv = 5,
                         scoring = 'accuracy'
                        )

print(f'Accuracy per fold: {scores.round(3)}')
print(f'Accuracy mean over folds: {scores.mean().round(3)}')

In [None]:
X_train.shape

**Exercise**

Think about it: 
1. What would be the advantage of a _higher_ number of folds for your cross-validation?
    - What happens if the number of folds = 1? 
    - What happens if the number of folds = the number of data points in your train set (ie. 569 for the Titanic dataset)? 
    
2. What would be the disadvantage of a higher number of folds? 

### 5.2 Parameter tuning

To find the best parameters, we often want to check how each combination of parameters performs. This process in known as a grid seach.

We could manually collect the cross validation scores for each paramter combination. However, in Scikit-Learn this can be implemented directly.

`GridSearchCV()` from `sklearn.model_selection` allows tuning of model's parameters using cross-validated grid search. 

It requires a dictionary of parameters and the options it should compare. This does not only have to be model parameters, but parts of preprocessing too. We simply need to reference the correct pipeline step using 'step_name__parameter_name' as one of the dictionary keys (note the double underscore between the step and parameter name). In the end, the best set of parameters will be selected based on the provided metric.

In [None]:
titanic_pipeline.get_params()

Pay attention to the way the params dictionary is written:
- A key is a string of the name of the model + 2 underscores + the specific parameter to search, for example classifier__criterion
- A value is a list/tuple of all the parameters values you want to search

This allows us to search possible parameters for any of the steps in the initial Pipeline (e.g. preprocessing too!) 

Note that with large datasets, many parameters and many cross validation folds, `GridSearchCV()` may take considerable time.

In [None]:
from sklearn.model_selection import GridSearchCV

#providing the list of parameters
parameters_svm = {'PFeatures__degree': (1,2),
                 'scaler': [MinMaxScaler(),StandardScaler(),RobustScaler() ]}

#running grid search 
clf_svm = GridSearchCV(titanic_pipeline, parameters_svm, cv=5, scoring='accuracy')
clf_svm = clf_svm.fit(X_train, y_train)

print(f'CV accuracy score of the best SVM is: {clf_svm.best_score_:.3f}')
print(f'Best parameters were: {clf_svm.best_params_}')
#parameter importance

In addition to being easier to implement than a manual grid search cross-validation, `GridSearchCV()` helps us to avoid __information leaks__.

## 6. Results

Once we've established the best parameters, we should use them to retrain on the *WHOLE* training set:

In [None]:
best_model = clf_svm.best_estimator_

In [None]:
print(f'Accuracy on the test set: {best_model.score(X_test, y_test):.3f}')

It is however usually a good idea to not just blindly take the "best" model, but manually compare different models from the grid search.

In [None]:
cv_results = pd.DataFrame(clf_svm.cv_results_)
cv_results.sort_values('rank_test_score').head(6)

After you have finished, it will likely be helpful to save you model for later.

In [None]:
# from joblib import dump, load
# # Save the model
# dump(best_model, 'best_model.joblib') 

# Load a saved model
#best_model = load('best_model.joblib') 

## Conclusions

Scikit-Learn can be a very powerful tool to deal with machine learning problems including data splitting, preprocessing, model training and model selection. Its simple interface and detailed documentation allow it to be used even by users with little experience.

* A huge variety of implemented models with sensible defaults, such as KMeans, SVC or RandomForest
* Supports both pandas and numpy as data inputs
* Data preprocessing techniques such as Scaler, OneHotEncoding, etc. 
* Validation tools such as train_test_split or cross validation
* Pipelines to evaluate not only your model and its hyperparameters, but also your preprocessing steps.
* GridSearch for parameter tuning selection
* Consistent implementation and API which makes it easy to extend with your own implemented building blocks (scorers, preprocessing techniques and even models)

Caveat: scikit-learn lacks tools to work with deep learning. For this purpose you would rather use deep learning libraries such as PyTorch or Tensorflow/Keras.