# Scikit-Learn Advanced

### In this notebook we will review and practice with other scikit-learn features

- Preprocessing   
- Creating and fitting Pipelines  
- Cross-Validation 
- Grid Search
- and more...



In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

## The data

In this notebook we will work with the [Titanic dataset](https://www.kaggle.com/c/titanic/data), containing historical data about Titanic passengers and whether they survived the wreck.

<img src="images/04_sklearn_advanced/titanic.png" style="display: block;margin-left: auto;margin-right: auto;width: 600px"/>

We will first focus on sklearn's preprocessing features, but before we can go ahead, let's load the dataset and get it ready for it.

In [None]:
titanic_df = sns.load_dataset('titanic')
print(titanic_df.shape)
titanic_df.head()

### Initial analysis

Clearly some issues need to be resolved before we can do machine learning on this dataset.

1) Take some time to examine the dataset. Below are some suggestions for what you may want to investigate. Note that this is not an exhaustive list; see if you can find something interesting!

    - How much data do we have? how many features?
    - What do the features represent
    - What datatypes does it contain? Are there any missing values?
    
    - Investigate how many different values some of the categorical features contain
    - Is there any redundant information?
    
    - Produce some summary statistics for the different features
    - Are any of the features correlated?
    - Group the data by the survived column and compare statistics
    
2) Which ML models would you consider for this problem? Why have you made this choice?

In [None]:
## Tool for some automatic analysis of a pandas DataFrame
# from pandas_profiling import ProfileReport
# profile = ProfileReport(titanic_df, minimal=True)
# profile.to_file('Titanic.html')

Let's create our feature matrix and target vector.

## Preprocessing:

Raw datasets are often not suitable for machine learning algorithms. For example, the dataset may contain categorical features or have missing values. Preprocessing the dataset to ensure that machine learning is feasible is therefore an import phase of a project.

In the dataset we found plenty of interesting variables, though quite a few need preprocessing. Let's first remove the redundant columns and drop rows with missing values*.

**With Scikit-Learn we can actually impute missing values but for now this approach will be sufficient*

In [None]:
titanic_df.head()

In [None]:
# Remove irrelevant features & missing values.
# Later we will discuss alternatives to this.
titanic_processed = (
    titanic_df
    .drop(['embarked', 'sex', 'adult_male','deck','alive', 'class'], axis=1)
    .dropna()
)

# Get the feature matrix & target vector
titanic_features = (
    titanic_processed
    .drop(['survived'], axis=1)
)
titanic_labels = titanic_processed['survived']

# Notice that we now use Pandas dataframe & series rather than numpy arrays - this is also possible with sklearn!
print(titanic_features.shape, titanic_labels.shape)

In [None]:
titanic_features.head()

In [None]:
titanic_labels.head()

## Variable encoding

It would still be rather difficult to work with the raw dataset. One reason is that it contains categorical values, e.g passenger class comes as text: "First", "Second" and "Third". However, it is possible to encode these categorical variables into numeric ones. Fortuantely, scikit-learn makes it easy for us to do so.

### Encoding the target vector

We can numerically encode a column using the `LabelEncoder` from Scikit-Learn.

Normally this is used for the target or label column (although in our case this is not necessary). However, it can also be used for columns in the feature matrix.

In [None]:
from sklearn.preprocessing import LabelEncoder

Much like the machine learning estimators, in Scikit-Learn, the preprocessing algorithms are implemented as Python objects. They are referred to as *transformers*.

Once you have picked the transformer algorithm you will use, you instantiate it.

In [None]:
label_encoder = LabelEncoder()

Transformers have a `.fit()` method implemented so that they can learn the paramters to transform data. The `.transform()` method is then used to perform the transformation. Let's demonstrate this on the alone column of the dataset:

In [None]:
print('Before transformation:')
print(titanic_features['alone'])

label_encoder.fit(titanic_features['alone'])
new_col = label_encoder.transform(titanic_features['alone'])

print('\n\nAfter transformation:')
print(new_col)

They also have a `.fit_transform()` method implemented so that the transformation can be performed directly after the paramters have been learnt.

In [None]:
print('Before transformation:')
print(titanic_features['alone'])

new_col = label_encoder.fit_transform(titanic_features['alone'])

print('\n\nAfter transformation:')
print(new_col)

### Ordinally encoding features

For the feature matrix it would be more practical to encode all categorical features at once (rather than one column at a time). 

One way of doing this could be to use the `OrdinalEncoder`, which can ordinally encode multiple features at the same time.

However, we do not want to encode *all* of the features, only those that are categorical. Rather than doing this manually, we can use `ColumnTransformer` to achieve exactly that.

In [None]:
titanic_features.head()

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer 


categorical_columns = [ 'who', 'embark_town', 'alone']

ct = ColumnTransformer(
    [
    ("ordinal", OrdinalEncoder(), categorical_columns)
    ], remainder="passthrough") 

# The output of fit_transform is no longer a pandas df, but now a numpy matrix. 
feature_matrix = ct.fit_transform(titanic_features)

print(feature_matrix[0:5])

Note that the ColumnTransformer will affect the order of your columns!

### One-hot encoding features

What issues could we get from using an `OrdinalEncoder()` on some of the categorical values?

This way of encoding categorial values may work for problems where there is a natural ordinal relationship between the categories, and in turn the integer values, such as labels for temperature ‘cold’, warm’, and ‘hot’.

There may be problems when there is no ordinal relationship and allowing the representation to lean on any such relationship might be damaging to learning to solve the problem. An example might be the labels ‘red’, 'blue' and ‘green’

In these cases, we would like to give the algorithm more expressive power, we require [one-hot encoding](https://gdd.li/04_sklearn).

Fortunately, we can use sklearn's `OneHotEncoder()` for this purpose. This can also help us encode multiple features at the same time.

Again, we do not want to encode *all* of the features. We only want to pass some of the columns into the `OneHotEncoder()`. Like before, we use the `ColumnTransformer()` to do this.

In [None]:
print(titanic_features.shape)

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer 

ct = ColumnTransformer(
    [
    ("onehot", OneHotEncoder(drop='first'), [ 'who', 'embark_town', 'alone'])
    ], remainder="passthrough") 

# The output of fit_transform is no longer a pandas df, but now a numpy matrix. 
feature_matrix = ct.fit_transform(titanic_features)

print(feature_matrix[0:5])

print(feature_matrix.shape)

**Keep in mind:**

We use the `drop='first'` option in OneHotEncoder, which ensures that we keep *n-1* dummies per categorical variable.

`ColumnTransformer` can also simultaneously apply a different transformation to other columns if we specify an extra step

## Feature Scaling

Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume. 

However, many machine learning algorithms perform better when numerical input variables are scaled to a standard range. [For example](https://gdd.li/04_sklearn), algorithms that use distance measures, like k-nearest neighbors.

We want to ensure our variables share a similar scale. However, it is important to be aware of when to do feature scaling to avoid data leakage.


### *Data leakage warning!!!*

We want to avoid any information from the test set leaking into the training set; it is important our ML algorithm learns only from the training data! 

If information from the test set *does* leak into the training data, it may cause our metrics to overestimate our model's performance.

Therefore we must do the train-test split *before* we do feature scaling (and other forms of preprocessing).

In [None]:
# Rename to X and y for consistency. 
X = feature_matrix
y = titanic_labels

print(type(X))
print(type(y))

In [None]:
#train-test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=111, stratify=y)

When performing classification, we can set `stratify` to ensure a consistent split across training and test sets. This can be useful to get a more reliable score for your model.

### Feature scaling with sklearn

`StandardScaler()` and `MinMaxScaler()` from `sklearn.preprocessing` allow us to scale the data together with applying the `fit_transform()` method

Remember to always first separate the data into train/test (and validation) sets and only then use feature scaling. If you take the mean and variance of the whole dataset you'll be introducing future information into the training process.

In [None]:
from sklearn.preprocessing import StandardScaler

The `StandardScaler()` transforms features by removing the mean and scaling to unit variance.

In [None]:
sc_X = StandardScaler()

After instantiating the transformer, we learn *how* to transform the dataset on the training set with `.fit()`

We then transfom the train set with `.transform()`.

These two steps can be performed at once with `.fit_transform()`

In [None]:
X_train_scaled = sc_X.fit_transform(X_train)
X_train_scaled.mean(axis = 0)

By using the `.transform()` on the test set,  we ensure it is encoded in the same way as the train set.

In [None]:
X_test_scaled = sc_X.transform(X_test)

The `MinMaxScaler()` transforms features to be with a certain range, e.g. 0 & 1. 

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
sc_X = MinMaxScaler()
X_train_scaled = sc_X.fit_transform(X_train)

In [None]:
X_train_scaled.mean(axis = 0)

In [None]:
X_test_scaled = sc_X.transform(X_test)

There are other scaling options to use too. For example, the `RobustScaler`.

In [None]:
from sklearn.preprocessing import RobustScaler
help(RobustScaler)

### Preprocessing with Scikit-Learn

Further preprocessing tools from `sklearn.preprocessing` could also be applied to our dataset. For example,
- Binarization: `Binarizer()`

    A common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.
    

- Imputing Missing Values: `Imputer()`

    Address missing values in a column by imputting the mean, median, mode or another constant value.
    

- Generating Polynomial Features: `PolynomialFeatures()`

    Generate a new feature matrix consisting of all polynomial combinations of the features. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]. This can help the model to learn more complex relationships.
    

### Scikit-Learn transformers

The diagrams below are visualisations of a Scikit-Learn transformer.

`.fit()`

<img src="images/04_sklearn_advanced/transform.png" style="display: block;margin-left: auto;margin-right: auto;width: 600px"/>

`.fit_transform()`

<img src="images/04_sklearn_advanced/fit_transform.png" style="display: block;margin-left: auto;margin-right: auto;width: 600px"/>

## Pipelines
> **"If you aren’t using pipelines you’re probably doing [Scikit-Learn] wrong."** - [Andreas Muller, Core Developer of Scikit-learn ](https://towardsdatascience.com/want-to-truly-master-scikit-learn-2-essential-tips-from-the-official-developer-himself-dada6ff56b99)

<img src="images/04_sklearn_advanced/sklearn-pipe.png" style="display: block;margin-left: auto;margin-right: auto;width: 600px"/>

#### What is a pipeline?

Whilst the statement above was probably an exaggeration, they are a great way to keep your code clean, consistent and mistake-free. 

Pipelines encapsulate all the preprocessing steps (feature selections, scaling, encoding of variables and so on), as well as the final model, into a single Scikit-Learn estimator. 

- Pipelines simplify and automate many steps in preprocessing and model training. 
- They give your workflow order and make it easier to read and understand. Later we will see how they can also be very useful during model optimization. 
- In addition to this including preprocessing as part of our model pipeline we can **avoid information leaks**

#### Getting started with a simple pipeline for preprocessing

An sklearn `Pipeline` simply requires us to specify a number of steps and what should happen at each of them. In our case we are going to add interaction terms for our features using `PolynomialFeatures()` and then scale the resulting data with `MinMaxScaler()` (we are not going to use `X_train_scaled` from the previous step).


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures


preprocess_pipeline = Pipeline(steps=[
    ('PFeatures', PolynomialFeatures(interaction_only=True)),
    ('scaler', MinMaxScaler())])



In [None]:
preprocess_pipeline.fit_transform(X_train, y_train)

In [None]:
preprocess_pipeline.transform(X_test)

#### A simple ML pipeline

We can further extend this pipeline by adding the last step containing our machine learning model.

We shall demonstrate this by adding an ML model to our pipeline (an SVM classifier).

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

titanic_pipeline = Pipeline(steps=[
    ('PFeatures', PolynomialFeatures(2)),
    ('scaler', MinMaxScaler()),
    ('model', SVC(kernel='linear'))])

titanic_pipeline.fit(X_train, y_train)
y_pred = titanic_pipeline.predict(X_test)
print("accuracy: ", round(accuracy_score(y_test, y_pred),2))

The beauty of sklearn pipelines is in the fact that it will also learn how to scale the data based on the `X_train` features and then automatically use it when we get predictions from `X_test`. So in practice it also removes some preprocessing efforts and unnecessary risks away from us.

The diagram below is a visualisation of Scikit-Learn Pipeline

<img src="images/04_sklearn_advanced/pipe-transform.png" style="display: block;margin-left: auto;margin-right: auto;width: 600px"/>

Once a pipeline has been specified, model training is not much different from training a regular model. 

This also means we can easily evaluate whether certain preprocessing settings help our model's performance.

## Validation

When training our models, we should reserve a separate set of data for comparing model performance and picking hyperparameters, i.e. Should I use order 2 polynomial features or order 3?

The test set should only be used to give a final evaluation of the model.


 <img src="images/04_sklearn_advanced/validation.png" style="display: block;margin-left: auto;margin-right: auto;height: 100px"/>
 
Otherwise information from the test set leaks into our training process invalidating our metrics.

## Cross-Validation

If we only have a small dataset, we may not have enough data to create a seperate training set. In addition to this, our validation results will still be biased to that split of the train/validation set. Cross-validation is used to address these issues.

 <img src="images/04_sklearn_advanced/crossvalidation.png" style="display: block;margin-left: auto;margin-right: auto;height: 300px"/>

K-fold cross validation splits can be acquired using `KFold()` from `sklearn.model_selection`. Then, for each split, the model can be fit and relevant metrics can be computed.

It may be however more practical to use `cross_val_score()` from `sklearn.model_selection`. It returns an array of cross-validated scores for a model/pipeline on the data.

Let's see how our current model performs on the different validation splits:

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(titanic_pipeline,
                         X_train,
                         y_train,
                         cv = 5,
                         scoring = 'accuracy'
                        )

print(f'Accuracy per fold: {scores.round(3)}')
print(f'Accuracy mean over folds: {scores.mean().round(3)}')

## Parameter tuning

To find the best paramters, we often want to check how each combination of paramters performs. This process in known as a grid seach.

We could manually collect the cross validation scores for each paramter combination. However, in Scikit-Learn this can be implemented directly.

### Grid Search

`GridSearchCV()` from `sklearn.model_selection` allows tuning of model's parameters using cross-validated grid search. 

It requires a dictionary of parameters and the options it should compare. This does not only have to be model parameters, but parts of preprocessing too. We simply need to reference the correct pipeline step using 'step_name__parameter_name' as one of the dictionary keys (note the double underscore between the step and parameter name). In the end, the best set of parameters will be selected based on the provided metric.

In [None]:
titanic_pipeline.get_params()

Pay attention to the way the params dictionary is written:
- A key is a string of the name of the model + 2 underscores + the specific parameter to search, for example classifier__criterion
- A value is a list/tuple of all the parameters values you want to search

This allows us to search possible parameters for any of the steps in the initial Pipeline (e.g. preprocessing too!) 

Note that with large datasets, many parameters and many cross validation folds, `GridSearchCV()` may take considerable time.

In [None]:
from sklearn.model_selection import GridSearchCV

#providing the list of parameters
parameters_svm = {'PFeatures__degree': (1,2),
                 'scaler': [MinMaxScaler(),StandardScaler(),RobustScaler() ]}

#running grid search 
clf_svm = GridSearchCV(titanic_pipeline, parameters_svm, cv=5, scoring='accuracy')
clf_svm = clf_svm.fit(X_train, y_train)

print(f'CV accuracy score of the best SVM is: {clf_svm.best_score_:.3f}')
print(f'Best parameters were: {clf_svm.best_params_}')
#parameter importance

In addition to being easier to implement than a manual grid search cross-validation, `GridSearchCV()` helps us to avoid __information leaks__.

## Results

Once we've established the best parameters, we should use them to retrain on the *WHOLE* training set:

In [None]:
#Refitting on the whole training set.
#clf_svm.set_params(refit=True)
#clf_svm.fit(X_train, y_train)
best_model=clf_svm.best_estimator_

In [None]:
print(f'Accuracy on the test set: {best_model.score(X_test, y_test):.3f}')

It is however usually a good idea to not just blindly take the "best" model, but manually compare different models from the Grid search. Higher accuracy often comes at the price of over-fitting!

In [None]:
cv_results = pd.DataFrame(clf_svm.cv_results_)
cv_results.sort_values('rank_test_score').head(6)

After you have finished, it will likely be helpful to save you model for later.

In [None]:
# from joblib import dump, load
# # Save the model
# dump(best_model, 'best_model.joblib') 

# Load a saved model
#best_model = load('best_model.joblib') 

## Challenge

1. Create a new ML pipeline with (somewhat) different preprocessing steps and a different ML model
2. Get the CV_f1 score (or accuracy) for that pipeline
3. Play around with Grid search (change some other parameters in the param_dict) 

Can you create a model better than this current one? What were your reasons for picking the ML model you chose.

## Conclusions

Scikit-Learn can be a very powerful tool to deal with machine learning problems including data splitting, preprocessing, model training and model selection. Its simple interface and detailed documentation allow it to be used even by users with little experience.

* A huge variety of implemented models with sensible defaults, such as KMeans, SVC or RandomForest
* Supports both pandas and numpy as data inputs
* Data preprocessing techniques such as Scaler, OneHotEncoding, etc. 
* Validation tools such as train_test_split or cross validation
* Pipelines to evaluate not only your model and its hyperparameters, but also your preprocessing steps.
* GridSearch for parameter tuning selection
* Consistent implementation and API which makes it easy to extend with your own implemented building blocks (scorers, preprocessing techniques and even models)

Caveat: scikit-learn lacks tools to work with deep learning. For this purpose you would rather use deep learning libraries such as PyTorch or Tensorflow/Keras.