In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Pipelines: Palmer Penguins continued

![](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png) 

## Penguin classification
In this notebook, we will continue improving on the model we made to classify the penguin species. We will do this by using some of the transformers available to us within the scikit-learn API. We will cover the following aspects:

1. Loading the data
2. Preparing the data for sklearn
3. Model creation & evaluation
4. Data pre-processing
5. Pipelines

## 1. Loading our data

We load the data from the data folder, dropping the missing values on the way in.

In [None]:
penguins = pd.read_csv('data/penguins.csv')
penguins.head()

## 2. Preparing the data for scikit-learn

Here we drop missing values & keep required fields

In [None]:
penguins = penguins.dropna()
penguins.head()

Using our knowledge of Pandas we create our feature matrix *X* and target vector *y*.

In [None]:
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

X = penguins.loc[:, feature_columns]
y = penguins.loc[:, 'species']

Now we split the data into train & test.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

## 3. Model creation and evaluation

Now we're ready to create our machine learning model! 

Scikit-learn has a rich collection of algorithms readily available. Depending on the case you are working on, scikit-learn most likely has a model that will suit your purposes. 

#### Scikit-Learn API usage steps when training a model
1. Choosing a model class and importing that model 
2. Choosing the model hyperparameters by instantiating this class with desired values.
3. Training the model to the preprocessed train data by calling the `fit()` method of the model instance.
4. Evaluating model's performance using available metrics

In [None]:
# Step 1: import the chosen algorithm 
from sklearn.tree import DecisionTreeClassifier

# Step 2: instantiate the model with the chosen hyperparameters
model = DecisionTreeClassifier(max_depth=2)

# Step 3: train the model with the training data
model.fit(X_train, y_train)

We have now trained a model that can be used to make predictions on new data. Remember our test set? That's new, unseen data to the model that we can now create predictions on. 

In [None]:
y_pred = model.predict(X_test)
y_pred[0:10]

Let's use Scikit-learn's implementations of the possible metrics readily available, such as accuracy. 

$\text{accuracy} = \frac{correct}{total}$

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_pred, y_test)

Pretty good! 

But accuracy is not the only metric you could be interested in. Alternatives are, for example, _precision_ and _recall_. 

* _Precision_ is the proportion of positive identifications that was actually correct. 
* _Recall_ is the proportion of actual positives that was identified correctly.
* _F1 score_ is a function of precision and recall, that you use when you seek a balance between precision and recall. 

Precision, recall and F1 are also all available with scikit-learn.

In [None]:
from sklearn.metrics import classification_report

report = classification_report(y_pred, y_test)
print(report)

--- 

# Pre-processing the data

A decision tree, the algorithm we used, works pretty well out of the box without requiring much pre-processing to the data, given that all the data is numeric. However, this is not the case for all machine learning algorithms. There are also some chunks of the data that we've omitted to allow for this model to be built.

### Encoding of categorical variables

We've initially omitted two possible features: _sex_ and _island_. Many of the popular machine learning algorithms require the data to be numeric. This will have to be taken care of before we can include these features.

In [None]:
penguins['sex'].unique()

In [None]:
penguins['island'].unique()

In [None]:
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'island']

X = penguins.loc[:, feature_columns]
y = penguins.loc[:, 'species']

print(f'The shape of feature matrix X is: {X.shape}')
print(f'The shape of target vector y is: {y.shape}')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [None]:
categorical_columns = ['sex', 'island']
X_train[categorical_columns].head()

Much like the machine learning estimators, in Scikit-Learn, the preprocessing algorithms are implemented as Python objects. They are referred to as _transformers_.

Once you have picked the transformer algorithm you will use, you instantiate it.

In [None]:
import sklearn

sklearn.__version__

In [None]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()

Transformers have a `.fit()` method implemented so that they can learn the paramters to transform data. They also have a `.transform()` method implemented that can apply the learned transformation to new data. A shortcut is `.fit_transform()`, which performs the transformation directly after the parameters have been learnt.

In [None]:
encoder.fit(X_train[categorical_columns])
output = encoder.transform(X_train[['sex', 'island']])
output[:5]

In [None]:
output = encoder.fit_transform(X_train[categorical_columns])
output[:5]

For the feature matrix it would be more practical to encode all categorical features at once. One way of doing this could be to use the OrdinalEncoder, like we saw, which can ordinally encode multiple features at the same time.

However, we do not want to encode all of the features, only those that are categorical. Rather than doing this manually, we can use `ColumnTransformer` to achieve exactly that.

In [None]:
from sklearn.compose import ColumnTransformer 

ct = ColumnTransformer(
    [
    ("ordinal", OrdinalEncoder(handle_unknown='error'), categorical_columns)
    ], remainder="passthrough") 

# The output of fit_transform is no longer a pandas df, but now a numpy matrix. 
X_train_encoded = ct.fit_transform(X_train)

print(X_train_encoded[0:5])

By using the `.transform()` method on the ColumnTransformer we can ensure the test set is encoded in the same way and the train set.

In [None]:
X_test_encoded = ct.transform(X_test)

There is a potential issue with ordinally encoding categorical features: it assumes an inherent order between the categories.

Whilst this may be acceptable for some categorical features, e.g. medals in the Olympics, it does not make sense in the context of Penguin species. Indeed for a number of machine learning algorithms this can be detrimental.

An alternative approach is to use one-hot or dummy encoding. This type of encoding can be obtained with the `OneHotEncoder`, which transforms each categorical feature with $n$ possible values into $n$ binary features, with one of them 1, and all others 0.

*i.e. the species column would be transformed as follows*
```
Adelie     -> [1, 0, 0]
Chinstrap  -> [0, 1, 0]
Gentoo     -> [0, 0, 1]
```

In [None]:
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(
    [
    ("onehot", OneHotEncoder(drop=None, handle_unknown='error'), categorical_columns)
    ], remainder="passthrough") 

X_train_encoded = ct.fit_transform(X_train)
X_test_encoded = ct.transform(X_test)

In [None]:
X_train_encoded.shape

It is also possible to encode a column $n$ possible values into $n$ - 1 columns, instead of $n$ columns, by using the drop parameter. his is useful to avoid co-linearity in the input matrix in some classifiers. 

*i.e. when dropping the first column, the species column would be transformed as follows*
```
Adelie     -> [0, 0]
Chinstrap  -> [1, 0]
Gentoo     -> [0, 1]
```

In [None]:
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(
    [
    ("onehot", OneHotEncoder(drop="first", handle_unknown='error'), categorical_columns)
    ], remainder="passthrough") 

# The output of fit_transform is no longer a pandas df, but now a numpy matrix. 
X_train_encoded = ct.fit_transform(X_train)

X_train_encoded = ct.fit_transform(X_train)
X_test_encoded = ct.transform(X_test)

In [None]:
print("number of features before:", X_train.shape[1])
print("number of features after:", X_train_encoded.shape[1])

Now we're ready to plug in our new, extended data with the two additional features into our decision tree classifier. 

In [None]:
model = DecisionTreeClassifier()
model.fit(X_train_encoded, y_train)
y_pred = model.predict(X_test_encoded)
accuracy_score(y_pred, y_test)

And the score has improved! 

### Scaling of the data

Dealing with missing values and converting the categorical variables to numeric values was sufficient preprocessing for the decision tree. However, you might be incentivised to use another machine learning algorithm. 


Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume, or - such as in this case - measurements in grams and milimeters. Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

model.fit(X_train_encoded, y_train)
y_pred = model.predict(X_test_encoded)
report = classification_report(y_test, y_pred)
print(f'Model accuracy: {model.score(X_test_encoded, y_test)}')
print(report)

k-Nearest Neighbors, for instance, does not seem to performed very well on the unscaled data.

The `StandardScaler` is a transformer that transforms features by removing the mean and scaling to unit variance.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_train_scaled[0:5]

In [None]:
X_test_scaled = scaler.transform(X_test_encoded)

Our data has now been scaled and we can retry the k-nearest neighbors classifier. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
report = classification_report(y_test, y_pred)
print(f'Model accuracy: {model.score(X_test_scaled, y_test)}')
print(report)

A huge improvement! 

## Pipelines

It was a bit cumbersome to manually make sure we use the transformers appropriately, on the right dataset, and run the classifier. In addition to scaling and encoding, you might have other preprocessing steps as well, such as imputing your missing values, creating polynomial features, and so on. 


Pipelines allow us to encapsulate all the preprocessing steps (imputing, scaling, encoding of variables and so on), as well as the final model, into a single Scikit-Learn estimator. 

> **"If you aren’t using pipelines you’re probably doing [Scikit-Learn] wrong."** - [Andreas Muller, Core Developer of Scikit-learn ](https://towardsdatascience.com/want-to-truly-master-scikit-learn-2-essential-tips-from-the-official-developer-himself-dada6ff56b99)

#### Why pipelines?

Whilst the statement above was probably an exaggeration, they are a great way to keep your code clean, consistent and mistake-free. 

- Pipelines simplify and automate many steps in preprocessing and model training. 
- They give your workflow order and make it easier to read and understand. Later we will see how they can also be very useful during model optimization. 
- In addition to this including preprocessing as part of our model pipeline we can **avoid information leaks**

To demonstrate this we will reperform the preprocessing techniques we tried on the Penguins dataset using a pipeline.

We will first redefine the column transformer as specified above.


In [None]:
from sklearn.compose import ColumnTransformer 

ct = ColumnTransformer(
    [
    ("onehot", OneHotEncoder(drop='first', handle_unknown='error'), categorical_columns)
    ], remainder="passthrough") 

Using the Scikit-Learn's `Pipeline` we can create our own transformer that encodes the catgorical features and then scales the values.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

preprocessor = Pipeline(steps=[
      ('encoding', ct),
      ('scaler', StandardScaler())
])

In [None]:
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)
X_test_transformed[0:5]

Our pipeline now acts as a transformer, such as the scaler and the encoder, that combines multiple steps. 

We can also include the model in a pipeline, such that preprocessing and classification can be performed in one step.

In [None]:
penguins_pipeline = Pipeline(steps=[
    ('encoding', ct),
    ('scaler', StandardScaler()),
    ('model', KNeighborsClassifier())
])

In [None]:
penguins_pipeline.fit(X_train, y_train)

The pipeline can be trained, much like a regular model.

In [None]:
y_pred  = penguins_pipeline.predict(X_test)
y_pred.shape

And we can even use the pipeline object to make predictions on new data. Notice how we do not have to process our `X_test` manually.

In [None]:
accuracy_score(y_pred, y_test)

Piplines are an incredibly useful tool to organise all your preprocessing and modelling steps into one simple object. This not only prevents data leakage and removes the need for manual executing of all the transformer steps on the right datasets, but also allows you to tune your hyperparameters as a combination of preprocessing steps & chosen algorithm.

### ** Exercise ** 

Create a pipeline with the following things: 
* **Model**: choose a classifier.
* **Scaling**: try a different scaler, e.g. [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)  or [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html). When do you imagine you'd use these instead of the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)? 
* **Encoding**: encode the categorical variables.
* **Impute missing values**: we will introduce some missing values into the data. Replace them with the [Simple Imputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html).

Use `X_train_nan` to train your model and `X_test_nan` for testing.

In [None]:
def add_missing_values(X_orig, p=0.05): 
    '''
    Add missing values (NaN) randomly to the columns in the dataframe.
    
    X_orig: Pandas DataFrame. Feature matrix.
    p: float. Percentage of values to set to NaN. 
    '''
    X = X_orig.copy()
    for col in X.columns: 
        values = np.array(X[col].tolist())
        mask = np.random.rand(len(values)) > p
        values[np.where(~mask)] = np.nan
        X[col] = values
    return X

X_train_nan = add_missing_values(X_train) 
X_test_nan = add_missing_values(X_test)

X_train_nan.head(10)

In [None]:
# Your code here. 


# Summary

Scikit-learn is an excellent, resourceful tool for machine learning in Python. We've seen how we can split a dataset with `train_test_split` into a train and test set, create and train a model, use the trained model to create predictions, and how to use the tools from `sklearn.metrics` to evaluate how good the model is. We have also seen preprocessing techniques like scaling and encoding your categorical variables, and the use of pipelines. 