**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
DATA_HOME = "https://github.com/michalgregor/ml_notebooks/blob/main/data/{}?raw=1"

from class_utils.download import download_file_maybe_extract
download_file_maybe_extract(DATA_HOME.format("titanic.zip"), directory="data/titanic")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Preprocessing Data Using Pipelines

In the previous notebook we have shown how to apply data preprocessing in a reproducible way. The method that we have shown was correct, but rather laborious (and therefore error-prone). We will now introduce a more practical approach to reproducible preprocessing – using the concept of **pipelines**  – also from the `scikit-learn` package.

### Loading the Data

As the first step we will again load the [Titanic](https://www.kaggle.com/c/titanic) dataset and split it into training and testing data.



In [None]:
df = pd.read_csv("data/titanic/train.csv")
df_train, df_test = train_test_split(df, test_size=0.25,
                     stratify=df["Survived"], random_state=4)

In [None]:
df_train.head()

Let's recall what each column stands for:



In [None]:
with open("data/titanic/description", "r") as file:
    print("".join(file.readlines()))

### Column Selection

As we already know, our dataset has a number of columns. Some of them are categorical and numerical. As we have seen in the previous notebook we will want to apply slightly different kinds of preprocessing to each of these.

It is likely that some columns we will not want to use at all, because the information contained in them is either not useful, or we are at least not able to extract it yet. Column `PassengerId`, for an instance, contains a unique numeric identifier for each record. It is probably not a good idea to use this as an input, because it does not contain any generalizable information. The unique identifiers should have been assigned at random in our case and they should not carry any information content.

Columns `Name`, `Cabin` and others might be found to contain generalizable information, if we were able to extract it using suitable preprocessing (e.g. the names contain titles, which could carry generalizable information; also, the cabin number could indicate which part of the ship the cabin was in etc.). However, since we do not know how to do such preprocessing yet, we will simply drop such columns.

We will split the remaining columns into two groups based on whether they are numeric or categorical. Column `Survived` represents the desired output: we will not preprocess it along with the other columns, but by itself (also, it already takes values 0 and 1, so no actual preprocessing is even necessary).



In [None]:
categorical_inputs = ["Pclass", "Sex", "Embarked"]
numeric_inputs = ["Age", "SibSp", 'Parch', 'Fare']

output = "Survived"

### Constructing the Pipeline and Preprocessing the Data

Given that numeric columns need to be preprocessed in a different way than categorical columns, we will use the built-in `make_column_transformer` function, which will allow us to specify different pipelines for different columns. The columns that we do not list at all will be dropped. If we want to reproduce the preprocessing from the previous notebook using pipelines we can use the following code:



In [None]:
input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

We will first use the `fit_transform` function to fit our new pipeline object and also preprocess our original dataset at the same time. We will also extract the column with desired outputs from the dataset. We will also reshape the desired outputs into a 1-dimensional array, since this is what our `KNeighborsClassifier` will expect.



In [None]:
X_train = input_preproc.fit_transform(df_train[categorical_inputs+numeric_inputs])
Y_train = df_train[output].values.reshape(-1)

To preprocess the testing data we will use the same pipeline object.

**Let us keep in mind that now we will be using the `transform` method a not the `fit_transform` method, because we do not want to fit our pipeline. We only want to transform the testing data in the same way we did with the training data.** 



In [None]:
X_test = input_preproc.transform(df_test[categorical_inputs+numeric_inputs])
Y_test = df_test[output].values.reshape(-1)

### Training

Finally, everything is ready for training the model itself. We can again use the `KNeighborsClassifier`, which we already know from one of our previous notebooks.



In [None]:
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, Y_train)

### Testing

We can next test our model on the testing data.



In [None]:
y_test = model.predict(X_test)

We will display the confusion matrix and compute the accuracy.



In [None]:
cm = pd.crosstab(Y_test, y_test,
                 rownames=['actual'],
                 colnames=['predicted'])
print(cm)

In [None]:
acc = accuracy_score(Y_test, y_test)
print("Accuracy = {}".format(acc))

### Keeping Track of Which Values Were Missing

When considering columns with missing values – whether numeric or categorical – in addition to imputing the missing values, it can be a useful practice to keep track of which values were missing. Our imputation procedure may, for instance, systematically over- or underestimate the missing values. If our model knows which values were missing it can learn to compensate for that.

We can automatically identify the columns with missing values and apply the `MissingIndicator` transformer to them: this will produce new binary columns indicating whether a particular value was missing or not. Naturally, there is no guarantee that doing this will always improve the results – it may depend on the dataset and on the machine learning method.



In [None]:
has_missing = df_train.isnull().any()
for_missing_tracking = has_missing[has_missing].keys()

In [None]:
tracking_input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs),
    
    # ---------------------
    (MissingIndicator(),
     for_missing_tracking)
    # ---------------------
)