# Pipelines or how to preprocess your data in a clean and organized fashion

![](https://media1.faz.net/ppmedia/aktuell/83311481/1.1703919/default-retina/der-untergang-der-titanic-1912.jpg)

[Pipelines](https://scikit-learn.org/stable/modules/compose.html#pipeline) are a useful tool for going through a whole sequence of data processing and modeling steps in the right order and offer three main advantages:

- **Convenience and encapsulation** 

    You only have to call `.fit()`and `.predict()`once to fit a whole sequence of processing steps.
- **Grid Search Hyperparemeter Selection over all Hyperparamters in pipeline possible at once** 
- **Safety** 

    Pipelines help avoid leaking statistics from your test data into model training. 
    



We will show how to create and use a pipeline on the titanic dataset. Therefore we will start by loading the relevant packages and the dataset. Since you've already worked with this dataset, we will skip the data exploration part. This notebook will focus on how to build a pipeline for effectively preprocessing this dataset and tuning the hyperparameters using grid search.

## Import of packages and dataset

In [2]:
# Import of relevant packages
import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.pipeline import Pipeline # <--- new function
from sklearn.compose import ColumnTransformer # <--- new function
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score, recall_score, precision_score

from sklearn.linear_model import LogisticRegression

from sklearn import set_config
set_config(transform_output="pandas")

# Set random seed 
RSEED = 42



warnings.filterwarnings("ignore")

In [3]:
# Loading the titanic dataset
df = pd.read_csv('data/titanic.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


We load now the fool dataset with all the columns

**Variable Description:**

|Variable|Definition   | Key  |  Type |
|---|---|---|---|
| Survived | Survival   |   0 = No, 1 = Yes | dichotomous | 
|Pclass|Ticket class|1 = 1st, 2 = 2nd, 3 = 3rd|ordinal|
|Sex|Sex||dichotomous|
|Age|Age|in years|ratio|
|SibSp|# of siblings / spouses aboard the Titanic|	|ratio|
|Parch|# of parents / children aboard the Titanic|  |ratio|
|Ticket|Ticket number||nominal|
|Fare|Passenger fare||ratio|
|Cabin|Cabin number||nominal|
|Embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton|nominal|  

---
## Data

Before we begin to build our pipeline let's have a quick look at the data to refresh our memory.

In [4]:
# Getting an idea of the dimension
print('Number of rows and columns of train: ', df.shape)

Number of rows and columns of train:  (891, 12)


In [6]:
# Inspecting the type of features
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
# Having a look at some simple, descriptive statistics 
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [10]:
# How many unique entries do the features have?
df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

In [16]:
missing

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
df.isnull().sum().to_frame()

In [23]:
# Checking for missing values
missing = df.isnull().sum().to_frame()
missing
missing["Percentage %"] = round(missing/df.shape[0]*100,2)
missing

Unnamed: 0,0,Percentage %
PassengerId,0,0.0
Survived,0,0.0
Pclass,0,0.0
Name,0,0.0
Sex,0,0.0
Age,177,19.87
SibSp,0,0.0
Parch,0,0.0
Ticket,0,0.0
Fare,0,0.0


# There are 3 features with missing values.

* **Age**  
* **Cabin**
* **Embarked**

---
## Building a Preprocessing Pipeline

To simplify the modeling part we will concentrate on a few promising features. We will drop the features **PassengerId**, **Name**, **Cabin** and **Ticket**. 
 * The **PassengerId** does not contain helpful information and 
 * for the feature **Cabin** there are over 77% values missing. 
 * **Name** and **Ticket** might contain helpful information but we need to extract them via feature engineering. 
 
 Feel free to play around with those: maybe you can create new features which will further improve your models. But for now we'll stick to the remaining ones. 

In [24]:
# Dropping the unnecessary columns 
df.drop(['PassengerId', 'Name', 'Cabin', 'Ticket'], axis=1, inplace=True)
df.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked'],
      dtype='object')

### Categorical vs numerical variables

Before we start building our pipeline we create a list, which contains the features we want to use for the modeling process. Since categorical and numerical features need to be preprocessed differently, we split the features in two lists: one for categorical and one for numerical features. 

In [25]:
# Change Pclass, Sex , Embarked to category pandas datatype 
df = df.astype({"Pclass":"category", "Sex":"category", "Embarked":"category"})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Survived  891 non-null    int64   
 1   Pclass    891 non-null    category
 2   Sex       891 non-null    category
 3   Age       714 non-null    float64 
 4   SibSp     891 non-null    int64   
 5   Parch     891 non-null    int64   
 6   Fare      891 non-null    float64 
 7   Embarked  889 non-null    category
dtypes: category(3), float64(2), int64(3)
memory usage: 37.9 KB


In [28]:
# categorical mask
cat_mask = df.dtypes == "category"
cat_mask

Survived    False
Pclass       True
Sex          True
Age         False
SibSp       False
Parch       False
Fare        False
Embarked     True
dtype: bool

In [31]:
df.columns[cat_mask].tolist()

['Pclass', 'Sex', 'Embarked']

In [32]:
## Creating list for categorical predictors/features 
# (dates are also objects so if you have them in your data you would deal with them first)
cat_features = df.columns[cat_mask].tolist()
cat_features

['Pclass', 'Sex', 'Embarked']

In [50]:
num_mask = df.dtypes != "category"
num_mask

Survived     True
Pclass      False
Sex         False
Age          True
SibSp        True
Parch        True
Fare         True
Embarked    False
dtype: bool

In [51]:
# Creating list for numerical predictors/features
# Since 'Survived' is our target variable we will remove this feature 
# from this list of numerical predictors 
num_features = df.columns[num_mask].tolist()
num_features.remove("Survived")
num_features

['Age', 'SibSp', 'Parch', 'Fare']

In [36]:
df[num_mask]

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

### Train-Test-Split

Let's split the data set into a training and test set. Using the training set and cross validation we will train our model and find the best hyperparameter combination. In the end the test set will be used for the final evaluation of our best model. 

In [53]:
# Define predictors X (features) and target variable y
X = df.drop('Survived', axis=1)
y = df['Survived']

print(f"We have {X.shape[0]} observations in our dataset and {X.shape[1]} features")
print(f"Our target vector has also {y.shape[0]} values")

We have 891 observations in our dataset and 7 features
Our target vector has also 891 values


In [54]:
# Split into train and test set 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, stratify=y, random_state=RSEED)


In [39]:
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

X_train shape: (623, 7)
X_test shape: (268, 7)
y_train shape: (623,)
y_test shape: (268,)


## Preprocessing Pipeline

![](images/sk_pipeline.png)

Building a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) always follows the same syntax. In our case we create one pipeline for our numerical features and one for our categorical features. 

The missing values of the numerical features should be filled with the median value of the features and in the end, each feature should be scaled using the StandardScaler.

The missing values of the categorical features could be replaced with the most frequent. In the end, we encode all categorical features as a dummy/one-hot numeric array. 


In the end both pipelines are combined into one pipeline called "preprocessor" using [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) from scikit-learn.

In [40]:
Pipeline([
    ("num_imputation",SimpleImputer(strategy="median")),
    ("num_scaler", StandardScaler())
])

In [41]:
#from sklearn.pipeline import Pipeline

# Pipeline for numerical features
# Initiating Pipeline and calling one step after another
# each step is built as a list of (name, transform)
# name is the name of the processing step
# transform is an transformation/estimator object (processing step)
num_pipeline = Pipeline([
    ("num_imputation",SimpleImputer(strategy="median")),
    ("num_scaler", StandardScaler())
])
num_pipeline

In [43]:
Pipeline([
    ("cat_imputation", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder())
])

In [61]:
# Pipeline for categorical features 
cat_pipeline = Pipeline([
    ("cat_imputation", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(drop="first",sparse_output=False))
])
cat_pipeline

In [62]:
from sklearn.compose import ColumnTransformer

# Complete pipeline for numerical and categorical features
# 'ColumnTransformer' applies transformers (num_pipeline/ cat_pipeline)
# to specific columns of an array or DataFrame (num_features/cat_features)
preprocessor = ColumnTransformer([
    ("num_processor", num_pipeline, num_features),
    ("cat_processor", cat_pipeline, cat_features)
])
preprocessor

In [63]:
preprocessor.fit_transform(X_train)

Unnamed: 0,num_processor__Age,num_processor__SibSp,num_processor__Parch,num_processor__Fare,cat_processor__Pclass_2,cat_processor__Pclass_3,cat_processor__Sex_male,cat_processor__Embarked_Q,cat_processor__Embarked_S
748,-0.831890,0.562957,-0.448665,0.465738,0.0,0.0,1.0,0.0,1.0
45,-0.064873,-0.474682,-0.448665,-0.478269,0.0,1.0,1.0,0.0,1.0
28,-0.064873,-0.474682,-0.448665,-0.481848,0.0,1.0,0.0,1.0,0.0
633,-0.064873,-0.474682,-0.448665,-0.646954,0.0,0.0,1.0,0.0,1.0
403,-0.141574,0.562957,-0.448665,-0.314823,0.0,1.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
476,0.318636,0.562957,-0.448665,-0.206906,1.0,0.0,1.0,0.0,1.0
190,0.165232,-0.474682,-0.448665,-0.374544,1.0,0.0,0.0,0.0,1.0
736,1.392460,0.562957,3.119650,0.073362,0.0,1.0,0.0,0.0,1.0
462,1.315758,-0.474682,-0.448665,0.159800,0.0,0.0,1.0,0.0,1.0


## Predictive Modelling using Pipelines and Grid Search

### Logistic Regression
Now that we have a preprocessing pipeline we can add a model on top (this sequence will also be handled by a Pipeline) and see how it performs using cross validation. 

In [64]:
Pipeline([
    ("feature_engi", preprocessor),
    ("log_regr", LogisticRegression(max_iter=1000,class_weight="balanced"))
])

In [65]:
# Building a full pipeline with our preprocessor and a LogisticRegression Classifier
pipe_logreg = Pipeline([
    ("feature_engi", preprocessor),
    ("log_regr", LogisticRegression(max_iter=1000,class_weight="balanced"))
])
pipe_logreg

In [66]:
# Making predictions on the training set using cross validation 
# cross_val_predict expects an estimator (model), X, y and nr of cv-splits (cv)
y_train_predicted = cross_val_predict(pipe_logreg, X_train, y_train, cv = 5, n_jobs=-1)

In [68]:
pipe_logreg.fit(X_train,y_train)

In [69]:
y_test_predict = pipe_logreg.predict(X_test)

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1,
       1, 0, 0, 1])

In [67]:
# Calculating the accuracy for the LogisticRegression Classifier 
print('Cross validation scores:')
print('-------------------------')
print("Accuracy: {:.2f}".format(accuracy_score(y_train, y_train_predicted)))
print("Recall: {:.2f}".format(recall_score(y_train, y_train_predicted)))
print("Precision: {:.2f}".format(precision_score(y_train, y_train_predicted)))

Cross validation scores:
-------------------------
Accuracy: 0.78
Recall: 0.75
Precision: 0.70


### Optimizing via Grid Search

In order to optimize our model we will use gird search. At first we have to define a parameter space we want to search for the best parameter combination. Then we have to initiate our grid search via GridSearchCV. The last step is to use the fit method providing our training data as input. 

In [75]:
pipe_logreg.get_params()

{'memory': None,
 'steps': [('feature_engi',
   ColumnTransformer(transformers=[('num_processor',
                                    Pipeline(steps=[('num_imputation',
                                                     SimpleImputer(strategy='median')),
                                                    ('num_scaler',
                                                     StandardScaler())]),
                                    ['Age', 'SibSp', 'Parch', 'Fare']),
                                   ('cat_processor',
                                    Pipeline(steps=[('cat_imputation',
                                                     SimpleImputer(strategy='most_frequent')),
                                                    ('ohe',
                                                     OneHotEncoder(drop='first',
                                                                   sparse_output=False))]),
                                    ['Pclass', 'Sex', 'Embarked'])])),
  ('log

In [76]:
# Defining parameter space for grid-search. Since we want to access the classifier step (called 'logreg') in our pipeline 
# we have to add 'logreg__' in front of the corresponding hyperparameters. 
param_logreg = {'log_regr__penalty':('l1','l2'),
                'log_regr__C': [0.001, 0.01, 0.1, 1, 10],
               }

grid_logreg = GridSearchCV(
    pipe_logreg,
    param_grid=param_logreg,
    scoring="accuracy",
    cv=5
)
grid_logreg

In [77]:
# we fit the grid_logreg on train data
grid_logreg.fit(X_train,y_train)

In [78]:
# Show best parameters
print('Best score:\n{:.2f}'.format(grid_logreg.best_score_))
print("Best parameters:\n{}".format(grid_logreg.best_params_))

Best score:
0.79
Best parameters:
{'log_regr__C': 0.01, 'log_regr__penalty': 'l2'}


In [79]:
# Save best estimator (including fitted preprocessing steps) as best_model 
best_model = grid_logreg.best_estimator_
best_model

### Final Evaluation

Finally we have a good model. Let's see if it also passes the final evaluation on the test data. Therefore we have to prepare the test set in the same way we did with the training data. Thanks to our pipeline it's done in a blink and we can be sure no data-leakage happened at any step through the whole data preprocessing.

When we saved the best model in the cell above, we did not only save the trained model but also the fitted preprocessing pipeline. Thus, transforming the test data the same way as the train data happens also when calling the `.predict` method on the `best_model`.

In [80]:
# Calculating the accuracy, recall and precision for the test set with the optimized model
y_test_predicted = best_model.predict(X_test)

print("Accuracy: {:.2f}".format(accuracy_score(y_test, y_test_predicted)))
print("Recall: {:.2f}".format(recall_score(y_test, y_test_predicted)))
print("Precision: {:.2f}".format(precision_score(y_test, y_test_predicted)))

Accuracy: 0.78
Recall: 0.72
Precision: 0.70


## Additional Information

### Customized Transformers

Sometimes you might want to transform your features in a very specific way, which is not implemented in scikit-learn yet. In those cases you can create your very own custom transformers. In order to work seamlessly with everything scikit-learn provides you need to create a class and implement the three methods `.fit()`, `.transform()` and `.fit_transform()`.      
Two useful base classes on which you can construct your personal transformer can be imported with the following command:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

If you want to learn more about building your own transformers or pipelines in general I would recommend to have a look at the following books:

**Introduction to Machine Learning with Python by Müller and Guido (2017), Chapter 6       
Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Geron (2019), Chapter 2**