# Pipelines or how to preprocess your data in a blink

We will show how to create and use a pipeline on the titanic dataset. Therefore we will start by loding the relevant packages and the dataset. Since you've already worked with this datset, we will skip the data exploration part. This notebook will focus on how to build a pipeline for effectivly preprocessing this dataset and tuning the hyperparameters using grid search.

## Import of packages and dataset

In [1]:
# Import of relevant packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_val_predict, cross_val_score, cross_validate
from sklearn.metrics import roc_curve, confusion_matrix, accuracy_score, recall_score, precision_score

from sklearn.linear_model import LogisticRegression

# Set random seed 
RSEED = 42

In [2]:
# Loading the titanic dataset
df = pd.read_csv('titanic.csv')
df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


**Variable Description:**

|Variable|Definition   | Key  |  Tpye |
|---|---|---|---|
| Survived | Survival   |   0 = No, 1 = Yes | dichotomous | 
|Pclass|Ticket class|1 = 1st, 2 = 2nd, 3 = 3rd|ordinal|
|Sex|Sex||dichotomous|
|Age|Age|in years|ratio|
|SibSp|# of siblings / spouses aboard the Titanic|	|ratio|
|Parch|# of parents / children aboard the Titanic|  |ratio|
|Ticket|Ticket number||nominal|
|Fare|Passenger fare||ratio|
|Cabin|Cabin number||nominal|
|Embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton|nominal|  

## Data

Before we begin to build our pipeline let's have a quick look at the data to refresh our memory.

In [3]:
# Getting an idea of the dimension
print('Number of rows and columns of train: ',df.shape)

Number of rows and columns of train:  (891, 12)


In [4]:
# Checking the tail of the dataset
df.tail(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [5]:
# Inspecting the type of features
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
# How many unique entries do the featuers have?
df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

In [7]:
# Checking for missing values
missing = pd.DataFrame(df.isnull().sum(), columns=["Amount"])
missing['Percentage'] = round((missing['Amount']/df.shape[0])*100, 2)
missing[missing['Amount'] != 0]

Unnamed: 0,Amount,Percentage
Age,177,19.87
Cabin,687,77.1
Embarked,2,0.22


There are 3 features with missing values.

* **Age**  
* **Cabin**
* **Embarked**

In [8]:
# Having a look at some simple, descriptive statistics 
df.describe().round(2)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.38,2.31,29.7,0.52,0.38,32.2
std,257.35,0.49,0.84,14.53,1.1,0.81,49.69
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.12,0.0,0.0,7.91
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.45
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.33


## Building a Preprocessing Pipeline

To simplify the modelling part we will concentrate on a few promising features. We will drop the featuers **PassengerId**, **Name**, **Cabin** and **Ticket**. The **PassengerId** does not contain helpful information and for the feature **Cabin** there are over 77% values missing. **Name** and **Ticket** might contain helpful information but we need to extract them via feature engineering. Feel free to play around with those: maybe you can create new features which will further improve your models. But for now we'll stick to the remaining ones. 

Before we start building our pipeline we create a list, which contains the features we want to use for the modelling process. Since categorical and numerical features need to be preprocessed differently, we split the features in two lists: one for categorical and one for numerical features. 

In [9]:
# Dropping the unnecessary columns 
df.drop(['PassengerId', 'Name', 'Cabin', 'Ticket'], axis=1, inplace=True)
df.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked'],
      dtype='object')

In [10]:
# Creating list for categorical predictors/features 
cat_features = list(df.columns[df.dtypes==object])
cat_features

['Sex', 'Embarked']

In [11]:
# Creating list for numerical predictors/features
# Since 'Survived' is our target variable we will exclude this feature from this list of numerical predictors 
num_features = list(df.columns[df.dtypes!=object])
num_features.remove('Survived')
num_features

['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

### Train-Test-Split

Let's split the data set into a training and test set. Using the training set and cross validation we will train our model and find the best hyperparameter combination. In the end the test set will be used for the final evaluation of our best model. 

In [12]:
# Define predictors and target variable
X = df.drop('Survived', axis=1)
y = df['Survived']
print(X.shape)
print(y.shape)

(891, 7)
(891,)


In [13]:
# Split into train and test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RSEED)

In [14]:
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

X_train shape: (712, 7)
X_test shape: (179, 7)
y_train shape: (712,)
y_test shape: (179,)


### Preprocessing Pipeline

Building a pipeline always follows the same syntax. In our case we create one pipeline for our numerical features and one for our categorical features. In the end both are combined into one pipeline called "preprocessor". 

In [15]:
from sklearn.pipeline import Pipeline

# Pipline for numerical features
num_pipeline = Pipeline([
    ('imputer_num', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())
])

# Pipeline for categorical features 
cat_pipeline = Pipeline([
    ('imputer_cat', SimpleImputer(strategy='constant', fill_value='missing')),
    ('1hot', OneHotEncoder(handle_unknown='ignore'))
])

In [16]:
from sklearn.compose import ColumnTransformer

# Complete pipeline
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

## Predictive Modelling using Pipelines and Grid Search

### Logistic Regression
Now that we have a preprocessing pipeline we can add a model on top and see how it performs using cross validation. 

In [17]:
# Building a full pipeline with our preprocessor and a LogisticRegression Classifier
pipe_logreg = Pipeline([
    ('preprocessor', preprocessor),
    ('logreg', LogisticRegression(max_iter=1000))
])

In [18]:
# Making predictions on the training set using cross validation as well as calculating the probabilities 
y_train_predicted = cross_val_predict(pipe_logreg, X_train, y_train, cv=5)



In [19]:
# Calculating the accuracy for the LogisticRegression Classifier 
print('Cross validation scores:')
print('-------------------------')
print("Accuracy: {:.2f}".format(accuracy_score(y_train, y_train_predicted)))
print("Recall: {:.2f}".format(recall_score(y_train, y_train_predicted)))
print("Precision: {:.2f}".format(precision_score(y_train, y_train_predicted)))

Cross validation scores:
-------------------------
Accuracy: 0.79
Recall: 0.68
Precision: 0.75


### Optimizing via Grid Search

In order to optimize our model we will use gird search. At first we have to define a parameter space we want to search for the best parameter combination. Then we have to initiate our grid search via GridSearchCV. The last step is to use the fit method providing our training data as input. 

In [20]:
# Defining parameter space for grid-search. Since we want to access the classifier step in our pipeline 
# we have to add 'logreg__' infront of the corresponding hyperparameters. 
param_logreg = {'logreg__penalty':('l1','l2'),
                'logreg__C': [0.01, 0.1, 1, 10, 100]
               }

grid_logreg = GridSearchCV(pipe_logreg, param_grid=param_logreg, cv=3, scoring='accuracy', 
                           verbose=5, n_jobs=-1)

In [21]:
grid_logreg.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    5.4s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    5.4s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('preprocessor',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('num',
                                                                         Pipeline(memory=None,
                                                                                  steps=[('imputer_num',
                                                                                          SimpleImputer(add_indicator=False,
                                                                                                        copy=True,
                

In [22]:
# Show best parameters
print('Best score:\n{:.2f}'.format(grid_logreg.best_score_))
print("Best parameters:\n{}".format(grid_logreg.best_params_))

Best score:
0.80
Best parameters:
{'logreg__C': 0.1, 'logreg__penalty': 'l2'}


In [23]:
# Save best model as best_model
best_model = grid_logreg.best_estimator_['logreg']

### Final Evaluation

Finally we have a good model. Let's see if it also passes the final evaluation on the test data. Therefore we have to prepare the test set in the same way we did with the training data. Thanks to our pipeline it's done in a blink. :) 

In [24]:
# Preparing the test set 
preprocessor.fit(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

In [25]:
# Calculating the accuracy, recall and precision for the test set with the optimized model
y_test_predicted = best_model.predict(X_test_preprocessed)

print("Accuracy: {:.2f}".format(accuracy_score(y_test, y_test_predicted)))
print("Recall: {:.2f}".format(recall_score(y_test, y_test_predicted)))
print("Precision: {:.2f}".format(precision_score(y_test, y_test_predicted)))

Accuracy: 0.80
Recall: 0.73
Precision: 0.77


## Additional Information

### Customized Transformers

Sometimes you might want to transform your features in a very specific way, which is not implemented in scikit-learn jet. In those cases you can create your very own custome transformers. In order to work seamlessly with everything scikit-learn provides you need to create a class and implement the three methods `.fit()`, `.transform()` and `.fit_transform()`.      
Two useful base classes on which you can construct your personal transformer can be imported with the following command:

In [26]:
from sklearn.base import BaseEstimator, TransformerMixin

If you want to learn more about building your own transformers or pipelines in general I would recommend to have a look at the following books:

**Introduction to Machine Learning with Python by Müller and Guido (2017), Chapter 6       
Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Geron (2019), Chapter 2**