# Preprocessing and Pipelines

In this module, you will learn how to do some simple preprocessing with your data. Additionally, we will look at how to prevent data leakage with scikit-learn's awesome pipelines.

<b>Functions and attributes in this lecture: </b>
- `pandas:` - Pandas package with alias `pd`
  - `.mean()` - Get the mean value of a dataframe
  - `.replace()` - Replaces values in a series with new values
  - `.drop()` - Drop certain colunns or rows
  - `.dropna()` - Drop the rows with missing values
  - `.fillna()` - Fill in the missing values with a spesific value
- `sklearn.preprocessing` - Submodule for preprocessing data
  - `StandardScaler()` - Scale the data
    - `.fit()` - Training the scaler on the data
    - `.transform()` - Tranforms data by scaling it
    - `.mean_` - Get the mean for the scaling
    - `.var_` - Get the variance for the scaling
- `sklearn.pipelines` - Submodule for assembeling pipelines
  - `Pipeline()` - Basic constructor for setting up a pipeline
    - `.fit()` - Training the pipeline on the data
    - `.predict()` - Predict values on new data
    - `.score()` - Get the score determined by the last model in the pipeline
    - `.named_steps` - Get the components of the pipeline

In [1]:
# Non-sklearn packages
import numpy as np
import pandas as pd

# Sklearn packages
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Importing the Titanic Dataset

In this section we will import the famous Titanic dataset and clean some of the missing values in it!

In [2]:
# The titanic dataset is inside the seaborn package
from seaborn import load_dataset

# Load the Titanic data set
titanic = load_dataset("titanic")
titanic.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


In [3]:
# Checking summary data of the dataset
titanic.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [4]:
# Information about the columns
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [5]:
# Remove the "deck" feature
titanic.drop(columns="deck", inplace=True)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [6]:
# Fill in the mean age for those with missing age
mean_age = titanic["age"].mean()
titanic["age"].fillna(mean_age, inplace=True)

In [7]:
# Check that the value has been filled in
titanic.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True
5,0,3,male,29.699118,0,0,8.4583,Q,Third,man,True,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,Cherbourg,yes,False


In [8]:
# Drop the remining two rows
titanic.dropna(inplace=True)

In [9]:
# Check that we have no more missing values
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     889 non-null    int64   
 1   pclass       889 non-null    int64   
 2   sex          889 non-null    object  
 3   age          889 non-null    float64 
 4   sibsp        889 non-null    int64   
 5   parch        889 non-null    int64   
 6   fare         889 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        889 non-null    category
 9   who          889 non-null    object  
 10  adult_male   889 non-null    bool    
 11  embark_town  889 non-null    object  
 12  alive        889 non-null    object  
 13  alone        889 non-null    bool    
dtypes: bool(2), category(1), float64(2), int64(4), object(5)
memory usage: 86.1+ KB


### Choosing Relevant Features

Not all the features you are presented with are nessesarily useful for predicting the survived feature. We will now exclude some of the features to only consider those we believe will affect the survived column significantly.

In [10]:
# Checking our dataset
titanic.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True
5,0,3,male,29.699118,0,0,8.4583,Q,Third,man,True,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,Cherbourg,yes,False


In [11]:
# Removing duplicate information
titanic.drop(columns=["embarked", "class", "who", "adult_male", "alive"], inplace=True)
titanic.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone
0,0,3,male,22.0,1,0,7.25,Southampton,False
1,1,1,female,38.0,1,0,71.2833,Cherbourg,False
2,1,3,female,26.0,0,0,7.925,Southampton,True
3,1,1,female,35.0,1,0,53.1,Southampton,False
4,0,3,male,35.0,0,0,8.05,Southampton,True
5,0,3,male,29.699118,0,0,8.4583,Queenstown,True
6,0,1,male,54.0,0,0,51.8625,Southampton,True
7,0,3,male,2.0,3,1,21.075,Southampton,False
8,1,3,female,27.0,0,2,11.1333,Southampton,False
9,1,2,female,14.0,1,0,30.0708,Cherbourg,False


In [12]:
# Encode the sex as 0 for female and 1 for male
titanic["ismale"] = titanic["sex"].replace({"female": 0, "male": 1})
titanic.drop(columns="sex", inplace=True)

In [13]:
# Can look at the correlation matrix (embark town is not present!)
titanic.corr()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,ismale
survived,1.0,-0.335549,-0.074673,-0.03404,0.083151,0.25529,-0.206207,-0.541585
pclass,-0.335549,1.0,-0.327954,0.081656,0.016824,-0.548193,0.138553,0.127741
age,-0.074673,-0.327954,1.0,-0.231875,-0.178232,0.088604,0.177712,0.089434
sibsp,-0.03404,0.081656,-0.231875,1.0,0.414542,0.160887,-0.584186,-0.116348
parch,0.083151,0.016824,-0.178232,0.414542,1.0,0.217532,-0.583112,-0.247508
fare,0.25529,-0.548193,0.088604,0.160887,0.217532,1.0,-0.274079,-0.179958
alone,-0.206207,0.138553,0.177712,-0.584186,-0.583112,-0.274079,1.0,0.306985
ismale,-0.541585,0.127741,0.089434,-0.116348,-0.247508,-0.179958,0.306985,1.0


In [14]:
# Drop the low-correlation columns and the embark town column
titanic.drop(columns=["age", "sibsp", "parch", "embark_town"], inplace=True)

In [15]:
# Our dataset 
titanic.head(10)

Unnamed: 0,survived,pclass,fare,alone,ismale
0,0,3,7.25,False,1
1,1,1,71.2833,False,0
2,1,3,7.925,True,0
3,1,1,53.1,False,0
4,0,3,8.05,True,1
5,0,3,8.4583,True,1
6,0,1,51.8625,True,1
7,0,3,21.075,False,1
8,1,3,11.1333,False,0
9,1,2,30.0708,False,0


## Standardizing the Values

It is useful to standardize the values before passing them into machine learning models. While this is not important for all machine learning models, it is important for many of them.

In [16]:
# Dividing into traning sets and testing sets
y = titanic["survived"]
X = titanic.drop(columns="survived")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [17]:
# Importing and initializing a standard scaler estimator
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [18]:
# Fitting the estimator on the training set
scaler.fit(X_train)

StandardScaler()

In [31]:
# Getting the mean and variance of the training set
print(scaler.mean_)
print(scaler.var_)

[ 2.32773109 32.76836067  0.59327731  0.6605042 ]
[6.87550314e-01 2.64921948e+03 2.41299343e-01 2.24238401e-01]


In [20]:
# Scaling the training set and testing set in the same way
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [21]:
# Checking the output
X_train_scaled

array([[-1.60124536,  0.37097199,  0.82798092,  0.71693438],
       [-0.39524411, -0.38407115,  0.82798092, -1.39482779],
       [-1.60124536,  0.95374775, -1.2077573 ,  0.71693438],
       ...,
       [-1.60124536, -0.13287517,  0.82798092, -1.39482779],
       [ 0.81075714,  0.03121472, -1.2077573 , -1.39482779],
       [ 0.81075714, -0.46850387,  0.82798092,  0.71693438]])

In [22]:
# Training a logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)

LogisticRegression()

In [23]:
# Get the predictions and accuracy score
y_pred = log_reg.predict(X_test_scaled)
print("Accuracy Score: ", accuracy_score(y_pred, y_test))

Accuracy Score:  0.7993197278911565


In [24]:
# This is not so good, since the data is unbalanced!
titanic["survived"].value_counts()/len(titanic["survived"])

0    0.617548
1    0.382452
Name: survived, dtype: float64

## Creating a Pipeline for Our Data

We will now put our scaling and logistic regression into a pipeline so that it is more managable.

In [25]:
# Importing the Pipeline object
from sklearn.pipeline import Pipeline

In [26]:
# Creting a pipeline
pipeline = Pipeline(
    [("scaler", StandardScaler()),
    ("log_reg", LogisticRegression())],
    verbose=True
)

In [27]:
# Fitting the pipeline
pipeline.fit(X_train, y_train)

[Pipeline] ............ (step 1 of 2) Processing scaler, total=   0.0s
[Pipeline] ........... (step 2 of 2) Processing log_reg, total=   0.0s


Pipeline(steps=[('scaler', StandardScaler()),
                ('log_reg', LogisticRegression())],
         verbose=True)

In [28]:
# Have all the information in the named_steps attribute
pipeline.named_steps["scaler"].mean_

array([ 2.32773109, 32.76836067,  0.59327731,  0.6605042 ])

In [29]:
# Can now use predict and X_test gets automatically scaled
pipeline.predict(X_test)

array([0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0], dtype=int64)

In [30]:
# Can use score to get the accuracy score for logistic regression
pipeline.score(X_test, y_test)

0.7993197278911565