# `Pipelines`, `ColumnTransformers`, and `pickles`

This tutorial demonstrates the use of `Pipelines`, `ColumnTransformers`, and `pickle` files.  These are tools you will need as you starting moving towards real world applications of machine learning, including the productionalizing of models.

We will use the classic Titanic dataset (sorry!) to predict which passengers survived (label) its sinking based on various aspects (features) of each passenger.  Most of the features require preprocessing, and we will see how to affect that preprocessing using `Pipelines` and `ColumnTransformers`.  We will also see how to combine the preprocessing with the actual machine learning technique into a `Pipeline` to create and object that can read-in the raw data and the `.fit()` the model in a single line of code.  This makes for much more organized and DRY code.

Finally we will save our pipelined structure (model) to disk as a `pickle` object.  This will allow us to use our object at a later time on new/unobserved data, and will have the added benefit that the data will not require any preprocessing, because it is built into the pipelined model object.

### Reading and Examining the Historical Data

There are two data sets in question for this tutorial, *historical/observed* data that will be used for fitting our model, and *new/unobserved* data that the model will be used on to generate predictions.

Let's start by reading in the historical data and take a quick look (the new data is in the same format).  The label is `Survived` and the rest of the columns are potential features.

In [1]:
import pandas as pd

df_historical = pd.read_csv('titanic_historical.csv')
df_historical.head().T

Unnamed: 0,0,1,2,3,4
PassengerId,1,2,3,4,5
Survived,0,1,1,1,0
Pclass,3,1,3,1,3
Name,"Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Th...","Heikkinen, Miss. Laina","Futrelle, Mrs. Jacques Heath (Lily May Peel)","Allen, Mr. William Henry"
Sex,male,female,female,female,male
Age,22.0,38.0,26.0,35.0,35.0
SibSp,1,1,0,1,0
Parch,0,0,0,0,0
Ticket,A/5 21171,PC 17599,STON/O2. 3101282,113803,373450
Fare,7.25,71.2833,7.925,53.1,8.05


We'll also do a bit of inspection to better understand our data.  (For details on this data set check out: https://www.kaggle.com/competitions/titanic/data.)

In [2]:
df_historical.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


As we can see below, the data isn't too imbalanced.

In [3]:
df_historical['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [4]:
df_historical['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [5]:
df_historical['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

Let's see how many null values are in each column.

In [6]:
df_historical.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Let's now separate out our features from our labels.  Notice that at this stage we're keeping all the columns; we will select for the ones we want in the `Pipeline` we create below.

In [7]:
df_X = df_historical.drop(columns=['Survived'])
df_y = df_historical['Survived']

Let's identify the features that we want to keep for making predictions; this `list` will be fed into our `Pipeline`.

In [8]:
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

### Creating a Custom Transformer: `FeatureSelector`

The first step in the `Pipeline` is selecting only the features that we want to keep.  For some reason there isn't a good built-in way to do this so we'll create a custom transformer to do the job.

This amounts to creating a custom class called `FeatureSelector` which inherits from `BaseEstimator` and `TransformerMixin` - this is standard for creating custom transformers.  Implementing this inheritance gives us the `.fit_transform()` method for free.

In [9]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[self.columns]

Let's verify that our `FeatureSelector` is working.

In [10]:
FeatureSelector(features).fit_transform(df_X).head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


### Building the Data Pre-processing Pipeline

This section is where we do most of the heavy lifting.  In particular, we define the preprocessing `Pipelines` for the various columns. 

To instantiate a `Pipeline` you give it a `list` of `steps` that are executed in order.  The `list` consists of 2-`tuples` where the first element is a `string` name and the second coordinate is a *transformer* object or a *estimator* object - don't worry too much about what that means at the moment. 

We start with the numeric features `Age`, `Fare`, `SibSp`, `Parch`.  For these columns we will fill the `NaN` values with the median of each column and then standardize; so, this `Pipeline` consists of two built-in tranformers.

In [11]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler 

# Age, Fare, SibSp, Parch
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

`Sex` and `Embarked` are both categorical feature, so we fill the `NaN` values with the mode of each column and then one hot encode each of them.  This `Pipeline` also consists of two built-in transformers.

In [12]:
from sklearn.preprocessing import OneHotEncoder

# Sex, Embarked
categorical_transformer_1 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('one_hot', OneHotEncoder(handle_unknown='ignore'))
])

Finally, we address `Pclass`.  Notice that it is an ordinal categorical feature, but it is already set to integer values, so the only thing we will do is fill the `NaN`values.  This `Pipeline` consists of a single built-in transformer.

In [13]:
# Pclass
categorical_transformer_2 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
])

Now we can put these processing `Pipelines` together in a `ColumnTransformer` object.  We need to use a `ColumnTransformer` because we are performing different transformations on different features.  To instantiate a `ColumnTransformer` you give it a `list` of three-`tuples` where elements are as follows:

1. a `string` name
2. a transformer object
3. a list of the columns to perform the transform on.

The `ColumnTransformer` below consists of the three transformer `Pipelines` that we created above.

In [14]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers=[
    ('numerical', numerical_transformer, ['Age', 'Fare', 'SibSp', 'Parch']),
    ('categorical_1', categorical_transformer_1, ['Sex', 'Embarked']),
    ('categorical_2', categorical_transformer_2, ['Pclass'])],
)

### Testing the Output of Our Preprocessing

For testing purposes, let's create a `Pipeline` that we will call `testing_output`; it will consist of our custom `FeatureSelector` and our `ColumnTransformer` which we called `preprocessor`.  The reason that we are doing this is to make sure that feeding in `df_X` yields a reasonable output.  

Notice that there were originally 7 columns, and now there are 10 columns.  The three extra columns result from the one-hot encoding of `Sex` and `Embark`.  The numerical columns seem scaled and reasonable.  (I'll leave it to the reader to check that there are no `NaNs`.)

Of course the resulting `DataFrame` is hard to decipher because in this situation the `.fit_transform()` method results in and `np.ndarray`, and we are converting it to a `DataFrame` with `pd.DataFrame()`.  But by inspection the result seems reasonable and that will suffice for our purposes.

In [15]:
testing_output = Pipeline(steps=[
    ('feature_selector', FeatureSelector(features)),
    ('preprocessor', preprocessor)
])
pd.DataFrame(testing_output.fit_transform(df_X))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.565736,-0.502445,0.432793,-0.473674,0.0,1.0,0.0,0.0,1.0,3.0
1,0.663861,0.786845,0.432793,-0.473674,1.0,0.0,1.0,0.0,0.0,1.0
2,-0.258337,-0.488854,-0.474545,-0.473674,1.0,0.0,0.0,0.0,1.0,3.0
3,0.433312,0.420730,0.432793,-0.473674,1.0,0.0,0.0,0.0,1.0,1.0
4,0.433312,-0.486337,-0.474545,-0.473674,0.0,1.0,0.0,0.0,1.0,3.0
...,...,...,...,...,...,...,...,...,...,...
886,-0.181487,-0.386671,-0.474545,-0.473674,0.0,1.0,0.0,0.0,1.0,2.0
887,-0.796286,-0.044381,-0.474545,-0.473674,1.0,0.0,0.0,0.0,1.0,1.0
888,-0.104637,-0.176263,0.432793,2.008933,1.0,0.0,0.0,0.0,1.0,3.0
889,-0.258337,-0.044381,-0.474545,-0.473674,0.0,1.0,1.0,0.0,0.0,1.0


### Creating the Final `Pipeline`

With the hard work we did in the previous section, we are now ready to put together the final `Pipeline`.  All that remains to be added is the `LogisticRegression()` estimator.  Thus, this final `PipeLine` has three sequential components:

1. A custom `FeatureSelector` that grabs only the columns that we will use for our preditions.

2. A `ColumnTransformer` object called `preprocesser` which itself consists of three different `Pipelines`, which in turn consist of one or more built-in transformers.

3. An estimator object `LogisticRegression()`.

In [16]:
from sklearn.linear_model import LogisticRegression

model = Pipeline(steps=[
    ('feature_selector', FeatureSelector(features)),
    ('preprocessor', preprocessor),
    ('logistic_regression', LogisticRegression())
])

### Fitting the Pipeline (Model)

We can now fit our entire `Pipeline` to the original data. Notice that we're not doing any preprocessing on the original data, the preprocessing is all built into the `Pipeline`.

In [17]:
model.fit(df_X, df_y)

Let's take a look at the in-sample accuracy as a sanity check that we're on right track.  The performace looks good, but of course this in-sample accuracy overstates out-of-sample performance.

In [18]:
model.score(df_X, df_y)

0.7991021324354658

### Saving the Model to Disk

Next, let's use the `joblib` package to save our model to disk as a `pickle` file.  This has the effect of saving our model for future use.

In [19]:
import joblib

joblib.dump(model, 'titanic_model_2.pkl')

['titanic_model_2.pkl']

### Loading the Model from Disk

Now let's read-in our saved model object using the `joblib` package.

In [20]:
saved_model = joblib.load('titanic_model_2.pkl')

### Making New Predictions

Finally, let's make predictions on the new data with our saved model object.  

In [21]:
df_new = pd.read_csv('titanic_new.csv')

In particular, we will append the predictions to the feature `DataFrame` associated with the predictions.  Notice that the last column of `df_new` consists of our predictions generated by `saved_model`. 

In [22]:
df_new['prediction'] = saved_model.predict(df_new)
df_new.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,408,409,410,411,412,413,414,415,416,417
PassengerId,892,893,894,895,896,897,898,899,900,901,...,1300,1301,1302,1303,1304,1305,1306,1307,1308,1309
Pclass,3,3,2,3,3,3,3,2,3,3,...,3,3,3,1,3,3,1,3,3,3
Name,"Kelly, Mr. James","Wilkes, Mrs. James (Ellen Needs)","Myles, Mr. Thomas Francis","Wirz, Mr. Albert","Hirvonen, Mrs. Alexander (Helga E Lindqvist)","Svensson, Mr. Johan Cervin","Connolly, Miss. Kate","Caldwell, Mr. Albert Francis","Abrahim, Mrs. Joseph (Sophie Halaut Easu)","Davies, Mr. John Samuel",...,"Riordan, Miss. Johanna Hannah""""","Peacock, Miss. Treasteall","Naughton, Miss. Hannah","Minahan, Mrs. William Edward (Lillian E Thorpe)","Henriksson, Miss. Jenny Lovisa","Spector, Mr. Woolf","Oliva y Ocana, Dona. Fermina","Saether, Mr. Simon Sivertsen","Ware, Mr. Frederick","Peter, Master. Michael J"
Sex,male,female,male,male,female,male,female,male,female,male,...,female,female,female,female,female,male,female,male,male,male
Age,34.5,47.0,62.0,27.0,22.0,14.0,30.0,26.0,18.0,21.0,...,,3.0,,37.0,28.0,,39.0,38.5,,
SibSp,0,1,0,0,1,0,0,1,0,2,...,0,1,0,1,0,0,0,0,0,1
Parch,0,0,0,0,1,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
Ticket,330911,363272,240276,315154,3101298,7538,330972,248738,2657,A/4 48871,...,334915,SOTON/O.Q. 3101315,365237,19928,347086,A.5. 3236,PC 17758,SOTON/O.Q. 3101262,359309,2668
Fare,7.8292,7.0,9.6875,8.6625,12.2875,9.225,7.6292,29.0,7.2292,24.15,...,7.7208,13.775,7.75,90.0,7.775,8.05,108.9,7.25,8.05,22.3583
Cabin,,,,,,,,,,,...,,,,C78,,,C105,,,


**References:**

https://www.youtube.com/watch?v=URdnFlZnlaE

https://medium.com/p/a27721fdff1b

https://towardsdatascience.com/creating-custom-transformers-for-sklearn-pipelines-d3d51852ecc1

https://towardsdatascience.com/step-by-step-tutorial-of-sci-kit-learn-pipeline-62402d5629b6