# `Pipelines` and `pickles`

This tutorial is an introduction the `Pipelines` and `pickle` files.  These are tools you will need as you starting moving towards real world applications of machine learning, including the productionalizing of models.

We will use the classic Titanic data set (sorry!) to predict which passengers survived (label) its sinking based on two numerical features of each passenger.  Both of the features require preprocessing, and we will see how to affect that preprocessing using `Pipelines`.  We will also see how to combine the preprocessing with the actual machine learning estimator (e.g `LogisticRegression`) into a `Pipeline` which can then be fitted in a single line of code.

Finally we will save our pipelined structure (model) to disk as a `pickle` object. This will allow us to use our model object at a later time on new/unobserved data.

### Reading and Examining the Historical Data

There are two data sets in question for this tutorial, historical/observed data that will be used for fitting our model, and new/unobserved data that the model will be used on to generate predictions.

Let's start by reading in the historical data and take a quick look (the new data is in the same format). The label is `Survived` and the rest of the columns are potential features.

In [1]:
import pandas as pd

df_historical = pd.read_csv('titanic_historical.csv')
df_historical.head().T

Unnamed: 0,0,1,2,3,4
PassengerId,1,2,3,4,5
Survived,0,1,1,1,0
Pclass,3,1,3,1,3
Name,"Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Th...","Heikkinen, Miss. Laina","Futrelle, Mrs. Jacques Heath (Lily May Peel)","Allen, Mr. William Henry"
Sex,male,female,female,female,male
Age,22.0,38.0,26.0,35.0,35.0
SibSp,1,1,0,1,0
Parch,0,0,0,0,0
Ticket,A/5 21171,PC 17599,STON/O2. 3101282,113803,373450
Fare,7.25,71.2833,7.925,53.1,8.05


We will also check the data type of each of the columns.  (For details on this data set check out: https://www.kaggle.com/competitions/titanic/data.)

In [2]:
df_historical.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Let's retain only the numeric features and also separate out our labels.

In [3]:
df_X = df_historical[['Age', 'SibSp', 'Parch', 'Fare']]
df_y = df_historical['Survived']

Finally, we note that `Age` has some missing values.

In [4]:
df_X.isnull().sum()

Age      177
SibSp      0
Parch      0
Fare       0
dtype: int64

### Creating the `Pipeline`

We are using four numeric features `Age`, `SibSp`, `Parch` and `Fare` to make predictions.  We see from above that `Age` has missing values, which we'll need to address.  We will also want to standardize all the  variables so they are on the same order of magnitude. 

To instantiate a `Pipeline` you give it a `list` of steps that are executed in order. The `list` consists of 2-`tuples` where the first element is a string name and the second element is a *transformer* object or a *estimator* object - don't worry too much about what that means at the moment.

Our `Pipeline` will consist of two transformers, followed by an estimator.  In particular:

1. `SimpleImputer()` - a built-in transformer that fills `NaN`; this particular instance fills them with the median value.

2. `StandardScaler()` - a built-in transform that scales each column by subtracting the mean and dividing by the standard deviation.

3. `LogisticRegression()` - the estimator that we are familiar with.

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression

model = Pipeline(steps=[
    ('fill_na', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('logistic_regression', LogisticRegression()),
])


(**Note:** our transformers are being applied to both columns. If we wanted different transformers to be applied to each column we would need to use a `ColumnTransformer` object.  We will cover those is a later tutorial.)

### Fitting the Pipeline (Model)

We can now fit our entire `Pipeline` to `df_X`. Notice that other than selecting variables, we're not doing any preprocessing on the original data, the preprocessing is all built into the `Pipeline`.

In [6]:
model.fit(df_X, df_y)

Let's take a look at the in-sample accuracy as a sanity check that we're on right track.  The performace looks decent, but of course this in-sample accuracy overstates out-of-sample performance.

In [7]:
model.score(df_X, df_y)

0.691358024691358

### Saving the Model to Disk

Next, let's use the `joblib` package to save our model to disk as a `pickle` file. This has the effect of saving our model for future use.

In [8]:
import joblib

joblib.dump(model, 'titanic_model_1.pkl')

['titanic_model_1.pkl']

### Loading the Model from Disk

Now let's read-in our saved model `pickle` file using the `joblib` package.

In [9]:
saved_model = joblib.load('titanic_model_1.pkl')

### Making New Predictions

Finally, let's make predictions on the new data with our saved model object.

In [10]:
df_new = pd.read_csv('titanic_new.csv')

In order to make predictions with our model, we need to first select only the columns that we will need.  This is a form of preprocessing that really should be taken care of in the `Pipeline`.  We will take care of this is a subsequent tutorial.

In [11]:
df_new = df_new[['Age', 'SibSp', 'Parch', 'Fare']].copy()

In particular, we will append the predictions to the feature `DataFrame` associated with the predictions.  Notice that the last column of `df_new` consists of our predictions generated by `saved_model`. 

In [12]:
df_new['prediction'] = saved_model.predict(df_new)
df_new

Unnamed: 0,Age,SibSp,Parch,Fare,prediction
0,34.5,0,0,7.8292,0
1,47.0,1,0,7.0000,0
2,62.0,0,0,9.6875,0
3,27.0,0,0,8.6625,0
4,22.0,1,1,12.2875,0
...,...,...,...,...,...
413,,0,0,8.0500,0
414,39.0,0,0,108.9000,1
415,38.5,0,0,7.2500,0
416,,0,0,8.0500,0


**References:**

https://medium.com/p/a27721fdff1b

https://towardsdatascience.com/creating-custom-transformers-for-sklearn-pipelines-d3d51852ecc1

https://towardsdatascience.com/step-by-step-tutorial-of-sci-kit-learn-pipeline-62402d5629b6