# II.- Travel Marketing Machine Learning Pipeline: Feature Engineering

There will be a notebook for each one of the Machine Learning Pipeline steps:

1. Data Analysis
2. Feature Engineering
3. Model Building

**This is the notebook for step 2: Feature Engineering**

**The purpose of these notebooks is to provide an idea of the steps that must be covered when preparing a machine learning model for deployment.**

===================================================================================================

## Predicting Repeat Customers in the Travel Business

The aim of the project is to build a machine learning model to predict which customers of a travel agency are going to be repeat customers.

### Why is this important? 

The travel agency is giving out too many discounted packages without ROI - they want to send discounted offers only to customers that will repeat. On the other hand, they want to reduce churn by sending targeted marketing to customers who defect.

### What is the objective of the machine learning model?

We aim to identify customers that will repeat using data describing each customer's socioeconomic status and interests. 

====================================================================================================

## Travel marketing dataset: Feature Engineering

In the following cells we will pre-process the variables of the **travelChurn_20k.csv** dataset. We will engineer the variables so that we tackle:

1. Missing values
2. Encoding of categorical variables

In this notebook we will persist the resulting data transformation pipeline for later re-utilisation. It must be noted that not all of the insights obtained from the exploratory data analysis are incorporated into this pipeline, as the idea of these notebooks is to illustrate the process that should be followed when writing production code.


### Setting the seed

It is important to note that we are engineering variables with the idea of deploying the model. Therefore, from now on, for each step that includes some element of randomness, it is extremely important that we **set the seed**. This way, we can obtain reproducibility between our research and our development code.

# Imports

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt

# to divide train and test set
from sklearn.model_selection import train_test_split

# categorical feature encoding
from sklearn.preprocessing import OneHotEncoder

# to build the model
from sklearn.ensemble import RandomForestClassifier

# to assess model performance
from sklearn.metrics import log_loss
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_recall_curve

# maximum number of dataframe rows and columns displayed
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None

# to create customised preprocessing pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

# to perform imputation
from sklearn.impute import SimpleImputer

# to persist preprocessing pipeline
import joblib

# custom transformers
from custom_transformers import CustomImputer, CustomEncoder

# random seed
RANDOM_STATE = 801

import warnings
warnings.simplefilter(action='ignore')

# 1.- Read data

In [2]:
# load dataset
data0 = pd.read_csv('travelChurn_20k.csv')
print(data0.shape)
data0.head()

(20000, 18)


Unnamed: 0,repeat,Gender,Age_Range,Income_Range,Occupation,Household_Type,Length_of_Residence,Home_Value_Range,Wealth_Rank,Mail_Buyer,Ecommerce_Behav_Rank,Upscale_Retail_Shopper,Premium_Bank_Card,Books_Behav,Family_Behav,Health_Magazine,Personal_Travel,Sporting_Goods_Interest
0,0,M,45-54 Years Old,"$100,000 - $124,999",Executive/Administrator,Adult Male & Adult Female Present,In the 6th Year,"$250,000 - $300,000",8,,9,Y,Y,1,,0,Y,U
1,0,M,45-54 Years Old,"$75,000 - $99,999",Unknown,Adult Male & Adult Female Present,In the 14th Year,"$150,000 - $200,000",8,,5,U,U,0,,0,U,U
2,0,M,24-34 Years Old,"$100,000 - $124,999",Unknown,Adult Male & Adult Female Present,In the 2nd Year,"$600,000 - $650,000",9,,8,,,2,,1,,
3,1,M,24-34 Years Old,"$75,000 - $99,999",Unknown,Adult Male & Adult Female Present,In the 6th Year,"$100,000 - $150,000",8,,5,U,U,3,,2,U,U
4,0,M,55-64 Years Old,"$125,000 - $149,999",Unknown,Unknown,In the 1st Year,Unknown,3,,10,,,0,,0,,


In [3]:
data0.dtypes

repeat                      int64
Gender                     object
Age_Range                  object
Income_Range               object
Occupation                 object
Household_Type             object
Length_of_Residence        object
Home_Value_Range           object
Wealth_Rank                object
Mail_Buyer                 object
Ecommerce_Behav_Rank       object
Upscale_Retail_Shopper     object
Premium_Bank_Card          object
Books_Behav                object
Family_Behav               object
Health_Magazine            object
Personal_Travel            object
Sporting_Goods_Interest    object
dtype: object

## 1.1.- Separate dataset into train and test

It is important to separate our data intro training and testing set before we start to engineer the features. **When we engineer features, some techniques learn parameters from data. It is important to learn this parameters only from the train set to avoid overfitting.** 

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [4]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(data0,
                                                    data0['repeat'],
                                                    test_size=0.1,
                                                    # we are setting the seed here:
                                                    random_state=RANDOM_STATE)  # the data is being shuffled by default

X_train.shape, X_test.shape

((18000, 18), (2000, 18))

In [5]:
X_train.columns

Index(['repeat', 'Gender', 'Age_Range', 'Income_Range', 'Occupation',
       'Household_Type', 'Length_of_Residence', 'Home_Value_Range',
       'Wealth_Rank', 'Mail_Buyer', 'Ecommerce_Behav_Rank',
       'Upscale_Retail_Shopper', 'Premium_Bank_Card', 'Books_Behav',
       'Family_Behav', 'Health_Magazine', 'Personal_Travel',
       'Sporting_Goods_Interest'],
      dtype='object')

In [6]:
X_train.head()

Unnamed: 0,repeat,Gender,Age_Range,Income_Range,Occupation,Household_Type,Length_of_Residence,Home_Value_Range,Wealth_Rank,Mail_Buyer,Ecommerce_Behav_Rank,Upscale_Retail_Shopper,Premium_Bank_Card,Books_Behav,Family_Behav,Health_Magazine,Personal_Travel,Sporting_Goods_Interest
18239,0,M,45-54 Years Old,"$100,000 - $124,999",Executive/Administrator,Adult Male & Adult Female Present With Children,In the 11th Year,"$200,000 - $250,000",,2,,Y,U,,0,,U,U
6282,0,F,45-54 Years Old,"$100,000 - $124,999",Unknown,Adult Male & Adult Female Present,15+ Years,"$150,000 - $200,000",7.0,2,8.0,Y,U,2.0,0,2.0,U,U
18641,0,F,55-64 Years Old,"$40,000 - $49,999",Unknown,Adult Male & Adult Female Present With Children,15+ Years,"$150,000 - $200,000",,2,,Y,U,,4,,U,U
7354,0,M,45-54 Years Old,"$125,000 - $149,999",Executive/Administrator,Adult Male & Adult Female Present,15+ Years,"$200,000 - $250,000",8.0,1,10.0,Y,U,5.0,2,4.0,Y,U
5691,0,F,55-64 Years Old,"$50,000 - $74,999",Unknown,Adult Male & Adult Female Present,15+ Years,"$50,000 - $100,000",3.0,2,5.0,U,U,0.0,0,0.0,U,U


In [7]:
# we drop the target from the test and train sets
X_train = X_train.drop('repeat', axis=1)
X_test = X_test.drop('repeat', axis=1)

In [8]:
X_train.shape

(18000, 17)

In [9]:
X_test.shape

(2000, 17)

# 2.- Typecasting & missing value imputation

As most of the features are read from the data as strings, missing values show up as blanks that we will replace with nans.

In [10]:
# replace blanks with nan
X_train = X_train.replace(r'^\s*$', np.nan, regex=True)
X_test = X_test.replace(r'^\s*$', np.nan, regex=True)

In [11]:
# print percentage of missing values per variable
X_train.isnull().mean()*100

Gender                      0.000000
Age_Range                   0.000000
Income_Range                0.000000
Occupation                  0.000000
Household_Type              0.000000
Length_of_Residence         0.000000
Home_Value_Range            0.000000
Wealth_Rank                 9.322222
Mail_Buyer                 10.244444
Ecommerce_Behav_Rank        9.322222
Upscale_Retail_Shopper     30.744444
Premium_Bank_Card          30.744444
Books_Behav                 9.322222
Family_Behav               10.244444
Health_Magazine             9.322222
Personal_Travel            30.744444
Sporting_Goods_Interest    30.744444
dtype: float64

In [12]:
# print percentage of missing values per variable
X_test.isnull().mean()*100

Gender                      0.00
Age_Range                   0.00
Income_Range                0.00
Occupation                  0.00
Household_Type              0.00
Length_of_Residence         0.00
Home_Value_Range            0.00
Wealth_Rank                 9.05
Mail_Buyer                 10.40
Ecommerce_Behav_Rank        9.05
Upscale_Retail_Shopper     31.20
Premium_Bank_Card          31.20
Books_Behav                 9.05
Family_Behav               10.40
Health_Magazine             9.05
Personal_Travel            31.20
Sporting_Goods_Interest    31.20
dtype: float64

As seen above, quite a few features have missing values, with some of them having up to 31% missing values.

## 2.1.- Typecasting

We observe that some categorical variables are nominal and others are ordinal, however all features other than the target 'repeat' are read as strings. For this reason we need to typecast ordinal features to integers correct.

In [13]:
CATEGORICAL_NOMINAL = ['Gender', 'Age_Range', 'Income_Range', 'Occupation', 'Household_Type', 'Length_of_Residence', 'Home_Value_Range', 'Upscale_Retail_Shopper', 
                       'Premium_Bank_Card', 'Personal_Travel', 'Sporting_Goods_Interest']

In [14]:
CATEGORICAL_ORDINAL = ['Wealth_Rank', 'Mail_Buyer', 'Ecommerce_Behav_Rank', 'Books_Behav', 'Family_Behav', 
                   'Health_Magazine']

In [15]:
X_train[CATEGORICAL_ORDINAL] = X_train[CATEGORICAL_ORDINAL].astype(float).astype(pd.Int32Dtype())

In [16]:
X_test[CATEGORICAL_ORDINAL] = X_test[CATEGORICAL_ORDINAL].astype(float).astype(pd.Int32Dtype())

In [17]:
X_train.dtypes

Gender                     object
Age_Range                  object
Income_Range               object
Occupation                 object
Household_Type             object
Length_of_Residence        object
Home_Value_Range           object
Wealth_Rank                 Int32
Mail_Buyer                  Int32
Ecommerce_Behav_Rank        Int32
Upscale_Retail_Shopper     object
Premium_Bank_Card          object
Books_Behav                 Int32
Family_Behav                Int32
Health_Magazine             Int32
Personal_Travel            object
Sporting_Goods_Interest    object
dtype: object

In [18]:
X_test.dtypes

Gender                     object
Age_Range                  object
Income_Range               object
Occupation                 object
Household_Type             object
Length_of_Residence        object
Home_Value_Range           object
Wealth_Rank                 Int32
Mail_Buyer                  Int32
Ecommerce_Behav_Rank        Int32
Upscale_Retail_Shopper     object
Premium_Bank_Card          object
Books_Behav                 Int32
Family_Behav                Int32
Health_Magazine             Int32
Personal_Travel            object
Sporting_Goods_Interest    object
dtype: object

## 2.2.- Missing value imputation

For the sake of simplicity we will impute missing values using 'Missing' for nominal features and -1 for ordinal features.  

In [19]:
ordinal_imputer = SimpleImputer(strategy='constant', fill_value=-1)

In [20]:
X_train[CATEGORICAL_ORDINAL] = ordinal_imputer.fit_transform(X_train[CATEGORICAL_ORDINAL])
X_test[CATEGORICAL_ORDINAL] = ordinal_imputer.fit_transform(X_test[CATEGORICAL_ORDINAL])

In [21]:
nominal_imputer = SimpleImputer(strategy='constant', fill_value='Missing')

In [22]:
X_train[CATEGORICAL_NOMINAL] = nominal_imputer.fit_transform(X_train[CATEGORICAL_NOMINAL])
X_test[CATEGORICAL_NOMINAL] = nominal_imputer.fit_transform(X_test[CATEGORICAL_NOMINAL])

In [23]:
# print percentage of missing values per variable
X_train.isnull().mean()*100

Gender                     0.0
Age_Range                  0.0
Income_Range               0.0
Occupation                 0.0
Household_Type             0.0
Length_of_Residence        0.0
Home_Value_Range           0.0
Wealth_Rank                0.0
Mail_Buyer                 0.0
Ecommerce_Behav_Rank       0.0
Upscale_Retail_Shopper     0.0
Premium_Bank_Card          0.0
Books_Behav                0.0
Family_Behav               0.0
Health_Magazine            0.0
Personal_Travel            0.0
Sporting_Goods_Interest    0.0
dtype: float64

In [24]:
# print percentage of missing values per variable
X_test.isnull().mean()*100

Gender                     0.0
Age_Range                  0.0
Income_Range               0.0
Occupation                 0.0
Household_Type             0.0
Length_of_Residence        0.0
Home_Value_Range           0.0
Wealth_Rank                0.0
Mail_Buyer                 0.0
Ecommerce_Behav_Rank       0.0
Upscale_Retail_Shopper     0.0
Premium_Bank_Card          0.0
Books_Behav                0.0
Family_Behav               0.0
Health_Magazine            0.0
Personal_Travel            0.0
Sporting_Goods_Interest    0.0
dtype: float64

In [25]:
# we copy the imputed daya for testing purposes
X_train_imputed1 = X_train.copy()
X_test_imputed1 = X_test.copy()

In [26]:
X_train.shape

(18000, 17)

In [27]:
X_test.shape

(2000, 17)

# 3.- Encoding of categorical variables

Next, we will transform categorical variables using one-hot encoding. This is done for simplicity, as different encoding strategies could have been chosen.

In [28]:
# instantiate encoder, unknown labels will be ignored
enc = OneHotEncoder(handle_unknown='ignore')

# encode train and test sets
X_train_encoded = enc.fit_transform(X_train)
X_test_encoded = enc.transform(X_test)

In [29]:
# let's have a look at the categories that have been encoded
enc.categories_

[array(['F', 'M', 'U'], dtype=object),
 array(['18-24 Years Old', '24-34 Years Old', '25-34 Years Old',
        '35-44 Years Old', '45-54 Years Old', '55-64 Years Old',
        '65-74 Years Old', '75+ Years Old', 'Unknown'], dtype=object),
 array(['$100,000 - $124,999', '$125,000 - $149,999',
        '$150,000 - $174,999', '$175,000 - $199,999', '$20,000 - $29,999',
        '$200,000 - $249,999', '$250,000+', '$30,000 - $39,999',
        '$40,000 - $49,999', '$50,000 - $74,999', '$75,000 - $99,999',
        'Under $20,000'], dtype=object),
 array(['Accountants/CPA', 'Attorneys',
        'Beauty (Cosmetologist, Barber, Manicurist, Nails)',
        'Civil Servant', 'Clergy', 'Clerical/Office',
        'Computer Professional', 'Counselors', 'Dentist/Dental Hygienist',
        'Doctors/Physicians/Surgeons', 'Engineers',
        'Executive/Administrator', 'Financial Services', 'Health Services',
        'Homemaker', 'Middle Management', 'Military', 'Nurses',
        'Pharmacist', 'Professio

# 4.- Save target, train & test sets

Let's save the target as well as the encoded train and test sets for the next notebook.

In [30]:
# converts sparse array to dataframe
df_train1 = pd.DataFrame.sparse.from_spmatrix(X_train_encoded)
df_test1 = pd.DataFrame.sparse.from_spmatrix(X_test_encoded)

In [31]:
# saves test & train sets
df_train1.to_csv('X_train_engineered.csv', index=False)
df_test1.to_csv('X_test_engineered.csv', index=False)

In [32]:
df_train1.shape

(18000, 168)

In [33]:
df_test1.shape

(2000, 168)

In [34]:
# saves target
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

# 5.- Assemble pre-processing pipeline

## 5.1.- Custom imputer

We have created a custom imputer to package the imputation and typecasting steps into a single object; this custom imputer can be found in module _custom_transformers.py_

We will test this custom imputer by splitting data into train and test sets, applying the imputer to these sets, and comparing the result with the one previously obtained.

In [35]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(data0,
                                                    data0['repeat'],
                                                    test_size=0.1,
                                                    # we are setting the seed here:
                                                    random_state=RANDOM_STATE)  # the data is being shuffled by default

X_train.shape, X_test.shape

((18000, 18), (2000, 18))

In [36]:
# drop target from train and test sets
X_train = X_train.drop('repeat', axis=1)
X_test = X_test.drop('repeat', axis=1)

In [37]:
# instantiate the custom imputer
imputer = CustomImputer(CATEGORICAL_ORDINAL, CATEGORICAL_NOMINAL)

In [38]:
# imputer is applied to the training set
X_train_imputed2 = imputer.fit_transform(X_train)

In [39]:
# verify if the custom imputer works as intended
X_train_imputed2.equals(X_train_imputed1)

True

In [40]:
# the imputer is applied to the test set
X_test_imputed2 = imputer.fit_transform(X_test)

In [41]:
# verify if the custom imputer works as intended
X_test_imputed2.equals(X_test_imputed1)

True

We are satisfied that the custom imputer performs as intended.

## 5.2.- Custom encoder

A custom encoder has been created to perform one-hot encoding and to conver the resulting sparse array into a dataframe; this custom encoder can be found in module _custom_transformers.py_

Just as we did for the custom imputer, we will test this custom encoder by comparing its output with previously obtained results.

In [42]:
# instantiate encoder
encoder = CustomEncoder()

In [43]:
# apply encoder to imputed training set
df_train2 = encoder.fit_transform(X_train_imputed2)

In [44]:
# verify that the encoder works as intended
df_train2.equals(df_train1)

True

In [45]:
# apply encoder to imputed testing set
df_test2 = encoder.transform(X_test_imputed2)

In [46]:
# verify that the encoder works as intended
df_test2.equals(df_test1)

True

We are satisfied that the encoder performs as intended.

## 5.3.- Assemble and save pre-processing pipeline

Now we will use the custom transformers we have defined to create a single pre-processing pipeline; we will then persist this pipeline for production use.

Once again, we will tests our pipeline by creating training and testing sets to which we will apply the pipeline; the results will be compared with the ones previously obtained.

In [47]:
# assemble pipeline
preprocessing_pipeline = Pipeline([
        ('imputer', CustomImputer(CATEGORICAL_ORDINAL, CATEGORICAL_NOMINAL)),
        ('encoder', CustomEncoder())
])

In [48]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(data0,
                                                    data0['repeat'],
                                                    test_size=0.1,
                                                    # we are setting the seed here:
                                                    random_state=RANDOM_STATE)  # the data is being shuffled by default

X_train.shape, X_test.shape

((18000, 18), (2000, 18))

In [49]:
# drop target from training and test sets
X_train = X_train.drop('repeat', axis=1)
X_test = X_test.drop('repeat', axis=1)

In [50]:
# apply pipeline to training set
df_train3 = preprocessing_pipeline.fit_transform(X_train)

In [51]:
# verify pipeline performs as intended
df_train3.equals(df_train1)

True

In [52]:
# apply pipeline to test set
df_test3 = preprocessing_pipeline.transform(X_test)

In [53]:
# verify pipeline performs as intended
df_test3.equals(df_test1)

True

In [54]:
# let's inspect the output of the pipeline
preprocessing_pipeline.fit_transform(X_train)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17995,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
17996,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
17997,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
17998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [55]:
# persist pipeline
joblib.dump(preprocessing_pipeline, 'preprocessing.pkl')

['preprocessing.pkl']

This concludes the feature engineering section for this project.