# Pipelines - Basic Template

Before building a machine learning model we must first go through important procedures such as data cleaning (e.g. dealing with missing values), data preparation (e.g. encoding categorical variables) and even the specific functionality for instantiating and fitting the model.

A pipeline object is a sequence of instructions that tie such processes together, automating machine learning workflows whenever new data is available. For example, it can be saved in a web application and any time new data comes in to be predicted or classified by our model, all of the appropriate preprocessing steps will be applied to the data and a prediction would be made.

In [1]:
# Libraries

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [2]:
# Import sample data

df = pd.read_csv("Desktop\Learning\DATA SCIENCE INFINITY\Machine Learning\Model Building\data\pipeline_data.csv")

In [3]:
# Split data into input and output objects

X = df.drop(["purchase"], axis = 1)
y = df["purchase"]

In [4]:
# Split data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

Numeric and categorical columns are handled differently. In this example, numeric columns will be scaled and their missing values will be imputed using the mean. Scaling and mean imputation are not relevant to categorical columns where missing values will be replaced with a constant and one-hot-encoding will be used. 

The procedures for each column type will need to be stored in their own object before bringing them together in one large pipeline. These objects are known as transformers and they are basically mini pipelines waiting to be combined.

Let's start by storing each of the numeric variables into a list and then do the same for categorical variables.

In [5]:
# Specify numeric and categorical features

numeric_features = ["age", "credit_score"]
categorical_features = ["gender"]

## Set up Pipelines

Using the objects created above, we can *set up a transformer for each type of column*. This is done using the Pipeline() function from sklearn. Inside the parenthesis of this function we specify the preprocessing steps we want to apply and we do this using the *steps* parameter. This parameter takes a list of tuples including the **what's** and the **how's**. 

For our numeric columns, we first want to impute any missing values using the the SimpleImputer(). Then, we want to scale the data using the StandardScaler.

In [6]:
# Numerical Feature Transformer

numeric_transformer = Pipeline(steps = [("imputer", SimpleImputer()), # Here "imputer" is used to instantiate SimpleImputer()
                                        ("scaler", StandardScaler())])

The procedure for categorical columns differs slightly because we want to replace missing values with a constant instead of using the mean like we did for numeric columns. We will also one-hot-encode the data instead of scale it. The *handle_unknown* parameter of the OneHotEncoder() is set to ignore which means that any new categories that may come from new data will be ignored by transforming the entire column to zeros.

In [7]:
# Categorical Feature Transformer

categorical_transformer = Pipeline(steps = [("imputer", SimpleImputer(strategy = "constant", fill_value = "U")),
                                        ("ohe", OneHotEncoder(handle_unknown = "ignore"))])

Now that a transformer object has been specified for each column type, we can pass both of them into one overall pipeline using the ColumnTransformer(). The procedure is similar to the above transformers, but we also need to provide the columns that we want to apply the logic to. We have already stored the different column types into a couple of lists, so we will be using those.

In [8]:
# Preprocessing Pipeline

preprocessing_pipeline = ColumnTransformer(transformers = [("numeric" , numeric_transformer, numeric_features),
                                                          ("categorical", categorical_transformer, categorical_features)])

## Apply the Pipeline

To apply the preprocessing pipeline, we are going to create another larger pipeline object that is going to specify that we want to receive data, pass it through the preprocessing pipeline, and then pass that onto a specific model. We will do this for a logistic regression model and a random forest model. In doing so, we are going to see how a pipeline can help us compare the performance of models on exactly the same data.

In [9]:
# Logistic Regression

clf = Pipeline(steps = [("preprocessing_pipeline", preprocessing_pipeline),
                       ("classifier", LogisticRegression(random_state = 42))])

Our classifier pipeline is now ready to go. Let's train the model and see exactly what is going to happen when the data gets passed through that pipeline. We can then carry out the predictions on the test set and get an accuracy score.

In [10]:
clf.fit(X_train, y_train)

Pipeline(steps=[('preprocessing_pipeline',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'credit_score']),
                                                 ('categorical',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='U',
                                                                                 strategy='constant')),
                                                                  ('ohe',
                                                                   OneHotE

In [11]:
y_pred_class = clf.predict(X_test)
accuracy_score(y_test, y_pred_class)

0.85

We can see that our model has received the test data which is unprocessed. If we take a look at our X_test object we will see that it is in its original state. That data has been passed through the preprocessing pipeline where numerical columns and categorical columns were dealt with accordingly, and finally passed into the trained classifier model allowing us to get predictions.

Let's repeat this for the Random Forest model.

In [12]:
# Random Forest

clf = Pipeline(steps = [("preprocessing_pipeline", preprocessing_pipeline),
                       ("classifier", RandomForestClassifier(random_state = 42))])

clf.fit(X_train, y_train)
y_pred_class = clf.predict(X_test)
accuracy_score(y_test, y_pred_class)

0.85

Coincidentally, on our sample data both models got the same accuracy score, but we were indeed applying separate types of models to the data. Moreover, since we are passing in unprocessed data, we could pass in any new data as well! This brings us to the next part where we will be passing in some brand new data to our pipeline.

## Save the pipeline

Without pipelines, in order to process any new data in the same way that the training data had been processed, we potentially have to save several objects including the model itself, an imputer object, an encoding object etc. By using pipelines, however, we have one clean object making it very easy to save our pipeline into another location, even somewhere like a web application. Any new data coming in from a user, that needs to be classified or predicted by our model, all of the appropriate preprocessing steps would be applied to the data and the prediction would be made. 

Let's go through a simple illustration.

In [13]:
import joblib
joblib.dump(clf, "Desktop\Learning\DATA SCIENCE INFINITY\Machine Learning\Model Building\data\model.joblib")

['Desktop\\Learning\\DATA SCIENCE INFINITY\\Machine Learning\\Model Building\\data\\model.joblib']

### Import pipeline object and predict on new data

For demonstration purposes, the kernel was restarted and the relevant libraries were loaded before importing the pipeline.

In [1]:
# Libraries

import pandas as pd
import numpy as np
import joblib

In [2]:
# Import pipeline

clf = joblib.load("Desktop\Learning\DATA SCIENCE INFINITY\Machine Learning\Model Building\data\model.joblib")

In [3]:
# Create new data

new_data = pd.DataFrame({"age" : [25,np.nan,50],
                        "gender" : ["M", "F", np.nan],
                        "credit_score" : [200,100,500]})

new_data

Unnamed: 0,age,gender,credit_score
0,25.0,M,200
1,,F,100
2,50.0,,500


Let's see what we need to do to get some predictions on the new data. 

In [4]:
# Pass new data in and receive predictions

clf.predict(new_data)

array([1, 0, 0], dtype=int64)

That's it! The model has received the input data, passed it through the preprocessing pipeline, then through the trained Random Forest classifier model, and then provided us with the predictions.

Pipelines are super clean and slick but they do take away the ease of a "step-by-step" understanding which is key at the start of the process. It may be best to work it all out manually when preprocessing and building the model so we can keep a closer eye on what the data looks like at each step. When done, we can tie it all together into a nice, neat pipeline.