# Building Pipelines (Simplifying Model Deployment)

#### Here we will demonstrate how to build pipeline for our first setup i.e. STMFWI using our best model with most optimum parameters and predicting the outcome variable

In [1]:
# Load Necessary Libraries

import pandas as pd
import numpy as np

from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import RobustScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error 
from sklearn.svm import SVR

import joblib

seed = 123

In [2]:
## Load split file

Xy_train = pd.read_csv('./forest_fires_deploy_train.csv', sep=',')
Xy_test = pd.read_csv('./forest_fires_deploy_test.csv', sep=',')

In [3]:
features = list(Xy_train.drop(columns = 'area').columns)

In [4]:
# Separating outcome and explanatory variables into their resp. datasets

X_train = pd.DataFrame(Xy_train[features])
y_train = pd.DataFrame(Xy_train['area'])
X_test = pd.DataFrame(Xy_test[features])
y_test = pd.DataFrame(Xy_test['area'])

In [5]:
# Converting the outcome variable since our model algorithms expects it as 1-d array

y_train = y_train.values.ravel()
y_test = y_test.values.ravel()
y_train.shape

(359,)

# Pipeline (Part 1)

## Dropping Columns

#### As we will be dropping some columns in our data set, we will create a custom function transformer 'AttributeDeleter' that can be fed to our pipeline.

#### Note: The same operation can be performed using FunctionTransformer as well but for demonstration purpose I have used the base class here. This method is generally useful when we have to perform both fit and transform, e.g. when performing scaling, box-cox transformations etc

In [6]:
class AttributeDeleter():
    "Enter list of columns you want to delete"
    def __init__(self,columns):
        self.columns = columns
    
    def fit(self, X, y = None):
        return self 
        
    def transform(self, X, y = None):
        return X.drop(self.columns, axis = 1)

#### We check from our transformation section for the test dataset (Analysis First Part/File) that we are removing 'rain' column from our dataset.

#### Note: We can check and confirm for any discrepancy(which columns were removed) by comparing the columns of the initial file and the final wrangled file. 
#### We don't remove 'area' column using the transformer as here we are assuming that for any new data we won't be having 'area' variable column, and thus for our train and test datasets its better to remove manually before fitting them in our pipeline.

## Log (1+x) Transformation

#### Next we perform log (1+x) transformation which is a stateless transformation. Also, we had created a custom function to transform positively and negatively skewed columns differently. To be able to fit it to our pipeline, we will make use of 'FunctionTransformer' function which will take our custom 'log1p_tranform' function

In [7]:
# Calculating skewness of dataset

skew_limit = 0.75
skew_vals = X_train.select_dtypes(exclude = 'object').skew().drop(['X','Y','rain'])
skew_cols = skew_vals[abs(skew_vals) > skew_limit].sort_values(ascending = False).index.values

pos_skew = X_train[skew_cols].agg(['skew']).transpose().query('skew > 0').index
neg_skew = X_train[skew_cols].agg(['skew']).transpose().query('skew < 0').index

print(pos_skew)
print(neg_skew)

Index(['RH'], dtype='object')
Index(['DC', 'FFMC'], dtype='object')


In [8]:
def log1p_transform(X):
    """ Log(1+x) transformation for positive and negatively skewed data
    with skew limit set to 0.75
    """    
    pos_skew = ['RH']
    neg_skew = ['DC', 'FFMC']
    
    for cols in pos_skew:
        X[cols] = np.log1p(X[cols])

    # Reversing distribution before applying log transform
    for cols in neg_skew:
        X[cols] = np.log1p(max(X[cols] + 1) - X[cols])
        
    return X

##  Cyclical Encoding

#### We then had performed cyclical encoding for our 'day' and 'month' variables. Again we will make use of 'FunctionTransformer' function which will take our custom 'cyclical_encoding' function

In [9]:
def cyclical_encoding(X):    
    cleanup_nums = { "day": {'fri': 5, 'tue': 2, 'sat': 6, 'sun': 7, 'mon': 1, 'wed': 3, 'thu': 2},
                "month": {'mar': 3, 'oct': 10, 'aug': 8, 'sep': 9, 'apr': 4, 'jun': 5, 'jul': 6,
                          'feb': 2, 'jan': 1, 'dec': 12, 'may': 5, 'nov': 11}
               }
    X = X.replace(cleanup_nums)
    
    X['day_sin'] = np.sin((X.day) * (2 * np.pi/7))
    X['day_cos'] = np.cos((X.day) * (2 * np.pi/7))
    X['month_sin'] = np.sin((X.month) * (2 * np.pi/12))
    X['month_cos'] = np.cos((X.month) * (2 * np.pi/12))
    X = X.drop(columns = ['day' ,'month'])
    return X

#### We can build our first pipeline for the first part now, keeping in mind the order where relevant/important of different operations performed

In [10]:
wrangled_pipeline = Pipeline([('del_attributes', AttributeDeleter(['rain'])), 
                              ('log1p', FunctionTransformer(log1p_transform)),
                              ('cyc_enc', FunctionTransformer(cyclical_encoding))])

#### As we are removing our outcome variable 'area' manually here before fitting the data in pipeline, we shouldn't forget about any transformation/operation performed on our test variables earlier which we will have to now separately.

In [11]:
y_train = np.log1p(y_train)
y_test = np.log1p(y_test)

# Pipeline (Part 2)

## Scaling

#### In the second part, we then performed Robust Scaling to normalize our data features. We can select the specific columns we want to scale with the help of ColumnTransformer and then add it to our pipeline. 

In [12]:
scale_cols = ['FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind']

preprocess = ColumnTransformer(remainder = 'passthrough', #passthough features not listed
                               transformers = [('rb_scaler', RobustScaler(), scale_cols)])

## Modelling

#### Now our best model was SVR(Gaussian-RBF kernel). We can load our saved model, extract the best parameters and fit SVR model to the pipeline with best parameters.

In [13]:
model_SVR = joblib.load('best_model_SVR_1')

In [18]:
modelling_pipeline = Pipeline([('rb_scaler', preprocess), 
                              ('SVR_model', SVR(**model_SVR.best_params_))])

# Full Pipeline

#### Our full pipeline can be made by combining the two other pipelines

In [19]:
full_pipeline = Pipeline([('wrangled', wrangled_pipeline), 
                          ('modelling', modelling_pipeline)])

# Training

#### After our pipeline is  set, we can finally train our model by using fit method.

In [20]:
train = full_pipeline.fit(X_train, y_train)

In [21]:
train

Pipeline(steps=[('wrangled',
                 Pipeline(steps=[('del_attributes',
                                  <__main__.AttributeDeleter object at 0x7f802f41a8b0>),
                                 ('log1p',
                                  FunctionTransformer(func=<function log1p_transform at 0x7f802f4248b0>)),
                                 ('cyc_enc',
                                  FunctionTransformer(func=<function cyclical_encoding at 0x7f802f424ca0>))])),
                ('modelling',
                 Pipeline(steps=[('rb_scaler',
                                  ColumnTransformer(remainder='passthrough',
                                                    transformers=[('rb_scaler',
                                                                   RobustScaler(),
                                                                   ['FFMC',
                                                                    'DMC', 'DC',
                                                 

# Testing

#### Now our model is ready and we can finally start predicting test observations.

In [22]:
test = full_pipeline.predict(X_test)

In [23]:
test[:10]

array([ 0.64368803,  0.41610555,  0.80751326,  0.7841918 ,  0.36785279,
        0.53324918,  1.51292203, -0.13628372,  0.43089379,  1.19911324])

#### We can make a function to calcualte 'MAE' and 'RMSE' after predicting values using pipeline which can then be sent back to the user 

In [24]:
pred = dict()
pred_metrics = dict()

pred['MAE'] = mean_absolute_error(y_test, test)
pred['RMSE'] = mean_squared_error(y_test, test, squared = False)
                
pred_metrics['SVR'] = pd.Series(pred)
pd.DataFrame(pred_metrics)

Unnamed: 0,SVR
MAE,1.031718
RMSE,1.413421


# Model Deployment

#### After the pipeline is set we can serve our model into production. There are different approaches each having its own benefits and tradeoffs -> Batch, Realtime (Database Trigger, Pub/Sub, Web-service, inApp).
#### Here we are using sklearn pipeline, so we will deploy a flask, django or fastAPI application through a docker container.