# Practical 8: Production and Testing

### In this practical note
1. [Training a basic model]
2. [Data pipeline]
3. [Serialisation]
4. [Other tips]
---

### Important Changelog:

The practical note for this week introduces you to the practical aspect of Python machine learning models in production environment. Production environment is where a model gets new data and makes predictions to help make decision. There are differences in focus between models in training phase compared to models in production phase. This topic is a seldom one to discuss in an university unit, but I believe it is equally as important as other theoretical components.

We have learned how to perform preprocessing and cleaning on a dataset, building and validating model, selecting important features and conducting both supervised and unsupervised data mining tasks. This practical will introduce you on concepts beyond the training, what happened after your model is finished and how to ensure your model is production ready.

To ensure sufficient time to finish your workload, anything discussed in this week will not be graded unit components (assignment/exam).

## 1. Training a basic model

To start this practical, we will train a LogisticRegression model on the `veteran.csv` dataset that you are very familiar with. Follow the next cells for step-by-step instructions.

In [1]:
# import required libraries
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV

from dm_tools import data_prep

# preprocessing step
df = data_prep()

# set the random seed - consistent
rs = 10

# train test split
y = df['TargetB']
X = df.drop(['TargetB'], axis=1)
X_mat = X.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.3, stratify=y, random_state=rs)

# initialise a standard scaler object
scaler = StandardScaler()

# learn the mean and std.dev of variables from training data
# then use the learned values to transform training data
X_train = scaler.fit_transform(X_train, y_train)
X_test = scaler.transform(X_test)

# initial model
init_model = LogisticRegression(random_state=rs)

# fit it to training data
init_model.fit(X_train, y_train)

# training and test accuracy
print("Train accuracy:", init_model.score(X_train, y_train))
print("Test accuracy:", init_model.score(X_test, y_test))

# classification report on test data
y_pred = init_model.predict(X_test)
print(classification_report(y_test, y_pred))

Train accuracy: 0.593510324484
Test accuracy: 0.562284927736
             precision    recall  f1-score   support

          0       0.56      0.57      0.57      1453
          1       0.56      0.55      0.56      1453

avg / total       0.56      0.56      0.56      2906



As we have seen previously, the model performed on acceptable level on both training and test data. Assume that after many experiments, we have concluded that this model performs the best on all settings. We will use this model as our deployment model.

## 2. Data Pipeline

An important step before deploying our model into production system is to ensure all steps performed in data preparation and feature engineering are consistent and complete. With many steps performed in before the data even gets to the model, it is common to make mistake and lose important actions through data leakage heading somewhere.

There are three major classes used in building `sklearn` data pipelines.
1. `BaseEstimator` and `TransformerMixin`, used as parent classes to build Transformers. Transformers are classes responsible for applying transformation steps onto the dataset.
1. `Pipeline`. Pipelines are a method to pack each step in your logic sequentially into one function call, making it easier to perform training and predictions. Given 3 transformers A, B and C, a Pipeline will run them in order of A->B->C. With a Pipeline, you can ensure consistent steps are applied into to training, validation, test and future data. Pipeline class also implements fit, transform and prediction functions, automating many of the function calls.
2. `FeatureUnion`. While pipelines run each step sequentially, `FeatureUnion` join each step in paralel. Instead of A->B->C, FeatureUnion will return a larger set of results from [A, B, C].

To start with data pipeline, import classes above as follows:

In [2]:
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion, make_union
from sklearn.base import TransformerMixin, BaseEstimator, clone

To help with applying transformation steps onto our dataframe, we will create a helper transformer as follows.

In [3]:
class SimpleTransformer(BaseEstimator, TransformerMixin):
    """Apply given transformation."""
    def __init__(self, trans_func, untrans_func, columns):
        self.transform_func = trans_func
        self.inverse_transform_func = untrans_func
        self.cols = columns

    def fit(self, x, y=None):
        return self

    def transform(self, x):
        x = self._get_selection(x)
        return self.transform_func(x) if callable(self.transform_func) else x

    def inverse_transform(self, x):
        return self.inverse_transform_func(x) \
            if callable(self.inverse_transform_func) else x

    def _get_selection(self, df):
        assert isinstance(df, pd.DataFrame)
        return df[self.cols]

    def get_feature_names(self):
        return self.cols

After that, we could start joining transformation operations together. Find the code and their explaination as follows:

In [5]:
from sklearn.preprocessing import StandardScaler, Imputer  # import imputer and scaler

# get the original dataset as reference
df2 = pd.read_csv('datasets/veteran.csv')

# drop unnecessary columns
dropped = df2.drop(['ID', 'TargetD', 'TargetB'], axis=1)

all_columns = dropped.columns.tolist()
obj_columns = dropped.columns[dropped.dtypes == object].tolist()  # separate categorical and numerical columns
trans_columns = ['DemCluster', 'DemHomeOwner', 'DemMedIncome']  # set aside columns with special processing steps

int_columns = list(set(all_columns) - set(obj_columns) - set(trans_columns))
obj_columns = list(set(obj_columns) - set(trans_columns))

# create your first pipeline, transforming `DemCluster` into categorical and running one hot encoding on all categorical columns
string_pipe = make_pipeline(
    FeatureUnion(
        [('demcluster', SimpleTransformer(lambda x: pd.get_dummies(x.astype(str)), None, ['DemCluster'])),
        ('identity', SimpleTransformer(lambda x: pd.get_dummies(x), None, obj_columns))]
    )
)

# help to remove noise in `DemMedIncome`
noise_pipe = SimpleTransformer(lambda x: x.replace(0, np.nan), None, ['DemMedIncome'])

# union both pipes above, impute and scale the dataframe into a numpy array
pipeline = make_pipeline(
    FeatureUnion([('stringpipe', string_pipe), ('noisepipe', noise_pipe), ('ints', SimpleTransformer(None, None, int_columns))]),
    Imputer(),
    StandardScaler(),
    LogisticRegression(random_state=rs))

With a pipeline set up, you could easily apply similar preprocessing steps for both training and test data. Apply the pipeline into `df2` and see how it applies all data preprocessing steps and training for the data.

In [6]:
# create X and Y
df_X = df2.drop(['TargetB'], axis=1)
df_y = df2['TargetB']

# notice that df_X, df_X_train and df_X_test are still pandas DataFrame instead of numpy array
df_X_train, df_X_test, df_y_train, df_y_test = train_test_split(df_X, df_y, stratify=df_y, random_state=rs)

# apply transformation and train model in one go
pipeline.fit(df_X, df_y)

# score pipeline on training and test data
print("Pipeline training score: ", pipeline.score(df_X_train, df_y_train))
print("Pipeline test score: ", pipeline.score(df_X_test, df_y_test))

Pipeline training score:  0.588656387665
Pipeline test score:  0.571841453344


## 3. Serialisation

Once you have your pipeline ready, you could serialise the model and all pipeline components into a binary file. Binary file allow easy versioning and easy deployment through file transfer. `sklearn` official documentation recommends using `pickle` to serialise pipeline and the model.

In [11]:
import pickle

# show model coefficient before pickled
print(init_model.coef_)

# dump to file named model.pkl, written in binary mode
pickle.dump(file=open('model.pkl', 'wb'), obj=init_model)

[[  8.55294819e-02   5.72905765e-02   9.60262361e-02  -7.68268037e-02
   -5.03317638e-02  -3.99429982e-02   7.37564612e-02  -5.93071375e-02
   -1.68332723e-01   2.80447649e-01  -1.39864836e-01   5.06139983e-02
    1.97820127e-01   4.77841914e-02   9.49498242e-02  -4.38512559e-01
    1.10047871e-01   5.51011404e-02   2.70004445e-02   1.25856209e-01
    4.53598078e-02  -2.50445045e-02  -2.30582374e-02   7.76478322e-02
   -2.66611375e-02  -7.17073881e-03   2.82414971e-02   5.45915187e-05
    4.95292495e-02   2.50351269e-02  -5.18782525e-02   2.46976545e-02
    2.03661586e-03   4.34903010e-03  -1.69243519e-04  -1.34790895e-03
   -2.41957340e-02   1.54527250e-02   2.44627356e-02  -5.96265462e-04
    9.41790072e-03   1.79343011e-02  -3.78651082e-02  -1.01814670e-02
   -1.30084774e-02   1.83655032e-02  -7.87823668e-03   8.94038392e-04
   -3.89726384e-04   1.77516004e-02   1.19268189e-02   2.45853350e-02
   -5.11888957e-02   1.11627368e-02  -4.77942174e-02  -1.00081318e-02
   -3.03296908e-03  

Once the model/pipeline is serialise, you could reload it using this code below.

In [12]:
pickled_model = pickle.load(file=open('model.pkl', 'rb'))

print(pickled_model.coef_)

[[  8.55294819e-02   5.72905765e-02   9.60262361e-02  -7.68268037e-02
   -5.03317638e-02  -3.99429982e-02   7.37564612e-02  -5.93071375e-02
   -1.68332723e-01   2.80447649e-01  -1.39864836e-01   5.06139983e-02
    1.97820127e-01   4.77841914e-02   9.49498242e-02  -4.38512559e-01
    1.10047871e-01   5.51011404e-02   2.70004445e-02   1.25856209e-01
    4.53598078e-02  -2.50445045e-02  -2.30582374e-02   7.76478322e-02
   -2.66611375e-02  -7.17073881e-03   2.82414971e-02   5.45915187e-05
    4.95292495e-02   2.50351269e-02  -5.18782525e-02   2.46976545e-02
    2.03661586e-03   4.34903010e-03  -1.69243519e-04  -1.34790895e-03
   -2.41957340e-02   1.54527250e-02   2.44627356e-02  -5.96265462e-04
    9.41790072e-03   1.79343011e-02  -3.78651082e-02  -1.01814670e-02
   -1.30084774e-02   1.83655032e-02  -7.87823668e-03   8.94038392e-04
   -3.89726384e-04   1.77516004e-02   1.19268189e-02   2.45853350e-02
   -5.11888957e-02   1.11627368e-02  -4.77942174e-02  -1.00081318e-02
   -3.03296908e-03  

The model is serialised and stored perfectly.

## 4. Other tips

This section discusses various performance tips that you could use to help your model works better in production setting.

### 4.1. Batch prediction

In many production systems, real-time atomic predictions is not required. Often prediction tasks can be stashed and delayed to be then processed in batch basis. Batch processing allows the model to take advantage of vectorised computations and SIMD instructions (single instruction, multiple data), which speeds up computational performance significantly.

To refer a real-life example, GoCardless, a YCombinator funded startup which focuses on cashless payments, uses machine learning models to perform fraud detection activities. Most payment transactions in the company take 1-2 days, therefore, batch prediction is very suitable for their business case. In the end, the fraud detection is run a nightly-cron job (every 24 hours).

### 4.2. Horisontal scaling

Horisontal scaling refers to the process of accomodating growing volume of computing tasks by spreading it over large number of commodity hardware. This methodology is the key to scale your infrastructure in an efficient and effective manner. Horisontal scaling fits data mining model deployments really well as many predictive tasks are stateless (e.g. each prediction/row are considered to be unrelated from one to another), which mean large number of predictions can easily be distributed over many models.

### 4.1. Testing and retraining

After a deployment, a model must be monitored periodically. As training often took place using a snapshot of training data, the model prediction quality typically degrades as the environment shifts away from the conditions that were originally captured. After a certain timeframe, the error could grow to be significant, thus the model has to be retrained or replaced.

In this condition, there are two common methods to update the model. First, you could periodically collect newer data and train a "challenger" model. Both old model and the new challenger are then tested on new stream of data. If the challenger manages to outperform the older model, this new model and its pipeline are then deployed.

The second paradigm is to use online training. In online training, newer data that becomes available in a sequential order is used to train and update model, fifferent from batch learning which waits for a large set of training data. Online learning is commonly used for training a model over a very large dataset that does not fit the memory. Online trained models are also capable of adapting to newer patterns in data, leading to common usage in time-related tasks such as stock price predictions.