# Practical 8: Production and Testing

### In this practical note
1. [Training a basic model]()
2. [Data pipeline]()
3. [Serialisation]()
4. [Deployment and working with other technologies]()
4. [Other tips]()
---

The practical note for this week introduces you to the practical aspect of Python machine learning models in production environment. Production environment is where a model gets new data and makes predictions to help make decision. We will discuss differences of between models in training phase and production phase, how to build a data pipeline, how to serialise models and other tips related to production. This topic is a seldom one to discuss in an university unit, but I believe it is equally as important as other theoretical components.

We have learned how to perform preprocessing and cleaning on a dataset, building and validating model, selecting important features and conducting both supervised and unsupervised data mining tasks. This practical will introduce you on concepts beyond the training, what happened after your model is finished and how to ensure your model is production ready.

To ensure sufficient time to finish your workload, anything discussed in this week will not be graded unit components (assignment/exam).

## 1. Training a basic model

As a foundation to start this practical, we will first train a `LogisticRegression` model. The dataset used will be the `veteran.csv` dataset that you are very familiar with (used from practical 1-5). We will not explain too much of this stage as it is quite similar with practical 1-3 steps, please read the comments on the code for step-by-step instructions.

In [1]:
# import required libraries
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV

from dm_tools import data_prep

# preprocessing step
df = data_prep()

# set the random seed - consistent
rs = 10

# train test split
y = df['TargetB']
X = df.drop(['TargetB'], axis=1)
X_mat = X.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.3, stratify=y, random_state=rs)

# initialise a standard scaler object
scaler = StandardScaler()

# learn the mean and std.dev of variables from training data
# then use the learned values to transform training data
X_train = scaler.fit_transform(X_train, y_train)
X_test = scaler.transform(X_test)

# initial model
init_model = LogisticRegression(random_state=rs)

# fit it to training data
init_model.fit(X_train, y_train)

# training and test accuracy
print("Train accuracy:", init_model.score(X_train, y_train))
print("Test accuracy:", init_model.score(X_test, y_test))

# classification report on test data
y_pred = init_model.predict(X_test)
print(classification_report(y_test, y_pred))

Train accuracy: 0.593510324484
Test accuracy: 0.562284927736
             precision    recall  f1-score   support

          0       0.56      0.57      0.57      1453
          1       0.56      0.55      0.56      1453

avg / total       0.56      0.56      0.56      2906



As we have previously, the model performs on acceptable level on both training and test data. **Assume** that after many experiments with different models and data preprocessing steps, we found that this model produces best accuracy on validation data. We have decided to use this model as our deployment.

## 2. Data Pipeline

An important step before deploying our model into production system is to ensure all steps performed in data preparation and feature engineering are consistent and complete. With many replacement, imputation, transformation and dropping steps performed before a dataset is clean, it is common to make mistake, lose important actions and have "data leakage". `sklearn` provides a method to consolidate all steps into one object through `pipeline`.

In building `sklearn` pipeline, there are 4 important classes used:
1. `BaseEstimator` and `TransformerMixin`, used as parent classes to build `Transformers`. Transformers are classes responsible for applying a preparation or transformation step onto a dataset. A Transformer is the smallest component in a `Pipeline`.
2. `Pipeline`. Pipelines are a method to pack each `Transformer` in your logic sequentially into one function call, making it easier to perform training and predictions. Given 3 transformers A, B and C, a Pipeline will run the input dataset through A, pass the output to B, then C, and so on (in order of A->B->C). With a Pipeline, you can ensure consistent steps are applied into to training, validation, test and future data. Pipeline class also implements fit, transform and prediction functions, automating many of the function calls commonly used by a model. You can build a Pipeline through `Pipeline()` class initialisation or `make_pipeline` function. We will use `make_pipeline` mainly for simpler call.
3. `FeatureUnion`. FeatureUnions are quite similar with pipelines. While pipelines run each step sequentially, `FeatureUnion` join each transformer in paralel. Instead of A->B->C, FeatureUnion will return a larger set of results from [A, B, C]. Similar with Pipeline, we will also use the equivalent function of `make_union`.

To start building your pipeline, import classes above as follows:

In [2]:
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion, make_union
from sklearn.base import TransformerMixin, BaseEstimator

Review your `data_prep` function from the `dm_tools` module. It reads dataset from a `.csv` file in form of `pandas` dataframe and returns a dataframe too. `sklearn` APIs, however, take numpy arrays as input and output. To help us transition pandas DataFrame into numpy arrays/matrices and apply transformation steps, we will create a helper transformer as follows.

In [3]:
class SimpleTransformer(BaseEstimator, TransformerMixin):
    """Apply given transformation into a subset of DataFrame and return numpy array.
    Accepts 3 parameters:
    1. trans_func: transformative function to be applied to selected columns of DataFrame
    2. untrans_func: inverse transformative function to be applied to numpy array (mostly unused)
    3. columns: columns in this DataFrame where the trans_func will be applied
    """
    def __init__(self, trans_func, untrans_func, columns):
        self.transform_func = trans_func
        self.inverse_transform_func = untrans_func
        self.cols = columns

    def fit(self, x, y=None):
        return self

    
    def transform(self, x):
        x = self._get_selection(x)
        return self.transform_func(x) if callable(self.transform_func) else x

    def inverse_transform(self, x):
        return self.inverse_transform_func(x) \
            if callable(self.inverse_transform_func) else x

    def _get_selection(self, df):
        assert isinstance(df, pd.DataFrame)
        return df[self.cols]

    def get_feature_names(self):
        return self.cols

After that, we could start building some transformations. There are a number of columns where we applied transformation functions in the `data_prep` function (in logical order):
1. Drop **ID, TargetD, TargetB** from dataframe.
2. Transform **DemCluster** transformed into string datatype.
3. Replaced errorneous values in **DemHomeOwner** and **DemMedIncome**.
4. One hot encoding for all categorical functions.
5. Mean imputation and standard scaling for all numerical columns.

The first step in this process is to recognise which columns are to be transformed. Find the code and its explaination as follow:

In [4]:
# get the original dataset as reference
df2 = pd.read_csv('datasets/veteran.csv')

# drop unnecessary columns - step 1
dropped = df2.drop(['ID', 'TargetD', 'TargetB'], axis=1)

# get list of all columns
all_columns = dropped.columns.tolist()

# set aside columns with special processing steps - step 2 and 3
trans_columns = ['DemCluster', 'DemHomeOwner', 'DemMedIncome']

# get list of all categorical columns - for step 4
obj_columns = dropped.columns[dropped.dtypes == object].tolist() 
int_columns = list(set(all_columns) - set(obj_columns) - set(trans_columns))  # get int columns
obj_columns = list(set(obj_columns) - set(trans_columns))  # get obj columns without special columns

*(bracketed number) refers to the section in code below.*

Once you have set aside all columns for processing, we can start applying transformations for all categorical columns. We will create two different Transformers, one for `DemCluster` (1) and one for all other categorical columns (2). (1) will change `DemCluster` into string and one hot encode it, while (2) will simply one hot encode all passed columns. Results from both transformers will be joined using a FeatureUnion.

After you have fixed the object columns, you could use similar step for `DemMedIncome` to remove noisy 0 values (3) and `DemHomeOwner` to change it into a binary variable. (4) Once all above operations are performed, we could join all dataframes from (1-4) into one using FeatureUnion (5). This joined dataframe will then be mean imputed using Imputer object (6) and scaled (7).

One advantage of Pipeline is you can attach a model at the end of the pipeline. This way, when you call `fit` or `predict` function on Pipeline, it will apply all transformative functions before training the model and making predictions. We will fit a LogisticRegression model at the end of this Pipeline - same model with our initial model. (8).

Find the code as below:

In [5]:
from sklearn.preprocessing import StandardScaler, Imputer  # import imputer and scaler

# create your feature union
# transforming `DemCluster` into categorical and running one hot encoding on all categorical columns
string_pipe = make_pipeline(
    make_union(
        SimpleTransformer(lambda x: pd.get_dummies(x.astype(str)), None, ['DemCluster']),  # (1)
        SimpleTransformer(lambda x: pd.get_dummies(x), None, obj_columns) # 2
    )
)

# remove noise in `DemMedIncome` and change `DemHomeOwner` into binary
noise_pipe = make_pipeline(
    make_union(
        SimpleTransformer(lambda x: x.replace(0, np.nan), None, ['DemMedIncome']), # 3
        SimpleTransformer(lambda x: x.replace(['U', 'H'], [0, 1]), None, ['DemHomeOwner'])
    )
)

# union both pipes above, impute and scale the dataframe into a numpy array
pipeline = make_pipeline(
    make_union(string_pipe,
               noise_pipe,
               SimpleTransformer(None, None, int_columns)),  # 5
    Imputer(),  # 6
    StandardScaler(),  # 7
    LogisticRegression(random_state=rs))  # 8

With a pipeline set up, you could easily apply similar preprocessing steps for both training and test data. Apply the pipeline into `df2` and see how it applies all data preprocessing steps and training for the data.

In [6]:
# create X and Y
df_X = df2.drop(['TargetB'], axis=1)
df_y = df2['TargetB']

# split X and Y into train and test for model validation purposes
# notice that df_X, df_X_train and df_X_test are still pandas DataFrame instead of numpy array
df_X_train, df_X_test, df_y_train, df_y_test = train_test_split(df_X, df_y, test_size=0.3,
                                                                stratify=df_y, random_state=rs)

# apply transformation and train model in one go
pipeline.fit(df_X_train, df_y_train)

# score pipeline on training and test data
print("Pipeline training score: ", pipeline.score(df_X_train, df_y_train))
print("Pipeline test score: ", pipeline.score(df_X_test, df_y_test))
print(pipeline)

Pipeline training score:  0.593510324484
Pipeline test score:  0.562284927736
Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('pipeline-1', Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('simpletransformer-1', SimpleTransformer(columns=None, trans_func=None, untrans_func=None)), ('simpletransformer-...alty='l2', random_state=10, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])


The pipeline fitted model performs exactly the same with the initial model on both training and test dataset.

## 3. Serialisation

As `sklearn` models and pipelines are stored in memory during training and experiments, you need a method to save them in a persistent format if you were to deploy your model in production. One option is by serialising the model and all pipeline components into a binary file. Binary file allow easy versioning and easy deployment through file transfer. `sklearn` official documentation recommends using `pickle` to serialise pipeline and the model, however in this practical we will use an alternative to `pickle` called `dill`. `dill` allows you not only to save the model, but also relevant functions used throughout the pipeline.

You can install dill using `pip install dill` command in your Anaconda prompt.

Use the following code to serialise our pipeline through `dill`.

In [7]:
import dill

# show model score before serialising
print("Pipeline properties before dilling\n========\n", pipeline)
print("Model score before dilling: ", pipeline.score(df_X_test, df_y_test))

# dump to file named model.pkl, written in binary mode
dill.dump(file=open('model.pkl', 'wb'), obj=pipeline)

Pipeline properties before dilling
 Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('pipeline-1', Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('simpletransformer-1', SimpleTransformer(columns=None, trans_func=None, untrans_func=None)), ('simpletransformer-...alty='l2', random_state=10, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])
Model score before dilling:  0.562284927736


Once the model/pipeline is serialise, you could easily transport it everywhere. Provided you have installed `sklearn`, `pandas` and other key libraries in your target machine, you could reload the pipeline using this code below.

In [8]:
dilled_model = dill.load(file=open('model.pkl', 'rb'))

print("Pipeline properties after dilling\n========\n", dilled_model)
print("Model score after dilling: ", dilled_model.score(df_X_test, df_y_test))

Pipeline properties after dilling
 Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('pipeline-1', Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('simpletransformer-1', SimpleTransformer(columns=None, trans_func=None, untrans_func=None)), ('simpletransformer-...alty='l2', random_state=10, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])
Model score after dilling:  0.562284927736


The model is serialised and stored perfectly.

## 4. Deployment and working with other technologies

While we have been using Python for building and testing our model, many times your model will be deployed in a production system built using other languages/platforms, such as Java, Go or Scala. In this case, you need a method for your model and the production system to communicate with each other.

One option is to deploy the model in Python as a web API. Using this set up, components built with other languages/platforms in the production system can make predictions by sending data through HTTP request. All data preprocessing steps will be handled by Python. This option is quite popular as it is simple and require least amount of work from both production and data scientist/analyst teams. A major downside of this technique is speed. For some applications that send large amount of information for prediction (e.g. picture for image recognition) or require time sensitive predictions (Python is quite slow compared to C, Java or Go), this technique might not be suitable.

In this practical, we have prepared a simple web server prototype in `prac08_prediction_server_flask.ipynb`. This web server is built using Flask, a Python-based web server framework and it receives POST requests for predictions at `http://127.0.0.1:5000/predict` endpoint.

Please run the web server and follow the code in the cell below to send prediction requests to it.

In [10]:
# import requests library to help making POST requests
import requests

# use the first 1000 rows of X_test, serialise it to JSON
post_data = df_X_test.iloc[:1000].to_json()

# post it to the running server
result = requests.post(url='http://127.0.0.1:5000/predict', json=post_data)
print(result.status_code)

200


If your status code is 200, it means the prediction is successful. You could see the result of the prediction through `result.text`.

In [11]:
print(result.text)

[1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 

Another option is to replicate data processing steps and export the model weights in other language. This is commonly used where high performance/low prediction latency is necessary. As we have learned in previous practicals and lectures, after training process is finished, many data mining models can be represented as series of weights and connections (LogisticRegression being the best example). These weights can be easily exported/serialised and used in other technologies/languages. A clear disadvantage of using this deployment option is there is way more work required as the production has to re-implement the model exactly as it was during training. With many complicated operations and data preprocessing steps in a pipeline, this implementation process can be prone to bugs.

## 4. Other tips

This section discusses various performance tips that you could use to help your model works better in production setting.

### 4.1. Batch prediction

In many production systems, real-time atomic predictions is not required. Often prediction tasks can be stashed and delayed to be then processed in batch basis. Batch processing allows the model to take advantage of vectorised computations and SIMD instructions (single instruction, multiple data), which speeds up computational performance significantly.

To refer a real-life example, GoCardless, a YCombinator funded startup which focuses on cashless payments, uses machine learning models to perform fraud detection activities. Most payment transactions in the company take 1-2 days, therefore, batch prediction is very suitable for their business case. In the end, the fraud detection is run a nightly-cron job (every 24 hours).

### 4.2. Vertical scaling

Vertical scaling refers to the process of accomodating growing volume of computing task by increasing the computing power of one task. Typically vertical scaling is implemented through buying a faster, better computer. Vertical scaling can allow a production system to handle more data mining task requests by running each task faster serially or using multithreading/multiprocessing.

### 4.3. Horizontal scaling

Different with vertical scaling, horizontal scaling spreading the large number of requests over large number of commodity hardware. This methodology is the key to scale your infrastructure in an efficient and effective manner. Horizontal scaling fits data mining model deployments really well as many predictive tasks are stateless (e.g. each prediction/row are considered to be unrelated from one to another), which mean large number of predictions can easily be distributed over many models.

### 4.4. Testing and retraining

After a deployment, a model must be monitored periodically. As training often took place using a snapshot of training data, the model prediction quality typically degrades as the environment shifts away from the conditions that were originally captured. After a certain timeframe, the error could grow to be significant, thus the model has to be retrained or replaced.

In this condition, there are two common methods to update the model. First, you could periodically collect newer data and train a "challenger" model. Both old model and the new challenger are then tested on new stream of data. If the challenger manages to outperform the older model, this new model and its pipeline are then deployed.

The second paradigm is to use online training. In online training, newer data that becomes available in a sequential order is used to train and update model, fifferent from batch learning which waits for a large set of training data. Online learning is commonly used for training a model over a very large dataset that does not fit the memory. Online trained models are also capable of adapting to newer patterns in data, leading to common usage in time-related tasks such as stock price predictions.