### Machine Learning models as APIs

#### Introduction



#### Environment Setup & Flask Basics

- Creating a virtual environment using `Anaconda`. If you need to create your workflows in Python and keep the dependencies separated out or share the environment settings, `Anaconda` distributions are a great option. 
    * You'll find a miniconda installation for Python [here](https://conda.io/miniconda.html)
    * `wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh`
    * `bash Miniconda3-latest-Linux-x86_64.sh`
    * Follow the sequence of questions.
    * `source .bashrc`
    * If you run: `conda`, you should be able to get the list of commands & help.
    * To create a new environment, run: `conda create --name <environment-name> python=3.6`
    * Follow the steps & once done run: `source activate <environment-name>`
    * Install the python packages you need, the two important are: `flask` & `gunicorn`.
    
    
- We'll try out a simple `Flask` Hello-World application and serve it using `gunicorn`:

```python

"""Filename: hello-world.py
"""

from flask import Flask

app = Flask(__name__)

@app.route('/users/<string:username>')
def hello_world(username=None):
    
    return("Hello {}!".format(username))

```

- To serve it, run: `gunicorn --bind 0.0.0.0:8000 hello-world:app` on your terminal.

![Hello World](./images/flaskapp1.png)

- On you browser, try out: `https://localhost:8000/users/any-name`

![Browser](./images/flaskapp2.png)

With a few simple steps, we were able to create web-endpoints that can be accessed locally. 

Using `Flask`, we can wrap our Machine Learning models and serve them as Web APIs easily. Also, if we want to create more complex web applications (that includes JavaScript `*gasps*`) we just need a few modifications.

#### Creating a Machine Learning Model

- We'll be taking up the Machine Learning competition: [Loan Prediction Competition](https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii). The main objective is to set a pre-processing pipeline and creating ML Models with goal towards making the ML Predictions easy while deployments. 



In [3]:
import os 
import json
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier

from sklearn.pipeline import make_pipeline

import warnings
warnings.filterwarnings("ignore")

- Saving the datasets in a folder:

In [4]:
!ls /home/pratos/Side-Project/av_articles/data/

test.csv  training.csv


In [5]:
data = pd.read_csv('./data/training.csv')

In [8]:
list(data.columns)

['Loan_ID',
 'Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History',
 'Property_Area',
 'Loan_Status']

In [9]:
data.shape

(614, 13)

- Finding out the `null/Nan` values in the columns:

In [18]:
for _ in data.columns:
    print("The number of null values in:{} == {}".format(_, data[_].isnull().sum()))

The number of null values in:Loan_ID == 0
The number of null values in:Gender == 13
The number of null values in:Married == 3
The number of null values in:Dependents == 15
The number of null values in:Education == 0
The number of null values in:Self_Employed == 32
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 22
The number of null values in:Loan_Amount_Term == 14
The number of null values in:Credit_History == 50
The number of null values in:Property_Area == 0
The number of null values in:Loan_Status == 0


Interestingly, there are `22 instances` in `LoanAmount` that don't have values. We won't be considering those instances, you could also impute the values and check out if there's any effect on the final model. But for practical scenarios we'll remove those instances.

Similarly, we'll remove `Gender`, `Married`, `Credit_History` from the data that have `NaN/null` values.

In [19]:
data = data.dropna(subset=['Gender', 'Married', 'Credit_History', 'LoanAmount'])

- Next step is creating `training` and `testing` datasets:

In [20]:
pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome',\
            'LoanAmount','Loan_Amount_Term','Credit_History','Property_Area']

X_train, X_test, y_train, y_test = train_test_split(data[pred_var], data['Loan_Status'], \
                                                    test_size=0.25, random_state=42)

- To make sure that the `pre-processing steps` are followed religiously even after we are done with experimenting and we do not miss them while predictions, we'll create a __custom pre-processing Scikit-learn `estimator`__.

To follow the process on how we ended up with this `estimator`, read up on [this notebook]()

In [21]:
class PreProcessing(BaseEstimator, TransformerMixin):
    """Custom Pre-Processing estimator for our use-case
    """

    def __init__(self):
        pass

    def transform(self, df):
        """Regular transform() that is a help for training, validation & testing datasets
           (NOTE: The operations performed here are the ones that we did prior to this cell)
        """
        
        df = df.dropna(subset=['Gender', 'Married', 'Credit_History', 'LoanAmount']) #For Testing
        
        df['Dependents'] = df['Dependents'].fillna(0)
        df['Self_Employed'] = df['Self_Employed'].fillna('No')
        df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(self.mean_)
        
        gender_values = {'Female' : 0, 'Male' : 1} 
        married_values = {'No' : 0, 'Yes' : 1}
        education_values = {'Graduate' : 0, 'Not Graduate' : 1}
        employed_values = {'No' : 0, 'Yes' : 1}
        property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2}
        dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1}
        df.replace({'Gender': gender_values, 'Married': married_values, 'Education': education_values, \
                    'Self_Employed': employed_values, 'Property_Area': property_values, \
                    'Dependents': dependent_values}, inplace=True)
        
        return df.as_matrix()

    def fit(self, df, y=None, **fit_params):
        """Fitting the Training dataset & calculating the required values from train
           e.g: We will need the mean of X_train['Loan_Amount_Term'] that will be used in
                transformation of X_test
        """
        
        self.mean_ = df['Loan_Amount_Term'].mean()
        return self

- Convert `y_train` & `y_test` to `np.array`:

In [22]:
y_train = y_train.replace({'Y':1, 'N':0}).as_matrix()
y_test = y_test.replace({'Y':1, 'N':0}).as_matrix()

We'll create a `pipeline` to make sure that all the preprocessing steps that we do are just a single `scikit-learn estimator`.

In [23]:
pipe = make_pipeline(PreProcessing(),
                    RandomForestClassifier())

In [24]:
pipe

Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

To search for the best `hyper-parameters` (`degree` for `PolynomialFeatures` & `alpha` for `Ridge`), we'll do a `Grid Search`:

- Defining `param_grid`:

In [25]:
param_grid = {"randomforestclassifier__n_estimators" : [10, 20, 30],
             "randomforestclassifier__max_depth" : [None, 6, 8, 10],
             "randomforestclassifier__max_leaf_nodes": [None, 5, 10, 20], 
             "randomforestclassifier__min_impurity_split": [0.1, 0.2, 0.3]}

- Running the `Grid Search`:

In [26]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3)

- Fitting the training data on the `pipeline estimator`:

In [27]:
grid.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impu..._jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'randomforestclassifier__n_estimators': [10, 20, 30], 'randomforestclassifier__min_impurity_split': [0.1, 0.2, 0.3], 'randomforestclassifier__max_leaf_nodes': [None, 5, 10, 20], 'randomforestclassifier__max_depth': [None, 6, 8, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

- Let's see what parameter did the Grid Search select:

In [28]:
print("Best parameters: {}".format(grid.best_params_))

Best parameters: {'randomforestclassifier__n_estimators': 20, 'randomforestclassifier__min_impurity_split': 0.2, 'randomforestclassifier__max_leaf_nodes': None, 'randomforestclassifier__max_depth': None}


- Let's score:

In [29]:
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))

Test set score: 0.77


Our `pipeline` is looking pretty swell & fairly decent to go the most important step of the tutorial: __Serialize the Machine Learning Model__

#### Saving the ML Model to serialize & de-serialize

>In computer science, in the context of data storage, serialization is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment.

In Python, `pickling` is a standard way to store objects and retrieve them as their original state. To give a simple example:

In [30]:
list_to_pickle = [1, 'here', 123, 'walker']

#Pickling the list
import pickle

list_pickle = pickle.dumps(list_to_pickle)

In [31]:
list_pickle

b'\x80\x03]q\x00(K\x01X\x04\x00\x00\x00hereq\x01K{X\x06\x00\x00\x00walkerq\x02e.'

When we load the pickle back:

In [32]:
loaded_pickle = pickle.loads(list_pickle)

In [33]:
loaded_pickle

[1, 'here', 123, 'walker']

We can save the `pickled object` to a file as well and use it. This method is similar to creating `.rda` files for folks who are familiar with `R Programming`. 

__NOTE:__ Some people also argue against using `pickle` for serialization[(1)](#no1). `h5py` could also be an alternative.

The `grid object` would be the one that needs to be pickled in our case. Instead of `pickle` you could use `joblib` module as well to work with `pickled objects`:

In [34]:
filename = 'model_v1.pk'

In [35]:
joblib.dump(grid, os.getcwd()+ '/' +str(filename))

['/home/pratos/Side-Project/av_articles/model_v1.pk']

So our model will be saved in the location above. Now that the model `pickled`, creating a `Flask` wrapper around it would be the next step.

Since, we already have the `preprocessing` steps required for the new incoming data present as a part of the `pipeline` we just have to run `predict()`. While working with `scikit-learn`, it is always easy to work with `pipelines`. If there is a section of your preprocessing or anything else in the ML Workflow that isn't already a class that's implemented, do try out [TransformerMixin](http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) to create your own estimators.

`Estimators` and `pipelines` save you time and headache, even if the initial implementation seems to be ridiculous. Stich in time, saves nine!

#### Creating an API using Flask

We'll keep the folder structure as simple as possible:

![Folder Struct](./images/flaskapp3.png)

There are three important parts in constructing our wrapper function, `apicall()`:

- Getting the `request` data (for which predictions are to be made)

- Loading our `pickled estimator`

- `jsonify` our predictions and send the response back with `status code: 200`

HTTP messages are made of a header and a body. As a standard, majority of the body content sent across are in `json` format. We'll be sending (`POST url-endpoint/`) the incoming data as batch to get predictions.

(__NOTE:__ You can send plain text, XML, csv or image directly but for the sake of interchangeability of the format, it is advisable to use `json`)

```python
"""Filename: server.py
"""

import os
import pandas as pd
from sklearn.externals import joblib
from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def apicall():
    try:
		test_json = request.get_json()
		test = pd.read_json(test_json)
	except Exception as e:
		raise e
        
    clf = 'model_v1.pk'
	
	if test.empty:
		return(bad_request())
	else:
		#Load the saved model
		loaded_model = joblib.load(os.getcwd()+'/models/'+str(clf))
		predictions = loaded_model.predict(test)
		
		prediction_series = pd.Series(predictions)
		
		"""We can be as creative in sending the responses.
		   But we need to send the response codes as well.
           
           flask module provides "jsonify" to help create secure
           top-level objects.
		"""
		responses = jsonify(predictions=prediction_series.to_json())
		responses.status_code = 200

		return (responses)

```

Once done, run: `gunicorn --bind 0.0.0.0:8000 server:app`

Let's generate some prediction data and query the API running locally at `https:0.0.0.0:8000/predict`

In [27]:
import json
import requests

In [66]:
"""Setting the headers to send and accept json responses
"""
header = {'Content-Type': 'application/json', \
                  'Accept': 'application/json'}

"""Reading test batch
"""
df = pd.read_csv('./flask_api/test_data/X_test.csv', header=None)

"""Converting Pandas Dataframe to json
"""
data = df.to_json(orient='split')

In [67]:
data

'{"columns":[0,1,2,3,4,5,6,7,8,9,10,11,12],"index":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],"data":[[0.11425,0.0,13.89,1.0,0.55,6.373,92.4,3.3633,5.0,276.0,16.4,393.74,10.5],[24.8017,0.0,18.1,0.0,0.693,5.349,96.0,1.7028,24.0,666.0,20.2,396.9,19.77],[0.85204,0.0,8.14,0.0,0.538,5.965,89.2,4.0123,4.0,307.0,21.0,392.53,13.83],[0.19657,22.0,5.86,0.0,0.431,6.226,79.2,8.0555,7.0,330.0,19.1,376.14,10.15],[0.10612,30.0,4.93,0.0,0.428,6.095,65.1,6.3361,6.0,300.0,16.6,394.62,12.4],[0.09178,0.0,4.05,0.0,0.51,6.416,84.1,2.6463,5.0,296.0,16.6,395.5,9.04],[4.22239,0.0,18.1,1.0,0.77,5.803,89.0,1.9047,24.0,666.0,20.2,353.04,14.64],[8.49213,0.0,18.1,0.0,0.584,6.348,86.1,2.0527,24.0,666.0,20.2,83.45,17.64],[0.17899,0.0,9.69,0.0,0.585,5.67,28.8,2.7986,6.0,391.0,19.2,393.29,17.6],[0.537,0.0,6.2,0.0,0.504,5.981,68.1,3.6715,8.0,307.0,17.4,378.35,11.65],[0.09378,12.5,7.87,0.0,0.524,5.889,39.0,5.4509,5.0,311.0,15.2,390.5,15.71],[0.06162,0.0,4.39,0.0,0.442,5.898,52.3,8.0136,3.0,352.0,18.8

In [71]:
"""POST <url>/predict
"""
resp = requests.post("http://0.0.0.0:8000/predict", \
                    data = json.dumps(data),\
                    headers= header)

In [72]:
resp.status_code

200

In [73]:
"""The final response we get is as follows:
"""
resp.json()

{'predictions': '{"0":24.8240379255,"1":11.8654385585,"2":17.1561912854,"3":20.0458076619,"4":19.2367642081,"5":25.8492084053,"6":22.9006678292,"7":15.9280178003,"8":21.4350486365,"9":22.023735047,"10":21.2012363118,"11":18.5667463494,"12":15.8056937572,"13":21.9883614457,"14":28.400726322,"15":42.9079207048,"16":43.1399770618,"17":24.3422675202,"18":26.679499558,"19":19.185684845,"20":10.3816598089,"21":19.9347015253,"22":27.2866216946,"23":18.8497692623}'}

#### Final Thoughts

We have half the battle won here, with a working API that serves predictions in a way where we take one step towards integrating our ML solutions right into our products. This is a very basic API that will help with proto-typing a data product, to make it as fully functional, production ready API a few more additions are required that aren't in the scope of Machine Learning. 



__Sources & Links:__

[1]. <a id='no1' href="http://www.benfrederickson.com/dont-pickle-your-data/">Don't Pickle your data.</a>

[2]. <a id='no2' href="http://www.dreisbach.us/articles/building-scikit-learn-compatible-transformers/">Building Scikit Learn compatible transformers.</a>

[3]. <a id='no2' href="http://flask.pocoo.org/docs/0.10/security/#json-security">Using jsonify in Flask.</a>