### ML Model

In [1]:
%config IPCompleter.greedy=True

In [191]:
import os
import re
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

Download the training & test data from the Practice Problem approach. We'll do a bit of quick investigation on the dataset:

In [192]:
data = pd.read_csv('C:/Users/datta/Desktop/flask_api/data/training.csv')

In [193]:
data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [194]:
print("Shape of the data is:{}".format(data.shape))

Shape of the data is:(614, 13)


In [195]:
print("List of columns is: {}".format(list(data.columns)))

List of columns is: ['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status']


Here, `Loan_status` is our `target variable`, the rest are `predictor variables`. `Loan_ID` wouldn't help much in making predictions about `defaulters` hence we won't be considering that variable in our final model.

Finding out the `null/Nan` values in the columns:

In [196]:
for col in data.columns:
    print("The number of null values in:{} == {}".format(col, data[col].isnull().sum()))

The number of null values in:Loan_ID == 0
The number of null values in:Gender == 13
The number of null values in:Married == 3
The number of null values in:Dependents == 15
The number of null values in:Education == 0
The number of null values in:Self_Employed == 32
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 22
The number of null values in:Loan_Amount_Term == 14
The number of null values in:Credit_History == 50
The number of null values in:Property_Area == 0
The number of null values in:Loan_Status == 0


We'll check out the values (labels) for the columns having missing values:

In [197]:
missing_pred = ['Dependents', 'Self_Employed', 'Loan_Amount_Term', 'Gender', 'Married']

for _ in missing_pred:
    print("List of unique labels for {}:::{}".format(_, set(data[_])))

List of unique labels for Dependents:::{'2', nan, '0', '1', '3+'}
List of unique labels for Self_Employed:::{nan, 'Yes', 'No'}
List of unique labels for Loan_Amount_Term:::{nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 12.0, 36.0, 300.0, 180.0, 60.0, 84.0, 480.0, 360.0, 240.0, 120.0}
List of unique labels for Gender:::{nan, 'Male', 'Female'}
List of unique labels for Married:::{nan, 'Yes', 'No'}


For the rest of the missing values:

- `Dependents`: Assumption that there are no dependents
- `Self_Employed`: Assumption that the applicant is not self-employed
- `Loan_Amount_Term`: Assumption that the loan amount term is median value
- `Credit_History`: Assumption that the person has a credit history
- `Married`: If nothing specified, applicant is not married
- `Gender`: Assuming the gender is Male for the missing values

Before that we'll divide the dataset in train and test

In [198]:
from sklearn.model_selection import train_test_split

In [199]:
list(data.columns)

['Loan_ID',
 'Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History',
 'Property_Area',
 'Loan_Status']

In [200]:
pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome',\
            'LoanAmount','Loan_Amount_Term','Credit_History','Property_Area']

In [201]:
X_train, X_test, y_train, y_test = train_test_split(data[pred_var], data['Loan_Status'], \
                                                    test_size=0.25, random_state=42)


We'll compile a list of `pre-processing` steps that we do on to create a custom `estimator`.

In [202]:
X_train['Dependents'] = X_train['Dependents'].fillna('0')
X_train['Self_Employed'] = X_train['Self_Employed'].fillna('No')
X_train['Loan_Amount_Term'] = X_train['Loan_Amount_Term'].fillna(X_train['Loan_Amount_Term'].mean())

In [203]:
X_train['Credit_History'] = X_train['Credit_History'].fillna(1)
X_train['Married'] = X_train['Married'].fillna('No')
X_train['Gender'] = X_train['Gender'].fillna('Male')

In [204]:
X_train['LoanAmount'] = X_train['LoanAmount'].fillna(X_train['LoanAmount'].mean())

We have a lot of `string` labels that we encounter in `Gender`, `Married`, `Education`, `Self_Employed` & `Property_Area` columns.

In [205]:
label_columns = ['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area', 'Dependents']

for _ in label_columns:
    print("List of unique labels {}:{}".format(_, set(X_train[_])))

List of unique labels Gender:{'Male', 'Female'}
List of unique labels Married:{'Yes', 'No'}
List of unique labels Education:{'Graduate', 'Not Graduate'}
List of unique labels Self_Employed:{'Yes', 'No'}
List of unique labels Property_Area:{'Urban', 'Rural', 'Semiurban'}
List of unique labels Dependents:{'2', '1', '0', '3+'}


In [206]:
gender_values = {'Female' : 0, 'Male' : 1} 
married_values = {'No' : 0, 'Yes' : 1}
education_values = {'Graduate' : 0, 'Not Graduate' : 1}
employed_values = {'No' : 0, 'Yes' : 1}
property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2}
dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1}
X_train.replace({'Gender': gender_values, 'Married': married_values, 'Education': education_values, \
                'Self_Employed': employed_values, 'Property_Area': property_values, 'Dependents': dependent_values}\
                , inplace=True)

In [207]:
X_train.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
92,1,1,2,1,0,3273,1820.0,81.0,360.0,1.0,1
304,1,0,0,0,0,4000,2500.0,140.0,360.0,1.0,0
68,1,1,3,1,1,7100,0.0,125.0,60.0,1.0,1
15,1,0,0,0,0,4950,0.0,125.0,360.0,1.0,1
211,1,1,3,0,0,3430,1250.0,128.0,360.0,0.0,2


In [208]:
X_train.dtypes

Gender                 int64
Married                int64
Dependents             int64
Education              int64
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area          int64
dtype: object

In [209]:
for _ in X_train.columns:
    print("The number of null values in:{} == {}".format(_, X_train[_].isnull().sum()))

The number of null values in:Gender == 0
The number of null values in:Married == 0
The number of null values in:Dependents == 0
The number of null values in:Education == 0
The number of null values in:Self_Employed == 0
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 0
The number of null values in:Loan_Amount_Term == 0
The number of null values in:Credit_History == 0
The number of null values in:Property_Area == 0


In [95]:
y_train 

92     1
304    1
68     1
15     1
211    0
268    0
88     1
514    0
117    1
395    1
419    1
33     1
0      1
607    1
177    0
356    1
399    0
499    0
259    0
22     0
456    1
116    1
89     1
583    0
443    1
18     0
380    1
446    1
144    1
290    1
      ..
58     1
474    1
560    1
252    1
21     1
313    1
459    0
160    1
276    1
191    0
385    1
413    1
491    1
343    1
308    0
130    1
99     1
372    1
87     1
458    1
330    1
214    1
466    0
121    1
20     0
71     1
106    1
270    1
435    1
102    1
Name: Loan_Status, Length: 460, dtype: int64

In [90]:
X_train

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
92,1,1,2,1,0,3273,1820.0,81.000000,360.0,1.0,1
304,1,0,0,0,0,4000,2500.0,140.000000,360.0,1.0,0
68,1,1,3,1,1,7100,0.0,125.000000,60.0,1.0,1
15,1,0,0,0,0,4950,0.0,125.000000,360.0,1.0,1
211,1,1,3,0,0,3430,1250.0,128.000000,360.0,0.0,2
268,0,0,0,0,0,3418,0.0,135.000000,360.0,1.0,0
88,1,0,0,0,0,8566,0.0,210.000000,360.0,1.0,1
514,1,0,0,0,0,5815,3666.0,311.000000,360.0,1.0,0
117,1,1,1,0,0,2214,1398.0,85.000000,360.0,1.0,1
395,1,1,2,0,0,3276,484.0,135.000000,360.0,1.0,2


['./training_labels.pkl']

Converting the pandas dataframes to numpy arrays:

In [211]:
X_train = X_train.as_matrix()

In [22]:
X_train.shape
X_train

array([[  1.,   1.,   2., ..., 360.,   1.,   1.],
       [  1.,   0.,   0., ..., 360.,   1.,   0.],
       [  1.,   1.,   3., ...,  60.,   1.,   1.],
       ...,
       [  0.,   0.,   0., ..., 360.,   1.,   1.],
       [  0.,   0.,   0., ..., 240.,   1.,   2.],
       [  1.,   1.,   0., ..., 360.,   1.,   1.]])

We'll create a custom `pre-processing estimator` that would help us in writing better pipelines and in future deployments:

In [212]:
from sklearn.base import BaseEstimator, TransformerMixin

class PreProcessing(BaseEstimator, TransformerMixin):
    """Custom Pre-Processing estimator for our use-case
    """

    def __init__(self):
        pass

    def transform(self, df):
        """Regular transform() that is a help for training, validation & testing datasets
           (NOTE: The operations performed here are the ones that we did prior to this cell)
        """
        pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome',\
            'LoanAmount','Loan_Amount_Term','Credit_History','Property_Area']
        
        df = df[pred_var]
        
        df['Dependents'] = df['Dependents'].fillna(0)
        df['Self_Employed'] = df['Self_Employed'].fillna('No')
        df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(self.term_mean_)
        df['Credit_History'] = df['Credit_History'].fillna(1)
        df['Married'] = df['Married'].fillna('No')
        df['Gender'] = df['Gender'].fillna('Male')
        df['LoanAmount'] = df['LoanAmount'].fillna(self.amt_mean_)
        
        gender_values = {'Female' : 0, 'Male' : 1} 
        married_values = {'No' : 0, 'Yes' : 1}
        education_values = {'Graduate' : 0, 'Not Graduate' : 1}
        employed_values = {'No' : 0, 'Yes' : 1}
        property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2}
        dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1}
        df.replace({'Gender': gender_values, 'Married': married_values, 'Education': education_values, \
                    'Self_Employed': employed_values, 'Property_Area': property_values, \
                    'Dependents': dependent_values}, inplace=True)
        
        return df.as_matrix()

    def fit(self, df, y=None, **fit_params):
        """Fitting the Training dataset & calculating the required values from train
           e.g: We will need the mean of X_train['Loan_Amount_Term'] that will be used in
                transformation of X_test
        """
        
        self.term_mean_ = df['Loan_Amount_Term'].mean()
        self.amt_mean_ = df['LoanAmount'].mean()
        return self

To make sure that this works, let's do a test run for it:

In [213]:
X_train, X_test, y_train, y_test = train_test_split(data[pred_var], data['Loan_Status'], \
                                                    test_size=0.25, random_state=42)

In [25]:
X_train.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
92,Male,Yes,2,Not Graduate,No,3273,1820.0,81.0,360.0,1.0,Urban
304,Male,No,0,Graduate,No,4000,2500.0,140.0,360.0,1.0,Rural
68,Male,Yes,3+,Not Graduate,Yes,7100,0.0,125.0,60.0,1.0,Urban
15,Male,No,0,Graduate,No,4950,0.0,125.0,360.0,1.0,Urban
211,Male,Yes,3+,Graduate,No,3430,1250.0,128.0,360.0,0.0,Semiurban


In [214]:
for col in X_train.columns:
    print("The number of null values in:{} == {}".format(col, X_train[col].isnull().sum()))

The number of null values in:Gender == 11
The number of null values in:Married == 1
The number of null values in:Dependents == 11
The number of null values in:Education == 0
The number of null values in:Self_Employed == 20
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 16
The number of null values in:Loan_Amount_Term == 11
The number of null values in:Credit_History == 36
The number of null values in:Property_Area == 0


In [215]:
preprocess = PreProcessing()

In [216]:
preprocess

PreProcessing()

In [217]:
preprocess.fit(X_train)
preprocess.fit(X_test)

PreProcessing()

In [218]:
X_train_transformed = preprocess.transform(X_train)

In [219]:
X_train_transformed.shape

(460, 11)

So our small experiment to write a custom `estimator` worked. This would be helpful further.

In [220]:
X_test_transformed = preprocess.transform(X_test)

In [221]:
X_test_transformed.shape

(154, 11)

In [222]:
y_test = y_test.replace({'Y':1, 'N':0}).as_matrix()

In [223]:
y_train = y_train.replace({'Y':1, 'N':0}).as_matrix()

In [224]:
param_grid = {"randomforestclassifier__n_estimators" : [10, 20, 30],
             "randomforestclassifier__max_depth" : [None, 6, 8, 10],
             "randomforestclassifier__max_leaf_nodes": [None, 5, 10, 20], 
             "randomforestclassifier__min_impurity_split": [0.1, 0.2, 0.3]}

In [225]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

pipe = make_pipeline(PreProcessing(),
                    RandomForestClassifier())

In [226]:
pipe

Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [227]:
from sklearn.model_selection import train_test_split, GridSearchCV

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3)

In [228]:
grid

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impu..._jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'randomforestclassifier__n_estimators': [10, 20, 30], 'randomforestclassifier__max_depth': [None, 6, 8, 10], 'randomforestclassifier__max_leaf_nodes': [None, 5, 10, 20], 'randomforestclassifier__min_impurity_split': [0.1, 0.2, 0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [229]:
X_train, X_test, y_train, y_test = train_test_split(data[pred_var], data['Loan_Status'], \
                                                    test_size=0.25, random_state=42)

In [258]:
joblib.dump(X_train, "./training_data.pkl")
joblib.dump(y_train, "./training_labels.pkl")

['./training_labels.pkl']

In [230]:
grid.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impu..._jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'randomforestclassifier__n_estimators': [10, 20, 30], 'randomforestclassifier__max_depth': [None, 6, 8, 10], 'randomforestclassifier__max_leaf_nodes': [None, 5, 10, 20], 'randomforestclassifier__min_impurity_split': [0.1, 0.2, 0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [233]:
print("Best parameters: {}".format(grid.best_params_))

Best parameters: {'randomforestclassifier__max_depth': None, 'randomforestclassifier__max_leaf_nodes': 20, 'randomforestclassifier__min_impurity_split': 0.2, 'randomforestclassifier__n_estimators': 10}


In [234]:
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))

Test set score: 0.77


- Load the test set:

In [235]:
test_df = pd.read_csv('C:/Users/datta/Desktop/flask_api/data/test.csv', encoding="utf-8-sig")
test_df = test_df[:]

In [236]:
y_test[20:25]

437    Y
361    Y
228    Y
296    Y
509    Y
Name: Loan_Status, dtype: object

In [237]:
len(test_df)

367

In [238]:
grid.predict(test_df)

array(['Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y',
       'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y',
       'Y', 'N', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'N', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y',
       'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'N', 'N', 'N', 'Y', 'Y', 'Y', 'N', 'N', 'Y', 'N', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'N',
       'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'N', 'N', 'Y', 'N',
       'Y', 'Y', 'Y', 'Y', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y

### make a pickle using pickle

In [70]:
import pickle
filename = 'model_v2.pk'

In [71]:
with open('C:/Users/datta/Desktop/flask_api/notebooks/'+filename, 'wb') as file:
	pickle.dump(grid, file)

So our model will be saved in the location above. Now that the model `pickled`, creating a `Flask` wrapper around it would be the next step.

Before that, to be sure that our `pickled` file works fine -- let's load it back and do a prediction:

In [72]:
with open('C:/Users/datta/Desktop/flask_api/notebooks/'+filename ,'rb') as f:
    loaded_model = pickle.load(f)

In [74]:
test_df[20:25]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
20,LP001121,Male,Yes,1,Not Graduate,No,1888,1620,48.0,360.0,1.0,Urban
21,LP001124,Female,No,3+,Not Graduate,No,2083,0,28.0,180.0,1.0,Urban
22,LP001128,,No,0,Graduate,No,3909,0,101.0,360.0,1.0,Urban
23,LP001135,Female,No,0,Not Graduate,No,3765,0,125.0,360.0,1.0,Urban
24,LP001149,Male,Yes,0,Graduate,No,5400,4380,290.0,360.0,1.0,Urban


In [75]:
loaded_model.predict(test_df[20:25])

array(['Y', 'Y', 'Y', 'Y', 'Y'], dtype=object)

### Using cPickle

In [53]:
import cPickle
# save the classifier
with open('./my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(grid, fid)    

# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
    gnb_loaded = cPickle.load(fid)
    
loaded_model.predict(test_df)

ModuleNotFoundError: No module named 'cPickle'

### Using Joblib

In [249]:
from sklearn.externals import joblib
# save the model to disk
filename = './finalized_model.pkl'
joblib.dump(grid, filename)
 
# some time later...
 
# load the model from disk
loaded_model = joblib.load(filename)
loaded_model.predict(test_df)
#result = loaded_model.score(X_test, Y_test)

array(['Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y',
       'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y',
       'Y', 'N', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'N', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y',
       'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'N', 'N', 'N', 'Y', 'Y', 'Y', 'N', 'N', 'Y', 'N', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'N',
       'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'N', 'N', 'Y', 'N',
       'Y', 'Y', 'Y', 'Y', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y

In [240]:
training_data = joblib.load('./training_data.pkl')

In [242]:
#training_data

### Calling web API

- Since, we already have the `preprocessing` steps required for the new incoming data present as a part of the `pipeline` we just have to run `predict()`. While working with `scikit-learn`, it is always easy to work with `pipelines`. 

- `Estimators` and `pipelines` save you time and headache, even if the initial implementation seems to be ridiculous. Stich in time, saves nine!

In [243]:
import json
import requests

In [250]:
"""Setting the headers to send and accept json responses
"""
header = {'Content-Type': 'application/json', 'Accept': 'application/json'}

"""Reading test batch
"""
df = pd.read_csv('C:\\Users\\datta\\Desktop\\flask_api\\data\\test.csv', encoding="utf-8-sig")
df = df[20:25]

"""Converting Pandas Dataframe to json
"""
data = df.to_json(orient='records')

In [245]:
#data = {"Loan_ID":"LP001121","Gender":"Male","Married":"Yes","Dependents":"1","Education":"Not Graduate","Self_Employed":"No","ApplicantIncome":1888,"CoapplicantIncome":1620,"LoanAmount":48.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"}
data

'[{"Loan_ID":"LP001121","Gender":"Male","Married":"Yes","Dependents":"1","Education":"Not Graduate","Self_Employed":"No","ApplicantIncome":1888,"CoapplicantIncome":1620,"LoanAmount":48.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001124","Gender":"Female","Married":"No","Dependents":"3+","Education":"Not Graduate","Self_Employed":"No","ApplicantIncome":2083,"CoapplicantIncome":0,"LoanAmount":28.0,"Loan_Amount_Term":180.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001128","Gender":null,"Married":"No","Dependents":"0","Education":"Graduate","Self_Employed":"No","ApplicantIncome":3909,"CoapplicantIncome":0,"LoanAmount":101.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001135","Gender":"Female","Married":"No","Dependents":"0","Education":"Not Graduate","Self_Employed":"No","ApplicantIncome":3765,"CoapplicantIncome":0,"LoanAmount":125.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_

In [251]:
"""POST <url>/predict
"""
resp = requests.post("http://127.0.0.1:5001/predict", data = json.dumps(data), headers= header)

In [252]:
resp.status_code

200

In [253]:

"""The final response we get is as follows:
"""
resp.json()

{'predictions': '[{"0":"LP001121","1":"Y"},{"0":"LP001124","1":"Y"},{"0":"LP001128","1":"Y"},{"0":"LP001135","1":"Y"},{"0":"LP001149","1":"Y"}]'}

There are a few things to keep in mind when adopting API-first approach:

- Creating APIs out of sphagetti code is next to impossible, so approach your Machine Learning workflow as if you need to create a clean, usable API as a deliverable. Will save you a lot of effort to jump hoops later.

- Try to use version control for models and the API code, Flask doesn't provide great support for version control. Saving and keeping track of ML Models is difficult, find out the least messy way that suits you. This article(https://medium.com/towards-data-science/how-to-version-control-your-machine-learning-task-cad74dce44c4) talks about ways to do it.

- `Specific to sklearn models (as done in this article), if you are using custom estimators for preprocessing or any other related task make sure you keep the estimator and training code together so that the model pickled would have the estimator class tagged along.`

## Retrain the model

In [254]:
X_test = X_test[:10]
y_test = y_test[:10]
X_test['Loan_Status'] = y_test

In [173]:
X_test = X_test.append(X_test)

In [269]:
training_data = joblib.load("C:/Users/datta/Desktop/flask_api/notebooks/training_data.pkl")
training_data

Unnamed: 0,ApplicantIncome,CoapplicantIncome,Credit_History,Dependents,Education,Gender,LoanAmount,Loan_Amount_Term,Married,Property_Area,Self_Employed
92,3273,1820.0,1.0,2,Not Graduate,Male,81.0,360.0,Yes,Urban,No
304,4000,2500.0,1.0,0,Graduate,Male,140.0,360.0,No,Rural,No
68,7100,0.0,1.0,3+,Not Graduate,Male,125.0,60.0,Yes,Urban,Yes
15,4950,0.0,1.0,0,Graduate,Male,125.0,360.0,No,Urban,No
211,3430,1250.0,0.0,3+,Graduate,Male,128.0,360.0,Yes,Semiurban,No
268,3418,0.0,1.0,0,Graduate,Female,135.0,360.0,No,Rural,
88,8566,0.0,1.0,0,Graduate,Male,210.0,360.0,No,Urban,No
514,5815,3666.0,1.0,0,Graduate,Male,311.0,360.0,No,Rural,No
117,2214,1398.0,,1,Graduate,Male,85.0,360.0,Yes,Urban,No
395,3276,484.0,,2,Graduate,Male,135.0,360.0,Yes,Semiurban,No


In [270]:
((training_data['Gender'] == "Male") & (training_data['Married'] == "Yes")&(training_data['Dependents'] == "0")&(training_data['Education'] == "Graduate")&(training_data['Self_Employed'] == "No")&(training_data['ApplicantIncome'] == 9083)&(training_data['CoapplicantIncome'] == 0.0)&(training_data['LoanAmount'] == 228.0)&(training_data['Loan_Amount_Term'] == 360.0)&(training_data['Credit_History'] == 1.0)&(training_data['Property_Area'] == 'Semiurban')).any()

True

In [262]:
data = X_test.to_json(orient='records')

In [263]:
data

'[{"Gender":"Male","Married":"Yes","Dependents":"0","Education":"Graduate","Self_Employed":"No","ApplicantIncome":9083,"CoapplicantIncome":0.0,"LoanAmount":228.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Semiurban","Loan_Status":"Y"},{"Gender":"Male","Married":"Yes","Dependents":"0","Education":"Graduate","Self_Employed":"No","ApplicantIncome":4310,"CoapplicantIncome":0.0,"LoanAmount":130.0,"Loan_Amount_Term":360.0,"Credit_History":null,"Property_Area":"Semiurban","Loan_Status":"Y"},{"Gender":"Male","Married":"Yes","Dependents":"2","Education":"Graduate","Self_Employed":"No","ApplicantIncome":4167,"CoapplicantIncome":1447.0,"LoanAmount":158.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Rural","Loan_Status":"Y"},{"Gender":"Female","Married":"No","Dependents":"0","Education":"Graduate","Self_Employed":"No","ApplicantIncome":2900,"CoapplicantIncome":0.0,"LoanAmount":71.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Rural","Loan_Sta

In [267]:
header = {'Content-Type': 'application/json', 'Accept': 'application/json'}
resp = requests.post("http://127.0.0.1:5001/retrain", data = json.dumps(data), headers= header)
resp.json()

'Retrained model successfully.'

In [177]:
X_test.drop_duplicates(subset=None, keep='first', inplace=True)

In [178]:
X_test

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
350,Male,Yes,0,Graduate,No,9083,0.0,228.0,360.0,1.0,Semiurban,Y
377,Male,Yes,0,Graduate,No,4310,0.0,130.0,360.0,,Semiurban,Y
163,Male,Yes,2,Graduate,No,4167,1447.0,158.0,360.0,1.0,Rural,Y
609,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
132,Male,No,0,Graduate,No,2718,0.0,70.0,360.0,1.0,Semiurban,Y
578,Male,Yes,1,Graduate,No,1782,2232.0,107.0,360.0,1.0,Rural,Y
316,Male,Yes,2,Graduate,No,3717,0.0,120.0,360.0,1.0,Semiurban,Y
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
340,Male,Yes,3+,Not Graduate,No,2647,1587.0,173.0,360.0,1.0,Rural,N
77,Male,Yes,1,Graduate,Yes,1000,3022.0,110.0,360.0,1.0,Urban,N
