Steps

0. `Fields & Objetives Spreadsheet` : create spreadsheet detailing each field, problem and opoortunity, work steps 
0. `Fields relationship` : Document each field relation in comparison with target field
1. `Fillna` : groupby fillna(median), most common
2. `Categories` : determine categorial variables and binning [50-100], get dummies or replaces
3. `Text Features` : get distinct caracteristics, or replace to aggregate similar texts
4. `Outliers` : Substitute outliers Pedrcentile > .99 with percentile 0.99
5. `Data types` : All numeric and as matrix
6. `Normalize`: transform 0-100 most vars, or scaling
7. `Var importance` : rank importance and drop correlated with low importance( below random )
8. `Data leakage` : Determin format to predict and evaluate models
9. `Pipelines` : create process for automation raw_data -> clean -> features -> predict -> evaluate
10. `Review Best practices` : first impressions matters, consistency, descripiton and communication, explanation

In [1]:
import os 
import json
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

In [2]:
path="/home/pedro/repos/ml_web_api/flask_api"

In [3]:
data = pd.read_csv(path+'/data/training.csv')

# 1. Diagnose data

In [4]:
data.head(2)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N


In [5]:
for _ in data.columns:
    print("The number of null values in:{} == {}".format(_, data[_].isnull().sum()))

The number of null values in:Loan_ID == 0
The number of null values in:Gender == 13
The number of null values in:Married == 3
The number of null values in:Dependents == 15
The number of null values in:Education == 0
The number of null values in:Self_Employed == 32
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 22
The number of null values in:Loan_Amount_Term == 14
The number of null values in:Credit_History == 50
The number of null values in:Property_Area == 0
The number of null values in:Loan_Status == 0


In [6]:
var_w_missing = [c for c in data.columns if data[c].isnull().sum() > 0]
var_w_missing

['Gender',
 'Married',
 'Dependents',
 'Self_Employed',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History']

In [7]:
for _ in var_w_missing:
    print("List of unique labels for {}:::{}".format(_, list(set(data[_].dropna()))[:10]))

List of unique labels for Gender:::['Female', 'Male']
List of unique labels for Married:::['No', 'Yes']
List of unique labels for Dependents:::['2', '3+', '1', '0']
List of unique labels for Self_Employed:::['No', 'Yes']
List of unique labels for LoanAmount:::[9.0, 17.0, 25.0, 26.0, 30.0, 35.0, 36.0, 40.0, 42.0, 44.0]
List of unique labels for Loan_Amount_Term:::[480.0, 36.0, 360.0, 300.0, 12.0, 240.0, 180.0, 84.0, 120.0, 60.0]
List of unique labels for Credit_History:::[0.0, 1.0]


In [8]:
# Fields with no missing data
[c for c in data.columns if c not in var_w_missing]

['Loan_ID',
 'Education',
 'ApplicantIncome',
 'CoapplicantIncome',
 'Property_Area',
 'Loan_Status']

#### Assumptions for missing data

Could also fillna with most commun in cohort groupby of non missing data

For the rest of the missing values:

- `Dependents`: Assumption that there are no dependents
- `Self_Employed`: Assumption that the applicant is not self-employed
- `Loan_Amount_Term`: Assumption that the loan amount term is median value
- `Credit_History`: Assumption that the person has a credit history
- `Married`: If nothing specified, applicant is not married
- `Gender`: Assuming the gender is Male for the missing values

Before that we'll divide the dataset in train and test

# A. Clean Data

# B. Feature eng & Var Importance w Correlation

# 2. Split into test & train

In [9]:
X_pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome',\
            'LoanAmount','Loan_Amount_Term','Credit_History','Property_Area']

y_pred_var = 'Loan_Status'

X_train, X_test, y_train, y_test = train_test_split(data[X_pred_var], data[y_pred_var], \
                                                    test_size=0.25, random_state=42)

## 2B Cleaning example

In [10]:
X_train['Dependents'] = X_train['Dependents'].fillna('0')
X_train['Self_Employed'] = X_train['Self_Employed'].fillna('No')
X_train['Loan_Amount_Term'] = X_train['Loan_Amount_Term'].fillna(X_train['Loan_Amount_Term'].mean()) # median


X_train['Credit_History'] = X_train['Credit_History'].fillna(1)
X_train['Married'] = X_train['Married'].fillna('No')
X_train['Gender'] = X_train['Gender'].fillna('Male')

X_train['LoanAmount'] = X_train['LoanAmount'].fillna(X_train['LoanAmount'].mean()) # median

# 2C Replace keys

In [11]:
label_columns = ['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area', 'Dependents']

for _ in label_columns:
    print("List of unique labels {}:{}".format(_, set(X_train[_])))

List of unique labels Gender:{'Female', 'Male'}
List of unique labels Married:{'No', 'Yes'}
List of unique labels Education:{'Not Graduate', 'Graduate'}
List of unique labels Self_Employed:{'No', 'Yes'}
List of unique labels Property_Area:{'Urban', 'Rural', 'Semiurban'}
List of unique labels Dependents:{'2', '3+', '1', '0'}


In [12]:
gender_values = {'Female' : 0, 'Male' : 1} 
married_values = {'No' : 0, 'Yes' : 1}
education_values = {'Graduate' : 0, 'Not Graduate' : 1}
employed_values = {'No' : 0, 'Yes' : 1}
property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2}
dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1}
X_train.replace({'Gender': gender_values, 'Married': married_values, 'Education': education_values, \
                'Self_Employed': employed_values, 'Property_Area': property_values, 'Dependents': dependent_values}\
                , inplace=True)

In [13]:
X_train.head(2)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
92,1,1,2,1,0,3273,1820.0,81.0,360.0,1.0,1
304,1,0,0,0,0,4000,2500.0,140.0,360.0,1.0,0


In [14]:
# Check types of each fields
X_train.dtypes

Gender                 int64
Married                int64
Dependents             int64
Education              int64
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area          int64
dtype: object

In [15]:
# Check if no nulls
for _ in X_train.columns:
    print("The number of null values in:{} == {}".format(_, X_train[_].isnull().sum()))

The number of null values in:Gender == 0
The number of null values in:Married == 0
The number of null values in:Dependents == 0
The number of null values in:Education == 0
The number of null values in:Self_Employed == 0
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 0
The number of null values in:Loan_Amount_Term == 0
The number of null values in:Credit_History == 0
The number of null values in:Property_Area == 0


In [16]:
X_train.head(2)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
92,1,1,2,1,0,3273,1820.0,81.0,360.0,1.0,1
304,1,0,0,0,0,4000,2500.0,140.0,360.0,1.0,0


In [17]:
X_train.shape

(460, 11)

# 2D test -> predict data format

In [18]:
dftest = pd.read_csv(path+'/data/test.csv', encoding="utf-8-sig")
dftest.head(2)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban


In [19]:
"""Converting Pandas Dataframe to json
"""
dftest.loc[0:1]#.to_json(orient='records')

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban


In [20]:
dftest.loc[0:1].to_json(orient='records')

'[{"Loan_ID":"LP001015","Gender":"Male","Married":"Yes","Dependents":"0","Education":"Graduate","Self_Employed":"No","ApplicantIncome":5720,"CoapplicantIncome":0,"LoanAmount":110.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"},{"Loan_ID":"LP001022","Gender":"Male","Married":"Yes","Dependents":"1","Education":"Graduate","Self_Employed":"No","ApplicantIncome":3076,"CoapplicantIncome":1500,"LoanAmount":126.0,"Loan_Amount_Term":360.0,"Credit_History":1.0,"Property_Area":"Urban"}]'

In [21]:
type(dftest.loc[0:1].to_json(orient='records'))

str

# 3. Preprocessing Cass

Get all clean methods into single method

In [22]:
from sklearn.base import BaseEstimator, TransformerMixin

class PreProcessing(BaseEstimator, TransformerMixin):
    """Custom Pre-Processing estimator for our use-case
    """

    def __init__(self):
        pass

    def transform(self, df):
        """Regular transform() that is a help for training, validation & testing datasets
           (NOTE: The operations performed here are the ones that we did prior to this cell)
        """
        pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome',\
            'LoanAmount','Loan_Amount_Term','Credit_History','Property_Area']
        
        df = df[pred_var]
        
        df['Dependents'] = df['Dependents'].fillna(0)
        df['Self_Employed'] = df['Self_Employed'].fillna('No')
        df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(self.term_mean_)
        df['Credit_History'] = df['Credit_History'].fillna(1)
        df['Married'] = df['Married'].fillna('No')
        df['Gender'] = df['Gender'].fillna('Male')
        df['LoanAmount'] = df['LoanAmount'].fillna(self.amt_mean_)
        
        gender_values = {'Female' : 0, 'Male' : 1} 
        married_values = {'No' : 0, 'Yes' : 1}
        education_values = {'Graduate' : 0, 'Not Graduate' : 1}
        employed_values = {'No' : 0, 'Yes' : 1}
        property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2}
        dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1}
        df.replace({'Gender': gender_values, 'Married': married_values, 'Education': education_values, \
                    'Self_Employed': employed_values, 'Property_Area': property_values, \
                    'Dependents': dependent_values}, inplace=True)
        
        return df.as_matrix()

    def fit(self, df, y=None, **fit_params):
        """Fitting the Training dataset & calculating the required values from train
           e.g: We will need the mean of X_train['Loan_Amount_Term'] that will be used in
                transformation of X_test
        """
        
        self.term_mean_ = df['Loan_Amount_Term'].mean()
        self.amt_mean_ = df['LoanAmount'].mean()
        return self

In [23]:
# Make sure it works
X_pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome',\
            'LoanAmount','Loan_Amount_Term','Credit_History','Property_Area']

y_pred_var = 'Loan_Status'

X_train, X_test, y_train, y_test = train_test_split(data[X_pred_var], data[y_pred_var], \
                                                    test_size=0.25, random_state=42)

In [24]:
X_train.head(2)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
92,Male,Yes,2,Not Graduate,No,3273,1820.0,81.0,360.0,1.0,Urban
304,Male,No,0,Graduate,No,4000,2500.0,140.0,360.0,1.0,Rural


In [25]:
for _ in X_train.columns:
    print("The number of null values in:{} == {}".format(_, X_train[_].isnull().sum()))

The number of null values in:Gender == 11
The number of null values in:Married == 1
The number of null values in:Dependents == 11
The number of null values in:Education == 0
The number of null values in:Self_Employed == 20
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 16
The number of null values in:Loan_Amount_Term == 11
The number of null values in:Credit_History == 36
The number of null values in:Property_Area == 0


In [26]:
# Declare class
preprocess = PreProcessing()

In [27]:
preprocess

PreProcessing()

In [28]:
preprocess.fit(X_train)

PreProcessing()

In [29]:
X_train_transformed = preprocess.transform(X_train)

In [30]:
X_train_transformed.shape

(460, 11)

In [31]:
X_train_transformed[:2,:6]

array([[  1.00000000e+00,   1.00000000e+00,   2.00000000e+00,
          1.00000000e+00,   0.00000000e+00,   3.27300000e+03],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   4.00000000e+03]])

In [32]:
# Do same for test dataset
X_test_transformed = preprocess.transform(X_test)
X_test_transformed.shape

(154, 11)

In [33]:
X_test_transformed[0]

array([  1.00000000e+00,   1.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   9.08300000e+03,
         0.00000000e+00,   2.28000000e+02,   3.60000000e+02,
         1.00000000e+00,   2.00000000e+00])

# 4. Format Y dataset

In [34]:
## Clean y dataset
y_test.head(2)

350    Y
377    Y
Name: Loan_Status, dtype: object

In [35]:
y_test = y_test.replace({'Y':1, 'N':0}).as_matrix()
y_test[:4]

array([1, 1, 1, 1])

In [36]:
y_train = y_train.replace({'Y':1, 'N':0}).as_matrix()

# 5. Set ML classes : pipe and grid

In [37]:
param_grid = {"randomforestclassifier__n_estimators" : [10, 20, 30],
             "randomforestclassifier__max_depth" : [None, 6, 8, 10],
             "randomforestclassifier__max_leaf_nodes": [None, 5, 10, 20], 
             "randomforestclassifier__min_impurity_decrease": [0.1, 0.2, 0.3]}

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

pipe = make_pipeline(PreProcessing(),
                     RandomForestClassifier())

from sklearn.model_selection import train_test_split, GridSearchCV

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3)

In [38]:
pipe

Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [39]:
grid

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impu..._jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'randomforestclassifier__n_estimators': [10, 20, 30], 'randomforestclassifier__max_depth': [None, 6, 8, 10], 'randomforestclassifier__max_leaf_nodes': [None, 5, 10, 20], 'randomforestclassifier__min_impurity_decrease': [0.1, 0.2, 0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

# 6 Deplot ML on data

In [40]:
# Make sure it works
X_pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome',\
            'LoanAmount','Loan_Amount_Term','Credit_History','Property_Area']

y_pred_var = 'Loan_Status'

X_train, X_test, y_train, y_test = train_test_split(data[X_pred_var], data[y_pred_var], \
                                                    test_size=0.25, random_state=42)

grid.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impu..._jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'randomforestclassifier__n_estimators': [10, 20, 30], 'randomforestclassifier__max_depth': [None, 6, 8, 10], 'randomforestclassifier__max_leaf_nodes': [None, 5, 10, 20], 'randomforestclassifier__min_impurity_decrease': [0.1, 0.2, 0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [41]:
print("Best parameters: {}".format(grid.best_params_))

Best parameters: {'randomforestclassifier__max_depth': 10, 'randomforestclassifier__max_leaf_nodes': None, 'randomforestclassifier__min_impurity_decrease': 0.1, 'randomforestclassifier__n_estimators': 20}


In [42]:
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))

Test set score: 0.77


# 7. Fit

In [43]:
# X_test is in its original format
# but because we use pipe in the grid class, the preprocessing is run before the predict method
X_test[:1]

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
350,Male,Yes,0,Graduate,No,9083,0.0,228.0,360.0,1.0,Semiurban


In [44]:
y_hat_predict = grid.predict(X_test)
y_hat_predict[:5]

array(['Y', 'Y', 'Y', 'Y', 'Y'], dtype=object)

# 8. Variable Importance

In [45]:
y_train_b = y_train.replace({"Y": 1, "N":0})

In [46]:
y_train_b[:5]

92     1
304    1
68     1
15     1
211    0
Name: Loan_Status, dtype: int64

In [47]:
## X clean
X_train['Dependents'] = X_train['Dependents'].fillna('0')
X_train['Self_Employed'] = X_train['Self_Employed'].fillna('No')
X_train['Loan_Amount_Term'] = X_train['Loan_Amount_Term'].fillna(X_train['Loan_Amount_Term'].mean()) # median


X_train['Credit_History'] = X_train['Credit_History'].fillna(1)
X_train['Married'] = X_train['Married'].fillna('No')
X_train['Gender'] = X_train['Gender'].fillna('Male')

X_train['LoanAmount'] = X_train['LoanAmount'].fillna(X_train['LoanAmount'].mean()) # median

gender_values = {'Female' : 0, 'Male' : 1} 
married_values = {'No' : 0, 'Yes' : 1}
education_values = {'Graduate' : 0, 'Not Graduate' : 1}
employed_values = {'No' : 0, 'Yes' : 1}
property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2}
dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1}
X_train.replace({'Gender': gender_values, 'Married': married_values, 'Education': education_values, \
                'Self_Employed': employed_values, 'Property_Area': property_values, 'Dependents': dependent_values}\
                , inplace=True)

In [69]:
X_train.head(2)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,randNumCol
92,1,1,2,1,0,3273,1820.0,81.0,360.0,1.0,1,4
304,1,0,0,0,0,4000,2500.0,140.0,360.0,1.0,0,3


In [70]:
X_train['randNumCol_1'] = np.random.randint(1, 6, X_train.shape[0])
X_train['randNumCol_2'] = np.random.randint(10, 16, X_train.shape[0])

In [71]:
X_train.head(2)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,randNumCol,randNumCol_1,randNumCol_2
92,1,1,2,1,0,3273,1820.0,81.0,360.0,1.0,1,4,3,11
304,1,0,0,0,0,4000,2500.0,140.0,360.0,1.0,0,3,3,10


In [72]:
#import numpy as np
#import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier

# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X_train, y_train_b)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

#for f in range(X_train.shape[1]):
#    print("%d. feature %d - %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

for f, col in enumerate(X_train.columns):
    print("%d. feature (%s) - %d (%f)" % (f + 1, X_train.columns[indices][f], indices[f], importances[indices[f]]))
    
print(range(X_train.shape[1]), importances[indices])

Feature ranking:
1. feature (Credit_History) - 9 (0.277882)
2. feature (LoanAmount) - 7 (0.097886)
3. feature (ApplicantIncome) - 5 (0.097279)
4. feature (CoapplicantIncome) - 6 (0.076965)
5. feature (randNumCol) - 11 (0.071769)
6. feature (randNumCol_1) - 12 (0.071368)
7. feature (randNumCol_2) - 13 (0.070383)
8. feature (Dependents) - 2 (0.051462)
9. feature (Property_Area) - 10 (0.050755)
10. feature (Loan_Amount_Term) - 8 (0.042975)
11. feature (Married) - 1 (0.025952)
12. feature (Education) - 3 (0.023144)
13. feature (Gender) - 0 (0.022846)
14. feature (Self_Employed) - 4 (0.019335)
range(0, 14) [ 0.27788186  0.09788591  0.09727933  0.07696494  0.07176898  0.07136777
  0.07038274  0.05146174  0.05075474  0.04297483  0.025952    0.02314442
  0.02284609  0.01933466]


## New Variable Importance

In [73]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50, max_features='sqrt')
clf = clf.fit(X_train, y_train_b)

In [74]:
features = pd.DataFrame()
features['feature'] = X_train.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)

In [75]:
features.sort_values(by="importance", ascending=False)

Unnamed: 0_level_0,importance
feature,Unnamed: 1_level_1
Credit_History,0.239514
ApplicantIncome,0.145982
LoanAmount,0.132748
CoapplicantIncome,0.090221
randNumCol_1,0.066078
randNumCol_2,0.065571
randNumCol,0.056143
Property_Area,0.048346
Loan_Amount_Term,0.039701
Dependents,0.039187


In [68]:
X_train.shape

(460, 12)

In [51]:
X_train.head(2)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,randNumCol
92,1,1,2,1,0,3273,1820.0,81.0,360.0,1.0,1,4
304,1,0,0,0,0,4000,2500.0,140.0,360.0,1.0,0,3


In [52]:
X_train.columns

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'randNumCol'],
      dtype='object')

In [53]:
X_train.columns[indices][0]

'Credit_History'

# 8B Correlation Matrix

In [54]:
X_train.corr()[(X_train.corr()>0.4) | (X_train.corr() < -0.4)]

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,randNumCol
Gender,1.0,,,,,,,,,,,
Married,,1.0,,,,,,,,,,
Dependents,,,1.0,,,,,,,,,
Education,,,,1.0,,,,,,,,
Self_Employed,,,,,1.0,,,,,,,
ApplicantIncome,,,,,,1.0,,0.537121,,,,
CoapplicantIncome,,,,,,,1.0,,,,,
LoanAmount,,,,,,0.537121,,1.0,,,,
Loan_Amount_Term,,,,,,,,,1.0,,,
Credit_History,,,,,,,,,,1.0,,


# 9. Save Pickel Model

In [55]:
import dill as pickle
#filename = 'model_v1.pk'
filename = 'model_v2.pk'

In [56]:
# 1. Save model
with open(path+'/flask_api/models/'+filename, 'wb') as file:
    pickle.dump(grid, file)

In [57]:
# 2. Read model
with open(path+'/flask_api/models/'+filename ,'rb') as f:
    loaded_model = pickle.load(f)

In [58]:
# 3. Test model
loaded_model.predict(X_test)[:5]

array(['Y', 'Y', 'Y', 'Y', 'Y'], dtype=object)