### Analytics Vidhya: Practice Problem (Approach)

In [447]:
import os
import re
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [448]:
!ls /home/pratos/Side-Project/av_articles/data/

test.csv  training.csv


Download the training & test data from the Practice Problem approach. We'll do a bit of quick investigation on the dataset:

In [449]:
data = pd.read_csv('./data/training.csv')

In [450]:
data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [451]:
print("Shape of the data is:{}".format(data.shape))

Shape of the data is:(614, 13)


In [452]:
print("List of columns is: {}".format(list(data.columns)))

List of columns is: ['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status']


Here, `Loan_status` is our `target variable`, the rest are `predictor variables`. `Loan_ID` wouldn't help much in making predictions about `defaulters` hence we won't be considering that variable in our final model.

Finding out the `null/Nan` values in the columns:

In [453]:
for _ in data.columns:
    print("The number of null values in:{} == {}".format(_, data[_].isnull().sum()))

The number of null values in:Loan_ID == 0
The number of null values in:Gender == 13
The number of null values in:Married == 3
The number of null values in:Dependents == 15
The number of null values in:Education == 0
The number of null values in:Self_Employed == 32
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 22
The number of null values in:Loan_Amount_Term == 14
The number of null values in:Credit_History == 50
The number of null values in:Property_Area == 0
The number of null values in:Loan_Status == 0


Interestingly, there are `22 instances` in `LoanAmount` that don't have values. We won't be considering those instances, you could also impute the values and check out if there's any effect on the final model. But for practical scenarios we'll remove those instances.

Similarly, we'll remove `Gender`, `Married`, `Credit_History` from the data that have `NaN/null` values.

In [454]:
data = data.dropna(subset=['Gender', 'Married', 'Credit_History', 'LoanAmount'])

In [455]:
data.shape

(529, 13)

We'll check out the values (labels) for the rest of the columns having missing values:

In [456]:
missing_pred = ['Dependents', 'Self_Employed', 'Loan_Amount_Term']

for _ in missing_pred:
    print("List of unique labels for {}:::{}".format(_, set(data[_])))

List of unique labels for Dependents:::{'3+', '0', '2', '1', nan}
List of unique labels for Self_Employed:::{'No', 'Yes', nan}
List of unique labels for Loan_Amount_Term:::{nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 36.0, 300.0, 180.0, 60.0, 84.0, 480.0, 360.0, 240.0, 120.0}


For the rest of the missing values:

- `Dependents`: Assumption that there are no dependents
- `Self_Employed`: Assumption that the applicant is not self-employed
- `Loan_Amount_Term`: Assumption that the loan amount term is median value

Before that we'll divide the dataset in train and test

In [457]:
from sklearn.model_selection import train_test_split

In [458]:
list(data.columns)

['Loan_ID',
 'Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History',
 'Property_Area',
 'Loan_Status']

In [459]:
pred_var = ['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome',\
            'LoanAmount','Loan_Amount_Term','Credit_History','Property_Area']

In [460]:
X_train, X_test, y_train, y_test = train_test_split(data[pred_var], data['Loan_Status'], \
                                                    test_size=0.25, random_state=42)

We'll compile a list of `pre-processing` steps that we do on to create a custom `estimator`.

In [461]:
X_train['Dependents'] = X_train['Dependents'].fillna(0)
X_train['Self_Employed'] = X_train['Self_Employed'].fillna('No')
X_train['Loan_Amount_Term'] = X_train['Loan_Amount_Term'].fillna(X_train['Loan_Amount_Term'].mean())

We have a lot of `string` labels that we encounter in `Gender`, `Married`, `Education`, `Self_Employed` & `Property_Area` columns.

In [462]:
label_columns = ['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area', 'Dependents']

for _ in label_columns:
    print("List of unique labels {}:{}".format(_, set(X_train[_])))

List of unique labels Gender:{'Female', 'Male'}
List of unique labels Married:{'No', 'Yes'}
List of unique labels Education:{'Graduate', 'Not Graduate'}
List of unique labels Self_Employed:{'No', 'Yes'}
List of unique labels Property_Area:{'Rural', 'Urban', 'Semiurban'}
List of unique labels Dependents:{0, '3+', '2', '0', '1'}


In [463]:
gender_values = {'Female' : 0, 'Male' : 1} 
married_values = {'No' : 0, 'Yes' : 1}
education_values = {'Graduate' : 0, 'Not Graduate' : 1}
employed_values = {'No' : 0, 'Yes' : 1}
property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2}
dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1}
X_train.replace({'Gender': gender_values, 'Married': married_values, 'Education': education_values, \
                'Self_Employed': employed_values, 'Property_Area': property_values, 'Dependents': dependent_values}\
                , inplace=True)

In [464]:
X_train.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
21,1,1,1,0,0,5955,5625.0,315.0,360.0,1.0,1
409,1,1,3,0,0,81000,0.0,360.0,360.0,0.0,0
64,0,0,0,0,0,4166,0.0,116.0,360.0,0.0,2
599,1,1,2,0,0,5780,0.0,192.0,360.0,1.0,1
459,1,1,0,0,0,8334,0.0,160.0,360.0,1.0,2


In [465]:
X_train.dtypes

Gender                 int64
Married                int64
Dependents             int64
Education              int64
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area          int64
dtype: object

Converting the pandas dataframes to numpy arrays:

In [466]:
X_train = X_train.as_matrix()

In [467]:
X_train.shape

(396, 11)

We'll create a custom `pre-processing estimator` that would help us in writing better pipelines and in future deployments:

In [481]:
from sklearn.base import BaseEstimator, TransformerMixin

class PreProcessing(BaseEstimator, TransformerMixin):
    """Custom Pre-Processing estimator for our use-case
    """

    def __init__(self):
        pass

    def transform(self, df):
        """Regular transform() that is a help for training, validation & testing datasets
           (NOTE: The operations performed here are the ones that we did prior to this cell)
        """
        
        df = df.dropna(subset=['Gender', 'Married', 'Credit_History', 'LoanAmount']) #For Testing
        
        df['Dependents'] = df['Dependents'].fillna(0)
        df['Self_Employed'] = df['Self_Employed'].fillna('No')
        df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(self.mean_)
        
        gender_values = {'Female' : 0, 'Male' : 1} 
        married_values = {'No' : 0, 'Yes' : 1}
        education_values = {'Graduate' : 0, 'Not Graduate' : 1}
        employed_values = {'No' : 0, 'Yes' : 1}
        property_values = {'Rural' : 0, 'Urban' : 1, 'Semiurban' : 2}
        dependent_values = {'3+': 3, '0': 0, '2': 2, '1': 1}
        df.replace({'Gender': gender_values, 'Married': married_values, 'Education': education_values, \
                    'Self_Employed': employed_values, 'Property_Area': property_values, \
                    'Dependents': dependent_values}, inplace=True)
        
        return df.as_matrix()

    def fit(self, df, y=None, **fit_params):
        """Fitting the Training dataset & calculating the required values from train
           e.g: We will need the mean of X_train['Loan_Amount_Term'] that will be used in
                transformation of X_test
        """
        
        self.mean_ = df['Loan_Amount_Term'].mean()
        return self

To make sure that this works, let's do a test run for it:

In [482]:
X_train, X_test, y_train, y_test = train_test_split(data[pred_var], data['Loan_Status'], \
                                                    test_size=0.25, random_state=42)

In [483]:
X_train.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
21,Male,Yes,1,Graduate,No,5955,5625.0,315.0,360.0,1.0,Urban
409,Male,Yes,3+,Graduate,No,81000,0.0,360.0,360.0,0.0,Rural
64,Female,No,0,Graduate,No,4166,0.0,116.0,360.0,0.0,Semiurban
599,Male,Yes,2,Graduate,No,5780,0.0,192.0,360.0,1.0,Urban
459,Male,Yes,0,Graduate,No,8334,0.0,160.0,360.0,1.0,Semiurban


In [484]:
for _ in X_train.columns:
    print("The number of null values in:{} == {}".format(_, data[_].isnull().sum()))

The number of null values in:Gender == 0
The number of null values in:Married == 0
The number of null values in:Dependents == 11
The number of null values in:Education == 0
The number of null values in:Self_Employed == 26
The number of null values in:ApplicantIncome == 0
The number of null values in:CoapplicantIncome == 0
The number of null values in:LoanAmount == 0
The number of null values in:Loan_Amount_Term == 14
The number of null values in:Credit_History == 0
The number of null values in:Property_Area == 0


In [485]:
preprocess = PreProcessing()

In [486]:
preprocess

PreProcessing()

In [487]:
preprocess.fit(X_train)

PreProcessing()

In [488]:
X_train_transformed = preprocess.transform(X_train)

In [489]:
X_train_transformed.shape

(396, 11)

So our small experiment to write a custom `estimator` worked. This would be helpful further.

In [494]:
X_test_transformed = preprocess.transform(X_test)

In [495]:
X_test_transformed.shape

(133, 11)

In [496]:
y_test = y_test.replace({'Y':1, 'N':0}).as_matrix()

In [497]:
y_train = y_train.replace({'Y':1, 'N':0}).as_matrix()

In [498]:
param_grid = {"randomforestclassifier__n_estimators" : [10, 20, 30],
             "randomforestclassifier__max_depth" : [None, 6, 8, 10],
             "randomforestclassifier__max_leaf_nodes": [None, 5, 10, 20], 
             "randomforestclassifier__min_impurity_split": [0.1, 0.2, 0.3]}

In [501]:
pipe = make_pipeline(PreProcessing(),
                    RandomForestClassifier())

In [502]:
pipe

Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [503]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3)

In [504]:
grid

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impu..._jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'randomforestclassifier__max_depth': [None, 6, 8, 10], 'randomforestclassifier__min_impurity_split': [0.1, 0.2, 0.3], 'randomforestclassifier__n_estimators': [10, 20, 30], 'randomforestclassifier__max_leaf_nodes': [None, 5, 10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [505]:
grid.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('preprocessing', PreProcessing()), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impu..._jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'randomforestclassifier__max_depth': [None, 6, 8, 10], 'randomforestclassifier__min_impurity_split': [0.1, 0.2, 0.3], 'randomforestclassifier__n_estimators': [10, 20, 30], 'randomforestclassifier__max_leaf_nodes': [None, 5, 10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [506]:
print("Best parameters: {}".format(grid.best_params_))

Best parameters: {'randomforestclassifier__max_depth': 10, 'randomforestclassifier__min_impurity_split': 0.3, 'randomforestclassifier__max_leaf_nodes': None, 'randomforestclassifier__n_estimators': 10}


In [507]:
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))

Test set score: 0.79
