### Michelle Kouba
### Using GridSearch and Pipeline to Find the Best Model at Predicting Loan Default in Future Borrowers

In [None]:
# Import libraries
import pandas as pd
import string
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import and inspect dataframe
df = pd.read_csv('Loan_Train.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## Data Preparation

In [None]:
# Dropping the Loan_ID column.
del df[df. columns[0]]
df.shape

(614, 12)

In [None]:
# Dropping any rows with missing data.
df.dropna()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...
609,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [None]:
# There are now 10 columns and 492 respondents (lost 122 respondents due to
# missing data).

In [None]:
df.Credit_History.unique()
# Should this be changed to categorical?   It's only yes no
# or something similarso I think it's fine.

array([ 1.,  0., nan])

In [None]:
# Creating dummy columns for all the categorical variables
df = pd.get_dummies(df)
# Dropping any rows with missing data.
df = df.dropna()
df.isnull().sum()

ApplicantIncome            0
CoapplicantIncome          0
LoanAmount                 0
Loan_Amount_Term           0
Credit_History             0
Gender_Female              0
Gender_Male                0
Married_No                 0
Married_Yes                0
Dependents_0               0
Dependents_1               0
Dependents_2               0
Dependents_3+              0
Education_Graduate         0
Education_Not Graduate     0
Self_Employed_No           0
Self_Employed_Yes          0
Property_Area_Rural        0
Property_Area_Semiurban    0
Property_Area_Urban        0
Loan_Status_N              0
Loan_Status_Y              0
dtype: int64

In [None]:
# Creating X and Y variables for analysis.
X = df.drop(['Loan_Status_Y', 'Loan_Status_N'],axis=1)
y = df['Loan_Status_Y']

In [None]:
# Splitting the dataset into the training set and the test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## KNN Classifier Pipeline with and w/o Grid Search for Comparative Analysis.

In [None]:
#Create a pipeline with a min-max scaler and a KNN classifier.
# Using the original training features and running a min-max scaler on them.
from sklearn.preprocessing import MinMaxScaler
standardizer = MinMaxScaler()
# Standardize features
features_standardized = standardizer.fit_transform(X_train)
# Create a KNN classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, n_jobs=-1)
# Create a pipeline
from sklearn.pipeline import Pipeline, FeatureUnion
pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])
# Training the knn model on the standardized data
knn.fit(features_standardized, y_train)
# Fitting the knn model to the test data.
knn.predict(X_test)
print ('The accuracy of the knn model with MinMax scaler and a pipeline is', knn.score(X_test, y_test),2)

The accuracy of the knn model with MinMax scaler and a pipeline is 0.660377358490566 2


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


The accuracy of the first model without a Grid Search is roughly 66%.

In [None]:
# Create a search space of candidate values from 1 to 10 for knn nearest neighbors
search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]
# Create a grid search
# from sklearn.model_selection import GridSearchCV
classifier = GridSearchCV(pipe, search_space, cv=5, verbose=0).fit(features_standardized, y_train)
# Best neighborhood size (k)
classifier.best_estimator_.get_params()["knn__n_neighbors"]

The best fit with a grid search is 8 "neighborhoods."

In [None]:
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=classifier, X=X_test, y=y_test, cv=5)

In [None]:
print(all_accuracies)

[0.63636364 0.66666667 0.66666667 0.71428571 0.71428571]


In [None]:
print('The average accuracy of grid search is:',all_accuracies.mean())

The average accuracy of grid search is: 0.6796536796536797


The first accuracy run on a standard knn classifier was roughly 67%.   With the grid search the accuracy is similar, nearly 68%.

## KNN/Logistic Regression/Random Forest

In [None]:
# Rerunning the grid search with more than just knn classifiers.  Using logistic regression and
# random forest models as well.
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# New search space for grid search using KNN Neighbors 8 from the above model worked best,
# Random Forest and Logistic Regression
classifiers = [
    KNeighborsClassifier(8),
    RandomForestClassifier(),
    LogisticRegression()
    ]
for classifier in classifiers:
    pipe = Pipeline(steps=[('classifier', classifier)])
    pipe.fit(X_train, y_train)
    print(classifier)
    print("model score: %.3f" % pipe.score(X_test, y_test))

KNeighborsClassifier(n_neighbors=8)
model score: 0.604
RandomForestClassifier()
model score: 0.811
LogisticRegression()
model score: 0.802


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


The best model score was the Random Forest with an accuracy of 81%, Logistic regression followed closely with 80%, and KNN neighbors was the lowest with 62%.  The best parameters for the KN neighbors was 8 neighborhoods.   Overall scaling the data works best for machine learning.   When we used the grid search, our accuracy or model score when validated with cross validationdidn't seem to improve much (from 66% to 68%) but it did a little.  