# Week 9 Exercises  

***Karlie Schwartzwald  
DSC 550 Fall 2022  
Bellevue University***

**Change Control Log:**  

Change#: 1  
Change(s) Made:  Assignment started and completed.  
Date of Change:  10/29/2022  
Author: Karlie Schwartzwald   
Change Approved by: Karlie Schwartzwald  
Date Moved to Production: 10/30/2022  

### Best Model Selection and Hyperparameter Tuning

In this exercise, you will work with the Loan_Train.csv dataset which can be downloaded from this link: [Loan Approval Data Set](https://www.kaggle.com/datasets/granjithkumar/loan-approval-data-set)

In [1]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

1. Import the dataset and ensure that it loaded properly.

In [2]:
loan_df = pd.read_csv('Loan_Train.csv')

In [3]:
loan_df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


2. Prepare the data for modeling by performing the following steps:  

- Drop the column “Load_ID.”  

In [4]:
loan_df.drop('Loan_ID', axis=1, inplace=True)

- Drop any rows with missing data.  

In [5]:
loan_df.dropna(inplace=True)

- Convert the categorical features into dummy variables.  

In [6]:
dummies_df = pd.get_dummies(loan_df, columns=['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area'])

3. Split the data into a training and test set, where the “Loan_Status” column is the target.

In [7]:
x = dummies_df.drop('Loan_Status', axis=1)
y = dummies_df['Loan_Status']

In [8]:
x_train, x_test, y_train, y_test = train_test_split(x, y)

4. Create a pipeline with a min-max scaler and a KNN classifier (see section 15.3 in the Machine Learning with Python Cookbook).

In [9]:
# Create scalar
scaler = MinMaxScaler()

In [10]:
# Scale features
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [11]:
# Create a knn classifier
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

In [12]:
# Create a pipeline
pipe = Pipeline([("scaler", scaler), ("knn", knn)])

5. Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. Note: Fitting a pipeline model works just like fitting a regular model.

In [13]:
#fit the pipe
pipe.fit(x_train_scaled,y_train)

Pipeline(steps=[('scaler', MinMaxScaler()),
                ('knn', KNeighborsClassifier(n_jobs=-1))])

In [14]:
#score the knn model accuracy on the testing data
pipe.score(x_test_scaled,y_test)

0.725

6. Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. (see section 15.3 in the Machine Learning with Python Cookbook).

In [15]:
# Create space of candidate values
search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

7. Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.

In [16]:
# Create grid search
classifier = GridSearchCV(
 pipe, search_space, cv=5, verbose=0).fit(x_train_scaled, y_train)

In [17]:
# Best neighborhood size (k)
best = classifier.best_estimator_.get_params()["knn__n_neighbors"]
best

9

8. Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.

In [18]:
# Create a knn classifier
knn_best = KNeighborsClassifier(n_neighbors=best, n_jobs=-1)

In [19]:
# Create a pipeline
pipe_best = Pipeline([("scaler", scaler), ("knn_best", knn_best)])

In [20]:
#fit the pipe
pipe_best.fit(x_train_scaled,y_train)

Pipeline(steps=[('scaler', MinMaxScaler()),
                ('knn_best', KNeighborsClassifier(n_jobs=-1, n_neighbors=9))])

In [21]:
#score the knn model accuracy on the testing data
pipe_best.score(x_test_scaled,y_test)

0.7416666666666667

9. Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.

In [22]:
# Create a pipeline
pipe = Pipeline([("classifier", RandomForestClassifier())])

In [23]:
# Create dictionary with candidate learning algorithms and their hyperparameters
search_space = [{"classifier": [LogisticRegression()],
 "classifier__penalty": ['l2'],
 "classifier__C": np.logspace(0, 4, 10)},
 {"classifier": [RandomForestClassifier()],
 "classifier__n_estimators": [10, 100, 1000],
 "classifier__max_features": [1, 2, 3]},
 {"classifier": [KNeighborsClassifier()],
 "classifier__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

In [24]:
# Create grid search
gridsearch = GridSearchCV(pipe, search_space, cv=5, verbose=0)

In [25]:
best_model = gridsearch.fit(x_train_scaled,y_train)

10. What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.

In [26]:
# View best model
model = best_model.best_estimator_.get_params()["classifier"]
model

LogisticRegression()

In [27]:
# Train model
model = model.fit(x_train_scaled, y_train)

In [28]:
# accuracy of model
cross_val_score(model, x_test_scaled, y_test, scoring = "accuracy").mean()

0.7833333333333333

11. Summarize your results.

*In summary, The first classifier we created performed the least well. Optimizing the hyperparamter improved the accuracy of the model. Further optimizing for model choice, we can see that a Logistic Regression yields the best model according to the grid search. This model has an accuracy of almost 80%, which is the highest of any model tested today. I did note that running the grid search over again yeilded different results for best model each time. 