# DCS 550 Data Mining (DSC550-T302 2227-1)
## Bellevue University
## 8.2 Exercise: Best Model Selection and Hyperparameter Tuning
## Author: Jake Meyer
## Date: 7/31/2022

## Assignment Instructions:
In this exercise, you will work with the Loan_Train.csv dataset which can be downloaded from this link:: [Loan Approval Data Set](https://www.kaggle.com/datasets/granjithkumar/loan-approval-data-set)

<ol>
    <li> Import the dataset and ensure that it loaded properly.
    <li> Prepare the data for modeling by performing the following steps:<br>
$\bullet$ Drop the column “Loan_ID.”<br>
$\bullet$ Drop any rows with missing data. <br>
$\bullet$ Convert the categorical features into dummy variables. <br>
    <li> Split the data into a training and test set, where the “Loan_Status” column is the target.
    <li> Create a pipeline with a min-max scaler and a KNN classifier (see section 15.3 in the Machine Learning with Python Cookbook).
    <li> Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. Note: Fitting a pipeline model works just like fitting a regular model.
    <li> Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. (see section 15.3 in the Machine Learning with Python Cookbook).
    <li> Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.
    <li> Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.
    <li> Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.
    <li> What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.
    <li> Summarize your results.
<ol>

In [1]:
'''
Import the necessary libraries to complete Exercise 8.2.
'''
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score

In [2]:
'''
Check the versions of the packages.
'''
print('numpy version:', np.__version__)
print('pandas version:', pd.__version__)
print('matplotlib version:', mpl.__version__)
print('seaborn:', sns.__version__)
print('sklearn:', sklearn.__version__)

numpy version: 1.20.3
pandas version: 1.3.4
matplotlib version: 3.4.3
seaborn: 0.11.2
sklearn: 0.24.2


### Part 1 - Import the dataset and ensure that it loaded properly.

In [3]:
'''
Import the Loan Data from Loan_Train.csv.
Note: A copy of the CSV file was placed into the same directory as this notebook.
'''
df = pd.read_csv('Loan_Train.csv')

In [4]:
'''
Show the data has been loaded successfully into the data frame 
by printing the first 10 rows with head().
'''
df.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
7,LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,Semiurban,N
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Urban,Y
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,Semiurban,N


In [5]:
'''
Understand the shape of the data frame.
'''
print('There are {} rows and {} columns in this data frame.'.format(df.shape[0], df.shape[1]))

There are 614 rows and 13 columns in this data frame.


In [6]:
'''
Display the total size of this data frame.
'''
print('This data frame contains {} records.'.format(df.size))

This data frame contains 7982 records.


In [7]:
'''
Find the type of data within each column initially.
'''
df.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [8]:
'''
Understand if there are any missing values in the data frame.
'''
df.isna().sum().sort_values(ascending = False)

Credit_History       50
Self_Employed        32
LoanAmount           22
Dependents           15
Loan_Amount_Term     14
Gender               13
Married               3
Loan_ID               0
Education             0
ApplicantIncome       0
CoapplicantIncome     0
Property_Area         0
Loan_Status           0
dtype: int64

In [9]:
'''
Understand how many missing values are in the dataset initially.
'''
df.isna().sum().sum()

149

### Part 2 - Prepare the data for modeling by performing the following steps: <br>
$\bullet$ Drop the column “Loan_ID.”<br>
$\bullet$ Drop any rows with missing data. <br>
$\bullet$ Convert the categorical features into dummy variables. <br>

In [10]:
'''
Drop the column Loan_ID with drop().
'''
df.drop('Loan_ID', axis =1, inplace = True)

In [11]:
'''
Drop any rows with missing data with dropna().
'''
df.dropna(how = "any", axis = 0, inplace = True)

In [12]:
'''
Identify all the categorical columns in the dataframe.  
'''
category_columns = df.select_dtypes(include = 'object').columns.tolist()
print(category_columns)

['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']


In [13]:
'''
Convert the categorical features into dummy variables with get_dummies().
'''
df = pd.get_dummies(df, columns = category_columns)

In [14]:
'''
Understand the shape of the dataframe after dropping Loan_ID Column, rows with any missing data, and converting categorical
features into dummy variables.
'''
print('There are {} rows and {} columns in this data frame.'.format(df.shape[0], df.shape[1]))

There are 480 rows and 22 columns in this data frame.


In [15]:
'''
Understand any missing values in the data frame after dropping the values
and creating the dummy variables.
'''
df.isna().sum().sort_values(ascending = False)

ApplicantIncome            0
CoapplicantIncome          0
Loan_Status_N              0
Property_Area_Urban        0
Property_Area_Semiurban    0
Property_Area_Rural        0
Self_Employed_Yes          0
Self_Employed_No           0
Education_Not Graduate     0
Education_Graduate         0
Dependents_3+              0
Dependents_2               0
Dependents_1               0
Dependents_0               0
Married_Yes                0
Married_No                 0
Gender_Male                0
Gender_Female              0
Credit_History             0
Loan_Amount_Term           0
LoanAmount                 0
Loan_Status_Y              0
dtype: int64

In [16]:
'''
View the first ten rows of the data frame.
'''
df.head(10)

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Dependents_0,...,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status_N,Loan_Status_Y
1,4583,1508.0,128.0,360.0,1.0,0,1,0,1,0,...,0,1,0,1,0,1,0,0,1,0
2,3000,0.0,66.0,360.0,1.0,0,1,0,1,1,...,0,1,0,0,1,0,0,1,0,1
3,2583,2358.0,120.0,360.0,1.0,0,1,0,1,1,...,0,0,1,1,0,0,0,1,0,1
4,6000,0.0,141.0,360.0,1.0,0,1,1,0,1,...,0,1,0,1,0,0,0,1,0,1
5,5417,4196.0,267.0,360.0,1.0,0,1,0,1,0,...,0,1,0,0,1,0,0,1,0,1
6,2333,1516.0,95.0,360.0,1.0,0,1,0,1,1,...,0,0,1,1,0,0,0,1,0,1
7,3036,2504.0,158.0,360.0,0.0,0,1,0,1,0,...,1,1,0,1,0,0,1,0,1,0
8,4006,1526.0,168.0,360.0,1.0,0,1,0,1,0,...,0,1,0,1,0,0,0,1,0,1
9,12841,10968.0,349.0,360.0,1.0,0,1,0,1,0,...,0,1,0,1,0,0,1,0,1,0
10,3200,700.0,70.0,360.0,1.0,0,1,0,1,0,...,0,1,0,1,0,0,0,1,0,1


### Part 3 - Split the data into a training and test set, where the "Loan_Status" column is the target.

In [17]:
'''
Define the target variable as y (Loan_Status_N and Loan_Status_Y) and independent variables as X. 
'''
X = df.drop(['Loan_Status_N','Loan_Status_Y'], axis = 1)
y = df[['Loan_Status_N','Loan_Status_Y']]

In [18]:
'''
Split the data with train_test_split from sklearn.
Use test_size = 0.2 to split the data into 80% training and 20% testing data.
'''
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [19]:
'''
Print out the shape of the resulting datasets for training and testing.
'''
print("The original data set shape was {} rows and {} columns.".format(df.shape[0],df.shape[1]))
print("The X_train shape is {} rows and {} columns.".format(X_train.shape[0],X_train.shape[1]))
print("The y_train shape is {} rows and {} columns.".format(y_train.shape[0],y_train.shape[1]))
print("The X_test shape is {} rows and {} columns.".format(X_test.shape[0],X_test.shape[1]))
print("The y_test shape is {} rows and {} columns.".format(y_test.shape[0],y_test.shape[1]))

The original data set shape was 480 rows and 22 columns.
The X_train shape is 384 rows and 20 columns.
The y_train shape is 384 rows and 2 columns.
The X_test shape is 96 rows and 20 columns.
The y_test shape is 96 rows and 2 columns.


### Part 4 - Create a pipeline with a min-max scaler and a KNN classifier (see section 15.3 in the Machine Learning with Python Cookbook).

In [20]:
'''
Create the scaler using MinMaxScaler() from sklearn.preprocessing.
'''
scaler = MinMaxScaler()

In [21]:
'''
Create a KNN classifier to put into the pipeline. Chose k=5 for default knn to start out.
'''
classifier_knn = KNeighborsClassifier(n_neighbors=5, n_jobs = -1)

In [22]:
'''
Create the pipeline with the min-max scaler and a KNN classifier.
'''
pipe = Pipeline([("scaler", scaler), ("classifier_knn", classifier_knn)])

### Part 5 - Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. Note: Fitting a pipeline model works just like fitting a regular model.

In [23]:
'''
Fit the pipe to the training data.
'''
pipe.fit(X_train, y_train)

Pipeline(steps=[('scaler', MinMaxScaler()),
                ('classifier_knn', KNeighborsClassifier(n_jobs=-1))])

In [24]:
'''
Determine the accuracy of the default KNN classifier on the test set. Print the results.
'''
accuracy_score = pipe.score(X_test, y_test)*100
print('Accuracy Score (k = 5): {}%'.format(accuracy_score))

Accuracy Score (k = 5): 78.125%


### Part 6 - Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. (see section 15.3 in the Machine Learning with Python Cookbook).

In [25]:
'''
Create the search space for the KNN classifier for n_neighbors between 1-10.
'''
search_space = [{"classifier_knn__n_neighbors":[1,2,3,4,5,6,7,8,9,10]}]

### Part 7 - Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.

In [26]:
'''
Create and fit the grid search with the pipeline, search_space, and 5-fold cross-validation.
Start by creating the grid search.
'''
gridsearch = GridSearchCV(pipe, search_space, cv=5, verbose = 0)

In [27]:
'''
Now fit the grid search classifier to the training data.
'''
gridsearch.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
                                       ('classifier_knn',
                                        KNeighborsClassifier(n_jobs=-1))]),
             param_grid=[{'classifier_knn__n_neighbors': [1, 2, 3, 4, 5, 6, 7,
                                                          8, 9, 10]}])

In [29]:
'''
Understand what dict_keys are available for the knn classifier with the code below.
'''
classifier_knn.get_params().keys()

dict_keys(['algorithm', 'leaf_size', 'metric', 'metric_params', 'n_jobs', 'n_neighbors', 'p', 'weights'])

In [30]:
'''
Determine the best neighborhood size (k).
'''
gridsearch.best_estimator_.get_params()['classifier_knn__n_neighbors']

3

A value of k=3 will provide the best model that balnaces bias-variance trade-off.

### Part 8 - Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.

In [31]:
'''
Find the accuracy of the best model based on k=3.
Start by creating a revised KNN classifier to put into the pipeline.
'''
classifier_knn_2 = KNeighborsClassifier(n_neighbors = 3, n_jobs = -1)

In [32]:
'''
Create the similar pipeline with the min-max scaler and the revised KNN classifier.
'''
pipe2 = Pipeline([("scaler", scaler), ("classifier_knn_2", classifier_knn_2)])

In [33]:
'''
Fit the pipe2 to the training data.
'''
pipe2.fit(X_train, y_train)

Pipeline(steps=[('scaler', MinMaxScaler()),
                ('classifier_knn_2',
                 KNeighborsClassifier(n_jobs=-1, n_neighbors=3))])

In [34]:
'''
Determine the accuracy of the revised KNN classifier on the test set. Print the results.
'''
accuracy_score2 = round(pipe2.score(X_test, y_test)*100,2)
print('Revised KNN Accuracy Score (k=3): {}%'.format(accuracy_score2))

Revised KNN Accuracy Score (k=3): 76.04%


### Part 9 - Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.

In [35]:
'''
Create the same pipeline as previously used.Define the knn, logistic regrssion and random forest models.
'''
classifier = KNeighborsClassifier()
pipe3 = Pipeline([('scaler', MinMaxScaler()), ('classifier', classifier)])

In [36]:
'''
Create a dictionary with the learning algorithms and hyperparameters.
'''
search_space_2 = [{'classifier' : [KNeighborsClassifier()],
                   'classifier__n_neighbors' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
                  {'classifier' : [LogisticRegression()],
                   'classifier__penalty' : ['l1', 'l2'],
                  'classifier__C' : np.logspace(0, 4, 10)},
                  {'classifier' : [RandomForestClassifier()],
                  'classifier__n_estimators' : [10, 100, 1000],
                  'classifier__max_features' : [1, 2, 3]}
                 ]

In [37]:
'''
Create the revised grid search. Use the initial pipeline, 5 fold cross-validation, and revised search space.
'''
gridsearch_2 = GridSearchCV(pipe3,search_space_2,cv=5,verbose=0)

In [38]:
'''
Fit the revised grid search.
'''
best_model = gridsearch_2.fit(X_train,y_train)

### Part 10 - What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.

In [40]:
'''
Show the best model and hyperparameters found in the grid search.
'''
best_model.best_estimator_.get_params()['classifier']

RandomForestClassifier(max_features=3, n_estimators=1000)

In [46]:
'''
Determine the accuracy of the best model classifier on the test set. Print the results.
'''
accuracy_score3 = round(best_model.score(X_test, y_test)*100,2)
print('Best Model Accuracy Score: {}% \n'
      'Model(Parameters): {}'.format(accuracy_score3,best_model.best_estimator_.get_params()['classifier']))

Best Model Accuracy Score: 82.29% 
Model(Parameters): RandomForestClassifier(max_features=3, n_estimators=1000)


### Part 11 - Summarize your results.

This week's assignment consisted of model selection and hyperparameter tuning. The data set used for this week involved loan data to determine whether a loan would be granted based on several features such as gender, education level, marrital status, income, type of loan, loan term, and more. The main highlights from the analysis are highlighted below:<br>
$\bullet$ The 'Loan_ID' column and rows with missing data were dropped from the initial data set. <br>
$\bullet$ The categorical features were converted into dummy variables. <br>
$\bullet$ Data was split into 80% training and 20% test sub sets. <br>
$\bullet$ Pipeline was created with a Min-Max Scaler and KNN classifier (with k=5). This Pipeline was then fit to the training data and resulted in an accuracy of 78%. <br>
$\bullet$ A grid search was utilized to find the best k-value for the KNN classifier by searching for k-values from 1-10. The best KNN classifier model will have a value of k=3. However, the accuracy of the revised KNN model decreased to 76%. <br>
$\bullet$ The search space was revised to include KNN, Logistic Regression and Random Forest models with hyperparameter values predefined for C, penalty, n_estimators, and max_features for each respective method. (values specified from text, page 214).<br>
$\bullet$ The best model for this data set is Random Forest Classifier with the hyperparameters max_features = 3 and n_estimators = 1000. <br>
$\bullet$ The best Random Forest Classifier model had an accuracy score of 82%. <br>