# Week 9: Best Model Selection and Hyperparameter Tuning

## Part 1: Best Model Selection and Hyperparameter Tuning

***Instructions)***

In this exercise, you will work with the Loan_Train.csv dataset which can be downloaded from this link: Loan Approval Data Set. 

1. Import the dataset and ensure that it loaded properly.
2. Prepare the data for modeling by performing the following steps:
    - Drop the column “Load_ID.”
    - Drop any rows with missing data.
    - Convert the categorical features into dummy variables.
3. Split the data into a training and test set, where the “Loan_Status” column is the target.
4. Create a pipeline with a min-max scaler and a KNN classifier (see section 15.3 in the Machine Learning with Python Cookbook).
5. Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. *Note*: Fitting a pipeline model works just like fitting a regular model.
6. Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. (see section 15.3 in the Machine Learning with Python Cookbook).
7. Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.
8. Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.
9. Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.
10. What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.
11. Summarize your results.


***Answer)***

### 1. Reading the Loan Approval Training Data

In [1]:
# importing libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [2]:
# reading the data
file_path = 'C:/Users/ivan2/gitLocal/DSC550-WINTER2023/Loan_Approval_Dataset_Train.csv'


df_loan = pd.read_csv(file_path)
df_loan.head(5)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
df_loan.shape

(614, 13)

In [4]:
df_loan.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

### 2. Prepare the data for modeling by performing the required steps

**1. Drop the column “Load_ID.”**

In [5]:
# dropping the "Load_ID" column
df_loan = df_loan.drop(['Loan_ID'], axis=1)

df_loan.head(3)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


**2. Drop any rows with missing data.**

In [6]:
# checking for NaN values
df_loan.isnull().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [7]:
# checking original dimensions
df_loan.shape

(614, 12)

In [8]:
# dropping NaN values
df_loanV2 = df_loan.dropna()
df_loanV2.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [9]:
# checking new dimensions
df_loanV2.shape

(480, 12)

**3. Convert the categorical features into dummy variables.**

In [10]:
# selecting categorical columns
categorical_cols = df_loanV2.columns[df_loanV2.dtypes == 'object']

# converting categorical columns to dummy variables
dummy_vars = pd.get_dummies(df_loanV2[categorical_cols])


In [11]:
# concatenate dummy variables with the original DataFrame
# drop the original categorical columns to avoid redundancy
df_loanV3 = pd.concat([df_loanV2.drop(categorical_cols, axis=1), dummy_vars], axis=1)

In [16]:
# check
df_loanV3.shape

(480, 22)

In [19]:
df_loanV3.head(3)

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Dependents_0,...,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status_N,Loan_Status_Y
1,4583,1508.0,128.0,360.0,1.0,0,1,0,1,0,...,0,1,0,1,0,1,0,0,1,0
2,3000,0.0,66.0,360.0,1.0,0,1,0,1,1,...,0,1,0,0,1,0,0,1,0,1
3,2583,2358.0,120.0,360.0,1.0,0,1,0,1,1,...,0,0,1,1,0,0,0,1,0,1


In [15]:
for col in df_loanV3:
    print(col)

ApplicantIncome
CoapplicantIncome
LoanAmount
Loan_Amount_Term
Credit_History
Gender_Female
Gender_Male
Married_No
Married_Yes
Dependents_0
Dependents_1
Dependents_2
Dependents_3+
Education_Graduate
Education_Not Graduate
Self_Employed_No
Self_Employed_Yes
Property_Area_Rural
Property_Area_Semiurban
Property_Area_Urban
Loan_Status_N
Loan_Status_Y


### 3. Split the data into a training and test set, where the “Loan_Status” column is the target.

In [21]:
from sklearn.model_selection import train_test_split

# separate features and target
X = df_loanV3.drop(['Loan_Status_Y', 'Loan_Status_N'], axis=1) # features
y = df_loanV3['Loan_Status_Y'] # target

In [22]:
# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### 4. Create a pipeline with a min-max scaler and a KNN classifier (see section 15.3 in the Machine Learning with Python Cookbook).

In [29]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

In [30]:
# defining the pipeline
pipeline = Pipeline([
    ('min_max_scaler', MinMaxScaler()),  # Step 1: Min-Max Scaler
    ('knn_classifier', KNeighborsClassifier())  # Step 2: KNN Classifier
])

### 5. Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. Note: Fitting a pipeline model works just like fitting a regular model.

In [31]:
# fiting the pipeline to the training data sets
pipeline.fit(X_train, y_train)

In [32]:
# predicting the target value for the test set
y_pred = pipeline.predict(X_test)

In [34]:
# calculating accuracy
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.67


### 6. Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. (see section 15.3 in the Machine Learning with Python Cookbook).

In [35]:
# define the parameter grid
param_grid = {
    'knn_classifier__n_neighbors': list(range(1, 11))  # 1 to 10
}

# creating a GridSearchCV object
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')

### 7. Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.

In [36]:
# fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)


In [37]:
# Best parameters found
print("Best parameters found: ", grid_search.best_params_)

# Best score (accuracy) from GridSearchCV
print("Best score (accuracy): ", grid_search.best_score_)

Best parameters found:  {'knn_classifier__n_neighbors': 5}
Best score (accuracy):  0.718626110731374


### 8. Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.

In [38]:
# Evaluate the best model found by the grid search on the test set
test_accuracy = grid_search.score(X_test, y_test)

print(f"Accuracy of the best model on the test set: {test_accuracy:.2f}")

Accuracy of the best model on the test set: 0.67


### 9. Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.

**Step 6**

First I will adjust the pipeline to include a generic classifier, which will alow to swap out different classifiers during the grid search.

In [41]:
# Create a pipeline with a generic 'classifier' step
pipeline = Pipeline([
    ('min_max_scaler', MinMaxScaler()),
    ('classifier', None)  # Placeholder for the classifier
])

Then I will define the search space to include the additional models

In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import numpy as np

In [43]:
# creating the expanded search space
search_space = [
    # KNN Classifier parameters
    {"classifier": [KNeighborsClassifier()],
     "classifier__n_neighbors": list(range(1, 11))},  # 1 to 10
    # Logistic Regression parameters
    {"classifier": [LogisticRegression(max_iter=500, solver='liblinear')],
     "classifier__penalty": ['l1', 'l2'],
     "classifier__C": np.logspace(0, 4, 10)},
    # Random Forest parameters
    {"classifier": [RandomForestClassifier()],
     "classifier__n_estimators": [10, 100, 1000],
     "classifier__max_features": [1, 2, 3]}
]

Using GridSearch with the updated pipelien and serach space

In [45]:
# Create a GridSearchCV object
grid_search = GridSearchCV(pipeline, search_space, cv=5, scoring='accuracy', verbose=1)

**Step 7**

In [46]:
# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 39 candidates, totalling 195 fits


In [47]:
# Evaluate the best model found by the grid search on the test set
test_accuracy = grid_search.score(X_test, y_test)

print(f"Best model parameters: {grid_search.best_params_}")
print(f"Accuracy of the best model on the test set: {test_accuracy:.2f}")

Best model parameters: {'classifier': LogisticRegression(C=2.7825594022071245, max_iter=500, penalty='l1',
                   solver='liblinear'), 'classifier__C': 2.7825594022071245, 'classifier__penalty': 'l1'}
Accuracy of the best model on the test set: 0.79


### 10. What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.

The best model and hyperparameters found in the grid search were from the Logistic Regression model

In [48]:
# evaluating the best model found by the grid search on the test set
best_model_accuracy = grid_search.score(X_test, y_test)

print(f"Accuracy of the Logistic Regression model on the test set: {best_model_accuracy:.2f}")

Accuracy of the Logistic Regression model on the test set: 0.79


### 11. Summarize your results.

In [49]:
# I just need to finalize this step and rewrite the bottom two text cells, and combine them all to a single cell

We initially used a KNN model which resulted in an accuracy of 67%. After evaluating two more additional models, the best model was determined to be Logistic Regression with an accuracy of 79%. The analysis shows the importance of model selection and hyperparameter tuning. The modeling phase involves identifying the optimal model and fine tuning the parameters to improve accuracy.