# Module 20 challenge
In this Challenge, you’ll use various techniques to train and evaluate a model based on loan risk. You’ll use a dataset of historical lending activity from a peer-to-peer lending services company to build a model that can identify the creditworthiness of borrowers.

In [52]:
#Import the required modules
import numpy as np
import pandas as pd 
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report

# Split the data into test and training sets

In [53]:
import pandas as pd
# Read the CSV file from the Resources folder into a Pandas DataFrame
lending_data_df = pd.read_csv('Documents/lending_data.csv')

# Review the DataFrame
display(lending_data_df.head())
display(lending_data_df.tail())

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700,7.672,52800,0.431818,5,1,22800,0
1,8400,6.692,43600,0.311927,3,0,13600,0
2,9000,6.963,46100,0.349241,3,0,16100,0
3,10700,7.664,52700,0.43074,5,1,22700,0
4,10800,7.698,53000,0.433962,5,1,23000,0


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
77531,19100,11.261,86600,0.65358,12,2,56600,1
77532,17700,10.662,80900,0.629172,11,2,50900,1
77533,17600,10.595,80300,0.626401,11,2,50300,1
77534,16300,10.068,75300,0.601594,10,2,45300,1
77535,15600,9.742,72300,0.585062,9,2,42300,1


In [54]:
#Create X and Y dataframes seperating the data 

# Separate the y variable, the labels
y = lending_data_df['loan_status']

# Separate the X variable, the features
X = lending_data_df.drop(columns = 'loan_status')

In [55]:
#Review the Y variable
display(y.head())
display(y.tail())

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

77531    1
77532    1
77533    1
77534    1
77535    1
Name: loan_status, dtype: int64

In [56]:
#Review the X variable 
display(X.head())
display(X.tail())

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700,7.672,52800,0.431818,5,1,22800
1,8400,6.692,43600,0.311927,3,0,13600
2,9000,6.963,46100,0.349241,3,0,16100
3,10700,7.664,52700,0.43074,5,1,22700
4,10800,7.698,53000,0.433962,5,1,23000


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
77531,19100,11.261,86600,0.65358,12,2,56600
77532,17700,10.662,80900,0.629172,11,2,50900
77533,17600,10.595,80300,0.626401,11,2,50300
77534,16300,10.068,75300,0.601594,10,2,45300
77535,15600,9.742,72300,0.585062,9,2,42300


In [57]:
#Check the balance of the labels variable (y) by using the value_counts function.
y.value_counts()

loan_status
0    75036
1     2500
Name: count, dtype: int64

In [58]:
#Split the data into training and testing datasets by using train_test_split
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    random_state = 1
)

# Create Logistic Regression 

In [59]:
#Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
LR_model = LogisticRegression(random_state = 1)

# Fit the model using training data
LR_model.fit(X_train, y_train)

In [60]:
# Save the predictions on the testing data labels by using the testing feature data (X_test) and the fitted model.
# Make a prediction using the testing data
LR_predictions = LR_model.predict(X_test)

 Evaluate the model’s performance by doing the following:
Calculate the accuracy score of the model.

Generate a confusion matrix.

Print the classification report.

In [61]:
# Print the balanced_accuracy score of the model
balanced_accuracy_score(y_test, LR_predictions)

0.9520479254722232

In [62]:
# Generate a confusion matrix for the model
cm_imbalanced = confusion_matrix(y_test, LR_predictions)
cm_imbalanced_df = pd.DataFrame(cm_imbalanced, 
                                index = ['Actual Healthy Loans (low-risk)', 
                                'Actual Non-Healthy Loans (high-risk)'], 
                                columns = ['Predicted Healthy Loans (low-risk)', 'Predicted Non-Healthy Loans (high-risk)']
                              )
cm_imbalanced_df

Unnamed: 0,Predicted Healthy Loans (low-risk),Predicted Non-Healthy Loans (high-risk)
Actual Healthy Loans (low-risk),18663,102
Actual Non-Healthy Loans (high-risk),56,563


In [63]:
# Print the classification report for the model
print(classification_report(y_test, LR_predictions, target_names=('Healthy Loans','High Risk Loans')))

                 precision    recall  f1-score   support

  Healthy Loans       1.00      0.99      1.00     18765
High Risk Loans       0.85      0.91      0.88       619

       accuracy                           0.99     19384
      macro avg       0.92      0.95      0.94     19384
   weighted avg       0.99      0.99      0.99     19384



# Question: How well does the logistic regression model predict both the 0 (healthy loan) and 1 (high-risk loan) labels?

Answer: The model performs good however this is due to data being imbalanced. The number of healthy loans (low-risk) highly outweighs the number of non-healthy (high-risk) loans which shows that the model would predict loan status's as healthy better than being able to predict loan status's as non-healthy.
The model predicted healthy loans 100% of the time and predicted non-healthy loans 85% of the time. 

# Predict a Logistic Regression Model with Resampled Training Data

In [None]:
# Import the RandomOverSampler module form imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler model
# # Assign a random_state parameter of 1 to the model
ros = RandomOverSampler(random_state=1)

# Fit the original training data to the random_oversampler model
X_ros_model, y_ros_model = ros.fit_resample(X,y)

In [None]:
# Count the distinct values of the resampled labels data
from collections import Counter
print(Counter(X_ros_model))
print(Counter(y_ros_model))
print(f"The y_ros_model resampled data is equivalently split")

# Use the LogisticRegression classifier and the resampled data to fit the model and make predictions.

In [None]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
classifier = LogisticRegression(solver='lbfgs', random_state=1)

# Fit the model using the resampled training data
classifier.fit(X_ros_model, y_ros_model)

# Make a prediction using the testing data
predictions = classifier.predict(X_ros_model)
pd.DataFrame({'Predictions': predictions, 'Actual': y_ros_model})

In [None]:
# Print the balanced_accuracy score of the model 
print(f"The balanced accuracy score of the model is: {balanced_accuracy_score(y_ros_model, predictions)}")

In [None]:
# Generate a confusion matrix for the model
cf_matrix = confusion_matrix(y_ros_model, predictions)
cf_matrix 

In [None]:
# Print the classification report for the model
report = classification_report(y_ros_model, predictions)
print(report)

# Question: How well does the logistic regression model, fit with oversampled data, predict both the 0 (healthy loan) and 1 (high-risk loan) labels?

Answer: The oversampled model generated an accuracy score of 99% which turns out to be higher than the model fitted with imbalanced data. The oversampled model performs better than the imbalanced data because it does a better job in catching mistakes such as labeling non-healthy (high-risk) loans as healthy (low-risk). 

A lending company might want a model that requires a higher recall because healthy loans being identified as a non-healthy loan might be more costly for a lending company as this may cause the loss of customers but it would not affect them much since they have not provided any funds to the customer whcih overall accounts to NO loss in terms of money. Non-healthy loans being identified as a healthy loan might be more costly overall for a lending company due to the loss of funds being provided by the lenders. 