In [119]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [120]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
df = pd.read_csv(
    Path('lending_data.csv')
)

# Review the DataFrame
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [121]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = df['loan_status']

# Separate the X variable, the features
X = df.drop(columns='loan_status')

In [122]:
# Review the y variable Series
y.head(10)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: loan_status, dtype: int64

In [123]:
# Review the X variable DataFrame
X.head(10)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000
5,10100.0,7.438,50600,0.407115,4,1,20600
6,10300.0,7.49,51100,0.412916,4,1,21100
7,8800.0,6.857,45100,0.334812,3,0,15100
8,9300.0,7.096,47400,0.367089,3,0,17400
9,9700.0,7.248,48800,0.385246,4,0,18800


### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [124]:
# Check the balance of our target values
print('A value of 0 in the “loan_status” column means that the loan is healthy. A value of 1 means that the loan has a high risk of defaulting.\n') 
y.value_counts()

A value of 0 in the “loan_status” column means that the loan is healthy. A value of 1 means that the loan has a high risk of defaulting.



loan_status
0    75036
1     2500
Name: count, dtype: int64

Note: The split between 0 and 1 in loan_status is what would be known as severly imbalanced.

### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [125]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [126]:
# As X_train is a 2-dimensional array (a table of data with rows and columns), X_train.shape will return a tuple with two elements such as (n_samples, n_features)
X_train.shape

(58152, 7)

In [127]:
# As X_test is a 2-dimensional array (a table of data with rows and columns), X_test.shape will return a tuple with two elements such as (n_samples, n_features)
X_test.shape

(19384, 7)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [128]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# Setting "max_iter" to a larger value (I was having an awarning indicating that the LBFGS solver failed to converge within the specified number of iterations.)
lr_model = LogisticRegression(random_state=1, max_iter=1000)


# Fit the model using training data
lr_model.fit(X_train,y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [129]:
# Make a prediction using the testing data
predictions = lr_model.predict(X_test)
predictions_df = pd.DataFrame({"Prediction": predictions, "Actual": y_test})
predictions_df.tail(20)

Unnamed: 0,Prediction,Actual
68997,0,0
56698,0,0
44442,0,0
63469,0,0
20244,0,0
26834,0,0
7128,0,0
77237,0,1
3080,0,0
54852,0,0


### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [130]:
# Print the balanced_accuracy score of the model
from sklearn.metrics import accuracy_score

print(f"The accuracy score is: {accuracy_score(y_test, predictions)}")
print(f"The balanced accuracy score is: {balanced_accuracy_score(y_test, predictions)}")

The accuracy score is: 0.9925711927362774
The balanced accuracy score is: 0.9672620331306306


***Notes:*** 

**accuracy_score:**

The accuracy_score function computes the accuracy, which is the ratio of the number of correct predictions to the total number of predictions. It is expressed as a value between 0 and 1, where 1 means 100% accuracy.

*Accuracy Score Formula:* 
$$ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $$

It is a useful metric when the class distribution is even, meaning each class has roughly the same number of instances. However, it can be misleading when dealing with imbalanced datasets, where one or more classes are under- or over-represented.




**balanced_accuracy_score:**

The balanced_accuracy_score function computes the balanced accuracy, which is the average of recall obtained on each class. It takes the imbalance between the classes into account, and is therefore suitable for imbalanced datasets. The idea is to compute accuracy for each class individually and then take the average, thus giving equal weight to the performance on each class.

*Balanced Accuracy Formula:* 
$$ \text{Balanced Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \frac{TP_i}{TP_i + FN_i} $$ 

Where (TP_i) is the number of True Positives for class (i), (FN_i) is the number of False Negatives for class (i), and (N) is the number of classes.

Like normal accuracy, balanced accuracy also yields a value between 0 and 1, but because it averages class-wise accuracies, it avoids giving inflated results on imbalanced datasets where a model might simply predict the most frequent class most of the time.
Using these two metrics can offer a better understanding of model performance, particularly when dealing with class imbalances. A high accuracy score might indicate good performance, but if your dataset is imbalanced, the balanced accuracy score can provide a more honest assessment of how well your model is doing across all classes. If classes are balanced, the balanced accuracy should be close to the regular accuracy score.

In [131]:
# Generate a confusion matrix for the model
cm = confusion_matrix(y_test, predictions)

cm_df = pd.DataFrame(cm,
                     index = ["Actual Healthy Loans (low-risk)", "Actual Non-Health Loans (high-risk)"],
                     columns = ["Predicted Healthy Loans (low-risk)", "Predicted Non-Health Loans (high-risk)"]
                     )

cm_df

Unnamed: 0,Predicted Healthy Loans (low-risk),Predicted Non-Health Loans (high-risk)
Actual Healthy Loans (low-risk),18658,107
Actual Non-Health Loans (high-risk),37,582


In [132]:
# Print the classification report for the model
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.94      0.89       619

    accuracy                           0.99     19384
   macro avg       0.92      0.97      0.94     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:**
It predicts '0' (healthy loan) with 100% precision, 99% recall, and 100% accuracy (F1 Score). On the other hand, the model predict '1' (unhealthy loan) with 84% precision, 94% recall, and 89% accuracy (F1 Score). Based on these performance metrics, the model is significantly better at predicting '0' (healthy loan) than '1' (unhealthy loan). The difference in the model's ability to predict healthy loans and unhealthy loans, is due to the class imbalance in the training dataset; there being 30 times as many healthy observations to learn from as there are unhealthy observations.



---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [112]:
# Import the RandomOverSampler module form imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler model
# # Assign a random_state parameter of 1 to the model
ros_model = RandomOverSampler(random_state = 1)

# Fit the original training data to the random_oversampler model
X_oversampled, y_oversampled = ros_model.fit_resample(X_train, y_train)

In [114]:
# Count the distinct values of the resampled labels data
y_oversampled.value_counts()

loan_status
0    56271
1    56271
Name: count, dtype: int64

### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [115]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
LR_oversampled_model = LogisticRegression(random_state = 1)

# Fit the model using the resampled training data
LR_oversampled_model.fit(X_oversampled, y_oversampled)

# Make a prediction using the testing data
LR_oversampled_pred = LR_oversampled_model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [116]:
# Print the balanced_accuracy score of the model 
balanced_accuracy_score(y_test, LR_oversampled_pred)

0.9935981855334257

In [117]:
# Generate a confusion matrix for the model
cm_oversampled = confusion_matrix(y_test, LR_oversampled_pred)
cm_oversampled_df = pd.DataFrame(cm_oversampled, 
                                index = ['Actual Healthy Loans (low-risk)', 'Actual Non-Healthy Loans (high-risk)'], 
                                columns = ['Predicted Healthy Loans (low-risk)', 'Predicted Non-Healthy Loans (high-risk)']
                              )
cm_oversampled_df

Unnamed: 0,Predicted Healthy Loans (low-risk),Predicted Non-Healthy Loans (high-risk)
Actual Healthy Loans (low-risk),18646,119
Actual Non-Healthy Loans (high-risk),4,615


In [118]:
# Print the classification report for the model
print(classification_report(y_test, LR_oversampled_pred))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.99      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:**  After the second iteration of the logistic regression model, fitted with oversampled data, it showed significant improvement. 
- The balanced accuracy score increased by 2.63% compared to the original model. 
- The precision for predicting '0' (healthy loans) remained very high at 100%, with a recall of 99% and an F1 Score of 100%. 
- The precision for predicting '1' (unhealthy loans) F1 Score improved to 91%. 

Overall, the model continued to perform better at predicting healthy loans than unhealthy loans, but its performance in predicting unhealthy loans significantly improved after oversampling the data, with recall increasing from 94% to 99% and the F1 Score increasing from 89% to 91%.
