In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
df_lending_data = pd.read_csv(
    Path("./Resources/lending_data.csv")
)

# Review the DataFrame
df_lending_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = df_lending_data['loan_status']

# Separate the X variable, the features
X = df_lending_data.drop(columns=['loan_status'])

In [4]:
# Review the y variable Series
print("Labels (y):")
print(y.head()) 

Labels (y):
0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64


In [5]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [6]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [7]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
model = LogisticRegression(random_state=1)

# Step 2: Fit the model using the training data
model.fit(X_train, y_train)

# The model is now trained and ready for predictions
print("Logistic Regression model has been successfully trained!")

Logistic Regression model has been successfully trained!


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [8]:
# Make a prediction using the testing data
y_pred = model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [11]:
# Generate a confusion matrix for the model
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion)

Confusion Matrix:
[[14924    77]
 [   31   476]]


In [14]:
# Print the classification report for the model
report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)
print(f'The counts of Healthy Loan Status (0) and Loans with a high likelihood of default (1) are:')

# Show the final counts of the y array to show how many 0s and how many 1s were counted
print(y.value_counts())


Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     15001
           1       0.86      0.94      0.90       507

    accuracy                           0.99     15508
   macro avg       0.93      0.97      0.95     15508
weighted avg       0.99      0.99      0.99     15508

The counts of Healthy Loan Status (0) and Loans with a high likelihood of default (1) are:
loan_status
0    75036
1     2500
Name: count, dtype: int64


### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The confusion matrix reveals how well the logistic regression model performed in predicting the loan statuses (healthy vs. high-risk). Here’s a breakdown:

Confusion Matrix:

True Negatives (TN): 14,924
- The model correctly identified 14,924 healthy loans (0).

False Positives (FP): 77
- The model incorrectly classified 77 healthy loans (0) as high-risk (1).

False Negatives (FN): 31
- The model incorrectly classified 31 high-risk loans (1) as healthy (0).

True Positives (TP): 476
- The model correctly identified 476 high-risk loans (1).

Insights:
Healthy Loans (Label 0):

- With 14,924 true negatives and only 77 false positives, the model is excellent at identifying healthy loans.

High-Risk Loans (Label 1):

- While the model accurately identified 476 high-risk loans, it missed 31 high-risk loans (false negatives).

Here’s a breakdown of the Classification Report and what it tells us about the model’s performance:

Classification Report Insights:

Class 0 (Healthy Loans):

Precision: 1.00
- The model is incredibly precise in predicting healthy loans, meaning it rarely misclassifies loans as high-risk when they are actually healthy.

Recall: 0.99
- The model is able to correctly identify 99% of all healthy loans in the dataset.

F1-Score: 1.00
- A perfect balance between precision and recall, reflecting exceptional performance for this class.

Class 1 (High-Risk Loans):

Precision: 0.86

- 86% of the loans predicted as high-risk are truly high-risk loans, indicating a slight tendency to classify healthy loans as high-risk (false       positives).

Recall: 0.94

- The model successfully identifies 94% of all high-risk loans, showing good sensitivity toward this class.

F1-Score: 0.90

- Indicates strong performance overall for high-risk loans but leaves room for improvement in precision.

Overall Accuracy: 0.99

- The model correctly classifies 99% of loans overall.

Macro Average:

- Precision: 0.93

- Recall: 0.97

- F1-Score: 0.95

This average treats both classes equally, showing that the model performs well on both classes, with a slight edge for healthy loans.

Weighted Average:

- Precision, Recall, F1-Score: 0.99 each

These metrics are weighted by the number of samples in each class, favoring the more frequent healthy loans (class 0).

Observations:
The imbalance in the dataset is evident, with 75,036 healthy loans (class 0) versus 2,500 high-risk loans (class 1). Despite this imbalance, the model performs well on both classes. Class 1 (high-risk loans) has lower precision than recall, which could indicate a need for better differentiation between high-risk and healthy loans to reduce false positives.The weighted metrics show that the model overall is extremely accurate due to the dominance of healthy loans in the dataset.

Recommendations:

For Deployment:

The model is highly effective overall and can be recommended for deployment with confidence, especially if correctly identifying high-risk loans (recall) is prioritized.

For Improvement:

Consider techniques such as oversampling the minority class (e.g., SMOTE) or tweaking decision thresholds to further improve precision for class 1.
Explore alternative models like Random Forest or Gradient Boosting to see if they provide better performance for high-risk loans.

---