In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

str_path_file = "Resources/lending_data.csv"

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
lending_data_df = pd.read_csv(str_path_file)

# Review the DataFrame
lending_data_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features
X = lending_data_df.drop('loan_status', axis=1)

# Separate the y variable, the labels
y = lending_data_df['loan_status']
print("Labels (y):")
print(y.head())  # Displaying first few rows of y for verification

# Separate the X variable, the features
print("\nFeatures (X):")
X.head(5)  # Displaying first few rows of X for verification

Labels (y):
0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

Features (X):


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


In [4]:
# Review the y variable Series
summary_stats = y.describe()
print("Statistical summary of y:")
summary_stats

Statistical summary of y:


count    77536.000000
mean         0.032243
std          0.176646
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: loan_status, dtype: float64

In [5]:
# Review the X variable DataFrame
summary_stats = X.describe()
print("Statistical summary of x:")
summary_stats

Statistical summary of x:


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
count,77536.0,77536.0,77536.0,77536.0,77536.0,77536.0,77536.0
mean,9805.562577,7.292333,49221.949804,0.377318,3.82661,0.392308,19221.949804
std,2093.223153,0.889495,8371.635077,0.081519,1.904426,0.582086,8371.635077
min,5000.0,5.25,30000.0,0.0,0.0,0.0,0.0
25%,8700.0,6.825,44800.0,0.330357,3.0,0.0,14800.0
50%,9500.0,7.172,48100.0,0.376299,4.0,0.0,18100.0
75%,10400.0,7.528,51400.0,0.416342,4.0,1.0,21400.0
max,23800.0,13.235,105200.0,0.714829,16.0,3.0,75200.0


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [6]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [7]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logreg_model = LogisticRegression(random_state=1)

# Fit the model using training data
logreg_model.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [8]:
# Make a prediction using the testing data
y_pred = logreg_model.predict(X_test)

df_results = pd.DataFrame({'Predictions': y_pred[:10], 'Actual values': y_test[:10]})
df_results.head(10)

Unnamed: 0,Predictions,Actual values
60914,0,0
36843,0,0
1966,0,0
70137,0,0
27237,0,0
40013,0,0
43107,0,0
61988,0,0
57437,0,0
46757,0,0


In [9]:
# Create a DataFrame to combine predictions and actual values
df_results = pd.DataFrame({'Predictions': y_pred, 'Actual values': y_test})

# Group by 'Predictions' and 'Actual values', and calculate the count of each group
df_results_counts = df_results.groupby(['Predictions', 'Actual values']).size().reset_index(name='Count')

# Display the result sorted by 'Predictions' for better readability
df_results_counts = df_results_counts.sort_values(by='Predictions')

# Print the result
print(df_results_counts)

   Predictions  Actual values  Count
0            0              0  14924
1            0              1     31
2            1              0     77
3            1              1    476


### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [10]:
# Generate a confusion matrix for the model
cm = confusion_matrix(y_test, y_pred)

# Display the confusion matrix
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[14924    77]
 [   31   476]]


In [11]:
labels = ['Actual 0', 'Actual 1']
columns = ['Predicted 0', 'Predicted 1']

# Create a DataFrame for the confusion matrix with labels
df_cm = pd.DataFrame(cm, index=labels, columns=columns)

# Display the confusion matrix DataFrame
print("Confusion Matrix:")
print(df_cm)

Confusion Matrix:
          Predicted 0  Predicted 1
Actual 0        14924           77
Actual 1           31          476


In [12]:
# Print the classification report for the model
# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     15001
           1       0.86      0.94      0.90       507

    accuracy                           0.99     15508
   macro avg       0.93      0.97      0.95     15508
weighted avg       0.99      0.99      0.99     15508



### Step 4: Analysis Using Logistic Regression Output and ChatGPT

* How well does the Logistic Regression model Predict `0` (healthy loan) and `1` (high-risk loan) labels?
# Confusion Matrix
|            | Predicted 0 | Predicted 1 |
|------------|--------------|-------------|
| **Actual 0** | 14924       | 77          |
| **Actual 1** | 31          | 476         |

- True Negative: 14,924 correctly classified
- False Negative: 31 incorrectly clasified
- True Positive: 476 instances were correctly predicted as class 1.
- False Positive: 77 incorrectly clasified

# Classification Report

### Precision: True Positive/ Total predicted positives.
##### Precision Negative class (0) is 100%.  Of those that were predicted as 0, 100% were actually 0.
##### Precision Positive class (1) is 86%.  Of those that were predicted as 1, 86% were actually 1.

### Recall/Sensitivity
##### Recall for class 0 is 99%.  Of those observations that were truely class 0, 99% were predicted as 0.
##### Recall for class 1 is 94%.  Of those observations that were truely class 0, 94% were predicted as  1.

### F1-score:
##### F1-score for class 0 was 100%.
##### F1-score for class 1 was 90%.

### Support:
##### Support for class 0 is 15,001 instances
##### Support for class 1 is 507 instances

### Accuracy:
#####  Accuracy is (99%).  The model as a whole correctly predicted 99% of the dependent variable.

### Macro Average and Weighted Average:
##### Macro Avg is 0.93 (Precision), 0.97 (Recall), 0.95 (F1-score).  Average metrics across all classes, without considering class imbalance.
##### Weighted Avg is 0.99 (Precision), 0.99 (Recall), 0.99 (F1-score).  Average metrics weighted by the support (number of instances) of each class.

# Summary of Credit Risk Classification model:
- The model performs extremely well for class 0 (negative class), with high precision, recall, and F1-score, indicating it correctly identifies class 0 in the majority of cases.
- For class 1 (positive class), while precision is slightly lower than recall, the model still performs well with a balanced F1-score.
- Overall, the high accuracy (99%) and strong performance metrics suggest that the model is effective in its predictions, especially for class 0, which has a larger number of instances (support).

I needed help from ChatGPT for the above and also why is the Confusion Matrix and Classification Report different - they seem inconsistent.  The below explanation does not help.  I think the best approach is to focus on precision and recall, plus the Confusion Matrix. While there is a great deal of detailed information in the classification report, it is best to keep the interpretation as simple as possible.  However, there might be some instances where the details matter, such as health care analysis, most of the time simplicity is the best approach.

### Confusion Matrix

The confusion matrix is a tabular representation that shows the performance of a classification algorithm. It compares the actual target values with those predicted by the model. Each row of the matrix represents the instances in an actual class, while each column represents the instances in a predicted class.

(actual versus perdicted for the Classification algorithm)

### Classification Report

The classification report provides a more detailed summary of the classifier's performance, presenting precision, recall, F1-score, and support for each class. These metrics help understand the effectiveness of the model in a more granular way.



---