In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
df_lending_data =pd.read_csv(
    "Resources/lending_data.csv")
   

# Review the DataFrame
df_lending_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the y variable (labels) from the loan_status column
# The target variable `y` is what we are trying to predict, in this case, it's the `loan_status`.
# In the dataset, 'loan_status' is 0 or 1, where 0  means that the loan is healthy, and 1 means that the loan has a high risk of defaulting

# Separate the X variable (features)
# Features are the input variables that will help the model make predictions.
# Here, we want to exclude the 'loan_status' column because that's what we are predicting.

In [4]:
# Separate the data into labels and features
y = df_lending_data['loan_status']
X = df_lending_data.drop(columns='loan_status')


# Separate the y variable, the labels


# Separate the X variable, the features


In [5]:

#Review the y variable (target labels)
# check if the y variable has been correctly extracted.
# display the first few rows of the y variable to ensure it looks right.
print("y (loan_status) target variable:")
print(y.head())

#Review the X variable (features)
# inspect the feature set to make sure it contains all the columns except 'loan_status'.
# display the first few rows of the X variable (features) as well.
print("\nX (features):")
print(X.head())


y (loan_status) target variable:
0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

X (features):
   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [6]:
# Import the train_test_learn module
# The train_test_split function from sklearn is used to randomly divide the dataset into training and testing subsets.
# Training data will be used to train the model, while testing data will be used to evaluate its performance.
from sklearn.model_selection import train_test_split


In [7]:
# The train_test_split function will split the data into four parts: 
# X_train (training features), X_test (testing features), y_train (training labels), and y_test (testing labels).
# The test_size=0.25 parameter means that 25% of the data will be used for testing, and 75% for training.
# The random_state=1 ensures that the split is reproducible; it guarantees that every time the code runs, the same data split occurs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [8]:
#Review the shapes of the resulting datasets
# It's a good practice to check the shapes (i.e., the number of rows and columns) of the datasets after the split.
# This ensures that the splitting was done correctly and the datasets are in the expected format.
print("Training features (X_train) shape:", X_train.shape)
print("Testing features (X_test) shape:", X_test.shape)
print("Training labels (y_train) shape:", y_train.shape)
print("Testing labels (y_test) shape:", y_test.shape)

Training features (X_train) shape: (58152, 7)
Testing features (X_test) shape: (19384, 7)
Training labels (y_train) shape: (58152,)
Testing labels (y_test) shape: (19384,)


---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [9]:
# Step 1: Import the LogisticRegression module
# LogisticRegression is a classification algorithm used for binary classification tasks.
# It models the relationship between the features and the probability of a binary outcome (loan status in this case).
from sklearn.linear_model import LogisticRegression

# Step 2: Instantiate the Logistic Regression model
# An instance of LogisticRegression is created to use the logistic regression algorithm.
# The random_state=1 ensures that the randomness in the model is controlled, so the results will be reproducible every time the code runs.
# The max_iter parameter is increased to 1000 to allow more iterations for convergence.
logistic_model = LogisticRegression(random_state=1, max_iter=1000)

# Step 3: Fit the model using the training data
# The fit method trains the Logistic Regression model on the training features (X_train) and training labels (y_train).
# The model learns the relationships between the features (like loan size, interest rate, etc.) and the loan status (0 or 1).
logistic_model.fit(X_train, y_train)


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [10]:
# Make predictions using the testing data (X_test)
# The predict method of the logistic regression model uses the trained model to make predictions on unseen data (X_test).
# These predictions are based on the patterns the model learned during training.
# The result will be an array of predicted values (loan status: 0 or 1).
y_pred = logistic_model.predict(X_test)

# Step 2: Review the predictions
# It's useful to check the first few predictions to understand how well the model is performing.
print("Predicted loan statuses:")
print(y_pred[:50])  # Display the first 10 predictions


Predicted loan statuses:
[0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 0]


In [11]:
# Make a prediction using the testing data
# YOUR CODE HERE!

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

A value of 0 in the “loan_status” column means that the loan is healthy. <br>A value of 1 means that the loan has a high risk of defaulting.

In [12]:
y.value_counts()

loan_status
0    75036
1     2500
Name: count, dtype: int64

There are 75,036 loans that were healthy <br>
There are 2,500  loans that were high-risk <br>
The output shows that the number of healthy loans is much larger than the number of high-risk loans <br>

Since the majority of the loans in the dataset are healthy (75,036 out of 77,536 total loans),<br>
the model could become biased towards predicting loans as healthy more often. 
If not addressed, the model might perform poorly in identifying high-risk loans.


A model with high accuracy might still be bad at predicting high-risk loans) For example, if the model always predicted 0 (healthy), it would be right 96.7% of the time (since 75,036 out of 77,536 loans are healthy), but it would completely fail at identifying high-risk loans (label 1).
This is why metrics like precision, recall, and f1-score (which you looked at earlier) are crucial for evaluating models with imbalanced datasets. These metrics give a more complete picture of how well the model is performing on both classes.
How to Handle Class Imbalance?
In the future, you may need to address the class imbalance to improve the model’s ability to predict high-risk loans. Some techniques include:

Resampling: You can oversample the minority class (high-risk loans) or undersample the majority class (healthy loans) to create a more balanced dataset.
Use Class Weights: Some machine learning algorithms, including logistic regression, allow you to give more weight to the minority class so that the model focuses more on correctly predicting those cases.
In Summary:
Your dataset has 75,036 healthy loans (label 0) and 2,500 high-risk loans (label 1), which shows a significant class imbalance.
This imbalance can impact the model's ability to correctly predict high-risk loans.
Supervised learning trains the model using both features (input variables) and labels (output variables) to learn patterns in the data.









In [13]:

# The confusion_matrix function will help evaluate how well the model performed by showing how many predictions were correct or incorrect.
# The classification_report function will generate a detailed report showing precision, recall, f1-score, and accuracy of the model.

# Generate the confusion matrix
# A confusion matrix is a summary of prediction results. 
# It shows the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
cm = confusion_matrix(y_test, y_pred)

# Print the confusion matrix
# This prints out the confusion matrix so it can be reviewed to understand the model’s performance.
print("Confusion Matrix:")
print(cm)



Confusion Matrix:
[[18658   107]
 [   37   582]]


In [14]:
#Confusion Matrix
#True Negatives (TN): 18,658 — the model correctly predicted 18,658 healthy loans (label 0).
#False Positives (FP): 107 — the model incorrectly predicted 107 high-risk loans when they were actually healthy (label 0).
#False Negatives (FN): 37 — the model incorrectly predicted 37 healthy loans when they were actually high-risk (label 1).
#True Positives (TP): 582 — the model correctly predicted 582 high-risk loans (label 1).

In [15]:
# Generate the classification report
# The classification report provides metrics like precision, recall, and f1-score for each class (0 and 1 in this case).
# Precision: Out of the predicted positive results, how many were actually positive?
# Recall: Out of the actual positive results, how many did the model identify correctly?
# F1-score: The harmonic mean of precision and recall.
# Support: The number of true instances for each class.
target_names =["loan is healthy","loan has a high risk of defaulting."]
report = classification_report(y_test, y_pred, target_names=target_names)

# Print the classification report
print("Classification Report:")
print(report)


Classification Report:
                                     precision    recall  f1-score   support

                    loan is healthy       1.00      0.99      1.00     18765
loan has a high risk of defaulting.       0.84      0.94      0.89       619

                           accuracy                           0.99     19384
                          macro avg       0.92      0.97      0.94     19384
                       weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 

**Classification Report**<br>
**Observations**:<br>
Label 0 (Healthy Loan):<br>

Precision: 1.00 — Out of all loans predicted as healthy, 100% were actually healthy.<br>
Recall: 0.99 — Out of all the true healthy loans, 99% were correctly identified by the model.<br>
F1-score: 1.00 — The model has a near-perfect balance between precision and recall for healthy loans.<br>
(High-risk Loan):<br>

Precision: 0.84 — Out of all loans predicted as high-risk, 84% were actually high-risk.<br>
Recall: 0.94 — Out of all the actual high-risk loans, 94% were correctly identified by the model.<br>
F1-score: 0.89 — The model does a good job balancing precision and recall for high-risk loans.<br>
Overall Performance:<br>
Accuracy: 99% — The model correctly predicted the loan status for 99% of all samples.<br>
Macro Avg F1-score: 0.94 — The average F1-score for both classes shows that the model performs well overall.<br>
Weighted Avg F1-score: 0.99 — This shows that the performance is heavily influenced by the correct predictions of healthy loans, given that there are many more healthy loans than high-risk loans.<br>
**Conclusion:**<br>
The logistic regression model exhibits excellent overall performance, especially in predicting healthy loans with high precision and recall. It also performs strongly in identifying high-risk loans, ensuring that most defaults are detected. Given the class imbalance, maintaining such high performance on the minority class is commendable. However, there is room for improvement in precision for high-risk loans, which can be addressed through techniques to better handle class imbalance and optimize model parameters.

Overall, the model is highly effective for its intended purpose, providing reliable predictions that can significantly aid in financial decision-making and risk management.








---