In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
file_path = 'Resources/lending_data.csv'
data = pd.read_csv(file_path)

# Review the DataFrame
print(data.head())

   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  loan_status  
0                 1       22800            0  
1                 0       13600            0  
2                 0       16100            0  
3                 1       22700            0  
4                 1       23000            0  


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = data['loan_status']  # from the “loan_status” column create the labels set (y)

# Separate the X variable, the features
X = data.drop('loan_status', axis=1)  # from the remaining columns create the features set (X)

In [4]:
# Review the y variable Series
print(y.head())

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64


In [5]:
# Review the X variable DataFrame
print(X.head())

   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  


### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [6]:
# Check the balance of our target values
print(y.value_counts())

0    75036
1     2500
Name: loan_status, dtype: int64


### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [7]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [8]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logistic_model = LogisticRegression(random_state=1)

# Fit the model using training data
logistic_model.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [9]:
# Make a prediction using the testing data
y_pred = logistic_model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [10]:
# Print the balanced_accuracy score of the model
from sklearn.metrics import balanced_accuracy_score

balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
print(f'Balanced Accuracy Score: {balanced_accuracy}')

Balanced Accuracy Score: 0.9521352751368186


In [11]:
# Generate a confusion matrix for the model
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(conf_matrix)

Confusion Matrix:
[[14926    75]
 [   46   461]]


In [12]:
# Print the classification report for the model
from sklearn.metrics import classification_report

class_report = classification_report(y_test, y_pred)
print('Classification Report:')
print(class_report)

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     15001
           1       0.86      0.91      0.88       507

    accuracy                           0.99     15508
   macro avg       0.93      0.95      0.94     15508
weighted avg       0.99      0.99      0.99     15508



Balanced Accuracy Score:
A metric used to assess the performance of classification models on imbalanced datasets. It calculates the average accuracy across all classes, thereby considering the imbalance in the number of samples among different classes.

Precision:
Precision indicates the proportion of correctly classified samples among all samples classified as a particular class.
Precision = TP / (TP + FP)
Where TP represents True Positives and FP represents False Positives.

Recall:
Recall represents the proportion of all actual samples belonging to a particular class that were correctly predicted as that class.
Recall = TP / (TP + FN)
Where TP represents True Positives and FN represents False Negatives.

F1-score:
F1-score is a metric that combines Precision and Recall using their harmonic mean.
F1-score = 2 * (Precision * Recall) / (Precision + Recall)

### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 
1. The Balanced Accuracy Score is 0.952, indicating the overall good performance of the model in predicting both classes.
2. The Confusion Matrix shows the model's predictions on the test set:
For the "0" (healthy loan) class, there are 14926 true positives and 75 false negatives.
For the "1" (high-risk loan) class, there are 46 false positives and 461 true positives.
3. The Classification Report provides more detailed metrics:
For the "0" class, the model performs perfectly with precision, recall, and F1-score all at 1.00, with a support of 15001.
For the "1" class, precision is 0.86, recall is 0.91, F1-score is 0.88, with a support of 507.
4. Weighted average precision (weighted avg) and weighted average recall are both at 0.99, indicating strong overall performance of the model.

The model demonstrates excellent performance in predicting healthy loans (label "0"), but slightly lower precision and recall in predicting high-risk loans (label "1"). Overall, the model exhibits high accuracy, yet there is room for improvement in predicting the minority class.

---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [13]:
from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler model
ros = RandomOverSampler(random_state=1)

# Fit the original training data to the random_oversampler model
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

In [14]:
from collections import Counter

# Count the distinct values of the resampled labels data
counter_resampled = Counter(y_resampled)
print(counter_resampled)

Counter({0: 60035, 1: 60035})


### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [15]:
# from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
logistic_model_resampled = LogisticRegression(random_state=1)

# Fit the model using the resampled training data
logistic_model_resampled.fit(X_resampled, y_resampled)

# Make a prediction using the testing data
y_pred_resampled = logistic_model_resampled.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [17]:
# Calculate and print the balanced accuracy score of the model using y_test and y_pred_resampled
balanced_accuracy_resampled = balanced_accuracy_score(y_test, y_pred_resampled)
print(f'Balanced Accuracy Score (Resampled): {balanced_accuracy_resampled}')

Balanced Accuracy Score (Resampled): 0.9941749445500477


In [18]:
# Generate a confusion matrix for the model
conf_matrix_resampled = confusion_matrix(y_test, y_pred_resampled)
print('Confusion Matrix (Resampled):')
print(conf_matrix_resampled)

Confusion Matrix (Resampled):
[[14915    86]
 [    3   504]]


In [19]:
# Print the classification report for the model
class_report_resampled = classification_report(y_test, y_pred_resampled)
print('Classification Report (Resampled):')
print(class_report_resampled)

Classification Report (Resampled):
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     15001
           1       0.85      0.99      0.92       507

    accuracy                           0.99     15508
   macro avg       0.93      0.99      0.96     15508
weighted avg       1.00      0.99      0.99     15508



### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:**
1. The balanced accuracy score of 0.994 suggests that the model performs exceptionally well in predicting both the "0" (healthy loan) and "1" (high-risk loan) labels. It is a strong indicator of the model's ability to handle imbalanced classes.
2. Confusion Matrix:
For the "0" class (healthy loan), there are 14915 true positives and 86 false negatives.
For the "1" class (high-risk loan), there are 3 false positives and 504 true positives.
3. Classification Report:
For the "0" class, precision, recall, and F1-score are all high (1.00, 0.99, and 1.00 respectively), indicating excellent performance in identifying healthy loans.
For the "1" class, precision is slightly lower at 0.85, while recall and F1-score are high at 0.99 and 0.92 respectively. This indicates good performance in identifying high-risk loans, although there might be some false positives.

The logistic regression model trained on oversampled data shows outstanding performance in predicting both healthy and high-risk loans. It effectively identifies healthy loans with minimal misclassifications and has a high ability to recognize high-risk loans, although there might be a few false positives in this category. Overall, the model displays a high level of accuracy and a robust capacity to handle imbalanced classes.