In [51]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [52]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
df_lending = pd.read_csv("lending_data.csv")
# Review the DataFrame
df_lending.sample(10)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
59921,10500.0,7.607,52200,0.425287,4,1,22200,0
8483,9600.0,7.19,48300,0.378882,4,0,18300,0
49892,8600.0,6.788,44500,0.325843,3,0,14500,0
9629,7400.0,6.278,39700,0.244332,2,0,9700,0
10980,9800.0,7.3,49300,0.391481,4,0,19300,0
56634,10000.0,7.372,50000,0.4,4,0,20000,0
76560,16100.0,9.971,74400,0.596774,9,2,44400,1
40563,9800.0,7.27,49000,0.387755,4,0,19000,0
37767,9500.0,7.161,48000,0.375,4,0,18000,0
24238,8600.0,6.798,44600,0.327354,3,0,14600,0


In [53]:
df_lending.columns

Index(['loan_size', 'interest_rate', 'borrower_income', 'debt_to_income',
       'num_of_accounts', 'derogatory_marks', 'total_debt', 'loan_status'],
      dtype='object')

### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [54]:
# Separate the data into labels and features

# Separate the X variable, the features
X = df_lending.copy()
X.drop("loan_status", axis=1, inplace=True)

# Separate the y variable, the labels
y = df_lending["loan_status"].values.reshape(-1, 1)

In [55]:
# Review the y variable Series
y[:5]

array([[0],
       [0],
       [0],
       [0],
       [0]], dtype=int64)

In [56]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [57]:
# Check the balance of our target values
loan_status_count = df_lending['loan_status'].value_counts()
print(loan_status_count)

0    75036
1     2500
Name: loan_status, dtype: int64


### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [58]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_splite
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [59]:
# Create the StandardScaler instance to scale the data. Although not part of the directions, it is part of the standard analytical process 
scaler = StandardScaler()
X_scaler = scaler.fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [60]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logistic_regression_model = LogisticRegression(random_state=1)

# Fit the model using training data
lr_model = logistic_regression_model.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [61]:
# Make a prediction using the testing data
testing_predictions = lr_model.predict(X_test)
testing_predictions[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [62]:
# Print the balanced_accuracy score of the model
accuracy = balanced_accuracy_score(y_test, testing_predictions)
print(accuracy)

0.9520479254722232


In [63]:
# Generate a confusion matrix for the model
test_matrix = confusion_matrix(y_test, testing_predictions)
print(test_matrix)

[[18663   102]
 [   56   563]]


In [64]:
# Print the TESTING classification report for the model
testing_report = classification_report(y_test, testing_predictions)

# Print the testing classification report
print(testing_report)

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



In [65]:
# Create and save the confusion matrix for the training data
training_predictions = lr_model.predict(X_train)
training_matrix = confusion_matrix(y_train, training_predictions)

# Print the confusion matrix for the training data
print(training_matrix)

# Print the TRAINING classification report for the model
training_report = classification_report(y_train, training_predictions)
print(training_report)

[[55994   277]
 [  181  1700]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56271
           1       0.86      0.90      0.88      1881

    accuracy                           0.99     58152
   macro avg       0.93      0.95      0.94     58152
weighted avg       0.99      0.99      0.99     58152



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 

Overall, the model shows excellent performance in predicting healthy loans (class "0") with very high precision, recall, and F1-score. It also performs well in identifying high-risk loans (class "1") with good precision, recall, and F1-score, although not as high as for the healthy loans. The accuracy of the model is 0.99, indicating that it correctly predicts the labels for 99% of the loans in the dataset.

---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [66]:
# Import the RandomOverSampler module form imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler model
# Assign a random_state parameter of 1 to the model
random_state_model = RandomOverSampler(random_state=1)

# Fit the original training data to the random_oversampler model
ro_model = random_state_model.fit_resample(X_train, y_train)

X_train_rs = ro_model[0]
y_train_rs = ro_model[1]

In [67]:
# Count the distinct values of the resampled features data
distinct_values_X = np.unique(X_train)
count_X = len(distinct_values_X)
print("Number of distinct values in X_train:", count_X)

# Count the distinct values of the resampled labels data
distinct_values_y = np.unique(y_train)
count_y = len(distinct_values_y)
print("Number of distinct values in y_train:", count_y)

Number of distinct values in X_train: 5980
Number of distinct values in y_train: 2


### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [69]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logistic_regression_model_rs = LogisticRegression(random_state=1)

# Fit the model using the resampled training data
lr_model_rs = logistic_regression_model_rs.fit(X_train_rs, y_train_rs)

# Make a prediction using the testing data
testing_predictions_rs = lr_model_rs.predict(X_test)


### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [70]:
# Print the balanced_accuracy score of the model 
accuracy_rs = balanced_accuracy_score(y_test, testing_predictions_rs)
print(accuracy_rs)

0.9936781215845847


In [71]:
# Generate a confusion matrix for the model
testing_matrix_rs = confusion_matrix(y_test, testing_predictions_rs)
print(testing_matrix_rs)


[[18649   116]
 [    4   615]]


In [72]:
# Print the classification report for the model
testing_report_rs = classification_report(y_test, testing_predictions_rs)
print(testing_report_rs)

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.99      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** Similiar to the first analysis, and based on the classification report and the confusion matrix, the model shows excellent performance in predicting healthy loans (class "0") with very high precision, recall, and F1-score. It also performs well in identifying high-risk loans (class "1") with good precision, recall, and F1-score, although not as high as for the healthy loans. The overall accuracy of the model is also quite high, indicating its effectiveness in making accurate predictions.

### Report

The purpose of the analysis was to develop machine learning models to predict the loan status based on various financial information provided in the dataset. The data consisted of loan size, interest rate, borrower income, debt to income ration, the number of accounts, any derogatory marks on the account, the total debt, and finally, the loan status.

The goal of the analysis was to predict the loan status, which in our dataset, is a binary variable indicating whether a loan is classified as a "0" (healthy loan) or a "1" (high risk loan).

To understand the distribution of the loan status variable, we used the value_counts function. This provided a count of the different loan statuses in the dataset. The analysis found that there were 18,765 healthy loans (class "0") and 619 high-risk loans (class "1"). This is not a very balanced dataset, as ideally we would see a far greater number of high-risk loans.

The stages of the machine learning process involved the following steps:

Data Preparation: The dataset was loaded into a Pandas DataFrame. The features (independent variables) were separated into the variable X, while the target variable (loan_status) was separated into the variable y.

Exploratory Data Analysis: Basic exploratory analysis was performed to gain insights into the dataset, such as reviewing the DataFrame, checking the balance of the loan_status labels using value_counts, and examining the distribution of the variables.

Model Training: The dataset was split into training and testing sets using the train_test_split function. Logistic regression models were trained on the training data using the LogisticRegression class from SKLearn.

Model Evaluation: The trained models were used to make predictions on the testing data. The performance of the models was evaluated using metrics such as balanced accuracy score, confusion matrix, and classification report. These metrics provided insights into the models' accuracy, precision, recall, and F1-score for both the "0" and "1" classes.

Resampling: In the subsequent analysis, resampling techniques were applied to address the class imbalance in the dataset. The RandomOverSampler from the imbalanced-learn library was used to oversample the minority class ("1" class) to achieve a more balanced dataset.

Model Training and Evaluation with Resampled Data: The logistic regression model was trained on the resampled training data, and predictions were made on the testing data. The performance of the model was evaluated using metrics such as balanced accuracy score, confusion matrix, and classification report.

The main method used in this analysis was logistic regression (LogisticRegression) for binary classification. Additionally, the RandomOverSampler from the imbalanced-learn library was used to address the class imbalance issue by oversampling the minority class.

Overall, the analysis involved data preparation, model training, evaluation, and optional resampling to develop a predictive model for loan status classification

### Results

**Machine Learning Module 1:**

With the metrics from our classification reports, **precision** focuses on the accuracy of positive predictions, **recall** focuses on the ability to find all positive instances, and **F1-score** combines both metrics to give an overall evaluation of the model's performance.

For the "0" class (healthy loan):

**Precision:** The model achieves a precision of 1.00, indicating that when it predicts a loan as healthy, it is correct 100% of the time.\
**Recall:** The model has a recall of 0.99, suggesting that it correctly identifies 99% of the actual healthy loans.\
**F1-score:** The F1-score, which combines precision and recall, is 1.00. This score indicates a very high overall performance for the "0" class.

For the "1" class (high-risk loan):

**Precision:** The precision for the high-risk loans is 0.85, indicating that when the model predicts a loan as high-ris|k, it is correct approximately 85% of the time.\
**Recall:** The recall is 0.91, implying that the model correctly identifies 91% of the actual high-risk loans.\
**F1-score:** The F1-score for the "1" class is 0.88, reflecting a reasonably good overall performance.

**Accuracy:** The accuracy score of the model 0.952, which means it correctly predicts the labels for approximately 95.2% of the loans in the dataset.


              Precision    Recall  F1-score   Support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    Accuracy                           0.99     19384


**Machine Learning Module 2:**

For the "0" class (healthy loan):

**Precision:** The precision is 1.00, indicating that when the model predicts a loan as healthy, it is correct 100% of the time.\
**Recall:** The recall is 0.99, suggesting that it correctly identifies 99% of the actual healthy loans.\
**F1-score:** The F1-score is 1.00, reflecting a very high overall performance for the "0" class

For the "1" class (high-risk loan):

**Precision:** The precision for the high-risk loans is 0.84, indicating that when the model predicts a loan as high-risk, it is correct approximately 85% of the time.\
**Recall:** The recall is 0.99, implying that the model correctly identifies 99% of the actual high-risk loans.\
**F1-score:** The F1-score for the "1" class is 0.91, reflecting a excellent overall performance.

**Accuracy:** The accuracy score of the model is 0.994, which means it correctly predicts the labels for approximately 99.4% of the loans in the dataset.


              Precision    Recall  F1-score   Support

           0       1.00      0.99      1.00     18765
           1       0.84      0.99      0.91       619

    Accuracy                           0.99     19384




### Summary

Based on the results, both the resampled model and original model show a high level of accuracy and demonstrate good performance in predicting both classes of loan risk. However, after completing an analysis from both models, the resampled model shows slightly higher accuracy (0.994) compared to the first model (0.952). Additionally, the resampled model achieves higher precision, recall, and F1-score for the minority class ("1" class), indicating a better ability to identify high-risk loans.

When determining which model is the best, if the main objective is to achieve overall accuracy and balanced performance for both classes, the resampled model is recommended. It not only achieves high accuracy but also demonstrates improved precision, recall, and F1-score for the high-risk loans. This model would be suitable when both classes ("0" and "1") are equally important.

Likewise, the resampled model more accurately identifies high-risk loans ("1" class), it performs better in terms of precision, recall, and F1-score for the minority class. This can be crucial in situations where correctly identifying high-risk loans is of paramount importance.Considering the slightly higher accuracy and improved performance for the minority class, the resampled model appears to be the preferred choice.

Lastly, we should consider that rather than working with a robust dataset, we are relying on resampling to complete our analysis. We would likely have greater accuracy if we started with a more balanced dataset, and not relied on the resampling. Resampling can artificially balance the dataset, but it does not address the underlying class imbalance issue in the real world. The model's performance may not generalize well to unseen data from the same problem domain, where class imbalance persists. Finding a more balanced dataset from the start can help mitigate these limitations. 

It is important to carefully consider the trade-offs between resampling and obtaining a more balanced dataset. If possible, collecting or acquiring a balanced dataset in the first place is preferred. However, if obtaining a balanced dataset is not feasible, resampling techniques can be valuable tools to address class imbalance and improve model performance, while being mindful of their limitations.