In [20]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [21]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
lending_data_df = pd.read_csv(Path('Resources/lending_data.csv'))

# Review the DataFrame
display(lending_data_df.head())

FileNotFoundError: [Errno 2] No such file or directory: 'Resources/lending_data.csv'

### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [None]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = lending_data_df['loan_status']

# Separate the X variable, the features
X = lending_data_df.drop(columns='loan_status')

In [None]:
# Review the y variable Series
display(y.head())

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [None]:
# Review the X variable DataFrame
display(X.head())

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [None]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [None]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
model = LogisticRegression(random_state=1)

# Fit the model using training data
model.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [None]:
# Make a prediction using the testing data
y_pred = model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [None]:
# Generate a confusion matrix for the model
confusion_matrix(y_test, y_pred)

In [22]:
# Print the classification report for the model
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18779
           1       0.84      0.91      0.87       605

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.93     19384
weighted avg       0.99      0.99      0.99     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The model has a high accuracy rate of 0.99.  However, considering the imbalance between the two label classes (given that there are more healthy loans, i.e. =true cases, than high-risk loans in the dataset), there may be room for a model's bias for true cases to go unnoticed. From the standpoint of healthy loans = true cases , such bias (i.e., the tendency to classify loans as healthy in a predominantly healthy loans' dataset) may include False-Positives (high-risk loans being classified as healthy), which would be relatively a smaller number compared to the total True-Positives, and hence go unnoticed. Further, the False-Negatives (an healthy loan is classified as high-risk), and the True-Negative (a high-risk loan classified as high-risk), would be minimized if the model is biased towards classifying loans as healthy in a predominantly healthy loans' dataset, thereby resulting in high accuracy rates (TP + TN) / (TP + TN + FP + FN). 

Hence, more than the accuracy metric, we need to look at the class prediction metric (TP / (TP + FP)) to evaluate the model's performance with predicting the high-risk loans correctly. 

The model's high precision rate of 100% for healthy loans suggest that there are no false-positive (i.e., the loans classified as healthy were all actually healthy).  With reference to the high-risk loans, the model predicted comparatively higher levels of false-positives (i.e., the loans classified as high-risk were not actually high-risk for 16% of the observations).  This implies the model to be conservative in nature from the lender's standpoint, albeit also indicating missed lending opportunities. 

Considering that the recall metric (TP / (TP + FN)) is upwards of 0.9 for both healthy and high-risk loans, although the levels of false-negatives were higher for the high-risk loans. In  other words, of all the actual healthy loans in the dataset, the model classified such loans as healthy for 99% of the observations, implying that for 1% of the observation, an healthy loan was classified as high-risk (which reflects the conservative nature of the model from the lender's standpoint).  

Similarly, from the sample of actual high-risk loans, 9% of the times an actually high-risk loan was classified as an healthy loan.  This reflects that the credit quality of the portfolio could be negatively impacted by such false-negatives related to the high-risk dataset. 

Therefore, from the perspective of missed lending opportunities due to the conservative nature of the model, there is a need to reduce the 16% of the false-positives in classifying high-risk loans; as well as the 1% of the times in classifying an healthy loan as high-risk. This can be addressed by having an additional filter to include a step in the process for having credit risk experts to review the borderline (based on a certain percentile or decile or quantile of the total high-risk, i.e. 605 cases) cases comprising of the least high-risk applications, and either confirm or deny whether these are truly high-risk.
 
Similarly, from the perspective of reducing the credit risk of the portfolio, such 9% cases of high-risk loans being classified as healthy needs to be lowered. Therefore, a filter could be added to look at the least healthy loans, where the approval process is made more stringent. 

Overall, the model certainly looks good, with excellent ability to identify healthy loans, while the ability to classify the high-risk loans could be improved by optimizing the model and adding more high-risk cases to the dataset. This aspect of the model could be mitigated by adding a filter such as manual credit review processes focused on the cusp of healthy and high-risk loans' quality / credit risk rating.  


 

---