<a href="https://colab.research.google.com/github/minalbm/credit-risk-classification/blob/main/credit_risk_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import balanced_accuracy_score
from imblearn.metrics import classification_report_imbalanced

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
lending_data_df = pd.read_csv(
    Path('lending_data.csv')
)

# Review the DataFrame
lending_data_df.sample(10)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
26371,7700.0,6.38,40600,0.261084,2,0,10600,0
33013,8700.0,6.802,44600,0.327354,3,0,14600,0
2032,10000.0,7.366,49900,0.398798,4,0,19900,0
41782,9800.0,7.282,49100,0.389002,4,0,19100,0
74352,8300.0,6.637,43100,0.303944,2,0,13100,0
19332,7800.0,6.452,41300,0.273608,2,0,11300,0
27814,8700.0,6.84,45000,0.333333,3,0,15000,0
76373,17200.0,10.414,78600,0.618321,10,2,48600,1
55848,9700.0,7.246,48800,0.385246,4,0,18800,0
8977,9800.0,7.287,49200,0.390244,4,0,19200,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = lending_data_df['loan_status']

# Separate the X variable, the features
x = lending_data_df.drop(columns = 'loan_status')

In [4]:
# Review the y variable Series
y.sample(10)

2028     0
57627    0
17680    0
15875    0
60826    0
27821    0
42690    0
59177    0
25817    0
7835     0
Name: loan_status, dtype: int64

In [5]:
# Review the X variable DataFrame
x.sample(10)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
28569,8900.0,6.89,45400,0.339207,3,0,15400
41956,10600.0,7.633,52400,0.427481,5,1,22400
76497,16100.0,9.981,74500,0.597315,10,2,44500
66974,10700.0,7.666,52700,0.43074,5,1,22700
40726,9300.0,7.071,47100,0.363057,3,0,17100
29990,8100.0,6.564,42400,0.292453,2,0,12400
53585,10300.0,7.509,51300,0.415205,4,1,21300
74814,8200.0,6.595,42700,0.297424,2,0,12700
17133,9400.0,7.113,47500,0.368421,3,0,17500
438,7700.0,6.389,40700,0.262899,2,0,10700


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [7]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    random_state = 1
)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [8]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
LogReg_model = LogisticRegression(random_state = 1)

# Fit the model using training data
LogReg_model.fit(x_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [9]:
# Make a prediction using the testing data
LogReg_predictions = LogReg_model.predict(x_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [12]:
# Generate a confusion matrix for the model
confusion_m = confusion_matrix(y_test, LogReg_predictions)
confusion_m_df = pd.DataFrame(confusion_m,
                                index = ['Actual Healthy Loans (low-risk)',
                                'Actual Non-Healthy Loans (high-risk)'],
                                columns = ['Predicted Healthy Loans (low-risk)', 'Predicted Non-Healthy Loans (high-risk)']
                              )
confusion_m_df

Unnamed: 0,Predicted Healthy Loans (low-risk),Predicted Non-Healthy Loans (high-risk)
Actual Healthy Loans (low-risk),18663,102
Actual Non-Healthy Loans (high-risk),56,563


In [14]:
# Print the classification report for the model
print(classification_report_imbalanced(y_test, LogReg_predictions))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.99      0.91      1.00      0.95      0.91     18765
          1       0.85      0.91      0.99      0.88      0.95      0.90       619

avg / total       0.99      0.99      0.91      0.99      0.95      0.91     19384



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:**
The performance of the model appears to be great with a balanced accuracy score reaching 95%. However, this seemingly impressive score is largely attributable to the imbalance within the dataset. The prevalence of healthy loans significantly outweighs that of non-healthy ones. Consequently, the model's proficiency lies more in classifying loans as healthy rather than accurately identifying non-healthy ones. Upon examination of the imbalanced classification report, it becomes evident that the model consistently achieved a 100% accuracy rate in predicting healthy loans. However, its accuracy in identifying non-healthy loans stood at 85%.

---