# Task 3: Credit risk analysis

This is the task 3. Here the data collected is loan borrowers data. The data is in tabular format, with each row providing details of the borrower, including their income, total loans outstanding, and a few other metrics. There is also a column indicating if the borrower has previously defaulted on a loan. I must use this data to build a model that, given details for any loan described above, will predict the probability that the borrower will default (also known as PD: the probability of default). Use the provided data to train a function that will estimate the probability of default for a borrower. Assuming a recovery rate of 10%, this can be used to give the expected loss on a loan.
- I should produce a function that can take in the properties of a loan and output the expected loss.
- I can explore any technique ranging from a simple regression or a decision tree to something more advanced. I can also use multiple methods and provide a comparative analysis.

## Import Modules

First of all, import the modules.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

## Load the Data

Then load the dataset that will be modeled.

In [2]:
file_path = 'Task_3_and_4_Loan_Data.csv'
loan_data = pd.read_csv(file_path)
loan_data.head()

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
0,8153374,0,5221.545193,3915.471226,78039.38546,5,605,0
1,7442532,5,1958.928726,8228.75252,26648.43525,2,572,1
2,2256073,0,3363.009259,2027.83085,65866.71246,4,602,0
3,4885975,0,4766.648001,2501.730397,74356.88347,5,612,0
4,4700614,1,1345.827718,1768.826187,23448.32631,6,631,0


## Check for Missing Values

Then check for missing value.

In [3]:
loan_data.isnull().sum()

customer_id                 0
credit_lines_outstanding    0
loan_amt_outstanding        0
total_debt_outstanding      0
income                      0
years_employed              0
fico_score                  0
default                     0
dtype: int64

It turns out there are no missing values, I can proceed to the next stage.

## Split the Data & Scale the Numeric Features

Next, in this stage, I will split the data into features (X) and target (y). 

In [4]:
# Split the data into features (X) and target (y)
X = loan_data.drop(columns=['customer_id', 'default'])
y = loan_data['default']

Then scale the numeric features first with `StandardScaler`.

In [5]:
# Scale the numeric features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

And then, split the dataset into training and testing sets.

In [6]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

## RandomForestClassifier

Train a RandomForestClassifier. 

In [7]:
# Train a RandomForestClassifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

Do testing first before applying it to the function.

In [8]:
# Make predictions
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]

Testing was carried out using 2 methods, `.predict` dan `.predict_proba`. In the `.predict` method, prediction is made on the `X_test` parameter. In the `.predict_proba` calculate the probability on the `X_test` parameter. Then slicing is performed, only selecting the first column to be taken.

Then evaluate the model. In y_pred_prob, ROC AUC Score evaluation is carried out

In [9]:
# Evaluate the model
print(classification_report(y_test, y_pred))
print('ROC AUC Score:', roc_auc_score(y_test, y_pred_prob))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1652
           1       0.99      0.98      0.98       348

    accuracy                           0.99      2000
   macro avg       0.99      0.99      0.99      2000
weighted avg       0.99      0.99      0.99      2000

ROC AUC Score: 0.9996547201580808


## Define the Function

Then we enter the next stage, define the function to estimate the probability of default and calculate the expected loss. 

In [10]:
# Define the function to estimate the probability of default and calculate the expected loss
def expected_loss(loan_features, model, scaler, recovery_rate=0.1):
    # Scale the features
    loan_features_scaled = scaler.transform([loan_features])
    
    # Predict the probability of default
    pd = model.predict_proba(loan_features_scaled)[0, 1]
    
    # Calculate the expected loss
    el = loan_features[1] * pd * (1 - recovery_rate)
    
    return el

The `expected_loss` function calculates the expected loss from a loan based on its features, a predictive model, and a recovery rate.
- `loan_features`: An array or list of features describing the loan, such as the loan amount, interest rate, duration, etc.
- `model`: A pre-trained machine learning model used to predict the probability of default (PD).
- `scaler`: A scaler object used to standardize the loan features before making predictions with the model.
- `recovery_rate`: The percentage of the loan that is expected to be recovered if a default occurs. The default value is 0.1 (10%).

The loan features are scaled using the pre-trained scaler to ensure they are in the appropriate format for the model to make accurate predictions.The model predicts the probability of default. The `predict_proba` method returns probabilities for both classes (non-default and default), so `[0, 1]` is used to extract the probability of default. The expected loss is calculated by multiplying the loan amount (assumed to be in `loan_features[1]`), the probability of default (`pd`), and (`1 - recovery rate`).

## Example Usage

The following is an example of usage to apply to its function.

In [11]:
# Example usage
sample_loan = X_test[1]  # Use array indexing
el = expected_loss(sample_loan, model, scaler)
print('Expected Loss:', el)

Expected Loss: 0.23143372552586883


