# üîé Overview

* **Background** <br>
**Evaluating of loan payback from borrower** is the most difficult for financial institutions. To solve the issue, they want to build the model which is predicted that borrower have ability to repay or not.

* **Goal** <br>
The main objective is **predict whether a borrower will be pay back on loan** based on historical data from financial institutions.

**Key Features**

| Feature | Description | Type
| ------- | ----------- | ---- |
| `id` | Unique of borrower | int |
| `annual_income` | Annual income of borrower | float |
| `debt_to_income_ratio` | ratio of debt in income of borrower (DTI) | float |
| `credit_score` | Score credit of borrower | int |
| `loan_amount` | The total of loan from borrower | float |
| ` interest_rate` | Interest Rate of loan | float |
| `gender` | Gender of borrower | string |
| `marital_status` | Status of marital of borrower | string |
| `education_level` | 

# 1. Import Library

In [12]:
# Read data
import pandas as pd

# Visualize
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

# Feature engineering
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Model predict
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

# Metric to evaluate
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay

# Other
import numpy as np

In [5]:
class Config:
    figsize = (12, 6)
    train_file = r"dataset\train.csv"
    test_file = r"dataset\test.csv"

# 2. Load data

In [6]:
train_df = pd.read_csv(Config.train_file)
test_df = pd.read_csv(Config.test_file)

# 3. Data Preview and Info

In [10]:
print(f"[i] Shape of train data: {train_df.shape}")
print(f"[i] Shape of test data: {test_df.shape}")

[i] Shape of train data: (593994, 13)
[i] Shape of test data: (254569, 12)


In [13]:
print(f"[i] First 5 rows in train data:")
display(train_df.head())
print(f"[i] First 5 rows in test data:")
display(test_df.head())

[i] First 5 rows in train data:


Unnamed: 0,id,annual_income,debt_to_income_ratio,credit_score,loan_amount,interest_rate,gender,marital_status,education_level,employment_status,loan_purpose,grade_subgrade,loan_paid_back
0,0,29367.99,0.084,736,2528.42,13.67,Female,Single,High School,Self-employed,Other,C3,1.0
1,1,22108.02,0.166,636,4593.1,12.92,Male,Married,Master's,Employed,Debt consolidation,D3,0.0
2,2,49566.2,0.097,694,17005.15,9.76,Male,Single,High School,Employed,Debt consolidation,C5,1.0
3,3,46858.25,0.065,533,4682.48,16.1,Female,Single,High School,Employed,Debt consolidation,F1,1.0
4,4,25496.7,0.053,665,12184.43,10.21,Male,Married,High School,Employed,Other,D1,1.0


[i] First 5 rows in test data:


Unnamed: 0,id,annual_income,debt_to_income_ratio,credit_score,loan_amount,interest_rate,gender,marital_status,education_level,employment_status,loan_purpose,grade_subgrade
0,593994,28781.05,0.049,626,11461.42,14.73,Female,Single,High School,Employed,Other,D5
1,593995,46626.39,0.093,732,15492.25,12.85,Female,Married,Master's,Employed,Other,C1
2,593996,54954.89,0.367,611,3796.41,13.29,Male,Single,Bachelor's,Employed,Debt consolidation,D1
3,593997,25644.63,0.11,671,6574.3,9.57,Female,Single,Bachelor's,Employed,Debt consolidation,C3
4,593998,25169.64,0.081,688,17696.89,12.8,Female,Married,PhD,Employed,Business,C1


In [20]:
print("[i] Information of train data:\n")
train_df.info()
print("[i] Information of test data:\n")
test_df.info()

[i] Information of train data:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 593994 entries, 0 to 593993
Data columns (total 13 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    593994 non-null  int64  
 1   annual_income         593994 non-null  float64
 2   debt_to_income_ratio  593994 non-null  float64
 3   credit_score          593994 non-null  int64  
 4   loan_amount           593994 non-null  float64
 5   interest_rate         593994 non-null  float64
 6   gender                593994 non-null  object 
 7   marital_status        593994 non-null  object 
 8   education_level       593994 non-null  object 
 9   employment_status     593994 non-null  object 
 10  loan_purpose          593994 non-null  object 
 11  grade_subgrade        593994 non-null  object 
 12  loan_paid_back        593994 non-null  float64
dtypes: float64(5), int64(2), object(6)
memory usage: 58.9+ MB
[i] Informatio

* **Dataset Size**

* The **training dataset** has `593994` rows and `13` columns, including target variable `loan_paid_back`
* The **testing dataset** has `254569` rows and `12` columns but not including target variable

* **Feature Overview**

* Feature of datasets have two type:
    * **Numerical features:** `id`, `annual_income`, `debt_to_income_ratio`, `credit_score`, `loan_amount`, `interest_rate`
    * **Categorical features:** `gender`, `maritial_status`, `education_level`, `employment_status`, `loan_purpose`, `grade_subgrade`

* **Data consistency**
    * Data types:
        * **Numerical:** `int64` and `float64`
        * **Categorical:** `object (string)`
    * feature `id` is not meaningful for analysis so that we should remove it.

In [21]:
train_df.drop(columns='id', inplace=True)
test_df.drop(columns='id', inplace=True)

## Describe of numerical dataset

In [None]:
print("[i] Describe of numerical train data:")
display(train_df.drop(columns='loan_paid_back').describe().T.style.background_gradient())
print("[i] Describe of numerical test data:")
display(test_df.describe().T.style.background_gradient())

[i] Describe of train data:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
annual_income,593994.0,48212.202976,26711.942078,6002.43,27934.4,46557.68,60981.32,393381.74
debt_to_income_ratio,593994.0,0.120696,0.068573,0.011,0.072,0.096,0.156,0.627
credit_score,593994.0,680.916009,55.424956,395.0,646.0,682.0,719.0,849.0
loan_amount,593994.0,15020.297629,6926.530568,500.09,10279.62,15000.22,18858.58,48959.95
interest_rate,593994.0,12.356345,2.008959,3.2,10.99,12.37,13.68,20.99


[i] Describe of test data:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
annual_income,254569.0,48233.080193,26719.65858,6011.77,27950.3,46528.98,61149.44,380653.94
debt_to_income_ratio,254569.0,0.120583,0.068582,0.011,0.072,0.096,0.156,0.627
credit_score,254569.0,681.037691,55.624118,395.0,646.0,683.0,719.0,849.0
loan_amount,254569.0,15016.753484,6922.165182,500.05,10248.58,15000.22,18831.46,48959.26
interest_rate,254569.0,12.352323,2.017602,3.2,10.98,12.37,13.69,21.29


* **Analysis**

| Feature | Review |
| ------- | ------ |
| `annual_income` | Average is about 48,000$ But the standard variation is too high (above 26,000$), this is skewed distribution |
| `debt_to_income_ratio` | Average is about 0.12 and data is balance, not skewed |
| `credit_score` | Average is about 680 score. Althrough a little skewed distribution, This is not imbalance like `annual_income` |
| `loan_amount` | Average is about 15,020$ and high skewed |
| `interest_rate` | Average is about 12% |

## Desribe of categorical dataset

In [45]:
# Convert to category
category_columns = ["gender", "marital_status", "education_level", "employment_status", "loan_purpose", "grade_subgrade"]

train_df[category_columns] = train_df[category_columns].astype('category')
test_df[category_columns] = test_df[category_columns].astype('category')

print("[i] Describe of categorical train data:")
display(train_df.describe(include='category').T)
print("[i] Describe of categorical test data:")
display(test_df.describe(include='category').T)

[i] Describe of categorical train data:


Unnamed: 0,count,unique,top,freq
gender,593994,3,Female,306175
marital_status,593994,4,Single,288843
education_level,593994,5,Bachelor's,279606
employment_status,593994,5,Employed,450645
loan_purpose,593994,8,Debt consolidation,324695
grade_subgrade,593994,30,C3,58695


[i] Describe of categorical test data:


Unnamed: 0,count,unique,top,freq
gender,254569,3,Female,131480
marital_status,254569,4,Single,123686
education_level,254569,5,Bachelor's,119924
employment_status,254569,5,Employed,193207
loan_purpose,254569,8,Debt consolidation,138963
grade_subgrade,254569,30,C3,25410


* **Analysis**

* Two datasets have number of categories of each features and top category is similar.

| Feature | Review |
| ------- | ------ |
| `gender` | 3 types: `Male`, `Female`, `Other`. Top is **Female** |
| `marital_status` | 4 types: `Signle`, `Married`, `Divorced`, `Widowed`. Top is **Signle** |
| `education_level` | 5 types: `High School`, `Master's`, `Bachelor's`, `PhD`, `Other`. Top is **Bachelor's** |
| `employment_status` | 5 types: `Self-employed`, `Employed`, `Unemployed`, `Retired`, `Student`. Top is **Employed** |
| `loan_purpose` | 8 types: `Debt consolidation`, `Home`, `Education`, `Vacation`, `Car`, `Medical`, `Business`, `Other`. Top is **Debt consolidation** |
| `grade_subgrade` | 30 types: From `A1` to `F5`. Top is **C3** |

# 4. Data Quality Checks

## 4.1 Missing Value

In [81]:
def is_missing_value(df: pd.DataFrame, name:str):
    total_rows = df.shape

    missing_df = df.isna().sum().reset_index()
    missing_df.rename({missing_df.columns[-1]: "Total_missing"}, axis=1, inplace=True)
    
    
    if missing_df.loc[missing_df['Total_missing'] > 0].shape[0] == 0:
        print(f"‚úÖ {name} dataset is not having missing values.")
    else:
        print(f"‚ùå {name} is having missing values.")
        display(missing_df.loc[missing_df['Total_missing'] > 0])

print("‚ùó Checking train dataset.....")
is_missing_value(train_df, "train")
print("‚ùó Checking test dataset.....")
is_missing_value(train_df, "test")

‚ùó Checking train dataset.....
‚úÖ train dataset is not having missing values.
‚ùó Checking test dataset.....
‚úÖ test dataset is not having missing values.


## 4.2 Duplicate Value

In [87]:
def is_duplicate(df: pd.DataFrame, name:str):
    duplicate_count = df.duplicated().sum()
    total_rows = df.shape[0]

    if duplicate_count == 0:
        print(f"‚úÖ {name} data is not duplicated.")
    else:
        print(f"‚ùå {name} data is duplicated.")
        print(f"Rows duplicated: {duplicate_count}/{total_rows}")


print("‚ùó Checking train dataset.....")
is_duplicate(train_df, "train")
print("‚ùó Checking test dataset.....")
is_duplicate(train_df, "test")

‚ùó Checking train dataset.....
‚úÖ train data is not duplicated.
‚ùó Checking test dataset.....
‚úÖ test data is not duplicated.


## 4.3 Outlier Value