# **Loan Default Prediction**


## **Problem Definition**

### **The Context:**

- Why is this problem important to solve?

### **The objective:**

- What is the intended goal?

### **The key questions:**

- What are the key questions that need to be answered?

### **The problem formulation**:

- What is it that we are trying to solve using data science?


## **Data Description:**

The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.

- **BAD:** 1 = Client defaulted on loan, 0 = loan repaid

- **LOAN:** Amount of loan approved.

- **MORTDUE:** Amount due on the existing mortgage.

- **VALUE:** Current value of the property.

- **REASON:** Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)

- **JOB:** The type of job that loan applicant has such as manager, self, etc.

- **YOJ:** Years at present job.

- **DEROG:** Number of major derogatory reports (which indicates a serious delinquency or late payments).

- **DELINQ:** Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).

- **CLAGE:** Age of the oldest credit line in months.

- **NINQ:** Number of recent credit inquiries.

- **CLNO:** Number of existing credit lines.

- **DEBTINC:** Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.


## **Import the necessary libraries and Data**


In [None]:
import warnings
from sklearn.model_selection import GridSearchCV
import scipy.stats as stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()


warnings.filterwarnings("ignore")

## **Data Overview**


- Reading the dataset
- Understanding the shape of the dataset
- Checking the data types
- Checking for missing values
- Checking for duplicated values


In [None]:
hm = pd.read_csv("hmeq.csv")
# Copying data to another variable to avoid any changes to original data
data = hm.copy()

In [None]:
# Assuming 'data' is a pandas DataFrame containing your dataset

# Understanding the shape of the dataset
print("Shape of the dataset:")
print(data.shape)
print("\n")

# Checking the data types
print("Data types of each column:")
print(data.dtypes)
print("\n")

# Checking for missing values
print("Missing values in each column:")
print(data.isnull().sum())
print("\n")

# Checking for duplicated values
print("Number of duplicated records:")
print(data.duplicated().sum())

In [None]:
cols = data.select_dtypes(["object"]).columns.tolist()

# Adding target variable to this list as this is a classification problem and the target variable is categorical
cols.append("BAD")

# Changing the data type of object type column to category using astype() function
for i in cols:
    data[i] = data[i].astype("category")

# Checking the info again and the datatype of different variables
data.info()

## Summary Statistics


- Observations from Summary Statistics


## **Exploratory Data Analysis (EDA) and Visualization**


- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.


**Leading Questions**:

1. What is the range of values for the loan amount variable "LOAN"?
2. How does the distribution of years at present job "YOJ" vary across the dataset?
3. How many unique categories are there in the REASON variable?
4. What is the most common category in the JOB variable?
5. Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
6. Do applicants who default have a significantly different loan amount compared to those who repay their loan?
7. Is there a correlation between the value of the property and the loan default rate?
8. Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?


### **Univariate Analysis**


In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.describe(include="all")

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


def histogram_boxplot(data, feature, figsize=(15, 7), bins=None):
    """
    Custom function for plotting a histogram and a boxplot for a numerical variable,
    with a vertical line indicating the mean and its value.
    """
    f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (0.15, 0.85)}, figsize=figsize)
    sns.boxplot(data=data, x=feature, ax=ax_box)
    ax_box.set(xlabel="")
    if bins:
        sns.histplot(data=data, x=feature, ax=ax_hist, bins=bins, kde=True)
    else:
        sns.histplot(data=data, x=feature, ax=ax_hist, kde=True)
    ax_hist.set(ylabel="Frequency")

    # Mean value
    mean_value = data[feature].mean()

    # Add a vertical line for the mean
    ax_hist.axvline(mean_value, color="r", linestyle="--")
    ax_box.axvline(mean_value, color="r", linestyle="--")

    # Annotate the mean value on the plot
    ax_hist.annotate(
        f"Mean: {mean_value:.2f}",
        xy=(mean_value, 0),
        xycoords=("data", "axes fraction"),
        xytext=(0, -30),
        textcoords="offset points",
        va="top",
        ha="center",
        color="red",
    )

    plt.title(f"Univariate Analysis of {feature} (Numerical)")
    plt.show()


def univariate_analysis(data):
    """
    Performs univariate analysis with appropriate plots for numerical and categorical variables.
    - data: DataFrame containing the data
    """
    sns.set(style="whitegrid")

    # Separate the dataset into numerical and categorical data
    numerical_data = data.select_dtypes(include=["int64", "float64"])
    categorical_data = data.select_dtypes(include=["object", "category"])

    # Numerical Data Analysis
    for column in numerical_data.columns:
        histogram_boxplot(data, column)  # Use the custom function for numerical variables

    # Categorical Data Analysis
    for column in categorical_data.columns:
        plt.figure(figsize=(10, 6))
        total = float(len(data[column]))
        ax = sns.countplot(
            x=column, data=categorical_data, palette="Set2", order=categorical_data[column].value_counts().index
        )
        plt.title(f"Univariate Analysis of {column} (Categorical)")
        plt.xticks(rotation=45)

        # Adding percentage annotations
        for p in ax.patches:
            percentage = "{:.1f}%".format(100 * p.get_height() / total)
            x = p.get_x() + p.get_width() / 2
            y = p.get_height()
            ax.annotate(percentage, (x, y), ha="center", va="bottom")

        plt.show()

In [None]:
univariate_analysis(data)

### **Bivariate Analysis**


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np


def bivariate_analysis(data, target="BAD"):
    sns.set(style="whitegrid")

    # Continuous Variables
    continuous_columns = data.select_dtypes(include=["int64", "float64", "float32"]).columns
    continuous_columns = continuous_columns.drop(target) if target in continuous_columns else continuous_columns

    for column in continuous_columns:
        plt.figure(figsize=(10, 6))
        sns.boxplot(x=target, y=column, data=data)
        plt.title(f"{column} vs. {target}")
        plt.show()

    # Categorical Variables
    categorical_columns = data.select_dtypes(include=["object", "category"]).columns
    if target not in categorical_columns:
        categorical_columns = categorical_columns.union([target])

    for column in categorical_columns:
        if column == target:
            continue
        plt.figure(figsize=(10, 6))
        ax = sns.countplot(x=column, hue=target, data=data)
        plt.title(f"{column} vs. {target}")
        plt.xticks(rotation=45)

        # Calculate percentages and add annotations
        total = len(data[column])
        for p in ax.patches:
            percentage = "{:.1f}%".format(100 * p.get_height() / total)
            x = p.get_x() + p.get_width() / 2
            y = p.get_height()
            ax.annotate(percentage, (x, y), ha="center", va="bottom")

        plt.show()

    # Example for Continuous Variables Comparison with Regression Line
    plt.figure(figsize=(10, 6))
    sns.lmplot(x="LOAN", y="MORTDUE", hue=target, data=data, aspect=1.5)
    plt.title("LOAN vs. MORTDUE with Regression Line")
    plt.show()


# bivariate_analysis(data, 'BAD')

In [None]:
# Example usage
bivariate_analysis(data, "BAD")

In [None]:
data.head()

### **Multivariate Analysis**


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


def multivariate_analysis(data, target="BAD"):
    sns.set(style="white")

    # Correlation Heatmap for Numerical Variables
    numerical_data = data.select_dtypes(include=["int64", "float64", "float32"])
    plt.figure(figsize=(12, 10))
    correlation_matrix = numerical_data.corr()
    sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
    plt.title("Correlation Heatmap")
    plt.show()

    # Pairplot for the dataset with 'BAD' as hue
    # Note: For large datasets, consider using a sample to speed up the plotting
    # or select fewer columns if the pairplot is too crowded or slow to generate
    sampled_data = data.sample(frac=0.1, random_state=42)  # Sample 10% of the data for the pairplot if necessary
    sns.pairplot(sampled_data, hue=target, vars=numerical_data.columns)
    plt.title("Pairplot with BAD as hue")
    plt.show()

In [None]:
# Example usage
multivariate_analysis(data, "BAD")

## Treating Outliers


In [None]:
import numpy as np


def treat_outliers(data, method="cap"):
    """
    Treat outliers in the numerical columns of the dataset based on the IQR method.

    Parameters:
    - data: pandas DataFrame containing the data.
    - method: 'cap' to cap outliers with threshold values or 'remove' to drop rows with outliers.

    Returns:
    - The DataFrame with outliers treated.
    """
    treated_data = data.copy()
    for column in treated_data.select_dtypes(include=["float64", "float32", "int64"]).columns:
        Q1 = treated_data[column].quantile(0.25)
        Q3 = treated_data[column].quantile(0.75)
        IQR = Q3 - Q1

        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        if method == "cap":
            treated_data[column] = np.where(treated_data[column] < lower_bound, lower_bound, treated_data[column])
            treated_data[column] = np.where(treated_data[column] > upper_bound, upper_bound, treated_data[column])
        elif method == "remove":
            treated_data = treated_data[(treated_data[column] >= lower_bound) & (treated_data[column] <= upper_bound)]

    return treated_data


# Example usage
# data_treated = treat_outliers(data, method='cap')
# or
# data_treated = treat_outliers(data, method='remove')

In [None]:
data = treat_outliers(data, method="cap")

## Treating Missing Values


In [None]:
# Understanding the shape of the dataset
print("Shape of the dataset:")
print(data.shape)
print("\n")

# Checking the data types
print("Data types of each column:")
print(data.dtypes)
print("\n")

# Checking for missing values
print("Missing values in each column:")
print(data.isnull().sum())
print("\n")

# Checking for duplicated values
print("Number of duplicated records:")
print(data.duplicated().sum())

In [None]:
# Understanding the shape of the dataset
print("Shape of the dataset:")
print(data_transformed_df.shape)
print("\n")

# Checking the data types
print("Data types of each column:")
print(data_transformed_df.dtypes)
print("\n")

# Checking for missing values
print("Missing values in each column:")
print(data_transformed_df.isnull().sum())
print("\n")

# Checking for duplicated values
print("Number of duplicated records:")
print(data_transformed_df.duplicated().sum())

## **Important Insights from EDA**

What are the the most important observations and insights from the data based on the EDA performed?


## **Model Building - Approach**

- Data preparation
- Partition the data into train and test set
- Build the model
- Fit on the train data
- Tune the model
- Test the model on test set


### Logistic Regression


### Decision Tree


### **Decision Tree - Hyperparameter Tuning**

- Hyperparameter tuning is tricky in the sense that **there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model**, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.
- **Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.**
- **It is an exhaustive search** that is performed on the specific parameter values of a model.
- The parameters of the estimator/model used to apply these methods are **optimized by cross-validated grid-search** over a parameter grid.

**Criterion {“gini”, “entropy”}**

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

**max_depth**

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

**min_samples_leaf**

The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

You can learn about more Hyperpapameters on this link and try to tune them.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


### **Building a Random Forest Classifier**

**Random Forest is a bagging algorithm where the base models are Decision Trees.** Samples are taken from the training data and on each sample a decision tree makes a prediction.

**The results from all the decision trees are combined together and the final prediction is made using voting or averaging.**


### **Random Forest Classifier Hyperparameter Tuning**


**1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):**

- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?


**2. Refined insights:**

- What are the most meaningful insights relevant to the problem?


**3. Proposal for the final solution design:**

- What model do you propose to be adopted? Why is this the best solution to adopt?
