# Project 2: Investigating borrower reliability

## Introduction

The client is a bank's credit department, and the objective is to determine whether a client's marital status and the number of children they have impact the timely repayment of loans. The bank provides input data consisting of clients' creditworthiness statistics.

The results of this study will be incorporated into the development of a credit scoring model—a specialized system that evaluates a potential borrower's ability to repay a loan to the bank.

**Research goals** - to test four hypotheses:

1. Is there a correlation between having children and repaying a loan on time?
2. Is there a correlation between marital status and repaying a loan on time?
3. Is there a correlation between income level and repaying a loan on time?
4. How do different loan purposes affect the timely repayment of loans?

**Data description**

The table consists of 12 columns:

1. `children` — the number of children in the family
2. `days_employed` — total employment history in days
3. `dob_years` — client's age in years
4. `education` — client's education level
5. `education_id` — identifier for the education level
6. `family_status` — marital status
7. `family_status_id` — identifier for the marital status
8. `gender` — client's gender
9. `income_type` — type of employment
10. `debt` — whether the client had debt related to loan repayment (yes/no)
11. `total_income` — monthly income
12. `purpose` — purpose of the loan

Each row in the dataset represents a client. The data includes both categorical variables (such as `education`, `income_type`) and quantitative variables (like `total_income`, `children`). The data appears sufficient for testing the stated hypotheses.

In the subsequent analysis, we will examine if there are any artifacts in the data that require attention. For now, it is worth noting that all column names are already recorded in the snake_case format.

In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from caseconverter import snakecase
from collections import defaultdict
from IPython.display import display

In [3]:
FIG_WIDTH = 10 * 100
FIG_HEIGHT = 5 * 100

In [4]:
def get_statistics(df: pd.DataFrame) -> pd.DataFrame:
    """
    Generate summary statistics for each column in a DataFrame.

    Args:
        df (pd.DataFrame): Input DataFrame.

    Returns:
        pd.DataFrame: DataFrame with column statistics, including column name, data type,
                      count of unique values, a sample of unique values, null count, and
                      percentage of null values.
    """
    rows_total = len(df)
    
    stats_list = [
        {
            'column_name': f"'{column}'",
            'data_type': str(df[column].dtype),
            'count_unique': df[column].nunique(),
            'sample_values': (
                pd.Series(df[column].dropna().unique())
                .sample(min(5, len(df[column].dropna().unique())))
                .apply(lambda x: round(x, 2) if pd.api.types.is_numeric_dtype(df[column]) else x)
                .tolist()
            ),
            'count_null': df[column].isnull().sum(),
            'pct_null': round((df[column].isnull().sum() / rows_total) * 100, 0),
        }
        for column in df.columns
    ]

    # Convert the list of statistics to a DataFrame
    print(f"Dataframe size: {df.shape[0]} rows x {df.shape[1]} columns")
    print(f"Full duplicate rows: {df.duplicated().sum()}")
    return pd.DataFrame(stats_list)

In [5]:
def plot_unique_counts(df: pd.DataFrame, columns: list, n_cols: int = 2, fig_size: tuple = (1200, 800), top_n: int = 10):
    """
    Generate a Plotly figure with multiple subplots, each showing unique value counts for a selected column.

    Args:
        df (pd.DataFrame): Input DataFrame.
        columns (list): List of column names to analyze for unique counts.
        n_cols (int): Number of columns in the subplot grid. Defaults to 2.
        fig_size (tuple): Size of the figure (width, height). Defaults to (1200, 800).
        top_n (int): Number of top unique values to display per column. Defaults to 10.

    Returns:
        None: Displays the Plotly figure.
    """
    n_rows = -(-len(columns) // n_cols)  # Calculate number of rows dynamically
    fig = make_subplots(rows=n_rows, cols=n_cols, subplot_titles=columns)

    row, col = 1, 1  # Track subplot position
    for column in columns:
        # Compute value counts and take the top N
        df_temp = (
            df[column]
            .value_counts()
            .reset_index()
            .set_axis(['values', 'ucount'], axis=1)
            .head(top_n)
        )

        # Create bar plot for the current column
        bar_fig = px.bar(
            df_temp.sort_values('ucount', ascending=True),
            x='ucount',
            y='values',
            title=f"Top {top_n} unique values in {column}",
            orientation='h'
        )

        # Extract traces and add to the main figure
        for trace in bar_fig['data']:
            fig.add_trace(trace, row=row, col=col)

        # Move to the next subplot
        col += 1
        if col > n_cols:
            col = 1
            row += 1

    # Update layout
    fig.update_layout(
        title_text="Unique value counts per column",
        width=fig_size[0], height=fig_size[1],
        showlegend=False,
        template='plotly_white'
    )

    fig.show()

In [6]:
try:
    raw_score = pd.read_csv('/datasets/data.csv')
except:
    raw_score = pd.read_csv('raw_credit_scoring.csv')

## Data preprocessing

Initially, we need to examine the available data. Our areas of interest include:

1. Artifacts in the data
2. Missing (None) values in columns
3. Duplicate rows, non-standard category names, and data types in columns
4. Lookup tables
5. Classification of total_income

We will address these points one-by-one.

In [7]:
get_statistics(raw_score)

Dataframe size: 21525 rows x 12 columns
Full duplicate rows: 54


Unnamed: 0,column_name,data_type,count_unique,sample_values,count_null,pct_null
0,'children',int64,8,"[20, 5, 4, 1, 0]",0,0.0
1,'days_employed',float64,19351,"[373939.24, -5698.81, -3561.5, -631.94, -7973.76]",2174,10.0
2,'dob_years',int64,58,"[74, 44, 58, 55, 68]",0,0.0
3,'education',object,15,"[СРЕДНЕЕ, начальное, НЕОКОНЧЕННОЕ ВЫСШЕЕ, учен...",0,0.0
4,'education_id',int64,5,"[0, 2, 1, 3, 4]",0,0.0
5,'family_status',object,5,"[гражданский брак, в разводе, Не женат / не за...",0,0.0
6,'family_status_id',int64,5,"[1, 2, 4, 0, 3]",0,0.0
7,'gender',object,3,"[XNA, F, M]",0,0.0
8,'income_type',object,8,"[сотрудник, в декрете, компаньон, безработный,...",0,0.0
9,'debt',int64,2,"[0, 1]",0,0.0


The first observation is that the categorical columns have `object` data type, representing text fields. Most of the quantitative columns have `float` data type, with the exception of `children` and `dob_years`, which have `int` data type (this makes sense, as the number of children cannot be a fraction).

The second observation is that the columns have varying numbers of rows; both `days_employed` and `total_income` have approximately 2,000 fewer rows than the others. This suggests that some records (rows) are missing these values.

Lastly, it appears that the columns `education_id` and `family_status_id` contain identifiers (keys) for different statuses in the `education` and `family_status` columns.

As we suspected, we have missing values in the `total_income` column. Let's see what proportion of the total number of rows they make up.

The number of missing values is also 2,174, which means that in one row, values are missing in two columns. Thus, we can work with missing values on a row-by-row basis.

Now, let's check the proportion of missing rows in the entire dataset.

In [8]:
display(raw_score.head(10))

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля
2,0,-5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу
5,0,-926.185831,27,высшее,0,гражданский брак,1,M,компаньон,0,255763.565419,покупка жилья
6,0,-2879.202052,43,высшее,0,женат / замужем,0,F,компаньон,0,240525.97192,операции с жильем
7,0,-152.779569,50,СРЕДНЕЕ,1,женат / замужем,0,M,сотрудник,0,135823.934197,образование
8,2,-6929.865299,35,ВЫСШЕЕ,0,гражданский брак,1,F,сотрудник,0,95856.832424,на проведение свадьбы
9,0,-2188.756445,41,среднее,1,женат / замужем,0,M,сотрудник,0,144425.938277,покупка жилья для семьи


In [9]:
plot_unique_counts(
    raw_score, 
    ['education', 'family_status', 'gender', 'income_type', 'purpose'],
    n_cols=1,
    top_n=10,
    fig_size=(FIG_WIDTH, 2*FIG_HEIGHT)
)

From the first 10 rows, it is clear that the data needs to be processed before analysis. For example, in the `education` column, the data is not standardized: a value can be `СРЕДНЕЕ` or `Среднее`. In the `days_employed` column, we also encounter unusual values: either negative or greater than 300,000.

In [11]:
columns_values =['children', 'days_employed', 'dob_years', 'total_income']

columns_category = [
    'education',
    'education_id',
    'family_status',
    'family_status_id',
    'gender',
    'income_type',
    'debt',
    'purpose'
]

round(raw_score[columns_values].describe().T, 1)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
children,21525.0,0.5,1.4,-1.0,0.0,0.0,1.0,20.0
days_employed,19351.0,63046.5,140827.3,-18388.9,-2747.4,-1203.4,-291.1,401755.4
dob_years,21525.0,43.3,12.6,0.0,33.0,42.0,53.0,75.0
total_income,19351.0,167422.3,102971.6,20667.3,103053.2,145017.9,203435.1,2265604.0


We're not very fortunate: we encounter children counts of -1 and 20, applicants from nursery groups (with an age of zero), as well as negative (and millennia-long(!)) work experience.

We will correct these artifacts based on a few assumptions:

1. `children = -1` and `children = 20` - likely due to some error correction logic in the system. It seems that upon input error, the database records -1. And 20 is an upper limit (also confirmed by the absence of values between 5 and 20). We will replace them with the average value for the corresponding `income_type` group. For example, a retiree is more likely to have several children, while a student has none. This assumption is a stretch for `children = 20`, but given the tight deadlines, I have no other options.

2. We will do the same for `dob_years`. It is unlikely that we will see 60-year-old students or 20-year-old retirees.

3. `days_employed` is slightly more complicated. We have two issues here: negative values and millennia-long experiences. We can fix the negative values with the abs() function (possibly due to data import errors). I thought that millennia might represent work experience in hours, not days. Since this column will not be used in hypothesis testing, we will simply remove it.

Let's run some checks and then carry out these adjustments.

In [7]:
print(
    'Empty rows in the dataset:',
    round(raw_df.total_income.isna().sum() / raw_df.children.count() * 100, 1), '%'
)
display(raw_df[raw_df.total_income.isna()].head())

Empty rows in the dataset: 10.1 %


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,среднее,1,гражданский брак,1,M,пенсионер,0,,сыграть свадьбу
26,0,,41,среднее,1,женат / замужем,0,M,госслужащий,0,,образование
29,0,,63,среднее,1,Не женат / не замужем,4,F,пенсионер,0,,строительство жилой недвижимости
41,0,,50,среднее,1,женат / замужем,0,F,госслужащий,0,,сделка с подержанным автомобилем
55,0,,54,среднее,1,гражданский брак,1,F,пенсионер,1,,сыграть свадьбу


Once again, we are fortunate: the number of missing values is relatively small. It seems that they are all random: it's unlikely that we would have a male civil servant with even an average education, without work experience and no income (see row 26). Therefore, we can replace these values with the mean within one employment type: this approach will not significantly distort the overall picture, considering that we may have a wide range of income values for different groups.

### Duplicate rows, non-standard category names, and column data types

It's time to look for possible duplicates. Since we don't have unique record identifiers in the dataset, I will approach the problem directly: if there are rows where all values are the same, we will consider them duplicates.

In a new task, it's suggested to first standardize the case and data types. I propose to first remove duplicates and then proceed with standardization: firstly, `total_income` has many decimal places (perhaps the value was converted from one currency to another), which can increase the "uniqueness" of a row; secondly, if we assume that different cases in `income_type` are a result of manual data entry by a person, this will also help avoid deleting extra rows.

In [8]:
print('Number of duplicate rows:', raw_df.duplicated().sum())

display(raw_df[raw_df.duplicated()].sort_values(by=['income_type', 'dob_years']).head())

Number of duplicate rows: 54


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
18349,1,,30,высшее,0,женат / замужем,0,F,госслужащий,0,,покупка жилья для семьи
14432,2,,36,высшее,0,женат / замужем,0,F,госслужащий,0,,получение образования
13878,1,,31,среднее,1,женат / замужем,0,F,компаньон,0,,покупка жилья
19387,0,,38,высшее,0,гражданский брак,1,F,компаньон,0,,на проведение свадьбы
10697,0,,40,среднее,1,гражданский брак,1,F,компаньон,0,,сыграть свадьбу


It appears that all duplicate rows are coincidental: in these applications, the `total_income` values were not provided (hence, they are the same since we replaced them with the averages). As we don't know the meaning of NaN in this data's context (whether the value is unspecified due to omission or because it equals zero), a straightforward solution would be to remove these rows from the dataset.

Let's create a clean version of the dataset `df` for analysis.

In [9]:
df = (
    raw_df.copy()
    .assign(
        children=lambda df: round(
            df.children
            .replace({-1:np.nan, 20:np.nan})
            .fillna(df.groupby('income_type').children.transform('mean'))
            .astype('int64')
        ),
        dob_years=lambda df: (
            df.dob_years
            .replace(0, np.nan)
            .fillna(df.groupby('income_type').dob_years.transform('mean'))
            .astype('int64')
        ),
        total_income=lambda df: (
            df.total_income
            .fillna(df.groupby('income_type').total_income.transform('mean'))
            .astype('int64')
        ),
        education=lambda df: df.education.str.lower(),
        family_status=lambda df: df.family_status.str.lower()   
    )
    .loc[lambda df: df.gender != 'XNA']
    .drop('days_employed', axis=1)
    .drop_duplicates()
)
 
columns_values.remove('days_employed')

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21453 entries, 0 to 21524
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   children          21453 non-null  int64 
 1   dob_years         21453 non-null  int64 
 2   education         21453 non-null  object
 3   education_id      21453 non-null  int64 
 4   family_status     21453 non-null  object
 5   family_status_id  21453 non-null  int64 
 6   gender            21453 non-null  object
 7   income_type       21453 non-null  object
 8   debt              21453 non-null  int64 
 9   total_income      21453 non-null  int64 
 10  purpose           21453 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


### Lookup tables

In our dataset, there are two columns that can be used as keys - `education_id` and `family_status_id`. Let's extract them into lookup tables.

In [10]:
lut_education = df[['education_id', 'education']].drop_duplicates().reset_index(drop=True)

lut_family_status = df[['family_status_id', 'family_status']].drop_duplicates().reset_index(drop=True)

### Classification of `total_income`

Based on the ranges specified below, let's create a column `total_income_category` with the following categories:
- **E**: 0 – 30,000
- **D**: 30,001 – 50,000
- **C**: 50,001 – 200,000
- **B**: 200,001 – 1,000,000
- **A**: 1,000,001 and above.

In [11]:
def fun_categorizer_total_income(value_for_categorization: float) -> object:
    """
    This function allocates a category for a variable based on its value. 

    Args:
        value_for_categorization ([float or int]): value for categorization

    Returns:
        - 'E' if value_for_categorization <= 30,000
        - 'D' if value_for_categorization = [30,001 : 50,000]
        - 'C' if value_for_categorization = [50,001 : 200,000]
        - 'B' if value_for_categorization = [200,001 : 1,000,000]
        - 'A' if value_for_categorization >= 1,000,001
    """
    if value_for_categorization <= 30e3:
        return 'E'
    elif 30e3 < value_for_categorization <= 50e3:
        return 'D'
    elif 50e3 < value_for_categorization <= 200e3:
        return 'C'
    elif 200e3 < value_for_categorization <= 1e6:
        return 'B'
    else:
        return 'A'

In [12]:
df['total_income_category'] = df.total_income.apply(fun_categorizer_total_income)

### Classification of `purpose`

Finally, it's time to add standardized categories to the `purpose` column.

In [13]:
def fun_categorizer_purpose(string_for_categorization: object) -> object:
    """
    This function categorizes a string if a particular set of characters is present. 

    Args:
        string_for_categorization [str]: string, where a set of characters is checked

    Returns:
        - 'car' if string contains 'авто'
        - 'property' if string contains 'нед' or 'жил'
        - 'wedding' if string contains 'свад'
        - 'education' if string contains 'образ'
    """
    if 'авто' in string_for_categorization:
        return 'car'
    elif 'свад' in string_for_categorization:
        return 'wedding'
    elif 'образ' in string_for_categorization:
        return 'education'
    else:
        return 'property'

In [14]:
df['purpose_category'] = df.purpose.apply(fun_categorizer_purpose)

At this point, we can wrap up the data preprocessing. Now it's time to move on to the analysis!

## Hypothesis Testing

We need to test 4 hypotheses:

1. Is there a relationship between having children and repaying a loan on time?
2. Is there a relationship between marital status and repaying a loan on time?
3. Is there a relationship between income level and repaying a loan on time?
4. How do different loan purposes affect timely repayment?

The hypothesis testing process will be straightforward: for each of these questions, we will create a pivot table. The rows of the table will correspond to the categories, and the columns will represent the total number of people in the category, the number of people who repaid the loan on time, and the number of people who did not repay the loan on time. Based on these columns, we will calculate the corresponding proportions and then draw conclusions.

Let's assume that if the difference between categories is more than 3%, it is statistically significant, which means there is a dependency. Conversely, if the difference is less than or equal to 3%, there is no dependency. Although I do not have experience in statistics, I suspect that in a more serious exercise, the 3% threshold would be better justified.

In [15]:
def fun_get_summary(data: pd.DataFrame, rows: object) -> pd.DataFrame:
    """
    Calculates the counts and percentages of people with and without debt based on the given data.

    Args:
        data (DataFrame): The input data containing information about individuals.
        rows (str or list): The column(s) to be used as the row index(es) in the pivot table.

    Returns:
        DataFrame: A pivot table with counts and percentages of people with and without debt, sorted by 'people_with_debt_per' and 'people_total' columns.

    Raises:
        KeyError: If the specified columns 'gender' or 'debt' are not found in the data.
    """

    pivot = (
        pd.pivot_table(
            data,
            values='gender',
            index=rows,
            columns='debt',
            aggfunc='count',
            fill_value=0
        )
        .rename(
            columns={0:'people_without_debt_no', 1:'people_with_debt_no'}, level=0
        )
        .assign(
            people_total=lambda df: df.people_without_debt_no + df.people_with_debt_no,
            people_without_debt_per=lambda df: round(df.people_without_debt_no / df.people_total * 100, 2),
            people_with_debt_per=lambda df: round(df.people_with_debt_no / df.people_total * 100, 2)
        )
    )

    return pivot.sort_values(by=['people_with_debt_per', 'people_total'])


### Dependency between having children and timely loan repayment

Let's check what we have for the first hypothesis.

In [16]:
display(fun_get_summary(df, 'children' ))

debt,people_without_debt_no,people_with_debt_no,people_total,people_without_debt_per,people_with_debt_per
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,9,0,9,100.0,0.0
0,13141,1072,14213,92.46,7.54
3,303,27,330,91.82,8.18
1,4364,444,4808,90.77,9.23
2,1858,194,2052,90.55,9.45
4,37,4,41,90.24,9.76


The first noticeable observation is that not all categories are sufficiently large to draw meaningful conclusions. For instance, there is a limited number of individuals with 3, 4, and 5 children, while there is a significantly larger number (almost 10 times more) of people with no children or 1 or 2 children.

**Conclusion:** However, based on our criteria, it does not seem that there is a significant relationship between having children and timely loan repayment. The likelihood of repayment is approximately 90% across all categories.

### Dependency between marital status and timely loan repayment

Let's check what we have for the second hypothesis.

In [17]:
display(fun_get_summary(df, 'family_status'))

debt,people_without_debt_no,people_with_debt_no,people_total,people_without_debt_per,people_with_debt_per
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
вдовец / вдова,896,63,959,93.43,6.57
в разводе,1110,85,1195,92.89,7.11
женат / замужем,11408,931,12339,92.45,7.55
гражданский брак,3762,388,4150,90.65,9.35
не женат / не замужем,2536,274,2810,90.25,9.75


In this case, we have a more evenly distributed sample compared to the previous hypothesis. Therefore, let's examine all the categories.

**Conclusion:** In this scenario, we observe a slightly more pronounced dependency. Being in a civil partnership or being single without a spouse clearly decreases the likelihood of loan repayment (although it still remains around 90%) compared to the other categories (with a repayment probability of around 93%). It is interesting to note that the most reliable category is widowed individuals, although this may be a result of the relatively small sample size.

### Dependency between income level and timely loan repayment

Let's check what we have for the third hypothesis.

In [18]:
display(fun_get_summary(df, 'total_income_category'))

debt,people_without_debt_no,people_with_debt_no,people_total,people_without_debt_per,people_with_debt_per
total_income_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
D,329,21,350,94.0,6.0
B,5156,386,5542,93.04,6.96
A,23,2,25,92.0,8.0
C,14184,1330,15514,91.43,8.57
E,20,2,22,90.91,9.09


Once again, some categories have a very small sample size, with only 25 individuals in the 'A' income category (as expected).

**Conclusion:** In this case, we also observe a dependency, although not exactly as anticipated. It is evident that individuals with the lowest income have the lowest likelihood of loan repayment (still around 90%). Conversely, higher income levels can significantly increase this probability (up to 93% in the case of category 'B'). Interestingly, category 'D' remains at the top position regardless of income levels.

### Dependency between loan purposes and timely loan repayment

Let's check what we have for the last hypothesis.

In [19]:
display(fun_get_summary(df, 'purpose_category'))

debt,people_without_debt_no,people_with_debt_no,people_total,people_without_debt_per,people_with_debt_per
purpose_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
property,10028,782,10810,92.77,7.23
wedding,2138,186,2324,92.0,8.0
education,3643,370,4013,90.78,9.22
car,3903,403,4306,90.64,9.36


Conclusion: This is an unexpected result. It appears that loans for real estate operations and wedding expenses have a lower repayment rate compared to loans for education and car-related operations. Although the difference is not significant (only around 2%), it is noticeable. It is also interesting to observe how each of the two categories relates to one another.

## Conclusions

In this project, I have completed the entire data analysis cycle, from data loading to hypothesis testing. In the first part, I performed data cleaning tasks, including standardizing the data, handling missing values, removing duplicates, and creating distinct categories. This was the most time-consuming aspect of the project. In the second part, I conducted a relatively straightforward analysis and compared the probability of timely loan repayment across different data segments.

Ultimately, the following conclusions can be drawn from the analysis:

- The number of children has a weak influence on the probability of loan repayment, with a slight trend indicating that a higher number of children corresponds to a lower repayment probability. This can be attributed to the additional financial responsibilities associated with raising children.

- Marital status has a significant impact, as individuals in registered partnerships are more likely to repay their loans on time. This may be attributed to the increased sense of financial stability and commitment that comes with being in a formal relationship.

- Income level also plays a role in loan repayment probability, with lower income levels generally associated with a lower likelihood of repayment. However, it is interesting to note that category 'D' stands out as the most reliable, despite the income grouping showing some uneven distribution among applicants.

- Loan purposes also influence the repayment probability. Surprisingly, loans taken for wedding expenses tend to have a higher repayment rate compared to loans for education. This could be attributed to the shared responsibility and commitment involved in wedding-related expenses, whereas education loans are typically taken by individuals.

By analyzing these factors, we gain insights into the dependencies between different variables and their impact on the likelihood of timely loan repayment. These findings can be valuable for financial institutions and lenders in assessing loan applications and managing credit risk.