# GROUP PROJECT: LOAN APPROVAL PREDICTION 

## Members:
- Manuela N
- Daniela S
- Samir MS
- Ibrahima T
- Adam C

## PROJECT STEPS
1. Data importation and exploration
2. Data cleaning
3. Data visualisation
4. Data encoding
5. Selection of sensitive data and fairness metrics
6. Training of the model
7. Evaluation of the model
8. Fairness Evaluation
9.  Bias mitigation
10. Conclusion

## DATA DESCRIPTION
○ Demographics: - Gender (Female, male)

                - Married (Yes, No)
  
                - Dependents (0, 1, 2, 3+)
  
                - Education (Graduate, Not graduated)
  
                - Self_Employed (No, Yes)
  
                - Property_Area (Urban, Rural, Semiurban)

○ Financial information: - ApplicantIncome

                         - CoapplicantIncome
  
                         - Credit_History
  
○ Loan details: - LoanAmount

                - Loan_Amount_Term

○ Target variable: Loan_Status

# IMPORTATION OF LIBRAIRIES

In [1]:
# Remove warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Generic
import pandas as pd
import numpy as np
from copy import copy

# Dataset
from ucimlrepo import fetch_ucirepo

#visualisation
import plotly.express as px
import plotly.graph_objs as go
import plotly.io as pio
from plotly.subplots import make_subplots

# ML
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Fairness
import shap

import dalex as dx
from dalex.fairness import resample
from dalex.fairness import roc_pivot

from fairlearn.reductions import ExponentiatedGradient
from fairlearn.reductions import TruePositiveRateParity, DemographicParity, EqualizedOdds
from fairlearn.postprocessing import ThresholdOptimizer, plot_threshold_optimizer

# DATA EXPLORATION

In [2]:
# Load the dataset
data=pd.read_csv('project-data.csv')
data.head(4)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y


In [3]:
# Statistics of numerics variables
data.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [4]:
# Informations about columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [5]:
# Checking of missing values
data.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

# DATA CLEANING

In [6]:
# Drop missing values
data.dropna(inplace=True)
data.isna().sum().sum()

0

### We will work with a dataset of 465 entries

# DATA VISUALISATION

In [7]:
# Function to generate visualizations for categorical variables
def plot_categorical_variable(df, column):
    # Pie chart
    counts = df[column].value_counts()
    fig_pie = px.pie(
        names=counts.index, 
        values=counts.values, 
        title=f'Distribution of {column}',
        hole=0.3,
        color_discrete_sequence=px.colors.qualitative.Pastel
    )
    fig_pie.update_traces(textinfo='percent+label')
    
    return fig_pie

# Generate visualizations for each categorical variable
categorical_vars = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']
visualizations = {var: plot_categorical_variable(data, var) for var in categorical_vars}

# Function to analyze Loan_Status based on other variables
def plot_loan_status_breakdown(df, vars_to_analyze):
    # Create a grid of subplots
    fig = make_subplots(
        rows=len(vars_to_analyze), 
        cols=1, 
        subplot_titles=[f'Distribution of Loan Status by {var}' for var in vars_to_analyze],
        vertical_spacing=0.1
    )
    
    # Colors for Loan_Status
    colors = {'Y': 'green', 'N': 'red'}
    
    # For each variable
    for i, var in enumerate(vars_to_analyze, 1):
        # Group data by variable and loan status
        grouped = df.groupby([var, 'Loan_Status']).size().unstack(fill_value=0)
        
        # Create stacked bar charts
        for status in ['Y', 'N']:
            trace = go.Bar(
                x=grouped.index, 
                y=grouped[status], 
                name=f'Loan Status {status}', 
                marker_color=colors[status],
                text=grouped[status],
                textposition='inside'
            )
            fig.add_trace(trace, row=i, col=1)
        
        # Update axes layout
        fig.update_xaxes(title_text=var, row=i, col=1)
        fig.update_yaxes(title_text='Number of Loans', row=i, col=1)
    
    # Update layout
    fig.update_layout(
        height=300 * len(vars_to_analyze), 
        title_text='Loan Status Breakdown by Variables',
        barmode='stack'
    )
    
    return fig

# Variables to analyze
vars_to_analyze = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']

# Generate the loan status breakdown chart
loan_status_breakdown = plot_loan_status_breakdown(data, vars_to_analyze)

print("### Detailed Statistics ###")
for var in categorical_vars:
    print(f"\n{var} :")
    print(data[var].value_counts())
    print(data[var].value_counts(normalize=True).map('{:.2%}'.format))
    
    # Display pie chart visualization
    print(f"\nVisualization of {var} :")
    visualizations[var].show()

# Display loan status breakdown chart
print("\nLoan Status Breakdown by Variables:")
loan_status_breakdown.show()


### Detailed Statistics ###

Gender :
Gender
Male      394
Female     86
Name: count, dtype: int64
Gender
Male      82.08%
Female    17.92%
Name: proportion, dtype: object

Visualization of Gender :



Married :
Married
Yes    311
No     169
Name: count, dtype: int64
Married
Yes    64.79%
No     35.21%
Name: proportion, dtype: object

Visualization of Married :



Dependents :
Dependents
0     274
2      85
1      80
3+     41
Name: count, dtype: int64
Dependents
0     57.08%
2     17.71%
1     16.67%
3+     8.54%
Name: proportion, dtype: object

Visualization of Dependents :



Education :
Education
Graduate        383
Not Graduate     97
Name: count, dtype: int64
Education
Graduate        79.79%
Not Graduate    20.21%
Name: proportion, dtype: object

Visualization of Education :



Self_Employed :
Self_Employed
No     414
Yes     66
Name: count, dtype: int64
Self_Employed
No     86.25%
Yes    13.75%
Name: proportion, dtype: object

Visualization of Self_Employed :



Property_Area :
Property_Area
Semiurban    191
Urban        150
Rural        139
Name: count, dtype: int64
Property_Area
Semiurban    39.79%
Urban        31.25%
Rural        28.96%
Name: proportion, dtype: object

Visualization of Property_Area :



Loan_Status :
Loan_Status
Y    332
N    148
Name: count, dtype: int64
Loan_Status
Y    69.17%
N    30.83%
Name: proportion, dtype: object

Visualization of Loan_Status :



Loan Status Breakdown by Variables:


## Analysis for gender
In our dataset, we have 82.08% of male and 17.92% of female. 

For female, we have 62.79% who have had loans accepted.

For male, we have 70.55% who have had loans accepted.

## Analysis for married 
In our dataset, we have 64.79% of yes and 35.21% of no.

For married, we have 72.99% who have had loans accepted.

For not married, we have 62.13% who have had loans accepted.

## Analysis for Dependents 
In our dataset, we have 57.08% of 0,  17.71% of 2, 16.67% of 1 and 8.54% of 3+

For 0, we have 75.70% who have had loans accepted.

For 1, we have 65% who have had loans accepted.

For 2, we have 76.47% who have had loans accepted.

For 3+, we have 68.29% who have had loans accepted.

## Analysis for Education 
In our dataset, we have 79.79% of Graduate and 20.21% of Not Graduate. 

For graduated, we have 70.75% who have had loans accepted.

For not graduated, we have 62.88% who have had loans accepted.

## Analysis for Self_Employed
In our dataset, we have 13.75% of yes and 86.25% of no.    

For yes, we have 65.15% who have had loans accepted.

For no, we have 69.80% who have had loans accepted.

## Analysis for Property_Area 
In our dataset, we have 39.79% of Semiurban,  31.25% of Urban and 28.96% of Rural.

For Semiurban, we have 78.01% who have had loans accepted.

For Urban, we have 65.33% who have had loans accepted.

For Rural, we have 61.15% who have had loans accepted.

## Analysis for Loan_Status
In our dataset, we have 69.17% of Y and 30.83% of N. 

### In fact, the data is not balanced

# DATA ENCODING