# Loan Prediction

### Developed by:

1. Tiago Pinheiro - 202205295
2. Tiago Rocha    - 202005428
3. Vasco Melo     - 202207564

### The problem

This project's goal is to predict whether an applicant is approved for a loan.

### The dataset 

To acomplish our goal we used the dataset for Loan Approval Prediction from Kaggle. It contains about 32600 entries, each with 12 attributes, but only 8 of those are numeric-valued. The not numeric ones are as follows:
- person_age:
    - Age of the loan applicant in years. 
    - If the applicant's age is of one extreme or the other, being too old or too young, his loan will most likely have higher chances of being refused.
- person_income:
    - Annual income of the applicant in currency units. 
    - A higher income strongly correlates to a loan being approved, as the applicant with have higher repayment capability and capacity.
- person_home_ownership:
    - Housing status of applicant, categorized with four different options, those being MORTGAGE, RENT, OWN and OTHER. 
    - Home ownership can directly correlate to the financial stability of the applicant while also providing potential collateral, thus facilitating a loan's approval.
- person_emp_length:
    - Number of years the applicant has been employed at their current job. 
    - As with housing status, their employment length can correspond to the applicant's income and financial stability, as the longer it is the more stable their financial status is more likely to be.
- loan_intent:
    - The stated purpose for the loan, categorized with six different options, those being VENTURE, EDUCATION, DEBTCONSOLIDATION, HOMEIMPROVEMENT, MEDICAL and PERSONAL. 
    - Loan purpose affects risk assessment, as for example education or home improvement motives will likely carry out to a higher earning capacity or asset value, while others like venture or personal are riskier and more prone failure in repaying.
- loan_grade: 
    - The credit quality grade assigned to the loan, ranging from A to G, best to worst.
    - Loan grade is used as approval likelihood, representing the lender's internal credit risk assessment. The higher the grade, the lower interest rates and higher approval rates one gets, and vice versa.
- loan_amnt:
    - The requested loan amount.
    - Larger loan amounts obviously represent higher absolute risk for lenders. As a norm, the higher the loan amount the lower the approval threshold, requiring stronger compensating factors like higher income and better credit history.
- loan_int_rate:
    - The annual interest rate charged on the loan.
    - Interest rates reflect risk assessment, as higher rates likely indicate higher perceived risk.
- loan_status:
    - The target variable, 1 being approved and 0 not approved.
    - This is the outcome variable the model will predict.
- loan_percent_income:
    - The percentage of applicant's income represented by the loan payment.
    - This is a critical debt-to-income component, as higher percentages represent greater financial strains. Values of 50% and above face significantly higher rejection.
- cb_person_default_on_file: 
    - Credit bureau record of whether the person has defaulted before, 'Y' for yes and 'N' for no.
    - If an applicant has previous defaults, it will dramatically reduce approval chances, as they are strong negative indicators of repayment capability.
- cb_person_cred_hist_length:
    - Length of the person's credit history in years.
    - Longer credit histories allow better risk assessment and generally improve approval chances.
    
### The solution

To solve this problem, we used a supervised learning model trained on Kaggle’s dataset. The model’s performance was measured using the accuracy metric, which represents the percentage of correct predictions made by the model out of all predictions. In other words, it shows how often the model correctly classified whether a loan was paid or not.



_____________________________________________________________

### Pre analysis
To start we imported the pandas library, and checked the dataset to get a better understanding of the data and to find possible outliers

In [None]:
#load dataset
import pandas as pd


dataset = pd.read_csv('data/credit_risk_dataset.csv')

print(dataset.describe())

##### Data Cleaning
Person age
- There are outliers of people that are 120 years old plus.

Person employment 
- Someone can't be working for longer than they have been alive.

Note: a total of 6 rows were removed in this part

In [None]:
removed_age_entries = dataset[dataset['person_age'] > 120]
print("Entries with person_age > 120:")
print(removed_age_entries)

# Find entries where person_emp_length > person_age
removed_emp_length_entries = dataset[dataset['person_emp_length'] > dataset['person_age']]
print("\nEntries with person_emp_length > person_age:")
print(removed_emp_length_entries)

# Combine all removed entries for reference
all_removed_entries = pd.concat([removed_age_entries, removed_emp_length_entries]).drop_duplicates()
print("\nAll entries to be removed:")
print(all_removed_entries)

# Remove invalid entries from the dataset
dataset = dataset[dataset['person_age'] <= 120]
dataset = dataset[dataset['person_emp_length'] <= dataset['person_age']]

# Display the updated dataset
print("\nDataset after removing invalid entries:")
print(dataset.describe())

# Display total rows removed
print(f"\nTotal rows removed: {len(all_removed_entries)}")


To complete the cleaning  we removed any incomplete rows

Note: a total of 3943 rows were removed in this part

In [None]:
# Find incomplete data (missing values)
print("Incomplete data (missing values) in the dataset:")

# Check for missing values in each column
missing_data = dataset.isnull().sum()

# Display columns with missing values
missing_data = missing_data[missing_data > 0]
if not missing_data.empty:
    print(missing_data)
else:
    print("No missing values found in the dataset.")

In [None]:
rows_before = len(dataset)

# Remove rows with missing values in the dataset
dataset = dataset.dropna()

# Print how many rows were removed
rows_after = len(dataset)
rows_removed = rows_before - rows_after
print(f"\nTotal rows removed due to missing values: {rows_removed}")


# Verify that there are no more missing values
print("Dataset after removing rows with missing values:")
print(dataset.isnull().sum())

##### Data Encoding
After cleaning the dataset, we needed to convert categorical (non-numeric) columns into numerical format, as most machine learning algorithms require numerical input. This process, known as encoding, allows the model to interpret qualitative information such as loan intent, employment type, or home ownership. Depending on whether the categories had a meaningful order or not, we applied appropriate encoding techniques to preserve the underlying structure of the data while making it usable for model training.

In [None]:
# Map to convert 'person_home_ownership' to numeric values
home_ownership_map = {
    'MORTGAGE': 0,
    'RENT': 1,
    'OWN': 2,
    'OTHER': 3
}

# Apply the mapping to the 'person_home_ownership' column
dataset['person_home_ownership'] = dataset['person_home_ownership'].map(home_ownership_map)

# Verify the transformation
print("Transformed 'person_home_ownership' column:")
print(dataset['person_home_ownership'].head())

In [None]:
# Map to convert 'loan_intent' to numeric values
loan_intent_map = {
    'VENTURE': 0,
    'EDUCATION': 1,
    'DEBTCONSOLIDATION': 2,
    'HOMEIMPROVEMENT': 3,
    'MEDICAL': 4,
    'PERSONAL': 5
}

# Apply the mapping to the 'loan_intent' column
dataset['loan_intent'] = dataset['loan_intent'].map(loan_intent_map)

# Verify the transformation
print("Transformed 'loan_intent' column:")
print(dataset['loan_intent'].head())

In [None]:
# Map to convert 'loan_grade' to numeric values
loan_grade_map = {
    'A': 0,
    'B': 1,
    'C': 2,
    'D': 3,
    'E': 4,
    'F': 5,
    'G': 6
}

# Apply the mapping to the 'loan_grade' column
dataset['loan_grade'] = dataset['loan_grade'].map(loan_grade_map)

# Verify the transformation
print("Transformed 'loan_grade' column:")
print(dataset['loan_grade'].head())

In [None]:
# Map to convert 'cb_person_default_on_file' to numeric values
cb_person_default_map = {
    'Y': 1,
    'N': 0
}

# Apply the mapping to the 'cb_person_default_on_file' column
dataset['cb_person_default_on_file'] = dataset['cb_person_default_on_file'].map(cb_person_default_map)

# Verify the transformation
print("Transformed 'cb_person_default_on_file' column:")
print(dataset['cb_person_default_on_file'].head())

##### Data Analysis

After completing the data preprocessing steps, we conducted an exploratory data analysis to understand the relationships between various features and the loan approval status. This analysis aimed to identify patterns and correlations that could inform our predictive modeling.

Key Observations:

loan_int_rate: higher interest rates are more commonly associated with approved loans. Lenders may be more inclined to approve loans with higher interest rates as they offer greater returns, potentially offsetting the risk associated with the borrower.

loan_percent_income: loans constituting a higher percentage of the borrower's income tend to have higher approval rates. This could indicate that lenders are willing to approve loans that represent a significant portion of the borrower's income, possibly due to confidence in the borrower's repayment capacity or other compensating factors.

##### Feature Importance Analysis:
To quantitatively assess the impact of each feature on loan approval, we employed a Decision Tree Classifier to evaluate feature importance. The results indicated that:

High Importance Features: loan_int_rate, loan_percent_income, and person_income emerged as the most influential predictors.

Low Importance Features: person_home_ownership and loan_grade showed minimal impact on the model's predictive power.

These findings align with the observations from our exploratory analysis, reinforcing the significance of financial metrics over demographic factors in loan approval decisions.



In [None]:
columns_to_plot = [col for col in dataset.columns if col != 'id']

import seaborn as sb
import matplotlib.pyplot as plt

sb.pairplot(dataset[columns_to_plot].dropna(), hue='loan_status')
plt.show()

The scatter plot of loan_percent_income and loan_int_rate is the most effective for explaining loan approval decisions. This plot reveals a clear separation between approved and denied applications, forming visible clusters that reflect different approval patterns. It visually captures the combined influence of how much of a borrower's income is allocated to the loan and the interest rate they are offered, making it an ideal representation for identifying trends and building intuitive decision boundaries.

______________________________________________________________________


To further explore the relationship between each feature and the loan status, we used violin plots. These plots display the distribution and density of values for each feature, grouped by whether the loan was paid or not paid. This visualization helps highlight differences in how features like income, interest rate, or credit score vary depending on the repayment outcome. By organizing the plots in a grid, we can efficiently compare these distributions across multiple variables.

In [None]:
plt.figure(figsize=(15, 15))

columns_to_plot = [col for col in dataset.columns if col != 'loan_status']

num_columns = len(columns_to_plot)
rows = (num_columns + 1) // 2  

for column_index, column in enumerate(columns_to_plot):
    plt.subplot(rows, 2, column_index + 1) 
    sb.violinplot(x='loan_status', y=column, data=dataset)

plt.tight_layout()  
plt.show()

#### Some conclusions from the graphs

🔹 person_age: 
Older individuals are less likely to have their loan approved, as lenders might consider life expectancy and financial independence when assessing the likelihood of full repayment over the loan term.

🔹 person_income: 
Applicants with higher incomes are more likely to be approved because they demonstrate a stronger ability to repay the loan without financial strain.

🔹 person_home_ownership: 
Owning a home can increase approval chances, as it indicates financial stability and may provide collateral, reducing the lender's risk.

🔹 person_emp_length: 
Longer employment history is typically viewed positively, as it suggests job stability and a consistent income source, which are important for loan repayment.

🔹 loan_intent: 
The purpose of the loan can influence approval, as lenders may consider some intents (like medical or personal expenses) riskier than others (like home improvement or education).

🔹 loan_grade: 
Loan grade reflects the applicant’s creditworthiness; lower grades are associated with higher risk and therefore a greater likelihood of rejection.

🔹 loan_amnt: 
Larger loan amounts may reduce the chances of approval, since they represent a greater financial risk for the lender if the borrower defaults.

🔹 loan_int_rate: 
Higher interest rates may increase the likelihood of loan approval, as lenders are more willing to take on higher-risk borrowers if they are compensated with greater returns.

🔹 loan_percent_income: 
Higher loan-to-income ratios are associated with higher approval rates, possibly indicating that lenders are more flexible when the borrower is willing to commit a larger portion of their income to repayment.

🔹 cb_person_default_on_file: 
Applicants with a history of default are much less likely to be approved, as past defaults are strong indicators of future risk and potential non-payment.

🔹 cb_person_cred_hist_length: 
A longer credit history gives lenders more information to evaluate credit behavior, which can increase the chances of approval due to a more established financial track record.

##### Data Splitting
We split the dataset into training and testing sets using stratified sampling to ensure the proportion of approved and denied loans remained consistent across both sets. This helps maintain balance in the target variable and avoids bias during model training and evaluation. In real-world applications, splitting the dataset in this way is not a problem, because models are typically deployed to make predictions on data the model hasn't seen before — similar to our test set. Since the test data simulates future or unseen cases, keeping it separate ensures we can fairly evaluate the model’s generalization ability. Furthermore, stratified sampling avoids distortions in class distribution, which could otherwise lead to misleading performance metrics or poorly trained models.



In [None]:
# Split the dataset into training and testing sets with the same distribution of loan_status
from sklearn.model_selection import train_test_split

# Perform stratified sampling based on 'loan_status'
train_dataset, test_dataset = train_test_split(
    dataset, 
    test_size=0.25,  # 25% for testing
    random_state=1,  # For reproducibility
    stratify=dataset['loan_status']  # Maintain the same distribution of 'loan_status'
)

original_percentage = (dataset['loan_status'].value_counts(normalize=True) * 100).loc[1]
print("Original dataset distribution:")
print(f"Percentage of 1 in original dataset: {original_percentage:.2f}%")

# Print the percentage of 1 in the 'loan_status' column for the training dataset
train_percentage = (train_dataset['loan_status'].value_counts(normalize=True) * 100).loc[1]
print(f"Percentage of 1 in training dataset: {train_percentage:.2f}%")

# Print the percentage of 1 in the 'loan_status' column for the testing dataset
test_percentage = (test_dataset['loan_status'].value_counts(normalize=True) * 100).loc[1]
print(f"Percentage of 1 in testing dataset: {test_percentage:.2f}%")

In [None]:
# Save the training and testing datasets to CSV files
train_dataset.to_csv('data/train.csv', index=False)
test_dataset.to_csv('data/test.csv', index=False)

print("Training dataset saved as 'train.csv'.")
print("Testing dataset saved as 'test.csv'.")

##### Model Training and Evaluation
We trained a Decision Tree classifier to predict whether a loan will be paid or not, using the cleaned and encoded dataset. The model was trained on the training set and evaluated on the test set, achieving an accuracy of 91%

In [None]:
# Train a Decision Tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Separate features (X) and target (y) for training and testing datasets
X_train = train_dataset.drop(columns=['loan_status'])
y_train = train_dataset['loan_status']
X_test = test_dataset.drop(columns=['loan_status'])
y_test = test_dataset['loan_status']

# Initialize the Decision Tree Classifier
model = DecisionTreeClassifier(random_state=1)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test dataset
y_pred = model.predict(X_test)

# Calculate and display the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Decision Tree model: {accuracy:.2f}")

To assess the stability of the Decision Tree model, we ran it 1000 times using different train-test splits while preserving the class distribution. We recorded the accuracy for each run and visualized the distribution with a histogram, allowing us to observe the consistency and variability in model performance.

In [None]:
# Run the Decision Tree model 1000 times with different splits and display a histogram of accuracies
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

# Store accuracies for each run
accuracies = []

# Run the model 1000 times
for i in range(1000):
    # Split the dataset into training and testing sets with stratified sampling
    train_dataset, test_dataset = train_test_split(
        dataset,
        test_size=0.25,  # 25% for testing
        random_state=i,  # Change random state for each iteration
        stratify=dataset['loan_status']  # Maintain the same distribution of 'loan_status'
    )
    
    # Separate features (X) and target (y) for training and testing datasets
    X_train = train_dataset.drop(columns=['loan_status'])
    y_train = train_dataset['loan_status']
    X_test = test_dataset.drop(columns=['loan_status'])
    y_test = test_dataset['loan_status']
    
    # Initialize the Decision Tree Classifier
    model = DecisionTreeClassifier(random_state=i)
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions on the test dataset
    y_pred = model.predict(X_test)
    
    # Calculate and store the accuracy
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

# Calculate and display the average accuracy over 1000 runs
average_accuracy = np.mean(accuracies)
print(f"Average accuracy over 1000 runs: {average_accuracy:.2f}")

# Plot a histogram of the accuracies
plt.hist(accuracies, bins=20, edgecolor='black')
plt.title('Histogram of Model Accuracies')
plt.xlabel('Accuracy')
plt.ylabel('Frequency')
plt.show()

##### Data Preparation
To prepare the data for analysis, we selected a subset of relevant features from the dataset, including demographic, financial, and loan-related variables. These features were extracted into a matrix of inputs, while the loan status was stored separately as the target variable. 

In [None]:
#preparação dos dados para análise
credit_risk_dataset = pd.read_csv('data/credit_risk_dataset.csv')
all_inputs = credit_risk_dataset[
    [
        'person_age',
        'person_income',
        'person_home_ownership',
        'person_emp_length',
        'loan_intent',
        'loan_grade',
        'loan_amnt',
        'loan_int_rate',
        'loan_percent_income',
        'cb_person_cred_hist_length'
    ]
].values

loan_status = credit_risk_dataset['loan_status'].values
all_inputs[:5]