# Loan Approval Classification

The dataset chosen for this MLP implementation can found [here](https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data).

In [3]:
import os
import sys
import subprocess

def download_kaggle_dataset(dataset_slug: str, target_dir: str):
    """
    Downloads a dataset from Kaggle to the target directory.
    
    Args:
        dataset_slug: e.g. "taweilo/loan-approval-classification-data"
        target_dir: local path to save dataset
    """
    # Ensure kaggle CLI is installed
    try:
        import kaggle
    except ImportError:
        print("kaggle package not installed. Installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "kaggle"])
    
    # Make sure the target directory exists
    os.makedirs(target_dir, exist_ok=True)
    
    # Use kaggle API to download
    cmd = [
        "kaggle", "datasets", "download",
        "-d", dataset_slug,
        "-p", target_dir,
        "--unzip"
    ]
    print("Running:", " ".join(cmd))
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode != 0:
        print("Error downloading dataset:")
        print(result.stderr)
        sys.exit(1)
    else:
        print("Dataset downloaded successfully to", target_dir)

if __name__ == "__main__":
    # e.g., use current working directory’s “data” subfolder
    dataset = "taweilo/loan-approval-classification-data"
    out_dir = "./data/loan_approval"
    
    download_kaggle_dataset(dataset, out_dir)



Running: kaggle datasets download -d taweilo/loan-approval-classification-data -p ./data/loan_approval --unzip
Dataset downloaded successfully to ./data/loan_approval


# Loan Approval Classification Dataset

## Dataset Overview

This dataset is a synthetic version inspired by the original Credit Risk dataset, enriched with additional variables based on Financial Risk data for Loan Approval. It contains 45,000 records and is designed for classification tasks, specifically to predict the approval or rejection of a loan.

The dataset represents a collection of information about loan applicants, which can be used to train a machine learning model to predict whether a loan should be approved or not.

## Features

### Numerical Features

- **Age:** The applicant's age.
- **Annual Income:** The applicant's annual income.
- **Employment Experience:** The applicant's work experience.
- **Loan Amount:** The amount of the loan requested.
- **Loan Interest Rate:** The interest rate of the loan.
- **Loan Amount as a percentage of annual income:** The loan amount as a percentage of the annual income.
- **Length of credit history:** The length of the applicant's credit history.
- **Credit score:** The applicant's credit score.
- **Previous loan defaults on file:** Records of previous loan defaults.

### Categorical Features

- **Gender:** The applicant's gender.
- **Education:** The applicant's education level.
- **Home Ownership:** Whether the applicant owns a home.
- **Loan Intent:** The purpose of the loan.

### Target Variable

The target variable is `loan_status`, which is a binary variable:
- **1:** Loan Approved
- **0:** Loan Rejected

## Domain Knowledge

In the context of loan approval, some financial terms are important:

- **Credit Score:** A numerical score that represents an individual's creditworthiness. A higher score generally indicates a lower risk of default.
- **Loan Intent:** The reason why the loan is being requested (e.g., for education, medical expenses, home improvements). The purpose can influence the risk assessment.
- **Default (Inadimplência):** The failure to meet the obligation to repay a loan. Having a history of defaults significantly increases the risk for the lender.

## The Challenge and Real-World Relevance

Even though this is a synthetic dataset, this project simulates one of the most classic and high-impact challenges in the financial and technology sectors: credit risk assessment. The relevance of solving this problem efficiently and fairly is immense, both for financial institutions and for society.

### The Core Challenge

The central challenge is to build a model that accurately balances two competing objectives:

1.  **Minimizing the Risk of Default (False Positives):** Approving a loan for someone who will not pay it back results in a direct financial loss for the institution. The model must be rigorous in identifying bad payers.
2.  **Maximizing Business Opportunity (False Negatives):** Rejecting a loan for someone who would have paid it back correctly means lost revenue (unearned interest) and the potential loss of a customer to a competitor. The model cannot be overly conservative.

Furthermore, the challenge deepens when considering fairness and interpretability. A model must not discriminate based on sensitive data, and in many countries, institutions are legally required to explain why a loan was denied. Therefore, the model needs to be not only accurate but also transparent and fair.

### Real-World Relevance

Working with this dataset, even though it is artificial, offers practical training for real problems faced daily by banks, fintechs, and credit unions.

- **Foundation for Credit Scoring Systems:** The models created here are the basis of credit scoring systems that determine the financial health of millions of people and companies.
- **Automation and Scalability:** In a digital world, manually analyzing every loan application is unfeasible. Machine Learning models allow companies to make fast, consistent, and large-scale decisions, enabling everything from the approval of an online credit card in minutes to car financing.
- **Financial Inclusion:** Well-built models can identify good payers who might be overlooked by traditional analyses, allowing more people to access credit fairly.

Therefore, solving this challenge is not just a technical exercise, but a direct simulation of how data science is applied to make critical financial decisions that affect people's lives and the economic health of businesses.