# Loan Default Prediction with Fairness Constraints

## üìå Problem Statement
You are given historical loan application data. Each row represents one applicant along with financial and demographic information.

Your task is to build a **machine learning pipeline** that predicts whether a customer will **default on a loan**.

Unlike simple prediction problems, you must also:
- Compare multiple algorithms taught in class
- Understand why some models work better than others
- Analyze whether your final model behaves **fairly** across different groups

This notebook provides **guided hints** to help you progress step by step. You are still expected to write the code yourself.

---
## üìÇ Dataset Information
You will be working with a real-world loan dataset (`loan_data.csv`).

**Target Column:**
- `Loan_Status` ‚Üí Loan default indicator (binary)

**Helpful Hint:**
- This dataset contains both **numerical** and **categorical** features
- Some features may need encoding before modeling

---
## 1Ô∏è‚É£ Import Required Libraries

**What you need to do:**
- Import libraries for:
  - Data handling
  - Visualization
  - Machine learning models

**Hint:**
- You will almost certainly need `pandas`, `numpy`, and `sklearn` modules
- Avoid importing deep learning libraries

---
## 2Ô∏è‚É£ Load and Inspect the Dataset

**What you need to do:**
- Load `loan_data.csv` into a DataFrame
- View the first 5 rows
- Check dataset shape and column names

**Hints:**
- Use `.head()` and `.shape()`
- Use `.info()` to identify data types
- Look for missing values early

---
## 3Ô∏è‚É£ Exploratory Data Analysis (EDA)

**What you need to do:**
- Understand how numerical features are distributed
- Compare distributions for default vs non-default cases

**Hints:**
- Plot histograms for features like income and loan amount
- Use groupby on `Loan_Status` to compare averages
- Do not overdo visualization ‚Äî clarity matters

---
## 4Ô∏è‚É£ Data Cleaning and Preprocessing

**What you need to do:**
- Handle missing values
- Convert categorical features into numerical form
- Separate features (X) and target (y)

**Hints:**
- Simple strategies (mean/mode) are acceptable
- `get_dummies()` is sufficient for encoding
- Do not forget to encode the target variable

---
## 5Ô∏è‚É£ Train‚ÄìTest Split

**What you need to do:**
- Split the data into training and testing sets
- Keep test data completely unseen during training

**Hints:**
- Use an 80‚Äì20 split
- Set `random_state` for reproducibility

---
## 6Ô∏è‚É£ Linear Regression (Why It Fails)

**What you need to do:**
- Train a Linear Regression model
- Generate predictions on test data

**Think About:**
- Are predictions limited to 0 and 1?
- Does the model output probabilities?

**Hint:**
- Print a few predicted values and inspect them

---
## 7Ô∏è‚É£ Logistic Regression (Baseline Classifier)

**What you need to do:**
- Train a Logistic Regression model
- Evaluate using accuracy, precision, recall
- Examine predicted probabilities

**Hints:**
- Use `classification_report`
- Look at `.predict_proba()` output
- This model will serve as your baseline

---
## 8Ô∏è‚É£ Decision Tree Classifier

**What you need to do:**
- Train a Decision Tree model
- Control its depth
- Compare performance with Logistic Regression

**Hints:**
- Start with a small `max_depth` (e.g., 3‚Äì5)
- Deeper trees may overfit

---
## 9Ô∏è‚É£ Random Forest Classifier

**What you need to do:**
- Train a Random Forest model
- Compare it with Decision Tree
- Identify important features

**Hints:**
- Start with 100‚Äì200 trees
- Feature importance can be accessed from the model

---
## üîü Hyperparameter Tuning

**What you need to do:**
- Tune one model (Decision Tree or Random Forest)
- Compare tuned vs default performance

**Hints:**
- Tune parameters related to depth or number of trees
- Use cross-validation to avoid overfitting

---
## 1Ô∏è‚É£1Ô∏è‚É£ K-Means Clustering (Customer Segmentation)

**What you need to do:**
- Apply K-Means on feature data (without target)
- Assign cluster labels to customers
- Compare default rates across clusters

**Hints:**
- Choose a small number of clusters (e.g., 3 or 4)
- Scaling features before clustering can help

---
## 1Ô∏è‚É£2Ô∏è‚É£ Fairness Analysis

**What you need to do:**
- Compare model predictions across demographic groups
- Check whether one group is disadvantaged

**Hints:**
- Use groupby on sensitive features
- Compare error rates, not just accuracy
- There is no single correct answer ‚Äî reasoning matters