# Decision Tree Classification: Loan Approval Prediction
✅ Predict if a loan application will be Approved (1) or Rejected (0) based on applicant information.

| Column             | Description                          |
| ------------------ | ------------------------------------ |
| `applicant_income` | Income in ₹000s                      |
| `credit_score`     | Credit score (300–900)               |
| `loan_amount`      | Loan requested in ₹000s              |
| `loan_term`        | Duration in months                   |
| `dependents`       | Number of dependents (0–3)           |
| `married`          | Marital status (1=Yes, 0=No)         |
| `education`        | Graduate? (1=Yes, 0=No)              |
| `approved`         | Target: Loan approved? (1=Yes, 0=No) |


In [1]:
# Data Cleaning
import pandas as pd
import numpy as np

np.random.seed(42)

# Simulating a realistic dataset
data = pd.DataFrame({
    'applicant_income': np.random.randint(20, 100, 30),
    'credit_score': np.random.randint(300, 900, 30),
    'loan_amount': np.random.randint(50, 300, 30),
    'loan_term': np.random.choice([12, 24, 36, 60], 30),
    'dependents': np.random.randint(0, 4, 30),
    'married': np.random.choice([0, 1], 30),
    'education': np.random.choice([0, 1], 30),
    'approved': np.random.choice([0, 1], 30)
})

# Check for missing values
print("Missing Values:\n", data.isnull().sum())




Missing Values:
 applicant_income    0
credit_score        0
loan_amount         0
loan_term           0
dependents          0
married             0
education           0
approved            0
dtype: int64


No missing values in this synthetic data. If real: use .fillna() or .dropna().


In [2]:
# Data Preprocessing
# No categorical encoding needed here, since all are numeric or binary.
# Optional: check data types
print(data.dtypes)


applicant_income    int64
credit_score        int64
loan_amount         int64
loan_term           int64
dependents          int64
married             int64
education           int64
approved            int64
dtype: object


In [3]:
# Feature Engineering
# You can add derived features if needed:
data['income_to_loan_ratio'] = data['applicant_income'] / data['loan_amount']


We'll keep it simple here, but adding features like:

- loan_amount / income

- credit_score_bucket could be useful in real cases.

In [4]:
# Train-Test Split
from sklearn.model_selection import train_test_split

X = data.drop('approved', axis=1)
y = data['approved']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [5]:
# Model Training (Logistic Regression, Random Forest, XGBoost)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss')
}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(f"✅ {name} trained.")



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


✅ Logistic Regression trained.
✅ Random Forest trained.
✅ XGBoost trained.


Parameters: { "use_label_encoder" } are not used.



In [6]:
# Evaluation (All Models)

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

for name, model in models.items():
    y_pred = model.predict(X_test)
    print(f"\n📌 {name} Results:")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))



📌 Logistic Regression Results:
Accuracy: 0.6666666666666666
Confusion Matrix:
 [[1 1]
 [1 3]]
Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.50      0.50         2
           1       0.75      0.75      0.75         4

    accuracy                           0.67         6
   macro avg       0.62      0.62      0.62         6
weighted avg       0.67      0.67      0.67         6


📌 Random Forest Results:
Accuracy: 0.8333333333333334
Confusion Matrix:
 [[2 0]
 [1 3]]
Classification Report:
               precision    recall  f1-score   support

           0       0.67      1.00      0.80         2
           1       1.00      0.75      0.86         4

    accuracy                           0.83         6
   macro avg       0.83      0.88      0.83         6
weighted avg       0.89      0.83      0.84         6


📌 XGBoost Results:
Accuracy: 0.3333333333333333
Confusion Matrix:
 [[0 2]
 [2 2]]
Classification Report:
        

In [7]:
# Hyperparameter Tuning (Random Forest)

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 4, 5],
    'n_estimators': [50, 100],
    'min_samples_split': [2, 4]
}

grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=3)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)

best_model = grid.best_estimator_
y_pred_best = best_model.predict(X_test)
print("Tuned Accuracy:", accuracy_score(y_test, y_pred_best))



Best Parameters: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 50}
Tuned Accuracy: 0.6666666666666666



📌 Best Performing Model:
- Check which model had highest accuracy + balanced precision/recall. Likely Random Forest .

| Model                   | Pros                                                     | Cons                                                 |
| ----------------------- | -------------------------------------------------------- | ---------------------------------------------------- |
| **Logistic Regression** | Interpretable, fast, works well for linear relationships | Poor for non-linear data, assumes linearity          |
| **Random Forest**       | Robust, handles non-linearity, feature importance        | Slower, black-box model, may overfit without tuning  |
| **XGBoost**             | High accuracy, handles imbalances, regularized           | Complex to tune, not easy to explain to stakeholders |


Real-world Uses of Loan Approval Classification:

- Banks (HDFC, SBI) for personal loans

- NBFCs for education/vehicle loans

- Credit scoring by FinTech apps

