## Random Forests

**Random Forest**
- Random Forest is an **ensemble model** built from many decision trees.  
- Each tree is trained on a random subset of the data and features ("bagging" + "feature randomness").  
- The final prediction is made by **majority vote** (classification) or **averaging** (regression).  
- This randomness makes the model more robust and reduces overfitting compared to a single decision tree.

**Why ?**
- Handles **non-linear relationships** and **feature interactions** automatically.  
- Works well on mixed or categorical-like features without strict scaling.  
- More powerful than Logistic Regression for complex patterns, are less sensitive to outliers than Logistic Regression.
- Built-in support for **imbalanced classes** using `class_weight="balanced"`.  

It will help us see whether non-linear ensembles improve performance.


In [3]:
import os, re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from scripts.data_loader import load_caravan
from scripts.metrics import evaluate_model

In [4]:
train, test, X, y, TARGET = load_caravan(data_dir="../data")
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=42)

In [5]:
from sklearn.ensemble import RandomForestClassifier

# Class imbalance: use class_weight="balanced_subsample" so each tree sees reweighted classes
rf = RandomForestClassifier(
    n_estimators=400,           # number of trees
    max_depth=None,             # allow trees to grow fully (we regularize via min_samples_leaf)
    min_samples_leaf=2,         # small leaf size to reduce overfitting a bit
    min_samples_split=5,        # avoid very tiny splits
    class_weight="balanced_subsample", #same as “balanced” except weights are computed based on the bootstrap sample for every tree grown.
    n_jobs=-1,                  # use all CPU cores
    random_state=42
)

rf.fit(X_train, y_train)

proba_val = rf.predict_proba(X_val)[:, 1]
preds_val = rf.predict(X_val)

results = evaluate_model("random forest", y_val, proba_val)


=== random forest ===
ROC-AUC: 0.7161 | PR-AUC: 0.1425
Best-F1 threshold: 0.156
At best-F1: Precision=0.155, Recall=0.500, F1=0.237
              precision    recall  f1-score   support

           0       0.96      0.83      0.89      1643
           1       0.16      0.50      0.24       104

    accuracy                           0.81      1747
   macro avg       0.56      0.66      0.56      1747
weighted avg       0.92      0.81      0.85      1747



## Random Forest Results

**Global metrics**
- **ROC-AUC = 0.716** → The model has a reasonable ability to rank buyers above non-buyers.  
- **PR-AUC = 0.143** → On this imbalanced dataset (6% positives), this is about 2–3× better than random guessing (≈0.06).  

**Performance at best-F1 threshold (0.156)**
- **Precision = 0.155** → About 15% of predicted buyers are correct.  
- **Recall = 0.500** → The model successfully identifies 50% of the true buyers.  
- **F1 = 0.237** → Harmonic mean of precision and recall, balancing the two.  

**Classification report (per class)**
- **Class 0 (non-buyers)**: Precision 0.96, Recall 0.83 → the model is strong at detecting non-buyers.  
- **Class 1 (buyers)**: Precision 0.16, Recall 0.50 → the model catches half of the buyers, but with many false positives.  

**Accuracy = 0.81**  
- Looks decent, but less informative here because the dataset is highly imbalanced.  

**Interpretation**  
- Random Forest captures **half of the true buyers** but at the cost of many false alarms.  
- It performs better than random, but **precision is low**, meaning a marketing campaign would still contact many uninterested customers.  
- Compared to Logistic Regression and Gradient Boosting, this model is **weaker in PR-AUC and F1**, so it is not the best choice overall.


In [6]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV

rf_base = RandomForestClassifier(
    class_weight="balanced_subsample",
    n_jobs=-1,
    random_state=42
)

param_grid = {
    "n_estimators": [200, 400],
    "max_depth": [None, 10],
    "min_samples_leaf": [1, 3]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

rf_grid = GridSearchCV(
    estimator=rf_base,
    param_grid=param_grid,
    scoring="average_precision",  # PR-AUC
    cv=cv,
    n_jobs=-1,
    refit=True,
    return_train_score=False
)

rf_grid.fit(X_train, y_train)

print("Best RF params:", rf_grid.best_params_)
print("Best CV PR-AUC:", round(rf_grid.best_score_, 4))

rf_best = rf_grid.best_estimator_

# Evaluate tuned model on validation
proba_val_best = rf_best.predict_proba(X_val)[:, 1]
preds_val_best  = rf_best.predict(X_val)

results = evaluate_model("random forest (tuned) ", y_val, proba_val_best)


Best RF params: {'max_depth': 10, 'min_samples_leaf': 1, 'n_estimators': 200}
Best CV PR-AUC: 0.1628

=== random forest (tuned)  ===
ROC-AUC: 0.7606 | PR-AUC: 0.1767
Best-F1 threshold: 0.526
At best-F1: Precision=0.259, Recall=0.288, F1=0.273
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      1643
           1       0.26      0.29      0.27       104

    accuracy                           0.91      1747
   macro avg       0.61      0.62      0.61      1747
weighted avg       0.91      0.91      0.91      1747



## Random Forest (Tuned) Results

- **ROC-AUC = 0.761** → The model can reasonably rank buyers vs. non-buyers.  
- **PR-AUC = 0.177** → Better than random baseline (~0.06), but still modest for an imbalanced dataset.  

- **Best-F1 threshold = 0.526**  
  - **Precision = 0.259** → ~26% of predicted buyers are correct.  
  - **Recall = 0.288** → ~29% of actual buyers are captured.  
  - **F1 = 0.273** → Balanced score, better then logistic regression.  

- **Class 0 (non-buyers)**: Very strong performance (precision/recall ≈ 0.95).  
- **Class 1 (buyers)**: Weak recall and precision, reflecting the challenge of imbalance.  
- **Accuracy = 0.91** looks high, but is dominated by the majority class and not a reliable indicator.  

**Interpretation:**  
Random Forest performs better than random guessing and provides a balanced baseline, but it struggles with recall on the minority class. Its overall performance (PR-AUC and F1) is better then logistic regression. But can be improved using gradient boosting. 
