<p align="center">
  <h1 align="center">🍳 Cookbook 02: ML Feature Engineering & VIF</h1>
  <p align="center">
    <strong>GradTracer for XGBoost, LightGBM, and Tabular Feature Selection</strong>
  </p>
</p>

---

While PyTorch FlowTracker analyzes backpropagation, GradTracer also supports **Tree Ensembles (XGBoost/LightGBM)** through the `FeatureAnalyzer` and `TreeDynamicsTracker`.

In this recipe, we use a classic Kaggle tabular dataset (California Housing) to demonstrate how GradTracer detects multicollinearity (VIF) and identifies non-linear feature interactions (Synergy) that standard permutation importance misses.

## 1. Setup & Load Dataset

In [None]:
# !pip install gradtracer xgboost lightgbm scikit-learn pandas statsmodels

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from gradtracer import FeatureAnalyzer
from gradtracer import TreeDynamicsTracker

# Load California Housing Data
cali = fetch_california_housing()
X = pd.DataFrame(cali.data, columns=cali.feature_names)
y = cali.target

# Introduce an artificial correlated feature to trigger the VIF warning
X['Fake_Income'] = X['MedInc'] * 1.5 + np.random.normal(0, 0.5, len(X))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 2. Train XGBoost Model with Tree Dynamics Tracker

In [None]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

params = {
    'objective': 'reg:squarederror',
    'max_depth': 6,
    'eta': 0.1,
    'eval_metric': 'rmse'
}

# Attach TreeDynamicsTracker to monitor Node Split Gains over time
tree_tracker = TreeDynamicsTracker()

print("Training XGBoost Model...")
bst = xgb.train(
    params,
    dtrain,
    num_boost_round=100,
    evals=[(dtest, 'eval')],
    early_stopping_rounds=10,
    callbacks=[tree_tracker.as_xgb_callback()],
    verbose_eval=False
)

preds = bst.predict(dtest)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print(f"\n✅ Initial XGBoost RMSE: {rmse:.4f}")

## 3. Tree Dynamics Report (Learning Curves)

In [None]:
tree_tracker.report()

## 4. GradTracer FeatureAnalyzer (VIF + Synergy)
We use the trained XGBoost model and passing it to `FeatureAnalyzer`. It will detect the highly correlated `Fake_Income` vs `MedInc` features using VIF, and also find synergy pairs that are non-linearly coupled.

In [None]:
print("Running Feature Diagnosis...")
analyzer = FeatureAnalyzer(bst, X_train, y_train, feature_names=X.columns.tolist())

vif_results = analyzer.multicollinearity(threshold=10.0)

print("\n--- Multicollinearity (VIF) Alerts ---")
for f in vif_results['warnings']:
    print(f"⚠️ Feature '{f['feature']}' has critical VIF ({f['vif']:.1f}). Consider dropping.")

print("\n--- Top Feature Synergy Interactions ---")
interactions = analyzer.interactions(top_k=5)
for i, item in enumerate(interactions):
    print(f"{i+1}. {item['feat_a']} × {item['feat_b']} (Synergy Score: {item['synergy_score']:.4f})")

## 5. Pruning and Retraining
By dropping the artificially correlated feature flagged by GradTracer, we can train a more robust model with identical performance but simpler structure.

In [None]:
features_to_drop = [f['feature'] for f in vif_results['warnings']]
print(f"Dropping features: {features_to_drop}")

X_train_clean = X_train.drop(columns=features_to_drop)
X_test_clean = X_test.drop(columns=features_to_drop)

dtrain_clean = xgb.DMatrix(X_train_clean, label=y_train)
dtest_clean = xgb.DMatrix(X_test_clean, label=y_test)

bst_clean = xgb.train(params, dtrain_clean, num_boost_round=100, evals=[(dtest_clean, 'eval')], verbose_eval=False)
preds_clean = bst_clean.predict(dtest_clean)
rmse_clean = np.sqrt(mean_squared_error(y_test, preds_clean))

print(f"\n✅ Cleaned XGBoost RMSE: {rmse_clean:.4f} (Maintained performance with reduced complexity!)")