# B2_log_ctx_opt â€“ UIDAI Final Model (XGBoost)

Goal:
- Forecast UIDAI demand with high RÂ² and low MAE.
- Ensure safety: at least 80% of state/segment groups have acceptable error.
- Use a clean, reproducible pipeline:
  data â†’ time split â†’ features â†’ XGBoost tuning â†’ evaluation â†’ safety.

In [8]:
"""
PROMPT FOR COPILOT â€“ Phase 1 (Path setup, imports, config)

You are configuring the FINAL UIDAI notebook for the model B2_log_ctx_opt.

Project structure:
- root/
    - src/
        - uidai_features.py      # feature engineering module
        - uidai_utils.py         # time-split, metrics, safety, save helpers
    - notebooks/
        - models/
            - 06_B2_log_ctx_opt_final.ipynb   # THIS notebook
    - data/
    - artifacts/

What this cell MUST do:

1) Python path setup
   - Add ../src to sys.path so that `uidai_features` and `uidai_utils`
     can be imported when the notebook runs from notebooks/models/.

2) Imports
   - Core libraries: pandas, numpy, matplotlib, seaborn.
   - Model: xgboost.XGBRegressor (regression).
   - Project helpers:
       from uidai_features import build_b2_log_ctx_opt_features
       from uidai_utils import (
           make_time_splits,
           compute_regression_metrics,
           compute_safety_report,
           save_metrics_and_safety,
       )

3) Global configuration
   - Define ROOT = project root (.. from this notebook).
   - Define:
       DATA_PATH   -> path to the main UIDAI monthly dataset
       TARGET_COL  -> name of the target column
       DATE_COL    -> name of the date column
       OUTPUT_DIR  -> directory for artifacts of B2_log_ctx_opt
       RANDOM_STATE -> fixed seed (e.g. 42) for reproducibility.

4) Safety / feedback
   - Print:
       - the SRC_DIR being used
       - a confirmation that imports worked
       - the OUTPUT_DIR path
   - If imports fail, DO NOT catch or hide the exception:
     let Python raise the error so we can fix paths or filenames.

Generate clean, readable Python code that follows these instructions.
"""

import sys
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBRegressor

# Add ../src to Python path
ROOT = Path("..").resolve()
SRC_DIR = ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

print("Using SRC_DIR:", SRC_DIR)

# Project imports
from uidai_features import build_b2_log_ctx_opt_features
from uidai_utils import (
    make_time_splits,
    compute_regression_metrics,
    compute_safety_report,
    save_metrics_and_safety,
)

print("âœ… Imports OK: uidai_features and uidai_utils loaded.")

# Global config
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Data configuration (using actual UIDAI data paths)
DATA_PATH = ROOT / "data" / "processed" / "district_month_modeling.csv"
TARGET_COL = "total_enrolment"
DATE_COL = "month_date"

OUTPUT_DIR = ROOT / "artifacts" / "B2_log_ctx_opt"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print("OUTPUT_DIR:", OUTPUT_DIR)

âœ… Phase 1 setup complete


In [9]:
"""
Phase 2 â€“ Load UIDAI data and create time-based splits for B2_log_ctx_opt.

Context:
- This notebook is the FINAL pipeline for B2_log_ctx_opt.
- We are only improving this model; no new models are introduced here.
- Copilot is helping generate boilerplate code, but all logic (paths, dates, columns) must be checked and corrected by us if needed.

Tasks:
1) Use the config defined in Phase 1:
   - DATA_PATH: path to monthly UIDAI dataset (e.g. ../data/uidai_monthly/uidai_monthly.csv)
   - TARGET_COL: true target column name
   - DATE_COL: true date column name

2) Load the data:
   - Read CSV from DATA_PATH.
   - Parse DATE_COL as datetime.
   - Apply only LIGHT cleaning: type fixes, simple missing handling, obvious outliers.

3) Create time-based splits (no random split):
   - Choose real boundaries for train_end and val_end that match our previous experiments / hackathon rules.
   - Call make_time_splits(df, DATE_COL, train_end=..., val_end=..., test_end=None or a final date).
   - Ensure: train dates < val dates < test dates, no overlap.

4) Sanity checks:
   - Print head() and tail() of each split.
   - Print date ranges and shapes for train_df, val_df, test_df.
"""

# Load data
df = pd.read_csv(DATA_PATH, parse_dates=[DATE_COL])

# Light cleaning: drop rows with missing target
df = df.dropna(subset=[TARGET_COL])

# Time splits: Train (Apr-Sep), Val (Oct), Test (Nov-Dec)
TRAIN_END = "2025-09-30"
VAL_END = "2025-10-31"
TEST_END = None  # Use all remaining data as test

train_df, val_df, test_df = make_time_splits(
    df=df,
    date_col=DATE_COL,
    train_end=TRAIN_END,
    val_end=VAL_END,
    test_end=TEST_END,
)

print("Train:", train_df[DATE_COL].min(), "->", train_df[DATE_COL].max(), "rows:", len(train_df))
print("Val  :", val_df[DATE_COL].min(), "->", val_df[DATE_COL].max(), "rows:", len(val_df))
print("Test :", test_df[DATE_COL].min(), "->", test_df[DATE_COL].max(), "rows:", len(test_df))

Train: 2025-04-01 00:00:00 -> 2025-09-01 00:00:00 rows: 498
Val  : 2025-10-01 00:00:00 -> 2025-10-01 00:00:00 rows: 962
Test : 2025-11-01 00:00:00 -> 2025-12-01 00:00:00 rows: 1036


In [10]:
"""
Sanity checks for Phase 1 & Phase 2 (setup + time splits).

Goal:
- Confirm that our configuration and time-based splits are "hackathon-ready", not just basic scaffolding.
- This cell should:
  1) Print DATA_PATH, TARGET_COL, DATE_COL so we can visually confirm they are correct.
  2) Verify that uidai_features and uidai_utils were imported successfully.
  3) Check that train_df, val_df, test_df exist and are non-empty.
  4) Print min/max dates and row counts for each split.
  5) Assert that:
        max(train dates) < min(val dates) <= max(val dates) < min(test dates)
     to guarantee strictly ordered, non-overlapping time splits.
If any assertion fails, we will fix the config/split logic before going to Phase 3.
"""

# 1) Show core config
print("DATA_PATH:", DATA_PATH)
print("TARGET_COL:", TARGET_COL)
print("DATE_COL :", DATE_COL)

# 2) Basic import confirmation (types)
print("build_b2_log_ctx_opt_features:", build_b2_log_ctx_opt_features)
print("make_time_splits:", make_time_splits)

# 3) Check that splits exist and are non-empty
for name, df_part in [("train_df", train_df), ("val_df", val_df), ("test_df", test_df)]:
    assert df_part is not None, f"{name} is None"
    assert len(df_part) > 0, f"{name} is empty"

# 4) Print date ranges and row counts
for name, df_part in [("Train", train_df), ("Val", val_df), ("Test", test_df)]:
    print(
        f"{name}: {df_part[DATE_COL].min()} -> {df_part[DATE_COL].max()} | rows: {len(df_part)}"
    )

# 5) Assert ordering of splits
train_max = train_df[DATE_COL].max()
val_min = val_df[DATE_COL].min()
val_max = val_df[DATE_COL].max()
test_min = test_df[DATE_COL].min()

assert train_max < val_min, "Train and validation date ranges overlap or are not ordered"
assert val_max < test_min, "Validation and test date ranges overlap or are not ordered"

print("âœ… Phase 1 & 2 configuration and time splits look consistent.")

DATA_PATH: ..\data\processed\district_month_modeling.csv
TARGET_COL: total_enrolment
DATE_COL : month_date
build_b2_log_ctx_opt_features: <function build_b2_log_ctx_opt_features at 0x000001E7E09AD6C0>
make_time_splits: <function make_time_splits at 0x000001E7E0FF6C00>
Train: 2025-04-01 00:00:00 -> 2025-09-01 00:00:00 | rows: 498
Val: 2025-10-01 00:00:00 -> 2025-10-01 00:00:00 | rows: 962
Test: 2025-11-01 00:00:00 -> 2025-12-01 00:00:00 | rows: 1036
âœ… Phase 1 & 2 configuration and time splits look consistent.


In [11]:
"""
PROMPT FOR COPILOT â€“ Phase 3 (Final features for B2_log_ctx_opt)

Context:
- Phase 1 and Phase 2 are passing (imports + time-based splits).
- We now have: train_df, val_df, test_df with correct date ranges.
- All feature logic must live in build_b2_log_ctx_opt_features in uidai_features.py.
- We are NOT creating a new model, only improving B2_log_ctx_opt.

Tasks for this cell:
1) Call build_b2_log_ctx_opt_features on each split:
   - (X_train, y_train) from train_df
   - (X_val,   y_val)   from val_df
   - (X_test,  y_test)  from test_df

2) Print shapes of all X_*/y_* for a quick overview.

3) Run sanity checks:
   - len(X_*) == len(y_*) for train/val/test
   - No NaNs in X_train, X_val, X_test
   - Optionally print first few feature columns so we see time / policy / segment features.

If any check fails, we will fix build_b2_log_ctx_opt_features in src/uidai_features.py and rerun.
"""

X_train, y_train = build_b2_log_ctx_opt_features(train_df, TARGET_COL, DATE_COL)
X_val, y_val = build_b2_log_ctx_opt_features(val_df, TARGET_COL, DATE_COL)
X_test, y_test = build_b2_log_ctx_opt_features(test_df, TARGET_COL, DATE_COL)

print("X_train:", X_train.shape, "y_train:", y_train.shape)
print("X_val  :", X_val.shape, "y_val  :", y_val.shape)
print("X_test :", X_test.shape, "y_test :", y_test.shape)

for name, X, y in [
    ("train", X_train, y_train),
    ("val", X_val, y_val),
    ("test", X_test, y_test),
]:
    assert len(X) == len(y), f"Length mismatch in {name} set"
    assert not pd.isna(X).any().any(), f"NaNs found in X_{name}"

print("\nFeature columns:", list(X_train.columns[:10]), "...")
print("âœ… Phase 3 features ready for XGBoost B2_log_ctx_opt.")

X_train: (486, 31) y_train: (486,)
X_val  : (950, 31) y_val  : (950,)
X_test : (1024, 31) y_test : (1024,)

Feature columns: ['state', 'district', 'year_month', 'age_0_5', 'age_5_17', 'age_18_greater', 'demo_age_5_17', 'demo_age_17_', 'bio_age_5_17', 'bio_age_17_'] ...
âœ… Phase 3 features ready for XGBoost B2_log_ctx_opt.


In [13]:
"""
Phase 4 â€“ Train and tune XGBoost (B2_log_ctx_opt).

Goal:
Train and tune the EXISTING B2_log_ctx_opt model using XGBoost on the Phase 3 features,
without creating any new model family. We want:
- Higher RÂ², lower MAE/RMSE/MAPE on the validation split.
- A clean, simple training cell that judges can read.
- No data leakage (respect the time-based split from Phase 2).

Approach:
- Manual hyperparameter loop (fit on train, eval on val) â€“ explicit and judge-friendly.
- Sample up to 30 combinations to keep runtime reasonable.
- Encode categorical columns to numeric for XGBoost compatibility.
"""

import random
from sklearn.preprocessing import LabelEncoder

# â”€â”€ Encode categorical columns â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
cat_cols = ["state", "district", "year_month", "volume_bucket"]
encoders = {}

X_train_enc = X_train.copy()
X_val_enc = X_val.copy()
X_test_enc = X_test.copy()

for col in cat_cols:
    if col in X_train_enc.columns:
        le = LabelEncoder()
        # Fit on all data to handle unseen categories in val/test
        all_values = pd.concat([X_train_enc[col].astype(str), 
                                 X_val_enc[col].astype(str), 
                                 X_test_enc[col].astype(str)]).unique()
        le.fit(all_values)
        
        X_train_enc[col] = le.transform(X_train_enc[col].astype(str))
        X_val_enc[col] = le.transform(X_val_enc[col].astype(str))
        X_test_enc[col] = le.transform(X_test_enc[col].astype(str))
        encoders[col] = le

print("Encoded categorical columns:", list(encoders.keys()))
print("X_train_enc dtypes:", X_train_enc.dtypes.value_counts().to_dict())

# â”€â”€ Define hyperparameter search space â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
param_grid = {
    "n_estimators": [300, 500, 700],
    "max_depth": [3, 4, 5],
    "learning_rate": [0.03, 0.06],
    "subsample": [0.8, 1.0],
    "colsample_bytree": [0.8, 1.0],
    "reg_lambda": [1.0, 5.0],
    "reg_alpha": [0.0, 0.5],
}

# Generate all combinations and sample 30
all_combos = [
    {
        "n_estimators": n,
        "max_depth": d,
        "learning_rate": lr,
        "subsample": ss,
        "colsample_bytree": cs,
        "reg_lambda": rl,
        "reg_alpha": ra,
    }
    for n in param_grid["n_estimators"]
    for d in param_grid["max_depth"]
    for lr in param_grid["learning_rate"]
    for ss in param_grid["subsample"]
    for cs in param_grid["colsample_bytree"]
    for rl in param_grid["reg_lambda"]
    for ra in param_grid["reg_alpha"]
]

random.seed(RANDOM_STATE)
sampled_combos = random.sample(all_combos, min(30, len(all_combos)))
print(f"\nTesting {len(sampled_combos)} hyperparameter combinations...")

# â”€â”€ Manual tuning loop â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
best_params = None
best_val_mae = float("inf")
best_model = None
results = []

for i, params in enumerate(sampled_combos, 1):
    full_params = {
        **params,
        "objective": "reg:squarederror",
        "random_state": RANDOM_STATE,
        "tree_method": "hist",
    }
    
    model = XGBRegressor(**full_params)
    model.fit(X_train_enc, y_train)
    
    y_pred_val = model.predict(X_val_enc)
    val_metrics = compute_regression_metrics(y_val, y_pred_val)
    val_mae = val_metrics["mae"]
    
    results.append({**params, "val_mae": val_mae, "val_r2": val_metrics["r2"]})
    
    if val_mae < best_val_mae:
        best_val_mae = val_mae
        best_params = full_params
        best_model = model
    
    if i % 10 == 0:
        print(f"  [{i}/{len(sampled_combos)}] Current best MAE: {best_val_mae:.2f}")

# â”€â”€ Report best configuration â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
print("\n" + "="*60)
print("BEST HYPERPARAMETERS:")
for k, v in best_params.items():
    if k not in ["objective", "random_state", "tree_method"]:
        print(f"  {k}: {v}")

# Recompute metrics for the best model
y_pred_train = best_model.predict(X_train_enc)
y_pred_val = best_model.predict(X_val_enc)

train_metrics = compute_regression_metrics(y_train, y_pred_train)
val_metrics = compute_regression_metrics(y_val, y_pred_val)

print("\nTRAIN METRICS:")
print(f"  MAE:  {train_metrics['mae']:.2f}")
print(f"  RMSE: {train_metrics['rmse']:.2f}")
print(f"  MAPE: {train_metrics['mape']:.2%}")
print(f"  RÂ²:   {train_metrics['r2']:.4f}")

print("\nVALIDATION METRICS:")
print(f"  MAE:  {val_metrics['mae']:.2f}")
print(f"  RMSE: {val_metrics['rmse']:.2f}")
print(f"  MAPE: {val_metrics['mape']:.2%}")
print(f"  RÂ²:   {val_metrics['r2']:.4f}")

print("\nâœ… Phase 4 complete â€“ best_model ready for test evaluation.")

Encoded categorical columns: ['state', 'district', 'year_month', 'volume_bucket']
X_train_enc dtypes: {dtype('float64'): 22, dtype('int32'): 5, dtype('int64'): 4}

Testing 30 hyperparameter combinations...
  [10/30] Current best MAE: 109.60
  [20/30] Current best MAE: 108.28
  [30/30] Current best MAE: 108.28

BEST HYPERPARAMETERS:
  n_estimators: 500
  max_depth: 3
  learning_rate: 0.03
  subsample: 0.8
  colsample_bytree: 1.0
  reg_lambda: 1.0
  reg_alpha: 0.5

TRAIN METRICS:
  MAE:  24.24
  RMSE: 32.44
  MAPE: 2854.10%
  RÂ²:   0.9997

VALIDATION METRICS:
  MAE:  108.28
  RMSE: 190.03
  MAPE: 5842.57%
  RÂ²:   0.9654

âœ… Phase 4 complete â€“ best_model ready for test evaluation.


In [14]:
"""
Phase 4 Summary â€“ Tuning Results Report

Context:
- The previous cell performed a manual hyperparameter search over XGBoost
  and created:
    - best_model    -> fitted XGBRegressor with best params
    - best_params   -> dict of best hyperparameters
    - best_val_mae  -> best validation MAE value
    - train_metrics -> regression metrics on (X_train, y_train)
    - val_metrics   -> regression metrics on (X_val, y_val)

Goal of THIS cell:
- Print a clean, judge-friendly summary of tuning results.
- Do NOT retrain the model or rerun the whole search; just report results.
"""

print("\n" + "=" * 60)
print("B2_log_ctx_opt â€“ Phase 4 Tuning Summary (XGBoost)")
print("=" * 60)

print(f"\nðŸ“Š Best Validation MAE: {best_val_mae:.2f}")

print("\nðŸ”§ Best Hyperparameters:")
print("-" * 40)
for k, v in best_params.items():
    if k not in ["objective", "random_state", "tree_method"]:
        print(f"  {k:20s}: {v}")

print("\nðŸ“ˆ Model Performance:")
print("-" * 40)
print(f"{'Metric':<10} {'Train':>12} {'Validation':>12}")
print("-" * 40)
for metric in ["mae", "rmse", "mape", "r2"]:
    train_val = train_metrics[metric]
    val_val = val_metrics[metric]
    if metric == "mape":
        print(f"{metric.upper():<10} {train_val:>11.2%} {val_val:>11.2%}")
    elif metric == "r2":
        print(f"{metric.upper():<10} {train_val:>12.4f} {val_val:>12.4f}")
    else:
        print(f"{metric.upper():<10} {train_val:>12.2f} {val_val:>12.2f}")

print("\nâœ… Phase 4 complete â€“ best_model, best_params, and metrics are ready for Phase 5/6.")


B2_log_ctx_opt â€“ Phase 4 Tuning Summary (XGBoost)

ðŸ“Š Best Validation MAE: 108.28

ðŸ”§ Best Hyperparameters:
----------------------------------------
  n_estimators        : 500
  max_depth           : 3
  learning_rate       : 0.03
  subsample           : 0.8
  colsample_bytree    : 1.0
  reg_lambda          : 1.0
  reg_alpha           : 0.5

ðŸ“ˆ Model Performance:
----------------------------------------
Metric            Train   Validation
----------------------------------------
MAE               24.24       108.28
RMSE              32.44       190.03
MAPE          2854.10%    5842.57%
R2               0.9997       0.9654

âœ… Phase 4 complete â€“ best_model, best_params, and metrics are ready for Phase 5/6.


In [None]:
"""
Phase 5 & 6 â€“ Final evaluation and safety.

Steps:
- Fit best XGBoost on train+val
- Evaluate on test (metrics)
- Compute safety report by state group / volume bucket
"""

# TODO: after hyperparameter search:
# best_params = search.best_params_
# best_model = XGBRegressor(**best_params, objective="reg:squarederror", random_state=RANDOM_STATE)

# TODO: fit best_model on X_train, y_train (or train+val)
# y_pred_train = best_model.predict(X_train)
# y_pred_val = best_model.predict(X_val)
# y_pred_test = best_model.predict(X_test)

metrics = {
    "train": compute_regression_metrics(y_train, y_pred_train),
    "val": compute_regression_metrics(y_val, y_pred_val),
    "test": compute_regression_metrics(y_test, y_pred_test),
}

# Example grouping: state and volume_bucket columns must exist in test_df
safety_df, safety_summary = compute_safety_report(
    y_true=y_test,
    y_pred=y_pred_test,
    groups_df=test_df[["state", "volume_bucket"]],
    mae_factor_threshold=1.5,
    mape_threshold=None,
)

save_metrics_and_safety(metrics, safety_df, OUTPUT_DIR)

metrics, safety_summary