# RABO AGB Model Analysis

Analysis of all exported AGB models from the RABO AGB Adoption folder.
There are 4 distinct models across two channels:
- **Outbound (Email):** BE_Email, PA_Email
- **Inbound (RBB/RBO):** BE_RBB, PA_RBO

Each model has multiple weekly snapshots over time.

This notebook:
1. Loads all exported JSON files and creates `ADMTreesModel` objects
2. Extracts CDH_ADM005-style metrics from each model via the `.metrics` property
3. Plots all metric trends over time per model, with Inbound/Outbound visually distinguished
4. Shows `tree_stats` for one model as an example

**Reference:** Pega CDH_ADM005 telemetry event specification (see Pega platform documentation).

**HTML export** (with interactive Plotly charts):
```bash
uv run python -m ipykernel install --user --name pega-datascientist-tools  # one-time setup
PLOTLY_RENDERER=notebook uv run python -m jupyter nbconvert --to html --execute --no-input \
  --ExecutePreprocessor.kernel_name=pega-datascientist-tools \
  examples/adm/RABO_AGB_Model_Analysis.ipynb
```

In [None]:
import re
import os
from pathlib import Path

import polars as pl
import plotly.express as px
import plotly.io as pio

from pdstools.adm.ADMTrees import ADMTreesModel

# For HTML export, set PLOTLY_RENDERER=notebook before running nbconvert.
# In VS Code, leave unset for native rendering.
if os.environ.get("PLOTLY_RENDERER"):
    pio.renderers.default = os.environ["PLOTLY_RENDERER"]

## Discover and load all exported model files

In [None]:
models_dir = (
    Path.home()
    / "Library/CloudStorage/OneDrive-PegasystemsInc"
    / "PRD - 1-1 Customer Engagement Alliance - Machine Learning"
    / "EasyPz/Adaptive gradient boosting/RABO AGB Adoption/Exported RABO GB models"
)

json_files = sorted(models_dir.glob("export_*.json"))
print(f"Found {len(json_files)} exported model files")
for f in json_files[:3]:
    print(f"  {f.name}")
print(f"  ...")

## Load all models and extract metrics

For each JSON file, create an `ADMTreesModel` object and call `.metrics` to get
CDH_ADM005-style diagnostic metrics. The model name and timestamp are parsed from
the filename. A **channel** column is derived: Email models → Outbound, RBB/RBO → Inbound.

In [None]:
# Pattern: export_<MODEL_NAME>_<TIMESTAMP>.json
filename_pattern = re.compile(r"^export_(.+)_(\d{8}T\d{6})\.json$")

records = []
first_trees = {}  # store first tree object per model for example below

for fp in json_files:
    m = filename_pattern.match(fp.name)
    if not m:
        print(f"Skipping {fp.name} - does not match expected pattern")
        continue
    model_name = m.group(1)

    trees = ADMTreesModel(str(fp))
    metrics = trees.metrics
    metrics["model_name"] = model_name
    metrics["file_name"] = fp.name
    records.append(metrics)

    if model_name not in first_trees:
        first_trees[model_name] = trees

df = (
    pl.DataFrame(records)
    .with_columns(
        pl.col("factory_update_time")
        .str.to_datetime("%Y-%m-%dT%H:%M:%S%.fZ")
        .alias("snapshot_time"),
        # Classify channel: Email = Outbound, RBB/RBO = Inbound
        pl.when(pl.col("model_name").str.contains("Email"))
        .then(pl.lit("Outbound (Email)"))
        .otherwise(pl.lit("Inbound (RBB/RBO)"))
        .alias("channel"),
    )
    .sort("model_name", "snapshot_time")
)

print(f"Loaded {len(df)} model snapshots across {df['model_name'].n_unique()} models")
print(f"Models: {df['model_name'].unique().sort().to_list()}")
print(f"Channels: {df['channel'].unique().sort().to_list()}")

## Summary table

In [None]:
df.select(
    "model_name",
    "channel",
    "snapshot_time",
    "auc",
    "success_rate",
    "number_of_trees",
    "number_of_tree_nodes",
    "total_number_of_active_predictors",
    "response_positive_count",
    "response_negative_count",
)

## Metric trend plots

Plot all numeric CDH_ADM005 metrics over time, grouped by model.
- **Solid lines** = Inbound (RBB/RBO)
- **Dashed lines** = Outbound (Email)

In [None]:
# Identify plottable metric columns (numeric, non-null, non-id)
exclude = {"model_name", "file_name", "snapshot_time", "factory_update_time", "channel"}
metric_cols = [
    c for c in df.columns
    if c not in exclude and df[c].dtype in (pl.Float64, pl.Int64, pl.UInt32)
]
print(f"Plotting {len(metric_cols)} metrics: {metric_cols}")

In [None]:
# Melt into long format for faceted plotting
df_long = df.select(
    "model_name", "channel", "snapshot_time", *metric_cols
).unpivot(
    index=["model_name", "channel", "snapshot_time"],
    on=metric_cols,
    variable_name="metric",
    value_name="value",
)

In [None]:
# Group metrics by category for cleaner plots
metric_groups = {
    "Model Performance": ["auc", "success_rate"],
    "Data Quality": ["response_positive_count", "response_negative_count"],
    "Model Complexity": [
        "number_of_trees",
        "number_of_tree_nodes",
        "tree_depth_max",
        "tree_depth_avg",
        "tree_depth_std",
        "number_of_stump_trees",
        "avg_leaves_per_tree",
    ],
    "Splits by Predictor Type": [
        "number_of_splits_on_ih_predictors",
        "number_of_splits_on_context_key_predictors",
        "number_of_splits_on_other_predictors",
    ],
    "Predictor Counts": [
        "total_number_of_active_predictors",
        "total_number_of_predictors",
        "number_of_active_ih_predictors",
        "total_number_of_ih_predictors",
        "number_of_active_context_key_predictors",
    ],
    "Predictor Types": [
        "number_of_active_symbolic_predictors",
        "total_number_of_symbolic_predictors",
        "number_of_active_numeric_predictors",
        "total_number_of_numeric_predictors",
    ],
    "Gain Distribution": [
        "total_gain",
        "mean_gain_per_split",
        "median_gain_per_split",
        "max_gain_per_split",
        "gain_std",
    ],
    "Leaf Scores": [
        "number_of_leaves",
        "leaf_score_mean",
        "leaf_score_std",
        "leaf_score_min",
        "leaf_score_max",
    ],
    "Split Types": [
        "number_of_numeric_splits",
        "number_of_symbolic_splits",
        "symbolic_split_fraction",
        "number_of_unique_splits",
        "split_reuse_ratio",
        "avg_symbolic_set_size",
    ],
    "Learning Convergence": [
        "mean_abs_score_first_10",
        "mean_abs_score_last_10",
        "score_decay_ratio",
        "mean_gain_first_half",
        "mean_gain_last_half",
    ],
    "Feature Importance Concentration": [
        "top_predictor_gain_share",
        "predictor_gain_entropy",
    ],
}

# Load metric descriptions from the library
from IPython.display import display, Markdown
descriptions = ADMTreesModel.metric_descriptions()

for group_name, cols in metric_groups.items():
    available = [c for c in cols if c in metric_cols]
    if not available:
        continue
    for metric_name in available:
        desc = descriptions.get(metric_name, "")
        display(Markdown(f"**{metric_name.replace('_', ' ').title()}** — {desc}"))
        subset = df_long.filter(pl.col("metric") == metric_name)
        title = f"{group_name}: {metric_name.replace('_', ' ')}"
        fig = px.line(
            subset.to_pandas(),
            x="snapshot_time",
            y="value",
            color="model_name",
            line_dash="channel",
            markers=True,
            title=title,
            labels={"snapshot_time": "Snapshot Time", "value": metric_name.replace('_', ' '), "model_name": "Model"},
            template="none",
            category_orders={"channel": ["Inbound (RBB/RBO)", "Outbound (Email)"]},
        )
        fig.update_layout(height=350)
        fig.show()

## Individual metric: AUC over time

Solid = Inbound, Dashed = Outbound

In [None]:
fig = px.line(
    df.to_pandas(),
    x="snapshot_time",
    y="auc",
    color="model_name",
    line_dash="channel",
    markers=True,
    title="AUC over time per model",
    labels={"snapshot_time": "Snapshot Time", "auc": "AUC", "model_name": "Model"},
    template="none",
    category_orders={"channel": ["Inbound (RBB/RBO)", "Outbound (Email)"]},
)
fig.update_yaxes(range=[0.5, 1.0])
fig.update_layout(height=500)
fig.show()

## Example: Tree statistics for the first model

Show `tree_stats` for the first snapshot of the first model.

In [None]:
first_model_name = sorted(first_trees.keys())[0]
example_trees = first_trees[first_model_name]

print(f"Model: {first_model_name}")
print(f"Number of trees: {len(example_trees.model)}")

In [None]:
example_trees.metrics

In [None]:
example_trees.tree_stats