In [0]:
!pip install kaggle
!pip install shap
!pip install -U scikit-learn
!pip install category_encoders
!pip install ydata-profiling

In [0]:
import pandas as pd
import numpy as np
import os
 
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, TargetEncoder, StandardScaler, FunctionTransformer, RobustScaler
from category_encoders import QuantileEncoder
from category_encoders.wrapper import NestedCVWrapper
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.feature_selection import VarianceThreshold, SelectFromModel
 
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor, StackingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR

from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error

import matplotlib.pyplot as plt; plt.rcParams.update({"figure.max_open_warning": 0, "figure.dpi": 100})
import shap

import joblib
import pickle

from pipeline_functions import CarDataCleaner, IndividualHierarchyImputer, CarFeatureEngineer, DebugTransformer, MajorityVoteSelectorTransformer, MutualInfoThresholdSelector, SpearmanRelevancyRedundancySelector, create_model_pipe, get_cv_results, model_hyperparameter_tuning, SetOutputCompatibleWrapper
from visualization_functions import plot_selector_agreement, plot_train_val_comparison

In [0]:
et_tuned_pipe = joblib.load("rf_tuned_pipe.pkl")

In [0]:
debug_preprocessor_pipe = joblib.load("debug_preprocessor_pipe.pkl")

### 10. Open-Ended-Section

#### 10.1 SHAP Interpretability for Our Final Tree Model (Informative Only)

**a) Objective and motivation**

After our end-to-end pipeline is finished, we use **SHAP (SHapley Additive exPlanations)**.

Goals:
- Identify the **most influential features** for the final tuned model (`et_tuned_pipe`).
- Validate whether feature effects are **plausible** (age, mileage, engine, etc.).
- Check how much **target encodings** and engineered interactions contribute.

Important: **SHAP does not change the model or feature set.** We do not build a new pipeline based on SHAP.

---

**b) Difficulty of the task**

This is non-trivial because SHAP must explain the model input **after** our preprocessing and feature selection:

- The model does not see raw columns. 
It sees: engineered numeric features (e.g., interactions, relative features, logs), OHE columns, target-encoded columns and the reduced subset after **VT + majority voting**.
- We therefore reconstruct:
  - the exact **post-preprocess feature matrix**, and
  - aligned **feature names** after applying both selection masks (VT support + majority selector mask).
- Because the full pipeline includes engineered preprocessing + selection, we treat the tuned pipeline as a **black box** and use SHAP via a **PermutationExplainer** (robust but expensive).
- Runtime: SHAP is costly, so we explain only a **subsample** (`sample_size=1000`) with a small background set.

---

**c) Correctness and efficiency**

We kept the analysis correct and consistent with the production pipeline:

- **No leakage / no optimization loop:** SHAP is computed on the already-fitted `et_tuned_pipe` and used only for interpretation.
- **Exact alignment:** feature names come from the ColumnTransformer output and are then filtered by VT + majority voting masks.
- **Global SHAP importance:** features are ranked by mean absolute contribution:
  
  $$
  Importance(feature_j) = \frac{1}{N}\sum_{i=1}^{N} |SHAP_{i,j}|
  $$

- **Efficient computation:** stable ranking via subsampling (PermutationExplainer on 1000 rows; runtime ~21 minutes in our run).

---

**d) Results and interpretation**

Model context:
- Final tuned model: `et_tuned_pipe` (**ExtraTrees**)
- Total features used after preprocessing + FS: **28**
- SHAP explainer used: **PermutationExplainer** (1001 iterations; ~21:34 total)

Top drivers (mean |SHAP|), excerpt:

| Feature | Importance | Interpretation |
|---|---:|---|
| `mean_te__model` | 1486.36 | Model-level mean target encoding (strong market-value proxy) |
| `median_te__model` | 1409.38 | Model-level median target encoding (strong market-value proxy) |
| `num__mpg_x_age` | 870.81 | Interaction capturing “efficiency x age” effects |
| `cat__transmission_Manual` | 846.46 | Manual transmission effect (dataset-dependent) |
| `num__age` | 764.79 | Direct age penalty / depreciation signal |
| `num__engineSize` | 655.15 | Engine size (segment/performance proxy) |
| `num__age_rel_brand` | 628.79 | Age relative to typical age inside the brand |
| `log__mileage` | 592.53 | Non-linear mileage effect (diminishing marginal impact) |
| `median_te__brand_trans` | 521.83 | Brand x transmission median target encoding |
| `num__age_rel_model` | 488.77 | Age relative to typical age within the model |
| `num__engine_per_mpg` | 303.93 | Performance/efficiency ratio proxy |
| `log__miles_per_year` | 223.94 | Usage intensity (mileage normalized by age) |

Key takeaways:
- **Target encodings still dominate** global importance (model-level mean/median TE). This is expected: model identity carries a large fraction of price signal.
- **Transmission became a top driver** in the ExtraTrees variant (`cat__transmission_Manual` ranks #4), suggesting stronger split usage on this categorical signal compared to the previous RF run.
- **Age and mileage remain major drivers**, and appear in intuitive forms (`num__age`, `log__mileage`, relative age features, and `log__miles_per_year`), supporting interpretability.
- **Engine/performance interactions matter** (`num__engineSize`, `num__engine_per_mpg`, `num__mpg_x_age`, `num__mpg_x_engine`), indicating feature engineering adds useful non-linear structure beyond raw variables.

Beeswarm plot (distribution of effects), main observations:
- **`mean_te__model` and `median_te__model` show the widest SHAP spread** → model identity (via target encoding) is the strongest pricing signal.
- **Manual transmission shows a clear directional pattern** in this tuned ExtraTrees model: `cat__transmission_Manual=1` tends to push predictions **down** (negative SHAP), while `=0` tends to push them **up** (positive SHAP), with heterogeneity explained by brand/model interactions.
- **Mileage is clearly non-linear** (`log__mileage`): low mileage contributes positively; high mileage pushes predictions down with diminishing marginal impact.
- **Age effects are consistent** (`num__age`, `num__age_rel_brand`, `num__age_rel_model`): being older (especially older than “typical” for brand/model) reduces predicted price.

---

**e) Alignment with objectives**

This section adds transparency without changing the modeling procedure:

- Feature selection stays **VT + majority voting** (robust and model-agnostic).
- SHAP is used **only** to explain the final tuned model (`et_tuned_pipe`).
- The resulting drivers (target encodings + age/mileage/engine + interactions + transmission) are consistent with domain logic and support trust in the final pipeline.


##### Functions

In [0]:
# Get Feature names aligned with X_proc (after preprocess incl. VT + majority voting)
def get_pipeline_feature_matrix(pipe, X, debug_preprocessor_pipe):
    """
    Given a fitted model pipeline with steps:
      'preprocess' -> 'model'
    where preprocess itself is a Pipeline:
      clean -> group_imputer -> fe -> ct -> fs(vt + selector)
    return:
      X_proc: 2D numpy array of features just before the model step
      feat_names: 1D np.array of feature names aligned with X_proc columns
    """
    pre = pipe.named_steps["preprocess"]

    # Transform to model-ready matrix and get feature names debug preprocessor
    X_proc = pre.transform(X)
    feat_names = debug_preprocessor_pipe.named_steps['fs'].get_feature_names_out()

    return X_proc, feat_names


In [0]:
# Compute SHAP Importance
def compute_shap_importance(
    pipe,
    X,
    sample_size=1000,
    seed=rs,
    model_name=None,
):
    """
    Compute global SHAP feature importances for a fitted pipeline (informative only).

    Fix:
      - TreeExplainer additivity check can fail for some sklearn tree implementations (incl. HGB).
        We disable it via check_additivity=False.
      - If TreeExplainer still fails, fall back to a model-agnostic SHAP explainer.
    """
    # Extract processed feature matrix and names
    X_proc, feat_names = get_pipeline_feature_matrix(pipe, X, debug_preprocessor_pipe)

    # Subsample rows for SHAP (for speed)
    rng = np.random.default_rng(seed)
    n = min(sample_size, len(X_proc))
    idx = rng.choice(len(X_proc), n, replace=False)
    X_sample = X_proc[idx]

    # Underlying model (last step in pipeline)
    model = pipe.named_steps["model"]
    tag = model_name or model.__class__.__name__

    # Background for SHAP (small subset)
    bg_n = min(200, len(X_sample))
    bg_idx = rng.choice(len(X_sample), bg_n, replace=False)
    X_bg = X_sample[bg_idx]

    # Try TreeExplainer first (fast for tree models)
    try:
        explainer = shap.TreeExplainer(model, X_bg)
        shap_vals = explainer.shap_values(X_sample, check_additivity=False)

        # shap_vals can be list-like in some setups; regression should be 2D
        if isinstance(shap_vals, list):
            shap_vals = shap_vals[0]

        base_vals = getattr(explainer, "expected_value", 0.0)
        shap_values = shap.Explanation(
            values=shap_vals,
            base_values=np.full((len(X_sample),), base_vals) if np.isscalar(base_vals) else base_vals,
            data=X_sample,
            feature_names=feat_names,
        )

    except Exception as e:
        # Fallback: model-agnostic explainer (slower but robust)
        explainer = shap.Explainer(model.predict, X_bg, feature_names=feat_names)
        shap_values = explainer(X_sample)

    importance = np.abs(shap_values.values).mean(axis=0)

    shap_df = (
        pd.DataFrame({"feature": feat_names, "importance": importance})
        .sort_values("importance", ascending=False)
        .reset_index(drop=True)
    )

    print(f"Features by SHAP for {tag}:")
    print(shap_df.head(40).to_string(index=False))

    return shap_df, feat_names, shap_values, X_sample


In [0]:
# SHAP Plots
def plot_top_shap_bar(shap_df, model_name, top_k):
    """
    Horizontal bar plot of top_k features by mean |SHAP|.
    """
    top_df = shap_df.head(top_k).iloc[::-1]  # reverse for nicer barh order
    fig, ax = plt.subplots(figsize=(8, 6))
    ax.barh(top_df["feature"], top_df["importance"])
    ax.set_xlabel("Average |SHAP| value")
    ax.set_title(f"Top {top_k} features by SHAP – {model_name}")
    plt.tight_layout()
    plt.show()


def plot_shap_beeswarm(shap_values, X_sample, feat_names, model_name, max_display=20):
    """
    SHAP summary (beeswarm) plot for top features.
    """
    X_df = pd.DataFrame(X_sample, columns=feat_names)

    # Create one figure and tell SHAP not to auto-show
    plt.figure(figsize=(10, 6))
    shap.summary_plot(shap_values.values, X_df, max_display=max_display, show=False)

    plt.title(f"SHAP Beeswarm – {model_name}")
    plt.tight_layout()
    plt.show()


##### SHAP of Best Model

In [0]:
# ExtraTrees baseline report + SHAP
et_pipe = et_tuned_pipe

# Feature matrix + names after preprocess (clean+impute+fe+ct+fs)
X_proc_et, feat_names_et = get_pipeline_feature_matrix(et_pipe, X_train, debug_preprocessor_pipe)
n_features_total_et = X_proc_et.shape[1]

print("ExtraTrees (tuned pipe) - feature space info:")
print(f"Total features used: {n_features_total_et}")

shap_importance_et, feat_names_et, shap_vals_et, X_sample_et = compute_shap_importance(
    et_pipe,
    X_train,
    sample_size=1000,
    seed=rs,
    model_name="ExtraTrees",
)

plot_top_shap_bar(shap_importance_et, model_name="ExtraTrees", top_k=n_features_total_et)
plot_shap_beeswarm(shap_vals_et, X_sample_et, feat_names_et, model_name="ExtraTrees", max_display=n_features_total_et)

#### 10.2 Global vs Brand- and Model-Specific Models

**a) Objective and motivation**

We investigated how far Cars4You should specialize its pricing models:

1. **Brand level:** Is a single global price model sufficient, or do brand-specific models reduce pricing error?
2. **Brand–model level:** For frequent (brand, model) segments (e.g. “VW Golf”, “Skoda Octavia”), does an even more specialized model per segment bring additional improvements, or does it overfit?

Starting point is our final tuned production pipeline **`et_tuned_pipe`** (full preprocessing + tuned **ExtraTrees** regressor). We compare:

- **Global model:** trained on all cars, evaluated only on a given segment.
- **Brand-specific model:** same pipeline structure and hyperparameters, fitted only on cars of a given brand.
- **Brand–model-specific model:** same pipeline structure and hyperparameters, fitted only on cars of a given (brand, model) pair.

We measured **MAE** and **RMSE** per segment using **5-fold cross-validation**. This quantifies the gain/loss when moving from:

> one global model → several brand models → many brand–model models.

---

**b) Difficulty of the tasks**

This multi-level comparison is not easy because it requires leakage-free, segment-wise evaluation inside cross-validation:

- **Per-segment metrics inside CV (not a single score):** each fold must report MAE/RMSE *for specific brands/pairs* on the fold’s validation set.
- **Fair protocol for global vs specialized models (within each fold):**
  - global model is trained on all training rows, but evaluated only on validation rows belonging to the segment;
  - segment-specific model is trained and evaluated only on that segment’s rows.
- **Imbalanced / small segments:** data is heavily skewed across brands/models, so we enforce minimum segment sizes:
  - **brand level:** only brands with **≥ 500** samples overall;
  - **brand–model level:** only pairs with **≥ 80** samples overall;
  - plus per-fold minimum training-size checks to avoid unstable tiny fits.
- **Cleaning labels before grouping:** inconsistent text labels (casing/spacing/typos) can split real segments; we normalize `(brand, model)` once using the same cleaning logic as in the pipeline and use cleaned labels for masks.
- **Manual RMSE:** computed as `sqrt(MSE)` inside the CV loops for compatibility with our environment.

---

**c) Correctness and efficiency of implementation**

We kept the evaluation correct and efficient:

- **Leakage-free out-of-fold evaluation:** in each fold the pipeline is fitted only on that fold’s training rows; metrics are computed only on validation rows.
- **Single CV design reused everywhere:** identical KFold splits (`n_splits=5`, `shuffle=True`, fixed `random_state`) are reused for all comparisons, making deltas directly comparable.
- **Efficient global baseline:** the global model is fitted **once per fold**, then reused to compute metrics for many brands/pairs in that fold (instead of refitting per segment).
- **Segmentation labels separated from training data:** model training always uses the original fold rows; segment membership uses cleaned labels for correct grouping.
- **Guards for tiny segments:** segments/folds with insufficient training samples are skipped to avoid unstable conclusions.

---

**d) Discussion of results**

- **Candidate brands:** all brands
- **Candidate (brand, model) pairs:** ≥ 80 samples; 100 pairs


**Brand-level comparison (global vs brand-specific, using `et_tuned_pipe`)**

Across the main brands, brand-specific training is **not consistently beneficial**. Even when some brands can improve, the typical trade-off remains:

- **Potential benefit:** brand-specific models can capture brand-local price structure if it differs materially from the full population.
- **Common downside:** less data and reduced diversity often outweigh specialization; the global model already captures brand effects via features (including encodings), so restricting to one brand can hurt generalization.
- **Practical interpretation:** brand-level specialization should be treated as **optional** and justified only when it shows **stable negative ΔMAE/ΔRMSE** in CV.

*(Brand-level table is omitted here because the provided updated results focus on candidate selection and pair-level outcomes; the recommendation below reflects the updated ExtraTrees pipeline behavior and the observed specialization risk patterns.)*


**Brand–model comparison (global vs brand–model-specific, `et_tuned_pipe`)**

Results are clearly **mixed** even after filtering to pairs with ≥ 80 samples: some segments improve, while others degrade substantially.

**Top improvements (largest negative ΔMAE; specialized better than global):**

| (brand, model) | n | MAE_global | MAE_pair | ΔMAE | RMSE_global | RMSE_pair | ΔRMSE |
|---|---:|---:|---:|---:|---:|---:|---:|
| Audi q7 | 268 | 3186.8 | 3044.2 | **-142.6** | 4282.0 | 4110.4 | **-171.6** |
| Audi tt | 222 | 1852.2 | 1723.0 | **-129.3** | 2597.2 | 2368.0 | **-229.2** |
| Ford edge | 137 | 1106.1 | 1030.3 | **-75.7** | 1556.6 | 1420.7 | **-135.9** |
| Hyundai ioniq | 203 | 1491.9 | 1429.6 | **-62.3** | 1921.5 | 1883.2 | **-38.2** |
| Ford s-max | 201 | 1278.5 | 1242.4 | **-36.1** | 1684.1 | 1656.7 | **-27.3** |
| VW sharan | 180 | 1584.9 | 1563.3 | **-21.6** | 2112.6 | 2099.1 | **-13.6** |
| Mercedes c class | 5288 | 1904.9 | 1885.7 | **-19.3** | 2813.0 | 2867.8 | **+54.8** |
| Skoda octavia | 1021 | 944.5 | 925.4 | **-19.1** | 1333.9 | 1307.2 | **-26.6** |

Notes:
- Several segments show **consistent wins** (negative ΔMAE and negative ΔRMSE), e.g. Audi q7/tt, Ford edge, Skoda octavia.
- Some show a **MAE improvement but RMSE worsening** (e.g. Mercedes c class: ΔMAE < 0 but ΔRMSE > 0), which suggests fewer average errors but worse tail behavior/outliers.

**Top degradations (largest positive ΔMAE; specialized worse than global):**

| (brand, model) | n | MAE_global | MAE_pair | ΔMAE | RMSE_global | RMSE_pair | ΔRMSE |
|---|---:|---:|---:|---:|---:|---:|---:|
| Audi a_unknown | 94 | 2099.5 | 2705.5 | **+606.1** | 2832.7 | 3498.3 | **+665.7** |
| Mercedes s class | 134 | 3986.0 | 4469.3 | **+483.3** | 6122.6 | 6485.4 | **+362.8** |
| Mercedes cls class | 160 | 1791.0 | 2175.7 | **+384.7** | 3246.5 | 3983.2 | **+736.7** |
| VW amarok | 83 | 2326.4 | 2598.5 | **+272.1** | 3065.4 | 3424.4 | **+359.0** |
| Mercedes gle class | 327 | 2059.5 | 2259.4 | **+199.9** | 3142.4 | 3419.7 | **+277.3** |
| VW touareg | 244 | 2193.4 | 2367.2 | **+173.8** | 3393.5 | 3820.3 | **+426.8** |
| Audi q5 | 594 | 1736.5 | 1903.2 | **+166.8** | 2427.5 | 2781.8 | **+354.3** |
| BMW x5 | 313 | 2314.0 | 2476.0 | **+162.0** | 3074.4 | 3489.7 | **+415.3** |

Interpretation:
- Even with ≥ 80 samples, **pair-level specialization can overfit** and lose beneficial cross-segment learning.
- Degradations are often large in both MAE and RMSE, indicating not just noise but materially worse generalization for some pairs.

---

**e) Alignment with objectives**

This study directly answers whether additional specialization layers are justified for deployment under real data constraints, using the final tuned pipeline `et_tuned_pipe` and a consistent leakage-free CV protocol.

**Deployment recommendation (based on ExtraTrees results and the observed Δ patterns):**
- Keep a **single global model** as the default (most robust across segments).
- Consider **brand-level specialization only selectively**, and only where it demonstrates **repeatable negative ΔMAE and/or ΔRMSE** under the same CV protocol.
- Consider **brand–model specialization only as a gated option**:
  - only for pairs with sufficient data (≥ 80 samples + per-fold minimum training size),
  - only if the pair shows **stable negative ΔMAE and not unacceptable ΔRMSE** (watch tail risk),
  - and with monitoring, because many pairs **degrade** substantially (positive Δ).

This demonstrates not only a tuned final model, but also an evidence-based assessment of whether “more specialized” models actually improve pricing accuracy versus increasing complexity and overfitting risk.


#### 10.2.1 Load final tuned pipeline, define key columns, brand frequency and metric helpers

In [0]:
# We use the squared error here and compare it to the performance on the squared error because of disproportionate computational expensive training with the absolute_error
pipe_template = et_tuned_pipe_squared_err

# Required inputs
brand_col = "brand"
model_col = "model"

assert brand_col in X_train.columns, f"Missing column '{brand_col}' in X_train."
assert model_col in X_train.columns, f"Missing column '{model_col}' in X_train."

In [0]:
cleaner = next(v for v in pipe_template.get_params(deep=True).values()
               if v.__class__.__name__ == "CarDataCleaner")

X_seg = X_train.copy()
tmp = cleaner.fit_transform(X_train.copy(), y_train)
X_seg[[brand_col, model_col]] = tmp[[brand_col, model_col]]

X_train = X_seg


In [0]:
# Inspect brand frequencies
brand_counts = X_train[brand_col].value_counts()
display(brand_counts.head(15).to_frame("count"))


# Select candidate brands
#    - TOP_K: max number of brands to compare.
#    - MIN_SAMPLES: minimum number of rows per brand.
TOP_K = 8
MIN_SAMPLES = 500  # ensures enough data for stable per-brand estimates

candidate_brands = (
    brand_counts[brand_counts >= MIN_SAMPLES]
    .head(TOP_K)
    .index
    .tolist()
)

print("Candidate brands:", candidate_brands)

In [0]:
# Cross-validation: same folds reused everywhere for fairness and reproducibility
cv = KFold(n_splits=5, shuffle=True, random_state=42)
splits = list(cv.split(X_train, y_train))

def mae_rmse(y_true, y_pred):
    """Return MAE and RMSE (RMSE computed as sqrt(MSE) for sklearn-compatibility)."""
    mae = mean_absolute_error(y_true, y_pred)
    rmse = float(np.sqrt(mean_squared_error(y_true, y_pred)))
    return mae, rmse

#### 10.2.2 Brand-level comparison (Global vs Brand-specific)

In [0]:
def eval_global_by_brand(pipe_template, X, y, brand_col, brands, splits):
    """
    Global model evaluation per brand (out-of-fold):
    - Fit 1 global model per fold on ALL training rows.
    - Compute MAE/RMSE only on validation rows belonging to each brand.
    """
    rows = []

    for fold, (tr_idx, va_idx) in enumerate(splits, start=1):
        X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
        y_tr, y_va = y.iloc[tr_idx], y.iloc[va_idx]

        pipe = clone(pipe_template)
        pipe.fit(X_tr, y_tr)
        y_pred = pipe.predict(X_va)

        for b in brands:
            mask = (X_va[brand_col] == b)
            n = int(mask.sum())
            if n == 0:
                continue

            mae, rmse = mae_rmse(y_va[mask], y_pred[mask])
            rows.append({"fold": fold, "brand": b, "MAE": mae, "RMSE": rmse, "n": n})

    df = pd.DataFrame(rows)
    summary = (
        df.groupby("brand")
          .apply(lambda g: pd.Series({
              "MAE_mean": g["MAE"].mean(),
              "MAE_std":  g["MAE"].std(ddof=0),
              "RMSE_mean": g["RMSE"].mean(),
              "RMSE_std":  g["RMSE"].std(ddof=0),
              "n": int(g["n"].sum()),
          }))
          .reset_index()
    )
    return summary


In [0]:
def eval_brand_specific(pipe_template, X, y, brand_col, brands, splits, min_train_per_fold=50):
    """
    Brand-specific evaluation:
    - For each fold and brand: train the SAME pipeline structure only on that brand's training rows.
    - Evaluate only on that brand's validation rows.
    """
    rows = []

    for fold, (tr_idx, va_idx) in enumerate(splits, start=1):
        X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
        y_tr, y_va = y.iloc[tr_idx], y.iloc[va_idx]

        for b in brands:
            tr_mask = (X_tr[brand_col] == b)
            va_mask = (X_va[brand_col] == b)

            n_tr = int(tr_mask.sum())
            n_va = int(va_mask.sum())

            # These checks are necessary: some folds can have very few samples for a segment.
            if n_va == 0 or n_tr < min_train_per_fold:
                continue

            pipe = clone(pipe_template)
            pipe.fit(X_tr[tr_mask], y_tr[tr_mask])
            y_pred_b = pipe.predict(X_va[va_mask])

            mae, rmse = mae_rmse(y_va[va_mask], y_pred_b)
            rows.append({"fold": fold, "brand": b, "MAE": mae, "RMSE": rmse, "n": n_va})

    df = pd.DataFrame(rows)
    summary = (
        df.groupby("brand")
          .apply(lambda g: pd.Series({
              "MAE_mean": g["MAE"].mean(),
              "MAE_std":  g["MAE"].std(ddof=0),
              "RMSE_mean": g["RMSE"].mean(),
              "RMSE_std":  g["RMSE"].std(ddof=0),
              "n": int(g["n"].sum()),
          }))
          .reset_index()
    )
    return summary


In [0]:
df_global_brand = eval_global_by_brand(
    pipe_template, X_train, y_train, brand_col, candidate_brands, splits
).rename(columns={
    "MAE_mean": "MAE_mean_global", "MAE_std": "MAE_std_global",
    "RMSE_mean": "RMSE_mean_global", "RMSE_std": "RMSE_std_global",
    "n": "n_global"
})

df_brand_spec = eval_brand_specific(
    pipe_template, X_train, y_train, brand_col, candidate_brands, splits, min_train_per_fold=50
).rename(columns={
    "MAE_mean": "MAE_mean_brand", "MAE_std": "MAE_std_brand",
    "RMSE_mean": "RMSE_mean_brand", "RMSE_std": "RMSE_std_brand",
    "n": "n_brand"
})

df_compare_brand = df_global_brand.merge(df_brand_spec, on="brand", how="inner")
df_compare_brand["delta_MAE"] = df_compare_brand["MAE_mean_brand"] - df_compare_brand["MAE_mean_global"]
df_compare_brand["delta_RMSE"] = df_compare_brand["RMSE_mean_brand"] - df_compare_brand["RMSE_mean_global"]

df_compare_brand = df_compare_brand.sort_values("delta_MAE")
display(df_compare_brand)


In [0]:
plt.figure(figsize=(8, 4))
x = np.arange(len(df_compare_brand))
w = 0.35

plt.bar(x - w/2, df_compare_brand["MAE_mean_global"], w, label="Global")
plt.bar(x + w/2, df_compare_brand["MAE_mean_brand"],  w, label="Brand-specific")
plt.xticks(x, df_compare_brand["brand"], rotation=45, ha="right")
plt.ylabel("MAE (GBP)")
plt.title("Global vs Brand-specific (MAE)")
plt.legend()
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 3))
plt.bar(df_compare_brand["brand"], df_compare_brand["delta_MAE"])
plt.axhline(0, linestyle="--")
plt.xticks(rotation=45, ha="right")
plt.ylabel("Δ MAE (brand - global)")
plt.title("Effect of brand specialization (negative = MAE improvement)")
plt.tight_layout()
plt.show()


#### 10.2.3 Brand-Model comparison (Global vs Pair-specific)

In [0]:
# Frequent (brand, model) pairs only (avoid conclusions from tiny segments)
pair_counts = (
    X_train.groupby([brand_col, model_col])
    .size()
    .sort_values(ascending=False)
)

MIN_PAIR_SAMPLES = 80  # no overfitting on low sample sizes

candidate_pairs = pair_counts[pair_counts >= MIN_PAIR_SAMPLES]

print(f"Number of candidate pairs (n >= {MIN_PAIR_SAMPLES}): {len(candidate_pairs)}")

# Show the most frequent pairs for context (readable table with names)
display(candidate_pairs.head(500).reset_index(name="count"))


In [0]:
def eval_global_by_pair(pipe_template, X, y, brand_col, model_col, pairs, splits):
    """
    Global model evaluation per (brand, model):
    - Fit once per fold on ALL cars.
    - Score only on validation rows for each selected pair.
    """
    rows = []

    for fold, (tr_idx, va_idx) in enumerate(splits, start=1):
        X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
        y_tr, y_va = y.iloc[tr_idx], y.iloc[va_idx]

        pipe = clone(pipe_template)
        pipe.fit(X_tr, y_tr)
        y_pred = pipe.predict(X_va)

        for (b, m) in pairs:
            mask = (X_va[brand_col] == b) & (X_va[model_col] == m)
            n = int(mask.sum())
            if n == 0:
                continue

            mae, rmse = mae_rmse(y_va[mask], y_pred[mask])
            rows.append({"fold": fold, "brand": b, "model": m, "MAE": mae, "RMSE": rmse, "n": n})

    df = pd.DataFrame(rows)
    summary = (
        df.groupby(["brand", "model"])
          .apply(lambda g: pd.Series({
              "MAE_mean": g["MAE"].mean(),
              "MAE_std":  g["MAE"].std(ddof=0),
              "RMSE_mean": g["RMSE"].mean(),
              "RMSE_std":  g["RMSE"].std(ddof=0),
              "n": int(g["n"].sum()),
          }))
          .reset_index()
    )
    return summary


def eval_pair_specific(pipe_template, X, y, brand_col, model_col, pairs, splits, min_train_per_fold=40):
    """
    Pair-specific models:
    - For each fold and (brand, model): fit the pipeline only on that segment's training rows.
    - Evaluate only on that segment's validation rows.

    The min_train_per_fold guard is necessary because some folds can have too few samples
    even if the pair is frequent overall.
    """
    rows = []

    for fold, (tr_idx, va_idx) in enumerate(splits, start=1):
        X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
        y_tr, y_va = y.iloc[tr_idx], y.iloc[va_idx]

        for (b, m) in pairs:
            tr_mask = (X_tr[brand_col] == b) & (X_tr[model_col] == m)
            va_mask = (X_va[brand_col] == b) & (X_va[model_col] == m)

            n_tr = int(tr_mask.sum())
            n_va = int(va_mask.sum())

            if n_va == 0 or n_tr < min_train_per_fold:
                continue

            pipe = clone(pipe_template)
            pipe.fit(X_tr[tr_mask], y_tr[tr_mask])
            y_pred = pipe.predict(X_va[va_mask])

            mae, rmse = mae_rmse(y_va[va_mask], y_pred)
            rows.append({"fold": fold, "brand": b, "model": m, "MAE": mae, "RMSE": rmse, "n": n_va})

    df = pd.DataFrame(rows)
    summary = (
        df.groupby(["brand", "model"])
          .apply(lambda g: pd.Series({
              "MAE_mean": g["MAE"].mean(),
              "MAE_std":  g["MAE"].std(ddof=0),
              "RMSE_mean": g["RMSE"].mean(),
              "RMSE_std":  g["RMSE"].std(ddof=0),
              "n": int(g["n"].sum()),
          }))
          .reset_index()
    )
    return summary

In [0]:
pairs_list = list(candidate_pairs.index)  # list of (brand, model) tuples

df_global_pair = eval_global_by_pair(
    pipe_template, X_train, y_train, brand_col, model_col, pairs_list, splits
).rename(columns={
    "MAE_mean": "MAE_mean_global", "MAE_std": "MAE_std_global",
    "RMSE_mean": "RMSE_mean_global", "RMSE_std": "RMSE_std_global",
    "n": "n_global"
})

df_pair_spec = eval_pair_specific(
    pipe_template, X_train, y_train, brand_col, model_col, pairs_list, splits, min_train_per_fold=40
).rename(columns={
    "MAE_mean": "MAE_mean_pair", "MAE_std": "MAE_std_pair",
    "RMSE_mean": "RMSE_mean_pair", "RMSE_std": "RMSE_std_pair",
    "n": "n_pair"
})

df_compare_pair = df_global_pair.merge(df_pair_spec, on=["brand", "model"], how="inner")
df_compare_pair["delta_MAE"] = df_compare_pair["MAE_mean_pair"] - df_compare_pair["MAE_mean_global"]
df_compare_pair["delta_RMSE"] = df_compare_pair["RMSE_mean_pair"] - df_compare_pair["RMSE_mean_global"]

df_compare_pair = df_compare_pair.sort_values("delta_MAE")

# Full results (all frequent pairs) are here:
display(df_compare_pair)

# For the report: show the most improved + most harmed (readable subset)
display_cols = ["brand", "model", "n_global", "MAE_mean_global", "MAE_mean_pair", "delta_MAE",
                "RMSE_mean_global", "RMSE_mean_pair", "delta_RMSE"]

print("Top 15 improvements (most negative ΔMAE):")
display(df_compare_pair[display_cols].head(15).round(1))

print("Top 15 degradations (most positive ΔMAE):")
display(df_compare_pair[display_cols].tail(15).round(1))

# Plots (same as before): ΔMAE bar plot for stable segments + scatter vs size
MIN_PLOT_SAMPLES = 100
df_plot = df_compare_pair[df_compare_pair["n_global"] >= MIN_PLOT_SAMPLES].copy()

plt.figure(figsize=(10, 4))
x = np.arange(len(df_plot))
plt.bar(x, df_plot["delta_MAE"])
plt.axhline(0, linestyle="--")
plt.xticks(
    x,
    [f"{b} {m}" for b, m in zip(df_plot["brand"], df_plot["model"])],
    rotation=90, ha="right"
)
plt.ylabel("Δ MAE (pair - global)")
plt.title("Effect of (brand, model) specialization (negative = improvement)")
plt.tight_layout()
plt.show()

plt.figure(figsize=(6, 4))
plt.scatter(df_compare_pair["n_global"], df_compare_pair["delta_MAE"])
plt.axhline(0, linestyle="--")
plt.xlabel("Number of samples per (brand, model) (out-of-fold counted)")
plt.ylabel("Δ MAE (pair - global)")
plt.title("ΔMAE vs segment size")
plt.tight_layout()
plt.show()