**Your work will be evaluated according to the following criteria:**
- Project Structure and Notebook(s) Quality (4/20)
- Data Exploration & Initial Preprocessing (4/20)
- Regression Benchmarking and Optimization (7/20)
- Open-Ended Section (4/20)
- Deployment (1/20)
- Extra Point: Have Project Be Publicly Available on GitHub (1/20)


**Project Timeline**
- 22.11.: Preprocessing and Model Preparation
    - Finish clean preprocessing all included in pipeline
    - Finish clean Hyperparameter Tuning
- 29.11.: Feature Selection
    - Clean and structured approach for feature selection for all models (best case: consistent approach imo)
- 29.11.: Regression Benchmarking and Optimization
    - Automize Optimization (add something like mlflow)
- 06.12.: Open-End Section and Deployment
    - Added 4 open-end-experiments
    - Deployment
- 13.12.: Notebook Feinschliff
    - Super clean notebook structure similar to lab-notebooks by Ricardo
    - Show and explain results of different models clearly in markdown tables etc. (see the lab-notebooks)
- 14.12.: Submission

In [0]:
# TODO Open End Section:
# Interface for new Car

<div style="
    background: rgba(25, 25, 25, 0.55);
    backdrop-filter: blur(16px) saturate(150%);
    -webkit-backdrop-filter: blur(16px) saturate(150%);
    border: 1px solid rgba(255, 255, 255, 0.12);
    border-radius: 18px;
    padding: 45px 30px;
    text-align: center;
    font-family: 'Inter', 'Segoe UI', 'Helvetica Neue', Arial, sans-serif;
    color: #e0e0e0;
    box-shadow: 0 0 30px rgba(0, 0, 0, 0.35);
    margin: 40px auto;
    max-width: 800px;
">

  <h1 style="
      font-size: 2.8em;
      font-weight: 700;
      margin: 0 0 8px 0;
      letter-spacing: -0.02em;
      background: linear-gradient(90deg, #00e0ff, #9c7eff);
      -webkit-background-clip: text;
      -webkit-text-fill-color: transparent;
  ">
      Machine Learning Project
  </h1>

  <h2 style="
      font-size: 1.6em;
      font-weight: 500;
      margin: 0 0 25px 0;
      color: #b0b0b0;
      letter-spacing: 0.5px;
  ">
      Cars 4 You - Predicting Car Prices
  </h2>

  <p style="
      font-size: 1.25em;
      font-weight: 500;
      color: #c0c0c0;
      margin-bottom: 6px;
  ">
      Group 5 - Lukas Belser, Samuel Braun, Elias Karle, Jan Thier
  </p>

  <p style="
      font-size: 1.05em;
      font-weight: 400;
      color: #8a8a8a;
      font-style: italic;
      letter-spacing: 0.5px;
  ">
      Machine Learning End Results · 22.12.2025
  </p>
</div>


### **Table of Contents**
 
- [1. Import Packages and Data](#1-import-packages-and-data)  
  - [1.1 Import Required Packages](#11-import-required-packages)  
  - [1.2 Load Datasets](#12-load-datasets)  
  - [1.3 Kaggle Setup](#13-kaggle-setup)  
- [2. Data Cleaning, Feature Engineering, Split & Preprocessing](#2-data-cleaning-feature-engineering-split--preprocessing)  
  - [2.1 Data Cleaning](#21-data-cleaning)  
  - [2.2 Feature Engineering](#22-feature-engineering)  
  - [2.3 Data Split](#23-data-split)  
  - [2.4 Preprocessing](#24-preprocessing)  
- [3. Feature Selection](#3-feature-selection)  
- [4. Model Evaluation Metrics, Baselining, Setup](#4-model-evaluation-metrics-baselining-setup)  
- [5. Hyperparameter Tuning and Model Evaluation](#5-hyperparameter-tuning-and-model-evaluation)  
  - [5.1 ElasticNet](#51-elasticnet)  
  - [5.2 HistGradientBoost](#52-histgradientboost)  
  - [5.3 RandomForest](#53-randomforest)  
  - [5.4 ExtraTrees](#54-extratrees)  
- [6. Feature Importance of Tree Models (with SHAP)](#6-feature-importance-of-tree-models-with-shap)  
  - [6.1 HGB](#61-hgb)  
  - [6.2 RF](#62-rf)  
- [7. Kaggle Competition](#7-kaggle-competition)  

TODO finish + update toc > at the end of project

### 1. Import Packages and Data

#### 1.1 Import Required Packages

In [0]:
!pip install kaggle
!pip install shap
!pip install -U scikit-learn
!pip install category_encoders

In [0]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt; plt.rcParams.update({"figure.max_open_warning": 0, "figure.dpi": 100})
import joblib
import shap

from collections import Counter
from sklearn.feature_selection import VarianceThreshold, chi2, RFE
from scipy.stats import spearmanr, uniform, randint
from sklearn.metrics import mean_absolute_error
 
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, TargetEncoder, StandardScaler, FunctionTransformer
from sklearn.base import clone
 
from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold, GridSearchCV, cross_validate
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor, StackingRegressor
from sklearn.svm import SVR
from sklearn.inspection import permutation_importance
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.dummy import DummyRegressor
from sklearn.feature_selection import SelectFromModel


from tqdm.auto import tqdm

from category_encoders import QuantileEncoder # used for median target encoding (sklearn only supports mean target encoding with their TargetEncoder class)
 
from car_functions import clean_car_dataframe
from pipeline_functions import GroupImputer, m_estimate_mean

from collections import Counter
from sklearn.inspection import permutation_importance
from tqdm.auto import tqdm

#### 1.2 Load Datasets

In [0]:
df_cars_train = pd.read_csv("train.csv")
df_cars_test = pd.read_csv("test.csv")

#### 1.3 Kaggle Setup

In [0]:
# Kaggle API Connect

# Folder containing kaggle.json
os.environ['KAGGLE_CONFIG_DIR'] = "/Workspace/Users/20250355@novaims.unl.pt" #add your own kaggle.json api token

# Test
!echo $KAGGLE_CONFIG_DIR

### 2. Data Cleaning, Feature Engineering, Split & Preprocessing

#### 2.1 Data Cleaning

In [0]:
# Outlier Preprocessing, Missing Value Handling and Decision justifying happens here
df_cars_train = clean_car_dataframe(df_cars_train)
df_cars_test = clean_car_dataframe(df_cars_test)

# Safety Check: print unique values of all columns of df_cars_train // df_cars_test to see if data cleaning worked and if there are still odd values
for col in df_cars_train.columns:
    print(col, df_cars_train[col].unique())
print("X"*150)
for col in df_cars_test.columns:
    print(col, df_cars_test[col].unique())

#### 2.2 Feature Engineering

**Base Feature Creation**

These are foundational features derived directly from the original data, often to create linear relationships or capture interactions.
- `age`: Calculated as `2020 - year`. Creates a simple linear feature representing the car's age. Newer cars (lower age) generally have higher prices.
- `miles_per_year`: Calculated as `mileage / age`. This normalizes the car's usage, preventing high correlation (multicollinearity) between `mileage` and `age`. A 3-year-old car with 60,000 miles is different from a 6-year-old car with 60,000 miles.
- `age_x_engine`: An interaction term `age * engineSize`. This helps the model capture non-linear relationships, such as the possibility that the value of cars with large engines might depreciate faster (or slower) than cars with small engines.
- `mpg_x_engine`: An interaction term `mpg * engineSize`. This captures the combined effect of fuel efficiency and engine power.
- `tax_per_engine`: Calculated as `tax / engineSize`. This feature represents the tax cost relative to the engine's power, which could be an indicator of overall running costs or vehicle class.
- `mpg_per_engine`: Calculated as `mpg / engineSize`. This creates an "efficiency" metric, representing how many miles per gallon the car achieves for each unit of engine size.


**Popularity & Demand Features**

These features attempt to quantify a car's popularity or market demand, which directly influences price.
- `model_freq`: Calculates the frequency (percentage) of each `model` in the training dataset. Popular, common models often have more stable and predictable pricing and demand.


**Price Anchor Features**

These features "anchor" a car's price relative to its group. They provide a strong baseline price signal based on brand, model, and configuration.
- `brand_med_price`: The median price for the car's `Brand` (e.g., the typical price for a BMW vs. a Skoda). This captures overall brand positioning.
- `model_med_price`: The median price for the car's `model` (e.g., the typical price for a 3-Series vs. a 1-Series). This captures the model's positioning within the brand.
- `brand_fuel_med_price`: The median price for the car's specific `Brand` and `fuelType` combination (e.g., a Diesel BMW vs. a Petrol BMW).
- `brand_trans_med_price`: The median price for the `Brand` and `transmission` combination (e.g., an Automatic BMW vs. a Manual BMW).


**Normalized & Relative Features**

These features compare a car to its peers rather than using absolute values.
- `*_anchor` (e.g., `brand_med_price_anchor`): Created by dividing the median price features (from section 3) by the `overall_mean_price`. This makes the feature dimensionless and represents the group's price *relative* to the entire market (e.g., "this brand is 1.5x the market average").
- `age_rel_brand`: Calculated as `age - brand_median_age`. This shows if a car is newer or older than the *typical* car for that specific brand, capturing relative age within its own group.


**CV-Safe Target Encodings**

This is an advanced technique to encode categorical variables (like `model` or `Brand`) using information from the target variable (`price`) without causing data leakage.
- `*_te` (e.g., `model_te`): Represents the *average price* for that category (e.g., the average price for a "Fiesta").
- **Why is it "CV-Safe"?** Instead of just calculating the global average price for "Fiesta" and applying it to all rows (which leaks target information), this method uses K-Fold cross-validation. For each fold of the data, the target encoding is calculated *only* from the *other* folds. This ensures the encoding for any given row never includes its own price, preventing leakage and leading to a more robust model.

In [0]:
# docstring adden TODO
class CarFeatureEngineer(BaseEstimator, TransformerMixin):
    def __init__(self, ref_year=None):
        self.ref_year = ref_year

    def fit(self, X, y=None):
        X_ = X.copy()
        if self.ref_year is None:
            self.ref_year_ = X_['year'].max()
        else:
            self.ref_year_ = self.ref_year
        self.brand_median_age_ = (
            (self.ref_year_ - X_['year'])
            .groupby(X_['Brand'])
            .median()
            .to_dict()
        )
        self.model_freq_ = X_['model'].value_counts(normalize=True).to_dict()
        return self

    def transform(self, X):
        X = X.copy()
        
        # # 1. Base Feature Creation: Car Age - Newer cars usually have higher prices, models prefer linear features
        age = self.ref_year_ - X['year']
        X['age'] = age

        # Miles per Year: Normalizes mileage by age (solves multicollinearity between year and mileage)
        X['miles_per_year'] = X['mileage'] / age.replace({0: np.nan})
        X['miles_per_year'] = X['miles_per_year'].fillna(X['mileage']) # if age is 0, just use mileage because that's the mileage it has driven so far in that year

        # Interaction Terms: Capture non-linear effects between engine and other numeric features
        X['age_x_engine'] = X['age'] * X['engineSize']
        X['mpg_x_engine']  = X['mpg'] * X['engineSize']

        # tax per engine
        X['tax_per_engine'] = X['tax'] / X['engineSize'].replace({0: np.nan})

        # MPG per engineSize to represent the efficiency
        X['mpg_per_engine'] = X['mpg'] / X['engineSize'].replace({0: np.nan})

        # 2. Model Frequency: Popular models tend to have stable demand and prices
        X['model_freq'] = X['model'].map(self.model_freq_).fillna(0.0)

        # 3. Create Interaction Features for anchor (relative positioning within brand/model)
        X['brand_fuel'] = X['Brand'].astype(str) + "_" + X['fuelType'].astype(str)
        X['brand_trans'] = X['Brand'].astype(str) + "_" + X['transmission'].astype(str)
        
        # 4. Relative Age (within brand): newer/older than brand median year
        X['age_rel_brand'] = X['age'] - X['Brand'].map(self.brand_median_age_)
        return X

In [0]:
df_cars_train.info()

In [0]:
df_cars_train.columns

#### 2.3 (No) Data Split

In [0]:
# Since we have an external hold-out set (kaggle) an additional val set is not necessary and wastes training data)
X_train = df_cars_train.drop(columns='price')
y_train = df_cars_train['price']

#### 2.4 Preprocessing

In [0]:
# TODO warum benutzen wir das => auch bei pipeline adden
def to_float_array(x):
    """Convert input to float array."""
    return np.array(x, dtype=float)

In [0]:
# Patch FunctionTransformer to expose feature names and create function to call it
class NamedFunctionTransformer(FunctionTransformer):
    def __init__(self, func=None, feature_names=None, **kwargs):
        # store as attribute so sklearn.get_params can access it
        self.feature_names = feature_names
        super().__init__(func=func, **kwargs)

    def get_feature_names_out(self, input_features=None):
        # if custom names specified, use them
        if self.feature_names is not None:
            return np.asarray(self.feature_names, dtype=object)
        # otherwise just pass through the input feature names
        return np.asarray(input_features, dtype=object)


def get_feature_names_from_preprocessor(pre):
    feature_names = []
    for name, trans, cols in pre.transformers_:
        if name != 'remainder':
            if hasattr(trans, 'get_feature_names_out'):
                # for categorical OHE
                try:
                    feature_names.extend(trans.get_feature_names_out(cols))
                except:
                    feature_names.extend(cols)
            else:
                feature_names.extend(cols)
    return feature_names

Split into original and engineered features pipelines for initial model baselining and comparison.

In [0]:
# PIPELINE WITH preprocessor_orig CONTAINING ONLY ORIGINAL FEATURES

orig_numeric_features = [
    "year", "mileage", "tax", "mpg", "engineSize", "previousOwners"
]
orig_categorical_features = ["Brand", "model", "transmission", "fuelType"]

# left out columns: hasDamage (unsure what the two values 0 and NaN mean), paintQuality (only added by mechanic so not available for our predictions in production)

numeric_transformer_orig = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),    # simple global median imputation
    ("to_float", FunctionTransformer()),
    ("scaler", StandardScaler())
])

categorical_transformer_orig = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),  # fill by mode instead of Unknown
    ("encoder", OneHotEncoder(handle_unknown="ignore"))  # One-hot encoding
])

preprocessor_orig = ColumnTransformer([
    ("num", numeric_transformer_orig, orig_numeric_features),
    ("cat", categorical_transformer_orig, orig_categorical_features)
])

preprocessor_orig.fit(X_train)

In [0]:
# PIPELINE WITH preprocessor_fe CONTAINING ENGINEERED FEATURES

numeric_features = [
    "age", "age_rel_brand", "tax", "mpg", "engineSize", "previousOwners", "model_freq",
    "age_x_engine", "mpg_x_engine",
    "tax_per_engine", "mpg_per_engine"
]
log_features = ["mileage", "miles_per_year"]                                              
categorical_features = ["transmission", "fuelType"]
categorical_features_for_te = ["Brand", "model"]
categorical_median_te = ["Brand", "model", "brand_fuel", "brand_trans"]

# left out columns: year (age is better), hasDamage (unsure what the two values 0 and NaN mean), paintQuality (only added by mechanic so not available for our predictions in production)

numeric_transformer_fe = Pipeline([
    ("to_float", NamedFunctionTransformer(to_float_array, validate=False)),
    ("scaler", StandardScaler()),
])

log_transformer_fe = Pipeline([
    # Hierarchical imputation on Brand_te/model_te, then log-transform
    ("to_float", NamedFunctionTransformer(to_float_array, validate=False)),
    ("log",    NamedFunctionTransformer(np.log1p, validate=False)),  # log1p handles zeros safely
    ("scaler", StandardScaler()),
])

categorical_transformer_fe_ohe = Pipeline([
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])

# Keep mean target encoder in code but dont use it for now because median TE seems more robust and we use only one method for consistency
# categorical_transformer_fe_te = Pipeline([ 
#     ("encoder", TargetEncoder(target_type='continuous', cv=5, smooth='auto', random_state=42)), # Prevents data leakage with CV (e.g. for the samples in Fold 1, it calculates the target mean using the data from Folds 2, 3, 4, and 5) # TODO If it overfits test data too much, increasing the smoothing parameter can help
#     ("scaler", StandardScaler()),
# ])


# names for median-TE features (one per input column, since QuantileEncoder outputs 1 column per feature)
median_te_feature_names = [f"{col}__median_te" for col in categorical_median_te]

categorical_transformer_fe_median_te = Pipeline(steps=[
    ('median_encoder', QuantileEncoder(quantile=0.5, m=10.0)), # not specifying the cols mean it encodes all columns # quantile=0.5 = Median. m is the smoothing parameter (smoothing mitigates but doesnt eliminate leakage) # TODO tune m?
    ('scaler', StandardScaler()) ,
    ('name_wrapper', NamedFunctionTransformer(feature_names=median_te_feature_names,
                                              validate=False)),
])

# TODO: Currently the transformed features from the categorical_transformer_fe_median_te dont have names but just numbers (see names in Section 5.3 Feature Importance for RF) -> fix that later if possible ~J

# ColumnTransformer that uses all engineered features
transformer_fe = ColumnTransformer([
    ("log", log_transformer_fe, log_features),
    ("num", numeric_transformer_fe, numeric_features),
    ("cat", categorical_transformer_fe_ohe, categorical_features),
    # ("mean_te", categorical_transformer_fe_te, categorical_features_for_te),
    ("median_te", categorical_transformer_fe_median_te, categorical_median_te)
])

# preprocessor_fe with group imputing and scaler_fe
preprocessor_fe = Pipeline([
    ("fe", CarFeatureEngineer(ref_year=2020)),
    ("group_imputer", GroupImputer(
        group_cols=("Brand", "model"),
        num_cols=numeric_features + log_features,
        cat_cols=categorical_features + categorical_median_te,
        fallback="__MISSING__",
    )),
    ("ct", transformer_fe),   # ColumnTransformer
])

preprocessor_fe.fit(X_train, y_train) # Fit here already to have scaled data for feature selection later (y_train is necessary for target encoder)

In [0]:
ct = preprocessor_fe.named_steps["ct"]
fe_names = get_feature_names_from_preprocessor(ct)

TODO delete following cell later - this is for us to see if the group imputer works - but it is GPT slop

In [0]:
brand = "VW"
model = "golf"

# 1) Get the fitted steps from preprocessor_fe
fe = preprocessor_fe.named_steps["fe"]              # CarFeatureEngineer
imp = preprocessor_fe.named_steps["group_imputer"]  # GroupImputer

# 2) Inspect GroupImputer internal numeric stats
pair_table = getattr(imp, "num_pair_", None)    # indexed by (_g0, _g1) = (Brand, model)
brand_table = getattr(imp, "num_first_", None)  # indexed by _g0 = Brand
global_med = getattr(imp, "num_global_", None)  # Series of global medians

print("Has pair-level medians table:",
      pair_table is not None and not getattr(pair_table, "empty", True))
print("Has brand-level medians table:",
      brand_table is not None and not getattr(brand_table, "empty", True))
print("Has global median:",
      global_med is not None and not global_med.empty if global_med is not None else False)
print()

_g0 = brand
_g1 = model

# 2a) Pair-level
if pair_table is not None and (_g0, _g1) in pair_table.index:
    print(f"Pair-level median FOUND for ({brand}, {model}):")
    display(pair_table.loc[(_g0, _g1)])
else:
    print(f"No pair-level median for ({brand}, {model}).")
    if pair_table is not None and not pair_table.empty:
        print("Sample of pair-level medians (top 5):")
        display(pair_table.head())

# 2b) Brand-level
if brand_table is not None and _g0 in brand_table.index:
    print(f"\nBrand-level median for {brand}:")
    display(brand_table.loc[_g0])
else:
    print("\nNo brand-level median for", brand)
    if brand_table is not None and not brand_table.empty:
        print("Sample of brand-level medians (top 5):")
        display(brand_table.head())

# 2c) Global medians
print("\nGlobal median (fallback):")
display(global_med)

# 3) Apply CarFeatureEngineer + GroupImputer to VW Golf rows and compare
#    (GroupImputer was fitted after CarFeatureEngineer, so we must mimic that order)

# 3a) Feature engineering on full X_train
X_train_fe = fe.transform(X_train)

# 3b) Filter for VW Golf in the feature-engineered space
vw_golf = X_train_fe[(X_train_fe["Brand"] == brand) & (X_train_fe["model"] == model)].copy()

if vw_golf.empty:
    print("\nNo VW Golf rows found in X_train.")
else:
    print(f"\nFound {len(vw_golf)} VW Golf rows in X_train.")

    # 3c) GroupImputer expects the columns it saw at fit time
    cols_for_imp = imp.feature_names_in_
    vw_input = vw_golf.loc[:, cols_for_imp]

    vw_imp = imp.transform(vw_input)
    vw_imp_df = pd.DataFrame(vw_imp, columns=cols_for_imp, index=vw_golf.index)

    print("\nImputed data (first 8 rows):")
    display(vw_imp_df[["mpg", "mileage", "tax"]].head(8))

    # 4) Build comparison table (original vs imputed, for selected columns)
    comp = pd.DataFrame(index=vw_golf.index)
    comp["orig_mpg"] = vw_golf["mpg"]
    comp["imp_mpg"] = vw_imp_df["mpg"]
    comp["orig_tax"] = vw_golf["tax"]
    comp["imp_tax"] = vw_imp_df["tax"]
    comp["orig_mileage"] = vw_golf["mileage"]
    comp["imp_mileage"] = vw_imp_df["mileage"]

    print("\nOriginal vs imputed (first 12 rows):")
    display(comp.head(12))

    # 5) Determine imputation source per row
    def source_of_imputation(col):
        srcs = []
        for idx, row in comp.iterrows():
            val = row[f"imp_{col}"]
            src = "other"

            # Pair-level
            if pair_table is not None and (_g0, _g1) in pair_table.index and col in pair_table.columns:
                pair_val = pair_table.loc[(_g0, _g1), col]
                if pd.notna(pair_val) and pd.notna(val) and val == pair_val:
                    src = "pair"

            # Brand-level
            if src == "other" and brand_table is not None and _g0 in brand_table.index and col in brand_table.columns:
                brand_val = brand_table.loc[_g0, col]
                if pd.notna(brand_val) and pd.notna(val) and val == brand_val:
                    src = "brand"

            # Global
            if src == "other" and global_med is not None and col in global_med.index:
                glob_val = global_med[col]
                if pd.notna(glob_val) and pd.notna(val) and val == glob_val:
                    src = "global"

            srcs.append(src)
        return srcs

    comp["src_mpg"] = source_of_imputation("mpg")
    comp["src_tax"] = source_of_imputation("tax")
    comp["src_mileage"] = source_of_imputation("mileage")

    print("\nImputation sources for the shown rows:")
    display(comp.head(12))

    # 6) Summary counts: NaN before vs after imputation
    print("\nSummary counts: NaN before -> NaN after")
    before = vw_golf[["mpg", "mileage", "tax"]].isna().sum()
    after = pd.DataFrame({
        "mpg": comp["imp_mpg"],
        "mileage": comp["imp_mileage"],
        "tax": comp["imp_tax"],
    }).isna().sum()
    display(pd.DataFrame({"na_before": before, "na_after": after}))

### 3. Feature Selection

For all models, we adopted a consistent, embedded feature selection strategy. 

We first designed a domain-specific preprocessing pipeline (feature engineering + group-wise imputation + encoding).

Then we use:
1. VarianceThreshold() to remove constant features.
2. SelectFromModel(RandomForest) inside each model pipeline as a supervised feature selector.
    - This selector is trained within cross-validation and shared across all models, ensuring:
        - No data leakage,
        - Consistent feature selection logic,
        - Model-agnostic, non-linear evaluation of feature relevance.

We use a supervised, RF-based feature selector (SelectFromModel) for models that are more sensitive to high dimensionality and collinearity (ElasticNet, SVR). 

For tree-based ensemble models (RandomForest, ExtraTrees, HistGradientBoosting, and the stacking ensemble), our experiments showed that explicit feature pruning did not improve performance, as these models already handle redundant features well through their own regularization mechanisms. (l2 reg)

Therefore, for these models we use only a light VarianceThreshold filter.

In [0]:
# TODO Samuel further review SKlearn pipeline FS techniques and see how to improve MAE again

In [0]:
# Base estimator used purely for feature selection
fs_estimator = RandomForestRegressor(
    n_estimators=400,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features="sqrt",
    bootstrap=True,
    n_jobs=-1,
    random_state=42,
)

# Global feature selector, used in *all* pipelines
feature_selector = SelectFromModel(
    estimator=fs_estimator,
    threshold="median",   # keep features with importance above median - tuning with mean or numeric values
    prefit=False,         # fit inside the pipeline
)


### 4. Model Evaluation Metrics, Baselining, Setup

#### 4.1 Model Evaluation Metrics

**MAE (Mean Absolute Error):**
- average absolute deviation between predicted and true car prices
- easy to interpret in pounds, same metric used by Kaggle competition

**RMSE (Root Mean Squared Error):**
- sensitive to outliers, helps identify large prediction errors

**R²:**
- Coefficient of determination: proportion of variance explained by the model
- 1.0 = perfect predictions, 0.0 = same as predicting mean, < 0.0 = worse than mean

=> We define the metrics in the method `print_metrics` in file `car_functions.py` TODO anpassen

#### 4.2 Baseline (median)

In [0]:
# Baseline: DummyRegressor using the median price as prediction
baseline_pipe = Pipeline([
    ("preprocess", preprocessor_fe_clean),  
    ("model", DummyRegressor(strategy="median")),
])

baseline_cv = cross_validate(
    baseline_pipe,
    X_train,
    y_train,
    cv=3,
    scoring={
        "neg_mae": "neg_mean_absolute_error",
        "neg_mse": "neg_mean_squared_error",
        "r2": "r2",
    },
    n_jobs=-2,
    verbose=0,
)

baseline_mae = -baseline_cv["test_neg_mae"].mean()
baseline_rmse = np.sqrt(-baseline_cv["test_neg_mse"].mean())
baseline_r2 = baseline_cv["test_r2"].mean()

print("Baseline (median) with engineered features & CV:")
print(f"MAE:  {baseline_mae:,.4f}")
print(f"RMSE: {baseline_rmse:,.4f}")
print(f"R2:   {baseline_r2:,.4f}")

# Baseline (median) with engineered features & CV:
# MAE:  6,801.8131
# RMSE: 9,981.0186
# R2:   -0.0508

#### 4.3 Pipeline Definitions (preprocessor + model)

In [0]:
### LINEAR MODEL (ElasticNet)

elastic_pipe_orig = Pipeline([
    ("preprocess", preprocessor_orig),
    ("vt", VarianceThreshold(threshold=0.0)),  # remove constant features
    ("model", ElasticNet(
        alpha=0.01,
        l1_ratio=0.5,
        max_iter=30000,
        tol=1e-4,
        selection="cyclic",
        random_state=42,
    )),
])

elastic_pipe_fe = Pipeline([
    ("preprocess", preprocessor_fe_clean),
    ("vt", VarianceThreshold(threshold=0.0)),
    ("fs", SelectFromModel(fs_estimator, threshold="median")),  # shared selector
    ("model", ElasticNet(
        alpha=0.01,
        l1_ratio=0.5,
        max_iter=30000,
        tol=1e-4,
        selection="cyclic",
        random_state=42,
    )),
])


### TREE MODELS

# HistGradientBoostingRegressor

hgb_pipe_orig = Pipeline([
    ("preprocess", preprocessor_orig),
    ("vt", VarianceThreshold(threshold=0.0)),
    ("model", HistGradientBoostingRegressor(
        early_stopping=True,
        validation_fraction=0.1,
        n_iter_no_change=20,
        l2_regularization=0.5,
        random_state=42,
    )),
])

hgb_pipe_fe = Pipeline([
    ("preprocess", preprocessor_fe_clean),
    ("vt", VarianceThreshold(threshold=0.0)),
    ("model", HistGradientBoostingRegressor(
        early_stopping=True,
        validation_fraction=0.1,
        n_iter_no_change=20,
        l2_regularization=0.5,
        random_state=42,
    )),
])


# RandomForestRegressor

rf_pipe_orig = Pipeline([
    ("preprocess", preprocessor_orig),
    ("vt", VarianceThreshold(threshold=0.0)),
    ("model", RandomForestRegressor(
        n_estimators=300,
        max_depth=None,
        min_samples_split=3,
        min_samples_leaf=2,
        max_features="sqrt",
        bootstrap=True,
        n_jobs=-1,
        random_state=42,
    )),
])

rf_pipe_fe = Pipeline([
    ("preprocess", preprocessor_fe_clean),
    ("vt", VarianceThreshold(threshold=0.0)),
    ("model", RandomForestRegressor(
        n_estimators=300,
        max_depth=None,
        min_samples_split=3,
        min_samples_leaf=2,
        max_features="sqrt",
        bootstrap=True,
        n_jobs=-1,
        random_state=42,
    )),
])


# ExtraTreesRegressor

et_pipe_orig = Pipeline([
    ("preprocess", preprocessor_orig),
    ("vt", VarianceThreshold(threshold=0.0)),
    ("model", ExtraTreesRegressor(
        n_estimators=400,
        max_depth=None,
        min_samples_leaf=2,
        max_features="sqrt",
        bootstrap=False,
        n_jobs=-1,
        random_state=42,
    )),
])

et_pipe_fe = Pipeline([
    ("preprocess", preprocessor_fe_clean),
    ("vt", VarianceThreshold(threshold=0.0)),
    ("model", ExtraTreesRegressor(
        n_estimators=400,
        max_depth=None,
        min_samples_leaf=2,
        max_features="sqrt",
        bootstrap=False,
        n_jobs=-1,
        random_state=42,
    )),
])


### KERNEL-BASED MODEL (SVR)

svr_pipe_orig = Pipeline([
    ("preprocess", preprocessor_orig),
    ("vt", VarianceThreshold(threshold=0.0)),
    ("model", SVR(
        kernel="rbf",
        C=10,
        epsilon=0.1,
        gamma="scale",
    )),
])

svr_pipe_fe = Pipeline([
    ("preprocess", preprocessor_fe_clean),
    ("vt", VarianceThreshold(threshold=0.0)),
    ("fs", SelectFromModel(fs_estimator, threshold="median")),
    ("model", SVR(
        kernel="rbf",
        C=10,
        epsilon=0.1,
        gamma="scale",
    )),
])


### ENSEMBLE META MODEL (Stacking)

stack_pipe_orig = StackingRegressor(
    estimators=[
        ("elastic_orig", elastic_pipe_orig),
        ("hgb_orig", hgb_pipe_orig),
        ("rf_orig", rf_pipe_orig),
    ],
    final_estimator=HistGradientBoostingRegressor(
        learning_rate=0.05,
        max_depth=5,
        l2_regularization=0.5,
        random_state=42,
    ),
    passthrough=False,
    n_jobs=-1,
)

stack_pipe_fe = StackingRegressor(
    estimators=[
        ("hgb_fe", hgb_pipe_fe),
        ("rf_fe", rf_pipe_fe),
        ("et_fe", et_pipe_fe),
    ],
    final_estimator=HistGradientBoostingRegressor(
        learning_rate=0.05,
        max_depth=5,
        l2_regularization=0.5,
        random_state=42,
    ),
    passthrough=False,
    n_jobs=-1,
)

#### 4.4 First run of models

In [0]:
# # TODO uncomment (currently its commented to save time during experimentation)

# # First evaluation of metrics based on original and engineered feature pipeline to decide how to proceed


# models_orig = {
#     # "ElasticNet_orig": elastic_pipe_orig,
#     "HGB_orig": hgb_pipe_orig,
#     "RF_orig": rf_pipe_orig,
#     "ET_orig": et_pipe_orig,
#     "SVR_orig": svr_pipe_orig,
#     "Stack_orig": stack_pipe_orig,
# }

# models_fe = {
#     # "ElasticNet_fe": elastic_pipe_fe,
#     "HGB_fe": hgb_pipe_fe,
#     "RF_fe": rf_pipe_fe,
#     "ET_fe": et_pipe_fe,
#     "SVR_fe": svr_pipe_fe,
#     "Stack_fe": stack_pipe_fe,
# }

# results = []

# # for name, model in {**models_orig, **models_fe}.items():
#     print(f"Fitting {name} with cross-validation...")
    
#     # Perform cross-validation on the entire training set
#     cv_results = cross_validate(
#         model, 
#         X_train, 
#         y_train,
#         cv=3,
#         scoring={
#             'neg_mae': 'neg_mean_absolute_error',
#             'neg_mse': 'neg_mean_squared_error',
#             'r2': 'r2'
#         },
#         return_train_score=False,
#         verbose=3,
#         n_jobs=-2
#     )
    
#     # Calculate mean metrics across folds
#     mae = -cv_results['test_neg_mae'].mean()
#     rmse = np.sqrt(-cv_results['test_neg_mse'].mean())
#     r2 = cv_results['test_r2'].mean()
    
#     results.append({
#         "model": name,
#         "feature_set": "original" if name.endswith("_orig") else "engineered",
#         "MAE": mae,
#         "RMSE": rmse,
#         "R2": r2,
#     })

# results_df = (
#     pd.DataFrame(results)
#       .sort_values(["feature_set", "MAE"])
#       .reset_index(drop=True)
# )

# print(results_df)

# # Long Duration (with orig ca 25mins VS without orig ca 6mins VS with CV ca 16mins VS with njobs=-1 ca )

# # Predicted on hold-out val set (20%):
# #       model feature_set          MAE          RMSE        R2
# # 0     RF_fe  engineered  1299.728938  4.509435e+06  0.950490
# # 1  Stack_fe  engineered  1321.130612  4.831609e+06  0.946953
# # 2     ET_fe  engineered  1328.051439  4.707534e+06  0.948315
# # 3    HGB_fe  engineered  1534.496164  5.609255e+06  0.938415
# # 4    SVR_fe  engineered  2955.064750  3.242891e+07  0.643956

# # Predicted using 3-fold CV on entire data:
# #       model feature_set          MAE         RMSE        R2
# # 0     RF_fe  engineered  1336.806163  2375.850617  0.940424
# # 1  Stack_fe  engineered  1357.266391  2505.029128  0.933786
# # 2     ET_fe  engineered  1364.212656  2399.654669  0.939223
# # 3    HGB_fe  engineered  1551.419964  2503.445871  0.933858
# # 4    SVR_fe  engineered  3068.524237  6130.420383  0.603579

In [0]:
# TODO the following markdown and reasoning for hyperparameter tuning has to be adjusted regarding new insights (e.g. ET is not underperforming anymore) ~J

After a first run comparing the original feature pipeline and the engineered feature pipeline for all models, we decided to focus on RandomForest and HistGradientBoost. 

They seem to have the best prediction performance for now. StackingRegressor currently performs best, but since it is blending existing models, we will focus on that and reevaluate in the end.

With ExtraTrees and SVR really underperforming, we decide not to do Hyperparameter Tuning.

### 5. Hyperparameter Tuning and Model Evaluation

In [0]:
# Define a function to use it here and potentially use it later for a final hyperparameter tuning after feature selection again
def model_hyperparameter_tuning(pipeline, param_dist, n_iter=100, splits=5):
    
    cv = KFold(n_splits=splits, shuffle=True, random_state=42) # 5 folds for more robust estimation

    # Randomized search setup
    model_random = RandomizedSearchCV(
        estimator=pipeline,
        param_distributions=param_dist,
        n_iter=n_iter,                      # number of different hyperparameter combinations that will be randomly sampled and evaluated (more iterations = more thorough search but longer runtime)
        scoring={
            'mae': 'neg_mean_absolute_error',
            'mse': 'neg_mean_squared_error',
            'r2': 'r2'
        },
        refit='mae', # Refit the best model based on MAE on the whole training set
        cv=cv,
        n_jobs=-2,
        random_state=42,
        verbose=3,
    )

    # Fit the search
    model_random.fit(X_train, y_train)

    mae = -model_random.cv_results_['mean_test_mae'][model_random.best_index_]
    mse = -model_random.cv_results_['mean_test_mse'][model_random.best_index_]
    rmse = np.sqrt(mse)
    r2 = model_random.cv_results_['mean_test_r2'][model_random.best_index_]

    print("Model Results (CV metrics):")
    print(f"MAE: {mae:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"R²: {r2:.4f}")
    print("Best Model params:", model_random.best_params_)

    return model_random.best_estimator_, model_random # return the best model

##### 5.1 ElasticNet

In [0]:
# TODO this cell is commented because we dont evaluate elasticnet for final performance (save time)
# # Hyperparameter Tuning: ElasticNet

# elastic_param_grid = {
#     "model__alpha": [0.001],    # also tried 0.01, 0.05, 0.1, 0.5
#     "model__l1_ratio": [0.9]    # also tried 0.1, 0.3, 0.5, 0.7  
# }

# # CV: Calculate all metrics but use MAE for selecting best model
# elastic_grid = GridSearchCV(
#     elastic_pipe_fe, 
#     param_grid=elastic_param_grid,
#     cv=5,
#     scoring={
#         'mae': 'neg_mean_absolute_error',
#         'mse': 'neg_mean_squared_error',
#         'r2': 'r2'
#     },
#     refit='mae', # Refit the best model based on MAE on the whole training set
#     n_jobs=-2,
#     verbose=3,
#     return_train_score=False
# )
# elastic_grid.fit(X_train, y_train)

# # Get mean metrics across folds
# mae = -elastic_grid.cv_results_['mean_test_mae'][elastic_grid.best_index_]
# mse = -elastic_grid.cv_results_['mean_test_mse'][elastic_grid.best_index_]
# rmse = np.sqrt(mse)
# r2 = elastic_grid.cv_results_['mean_test_r2'][elastic_grid.best_index_]
# print("ElasticNet Results (CV on entire train set):")
# print(f"MAE: {mae:.4f}")
# print(f"RMSE: {rmse:.4f}")
# print(f"R²: {r2:.4f}")
# print("Best ElasticNet params:", elastic_grid.best_params_)

# elastic_best = elastic_grid.best_estimator_ # Final model trained on entire training set with best hyperparameters minimizing MAE

# # Long Duration (Before removal of OHE-categoricals interrupted kernel after 64mins VS after removal ca 1min -> now 15secs with njobs=-2)

# # ElasticNet Results: 
# # MAE: 2353.9112 | RMSE: 13356867.7860 | R2: 0.8534
# # Best ElasticNet params: {'model__alpha': 0.001, 'model__l1_ratio': 0.9}

# # MAE: 2589.6100
# # RMSE: 4104.4515
# # R²: 0.8222
# # Best ElasticNet params: {'model__alpha': 0.001, 'model__l1_ratio': 0.9}

In [0]:
# # TODO this cell is commented because of time constraints
# # Use GridSearchCV for features_to_select
# # Base model: tuned ElasticNet from above
# en_base = clone(elastic_best.named_steps["model"])

# # Pipeline: clean preprocessing -> RFE -> model
# rfe_pipe_linear = Pipeline([
#     ("preprocess", preprocessor_fe_clean),
#     ("rfe", RFE(
#         estimator=en_base,
#         step=0.5,               # drop ~20% per iteration
#         importance_getter="auto"
#     )),
#     ("model", clone(en_base))
# ])

# # Try a few target feature counts (adjust as needed)
# number_of_all_features = preprocessor_fe_clean.transform(X_train).shape[1]
# rfe_param_grid = {
#     "rfe__n_features_to_select": [int(number_of_all_features*0.5)]# , int(number_of_all_features*1)] # use only these two extremes to save time ~J
# }

# rfe_grid = GridSearchCV(
#     rfe_pipe_linear,
#     param_grid=rfe_param_grid,
#     cv=2,
#     scoring="neg_mean_absolute_error",
#     n_jobs=-1,
#     verbose=3,
#     return_train_score=False,
# )

# rfe_grid.fit(X_train, y_train)

# print("Best n_features_to_select:", rfe_grid.best_params_["rfe__n_features_to_select"])
# print("MAE (CV):", -rfe_grid.best_score_)
# rfe_best = rfe_grid.best_estimator_

# # list kept features
# best_rfe = rfe_best.named_steps["rfe"]
# all_feats = rfe_best.named_steps["preprocess"].get_feature_names_out()
# kept = [f for f, keep in zip(all_feats, best_rfe.support_) if keep]
# print("Kept features:", kept)


**Reasoning**: We used 100 features as an initial, arbitrary cutoff for feature selection in the ElasticNet model. Preliminary experiments and insights from the EDA (see separate notebook) indicated that tree-based methods are likely to perform better. Therefore, we prioritized feature selection for the tree-based models based on SHAP values.
 

##### 5.2 HistGradientBoost

In [0]:
hgb_param_dist = {
    "vt__threshold": [0.0, 0.005, 0.01],
    "model__learning_rate": uniform(0.01, 0.15),       # samples values
    "model__max_leaf_nodes": randint(50, 150),         
    "model__min_samples_leaf": randint(2, 20),         # samples leaf sizes between 2–20
    "model__max_iter": randint(200, 900),              # tries 200–900 iterations
    "model__l2_regularization": uniform(0.0, 1.0),      # samples small regularization values
    "model__early_stopping": [True],
    "model__validation_fraction": [0.1],
    "model__n_iter_no_change": [20],
    "model__random_state":[42]
}

# optimized the parameter distributions based on previous runs to focus search space
hgb_param_dist = {
    "vt__threshold": [0.005],
    "model__learning_rate": [0.06923222772633546],
    "model__max_leaf_nodes": [137],
    "model__min_samples_leaf": [12],
    "model__max_iter": [847],
    "model__l2_regularization": [0.4234014807063696],
    "model__early_stopping": [True],
    "model__validation_fraction": [0.1],
    "model__n_iter_no_change": [20],
    "model__random_state":[42]
}

hgb_best = model_hyperparameter_tuning(hgb_pipe_fe, hgb_param_dist
    # , n_iter=3, 
    # splits=5
) 

# Old preset hps (1min):
# MAE: 1289.7294
# RMSE: 2185.5006
# R²: 0.9498

# Reapplying RandomizedSearchCV (40mins):
# MAE: 1289.6713
# RMSE: 2181.5766
# R²: 0.9500
# Best Model params: {'model__early_stopping': True, 'model__l2_regularization': np.float64(0.14092422497476265), 'model__learning_rate': np.float64(0.08219772826786356), 'model__max_iter': 464, 'model__max_leaf_nodes': 108, 'model__min_samples_leaf': 8, 'model__n_iter_no_change': 20, 'model__random_state': 42, 'model__validation_fraction': 0.1}

# Using transmission and fuelType as OHE instead of TE (10mins):
# MAE: 1283.2876
# RMSE: 2184.9071
# R²: 0.9499
# Best Model params: {'model__early_stopping': True, 'model__l2_regularization': np.float64(0.4234014807063696), 'model__learning_rate': np.float64(0.06923222772633546), 'model__max_iter': 847, 'model__max_leaf_nodes': 137, 'model__min_samples_leaf': 12, 'model__n_iter_no_change': 20, 'model__random_state': 42, 'model__validation_fraction': 0.1}

# Removed manual TE to prevent data leakage (30sek):
# MAE: 1264.2084
# RMSE: 2152.0902
# R²: 0.9513
# [fixed] Best Model params: {'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}

# Fix (remove certain) fillnas in feature_engineering:
# MAE: 1259.7342
# RMSE: 2153.3145
# R²: 0.9513
# [fixed] Best Model params: {'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}

# Added sklearn targetencoder in pipeline:
# MAE: 1275.7318
# RMSE: 2171.0317
# R²: 0.9505
# Best Model params: {'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}

# Scale target encoded features:
# MAE: 1275.7318
# RMSE: 2171.0317
# R²: 0.9505
# Best Model params: {'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}


# Removed anchor in anchor features (no division by overall mean):
# MAE: 1275.7318
# RMSE: 2171.0317
# R²: 0.9505
# Best Model params: {'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}
# ==> Tree-based models do not need normalized features, so this is expected

# Fixed leakage in med_price_anchor (smoothing):
# MAE: 1268.8764
# RMSE: 2160.6397
# R²: 0.9510
# Best Model params: {'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}

# Use only mean te for 'Brand' and 'model' instead of mean and median te:
# MAE: 1273.0038
# RMSE: 2182.3061
# R²: 0.9500
# Best Model params: {'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}

# Use only median te for 'Brand' and 'model' instead of mean te:
# MAE: 1256.4969
# RMSE: 2134.2390
# R²: 0.9521
# Best Model params: {'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}


# Use cv=10 in TargetEncoder instead of default 5:

# Use GroupMedianImputer for categorical_transformer_fe_ohe and categorical_transformer_fe_te instead of SimpleImputer ():
 
# --> 

# categorical_features = ["transmission", "fuelType","Brand", "model"]
# Model Results (CV metrics):
# MAE: 1257.7883
# RMSE: 2154.7975
# R²: 0.9513

# categorical_features = ["transmission", "fuelType"]
# MAE: 1262.7020
# RMSE: 2148.1356
# R²: 0.9515
# Best Model params: {'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}

#Drop Features with correlation inbetween features (>0.9): 
# MAE: 1293.6773
# RMSE: 2206.5226
# R²: 0.9488

# vt__threshold in pipeline
# MAE: 1260.7132
# RMSE: 2151.5396
# R²: 0.9514
# Best Model params: {'vt__threshold': 0.005, 'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}

# added imputer to QuantileEncoder pipeline step (performance was like this before so the minimal change comes from differences in Jans setup compared to Elias):
# MAE: 1259.5270
# RMSE: 2148.1023
# R²: 0.9515
# Best Model params: {'vt__threshold': 0.005, 'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}

# Use only median imputer
# MAE: 1256.7489
# RMSE: 2132.2187
# R²: 0.9522
# Best Model params: {'vt__threshold': 0.005, 'model__validation_fraction': 0.1, 'model__random_state': 42, 'model__n_iter_no_change': 20, 'model__min_samples_leaf': 12, 'model__max_leaf_nodes': 137, 'model__max_iter': 847, 'model__learning_rate': 0.06923222772633546, 'model__l2_regularization': 0.4234014807063696, 'model__early_stopping': True}

##### 5.3 RandomForest

In [0]:
# Old parameter distribution
rf_param_dist = {
    "vt__threshold": [0.0, 0.005, 0.01],
    "model__n_estimators": randint(200, 600),        # number of trees
    "model__max_depth": randint(5, 40),              # depth of each tree
    "model__min_samples_split": randint(2, 10),      # min samples to split an internal node
    "model__min_samples_leaf": randint(1, 8),        # min samples per leaf
    "model__max_features": ["sqrt"],           # feature sampling strategy (sqrt performed better than log2 and None in previous tests)
    "model__bootstrap": [False]                      # use bootstrapping or not (False performed better than True in previous tests)
}

# So far best parameter distribution based on previous runs to focus search space
# RF param grid incl. VarianceThreshold
rf_param_dist = {
    "vt__threshold": [0.005],
    "model__n_estimators": [328],
    "model__max_depth": [20],
    "model__min_samples_split": [5],
    "model__min_samples_leaf": [1],
    "model__max_features": ["sqrt"],
    "model__bootstrap": [False],
}

rf_best_rand = model_hyperparameter_tuning(rf_pipe_fe, rf_param_dist)

# joblib.dump(rf_best_rand, "rf_best_rand.pkl")


# Long Duration (~2min)

# MAE: 1275.1518
# RMSE: 2232.9070
# R²: 0.9477

# Reapplying RandomizedSearchCV (~120mins):
# MAE: 1272.4144
# RMSE: 2214.9228
# R²: 0.9486
# Best Model params: {'model__bootstrap': False, 'model__max_depth': 27, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 1, 'model__min_samples_split': 5, 'model__n_estimators': 328}

# Using transmission and fuelType as OHE instead of TE (140mins):
# Model Results (CV metrics):
# MAE: 1271.8784
# RMSE: 2223.6243
# R²: 0.9482
# Best Model params: {'model__bootstrap': False, 'model__max_depth': 20, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 1, 'model__min_samples_split': 5, 'model__n_estimators': 386}

# Removed manual TE to prevent data leakage (1min):
# MAE: 1270.0122
# RMSE: 2203.1101
# R²: 0.9491
# [fixed] Best Model params: {'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}

# # Fix (remove certain) fillnas in feature_engineering (1min):
# MAE: 1267.4191
# RMSE: 2199.6899
# R²: 0.9492
# [fixed] Best Model params: {'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}

# Added sklearn targetencoder in pipeline:
# MAE: 1254.2265
# RMSE: 2185.9107
# R²: 0.9499
# Best Model params: {'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}

# Scale target encoded features:
# MAE: 1254.2265
# RMSE: 2185.9107
# R²: 0.9499
# Best Model params: {'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}

# Removed anchor in anchor features (no division by overall mean):
# MAE: 1254.2265
# RMSE: 2185.9107
# R²: 0.9499
# Best Model params: {'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}
# ==> Tree-based models do not need normalized features, so this is expected

# Fixed leakage in med_price_anchor (smoothing):
# MAE: 1249.7952
# RMSE: 2185.0019
# R²: 0.9500
# Best Model params: {'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}

# Use only mean te for 'Brand' and 'model' instead of mean and median te:
# MAE: 1262.8323
# RMSE: 2215.5205
# R²: 0.9486
# Best Model params: {'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}

# Use only median te for 'Brand' and 'model' instead of mean te:
# MAE: 1262.6739
# RMSE: 2196.6384
# R²: 0.9494
# Best Model params: {'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}

# Use GroupMedianImputer for categorical_transformer_fe_ohe and categorical_transformer_fe_te instead of SimpleImputer ():
# --> 

# categorical_features = ["transmission", "fuelType","Brand", "model"]
# Model Results (CV metrics):
# MAE: 1301.7759
# RMSE: 2273.5359
# R²: 0.9458

# categorical_features = ["transmission", "fuelType"]
# MAE: 1249.6696
# RMSE: 2184.4328
# R²: 0.9500
# Best Model params: {'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}

#Drop Features with correlation inbetween features (>0.9): 
# MAE: 1376.7839
# RMSE: 2399.4670
# R²: 0.9396

# vt__threshold in pipeline
# MAE: 1248.3166
# RMSE: 2179.9166
# R²: 0.9502
# Best Model params: {'vt__threshold': 0.005, 'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}

# added imputer to QuantileEncoder pipeline step
# MAE: 1248.3166
# RMSE: 2179.9166
# R²: 0.9502
# Best Model params: {'vt__threshold': 0.005, 'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}

# Use only median imputer
# Model Results (CV metrics):
# MAE: 1265.6359
# RMSE: 2206.0571
# R²: 0.9490
# Best Model params: {'vt__threshold': 0.005, 'model__n_estimators': 328, 'model__min_samples_split': 5, 'model__min_samples_leaf': 1, 'model__max_features': 'sqrt', 'model__max_depth': 20, 'model__bootstrap': False}

In [0]:
pipe = rf_best_rand if hasattr(rf_best_rand, "named_steps") else rf_best_rand[0]
pre = pipe.named_steps["preprocess"]
vt = pipe.named_steps.get("vt")

# names after preprocessing
feat_names = get_feature_names_from_preprocessor(ct)

# apply VT mask if present
if vt is not None:
    mask = vt.get_support()
    feat_names = np.array(feat_names)[mask]

importances = pipe.named_steps["model"].feature_importances_
feature_importance_df = pd.DataFrame(
    {"feature": feat_names, "importance": importances}
).sort_values("importance", ascending=False)

print("Feature Importances:")
for _, row in feature_importance_df.iterrows():
    print(f"{row['feature']:30s}: {row['importance']:.6f}")


In [0]:
#stop here # Fail the code here because we dont use SHAP values and dont need stackingregressor

##### 5.4 StackingRegressor

In [0]:
# Old parameter distribution
stack_param_dist = {
    "final_estimator__learning_rate": uniform(0.02, 0.1),
    "final_estimator__max_depth": randint(3, 10),
    "final_estimator__min_samples_leaf": randint(3, 20),
    "final_estimator__l2_regularization": uniform(0.0, 1.0),
}

# So far best parameter distribution based on previous runs to focus search space
stack_param_dist = {
    "final_estimator__learning_rate": [0.061135390505667866],
    "final_estimator__max_depth": [5],
    "final_estimator__min_samples_leaf": [10],
    "final_estimator__l2_regularization": [0.19438003399487302]
}

stack_best = model_hyperparameter_tuning(stack_pipe_fe, stack_param_dist, splits=3)
# joblib.dump(stack_best, "stack_best.pkl")


# Long Duration (~3mins)

# MAE: 1351.8682
# RMSE: 2498.2822
# R²: 0.9342

# After RandomizedSearchCV:
# MAE: 1350.4717
# RMSE: 2497.0474
# R²: 0.9343
# Best Model params: {'final_estimator__l2_regularization': np.float64(0.978892858275009), 'final_estimator__learning_rate': np.float64(0.06867421529594551), 'final_estimator__max_depth': 6, 'final_estimator__min_samples_leaf': 13}

# Removed ElasticNet from stacking due to poor performance compared to RF and HGB alone
# canceled but the cv scores didnt seem to show much improvement

# Using transmission and fuelType as OHE instead of TE():
# MAE: 1357.4291
# RMSE: 2516.5470
# R²: 0.9333
# Best Model params: {'final_estimator__l2_regularization': np.float64(0.19438003399487302), 'final_estimator__learning_rate': np.float64(0.061135390505667866), 'final_estimator__max_depth': 5, 'final_estimator__min_samples_leaf': 10}


# Removed fillna(0) in feature engineering for a_x_b and model_freq():
# was worse for hgb and rf so not tested for stacking

# ...

# implemented GroupModeImputer
# MAE: 1329.2379
# RMSE: 2453.0239
# R²: 0.9366
# Best Model params: {'final_estimator__min_samples_leaf': 10, 'final_estimator__max_depth': 5, 'final_estimator__learning_rate': 0.061135390505667866, 'final_estimator__l2_regularization': 0.19438003399487302}


### 6. Feature Importance of Tree Models (with SHAP)

  **Problem:** Current feature selection targets linear models
  (ElasticNet), but we primarily use tree-based models (HGB,
  RandomForest).

  **Solution:** Use SHAP (SHapley Additive exPlanations) to
  identify feature importance specifically for tree models

  **Why SHAP for Trees:**
  - Provides exact feature importance values for tree-based
  models
  - Tree models handle irrelevant features, but noise features
  still impact performance
  - Enables data-driven selection rather than statistical filter
  methods

In [0]:
# TODO probably use CV for SHAP
def calculate_shap_values(pipeline, X, sample_size=1000, seed=42, label=None):
    # accept either pipeline or (pipeline, search_obj) # TODO check what the parameter is and dont use if isinstance...
    pipe = pipeline[0] if isinstance(pipeline, tuple) else pipeline

    pre = pipe.named_steps["preprocess"]
    vt = pipe.named_steps.get("vt")
    model = pipe.named_steps["model"]

    X_proc = pre.transform(X)
    feat_names = pre.get_feature_names_out()

    # apply VT mask if present
    if vt is not None:
        mask = vt.get_support()
        X_proc = vt.transform(X_proc)
        feat_names = np.array(feat_names)[mask]

    rng = np.random.default_rng(seed)
    n = min(sample_size, len(X_proc))
    idx = rng.choice(len(X_proc), n, replace=False)

    explainer = shap.TreeExplainer(model)
    shap_vals = explainer.shap_values(X_proc[idx])
    importance = np.abs(shap_vals).mean(axis=0)

    shap_df = (pd.DataFrame({"feature": feat_names, "importance": importance})
               .sort_values("importance", ascending=False)
               .reset_index(drop=True))

    tag = label or model.__class__.__name__
    print(f"Most important features ({tag}):")
    print(shap_df.head(20).to_string(index=False))
    return shap_df, feat_names, X_proc

In [0]:
def train_model_on_best_features(shap_importance, pipeline, model, X, y,
                                    range_number_of_features, folds=5, seed=42):
    """
    Select top-N features by SHAP, evaluate MAE via CV, and return best estimator and feature list.
    """
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

    pre = pipeline.named_steps["preprocess"]
    vt = pipeline.named_steps.get("vt")

    # preprocess once on full data
    X_proc = pre.transform(X)
    feat_names = pre.get_feature_names_out()
    if vt is not None:
        mask = vt.get_support()
        X_proc = vt.transform(X_proc)
        feat_names = np.array(feat_names)[mask]

    # helper to extract column indices for top-N features
    def select_cols(n):
        top_feats = shap_importance.head(n)["feature"].tolist()
        return [i for i, fname in enumerate(feat_names) if fname in top_feats]

    results = []
    for n in range_number_of_features:
        idx = select_cols(n)
        mae_folds = []
        for train_idx, val_idx in kf.split(X_proc):
            X_tr, X_val = X_proc[train_idx][:, idx], X_proc[val_idx][:, idx]
            y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
            est = clone(model)
            est.fit(X_tr, y_tr)
            mae_folds.append(mean_absolute_error(y_val, est.predict(X_val)))
        results.append({"n": n, "mae": np.mean(mae_folds), "idx": idx})

    best = min(results, key=lambda r: r["mae"])
    best_features = [feat_names[i] for i in best["idx"]]

    # fit final estimator on full data restricted to best features
    final_est = clone(model)
    final_est.fit(X_proc[:, best["idx"]], y)

    print("CV MAE by feature count:")
    for r in results:
        print(f"  n={r['n']:3d} | MAE={r['mae']:.2f}")
    print(f"Best: n={best['n']} | MAE={best['mae']:.2f}")

    return final_est, best_features

In [0]:
# Old Version
# # General function which can be called by the models to avoid redundant code and enable easy maintenance
# def train_model_on_best_features(baseline_mae, shap_importance, model, X_train_processed, X_val_processed, range_number_of_features, feature_names_all):
#     '''
#     We systematically test different numbers of top features to find the optimal subset:
#     We train the model with the same optimized hyperparameters but using only the most important features identified by SHAP
#     '''
#     # Track best model
#     results = []
#     best_model = None
#     best_mae = float("inf")
#     best_n = None
#     best_features = None

#     # Find best feature counts
#     for n_features in range_number_of_features:
#         # Select top N features
#         top_features = shap_importance.head(n_features)["feature"].tolist()
#         feature_indices = [i for i, fname in enumerate(feature_names_all) if fname in top_features]

#         X_train_subset = X_train_processed[:, feature_indices]
#         X_val_subset   = X_val_processed[:, feature_indices]

#         # Train and predict using selected amount of features (model uses tuned hyperparams)
#         model.fit(X_train_subset, y_train)
#         pred_subset = model.predict(X_val_subset)
#         mae_subset = mean_absolute_error(y_val, pred_subset)
#         results.append({"n_features": n_features, "mae": mae_subset})

#         # Check whether current mae is best so far
#         if mae_subset < best_mae:
#             best_mae = mae_subset
#             best_n = n_features
#             best_model = model
#             best_features = top_features

#         # Print MAE for each amount of features
#         if n_features in range_number_of_features:
#             improvement_rf = baseline_mae - mae_subset
#             print(f"Top {n_features:3d} features: MAE: {mae_subset:.1f} (Δ: {improvement_rf:+.1f})")


#     print(f"\nOptimal feature selection results:")
#     print(f"Best performance with {best_n} features: MAE: {best_mae:.2f}")
#     print(f"Improvement over baseline: {baseline_mae - best_mae:+.2f} MAE\n")

#     print(f"Optimal {best_n} features for production model:")
#     for i, feat in enumerate(best_features, start=1):
#         imp = shap_importance.loc[shap_importance['feature'] == feat, 'importance'].values[0]
#         print(f"{i:2d}. {feat:25s} ({imp:.1f})")
    
#     # Retrain a fresh final estimator on the full training set restricted to best_features (guarantees correct input dimension)
#     selected_idx = [i for i, fname in enumerate(feature_names_all) if fname in best_features]
#     final_est = clone(model)
#     final_est.fit(X_train_processed[:, selected_idx], y_train)

#     return final_est, best_features

In [0]:
def plot_top_shap(shap_df, model_name, top_k=20):
    top_df = shap_df.head(top_k).iloc[::-1]
    fig, ax = plt.subplots(figsize=(8, 6))
    ax.barh(top_df["feature"], top_df["importance"], color="#4C72B0")
    ax.set_xlabel("Average |SHAP| value")
    ax.set_title(f"Top {top_k} {model_name} features by SHAP")
    plt.tight_layout()
    plt.show()


#### 6.1 HGB

##### Step 1: Baseline Performance with Optimized Hyperparameters

In [0]:
# unpack tuple from model_hyperparameter_tuning
hgb_pipe, hgb_search = hgb_best  # hgb_pipe is the fitted pipeline

# CV metrics at best params
idx = hgb_search.best_index_
mae_cv  = -hgb_search.cv_results_['mean_test_mae'][idx]
rmse_cv = np.sqrt(-hgb_search.cv_results_['mean_test_mse'][idx])
r2_cv   = hgb_search.cv_results_['mean_test_r2'][idx]

# feature count after preprocess (+ vt)
X_proc = hgb_pipe.named_steps["preprocess"].transform(X_train)
vt = hgb_pipe.named_steps.get("vt")
if vt is not None:
    X_proc = vt.transform(X_proc)
n_features_total = X_proc.shape[1]

print("Baseline Performance of HGB (CV on train):")
print(f"MAE:  {mae_cv:.4f}")
print(f"RMSE: {rmse_cv:.4f}")
print(f"R²:   {r2_cv:.4f}")
print(f"Total features used: {n_features_total}")


In [0]:
# Old Version
# X_val_processed_hgb = hgb_best.named_steps["preprocess"].transform(X_val)
# hgb_val_pred = hgb_best.named_steps["model"].predict(X_val_processed_hgb)
# n_features_total = X_val_processed_hgb.shape[1]
# baseline_mae_hgb = mean_absolute_error(y_val, hgb_val_pred)

# print("Baseline Performance of HGB model after Hyperparameter Tuning:\n")
# print_metrics(y_val, hgb_val_pred)
# print(f"\nTotal features used: {n_features_total}")

##### Step 2: SHAP Feature Importance Analysis

In [0]:
shap_importance_df_hgb, feature_names_all_hgb, X_train_processed_hgb = calculate_shap_values(
    hgb_best[0],  # the fitted pipeline
    X_train,
    sample_size=1000,
    seed=42,
    label="HGB"
)

In [0]:
# Old Version
# #function
# shap_importance_df_hgb, feature_names_all_hgb, X_train_processed_hgb = calculate_shap_values(
#     hgb_best, X_train, log_features, numeric_features, categorical_features,
#     sample_size=1000, seed=42, label="HGB"
# )

In [0]:
# HGB SHAP bar plot
plot_top_shap(shap_importance_df_hgb, "HGB", top_k=20)

##### Step 3: Automated Feature Selection Optimization

In [0]:
# unpack tuned pipeline and base model
hgb_pipe = hgb_best[0]  # best_estimator_
hgb_model = hgb_pipe.named_steps["model"]

# feature count after preprocess (+ vt)
X_proc = hgb_pipe.named_steps["preprocess"].transform(X_train)
vt = hgb_pipe.named_steps.get("vt")
if vt is not None:
    X_proc = vt.transform(X_proc)
n_features_total = X_proc.shape[1]

# denser grid of feature counts (adjust step to your runtime budget)
range_number_of_features_hgb = list(range(20, 31))
# optionally include full set if not already in the list
if n_features_total not in range_number_of_features_hgb:
    range_number_of_features_hgb.append(n_features_total)

best_model_hgb, best_features_hgb = train_model_on_best_features(
    shap_importance_df_hgb,  # from your SHAP importance
    hgb_pipe,
    hgb_model,
    X_train,
    y_train,
    range_number_of_features_hgb,
    folds=5,
    seed=42,
)

In [0]:
# Old Version
# # Define model with the same hyperparams
# hgb_model = hgb_best.named_steps["model"]
# hgb_selected = HistGradientBoostingRegressor(**hgb_model.get_params())

# # Number of top SHAP features to try
# range_number_of_features_hgb = range(15, n_features_total + 1, 1) # After previous runs with higher step size, the range is now narrowed down

# # Train/evaluate on subsets of top features
# best_model_hgb, best_features_hgb = train_model_on_best_features(
#     baseline_mae_hgb, shap_importance_df_hgb,
#     hgb_selected,
#     X_train_processed_hgb, X_val_processed_hgb,
#     range_number_of_features_hgb,
#     feature_names_all_hgb
# )

# # Long Duration (ca 2mins)

# # start: Best performance with 17 features: MAE: 1288.12
# # removed ohe categoricals: Best performance with 19 features: MAE: 1282.81
# # After FS removal: Best performance with 19 features (all features bc deselection filtered the same features as SHAP): MAE: 1282.81

In [0]:
# Build the final pipeline with feature selection included
def select_best_features_hgb(X):
    # X is the output of the preprocessing step: Matrix with all features after they have been scaled, encoded, and combined by the preprocessor
    idx = [i for i, fname in enumerate(feature_names_all) if fname in best_features_hgb]
    return X[:, idx]

hgb_final_pipe = Pipeline([
    ("preprocess", hgb_best.named_steps["preprocess"]),
    ("feature_selector", FunctionTransformer(select_best_features_hgb, validate=False)), # a flexible wrapper that applies a custom function to the data flow in a pipeline
    ("model", best_model_hgb)
])

# Save the best model for later use
joblib.dump(hgb_final_pipe, "hgb_best_feature.pkl")

#### 6.2 RF

##### Step 1: Baseline Performance with Optimized Hyperparameters

In [0]:
# unpack tuned RF pipeline and search object
rf_pipe, rf_search = rf_best_rand  # adjust if your variable is named differently, e.g. rf_best_rand

# CV metrics at best params
idx = rf_search.best_index_
mae_cv  = -rf_search.cv_results_['mean_test_mae'][idx]
rmse_cv = np.sqrt(-rf_search.cv_results_['mean_test_mse'][idx])
r2_cv   = rf_search.cv_results_['mean_test_r2'][idx]

# feature count after preprocess (+ vt)
X_proc_rf = rf_pipe.named_steps["preprocess"].transform(X_train)
vt_rf = rf_pipe.named_steps.get("vt")
if vt_rf is not None:
    X_proc_rf = vt_rf.transform(X_proc_rf)
n_features_total_rf = X_proc_rf.shape[1]

print("Baseline Performance of RF (CV on train):")
print(f"MAE:  {mae_cv:.4f}")
print(f"RMSE: {rmse_cv:.4f}")
print(f"R²:   {r2_cv:.4f}")
print(f"Total features used: {n_features_total_rf}")


In [0]:
# Old Version
# # Use the tuned RF pipeline (rf_best_rand) and compute baseline on the validation set
# X_val_processed_rf = rf_best_rand.named_steps["preprocess"].transform(X_val)
# rf_val_pred = rf_best_rand.named_steps["model"].predict(X_val_processed_rf)
# n_features_total_rf = X_val_processed_rf.shape[1] # TODO cant we just use one val_processed and one n_features_total or why did we split that? ~J
# baseline_mae_rf = mean_absolute_error(y_val, rf_val_pred)

# print("Baseline Performance of RF model after Hyperparameter Tuning:\n")
# print_metrics(y_val, rf_val_pred)
# print(f"\nTotal features used: {n_features_total_rf}")

# # start: MAE: 1322.4418 | RMSE: 4630733.9922 | R2: 0.9492 (Total features used: 155)
# # after removing ohe categoricals: MAE: 1282.0634 | RMSE: 4317505.8622 | R2: 0.9526 (Total features used: 22)
# # After FS removal: MAE: 1275.2603 | RMSE: 4274727.2255 | R2: 0.9531 (Total features used: 19)

##### Step 2: SHAP Feature Importance Analysis

In [0]:
shap_importance_df_rf, feature_names_all_rf, X_train_processed_rf = calculate_shap_values(
    rf_best_rand[0],        # unpacked best RF pipeline
    X_train,
    sample_size=100,   # adjust if you need more precision
    seed=42,
    label="RF"
)

In [0]:
# RF SHAP bar plot
plot_top_shap(shap_importance_df_rf,  "RF",  top_k=20)

In [0]:
# Old Version
# shap_importance_df_rf, feature_names_all_rf, X_train_processed_rf = calculate_shap_values(
#     rf_best_rand, X_train, log_features, numeric_features, categorical_features,
#     sample_size=100, seed=42, label="RF"
# )

# # Long Duration (ca 4mins)

##### Step 3: Automated Feature Selection Optimization

In [0]:
# unpack tuned RF pipeline and base model
rf_pipe = rf_best_rand[0]              # best_estimator_ from tuning
rf_model = rf_pipe.named_steps["model"]

# feature count after preprocess (+ vt)
X_proc_rf = rf_pipe.named_steps["preprocess"].transform(X_train)
vt_rf = rf_pipe.named_steps.get("vt")
if vt_rf is not None:
    X_proc_rf = vt_rf.transform(X_proc_rf)
n_features_total_rf = X_proc_rf.shape[1]

# choose candidate feature counts (dense; adjust for runtime)
range_number_of_features_rf = list(range(20, 31))
if n_features_total_rf not in range_number_of_features_rf:
    range_number_of_features_rf.append(n_features_total_rf)

# CV-based subset search
best_model_rf, best_features_rf = train_model_on_best_features(
    shap_importance_df_rf,   # from shap_importance_cv / calculate_shap_values
    rf_pipe,
    rf_model,
    X_train,
    y_train,
    range_number_of_features_rf,
    folds=5,                 # or 3 if you need it faster
    seed=42,
)


In [0]:
# Old
# # Use the same processed validation data and reuse tuned RF hyperparameters
# rf_params = {k.replace("model__", ""): v for k, v in rf_random.best_params_.items()}
# rf_selected = RandomForestRegressor(random_state=42, n_jobs=-1, **rf_params)
# range_number_of_features_rf = range(16, n_features_total_rf + 1, 1) # After previous runs with higher step size, the range is now narrowed down

# best_model_rf, best_features_rf = train_model_on_best_features(baseline_mae_rf, shap_importance_df_rf, rf_selected, X_train_processed_rf, X_val_processed_rf, range_number_of_features_rf, feature_names_all_rf)

# # Long Duration (ca 1min)

# # start: Best performance with 26 features: MAE: 1277.16
# # removed ohe categoricals: Best performance with 20 features: MAE: 1273.89
# # After FS removal: Best performance with 19 features (all features bc deselection filtered the same features as SHAP): MAE: 1275.26

In [0]:
# Save the best RF model for later use

# Build the final RF pipeline with feature selection included
def select_best_features_rf(X):
    idx = [i for i, fname in enumerate(feature_names_all_rf) if fname in best_features_rf]
    return X[:, idx]

final_rf_pipe = Pipeline([
    ("preprocess", rf_best_rand.named_steps["preprocess"]),
    ("feature_selector", FunctionTransformer(select_best_features_rf, validate=False)),
    ("model", best_model_rf)
])

joblib.dump(final_rf_pipe, "rf_best_feature.pkl")

#### 6.3 Build Final Stacking Regressor to mix tuned and feature selected HGB and RF

In [0]:
stack_pipe_final = StackingRegressor(
    estimators=[
        ("hgb_final", hgb_final_pipe),   # tuned HGB pipeline (preprocessor + model)
        ("rf_final",  final_rf_pipe),    # tuned RF pipeline (preprocessor + model)
    ],
    final_estimator=LinearRegression(),  # simple, perfect for 2 base preds
    passthrough=False,                   # meta model sees only base predictions
    cv=5,                                # proper OOF stacking
    n_jobs=1                             # no BrokenProcessPool on Databricks
)

stack_pipe_final.fit(X_train, y_train)
stack_val_pred = stack_pipe_final.predict(X_val)
print_metrics(y_val, stack_val_pred)

joblib.dump(stack_pipe_final, "stack_pipe.pkl")

# MAE: 1255.3112 | RMSE: 4157099.9081 | R2: 0.9544

# Kaggle Score submit 1274 !! OVERFITTED

# Long Duration (ca 3mins)

# start: MAE: 1256.5922 | RMSE: 4154147.3742 | R2: 0.9544
# removed ohe categoricals: MAE: 1252.2718 | RMSE: 4146270.8422 | R2: 0.9545
# After FS removal: MAE: 1252.9558 | RMSE: 4145097.0520 | R2: 0.9545

Final SR of the tuned HGB and RF models, did improve over the best single HGB and RF models on the validation set (MAE 1258 vs 1281/1289). 

However, it seems to be overfitted, Kaggle Score is only 1274

Therefore we will keep the RF/HGB model => with such small difference in MAE, we further need to evaluate them both + the Stacking

### 7. Kaggle Competition

Extra Task (1 Point): Be in the Top 5 Groups on Kaggle

In [0]:
def predict_on_test(model_pipeline, model_name):
    # Load best model from Joblib and predict on validation set to verify
    pipe_best = joblib.load(model_pipeline)
    pred_loaded = pipe_best.predict(X_val)
    print(f"Loaded {model_name}-model MAE on validation set: {mean_absolute_error(y_val, pred_loaded):.2f}")

    # Predict on test set
    df_cars_test['price'] = pipe_best.predict(df_cars_test)
    df_cars_test['price'].to_csv(f'Group05_{model_name}_Version10.csv', index=True)

In [0]:
predict_on_test("hgb_best_feature.pkl", "HGB")

In [0]:
predict_on_test("rf_best_feature.pkl", "RF")

In [0]:
predict_on_test("stack_pipe.pkl", "Stack")

In [0]:
# !kaggle competitions submit -c cars4you -f Group05_Version05.csv -m "Message" # Uncomment to submit to Kaggle

In [0]:
!kaggle competitions submissions -c cars4you

### 8. Open-Ended-Section

#### 8.1 Counterfactual Sensitivity Maps

**a) Objective and motivation (0.5v)**

We have a trained model that predicts a car's price from features (age, mileage, engine size, paint quality, mpg, etc.). For a single car, we want to answer:

_“What small changes to the car's features would be required to make the model predict a x% higher or lower price?”_ (we tried +-10)

That is a counterfactual question: “If this car were different in X ways, would the model value it differently?”

Instead of only knowing which features are important overall, this tells  for this exact car what the model would need to see to change its valuation.

Examples of use:
- If the counterfactual says “increase paintQuality by 5 will increase the price by +10%”, a business could decide whether doing that repair is worth it.
- If the model requires impossible changes to reach +10%, the car should be flagged for manual inspection or considered non-upgradeable.

------------
**Limitations:**
- The greedy search is heuristic — it is fast and interpretable but not guaranteed to find the global minimal change. It is designed for practical interpretability rather than mathematical optimality.
- Categorical changes (e.g., changing `Brand` or `model`) are **not** attempted by default to avoid generating invalid combinations; can be added later if you provide a realistic whitelist of allowed swaps.
- Some input features are physically immutable or economically infeasible to change; results must be interpreted by a domain expert.
-----


**b) Difficulty of tasks (1v)**

Implementing counterfactual sensitivity analysis in this project is non-trivial because:

1. **Complex Preprocessing Pipeline:** The model uses hierarchical imputers, log transforms, target encodings, scaling, and many engineered features.
All counterfactuals must operate in the raw input space, while still being compatible with the full sklearn pipeline.

2. **Non-Differentiable Model:** Our final model is tree-based and does not provide gradients. Therefore, we implemented a greedy local search algorithm: Perturb the most influential numeric features (ranked via permutation importance).
Accept only changes that monotonically move the prediction closer to the target.
Shrink step sizes adaptively when progress stalls.
Enforce realistic value ranges based on training quantiles.

3. **Handling Mixed Pandas Dtypes:** The real dataset uses Int64 nullable types, which break when assigning floats. All numeric fields were therefore converted to consistent float64 inputs before search.

-----

**c) Correctness and efficiency of implementation (1v)**

We used a greedy, local search that operates in the input (raw) feature space. Our idea:

1. Pick one car (one row of features).
2. Compute the current predicted price using the full pipeline.
3. Decide a target: e.g., +10% (increase) or -10% (decrease).
4. Pick a short list of numeric features that the model tends to rely on (we rank features by permutation importance on a training sample).
5. Iteratively nudge each of those features a little bit in the direction that would move the prediction toward the target: 
    - Step sizes are based on realistic feature variation from the training set (q10→q90 range).
    - After each proposed change we run the entire pipeline and get a new prediction.
    - We keep the change only if it moved the prediction closer to the target.
    - If nothing improves, we reduce step sizes and try again, for up to a maximum number of iterations.
6. Stop when the target is reached or we run out of iterations / step size becomes tiny.

Important implementation details to make this robust:
- Coerce all numeric inputs to float (to avoid pandas nullable-int assignment errors).
- Fill missing numeric inputs with training medians before searching so the pipeline will not break.
- Bound proposed values within training min/max to avoid unrealistic values.
- Use a deterministic search (no random elements) so results are reproducible.

------

**d) Discussion of results (1v)**

For each sampled test car (we sampled 30):
- orig_pred: the original model prediction for that car.
- target (implicit): the target price we tried to reach (orig_pred x 1.10 for +10%).
- final_pred: the model's prediction after applying the accepted changes.
- success: True if the greedy search reached (or exceeded) the target, False otherwise.
- iters: how many iterations the search ran.
- changes: dictionary of features the search changed and the new values it set for them.

We also saved open_ended_counterfactual_summary.csv with these results.

1. **Success Rates:** +10% target: 6.67%  // -10% target: 0.00%
    - For the sampled cars, only ~1 in 15 (6.7%) could reach a +10% increase by changing realistic features; none could be reduced by 10% using realistic changes.
    - That means most cars are already at the top (or bottom) of what the model considers plausible given their brand, age, engine size, etc. Small **realistic** tweaks cannot move the valuation.

2. **Most used Features** by the Algorithm: miles_per_year: 37 times; mpg: 31 times; tax: 28 times
    - When the algorithm tries to change the prediction, it almost always does so by adjusting these three features. These are the model's “go-to” levers to move predictions locally.

3. **Average Size of Suggested Changes**: mpg: +59.4; miles_per_year: +7,698; tax: +119.6
    - The model requested large changes: +59 mpg or +7.7k miles/year or +120 tax units. These are not small, realistic fixes—large or impossible in practice.
    - Increasing miles per year is not only impossible, it is also unrealistic that more miles would increase the car price. This shows us that the model learnt an unrealistic relationship.
    - This tells the business: even where changes are suggested, they are generally unrealistic, so the car's price is effectively locked by stronger features (brand, age, engine size).

4. **Successful Example** (index 100785): orig_pred = 10,783; final_pred = 12,014 (success True); changes: {'mpg': 54.7, 'tax': 164.5}  (1 iteration)
    - The model was able to push the price up by 10% in one step by dramatically increasing mpg and tax. But those required changes are huge and not realistic — this was a “success” only in the model's internal logic, not a practical recommendation.

5. **Failed example** (index 83321): orig_pred = 14,470; final_pred = 14,746  (success False); changes: {'mpg': 62.35, 'miles_per_year': 6,347.74}  (26 iterations)
    - Even after many iterations and very large changes, the model could not reach a +10% increase. This shows the model is constrained by other features (age/brand/engine) that we did not change and cannot realistically change.

----    

**e) Alignment with objectives (0.5v)**

1. **Clear objective**: Our goal was simple: check what small, realistic changes to a car's features would be needed to move its predicted price by ±10%. This gives practical “what-if” insights instead of only global feature importance.
2. **Non-trivial task**: Although the question is simple, solving it was technically difficult because our model uses a complex preprocessing pipeline and non-linear algorithms. We had to design a custom search method that safely modifies input features and runs the entire pipeline each time.
3. **Correct and efficient implementation**: The solution only changes features that make sense, keeps values within realistic ranges, and automatically handles pipeline steps like scaling and imputing. It is deterministic, fast, and works directly with the real production pipeline.
4. **Meaningful findings**: The results showed: Most cars cannot be pushed ±10% with realistic feature changes. The model often requested unrealistic adjustments (e.g., huge mpg changes), revealing rigid pricing behaviour.
Some features the model tried to use (like increasing miles_per_year) were unrealistic, helping us spot places where the model relies on odd patterns.
5. **Objective achieved**: We obtained exactly what we wanted: clear, car-specific explanations and simple recommendations (“small repairs” vs. “manual review”).
The insights directly support the project's aim of understanding and improving price evaluations.


In [0]:
# Basic asserts so failures are explicit 
assert "pipe" in globals(), "pipeline object `pipe` not found - set `pipe` to your final Pipeline"
assert "X_train" in globals(), "X_train not found"
assert "y_train" in globals(), "y_train not found"
assert "df_cars_test" in globals(), "df_cars_test not found"

# Build the list of numeric features we will consider
num_candidates = []
if "numeric_features_clean" in globals():
    num_candidates += list(numeric_features_clean)
if "log_features_clean" in globals():
    num_candidates += list(log_features_clean)
if not num_candidates:
    if "numeric_features" in globals():
        num_candidates += list(numeric_features)
    if "log_features" in globals():
        num_candidates += list(log_features)

# keep only columns that actually exist in X_train (avoid KeyErrors)
num_features = [f for f in dict.fromkeys(num_candidates) if f in X_train.columns]
print(f"Numeric features detected ({len(num_features)}): {num_features}")

# Build robust summary statistics (numpy floats only) used for bounding and step sizes.
train_stats = {}
for c in num_features:
    arr = pd.to_numeric(X_train[c], errors="coerce").dropna().to_numpy(dtype="float64")
    if arr.size > 0:
        train_stats[c] = {
            "min": float(np.nanmin(arr)),
            "max": float(np.nanmax(arr)),
            "q10": float(np.nanquantile(arr, 0.10)),
            "q90": float(np.nanquantile(arr, 0.90)),
            "median": float(np.nanmedian(arr)),
            "mean": float(np.nanmean(arr)),
            "std": float(np.nanstd(arr, ddof=1) if arr.size > 1 else 0.0)
        }
    else:
        train_stats[c] = {"min": 0.0, "max": 0.0, "q10": 0.0, "q90": 0.0, "median": 0.0, "mean": 0.0, "std": 0.0}

# quick sanity output for first features
print("Example train stats (first 5 features):")
for i, (k, v) in enumerate(train_stats.items()):
    print(f"  {k}: q10={v['q10']}, median={v['median']}, q90={v['q90']}")
    if i >= 4:
        break


In [0]:
# Permutation importance gives a robust ranking even with encodings and OHE.

COMPUTE_PERM = True  # flip to False if you want to skip this step and just use the numeric_features ordering

perm_imp = None
if COMPUTE_PERM:
    sample_size = min(2000, len(X_train))
    X_sample = X_train.sample(sample_size, random_state=0)
    y_sample = y_train.loc[X_sample.index]
    print("Computing permutation importance and sampling training data")
    res = permutation_importance(pipe, X_sample, y_sample, n_repeats=6, random_state=42, n_jobs=1)
    # res.importances_mean corresponds to the columns in X_sample
    perm_imp = pd.Series(res.importances_mean, index=X_sample.columns).sort_values(ascending=False)
    # keep only numeric features for ranking
    perm_imp = perm_imp.loc[[c for c in perm_imp.index if c in num_features]]
    print("Top numeric features by permutation importance:\n", perm_imp.head(10))
else:
    print("Permutation importance disabled. Ranked features will default to numeric feature order.")

In [0]:
# Robust greedy counterfactual search
def find_counterfactual_greedy(
    instance: pd.DataFrame,
    pipe,
    X_train: pd.DataFrame,
    target_rel: float = 0.10,
    mode: str = "realistic",
    step_frac: float = 0.15,
    max_iters: int = 200,
    verbose: bool = False,
    perm_importance_series: pd.Series = None
):
    """
    Greedy, deterministic counterfactual search for a single-row `instance`.
    Returns a dict with original/final predictions, success flag, changed features, and trajectory.

    Key design decisions:
    - Work in input-space only (do not modify internals).
    - Coerce numeric inputs to float64, fill missing with train median.
    - Use training q10-q90 ranges to compute sensible step sizes.
    - 'realistic' mode excludes non-actionable features (e.g. miles_per_year).
    """
    assert instance.shape[0] == 1, "instance must be single-row DataFrame"

    # copy and coerce numeric columns to float64 (fallback to training median)
    inst = instance.copy()
    numeric_present = [f for f in num_features if f in inst.columns]
    for f in numeric_present:
        v = pd.to_numeric(inst.at[inst.index[0], f], errors="coerce")
        if pd.isna(v):
            inst.at[inst.index[0], f] = float(train_stats.get(f, {}).get("median", 0.0))
        else:
            inst.at[inst.index[0], f] = float(v)
    if numeric_present:
        inst[numeric_present] = inst[numeric_present].astype("float64")

    cur = inst.copy()
    original_pred = float(pipe.predict(cur)[0])
    target = original_pred * (1.0 + target_rel)
    direction = 1 if target_rel > 0 else -1

    # realistic allowed features: keep only features that can be reasonably changed
    # NOTE: miles_per_year is excluded because it's not actionable in reality.
    if mode == "realistic":
        allowed = [f for f in ["paintQuality", "tax", "mpg"] if f in num_features]
        # allowed = [f for f in ["paintQuality"] if f in num_features]
    else:
        allowed = [f for f in num_features if f in cur.columns]

    # rank allowed features by permutation importance if available
    if perm_importance_series is not None:
        ranked = [f for f in perm_importance_series.index if f in allowed]
    else:
        ranked = allowed.copy()

    # fallback
    if not ranked:
        ranked = [f for f in num_features if f in cur.columns]

    # compute step sizes from q10->q90
    steps = {}
    for f in ranked:
        s = train_stats.get(f)
        range_q = max(1e-8, (s["q90"] - s["q10"]) if s else step_frac)
        steps[f] = range_q * step_frac

    cur_pred = original_pred
    trajectory = [cur_pred]
    changes = {}
    feat_steps = {f: 0.0 for f in ranked}

    # quick exit if already at target
    if direction * (cur_pred - target) >= 0:
        return {'index': cur.index[0], 'original_pred': original_pred, 'target': target,
                'final_pred': cur_pred, 'success': True, 'iters': 0, 'changes': {}, 'trajectory': trajectory}

    it = 0
    while it < max_iters:
        progressed = False
        for f in ranked:
            old_raw = pd.to_numeric(cur.at[cur.index[0], f], errors="coerce")
            old_val = float(old_raw) if not pd.isna(old_raw) else float(train_stats.get(f, {}).get("median", 0.0))
            step = steps.get(f, 0.0)
            if abs(step) < 1e-12:
                continue

            proposed = old_val + direction * step
            bounds = train_stats.get(f, {"min": -1e12, "max": 1e12})
            proposed = max(float(bounds["min"]), min(float(bounds["max"]), float(proposed)))
            cur.at[cur.index[0], f] = float(proposed)

            try:
                new_pred = float(pipe.predict(cur)[0])
            except Exception:
                # revert and reduce step on failure
                cur.at[cur.index[0], f] = float(old_val)
                steps[f] *= 0.5
                continue

            # accept only if it moves towards the target
            if direction * (new_pred - cur_pred) > 1e-8:
                changes[f] = float(proposed)
                feat_steps[f] += direction * float(step)
                cur_pred = new_pred
                trajectory.append(cur_pred)
                progressed = True
                if verbose:
                    print(f"it {it} | {f}: {old_val:.4f} -> {proposed:.4f} | pred {cur_pred:.2f}")
                if direction * (cur_pred - target) >= 0:
                    break
            else:
                # revert and shrink this feature's step
                cur.at[cur.index[0], f] = float(old_val)
                steps[f] *= 0.5

        if not progressed:
            # shrink all steps if no progress this iteration
            for k in steps:
                steps[k] *= 0.5

        if direction * (cur_pred - target) >= 0:
            break
        if all(abs(s) < 1e-12 for s in steps.values()):
            break
        it += 1

    success = (direction * (cur_pred - target) >= 0)
    return {
        'index': cur.index[0],
        'original_pred': original_pred,
        'target': target,
        'final_pred': cur_pred,
        'success': bool(success),
        'iters': it,
        'changes': changes,
        'trajectory': trajectory,
        'feature_steps': feat_steps
    }


In [0]:
# Run counterfactuals on a small test sample and save results

# Parameters adjustable for other samples:
SAMPLE_N = 30                 # how many test rows to run counterfactuals for
TARGET_RELS = [0.10, -0.10]   # compute +10% and -10% counterfactuals
MODE = "realistic"            # "realistic" or "full" features defined above in greedy counterfactual
STEP_FRAC = 0.15
MAX_ITERS = 200
VERBOSE = False

# sample deterministic indices from the provided test set
sample_idx = df_cars_test.sample(min(SAMPLE_N, len(df_cars_test)), random_state=0).index.tolist()

perm_imp_small = perm_imp.copy() if ('perm_imp' in globals() and perm_imp is not None) else None

cf_records = []
for idx in tqdm(sample_idx, desc="counterfactuals"):
    inst = df_cars_test.loc[[idx]].copy()
    for tr in TARGET_RELS:
        cf = find_counterfactual_greedy(
            inst, pipe, X_train,
            target_rel=tr,
            mode=MODE,
            step_frac=STEP_FRAC,
            max_iters=MAX_ITERS,
            verbose=VERBOSE,
            perm_importance_series=perm_imp_small
        )
        cf['target_rel'] = tr
        cf_records.append(cf)

# collect into DataFrame and save
rows = []
for r in cf_records:
    rows.append({
        'index': r['index'],
        'target_rel': r['target_rel'],
        'orig_pred': r['original_pred'],
        'final_pred': r['final_pred'],
        'success': r['success'],
        'iters': r['iters'],
        'changes': r['changes']
    })
df_cf_summary = pd.DataFrame(rows).set_index('index')
df_cf_summary.to_csv("open_ended_counterfactual_summary.csv")
print("Saved open_ended_counterfactual_summary.csv — sample of counterfactual attempts")


In [0]:
# Compact numeric summary results

# Success rates
success_rates = df_cf_summary.groupby("target_rel")["success"].mean().rename("success_rate")
print("Success rates (fraction of tested cars where the algorithm reached the target):")
for idx, v in success_rates.items():
    print(f"  {int(idx*100)}% target: {v:.3f} ({v*100:.1f}%)")
print()

# Which features were changed most often
feature_counter = Counter()
for changes in df_cf_summary["changes"]:
    for f in changes.keys():
        feature_counter[f] += 1
print("Top features changed (count):", feature_counter.most_common())

# Average magnitude of changes
avg_change = {}
for f in feature_counter.keys():
    deltas = [changes[f] for changes in df_cf_summary["changes"] if f in changes]
    if deltas:
        avg_change[f] = float(np.mean(deltas))
print("Average suggested change (approx):", avg_change)


In [0]:
# Example plots and recommendation table

# helper: basic visualization for a single summary row
def visualize_counterfactual_summary(index, target_rel, df_summary=df_cf_summary, df_test=df_cars_test, pipe=pipe):
    rec = df_summary.loc[index]
    if isinstance(rec, pd.DataFrame):
        rec = rec.loc[rec['target_rel'] == target_rel].iloc[0]
    inst = df_test.loc[[index]]
    orig_pred = float(pipe.predict(inst)[0])
    final_pred = float(rec['final_pred'])
    changes = rec['changes']
    plt.figure(figsize=(6,3.5))
    plt.bar(['Original', f"After changes ({int(target_rel*100)}%)"], [orig_pred, final_pred], color=['#4C72B0', '#55A868'])
    plt.ylabel("Predicted price (GBP)")
    plt.title(f"Row {index} | Success: {rec['success']}")
    plt.tight_layout()
    plt.show()
    print(f"Original: £{orig_pred:,.0f} | After: £{final_pred:,.0f} | Steps: {rec['iters']} | Success: {rec['success']}")
    if changes:
        print("Changes applied (feature -> new value):")
        for k, v in changes.items():
            print(f"  - {k}: {v:,.2f}")
    print("---")

# show two illustrative examples if present (two indexes we use in markdown)
examples_to_show = [100785, 83321]
for idx in examples_to_show:
    if idx in df_cf_summary.index:
        try:
            visualize_counterfactual_summary(idx, 0.10)
        except Exception as e:
            print("visualization error for", idx, ":", e)

# simple textual recommendation column
def simple_recommendation(row):
    if row['success']:
        changes = row['changes']
        if not changes:
            return "No change needed"
        large = any((k == "mpg" and abs(v) > 15) or (k == "miles_per_year" and abs(v) > 2000) or (k == "tax" and abs(v) > 50)
                    for k, v in changes.items())
        return "Consider small repairs" if not large else "Likely unrealistic — manual review"
    else:
        return "Cannot reach target — manual review"

df_readable = df_cf_summary.reset_index().copy()
df_readable['recommendation'] = df_readable.apply(simple_recommendation, axis=1)

# display a clean table
display_cols = ["index", "orig_pred", "final_pred", "target_rel", "success", "iters", "changes", "recommendation"]
display_df = df_readable.loc[:, display_cols].rename(columns={
    "index": "row_id",
    "orig_pred": "Original price (GBP)",
    "final_pred": "Final price after changes (GBP)",
    "target_rel": "Target change",
    "success": "Reached target?",
    "iters": "Search steps",
    "changes": "What changed (feature -> new value)"
})

# format for readability
display_df["Original price (GBP)"] = display_df["Original price (GBP)"].round(0)
display_df["Final price after changes (GBP)"] = display_df["Final price after changes (GBP)"].round(0)
display_df["Target change"] = display_df["Target change"].apply(lambda x: f"{int(x*100)}%")
display_df["Reached target?"] = display_df["Reached target?"].apply(lambda v: "Yes" if v else "No")

display(display_df.head(10))
