<div style="
    background: rgba(25, 25, 25, 0.55);
    backdrop-filter: blur(16px) saturate(150%);
    -webkit-backdrop-filter: blur(16px) saturate(150%);
    border: 1px solid rgba(255, 255, 255, 0.12);
    border-radius: 18px;
    padding: 45px 30px;
    text-align: center;
    font-family: 'Inter', 'Segoe UI', 'Helvetica Neue', Arial, sans-serif;
    color: #e0e0e0;
    box-shadow: 0 0 30px rgba(0, 0, 0, 0.35);
    margin: 40px auto;
    max-width: 800px;
">

  <h1 style="
      font-size: 2.8em;
      font-weight: 700;
      margin: 0 0 8px 0;
      letter-spacing: -0.02em;
      background: linear-gradient(90deg, #00e0ff, #9c7eff);
      -webkit-background-clip: text;
      -webkit-text-fill-color: transparent;
  ">
      Machine Learning Project
  </h1>

  <h2 style="
      font-size: 1.6em;
      font-weight: 500;
      margin: 0 0 25px 0;
      color: #b0b0b0;
      letter-spacing: 0.5px;
  ">
      Cars 4 You - Predicting Car Prices
  </h2>

  <p style="
      font-size: 1.25em;
      font-weight: 500;
      color: #c0c0c0;
      margin-bottom: 6px;
  ">
      Group 5 - Lukas Belser, Samuel Braun, Elias Karle, Jan Thier
  </p>

  <p style="
      font-size: 1.05em;
      font-weight: 400;
      color: #8a8a8a;
      font-style: italic;
      letter-spacing: 0.5px;
  ">
      Machine Learning End Results · 22.12.2025
  </p>
</div>


### **Table of Contents**
 
- [1. Import Packages and Data](#1-import-packages-and-data)  
  - [1.1 Import Required Packages](#11-import-required-packages)  
  - [1.2 Load Datasets](#12-load-datasets)  
  - [1.3 Kaggle Setup](#13-kaggle-setup)  
- [2. Preprocessing](#2-data-cleaning-feature-engineering-split--preprocessing)  
  - [2.1 Data Cleaning](#21-data-cleaning)  
  - [2.2 Feature Engineering](#22-feature-engineering)  
  - [2.3 (No) Data Split](#23-data-split)  
  - [2.4 Encoding, Transforming and Scaling](#24-preprocessing)  
  - [2.5. Feature Selection](#3-feature-selection)  
- [4. Model Evaluation Metrics, Baselining, Setup](#4-model-evaluation-metrics-baselining-setup)  
- [5. Hyperparameter Tuning and Model Evaluation](#5-hyperparameter-tuning-and-model-evaluation)  
  - [5.1 ElasticNet](#51-elasticnet)  
  - [5.2 HistGradientBoost](#52-histgradientboost)  
  - [5.3 RandomForest](#53-randomforest)  
  - [5.4 ExtraTrees](#54-extratrees)  
- [6. Feature Importance of Tree Models (with SHAP)](#6-feature-importance-of-tree-models-with-shap)  
  - [6.1 HGB](#61-hgb)  
  - [6.2 RF](#62-rf)  
- [7. Kaggle Competition](#7-kaggle-competition)  

TODO finish + update toc > at the end of project

In [None]:
# TODO

<img src="images/process_ML.png" alt="Drawing" style="width: 1000px;"/>

### 0. Outline

#### **Group Member Contribution**    
What part(s) of the work were done by each member and an estimated % contribution of each member towards the final work.

Jan: 
- EDA
- Pipeline structure (Data Split)
- Missing Values Handling
- Feature Engineering
- Encoding, Transforming and Scaling
- Feature Selection 
- Model Assessment Strategy
- Baseline Model Comparison
- Hyperparameter Tuning of Top Candidates
- Comparison of Tuned Models
- Deployment and Prediction on Test Set
- Visualizations and Insights of Best Model and the Data Preparation Pipeline
- Discussion and Outlook

Samu:
- Pipeline chart and Introduction Markdowns
- Data Cleaning
- Outlier Handling
- Feature Engineering
- Encoding, Transforming and Scaling
- Model Assessment Strategy
- Baseline Model Comparison
- Deployment and Prediction on Test Set
- Open End: SHAP Feature Importance
- Open End: Brand-specific model comparison

Elias:
- Open End: SHAP Feature Importance
- Open End: Analytical Interface
- Open End: Deep-Learning

Estimated percentage:
Jan: 25%
Samu: 25%
Elias: 25%
Lukas: 25%

#### Abstract

The Cars4You project aims to accelerate and standardize used-car price evaluations by replacing manual, subjective pricing with a production-ready machine learning pipeline. Our objective was to optimize predictive accuracy (MAE) on unseen cars while ensuring robustness to wrong inputs and a leakage-free evaluation.

Our EDA containing univariate, bivariate and multivariate analysis showed three dominant challenges: (1) inconsistent raw entries (typos, invalid ranges, sparse categories), (2) strong segmentation effects by brand/model, and (3) heavy-tailed numeric distributions (notably mileage). 

We addressed these with a custom engineered and reproducible sklearn pipeline. It follows the state of the art pipeline architecture and consists the following transformers: 

deterministic cleaning and category canonicalization with `CarDataCleaner` and hierarchical imputation with `IndividualHierarchyImputer`. We then added domain-informed feature engineering with `CarFeatureEngineer` to encode depreciation, usage intensity, efficiency/performance ratios, interaction effects, and relative positioning within brand/model segments.

Encoding and scaling were consolidated in a `ColumnTransformer` combining selective log transforms, `RobustScaler`, one-hot encoding, and median target encoding for high-signal categorical structure.

To reduce noise and improve generalization, we implemented automated feature selection as a dedicated pipeline stage: VarianceThreshold followed by majority voting across complementary selectors (Spearman relevance+redundancy, mutual information, and tree-based importance). SHAP was used strictly for interpretability and diagnostics in the end.

All model selection and tuning followed a consistent 5-fold cross-validation protocol. Primary evaluation metric MAE was set at the beginning of the project, we also evaluated RMSE and R2. After a first run of different models on original and engineered features, further hyperparameter tuning on the tree-based models HistGradientBoost and RandomForest was decided. The final tuned RF pipeline improved substantially over a naive median baseline (MAE ≈ 6.8k), achieving approximately **£1.2k MAE** in cross-validation.

For detailed methodological reasoning, trade-offs and further findings please refer to the respective sections.

#### I. Identifying Business Needs

**Context**       
Cars 4 You is an online car-resale company that buys cars from many brands. Sellers submit car details online, and the company traditionally relies on a mechanic inspection before making a purchase offer. Company growth has increased waiting lists for inspections, which risks losing potential customers to competitors. The business need is therefore to **expedite pricing** by generating a reliable **pre-inspection price estimate** directly from user-provided inputs.

**Main goals**     
**Business goal:** provide a fast, consistent price estimate at intake to reduce inspection bottlenecks and improve conversion.

**ML goal:** train a supervised **regression** model to predict `price` (GBP) from features available at form submission time. Because `paintQuality%` is filled by a mechanic during evaluation, it is treated as **non-production** input and is excluded from the deployed prediction path.

**Overall process (end-to-end)**      
1. **Data intake:** use the 2020 training dataset (features + target `price`) to develop and validate models; use the provided test dataset (features only) for final predictions and Kaggle submission.
2. **EDA → preprocessing decisions:** identify data inconsistencies, missingness patterns, segmentation by brand/model, and heavy-tailed variables; translate insights into pipeline steps.
3. **Leakage-safe pipeline:** implement cleaning, outlier handling, hierarchical imputation, feature engineering, encoding/scaling, and feature selection as sklearn transformers inside a single `Pipeline` so that every step is fitted only on training folds.
4. **Model benchmarking & optimization:** compare candidate regressors on a consistent protocol; tune the most promising models; select the most generalizable pipeline based on validation performance.
5. **Deployment output:** fit the selected pipeline on the full training set and generate predictions for `test.csv` to produce the final submission `.csv`.

**Model assessment strategy:**     
We adopt **5-fold cross-validation (CV)** on the training set as the single assessment strategy used throughout benchmarking and tuning. The primary metric is **MAE** (business-interpretable error in GBP), with RMSE and R² reported as complementary diagnostics. The Kaggle test set functions as an **external holdout** for the final chosen pipeline (labels hidden; performance observed via leaderboard score).

For more details on the respective steps in the pipeline, refer to the specific part in the notebook below.

In [None]:
%%html
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
 
<style>
  .mermaid {
    font-size: 22px;
    width: 1400px;
    max-width: 100%;
    margin: 0 auto;
  }
</style>
 
<div class="mermaid">
%%{init: {"theme":"base","themeVariables":{"primaryColor":"#B6D040","primaryBorderColor":"#6C7A1A","lineColor":"#6C7A1A","primaryTextColor":"#0f172a"}}}%%
flowchart LR
A["<b>Raw input</b><br/><br/>df_cars_train<br/><br/>"] --> B["<b>CarDataCleaner</b><br/><br/>Canonicalize categories,<br/>fix ranges, robust typing"]
B --> C["<b>GroupImputer</b><br/><br/>Hierarchical, segment-aware <br/>imputation"]
C --> D["<b>CarFeatureEngineer</b><br/><br/>Depreciation ratios & interactions,<br/>segment positioning"]
D --> E["<b>Preprocessing</b><br/><br/>ColumnTransformer<br/>Log for Skewed<br/>RobustScaler<br/>One-Hot-Encoding<br/>Mean/Median Target Encoding"]
E --> F["<b>Feature Selection</b><br/><br/>VarianceThreshold<br/>Majority Vote Selectors"]
F --> G["<b><br/>Model Setup</b><br/><br/>"]
 
classDef step fill:#B6D040,stroke:#6C7A1A,stroke-width:2px,color:#0f172a,rx:8,ry:8;
class A,B,C,D,E,F,G step;
linkStyle default stroke:#6C7A1A,stroke-width:2px;
</div>
 
<script>
  mermaid.initialize({
    startOnLoad: true,
    flowchart: { curve: "basis", nodeSpacing: 40, rankSpacing: 60 }
  });
</script>

#### II. Data Exploration and Preprocessing

**Data Exploration:** For the analysis of the original features including the consequences for preprocessing, refer to notebook `group05_exploratory_data_analysis.ipynb`. These insights are then used to clean and prepare the data.     

**Preprocessing:** The steps taken to clean and prepare the data based on exploration are described in the respective Subsections in Section 2.

**Top 3 EDA Key Insights:**
- The target has high correlations with year, mileage and engineSize while other features (previousOwners) have no correlation
- There is also a high correlation between features (e.g. spearman of roughly -0.8 for mileage and year)
- Some features, especially mileage and the target price, are right-skewwed -> log-transform

#### III. Regression Benchmarking

**Explanation of model assessment strategy and metrics used:**


**Feature Selection Strategy and results:**


**Optimization efforts: presentation, results and discussion:**


**Comparison of performance between candidate models:**

#### IV. Open-Ended Section

**Objectives for the Section and description of the actions taken:**

**Results and discussion of main findings → key takeaways:**

##### V. Deployment
The final section of the notebook implements the pipeline to generate reliable predictions for new data. The pipeline is stored in the `et_tuned_pipe.pkl` file and the final output of the predicted test data is stored in `Group05_Version20.csv` (selected as best on Kaggle).

### 1. Import Packages and Data

#### 1.1 Import Required Packages

In [None]:
!pip install kaggle
!pip install shap
!pip install -U scikit-learn
!pip install category_encoders
!pip install ydata-profiling

In [None]:
import pandas as pd
import numpy as np
import os
 
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, TargetEncoder, StandardScaler, FunctionTransformer, RobustScaler
from category_encoders import QuantileEncoder
from category_encoders.wrapper import NestedCVWrapper
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.feature_selection import VarianceThreshold, SelectFromModel
 
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor, StackingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR

from scipy.stats import uniform, randint
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.metrics import mean_absolute_error

import matplotlib.pyplot as plt; plt.rcParams.update({"figure.max_open_warning": 0, "figure.dpi": 100})
import shap

import joblib
import pickle

from pipeline_functions import CarDataCleaner, IndividualHierarchyImputer, CarFeatureEngineer, DebugTransformer, MajorityVoteSelectorTransformer, MutualInfoThresholdSelector, SpearmanRelevancyRedundancySelector, create_model_pipe, get_cv_results, model_hyperparameter_tuning
from visualization_functions import plot_selector_agreement, plot_val_mae_comparison, plot_train_val_comparison

#### 1.2 Load Datasets

In [None]:
df_cars_train = pd.read_csv("train.csv").rename(columns={"Brand": "brand",
                                                        "paintQuality%": "paintQuality"})
df_cars_test = pd.read_csv("test.csv").rename(columns={"Brand": "brand",
                                                       "paintQuality%": "paintQuality"})

# Check for duplicates in carID column
print(f"Number of duplicate carIDs in training data: {df_cars_train['carID'].duplicated().sum()}")
print(f"Total rows in training data: {len(df_cars_train)}")

#### 1.3 Kaggle Setup

In [None]:
# Folder containing kaggle.json (add kaggle.json api token)
os.environ['KAGGLE_CONFIG_DIR'] = "/Workspace/Users/X@novaims.unl.pt"

# Test
!echo $KAGGLE_CONFIG_DIR

### 2. Data Preparation

In this Section we describe the approach taken to prepare the data. The inner-workings of these processes are later visualized and discussed in Section 8.

#### 2.1 Data Split

**Train and Val**: We use `Cross-Validation` in the `sklearn pipeline` on the available training data to make use of all data while validating different approaches.    
-> We fix the random states everywhere to ensure that all models use the same split to ensure a fair model comparison

**Test** Use external hold-out set from kaggle as final test set (remains completely unseen to avoid leakage)
-> An additional val set is therefore not necessary and would waste training data

Final setup:
1. **Training Set (n-1 folds from CV)**: Used to fit models.
2. **Validation Set (1 fold from CV)**: Used to evaluate performance of models and tune hyperparameters, detect overfitting. 
3. **Test Set (Kaggle)**: Used only once at the end of the entire process to evaluate final model performance. Not considered before to prevent leakage.



<u>Place in the pipe:</u> The split is decided here because the data has to be split before the preprocessing steps to avoid data leakage. All of the following steps are part of the sklearn pipeline while the CV is not an explicit part of the pipeline but rather the technique that calls the pipeline with its separate folds.

In [None]:
# Create CV (shuffle to ensure randomness in splits, random_state to make it reproducible and comparable across models)
rs = 5
cv = KFold(n_splits=5, shuffle=True, random_state=rs)
# => This cv will be passed for hyperparameter tuning later when training the models

# Split features and target
X_train = df_cars_train.drop(columns='price')
y_train = df_cars_train['price']

**Our findings:**
- CV achieves better results than using a hold-out set

**Consequences/Interpretation:**
- Usage of all available data is better for the model than 'wasting' training data for a hold-out set

#### 2.2 Data Cleaning

**Our approach:**
- We `clean data inconsistencies` and data entry errors that we found in the EDA
- These columns will be `set to NaN` for that specific entry to not lose rows in the data due to removing
- Afterwards, these values will be imputed (see Section 2.3)

____

**Numerical Features**

| **Feature** | **Allowed thresholds** | **Reasoning** | **# filtered below threshold** | **# filtered above threshold** |
| :--- | :--- | :--- | :---: | :---: |
| `year` | 1886 to 2020 | The first automobile is dated to 1886 (Benz Patent Motor Car), so earlier values are implausible; the dataset is from 2020, so newer model years are logically impossible. [1] | 0 | 358 |
| `mileage` | ≥ 0 | Negative mileage is not possible. | 369 | - |
| `tax` | ≥ 0 | Negative tax is not possible. | 378 | - |
| `mpg` | 5 to 150 | Lower bound 5 mpg is a conservative “sanity floor” below the least-efficient passenger car on FuelEconomy.gov’s list (Bugatti Mistral at 9 mpg combined), so we avoid removing valid low-efficiency cars while filtering implausible entries. [2] Upper bound 150 is a pragmatic cap to reduce leverage from extreme values and potential metric-mixing (e.g., MPGe-style values are defined as an energy-equivalent MPG for plug-in vehicles). [3] Reference point for high-efficiency non-EVs: Prius variants are ~50–56 mpg combined on FuelEconomy.gov. [4] | 49 | 221 |
| `engineSize` | 0.1 to 12.7 | Practical bounds: kei-class cars are capped at 660cc (0.66L), giving a grounded “small production car” reference point. [4] Very large historical production engines reach ~12.763L (Bugatti Type 41 / Royale). [5] Lower bound reduced to 0.1L as a conservative data-sanity floor (primarily to remove obvious errors) while avoiding unnecessary loss of potentially valid small-displacement entries. | 264 | 0 |
| `paintQuality` | 0 to 100 | Percentage values must be between 0 and 100. | 0 | 367 |
| `previousOwners` | ≥ 0 | Negative owner counts are not possible. | 371 | - |
| `hasDamage` | - | Only 0 and NaN values in the data -> no thresholding | . | . |


[[1]: https://group.mercedes-benz.com/company/tradition/company-history/1885-1886.html [2]: https://www.fueleconomy.gov/feg/best-worst.shtml https://www.fueleconomy.gov/feg/PowerSearch.do?action=noform&baseModel=Prius&make=Toyota&path=1&srchtyp=ymm&year1=2020&year2=2020 [3]: https://www.epa.gov/greenvehicles/fuel-economy-and-ev-range-testing [4]: https://www.fueleconomy.gov/feg/PowerSearch.do?action=noform&baseModel=Prius&make=Toyota&path=1&srchtyp=ymm&year1=2020&year2=2020 [5]: https://www.motortrend.com/features/what-is-a-kei-car [6]: https://www.bugatti-trust.co.uk/bugatti-type-41/]

----

**Categorial Features**

We fix data entry noise (typos, truncations, inconsistent casing) in two stages:

**1) Deterministic canonicalization:**

We first apply **static mapping tables** (built from our EDA findings on the training data) that collapse variants into one canonical label

- normalize formatting: lowercase + trim whitespace  
- map known variants/typos → one canonical category  
  - example: `"AUDI"`, `"udi"`, `"Aud"` → `Audi`  
  - example: `"semi-aut"`, `"emi-auto"` → `Semi-Auto`

This step is **fully deterministic**.

**2) Strict validity check:**

After canonicalization, we apply a **valid-category filter**:

- for high-cardinality `model`, we accept only **known valid model names** (our static canonical model vocabulary)
- any model token not in that vocabulary is considered unreliable and is set to **NaN**
- this prevents “garbage categories” (rare or malformed strings) from becoming real categories and harming generalization

Examples of what gets rejected (→ NaN):
- truncations like `"scirocc"` (intended: `scirocco`)
- incomplete single-letter tokens like `"a"`, `"q"`, `"x"` without enough context

**3) Special-rule buckets for ambiguous 1-letter tokens:**

Some one-letter model tokens are ambiguous and must not be guessed:

- if `brand == Audi` and `model == "a"` → set to `a_unknown`
- if `brand == Audi` and `model == "q"` → set to `q_unknown`
- if `brand == BMW` and `model == "x"` → set to `x_unknown`

If the brand is missing and the token is one of `{a, q, x}`, we do **not** guess and force it to **NaN**.

**4) Conservative fuzzy rescue (only for values that are still missing):**

Only after steps (1)–(3), we run a very strict “rescue step” to recover obvious typos:

- applied **only where the value is still NaN**
- candidate choices are restricted to **valid canonical categories** (never invent new labels)
- acceptance requires very high similarity (strict cutoff) and/or a unique prefix match  
  - example: `"pum"` → `puma` (unique prefix, safe)
  - example: `"sl clas"` → `sl class` (high similarity, safe)
- if no safe match exists, the value remains **NaN**

**5) Final handling of remaining missing categories:**

Any categorical value that is still missing after all steps remains **NaN** and is filled later by our **Imputer** (Section 2.3).

_____

**Leakage safety:**
All “data-driven vocabularies” used in the cleaner are learned inside `fit()` on the training fold only, so the pipeline remains leakage-safe under CV:
- the validation fold is never used to decide which categories are “valid”
- the same cleaning logic is applied consistently in train/validation/test
- the static mapping includes misspellings independent of the training fold and is included for every fold to prevent unnecessary pitfalls in model learning

In [None]:
# Visualize the data cleaning by running it on raw df and inspect uniques
cleaner = CarDataCleaner(handle_electric="other", set_carid_index=False, use_fuzzy=True)

df_cars_train_clean = cleaner.fit_transform(df_cars_train)
df_cars_test_clean  = cleaner.transform(df_cars_test)

# Print unique values for visualization after cleaning (more detailed inspection before and after Cleaning can be found in the EDA)
print("CLEANED TRAIN uniques")
for col in df_cars_train_clean.columns:
    print(col, df_cars_train_clean[col].unique())

**Our findings:**     
The findings are already included in the table above for easier overview and direct comparison
    - 'Number of filtered values below threshold'
    - 'Number of filtered values above threshold'

==> In total, [TODO] values are identified as data errors in the available training data and are set to NaN

**Consequences/Interpretation:**     
Handling data erros is crucial for effective model training. To identify these data erros, an extensive EDA is inevitable.

#### 2.3 Outlier Handling

**Practical considerations:**

We tried different outlier handling techniques (see `unused_experiments.ipynb`) but decided to not use an explicit technique because they hurt our performance.
Beside this explicit outlier handling between Cleaning and Imputation, we use the [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) for scaling the features which handles outliers by centering and scaling the data based on the median and IQR (refer to Section 2.6).

<u>Place in the pipe</u>: 
- Before imputation to use original distribution for identifying the outliers (otherwise we would inflate the distributions with the imputed values)
- Then in imputation, fill the original gaps based on a distribution that does not includes the massive outliers (skewing the mean/median)    
  -> kill the outliers first (set to NaN) so the imputation for everyone becomes cleaner

**Findings**:      

As mentioned, using outlier handling significantly **hurt our best MAE**. The tried techniques are descripted in detail in the `unused_experiments.ipynb` file.

**Consequences/Interpretation:**   

Due to performance reasons we decided to use **no explicit outlier handling**.      
The worse performance due to outlier handling can be explained by the **importance of extreme cars for the algorithm**. When winsorizing or imputing these extreme values, we remove valuable signals. The specific decision to not use outlier handling is based on our best performing tree-based models which are not as sensitive to outliers like other models. When using other approaches that are more sensitive to extreme values, outlier handling can indeed be a valuable technique to improve performance.     

#### 2.4 Missing Values Handling

**Inidividual Hierarchy per Feature**

To impute the NaNs in each cell, an **individual hierarchical approach** is applied for each feature. For this, we compute group statistics (median and mode) of features that are highly correlated with the feature in the missing cell. If one of the required features is not available, we fallback to the next level until finally the global dataset statistic (median or mode) is used as a last fallback.    

Examples (for an extensive explanation refer to the IndividualHierarchyImputer class in pipeline_functions.py):
- `brand`: impute the brand using the model and if the model is NaN too, we impute using the grouped mode of (fuelType, transmission) because we identfied interaction between these two features and brand in the EDA.   
- `mileage`: we impute mileage with the median of the year because of high spearman corr identified in the EDA (~0.8).

The group statistics are only computed on the respective train folds and transformed on the val set to prevent leakage.   
-> When refitting the entire model, the entire train set is used to fit and the kaggle test set is transformed using the fitted values

<u>Place in the pipe:</u> The Imputation is decided here because the data has to be imputed on original values before engineering new features

**Our findings:**       
The IndividualHierarchyImputer improves performance over the best of our other tried imputers SimpleImputer (mode and median), KNN and the custom GroupImputer by ~30 MAE on the best baseline models.

**Consequences/Interpretation:**       
The improved performance over SimpleImputer and the custom GroupImputer show that each feature is **best approximated by a different feature (combination)**. There is **no golden rule** of imputing everything by the median or mode of the model. Instead, the **interaction** (e.g. spearman corr) of the missing feature and the other features should be regarded (e.g. impute mileage with median of year because of high spearman corr).

#### 2.5 Feature Engineering

**Our approach:**
- We implement feature engineering as an sklearn transformer (`CarFeatureEngineer`) **inside the pipeline**.
  - This makes the process **CV-safe / leakage-free**: all fold-specific statistics (e.g., model frequency, mean ages) are learned only on the training fold in `fit()` and applied to the validation fold in `transform()`.
- Important design notes: Interaction features use `(age + 1)` to avoid division by zero for cars in the reference year.
- We engineer features with two goals:
  1. **Inject domain structure** (age, usage intensity, efficiency, “big engine + old car” effects).
  2. **Create stronger signals for models** by expressing ratios and interactions that are difficult to learn reliably from raw variables.

**Input columns** (after cleaning + imputation):
- Numeric: `year`, `mileage`, `tax`, `mpg`, `engineSize`, `previousOwners`
- Categorical: `brand`, `model`, `transmission`, `fuelType`
- Boolean: `hasDamage`

---

| **New Feature** | **Calculation** | **Nature** | **Reasoning** |
| :--- | :--- | :--- | :--- |
| `age` | `ref_year - year` | Base | Captures depreciation; turns a calendar value into a meaningful pricing variable. |
| `mpg_x_engine` | `mpg * engineSize` | Interaction (product) | Joint signal for “performance vs efficiency” patterns (high engine + low mpg vs small engine + high mpg). |
| `engine_x_age` | `engineSize * (age + 1)` | Interaction (product) | Differentiates large engines in older cars vs newer cars; helps model capture age-dependent valuation of engine size. |
| `mpg_x_age` | `mpg * (age + 1)` | Interaction (product) | Captures age-dependent fuel-efficiency patterns (e.g., older fleets / technology differences) that can correlate with price. |
| `tax_x_age` | `tax * (age + 1)` | Interaction (product) | Models that tax effects can differ by car age (policy/regime + car segment composition). |
| `miles_per_year` | `mileage / (age + 1)` | Interaction (ratio) | Normalizes mileage by lifetime: 60k miles on a 3-year car is very different from 60k on a 10-year car; reduces collinearity between `mileage` and `age`. |
| `tax_per_mpg` | `tax / mpg` | Interaction (ratio) | “Cost pressure” proxy: high tax relative to efficiency can reflect segment / running cost patterns. |
| `engine_per_mpg` | `engineSize / mpg` | Interaction (ratio) | Performance-style signal: high engine with low mpg tends to indicate sporty/luxury configurations. |
| `brand_fuel` | `brand + "_" + fuelType` | Interaction (categorical) | Creates configuration groups for target encoding (e.g., Diesel BMW differs from Petrol BMW). |
| `brand_trans` | `brand + "_" + transmission` | Interaction (categorical) | Creates configuration groups for target encoding (e.g., Automatic Mercedes vs Manual Mercedes). |
| `model_freq` | `P(model)` from training fold | Popularity | Approximates market supply/demand stability: common models have more stable pricing; learned CV-safe in `fit()`. |
| `age_rel_brand` | `age - mean_age(brand)` | Relative / group-stat | Measures whether a car is newer/older than typical within its brand (brand-relative positioning). |
| `age_rel_model` | `age - mean_age(model)` | Relative / group-stat | Measures whether a car is newer/older than typical within its model (model-relative positioning). |
| `engine_rel_model` | `engineSize / mean_engineSize(model)` | Relative / group-stat | Captures whether a car is under-/over-engined relative to its model’s typical configuration. |

---

Legend (feature “Nature”)

- **Base Features**: derived from a single original variable (e.g. `age` from `year`)
- **Interaction Features**: combine multiple variables to capture non-additive effects
  - products (“amplifiers”) and ratios (“normalizers”)
- **Popularity Features**: learned from the training fold distribution (e.g. model frequency)
- **Relative / Group-stat Features**: compare a car to typical peers within `brand` or `model`
  - learned in `fit()` and applied in `transform()` to avoid leakage

---

Relation to encoding (Target Encoded Features)

We also create categorical “group keys” (`brand_fuel`, `brand_trans`) specifically so that our later target-encoding step inside the preprocessing pipeline can learn stable, configuration-specific signals.  
This encoding is handled **after** feature engineering and is **CV-safe** because it is part of the pipeline.

---


**Our findings:**
The engineered features are main drivers for performance improvement. In addition, to the improved MAE, the following findings support this insight:
- The correlation of the engineered features with the target show that we engineered meaningful features (e.g. miles_per_year: spearman corr of TODO, while miles and age only have Y and Z). This is visible through using y-data-profiling in the debugging pipeline.
- The feature importance of the engineered features in the final model (e.g. mpg_x_age: TODO)
- All Feature Importances and therefore also the impact of the engineered features in analyzed in detail in the Open End Section

| Feature | Impact |
| :--- | :--- |
| age | ... |


**Consequences/Interpretation:**

Some relationships are better captured through engineered features than raw original features. For example miles_per_year captures the actual usage of the car.


#### 2.6 Encoding, Transforming and Scaling

**Our approach:**
- We separate features in their `groups of variables` and combine their different treatments in the ColumnTransformer
    - Numerics vs. Booleans vs. Categoricals
- We have one `baseline pipe` and one `optimized pipe` to compare basic preprocessing to optimized preprocessing
    - The baseline pipe does the bare minimum for the algorithms to work cleanly
    - The optimized pipe was adjusted iteratively through multiple experiments and trials during the process to optimize model performance

In [None]:
# Original features
orig_numeric_features = ["year", "mileage", "tax", "mpg", "engineSize", "previousOwners"]
orig_boolean = ["hasDamage"]
orig_categorical_features = ["brand", "model", "transmission", "fuelType"]

In [None]:
numeric_features = [
    "age", "tax", "mpg", "engineSize", "previousOwners",        
    "mpg_x_engine", "engine_x_age", "mpg_x_age", "tax_x_age",   
    "engine_per_mpg", "tax_per_mpg",                            
    "model_freq",
    "age_rel_brand", "age_rel_model", "engine_rel_model"
]
numeric_features_for_log = ["mileage", "miles_per_year"]
boolean_features = ["hasDamage"]
categorical_features_ohe = ["transmission", "fuelType"]
categorical_features_te_mean = ["brand", "model"]
categorical_features_te_median = ["brand", "model", "brand_fuel", "brand_trans"]
unused_columns = ["year"] # replaced by age

all_feature_names_before_encoding = set(numeric_features + numeric_features_for_log + boolean_features + categorical_features_ohe + categorical_features_te_median)
print(len(all_feature_names_before_encoding))

**Optimized Pipe:**
- `Log-Transforming` using [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) for skewed Numerics identified in the EDA
- `RobustScaler` ([docu](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)) for Numerics because it performed better than [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) and [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)     
    -> Scaling only on training data to avoid data leakage and then scale val and later test set with the fitted scaler of the training set     
- `Encoding` for Categoricals:
    - Low cardinality: [OHE](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) because of optimized performance on best models
    - High Cardinality: [Median TE](https://contrib.scikit-learn.org/category_encoders/quantile.html) and [Mean TE](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html) for optimized performance on best models

| **Feature** | Nature | Transformation | Encoding | Scaling |
| :--- | :--- | :--- | :--- | :--- |
| age | Numerical | - | - | Robust |
| ... | ... | ... | ... | ... |
| mileage | Numerical | Log | - | Robust |
| ... | ... | ... | ... | ... |
| hasDamage | Boolean | - | - | - |
| transmission | Categorical | - | OHE | - |
| ... | ... | ... | ... | ... |
| Brand | Categorical | - | TE | Robust |

All operations are combined in a [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) which applies the different steps to the columns of the data in one unified pipeline (reproducible and prevents leakage)    
  -> outputs a combined feature matrix

In [None]:
enc_transf_scale = ColumnTransformer([
    ("log", Pipeline([
        ("log", FunctionTransformer(np.log1p, validate=False, feature_names_out="one-to-one")),  # log1p handles zeros safely
        ("scaler", RobustScaler()),
    ]), numeric_features_for_log),

    ("num", RobustScaler(), numeric_features),

    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features_ohe), # Use sparse_output=False to get dense array back (e.g. necessary for hgb)

    ("mean_te", Pipeline([ 
        ("encoder", TargetEncoder(target_type='continuous', cv=5, smooth='auto', random_state=rs)),
        ("scaler", RobustScaler()),
    ]), categorical_features_te_mean),

    # Smoothing (m) mitigates but doesnt eliminate leakage, so we use nested cv to work similar to the sklearn TE
    ("median_te", Pipeline(steps=[
        ('median_encoder', NestedCVWrapper(QuantileEncoder(quantile=0.5, m=10.0), cv=cv, random_state=rs)), # not specifying the cols means it encodes all columns
        ('scaler', RobustScaler()),
    ]), categorical_features_te_median)
])

#### 2.7 Feature Selection

**Our approach:**     
We apply an automatic feature selection approach in addition to the previously removed features (data cleaning, feature engineering)
- year: dropped because replaced by derived feature 'age'
- paintQuality: dropped because filled by mechanic so not available for our predictions in production  as the car prediction skips the mechanic

The goal is to create a very robust feature selection approach that finds features that are most likely actually irrelevant/redundant and therefore generate noise in the model that might lead to overfitting.     
To achieve that goal, we apply **two steps** inside the feature selection:
1) `Variance Threshold` (Filter) to filter constant variables ([docu](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html))    
(If using a different value than threshold 0, VarianceThreshold has to be applied before scaling, because e.g. standard scaling leads to std=1)
2) Majority voter:
    - `Spearman` handles the clean, obvious trends and cleans up redundancy ([docu](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html))
    - `MI` catches more complex relations that Spearman misses ([docu](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html))
    - `RF feature importance` to account for importance of the features ([docu](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#selectfrommodel))


==> The different voters capture different aspects:
| **Voter** | **Nature** | **Role & Responsibility** |
| :--- | :--- | :--- |
| **Spearman Voter** <br> *(SpearmanRedundancySelector)* | Filter | **Linear/Monotonic**<br>Captures obvious, strong relationships (e.g., "Newer cars are expensive"). Also handles Redundancy by filtering out features that are exact duplicates of better ones. |
| **MI Voter** <br> *(MutualInfoThresholdSelector)* | Filter | **Non-Linear**<br>Captures complex "physics" and non-monotonic patterns that correlation misses. |
| **RF Voter** <br> *(SelectFromModel)* | Embedded | **Interactions**<br>Captures features that are only important in combination with others. |

==> The feature selection is performed inside the pipelines cross-validation and consistent across all models, ensuring no data leakage and consistent feature selection logic.

<u>Place in the pipe:</u> The Feature Selection is placed after the scaling to have the features on one scale (just like in the lab)

!!!!! OLD markdown !!!!! More for us to understand the techniques and be able to explain them in the project defense

*Filter* methods to make an initial screening of the statistical properties of the data: 
- `Correlation Indices` to filter irrelevant and redundant features (Maximum Relevance, Minimum Redundancy (mRMR)-style pruning).     
    - Metric: We use Spearman because we want a single, unified pipeline step after encoding even though it treats binary OHE columns as "ranks," which is a fine but rough approximation. Spearman because not all features are normally distributed as it would be necessary for Pearson.
    - Irrelevant: Little correlation with the target
    - Redundant: Important because other methods like MI and RF will likely keep redundant features as both of them are important if they contain valuable information. However, one of them should be eliminated for cleaner model interpretation of trees and correct model building for models that work better without multicolinearity between features. Of the redundant features, we keep the one with a higher correlation with the target.

*Wrapper* methods create multiple models and use their performance as a proxy for the relevance of the features instead of relying on statistical properties of the data by themselves
    - `RFECV` with Random Forest as the base estimator (removes least important feature based on feature importancy by base estimator) ([docu](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html)).

*Embedded* methods perform feature selection as part of the model training process itself -> FS is integrated into the model and is not a separate step (Train model on all features -> Get FI -> Select based on FI)
    - `Random Forest` (Tree-based method): Reduce impurity of the tree using sklearn SelectFromModel 

In [None]:
fs_pipe = Pipeline([
        ("vt", VarianceThreshold(threshold=0.0)), # Apply VT first to remove constant features (it serves as a "dictator" and not a "voter" in our pipeline)
        ('selector', MajorityVoteSelectorTransformer(
            selectors=[
                SpearmanRelevancyRedundancySelector(relevance_threshold=0.05, redundancy_threshold=0.95), # If we set redundancy threshold to 1.01, this becomes similar to just relevance filtering,
                MutualInfoThresholdSelector(threshold=0.01, n_neighbors=10), # Increasing n_neighbors makes the estimation more stable but computationally slower,
                SelectFromModel(RandomForestRegressor(n_estimators=100, max_depth=8, n_jobs=-1, random_state=rs), # max_depts not too low (miss interactions) and not too high (selecting noise -> overfitting)
                                threshold='0.001*mean'), # threshold relative because it sums to 1 and if we have many features, many features will have a low importance but are still important],
                # RFE currently unused because of high computational cost
                # rfecv_rf = RFECV(estimator = RandomForestRegressor(n_jobs=-1, max_depth=50), step=1, random_state=rs, cv=cv, scoring='neg_mean_absolute_error', min_features_to_select=5)
            ],
            min_votes=2))
        ])

**Our findings:**
- While trees are comparatively robust to unnecessary features, applying the feature selection pipeline improves the performance slighty (TODO add MAE difference here when including fs pipe vs. not including fs pipe)


**Consequences/Interpretation:**
- ...

________

SKlearn elements we also considered but decided not to use:
- Filter Methods: & SelectPercentile
    - SelectFwe (Family-Wise Error Rate)   
    -> too strict and we don't want to be too conservative in our feature selection (we prefer to keep weak but useful signals)
    - SelectKBest     
    -> we didn't want to fix k (number of selected features)
- Wrapper Methods: 
    - RFECV   
    -> too expensive
    - SequentialFeatureSelector (forward, backward selection)   
    -> too expensive
- Embedded:
    - Regularization Method (Lasso)     
    -> considers only linear relationships so discarded

#### 2.8 Create Final Preprocessing Pipeline

**Our approach:**
- The [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) combines all steps into the preprocessing pipe (see table below)
- Through calling the pipeline for data preparation, we ensure that the data is preprocessed independently for each training fold    
-> prevent leakage while filling missing values, scaling, encoding, etc.

##### Summary of data preparation: Baseline vs Optimized (including outliers)

| | **Baseline** | **Optimized** |
| :--- | :--- | :--- |
| **Data cleaning** | - | `CarDataCleaner` (Section 2.2) |
| **Outlier handling** | - | - |
| **Imputation** | SimpleImputer median/mode <br>(simplicity; median more robust than mean)</br> | `IndividualHierarchyImputer` (Section 2.4) |
| **Feature engineering** | - | `CarFeatureEngineer` (Section 2.5) |
| **Transformation** | - | Log-Transform selected skewed numerics (Section 2.6) |
| **Scaling** | StandardScaler <br>(popularity)</br>| RobustScaler (Section 2.6) |
| **Encoding** | OneHotEncoder <br>(simplicity, most straight-forward)</br>| OneHotEncoder + Target Encoding (Section 2.6) |
| **Feature selection** | - | VT + `Majority voting` (Section 2.7) |

In [None]:
preprocessor_orig = ColumnTransformer([
    ("num", Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]), orig_numeric_features),
    ("bool", SimpleImputer(strategy="most_frequent"), orig_boolean),
    ("cat", Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)) # TODO maybe drop='first', (like in prac03) # Use sparse_output=False to get dense array back (e.g. necessary for hgb)
    ]), orig_categorical_features)
])

In [None]:
### The entire pipeline is build incorporating the enc_transf_scale and fs-pipe defined in Section 2.6 and 2.7 respectively
preprocessor_pipe = Pipeline([
    ("clean", CarDataCleaner(handle_electric="other", set_carid_index=False, use_fuzzy=True)),
    # [Unused Outlier Handling]
    ("imputer", IndividualHierarchyImputer()),
    ("fe", CarFeatureEngineer(ref_year=2020)),
    ("ct", enc_transf_scale),
    ("fs", fs_pipe),
])

# Save preprocessor for reuse in DL experiments
with open('preprocessor_pipe.pkl', 'wb') as f:
    pickle.dump(preprocessor_pipe, f)

### 3. Model Assessment Strategy and Metrics

**Performance Metrics**    
Comparison on train and val set to identify potential overfitting:
- **MAE** (Mean Absolute Error):     
    Used as primary metric because it is the metric used for evaluating in the Kaggle competition. It is easy to interpret in pounds.

- **MAE Std** (Standard Deviation):    
    Used to asses the variance of MAE across folds. For the secondary metrics the std will not be regarded as one std is enough to get a first understanding.

- **RMSE** (Root Mean Squared Error):    
    Used to identify large prediction errors which helps us understand model weaknesses (sensitive to outliers)

- **R-squared**:     
    Used to compare how well the models explain the underlying data compared to the mean (R-squared=0). It serves as an additional information compared to the two "error-metrics" MAE and RMSE.


**Structured approach to identify production model**
1. Run Baseline Models with default parameters to get a first glance of their performance
2. Compare and discuss model performance
3. Select top candidates for further optimization
4. Run Hyperparameter tuning on top candidates (Section 5)
5. Select best model and train it with best hyperparameters on minimizing the absolut_error (Section 5)

### 4. Baseline Model Comparison

`Log-transforming the target (price)` using [TransformedTargetRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html) because EDA showed that it is heavily right-skewed. The model predicts the log-price and automatically convert it back to pounds at the end.

#### 4.1 Setup Default Models (Original vs. Optimized Preprocessing)

**Fair comparison:**    
- We use **default parameters** to get a first result of models potential to decide on which ones to use for further optimizing (hyperparameter tuning). - Only use the same random_state for reproducibility and n_jobs to speed up computations.    
- Use the rule of thumb provided in the lab for the MLP ()
- Baseline: DummyRegressor using the median price as prediction

The **log-transform** of the target is performed here because it is the most straightforwared implementation using the TransformedTargetRegressor ([docu](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html)). It handles transformation and afterwards uses the inverse automatically.

In [None]:
### Median ###
baseline_median_pipe_orig = create_model_pipe(preprocessor_orig, DummyRegressor(strategy="median")) # Preprocessing does not matter for median

### Linear Models ###
linear_reg_default = LinearRegression()
linear_reg_default = TransformedTargetRegressor(
    regressor=linear_reg_default,
    func=np.log1p,
    inverse_func=np.expm1
)
linear_reg_pipe_orig = create_model_pipe(preprocessor_orig, linear_reg_default)
linear_reg_pipe_adjusted = create_model_pipe(preprocessor_pipe, linear_reg_default)


### Instance-Based ###
knn_default = KNeighborsRegressor(n_jobs=-3)
knn_default = TransformedTargetRegressor(
    regressor=knn_default,
    func=np.log1p,
    inverse_func=np.expm1
)
knn_pipe_orig = create_model_pipe(preprocessor_orig, knn_default)
knn_pipe_adjusted = create_model_pipe(preprocessor_pipe, knn_default)
# Long Duration (~4min)
# => Better performance than linear models but still worse than tree-based models -> not further optimized


### Neural Networks ###
# The number of hidden neurons should be between the size of the input layer (~10 for orig, ~25 for optimized) and the size of the output layer (1)
mlp_default = MLPRegressor(hidden_layer_sizes=(9,) , random_state=rs)
mlp_default = TransformedTargetRegressor(
    regressor=mlp_default,
    func=np.log1p,
    inverse_func=np.expm1
)
mlp_pipe_orig = create_model_pipe(preprocessor_orig, mlp_default)
mlp_pipe_adjusted = create_model_pipe(preprocessor_pipe, mlp_default)
# Long Duration (~4min)
# => Worse performance than KNN and tree-based models (notably, orig better than preprocessed)


### Tree-Based Models ###
hgb_default = HistGradientBoostingRegressor(random_state=rs, loss='squared_error')
hgb_default = TransformedTargetRegressor(
    regressor=hgb_default,
    func=np.log1p,
    inverse_func=np.expm1
)
hgb_pipe_orig = create_model_pipe(preprocessor_orig, hgb_default)
hgb_pipe_adjusted = create_model_pipe(preprocessor_pipe, hgb_default)
# Long Duration (~1mins)


rf_default = RandomForestRegressor(random_state=rs, n_jobs=-1, criterion='squared_error')
rf_default = TransformedTargetRegressor(
    regressor=rf_default,
    func=np.log1p,
    inverse_func=np.expm1
)
rf_pipe_orig = create_model_pipe(preprocessor_orig, rf_default)
rf_pipe_adjusted = create_model_pipe(preprocessor_pipe, rf_default)
# Long Duration (~5mins)
# Good performance -> further hyperparameter tuning

et_default = ExtraTreesRegressor(random_state=rs, n_jobs=-1, criterion='squared_error')
et_default = TransformedTargetRegressor(
    regressor=et_default,
    func=np.log1p,
    inverse_func=np.expm1
)
et_pipe_orig = create_model_pipe(preprocessor_orig, et_default)
et_pipe_adjusted = create_model_pipe(preprocessor_pipe, et_default)
# Long Duration (~7mins)
# Good performance -> further hyperparameter tuning

# ### Kernel-Based Models ###
# svr_default = SVR()
# svr_default = TransformedTargetRegressor(
#     regressor=svr_default,
#     func=np.log1p,
#     inverse_func=np.expm1
# )
# svr_pipe_orig = create_model_pipe(preprocessor_orig, svr_default)
# svr_pipe_adjusted = create_model_pipe(preprocessor_pipe, svr_default)
# # Long Duration (~12mins)
# # => Much worse performance than other models -> not further optimized


### Ensemble Meta Model ###
# The 'final_estimator' (Meta-Learner) looks at the predictions from the estimators and decides how to combine them.
raw_stack = StackingRegressor(
    estimators=[
        ('rf_main', rf_pipe_adjusted),
        ('linear_helper', linear_reg_pipe_orig)
    ],
    final_estimator=LinearRegression(), # A linear final estimator allows the prediction to go beyond bounds (extrapolate)
    n_jobs=-1
)
stacking_model = TransformedTargetRegressor(
    regressor=raw_stack,
    func=np.log1p,
    inverse_func=np.expm1
)

#### 4.2 Run the models

In [None]:
default_models_orig = {
    "Baseline_Median_orig": baseline_median_pipe_orig,
    "LinearReg_orig": linear_reg_pipe_orig,
    "KNN_orig": knn_pipe_orig,
    "MLP_orig": mlp_pipe_orig,
    "HGB_orig": hgb_pipe_orig,
    "RF_orig": rf_pipe_orig,
    "ET_orig": et_pipe_orig,
    # "SVR_orig": svr_pipe_orig,
}


default_orig_models_results_df = get_cv_results(default_models_orig, X_train, y_train, cv=cv, rs=rs)
display(default_orig_models_results_df)

# Long Duration (~10mins)


In [None]:
default_models = {
    "Baseline_Median_orig": baseline_median_pipe_orig,
    "LinearReg": linear_reg_pipe_adjusted,
    "KNN": knn_pipe_adjusted,
    "MLP": mlp_pipe_adjusted,
    "HGB": hgb_pipe_adjusted,
    "RF": rf_pipe_adjusted,
    "ET": et_pipe_adjusted,
    # "SVR": svr_pipe_adjusted,
    # "Stack": stacking_model,
}

default_models_results_df = get_cv_results(default_models, X_train, y_train, cv=cv, rs=rs)
display(default_models_results_df)

# model	preprocessing	    val_MAE	    std_MAE	val_RMSE	val_R2	train_MAE	train_std_MAE	train_RMSE	train_R2
# 0	ET	optimized	        1234.4953	10.3659	2129.6192	0.9522	446.4167	6.7407	886.0834	0.9917
# 1	RF	optimized	        1236.1696	12.1918	2170.3510	0.9504	674.2950	4.7058	1240.4772	0.9838
# 2	HGB	optimized	        1405.5125	15.7476	2369.1176	0.9409	1359.2691	5.8734	2258.9634	0.9462
# 3	KNN	optimized	        1441.2594	17.5903	2556.5699	0.9312	1157.0491	3.7397	2088.6082	0.9540
# 4	MLP	optimized	        1917.3194	76.4493	12393.8834	-0.6276	1859.8048	78.6396	3186.8379	0.8929
# 5	LinearReg	optimized	2242.0526	19.6574	3926.3012	0.8374	2238.3766	4.1489	3919.2553	0.8380
# 7	Baseline_Median_orig	original	6801.3202	15.0554	9976.8558	-0.0499	6801.1833	3.7168	9976.8186	-0.0499

# Use own cv in quantile_encoder instead of "5"


#### 4.3 Baseline Model Comparison Discussion of Results

Broad overview of performance, reasoning and next steps:

| **Model** | **Performance** | **Reasoning**  | **Next steps** |
| :--- | :--- | :--- | :--- |
| **ElasticNet** <br> *(Linear)* | ... | ... | Discard |
| **KNN** <br> *(Instance-based)* | ... | ... | ... |
| **RF** <br> *(Tree-based)* | ... | ... | Optimize |
| **ET** <br> *(Tree-based)* | ... | ... | ... |
| **HGB** <br> *(Tree-based)* | ... | ... | ... |
| **SVR** <br> *(Kernel-based)* | ... | ... | ... |

**1) Comparative performance under optimized preprocessing**

Under the optimized preprocessing pipeline, the ranking is dominated by **tree-based ensembles**:
 
- **ExtraTrees (ET, optimized)** is best overall with **val MAE = 1234.50 ± 10.37**, **val RMSE = 2129.62**, **val R² = 0.9522**.  

- **RandomForest (RF, optimized)** is a close second (**val MAE = 1236.17 ± 12.19**, **val RMSE = 2170.35**, **val R² = 0.9504**).
 
Both outperform the next tier (HGB and KNN), and they do so with very small fold-to-fold variability, indicating that the gain is not driven by a subset of “lucky” splits.
 
The remaining models behave as expected for this task:

- **HGB/KNN** improve with preprocessing but remain worse than ET/RF, consistent with residual bias (HGB) or limited representational flexibility (KNN).

- **Linear Regression** underperforms substantially, consistent with a pricing function that is not well-approximated by a global linear mapping in the engineered feature space.

- **MLP (optimized)** is unstable: despite a moderate MAE, the **RMSE explodes** and **R² becomes strongly negative**, which is characteristic of rare but extremely large errors (tail instability), making it unsuitable in this configuration.
 
---
 
**2) Effect size of preprocessing (optimized vs original)**

The optimized preprocessing delivers a large, consistent improvement for the models that can exploit nonlinear interactions:
 
- **ET:** val MAE improves from **1442.82 → 1234.50** (Δ ≈ -208; ≈ **-14.4%**) and val R² increases from **0.9305 → 0.9522**.  

- **RF:** val MAE improves from **1464.05 → 1236.17** (Δ ≈ -228; ≈ **-15.6%**) and val R² increases from **0.9276 → 0.9504**.  

- **HGB/KNN** show similarly strong absolute MAE reductions, but remain below ET/RF in final validation performance.
 
The baseline median predictor (MAE ≈ 6801, negative R²) provides a clear reference: the optimized ET/RF pipelines are not incremental improvements but a substantial step-change in predictive quality.
 
An important nuance is that **Linear Regression becomes worse** under optimized preprocessing (MAE ≈ 2242 vs ≈ 1977 originally). This is consistent with preprocessing/feature engineering increasing nonlinear structure and interactions that benefit trees but do not translate into a better linear fit.
 
---
 
**3) Detailed focus: why ExtraTrees is the strongest model here**

**ExtraTrees (optimized)** is the strongest model by a small but consistent margin over RF, and the pattern of metrics supports that this is a real advantage rather than noise:
 
- **Best mean performance** (lowest MAE and RMSE; highest R² among all candidates under the same preprocessing).

- **Highest stability** across folds (lowest std(MAE) among the strong models).
 
From a modelling perspective, ET’s advantage is plausible for this dataset: it aggregates many randomized decision trees, which can (i) capture complex non-linearities and heterogeneous feature interactions, (ii) naturally model threshold effects common in pricing (e.g., year, mileage, engine size, transmission interactions), and (iii) reduce variance via ensemble averaging. In other words, ET is well-matched to structured tabular data with mixed numeric/categorical signals and non-additive effects.
 
---
 
**4) Generalization and overfitting signals (train vs validation)**

Train–validation gaps provide an additional diagnostic:
 
- **ET/RF optimized** show materially lower training error than validation error (ET train MAE ≈ 446 vs val MAE ≈ 1234; RF train MAE ≈ 674 vs val MAE ≈ 1236). This indicates high capacity and some degree of overfitting, which is expected for ensembles. Crucially, this does **not** translate into unstable validation results: fold variability is low, suggesting the models generalize reliably despite the gap.
 
- **HGB optimized** has relatively close train/val errors (train MAE ≈ 1359 vs val MAE ≈ 1406), consistent with a more bias-limited model: it generalizes more “smoothly” but cannot reach the same accuracy ceiling as ET/RF.
 
- **MLP optimized** exhibits a clear robustness failure: the huge RMSE and negative R² indicate that a minority of cases are predicted extremely poorly. This is not merely “worse average performance”; it is a qualitatively different error profile (high-risk tail behaviour).
 
A notable anomaly appears for **ET_orig**: the reported **train MAE ≈ 4** and **train RMSE ≈ 89** are extraordinarily small compared to all other models and configurations. Given that validation performance is far from perfect, such near-zero training error suggests either extreme memorization and/or an artefact in how training metrics were computed in that configuration. In contrast, ET under optimized preprocessing yields plausible training metrics while also improving validation performance, which strengthens the credibility of the optimized pipeline as the final modelling choice.
 
---
 
**5) Bottom-line conclusion from the evidence**

The results support a clear empirical conclusion:
 
- With optimized preprocessing, **ExtraTrees is the best-performing and most stable model** on validation metrics (MAE/RMSE/R²), with RandomForest as a very close runner-up.

- The optimized preprocessing is not cosmetic: it delivers **double-digit percentage improvements** in MAE for the strongest learners.

- Models outside tree ensembles either underfit (Linear Regression), fall short on accuracy (HGB/KNN), or exhibit unacceptable tail instability (MLP optimized).
 
Within this benchmark, **ET + optimized preprocessing** is the strongest overall solution by both accuracy and stability criteria.

 

In [None]:
merged_model_results_df = pd.concat([default_orig_models_results_df, default_models_results_df])
merged_model_results_df = merged_model_results_df.sort_values(["preprocessing", "val_MAE"])
display(merged_model_results_df)

`MAE Performance` of Models on original and optimized data preparation

In [None]:
# Create MAE_gap
merged_model_results_df["MAE_gap"] = merged_model_results_df["val_MAE"] - merged_model_results_df["train_MAE"]
merged_model_results_df["base_model"] = merged_model_results_df["model"].str.replace("_orig", "", regex=False)

# Original vs Optimized preprocessing (Validation MAE)
plt.figure(figsize=(12, 6))

for prep in ["original", "optimized"]:
    subset = merged_model_results_df[merged_model_results_df["preprocessing"] == prep]
    plt.scatter(subset["base_model"], subset["val_MAE"], label=prep)

plt.ylabel("Validation MAE")
plt.title("Original vs Optimized Preprocessing (Validation MAE)")
plt.legend()
plt.tight_layout()
plt.show()

`Train-Val-Gap` to analyze overfitting

In [None]:
# Validation vs Train MAE (grouped bar plot)
plt.figure(figsize=(12, 6))
x = range(len(merged_model_results_df))
plt.bar(x, merged_model_results_df["train_MAE"], width=0.4, label="Train MAE")
plt.bar(
    [i + 0.4 for i in x],
    merged_model_results_df["val_MAE"],
    width=0.4,
    label="Validation MAE"
)

plt.xticks([i + 0.2 for i in x], merged_model_results_df["model"], rotation=45, ha="right")
plt.ylabel("MAE")
plt.title("Train vs Validation MAE by Model")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Generalization gap (Val − Train MAE)
plt.figure(figsize=(12, 6))

plt.bar(merged_model_results_df["model"], merged_model_results_df["MAE_gap"])
plt.axhline(0, linestyle="--")

plt.ylabel("MAE Gap (Validation − Train)")
plt.title("Generalization Gap by Model")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

### 5. Hyperparameter Tuning of Top Candidates

**Tune Top candidates with Randomized Search CV:**     

After the first runs we only keep the **top candidates for further hyperparameter** tuning to focus on most promising approaches and not waste computing power.     
We tune using [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) which calls the pipeline object for consistent preprocessing. An example by sklearn of calling the pipeline similar to this can be found [here](https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py).
Within the process of tuning, the parameters were iteratively optimized to decrease size of the search space of the top candidates. This allowed faster runtimes and better final results compared to running one extensive search on a big hyperparameter grid.

##### 5.1 [Tree-Based] RandomForest

In [None]:
# Old parameter distribution
rf_param_dist = {
    "model__regressor__criterion": ["squared_error"],       # optimizes way faster than "absolute_error"
    "model__regressor__n_estimators": randint(300, 350),    # number of trees
    "model__regressor__max_depth": randint(18, 22),         # depth of each tree
    "model__regressor__min_samples_split": randint(4, 6),   # min samples to split an internal node
    "model__regressor__min_samples_leaf": randint(1, 3),    # min samples per leaf (increse to not overfit)
    "model__regressor__max_features": ["sqrt"],             # feature sampling strategy (sqrt performed better than log2 and None in previous tests)
    "model__regressor__bootstrap": [True],                  # True is default for RFs
}

# So far best parameter distribution based on previous runs to focus search space
# rf_param_dist = {
#     "preprocess__fs__vt__threshold": [0.0],
#     "model__regressor__criterion": ['squared_error'], # Use “absolute_error” to optimize for MAE but its significantly slower than when using “squared_error” (~5x) (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
#     "model__regressor__n_estimators": [341],
#     "model__regressor__max_depth": [20],
#     "model__regressor__min_samples_split": [4],
#     "model__regressor__min_samples_leaf": [1],
#     "model__regressor__max_features": ["sqrt"],
#     "model__regressor__oob_score": [False],
# }

rf_tuned_pipe, rf_random_search_object, rf_scores_dict = model_hyperparameter_tuning(X_train,
                                                                                     y_train,
                                                                                     cv,
                                                                                     rf_pipe_adjusted,
                                                                                     rf_param_dist,
                                                                                     n_iter=5,
                                                                                     verbose_features=[
                                                                                    ["model__regressor__n_estimators", "model__regressor__max_depth"],
                                                                                    ],
                                                                                     verbose_metric="mae",
                                                                                     verbose_plot=True,
                                                                                     verbose_top_n=20)

# Long Duration (~1min with squared_error, ~6min with absolute_error)
# Long Duration (30mins with 100 fits)

##### 5.2 [Tree-Based] Extra Trees

In [None]:
# Old parameter distribution
et_param_dist = {
    "model__regressor__criterion": ["squared_error"],           # optimizes way faster than "absolute_error"
    "model__regressor__n_estimators": randint(350, 400),        # number of trees
    "model__regressor__max_depth": randint(23, 25),             # depth of each tree
    "model__regressor__min_samples_split": randint(6, 8),       # min samples to split an internal node
    "model__regressor__min_samples_leaf": randint(1, 3),        # min samples per leaf
    "model__regressor__max_features": [0.8, 0.9],            # feature sampling strategy
    "model__regressor__bootstrap": [False]                      # default for ETs
}

# So far best parameter distribution based on previous runs to focus search space
# et_param_dist = {
#     "model__regressor__criterion": ["absolute_error"], # absolute_error for final best prediction
#     "model__regressor__n_estimators": [395],
#     "model__regressor__max_depth": [24],
#     "model__regressor__min_samples_split": [7],
#     "model__regressor__min_samples_leaf": [1],
#     "model__regressor__max_features": [0.8],
# }

et_tuned_pipe, et_random_search_object, et_scores_dict = model_hyperparameter_tuning(X_train, y_train, cv, et_pipe_adjusted, et_param_dist, n_iter=5, verbose_features=[
                                                                                    ["model__regressor__n_estimators", "model__regressor__max_depth"],
                                                                                ],
                                                                                     verbose_metric="mae",
                                                                                     verbose_plot=True,
                                                                                     verbose_top_n=20)

# Save here for deployment (see Section 7) because refit on entire train data already done inside model_hyperparameter_tuning
joblib.dump(et_tuned_pipe, "et_tuned_pipe.pkl")

# Long Duration (~20min with "absolute_error")

# Use "absolute_error" for final best performance
# MAE: 1187.1061
# RMSE: 2104.8590
# R²: 0.9533
# Best Model params: {'model__regressor__n_estimators': 395, 'model__regressor__min_samples_split': 7, 'model__regressor__min_samples_leaf': 1, 'model__regressor__max_features': 0.8, 'model__regressor__max_depth': 24, 'model__regressor__criterion': 'absolute_error'}

In [None]:
# # Old parameter distribution
# stack_param_dist = {
#     "final_estimator__learning_rate": uniform(0.02, 0.1),
#     "final_estimator__max_depth": randint(3, 10),
#     "final_estimator__min_samples_leaf": randint(3, 20),
#     "final_estimator__l2_regularization": uniform(0.0, 1.0),
# }

# # So far best parameter distribution based on previous runs to focus search space
# stack_param_dist = {
#     "final_estimator__learning_rate": [0.061135390505667866],
#     "final_estimator__max_depth": [5],
#     "final_estimator__min_samples_leaf": [10],
#     "final_estimator__l2_regularization": [0.19438003399487302]
# }

# stack_tuned_pipe, stack_random_search_object, stack_scores_dict = model_hyperparameter_tuning(X_train, y_train, cv, stacking_model, stack_param_dist, n_iter=50)
# # joblib.dump(stack_tuned_pipe, "stack_best.pkl")

### 6. Comparison of Tuned Models

**Our approach:**
- The performance metrics are compared on the same data split to ensure a fair comparison (same CV seed)
- We compare the performance of the preselected models based on 3 metrics with 1 primary metric
    - MAE
    - RMSE
    - R2
- We compare the mean results on the training and on the validation data to evaluate overfitting of the model

**Comparison of optimized model with previous models:**
- Hyperparameter tuning massively improves the performance:

| **Model** | **Performance** | **Biggest Change in HPs compared to default model** |
| :--- | :--- | :--- |
| **RF** <br> *(Tree-based)* | ... | ... |
| **ET** <br> *(Tree-based)* | ... | ... |

**Our findings:**
- Primary Metric val MAE:
    - RF performs best...
- Train vs. Val Score (Overfitting):
    - RF overfits significantly more...
- Secondary Metrics
    - RMSE:
    - R2:


==> Final decision: Use RF because it has the lowest MAE and our final goal is to minimize the MAE. Therefore, we acccept the fact the RF is overfitting...

In [None]:
# Use object from randomizedsearch to retrieve the mean metrics of the best model (that was also refit on entire data for final predictions later)
model_scores = {
    "rf_tuned": rf_scores_dict,
    "et_tuned": et_scores_dict,
    # "stack_tuned": stacked_scores_dict,
}

# Convert dictionary to DataFrame (transpose to have models as rows)
df_scores = pd.DataFrame(model_scores).T 
df_scores = df_scores[['val_mae', 'val_rmse', 'val_r2','train_mae', 'train_rmse', 'train_r2']]

# Sort by val_mae (primary metric)
df_scores = df_scores.sort_values(by='val_mae')

print("Model Comparison Table:")
display(df_scores)

In [None]:
plot_val_mae_comparison(df_scores)
plot_train_val_comparison(df_scores)

### 7. Deployment and Prediction on Test Set

The section includes the deployment to generate reliable predictions for new data. The output is the .csv file which was selected as our best solution on Kaggle.

In [None]:
# Use best parameter distribution and run it with "absolute_error" to minimize MAE even further
et_param_dist = {
    "model__regressor__criterion": ["absolute_error"], # absolute_error for final best prediction
    "model__regressor__n_estimators": [395],
    "model__regressor__max_depth": [24],
    "model__regressor__min_samples_split": [7],
    "model__regressor__min_samples_leaf": [1],
    "model__regressor__max_features": [0.8],
}

# We use the same function to get a final result CV result with "absolut_error" and then the model is automatically refit on all data
et_tuned_pipe, et_random_search_object, et_scores_dict = model_hyperparameter_tuning(X_train, y_train, cv, 
                                                                                     et_pipe_adjusted,
                                                                                     et_param_dist,
                                                                                     n_iter=1,
                                                                                     verbose_features=[],
                                                                                     verbose_plot=False)

In [None]:
joblib.dump(et_tuned_pipe, "et_tuned_pipe.pkl")

In [None]:
def predict_on_test(model_pipeline, model_name):
    # Load best model from Joblib and predict on validation set to verify
    pipe_best = joblib.load(model_pipeline)
    
    # Predict on test set
    df_cars_test['price'] = pipe_best.predict(df_cars_test)
    df_cars_test[['carID', 'price']].to_csv(f'Group05_Version20.csv', index=False)

In [None]:
predict_on_test("et_tuned_pipe.pkl", "ET")

In [None]:
# !kaggle competitions submit -c cars4you -f Group05_Version05.csv -m "Message" # Uncomment to submit to Kaggle

In [None]:
# !kaggle competitions submissions -c cars4you

**Findings:**     
The test score is close to validation score so our model should generalize well, assuming that the data in validation and test is similar.

### 8. Visualizations and Insights of Best Model and the Data Preparation Pipeline

#### 8.1 Data Preparation Pipeline

**Preprocessing is independet of model:**     
To ensure a consistent comparison, the preprocessing pipeline is independent of the model. Therefore, the insights are the same for all models.


**Visualize outputs** of each preprocessing step:     
We use this `DebugTransformer` to print the data shape and check analysis of y-data-profiling for missing values etc. in steps where we would not expect them. This facilitates the experimenting process massively and helps finding errors within the pipeline fast.     
For the visualizations we use ydata-profiling instead of our own plots like in the EDA for a more concise output.


In [None]:
# Toggle verbosity of output by setting show_data or even y_data_profiling to True
show_data = True
y_data_profiling = True

# Set output to pandas DataFrames for easier inspection while we use numpy arrays for efficient model training (default)
enc_transf_scale.set_output(transform="pandas")
fs_pipe.set_output(transform="pandas")

##### 8.1.1 Start

In [None]:
debug_preprocessor_pipe = Pipeline([
    ('debug_start', DebugTransformer('START', show_data=show_data, y_data_profiling=y_data_profiling)),
])

print("Show outputs of each step in the preprocessing pipeline:")

# We call fit_tranform here on the entire training data to just visualize the result. The insights from here are not used for anything else in model decisions so it's not leakage
X_result = debug_preprocessor_pipe.fit_transform(X_train, y_train)

##### 8.1.2 Cleaning

In [None]:
debug_preprocessor_pipe = Pipeline([
    ("clean", CarDataCleaner(handle_electric="other", set_carid_index=False, use_fuzzy=True, verbose=True)),
    ('debug_after_clean', DebugTransformer('AFTER CLEANING', show_data=show_data, y_data_profiling=y_data_profiling)),
])

# We call fit_tranform here on the entire training data to just visualize the result. The insights from here are not used for anything else in model decisions so it's not leakage
X_result = debug_preprocessor_pipe.fit_transform(X_train, y_train)

##### 8.1.3 [Unused] Outlier Handling

In [None]:
# The code for debugging and visualizing the outlier handling is included in the unused_experiments.ipynb

##### 8.1.4 Missing Values Handling

In [None]:
debug_preprocessor_pipe = Pipeline([
    ("clean", CarDataCleaner(handle_electric="other", set_carid_index=False, use_fuzzy=True)),
    # [Unused Outlier Handling]
    # ("outliers", OutlierHandler(
    #     cols=[c for c in orig_numeric_features if c != "mileage"],      # only original numeric features here, no mileage because of log transform later
    #     methods=("iqr", "mod_z"),                                       # robust voting
    #     min_votes=2,                                                    # outlier if both methods agree
    #     iqr_k=1.5,
    #     z_thresh=3.5,
    #     action="clip",                                                   
    #     verbose=False,
    # )),
    ("imputer", IndividualHierarchyImputer()),
    ('debug_after_impute', DebugTransformer('AFTER IMPUTATION', show_data=show_data, y_data_profiling=y_data_profiling)),
])

# We call fit_tranform here on the entire training data to just visualize the result. The insights from here are not used for anything else in model decisions so it's not leakage
X_result = debug_preprocessor_pipe.fit_transform(X_train, y_train)

##### 8.1.5 Feature Engineering

In [None]:
debug_preprocessor_pipe = Pipeline([
    ("clean", CarDataCleaner(handle_electric="other", set_carid_index=False, use_fuzzy=True)),
    # [Unused Outlier Handling]
    ("imputer", IndividualHierarchyImputer()),
    ("fe", CarFeatureEngineer(ref_year=2020, verbose=True)),
    ('debug_after_fe', DebugTransformer('AFTER FEATURE ENGINEERING', show_data=show_data, y_data_profiling=y_data_profiling)),
])

# We call fit_tranform here on the entire training data to just visualize the result. The insights from here are not used for anything else in model decisions so it's not leakage
X_result = debug_preprocessor_pipe.fit_transform(X_train, y_train)

##### 8.1.6 Transformation, Scaling, Encoding

In [None]:
debug_preprocessor_pipe = Pipeline([
    ("clean", CarDataCleaner(handle_electric="other", set_carid_index=False, use_fuzzy=True)),
    # [Unused Outlier Handling]
    ("imputer", IndividualHierarchyImputer()),
    ("fe", CarFeatureEngineer(ref_year=2020, verbose=True)),
    ("ct", (enc_transf_scale)),
    ('debug_after_ct', DebugTransformer('AFTER COLUMN TRANSFORMER', show_data=show_data, y_data_profiling=y_data_profiling))
])

# We call fit_tranform here on the entire training data to just visualize the result. The insights from here are not used for anything else in model decisions so it's not leakage
X_result = debug_preprocessor_pipe.fit_transform(X_train, y_train)

##### 8.1.7 Feature Selection

The votes of each contributor are shown, resulting in the final decision whether to keep the feature or not.

**Argumentation** why we keep the certain features:     

In [None]:
debug_preprocessor_pipe = Pipeline([
    ("clean", CarDataCleaner(handle_electric="other", set_carid_index=False, use_fuzzy=True)),
    # [Unused Outlier Handling]
    ("imputer", IndividualHierarchyImputer()),
    ("fe", CarFeatureEngineer(ref_year=2020, verbose=True)),
    ("ct", (enc_transf_scale)),
    ("fs", (fs_pipe)),
    ('debug_after_fs', DebugTransformer('AFTER FEATURE SELECTION', show_data=show_data, y_data_profiling=y_data_profiling))
    
])

# We call fit_tranform here on the entire training data to just visualize the result. The insights from here are not used for anything else in model decisions so it's not leakage
X_result = debug_preprocessor_pipe.fit_transform(X_train, y_train)

In [None]:
# Feed the feed names after VT because VT is applied before the majority voting to remove constant features
feature_names_after_vt = debug_preprocessor_pipe.named_steps['fs'].named_steps['vt'].get_feature_names_out()
plot_selector_agreement(
    majority_selector = debug_preprocessor_pipe.named_steps['fs'].named_steps['selector'], 
    feature_names = feature_names_after_vt
)

##### 8.1.9 Entire Pipeline with all outputs

In [None]:
debug_preprocessor_pipe = Pipeline([
    ('debug_start', DebugTransformer('START', show_data=show_data, y_data_profiling=y_data_profiling)),
    
    ("clean", CarDataCleaner(handle_electric="other", set_carid_index=False, use_fuzzy=True)),
    ('debug_after_clean', DebugTransformer('AFTER CLEANING', show_data=show_data, y_data_profiling=y_data_profiling)),

    # [Unused Outlier Handling]

    ("imputer", IndividualHierarchyImputer()),
    ('debug_after_impute', DebugTransformer('AFTER IMPUTATION', show_data=show_data, y_data_profiling=y_data_profiling)),

    ("fe", CarFeatureEngineer(ref_year=2020)),
    ('debug_after_fe', DebugTransformer('AFTER FEATURE ENGINEERING', show_data=show_data, y_data_profiling=y_data_profiling)),

    ("ct", (enc_transf_scale)),
    ('debug_after_ct', DebugTransformer('AFTER COLUMN TRANSFORMER', show_data=show_data, y_data_profiling=y_data_profiling)),
    
    ("fs", (fs_pipe)),
    ('debug_after_fs', DebugTransformer('AFTER FEATURE SELECTION', show_data=show_data, y_data_profiling=y_data_profiling))
])

print("Show outputs of each step in the preprocessing pipeline:")

# We call fit_tranform here on the entire training data to just visualize the result. The insights from here are not used for anything else in model decisions so it's not leakage
X_result = debug_preprocessor_pipe.fit_transform(X_train, y_train)

#### 8.2 Model

**Inspection of the best model:**
- We use FI to analyze the features

In [None]:
# Use the debug preprocessor pipeline to get final feature names by hierarchically accessing each step
feature_names_after_fs = debug_preprocessor_pipe.named_steps['fs'].get_feature_names_out()
feat_names = feature_names_after_fs
importances = rf_tuned_pipe.named_steps["model"].regressor_.feature_importances_
df_feat_importance_rf = pd.DataFrame({"feature": feat_names, "importance": importances}).sort_values("importance", ascending=False)

print("Feature Importances:")
for _, row in df_feat_importance_rf.iterrows():
    print(f"{row['feature']:30s}: {row['importance']:.6f}")

### 9. Discussion and Outlook

**Tree-based models performed best:**     
Regarding the baseline models we used in Section 5, it becomes clear that the tree-based models outperform the other models. This is probably due to ...

**Constraints:**     
However, the nature of tree-based models constraints the predictions to never be lower than the lowest or higher than the highest price in the train set. This is because tree-based models return the average price of the cars in the leaf node.

**Outlook:**     
A potential solution of this could be a **Stacking Regressor** that combines a tree-based model with another model that is better in extrapolation (e.g. RF + Ridge). A final Meta-Learner (also Linear) can combine their predictions. If the RF predicts "Max Value" but the Linear Model predicts "Higher Value," the Meta-Learner can follow the Linear trend upward.    

In [None]:
# e.g. maybe outlier handling per model (if sample size big enough) ~J
# "preprocess__fs__vt__threshold": [0.0],
    # 'fs__selector__redundancy_threshold': [0.85, 0.95, 1.01],   # TODO If this is 1.01, redundancy filtering is disabled -> hp-tuning will tell whether redundancy selection improves model performance

### 10. Open-Ended-Section

#### 10.1 SHAP Interpretability for Our Final Tree Model (Informative Only)

%md


##### a) Objective and motivation (0.5v)

After building a strong pipeline (data cleaning → imputation → feature engineering → encoding/scaling → VT + majority-vote FS → tuned tree model), we use **SHAP (SHapley Additive exPlanations)** purely for **interpretability**.

Goals:
- Identify the **most influential features** for our final tuned tree model (`hgb_tuned_pipe`).
- Validate whether feature effects are **plausible** (age, mileage, engine, etc.).
- Understand whether **target encodings** dominate and how engineered interactions contribute.

Important: **SHAP does not change the model or feature set.** We do not build a new pipeline based on SHAP.

---

##### b) Difficulty of the task (1v)

This was non-trivial because SHAP must explain the model input **after** our preprocessing:

- The model does not see raw columns. It sees:
  - engineered features,
  - OHE columns,
  - median target-encoded columns,
  - and the reduced subset after **VT + majority voting**.
- We therefore implemented a helper to reconstruct:
  - the exact **post-preprocess feature matrix**, and
  - aligned **feature names** after applying both selection masks (VT support + majority selector mask).
- For `HistGradientBoostingRegressor`, SHAP’s **additivity check** can fail even with correct shapes. We handle this safely by disabling it (`check_additivity=False`) and keeping a robust fallback explainer if needed.
- Runtime: SHAP is expensive, so we compute explanations on a **subsample** (`sample_size=1000`) plus a small background set.

---

##### c) Correctness and efficiency (1v)

We kept the analysis correct and consistent with the production pipeline:

- **No leakage / no optimization loop:** SHAP is computed on the already fitted tuned model and used only to interpret it.
- **Exact alignment:** feature names are derived from the ColumnTransformer output and then filtered by VT + majority voting masks.
- **Global SHAP importance:** we rank features by mean absolute contribution:
  
  $$
  Importance(feature_j) = \frac{1}{N}\sum_{i=1}^{N} |SHAP_{i,j}|
  $$

- **Efficient computation:** stable ranking via subsampling.

---

##### d) Results and interpretation (1v)

Model context:
- Final tuned model: `hgb_tuned_pipe`
- Total features after preprocessing + FS: **26**

Top drivers (mean |SHAP|), excerpt:

| Feature | Importance | Interpretation |
|---|---:|---|
| `median_te__model_median_te` | 2850.28 | Model-level median target encoding (strong market-value proxy) |
| `num__mpg_x_age` | 1659.43 | Interaction: MPG × age |
| `num__engineSize` | 1367.63 | Engine size (segment/performance proxy) |
| `log__mileage` | 1151.75 | Log mileage (diminishing marginal effect) |
| `num__age_rel_model` | 642.22 | Age relative to typical age within the model |
| `median_te__brand_trans_median_te` | 500.44 | Brand × transmission median target encoding |
| `num__engine_per_mpg` | 497.18 | Performance/efficiency ratio |
| `cat__transmission_Manual` | 406.37 | Manual transmission effect |

Key takeaways:
- **Target encodings dominate** global importance, especially the model-level encoding. This is expected because model identity carries a large fraction of price signal.
- **Engineered interactions matter** (`mpg_x_age`, `engine_per_mpg`, `mpg_x_engine`), confirming that our feature engineering adds useful non-additive structure.
- **Mileage and age appear in strong, intuitive forms** (log mileage, relative age vs model/brand), supporting both predictive performance and interpretability.

Beeswarm plot (distribution of effects):
- **`median_te__model_median_te` dominates** and shows a wide SHAP spread → model identity (via median target encoding) is the strongest pricing signal.
- **Mileage effect is non-linear** (`log__mileage`): low mileage produces strong positive contributions; high mileage pushes predictions down, but with diminishing marginal impact (consistent with log-transform).
- **Engine/performance features matter across many cars** (`num__engineSize`, `num__engine_per_mpg`, `num__mpg_x_age`) and show heterogeneous spreads → effects differ by segment (e.g., sporty vs economy cars).
- **Relative positioning features stabilize predictions** (`num__age_rel_model`, `num__engine_rel_model`): the model compares a car to what is “typical” within its model/segment, not only absolute values.
- **Transmission signal is consistent** (`cat__transmission_Manual`): manual cars tend to shift predictions in one direction (dataset-dependent), but the spread indicates exceptions (model/brand interactions).


---

##### e) Alignment with objectives (0.5v)

This section adds transparency without changing the modeling procedure:

- Feature selection stays **VT + majority voting** (robust, leakage-safe, model-agnostic).
- SHAP is used **only** to explain the final tuned model.
- The resulting drivers (target encodings + age/mileage/engine + interactions) are consistent with domain logic and support trust in the final pipeline.

---


In [None]:
# Get Feature names aligned with X_proc (after preprocess incl. VT + majority voting)
def get_pipeline_feature_matrix(pipe, X):
    """
    Given a fitted model pipeline with steps:
      'preprocess' -> 'model'
    where preprocess itself is a Pipeline:
      clean -> group_imputer -> fe -> ct -> fs(vt + selector)
    return:
      X_proc: 2D numpy array of features just before the model step
      feat_names: 1D np.array of feature names aligned with X_proc columns
    """
    pre = pipe.named_steps["preprocess"]

    # 1) Transform to model-ready matrix
    X_proc = pre.transform(X)

    # 2) Reconstruct feature names: ct -> vt mask -> majority selector mask
    ct = pre.named_steps["ct"]
    feat_names = np.asarray(ct.get_feature_names_out(), dtype=object)

    fs = pre.named_steps.get("fs", None)
    if fs is not None:
        # VT (dictator) first
        vt = fs.named_steps.get("vt", None)
        if vt is not None and hasattr(vt, "get_support"):
            feat_names = feat_names[vt.get_support()]

        # Majority selector next
        sel = fs.named_steps.get("selector", None)
        if sel is not None and hasattr(sel, "support_mask_") and sel.support_mask_ is not None:
            feat_names = feat_names[sel.support_mask_]

    return X_proc, feat_names


In [None]:
# Compute SHAP Importance
def compute_shap_importance(
    pipe,
    X,
    sample_size=1000,
    seed=rs,
    model_name=None,
):
    """
    Compute global SHAP feature importances for a fitted pipeline (informative only).

    Fix:
      - TreeExplainer additivity check can fail for some sklearn tree implementations (incl. HGB).
        We disable it via check_additivity=False.
      - If TreeExplainer still fails, fall back to a model-agnostic SHAP explainer.
    """
    # Extract processed feature matrix and names
    X_proc, feat_names = get_pipeline_feature_matrix(pipe, X)

    # Subsample rows for SHAP (for speed)
    rng = np.random.default_rng(seed)
    n = min(sample_size, len(X_proc))
    idx = rng.choice(len(X_proc), n, replace=False)
    X_sample = X_proc[idx]

    # Underlying model (last step in pipeline)
    model = pipe.named_steps["model"]
    tag = model_name or model.__class__.__name__

    # Background for SHAP (small subset)
    bg_n = min(200, len(X_sample))
    bg_idx = rng.choice(len(X_sample), bg_n, replace=False)
    X_bg = X_sample[bg_idx]

    # --- Try TreeExplainer first (fast for tree models) ---
    try:
        explainer = shap.TreeExplainer(model, X_bg)
        shap_vals = explainer.shap_values(X_sample, check_additivity=False)

        # shap_vals can be list-like in some setups; regression should be 2D
        if isinstance(shap_vals, list):
            shap_vals = shap_vals[0]

        base_vals = getattr(explainer, "expected_value", 0.0)
        shap_values = shap.Explanation(
            values=shap_vals,
            base_values=np.full((len(X_sample),), base_vals) if np.isscalar(base_vals) else base_vals,
            data=X_sample,
            feature_names=feat_names,
        )

    except Exception as e:
        # --- Fallback: model-agnostic explainer (slower but robust) ---
        explainer = shap.Explainer(model.predict, X_bg, feature_names=feat_names)
        shap_values = explainer(X_sample)

    importance = np.abs(shap_values.values).mean(axis=0)

    shap_df = (
        pd.DataFrame({"feature": feat_names, "importance": importance})
        .sort_values("importance", ascending=False)
        .reset_index(drop=True)
    )

    print(f"Top 20 features by SHAP for {tag}:")
    print(shap_df.head(20).to_string(index=False))

    return shap_df, feat_names, shap_values, X_sample


In [None]:
# SHAP Plots
def plot_top_shap_bar(shap_df, model_name, top_k):
    """
    Horizontal bar plot of top_k features by mean |SHAP|.
    """
    top_df = shap_df.head(top_k).iloc[::-1]  # reverse for nicer barh order
    fig, ax = plt.subplots(figsize=(8, 6))
    ax.barh(top_df["feature"], top_df["importance"])
    ax.set_xlabel("Average |SHAP| value")
    ax.set_title(f"Top {top_k} features by SHAP – {model_name}")
    plt.tight_layout()
    plt.show()


def plot_shap_beeswarm(shap_values, X_sample, feat_names, model_name, max_display=20):
    """
    SHAP summary (beeswarm) plot for top features.
    """
    X_df = pd.DataFrame(X_sample, columns=feat_names)

    # Create one figure and tell SHAP not to auto-show
    plt.figure(figsize=(10, 6))
    shap.summary_plot(shap_values.values, X_df, max_display=max_display, show=False)

    plt.title(f"SHAP Beeswarm – {model_name}")
    plt.tight_layout()
    plt.show()


##### RF SHAP

In [None]:
# RandomForest baseline report + SHAP
rf_pipe = rf_tuned_pipe

# Feature matrix + names after preprocess (clean+impute+fe+ct+fs)
X_proc_rf, feat_names_rf = get_pipeline_feature_matrix(rf_pipe, X_train)
n_features_total_rf = X_proc_rf.shape[1]

print("RandomForest (tuned pipe) – feature space info:")
print(f"Total features used: {n_features_total_rf}")

shap_importance_rf, feat_names_rf, shap_vals_rf, X_sample_rf = compute_shap_importance(
    rf_pipe,
    X_train,
    sample_size=1000,
    seed=rs,
    model_name="RandomForest",
)

plot_top_shap_bar(shap_importance_rf, model_name="RandomForest", top_k=20)
plot_shap_beeswarm(shap_vals_rf, X_sample_rf, feat_names_rf, model_name="RandomForest", max_display=20)

In [None]:
# RF SHAP
shap_importance_rf, feat_names_rf, shap_vals_rf, X_sample_rf = compute_shap_importance(
    rf_pipe,
    X_train,
    sample_size=1000,
    seed=rs,
    model_name="RandomForest",
)

plot_top_shap_bar(shap_importance_rf, model_name="RandomForest", top_k=20)
plot_shap_beeswarm(shap_vals_rf, X_sample_rf, feat_names_rf, model_name="RandomForest", max_display=20)

#### 10.2 Global vs Brand- and Model-Specific Models

##### a) Objective and motivation (0.5v)

We investigated how far Cars4You should specialize its pricing models:

1. **Brand level:** Is a single global price model for all brands sufficient, or do separate brand-specific models reduce pricing error?
2. **Brand–model level:** For frequent models (e.g. “Skoda Octavia”, “VW Golf”), does an even more specialized model per (brand, model) segment bring additional improvements, or does it overfit?

Concretely, we started from our final production pipeline `hgb_final_shap_pipe` (full preprocessing + SHAP-based feature selection + HGB regressor) and compared:

- **Global model:** trained on all cars, evaluated only on a given segment.
- **Brand-specific model:** same preprocessing and SHAP selector, but the regressor re-fitted only on cars of a given brand.
- **Brand–model-specific model:** same preprocessing and SHAP selector, but the regressor re-fitted only on cars of a given (brand, model) pair.

We measured mean absolute error (MAE) and root mean squared error (RMSE) per segment. This answers how much performance we gain by moving from:

> one global model → several brand models → many brand–model models.

---

##### b) Difficulty of tasks (1v)

Extending the existing solution to this multi-level comparison was non-trivial:

- **Complex pipeline with a custom SHAP selector**  
  The final pipeline contains a `ShapTopKColumnSelector` that is not clone-compatible. Standard `cross_val_score` + `clone` would fail. We therefore implemented manual cross-validation:
  - reuse the fitted preprocessing + SHAP selector from `hgb_final_shap_pipe`;
  - only re-fit the final regressor for each fold and segment.

- **Consistent and fair evaluation protocol**  
  We reused the same 5-fold KFold strategy (`n_splits`, `shuffle`, `random_state`) and the same target (`price`) as in the main project. For each fold and segment:
  - the global model is trained on all training rows but evaluated only on validation rows belonging to that segment;
  - the segment-specific model is trained and evaluated only on that segment’s rows.

- **Handling data imbalance**  
  Data is unevenly distributed across brands and models. We therefore:
  - restricted the analysis to brands with at least 500 training samples;
  - for brand–model analysis, kept only frequent pairs (e.g. Skoda Octavia, VW Golf) with a minimum sample threshold per segment;
  - enforced additional checks per fold (minimum training size) to avoid fits on a handful of cars.

- **Manual metric computation**  
  Due to an older `sklearn` version (no `squared=` parameter), RMSE had to be computed manually as `sqrt(MSE)` inside the CV loops instead of relying on built-in scorers.

Overall, the task required custom CV logic, careful reuse of the production pipeline, and multiple levels of segment-wise filtering.

---

##### c) Correctness and efficiency of implementation (1v)

To keep the analysis correct and reasonably efficient we:

- **Reused the production pipeline as-is**  
  All preprocessing (imputation, scaling, encoding, price anchors) and SHAP-based feature selection are exactly the same as in the final model used on the test set. Only the last regressor is re-fit for segment-specific models.

- **Used a single CV design for all comparisons**  
  The same KFold splits (`splits = list(KFold(...).split(X_train, y_train))`) are reused for:
  - global per-brand evaluation;
  - brand-specific evaluation;
  - global per (brand, model) evaluation;
  - brand–model-specific evaluation.  
  This removes extra randomness and makes differences directly comparable.

- **Implemented clear separation between global and segment-specific training**  
  - For brands:  
    - global: fit on all brands, compute metrics only on that brand’s validation rows;  
    - brand-specific: use the fixed preprocessor, fit a fresh regressor only on that brand’s transformed data.
  - For (brand, model) pairs:  
    - global: fit on all cars, compute metrics only on that (brand, model) validation subset;  
    - brand–model-specific: fixed preprocessor + fresh regressor only on that pair.

- **Guarded against tiny segments**  
  Only segments with enough rows at dataset level and per fold are evaluated. Otherwise, metrics are set to NaN and those segments are excluded via `dropna`.

This design produces stable segment-wise estimates without changing the core production pipeline.

---

##### d) Discussion of results (1v)

#### Brand-level comparison

For the main brands, the final summary table (MAE in GBP) is:

| Brand    | MAE (global) | MAE (brand) | ΔMAE (brand – global) | n_samples |
|----------|--------------|-------------|------------------------|-----------|
| Ford     | 966.7        | 929.2       | -37.6                  | 16,371    |
| BMW      | 1,828.0      | 1,792.8     | -35.2                  | 7,540     |
| Mercedes | 1,968.7      | 1,934.6     | -34.1                  | 11,899    |
| VW       | 1,299.7      | 1,287.8     | -11.9                  | 10,572    |
| Audi     | 1,806.0      | 1,794.5     | -11.5                  | 7,456     |
| Skoda    | 1,174.6      | 1,165.9     | -8.7                   | 4,380     |
| Toyota   |   926.7      |   920.9     | -5.9                   | 4,714     |
| Opel     |   777.1      |   774.5     | -2.6                   | 9,530     |

Key observations:

- **High-volume premium brands benefit the most from brand-specific models.**  
  Ford, BMW and Mercedes gain about 35–38 GBP lower MAE per car (≈ 2–4% relative improvement). This is meaningful at scale and based on large sample sizes.

- **Moderate gains for VW, Audi, Skoda, Toyota.**  
  MAE improvements are smaller (5–12 GBP, typically <1% relative), but still consistent in sign.

- **Minimal benefit for Opel.**  
  The improvement for Opel (≈ 2.6 GBP) is negligible relative to its base MAE. The global model already captures Opel’s pricing patterns.

- **RMSE sometimes increases slightly for brand-specific models.**  
  For some brands, RMSE is marginally higher, indicating that brand-specific models reduce typical errors but can perform worse on rare/extreme cases, hinting at mild overfitting in the tails.

Overall, moving from a global to a brand-specific layer consistently does not harm MAE and clearly helps for some large brands, but the absolute gains are moderate.

#### Brand–model-level comparison

For frequent (brand, model) pairs, the analysis shows a more mixed picture. A selection of results (MAE in GBP):

| Brand   | Model        | MAE global | MAE seg | ΔMAE (seg – global) | n_samples |
|---------|--------------|-----------:|--------:|---------------------:|----------:|
| Skoda   | kamiq        | 1,418.6    | 1,107.1 | -311.5               | 109       |
| VW      | amarok       | 2,988.7    | 2,801.3 | -187.4               | 83        |
| Mercedes| x-class      | 3,592.8    | 3,448.9 | -144.0               | 59        |
| Skoda   | scala        | 1,175.7    | 1,100.5 | -75.2                | 147       |
| Ford    | b-max        |   640.2    |   578.1 | -62.1                | 248       |
| Skoda   | octavia      | 1,089.4    | 1,031.9 | -57.5                | 1,021     |
| Skoda   | fabia        |   845.8    |   795.1 | -50.6                | 1,069     |
| VW      | up           |   645.2    |   608.1 | -37.1                | 608       |
| BMW     | 1 series     | 1,158.2    | 1,130.0 | -28.1                | 1,358     |
| VW      | golf         | 1,151.0    | 1,155.8 |  +4.8                | 3,515     |
| Ford    | fiesta       |   753.3    |   762.7 |  +9.3                | 4,470     |
| Toyota  | aygo         |   557.1    |   576.7 | +19.6                | 1,381     |
| BMW     | 7 series     | 3,146.7    | 4,751.9 | +1,605.2             | 71        |
| Mercedes| gls class    | 3,295.8    | 5,906.4 | +2,610.7             | 54        |

Patterns:

- **Some compact, relatively frequent models benefit from model-level specialization.**  
  Examples: Skoda Kamiq, Scala, Octavia and Fabia; VW up; Ford B-MAX.  
  These segments see large MAE reductions (50–300 GBP), and RMSE also tends to decrease. Here, the model-level regressor can exploit consistent, model-specific patterns.

- **For many common volume models, gains are small or negative.**  
  VW Golf, Ford Fiesta, Opel Corsa, Toyota Yaris, etc. often show small positive ΔMAE and/or higher RMSE. For these, splitting by model does not significantly improve typical error and can worsen extreme cases.

- **For rare, high-priced models, model-specific fits severely overfit.**  
  BMW 7 series, BMW X6, Mercedes GLS/S/SL/CLS class, VW Beetle, Toyota Avensis/Verso and others exhibit very large increases in MAE (hundreds to thousands of GBP) and often huge increases in RMSE.  
  These models have small sample sizes (often <100 cars), so a separate model per (brand, model) is clearly not robust.

In short:

- Moving from **global → brand** is often beneficial and relatively safe for high-volume brands.
- Moving further from **brand → brand–model** brings strong improvements only for a small subset of frequent models; for many others, especially rare premium models, it clearly overfits.

---

##### e) Alignment with objectives (0.5v)

This extended open-ended study:

- Directly addresses and expands a suggested topic (“global vs brand-specific models”), and pushes it one step further to **brand–model** specialization.
- Uses fully the final production pipeline and a consistent CV protocol, so the conclusions are directly relevant for deployment.
- Provides a **clear design recommendation**:
  - Use a **single global model** as the base.
  - Optionally introduce **brand-level specialization** for a small set of high-volume brands (e.g. Ford, BMW, Mercedes) where MAE improvements are meaningful.
  - Avoid full **brand–model specialization** except potentially for a handful of very frequent models with demonstrated gains; for most models, especially rare and expensive ones, splitting further clearly overfits.

This shows that we not only tuned a strong model, but also explored the trade-off between model complexity and robustness in a structured, data-driven way.



In [None]:
# Load final production pipeline (preprocessing + SHAP + HGB)
# hgb_final_shap_pipe = load("hgb_final_shap_pipe.pkl")
pipe_global = rf_tuned_pipe
brand_col = "brand"

# assert "X_train" in globals() and "y_train" in globals(), "Define X_train and y_train before proceeding." # TODO can probably be removed

# Identify the brand column (name may be 'Brand' or 'brand')
# brand_col = "Brand" if "Brand" in X_train.columns else "brand"
# assert brand_col in X_train.columns, (
#     f"Brand column not found in X_train. "
#     f"First columns: {X_train.columns.tolist()[:20]}"
# )

# print("Using brand column:", brand_col)


In [None]:
# Inspect brand frequencies
brand_counts = X_train[brand_col].value_counts()
print("Top brands by count:")
print(brand_counts.head(15))

# Select candidate brands
#    - TOP_K: max number of brands to compare.
#    - MIN_SAMPLES: minimum number of rows per brand.

TOP_K = 8
MIN_SAMPLES = 500  # adjust if needed

candidate_brands = [
    b for b, cnt in brand_counts.items()
    if cnt >= MIN_SAMPLES
][:TOP_K]

print("\nCandidate brands used in the comparison:")
print(candidate_brands)


In [None]:
# Cross-validation setup: We reuse the same KFold splits for all evaluations to keep comparisons fair and to reduce randomness.

cv = KFold(n_splits=5, shuffle=True, random_state=rs)
splits = list(cv.split(X_train, y_train)) # TODO no random_state necessary here?


def eval_global_for_brand(model, X, y, brand_col, brand, splits):
    """
    Evaluate the global pipeline for a single brand.

    The model is trained on all brands in each fold, but the error
    is computed only on validation rows belonging to the given brand.
    """
    maes, rmses = [], []
    n_obs = 0

    for train_idx, val_idx in splits:
        # Split data for this fold
        X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]

        # Train global model on ALL brands in this fold
        model.fit(X_tr, y_tr)
        y_pred = model.predict(X_val)

        # Restrict metrics to the target brand in validation
        mask_b = (X_val[brand_col] == brand)
        if mask_b.sum() == 0:
            continue

        y_val_b = y_val[mask_b]
        y_pred_b = y_pred[mask_b]

        mae = mean_absolute_error(y_val_b, y_pred_b)
        mse = mean_squared_error(y_val_b, y_pred_b)
        rmse = float(np.sqrt(mse))

        maes.append(mae)
        rmses.append(rmse)
        n_obs += mask_b.sum()

    return {
        "MAE_mean": float(np.mean(maes)),
        "MAE_std":  float(np.std(maes)),
        "RMSE_mean": float(np.mean(rmses)),
        "RMSE_std":  float(np.std(rmses)),
        "n": int(n_obs),
    }


def eval_brand_specific(pipe_global, X, y, brand_col, brand, splits,
                        min_train_per_fold=50):
    """
    Evaluate a brand-specific model for a single brand.

    Preprocessing + SHAP selection are kept fixed (from pipe_global).
    In each fold:
      - Transform the brand's data with the fixed preprocessor.
      - Fit a fresh regressor (clone of the final step) only on that brand.
      - Evaluate on validation rows of that brand.
    """
    # Split the pipeline into:
    # - preproc: all steps except the final regressor
    # - base_reg: the final regressor template
    preproc = pipe_global[:-1]
    base_reg = pipe_global[-1]

    maes, rmses = [], []
    n_obs = 0

    for train_idx, val_idx in splits:
        X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]

        # Keep only this brand in train/val
        mask_tr = (X_tr[brand_col] == brand)
        mask_val = (X_val[brand_col] == brand)

        if mask_val.sum() == 0:
            # No validation examples of this brand in this fold
            continue
        if mask_tr.sum() < min_train_per_fold:
            # Too few training examples for a stable brand-specific fit
            continue

        X_tr_b, y_tr_b = X_tr[mask_tr], y_tr[mask_tr]
        X_val_b, y_val_b = X_val[mask_val], y_val[mask_val]

        # Do NOT refit the preprocessor; just transform with the fitted one
        X_tr_b_proc = preproc.transform(X_tr_b)
        X_val_b_proc = preproc.transform(X_val_b)

        # Fresh regressor for this fold
        reg = clone(base_reg)
        reg.fit(X_tr_b_proc, y_tr_b)

        y_pred_b = reg.predict(X_val_b_proc)

        mae = mean_absolute_error(y_val_b, y_pred_b)
        mse = mean_squared_error(y_val_b, y_pred_b)
        rmse = float(np.sqrt(mse))

        maes.append(mae)
        rmses.append(rmse)
        n_obs += len(y_val_b)

    return {
        "MAE_mean": float(np.mean(maes)) if maes else np.nan,
        "MAE_std":  float(np.std(maes))  if maes else np.nan,
        "RMSE_mean": float(np.mean(rmses)) if rmses else np.nan,
        "RMSE_std":  float(np.std(rmses))  if rmses else np.nan,
        "n": int(n_obs),
    }


In [None]:
# Evaluate both models for each candidate brand

pipe_global = hgb_final_shap_pipe 

global_results = []
brand_specific_results = []

for brand in candidate_brands:
    print("Evaluating brand:", brand)

    # 1) Global: train on all brands, measure only this brand in validation
    res_g = eval_global_for_brand(
        pipe_global,
        X_train,
        y_train,
        brand_col=brand_col,
        brand=brand,
        splits=splits,
    )
    res_g.update({
        "brand": brand,
        "model_type": "global",
    })
    global_results.append(res_g)

    # 2) Brand-specific: preproc fixed, regressor trained only on this brand
    res_b = eval_brand_specific(
        pipe_global,
        X_train,
        y_train,
        brand_col=brand_col,
        brand=brand,
        splits=splits,
    )
    res_b.update({
        "brand": brand,
        "model_type": "brand_specific",
    })
    brand_specific_results.append(res_b)

# Collect results into DataFrames
df_global = pd.DataFrame(global_results)
df_brand = pd.DataFrame(brand_specific_results)

print("\nGlobal model results per brand:")
display(df_global)

print("\nBrand-specific model results per brand:")
display(df_brand)


In [None]:
# Clean results and compute performance differences

# Drop any brands where evaluation failed (NaNs)
df_global = df_global.dropna(subset=["MAE_mean", "RMSE_mean"])
df_brand  = df_brand.dropna(subset=["MAE_mean", "RMSE_mean"])

# Merge global vs brand-specific results
df_compare = df_global.merge(
    df_brand,
    on="brand",
    suffixes=("_global", "_brand"),
)

# Compute deltas:
#   delta_MAE  < 0  -> brand-specific has lower MAE (better)
#   delta_RMSE < 0  -> brand-specific has lower RMSE (better)
df_compare["delta_MAE"]  = df_compare["MAE_mean_brand"]  - df_compare["MAE_mean_global"]
df_compare["delta_RMSE"] = df_compare["RMSE_mean_brand"] - df_compare["RMSE_mean_global"]

# Sort by delta_MAE (most improvement first)
df_compare_sorted = df_compare.sort_values("delta_MAE")

print("Per-brand comparison (head):")
display(df_compare_sorted)


In [None]:
# Visualizations: bar plots for MAE and ΔMAE

# Global vs Brand-specific MAE per brand
plt.figure(figsize=(8, 4))
x = np.arange(len(df_compare_sorted))
width = 0.35

plt.bar(
    x - width / 2,
    df_compare_sorted["MAE_mean_global"],
    width,
    label="Global model",
)
plt.bar(
    x + width / 2,
    df_compare_sorted["MAE_mean_brand"],
    width,
    label="Brand-specific model",
)

plt.xticks(x, df_compare_sorted["brand"], rotation=45, ha="right")
plt.ylabel("MAE (GBP)")
plt.title("Global vs Brand-specific models (MAE per brand)")
plt.legend()
plt.tight_layout()
plt.show()

# 1ΔMAE per brand (negative = improvement with specialization)
plt.figure(figsize=(8, 3))
plt.bar(df_compare_sorted["brand"], df_compare_sorted["delta_MAE"])
plt.axhline(0, linestyle="--")
plt.xticks(rotation=45, ha="right")
plt.ylabel("Δ MAE (brand - global)")
plt.title("Effect of model specialization per brand\n(negative = brand-specific MAE is lower)")
plt.tight_layout()
plt.show()


**Brand-Model Segmentation**

In [None]:
# Evaluation helpers for brand–model segments

def eval_global_for_brand_model(model, X, y, brand_col, model_col,
                                brand, model_name, splits):
    """
    Evaluate the global pipeline for a specific (brand, model) pair.

    In each fold:
      - Train on all cars.
      - Compute MAE / RMSE only on validation rows where
        Brand == brand AND model == model_name.
    """
    maes, rmses = [], []
    n_obs = 0

    for train_idx, val_idx in splits:
        X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]

        # Train global model on ALL brands and models
        model.fit(X_tr, y_tr)
        y_pred = model.predict(X_val)

        # Restrict to this brand–model in validation
        mask_seg = (
            (X_val[brand_col] == brand) &
            (X_val[model_col] == model_name)
        )
        if mask_seg.sum() == 0:
            continue

        y_val_seg = y_val[mask_seg]
        y_pred_seg = y_pred[mask_seg]

        mae = mean_absolute_error(y_val_seg, y_pred_seg)
        mse = mean_squared_error(y_val_seg, y_pred_seg)
        rmse = float(np.sqrt(mse))

        maes.append(mae)
        rmses.append(rmse)
        n_obs += mask_seg.sum()

    return {
        "MAE_mean": float(np.mean(maes)),
        "MAE_std":  float(np.std(maes)),
        "RMSE_mean": float(np.mean(rmses)),
        "RMSE_std":  float(np.std(rmses)),
        "n": int(n_obs),
    }


def eval_brand_model_specific(pipe_global, X, y, brand_col, model_col,
                              brand, model_name, splits,
                              min_train_per_fold=40):
    """
    Evaluate a brand–model-specific regressor.

    Preprocessing + SHAP selection stay fixed (from pipe_global).
    In each fold:
      - Keep only rows with this (brand, model).
      - Transform them with the fixed preprocessor.
      - Fit a fresh regressor only on this segment.
      - Evaluate on validation rows of the same segment.
    """
    preproc = pipe_global[:-1]
    base_reg = pipe_global[-1]

    maes, rmses = [], []
    n_obs = 0

    for train_idx, val_idx in splits:
        X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]

        # Restrict to this brand–model in train/val
        mask_tr = (
            (X_tr[brand_col] == brand) &
            (X_tr[model_col] == model_name)
        )
        mask_val = (
            (X_val[brand_col] == brand) &
            (X_val[model_col] == model_name)
        )

        if mask_val.sum() == 0:
            continue
        if mask_tr.sum() < min_train_per_fold:
            continue

        X_tr_seg, y_tr_seg = X_tr[mask_tr], y_tr[mask_tr]
        X_val_seg, y_val_seg = X_val[mask_val], y_val[mask_val]

        # Transform with fixed preprocessor
        X_tr_seg_proc = preproc.transform(X_tr_seg)
        X_val_seg_proc = preproc.transform(X_val_seg)

        # Fresh regressor for this fold
        reg = clone(base_reg)
        reg.fit(X_tr_seg_proc, y_tr_seg)
        y_pred_seg = reg.predict(X_val_seg_proc)

        mae = mean_absolute_error(y_val_seg, y_pred_seg)
        mse = mean_squared_error(y_val_seg, y_pred_seg)
        rmse = float(np.sqrt(mse))

        maes.append(mae)
        rmses.append(rmse)
        n_obs += len(y_val_seg)

    return {
        "MAE_mean": float(np.mean(maes)) if maes else np.nan,
        "MAE_std":  float(np.std(maes))  if maes else np.nan,
        "RMSE_mean": float(np.mean(rmses)) if rmses else np.nan,
        "RMSE_std":  float(np.std(rmses))  if rmses else np.nan,
        "n": int(n_obs),
    }


In [None]:
# Run evaluation for each (brand, model) pair

bm_global_results = []
bm_specific_results = []

for (brand, model_name), cnt in candidate_pairs.items():
    print(f"Evaluating pair: {brand} / {model_name} (n={cnt})")

    # Global model on this brand–model segment
    res_g = eval_global_for_brand_model(
        pipe_global,
        X_train,
        y_train,
        brand_col=brand_col,
        model_col=model_col,
        brand=brand,
        model_name=model_name,
        splits=splits,
    )
    res_g.update({
        "brand": brand,
        "model": model_name,
        "segment_type": "global",
    })
    bm_global_results.append(res_g)

    # Brand–model-specific regressor
    res_bm = eval_brand_model_specific(
        pipe_global,
        X_train,
        y_train,
        brand_col=brand_col,
        model_col=model_col,
        brand=brand,
        model_name=model_name,
        splits=splits,
    )
    res_bm.update({
        "brand": brand,
        "model": model_name,
        "segment_type": "brand_model_specific",
    })
    bm_specific_results.append(res_bm)

df_bm_global = pd.DataFrame(bm_global_results)
df_bm_spec   = pd.DataFrame(bm_specific_results)

print("\nGlobal results per (brand, model):")
display(df_bm_global)

print("\nBrand–model-specific results:")
display(df_bm_spec)


In [None]:
# Compare global vs brand–model-specific performance

# Drop failed / NaN segments
df_bm_global = df_bm_global.dropna(subset=["MAE_mean", "RMSE_mean"])
df_bm_spec   = df_bm_spec.dropna(subset=["MAE_mean", "RMSE_mean"])

df_bm_compare = df_bm_global.merge(
    df_bm_spec,
    on=["brand", "model"],
    suffixes=("_global", "_bm"),
)

df_bm_compare["delta_MAE"]  = df_bm_compare["MAE_mean_bm"]  - df_bm_compare["MAE_mean_global"]
df_bm_compare["delta_RMSE"] = df_bm_compare["RMSE_mean_bm"] - df_bm_compare["RMSE_mean_global"]

df_bm_sorted = df_bm_compare.sort_values("delta_MAE")

print("Brand–model comparison (most improvement first):")
display(df_bm_sorted)

# Optional readable table
bm_display_cols = [
    "brand", "model",
    "MAE_mean_global", "MAE_mean_bm", "delta_MAE",
    "RMSE_mean_global", "RMSE_mean_bm", "delta_RMSE",
    "n_global",
]
df_bm_display = (
    df_bm_sorted[bm_display_cols]
    .copy()
    .rename(columns={"n_global": "n_samples"})
)

for c in df_bm_display.columns:
    if "MAE" in c or "RMSE" in c or "delta" in c:
        df_bm_display[c] = df_bm_display[c].round(1)

print("\nReadable brand–model summary:")
display(df_bm_display)


In [None]:
# Extra plots for brand–model specialization

# Focus on segments with at least 100 samples for more stable numbers
df_bm_plot = df_bm_display[df_bm_display["n_samples"] >= 100]

# Sort by delta_MAE (most improvement first)
df_bm_plot = df_bm_plot.sort_values("delta_MAE")

# 1) Bar plot of ΔMAE for brand–model segments (filtered)
plt.figure(figsize=(10, 4))
x = np.arange(len(df_bm_plot))
plt.bar(x, df_bm_plot["delta_MAE"])
plt.axhline(0, linestyle="--")
plt.xticks(x, [f"{b} {m}" for b, m in zip(df_bm_plot["brand"], df_bm_plot["model"])],
           rotation=90, ha="right")
plt.ylabel("Δ MAE (brand–model - global)")
plt.title("Effect of brand–model specialization\n(negative = lower MAE than global)")
plt.tight_layout()
plt.show()

# 2) Scatter plot: n_samples vs ΔMAE to visualise overfitting at low sample sizes
plt.figure(figsize=(6, 4))
plt.scatter(df_bm_display["n_samples"], df_bm_display["delta_MAE"])
plt.axhline(0, linestyle="--")
plt.xlabel("Number of samples per (brand, model)")
plt.ylabel("Δ MAE (brand–model - global)")
plt.title("ΔMAE vs segment size\n(negative = brand–model-specific is better)")
plt.tight_layout()
plt.show()


### 9. Ablation study

In [None]:
# TODO use i.e. 3 parallel exact same pipelines with different scalers to see the difference (input from lab)
# TODO use different encoding ohe, target, label, frequency (frequency encoder is ricardos favorite encoder)