**Your work will be evaluated according to the following criteria:**
- Project Structure and Notebook(s) Quality (4/20)
- Data Exploration & Initial Preprocessing (4/20)
- Regression Benchmarking and Optimization (7/20)
- Open-Ended Section (4/20)
- Deployment (1/20)
- Extra Point: Have Project Be Publicly Available on GitHub (1/20)


**Project Timeline**
- 22.11.: Preprocessing and Model Preparation
    - Finish clean preprocessing all included in pipeline
    - Finish clean Hyperparameter Tuning
- 29.11.: Feature Selection
    - Clean and structured approach for feature selection for all models (best case: consistent approach imo)
- 29.11.: Regression Benchmarking and Optimization
    - Automize Optimization (add something like mlflow)
    - Show and explain results of different models clearly just as in the lab lectures
- 06.12.: Open-End Section and Deployment
    - Added 4 open-end-experiments
    - Deployment
- 13.12.: Notebook Feinschliff
    - Super clean notebook structure similar to lab-notebooks by Ricardo
- 14.12.: Submission

In [None]:
# TODO Brainstorm + Implementation of Ideas for open ended Section (Several can get explored):
#   1. Create a classification Model, that predicts if a dataset is gonna be a price outlier (outlier flag) (SAMUEL)
#   2. Jan:
#       - Use multiple approaches for encoding to see how the models behave (e.g. target encoding AND OHE for the same categorical variable)
#   3. Elias: 
#   4. Lukas: 

<div style="
    background: rgba(25, 25, 25, 0.55);
    backdrop-filter: blur(16px) saturate(150%);
    -webkit-backdrop-filter: blur(16px) saturate(150%);
    border: 1px solid rgba(255, 255, 255, 0.12);
    border-radius: 18px;
    padding: 45px 30px;
    text-align: center;
    font-family: 'Inter', 'Segoe UI', 'Helvetica Neue', Arial, sans-serif;
    color: #e0e0e0;
    box-shadow: 0 0 30px rgba(0, 0, 0, 0.35);
    margin: 40px auto;
    max-width: 800px;
">

  <h1 style="
      font-size: 2.8em;
      font-weight: 700;
      margin: 0 0 8px 0;
      letter-spacing: -0.02em;
      background: linear-gradient(90deg, #00e0ff, #9c7eff);
      -webkit-background-clip: text;
      -webkit-text-fill-color: transparent;
  ">
      Machine Learning Project
  </h1>

  <h2 style="
      font-size: 1.6em;
      font-weight: 500;
      margin: 0 0 25px 0;
      color: #b0b0b0;
      letter-spacing: 0.5px;
  ">
      Cars 4 You - Predicting Car Prices
  </h2>

  <p style="
      font-size: 1.25em;
      font-weight: 500;
      color: #c0c0c0;
      margin-bottom: 6px;
  ">
      Group 5 - Lukas Belser, Samuel Braun, Elias Karle, Jan Thier
  </p>

  <p style="
      font-size: 1.05em;
      font-weight: 400;
      color: #8a8a8a;
      font-style: italic;
      letter-spacing: 0.5px;
  ">
      Machine Learning End Results · 22.12.2025
  </p>
</div>


## **Table of Contents**
 
- [1. Import Packages and Data](#1-import-packages-and-data)  
  - [1.1 Import Required Packages](#11-import-required-packages)  
  - [1.2 Load Datasets](#12-load-datasets)  
  - [1.3 Kaggle Setup](#13-kaggle-setup)  
- [2. Data Cleaning, Feature Engineering, Split & Preprocessing](#2-data-cleaning-feature-engineering-split--preprocessing)  
  - [2.1 Data Cleaning](#21-data-cleaning)  
  - [2.2 Feature Engineering](#22-feature-engineering)  
  - [2.3 Data Split](#23-data-split)  
  - [2.4 Preprocessing](#24-preprocessing)  
- [3. Feature Selection](#3-feature-selection)  
- [4. Model Evaluation Metrics, Baselining, Setup](#4-model-evaluation-metrics-baselining-setup)  
- [5. Hyperparameter Tuning and Model Evaluation](#5-hyperparameter-tuning-and-model-evaluation)  
  - [5.1 ElasticNet](#51-elasticnet)  
  - [5.2 HistGradientBoost](#52-histgradientboost)  
  - [5.3 RandomForest](#53-randomforest)  
  - [5.4 ExtraTrees](#54-extratrees)  
- [6. Feature Importance of Tree Models (with SHAP)](#6-feature-importance-of-tree-models-with-shap)  
  - [6.1 HGB](#61-hgb)  
  - [6.2 RF](#62-rf)  
- [7. Kaggle Competition](#7-kaggle-competition)  

TODO finish + update toc > at the end of project

### 1. Import Packages and Data

#### 1.1 Import Required Packages

In [None]:
pip install kaggle

In [None]:
pip install shap

In [None]:
pip install -U scikit-learn

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import joblib
import shap

from sklearn.feature_selection import VarianceThreshold, chi2, RFE
from scipy.stats import spearmanr, uniform, randint
from sklearn.metrics import mean_absolute_error
 
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.base import clone
 
from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold, GridSearchCV, cross_validate
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor, StackingRegressor
from sklearn.svm import SVR

 
from car_functions import GroupMedianImputer, clean_car_dataframe, cv_target_encode, print_metrics # add_price_anchor_features

#### 1.2 Load Datasets

In [None]:
df_cars_train = pd.read_csv("train.csv")
df_cars_test = pd.read_csv("test.csv")

#### 1.3 Kaggle Setup

In [None]:
# Kaggle API Connect => everyone has to do this himself, with his own kaggle.json api token

# Folder containing kaggle.json
os.environ['KAGGLE_CONFIG_DIR'] = "/Workspace/Users/20250355@novaims.unl.pt"

# Test
!echo $KAGGLE_CONFIG_DIR

### 2. Data Cleaning, Feature Engineering, Split & Preprocessing

Task II (5 Points): Clean and preprocess the dataset 
- Missing Value handling, Outlier preprocessing + justify decisions -> in data_cleaning.py
- Review current features and create extra features if needed + explain -> in Feature Engineering
- Deal with categorical variables -> in One-Hot-Encoding 
- Perform data scaling, explain reasoning -> in Transforming

#### 2.1 Data Cleaning

In [None]:
# Outlier Preprocessing, Missing Value Handling and Decision justifying happens here
df_cars_train = clean_car_dataframe(df_cars_train)
df_cars_test = clean_car_dataframe(df_cars_test)


# Safety Check: print unique values of all columns of df_cars_train // df_cars_test to see if data cleaning worked and if there are still odd values
for col in df_cars_train.columns:
    print(col, df_cars_train[col].unique())
print("X"*150)
print("-"*150)
print("X"*150)
for col in df_cars_test.columns:
    print(col, df_cars_test[col].unique())

#### 2.2 Feature Engineering

**Base Feature Creation**

These are foundational features derived directly from the original data, often to create linear relationships or capture interactions.
- `age`: Calculated as `2020 - year`. Creates a simple linear feature representing the car's age. Newer cars (lower age) generally have higher prices.
- `miles_per_year`: Calculated as `mileage / age`. This normalizes the car's usage, preventing high correlation (multicollinearity) between `mileage` and `age`. A 3-year-old car with 60,000 miles is different from a 6-year-old car with 60,000 miles.
- `age_x_engine`: An interaction term `age * engineSize`. This helps the model capture non-linear relationships, such as the possibility that the value of cars with large engines might depreciate faster (or slower) than cars with small engines.
- `mpg_x_engine`: An interaction term `mpg * engineSize`. This captures the combined effect of fuel efficiency and engine power.
- `tax_per_engine`: Calculated as `tax / engineSize`. This feature represents the tax cost relative to the engine's power, which could be an indicator of overall running costs or vehicle class.
- `mpg_per_engine`: Calculated as `mpg / engineSize`. This creates an "efficiency" metric, representing how many miles per gallon the car achieves for each unit of engine size.


**Popularity & Demand Features**

These features attempt to quantify a car's popularity or market demand, which directly influences price.
- `model_freq`: Calculates the frequency (percentage) of each `model` in the training dataset. Popular, common models often have more stable and predictable pricing and demand.


**Price Anchor Features**

These features "anchor" a car's price relative to its group. They provide a strong baseline price signal based on brand, model, and configuration.
- `brand_med_price`: The median price for the car's `Brand` (e.g., the typical price for a BMW vs. a Skoda). This captures overall brand positioning.
- `model_med_price`: The median price for the car's `model` (e.g., the typical price for a 3-Series vs. a 1-Series). This captures the model's positioning within the brand.
- `brand_fuel_med_price`: The median price for the car's specific `Brand` and `fuelType` combination (e.g., a Diesel BMW vs. a Petrol BMW).
- `brand_trans_med_price`: The median price for the `Brand` and `transmission` combination (e.g., an Automatic BMW vs. a Manual BMW).


**Normalized & Relative Features**

These features compare a car to its peers rather than using absolute values.
- `*_anchor` (e.g., `brand_med_price_anchor`): Created by dividing the median price features (from section 3) by the `overall_mean_price`. This makes the feature dimensionless and represents the group's price *relative* to the entire market (e.g., "this brand is 1.5x the market average").
- `age_rel_brand`: Calculated as `age - brand_median_age`. This shows if a car is newer or older than the *typical* car for that specific brand, capturing relative age within its own group.


**CV-Safe Target Encodings**

This is an advanced technique to encode categorical variables (like `model` or `Brand`) using information from the target variable (`price`) without causing data leakage.
- `*_te` (e.g., `model_te`): Represents the *average price* for that category (e.g., the average price for a "Fiesta").
- **Why is it "CV-Safe"?** Instead of just calculating the global average price for "Fiesta" and applying it to all rows (which leaks target information), this method uses K-Fold cross-validation. For each fold of the data, the target encoding is calculated *only* from the *other* folds. This ensures the encoding for any given row never includes its own price, preventing leakage and leading to a more robust model.

In [None]:
# TODO split this into real "feature engineering" and an additional "encoding" section ~J

# 1. Base Feature Creation

# Car Age: Newer cars usually have higher prices, models prefer linear features
df_cars_train['age'] = 2020 - df_cars_train['year']
df_cars_test['age']  = 2020 - df_cars_test['year']

# Miles per Year: Normalizes mileage by age (solves multicollinearity between year and mileage)
df_cars_train['miles_per_year'] = df_cars_train['mileage'] / df_cars_train['age'].replace({0: np.nan})
df_cars_train['miles_per_year'] = df_cars_train['miles_per_year'].fillna(df_cars_train['mileage'])

df_cars_test['miles_per_year'] = df_cars_test['mileage'] / df_cars_test['age'].replace({0: np.nan})
df_cars_test['miles_per_year'] = df_cars_test['miles_per_year'].fillna(df_cars_test['mileage'])

# Interaction Terms: Capture non-linear effects between engine and other numeric features
df_cars_train['age_x_engine'] = df_cars_train['age'] * df_cars_train['engineSize'].fillna(0)
df_cars_test['age_x_engine']  = df_cars_test['age']  * df_cars_test['engineSize'].fillna(0)

df_cars_train['mpg_x_engine'] = df_cars_train['mpg'].fillna(0) * df_cars_train['engineSize'].fillna(0)
df_cars_test['mpg_x_engine']  = df_cars_test['mpg'].fillna(0)  * df_cars_test['engineSize'].fillna(0)

# tax per engine
df_cars_train['tax_per_engine'] = df_cars_train['tax'] / df_cars_train['engineSize'].replace({0: np.nan})
df_cars_train['tax_per_engine'] = df_cars_train['tax_per_engine'].fillna(df_cars_train['tax'])

df_cars_test['tax_per_engine'] = df_cars_test['tax'] / df_cars_test['engineSize'].replace({0: np.nan})
df_cars_test['tax_per_engine'] = df_cars_test['tax_per_engine'].fillna(df_cars_test['tax'])

# MPG per engineSize to represent the efficiency
df_cars_train['mpg_per_engine'] = df_cars_train['mpg'] / df_cars_train['engineSize'].replace({0: np.nan})
df_cars_train['mpg_per_engine'] = df_cars_train['mpg_per_engine'].fillna(df_cars_train['mpg'])

df_cars_test['mpg_per_engine'] = df_cars_test['mpg'] / df_cars_test['engineSize'].replace({0: np.nan})
df_cars_test['mpg_per_engine'] = df_cars_test['mpg_per_engine'].fillna(df_cars_test['mpg'])


# 2. Model Frequency: Popular models tend to have stable demand and prices
model_freq = df_cars_train['model'].value_counts(normalize=True).to_dict()

df_cars_train['model_freq'] = df_cars_train['model'].map(model_freq).fillna(0.0)
df_cars_test['model_freq']  = df_cars_test['model'].map(model_freq).fillna(0.0)


# 3. Brand and Model Anchors: Represent typical price levels (positioning)
overall_mean_price = df_cars_train['price'].mean()

# Brand median price: captures brand positioning (e.g., BMW > Skoda)
brand_median_price = df_cars_train.groupby('Brand')['price'].median().to_dict()
df_cars_train['brand_med_price'] = df_cars_train['Brand'].map(brand_median_price)
df_cars_test['brand_med_price']  = df_cars_test['Brand'].map(brand_median_price)

# Model median price: captures model hierarchy within brand (e.g., 3er > 1er)
model_median_price = df_cars_train.groupby('model')['price'].median().to_dict()
df_cars_train['model_med_price'] = df_cars_train['model'].map(model_median_price)
df_cars_test['model_med_price']  = df_cars_test['model'].map(model_median_price)

# Brand × Fuel median price: different fuels have different price segments
brand_fuel_median_price = df_cars_train.groupby(['Brand','fuelType'])['price'].median().to_dict()
df_cars_train['brand_fuel_med_price'] = list(zip(df_cars_train['Brand'], df_cars_train['fuelType']))
df_cars_train['brand_fuel_med_price'] = df_cars_train['brand_fuel_med_price'].map(brand_fuel_median_price)
df_cars_test['brand_fuel_med_price']  = list(zip(df_cars_test['Brand'], df_cars_test['fuelType']))
df_cars_test['brand_fuel_med_price']  = df_cars_test['brand_fuel_med_price'].map(brand_fuel_median_price)

# Brand × Transmission median price: automatic or manual may influence resale
brand_trans_median_price = df_cars_train.groupby(['Brand','transmission'])['price'].median().to_dict()
df_cars_train['brand_trans_med_price'] = list(zip(df_cars_train['Brand'], df_cars_train['transmission']))
df_cars_train['brand_trans_med_price'] = df_cars_train['brand_trans_med_price'].map(brand_trans_median_price)
df_cars_test['brand_trans_med_price']  = list(zip(df_cars_test['Brand'], df_cars_test['transmission']))
df_cars_test['brand_trans_med_price']  = df_cars_test['brand_trans_med_price'].map(brand_trans_median_price)


# 4. Normalized Anchors (dimensionless): relative position vs overall mean
for col in ['brand_med_price','model_med_price','brand_fuel_med_price','brand_trans_med_price']:
    df_cars_train[f'{col}_anchor'] = df_cars_train[col] / overall_mean_price
    df_cars_test[f'{col}_anchor']  = df_cars_test[col]  / overall_mean_price


# 5. Relative Age (within brand): newer/older than brand median year
brand_median_age = df_cars_train.groupby('Brand')['age'].median().to_dict()

df_cars_train['age_rel_brand'] = df_cars_train['age'] - df_cars_train['Brand'].map(brand_median_age)
df_cars_test['age_rel_brand']  = df_cars_test['age']  - df_cars_test['Brand'].map(brand_median_age)


# 6. CV-Safe Target Encoding on categorical columns # TODO: check whether this mean-target-encoding is necessary after already using features such as brand_med_price_anchor which uses the median instead of the mean
for col, m in [('model', 100), ('Brand', 30), ('fuelType', 20), ('transmission', 20)]:
    tr_enc, te_enc = cv_target_encode(df_cars_train, df_cars_test, col, ycol='price', m=m)
    df_cars_train[f'{col}_te'] = tr_enc
    df_cars_test[f'{col}_te']  = te_enc

In [None]:
df_cars_train.columns

#### 2.3 Data Split

In [None]:
# Split Data, Stratify not necessary due to regression problem and Cross Validation later
X_train = df_cars_train.drop(columns='price')
y_train = df_cars_train['price']

# X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 42, shuffle = True)
# ==> Since we have an external hold-out set (kaggle) an additional val set is not necessary and wastes training data

#### 2.4 Preprocessing

In [None]:
def to_float_array(x):
    """Convert input to float array."""
    return np.array(x, dtype=float)

TODO: Explain why we split into original and engineered features here

In [None]:
# PIPELINE WITH preprocessor_orig CONTAINING ONLY ORIGINAL FEATURES


orig_numeric_features = [
    "year", "mileage", "tax", "mpg", "engineSize", "paintQuality", "previousOwners"
]
orig_categorical_features = ["Brand", "model", "transmission", "fuelType"]

numeric_transformer_orig = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),    # simple global median imputation
    ("to_float", FunctionTransformer(to_float_array)),
    ("scaler", StandardScaler())
])

categorical_transformer_orig = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),  # fill by mode instead of Unknown
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))  # One-hot encoding
    # TODO use target-encoder here and remove manual target encoding from feature engineering ~J
])

preprocessor_orig = ColumnTransformer([
    ("num", numeric_transformer_orig, orig_numeric_features),
    ("cat", categorical_transformer_orig, orig_categorical_features)
])

preprocessor_orig.fit(X_train)

In [None]:
# PIPELINE WITH preprocessor_fe CONTAINING ENGINEERED FEATURES

# Custom Written GroupMedianImputer to get Brand and Model specific Medians
# Uses target-encoded Brand_te/model_te as grouping cols (present in numeric_features)
group_imputer = GroupMedianImputer(group_cols=["Brand_te", "model_te"])

numeric_features = [
    "age", "age_rel_brand", "tax", "mpg", "engineSize", "paintQuality", "previousOwners", "model_freq",
    "brand_med_price_anchor", "model_med_price_anchor", "brand_fuel_med_price_anchor", "brand_trans_med_price_anchor",
    "age_x_engine", "mpg_x_engine",
    "model_te", "Brand_te", # "fuelType_te", "transmission_te",
    "tax_per_engine", "mpg_per_engine"
]
log_features = ["mileage", "miles_per_year"]  # TODO other num columns here better?!
categorical_features = ["transmission", "fuelType"] # ["Brand", "model", "transmission", "fuelType"] # TODO use target-encoder here and remove manual target encoding from feature engineering ~J # TODO play around with some OHE (e.g. fuelType or transmission)

# left out columns: year (age is better), hasDamage (unsure what the two values 0 and NaN mean)

numeric_transformer_fe = Pipeline([
    ("group_impute", GroupMedianImputer(group_cols=["Brand_te", "model_te"])),
    ("to_float", FunctionTransformer(to_float_array)),
    ("scaler", StandardScaler())
])

log_transformer_fe = Pipeline([
    # Hierarchical imputation on Brand_te/model_te, then log-transform
    ("group_impute", GroupMedianImputer(group_cols=["Brand_te", "model_te"])),
    ("to_float", FunctionTransformer(to_float_array)),
    ("log", FunctionTransformer(np.log1p, validate=False)),  # log1p handles zeros safely
    ("scaler", StandardScaler())
])

categorical_transformer_fe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),  # fill by mode instead of Unknown
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))  # One-hot encoding
])

# ColumnTransformer that uses all engineered features
preprocessor_fe = ColumnTransformer([
    ("log", log_transformer_fe, log_features),
    ("num", numeric_transformer_fe, numeric_features),
    ("cat", categorical_transformer_fe, categorical_features)
])

preprocessor_fe.fit(X_train) # Fit here already to have scaled data for feature selection later

# EXPLANATIONS:
# 1) Pipeline bundles preprocessing + model training:
#       > Ensures all preprocessing happens inside cross-validation folds (no data leakage)
#       > Keeps the entire workflow reproducible — scaling, encoding, and modeling are learned together
#       > After .fit(), the final model automatically knows how to preprocess new unseen data
#       > When saving with joblib, the entire preprocessing (imputers, scalers, encoders) and model are stored together

# 2) The ColumnTransformer applies different transformations to subsets of features:
#       > Numeric Features are handled by our custom GroupMedianImputer (domain-aware filling)
#           - Missing numeric values are imputed hierarchically:
#           1. By (Brand, model)
#           2. If missing model by Brand
#           3. If missing Brand by global median
#       > This approach captures brand/model-level patterns (e.g. BMWs have similar engine sizes)
#       > After imputation, StandardScaler standardizes all numeric features
#
#       > Log Features use the same group-median imputation, followed by log1p() transformation
#           - log1p() compresses large, skewed values (like mileage or price-related features), stabilizing variance and helping linear models perform better
#           - StandardScaler then scales them to zero mean and unit variance
#
#       > Categorical Features are handled by SimpleImputer + OneHotEncoder
#           - SimpleImputer fills missing categorical values with the most frequent (mode) value.
#             (Alternative would be “Unknown”, but mode keeps categories realistic, e.g. most cars in a model share the same transmission) # TODO this is not per model yet so the explanation is not fully correct, or am I missing something here? ~J
#           - OneHotEncoder converts each categorical label (Brand, model, etc.) into binary dummy variables
#             This lets the model use category information numerically without implying order
#
# 3) Overall:
#       > The pipeline ensures consistent preprocessing across training, validation, and test data.
#       > It combines domain knowledge (brand/model-aware imputation) with numerical scaling.
#       > Linear models (ElasticNet, Ridge, Lasso) and tree models (HistGradientBoosting, RandomForest)
#           can now learn from the same standardized, clean, and information-rich feature space.

### 3. Feature Selection

Task III (3 Points): Define and Implement a clear and unambiguous strategy for feature selection. Use the methods **discussed in the course**. Present and justify your final selection 

Model independent Filter Methods:
- Remove constant numerical variables identified by VarianceThreshold
- Check spearman correlation of numerical variables with the target and deselect features with low correlation
- Remove unindependent categorical variables identified by Chi2

Model dependent Wrapper Methods:
- RFE LR / RFE SVR for linear Models: ElasticNet, SVM
- Feature Importance for tree Models: RandomForest, HistGradientBoosting (see at 6 Feature Importance (with SHAP))


#### 3.1 Relevancy Analysis

In [None]:
X_train_proc_before_fs = preprocessor_fe.transform(X_train)

feature_names_all = [] # TODO better and more readable way to get feature names? ~J
for name, trans, cols in preprocessor_fe.transformers_:
    if name != 'remainder':
        if hasattr(trans, 'get_feature_names_out'):
            # for categorical OHE
            try:
                feature_names_all.extend(trans.get_feature_names_out(cols))
            except:
                feature_names_all.extend(cols)
        else:
            feature_names_all.extend(cols)
print(f"All feature names after preprocessing: {feature_names_all}")

X_df = pd.DataFrame(X_train_proc_before_fs, columns=feature_names_all)


# Variance Threshold (simple preliminary filter before applying more complex methods)
vt = VarianceThreshold(threshold=0.0) # TODO try different thresholds (e.g. 0.01) to capture quasi-constant features and evaluate impact on performance ~J
vt.fit(X_df)
vt_deselect = [f for f, keep in zip(feature_names_all, vt.get_support()) if not keep]
print("Features to deselect according to VarianceThreshold:", vt_deselect)


# Spearman correlation (numeric + log only) to identify features with low correlation to target
numeric_log = numeric_features + log_features
spearman_deselect = []
for f in numeric_log:
    if f in X_df.columns: # should always be the case: just to be safe to prevent the code from crashing
        corr, _ = spearmanr(X_df[f], y_train)
        if abs(corr) <= 0.05:
            spearman_deselect.append(f)
print("Features to deselect according to Spearman correlation:", spearman_deselect)


# # Chi2 (categorical only, must be non-negative) # TODO remove because Chi2 does not make sense to find relationships to continuous target ~J
# cat_cols = [c for c in X_df.columns if c not in numeric_log]
# X_cat = X_df[cat_cols].astype(float)
# chi2_vals, _ = chi2(X_cat, y_train)
# chi2_deselect = [f for f, val in zip(cat_cols, chi2_vals) if val <= 0]
# print("Features to deselect according to Chi²:", chi2_deselect)

irrelevant_features = list(set(vt_deselect + spearman_deselect)) # TODO deselect all the identified features for upcoming models instead of only removing them for ElasticNet ~J

#### 3.2 Redundancy Analysis

In [None]:
redundant_variables = []
# TODO Maybe add filter methods for redundant features (e.g., high inter-feature correlation) ~J
deselect_features = list(set(irrelevant_features + redundant_variables))

#### 3.3 Removal of identified features

We remove the identified features for all upcoming models for consistency and to ensure more stable models even if they natively handle irrelevant features well (tree-based models).

In [None]:
# Create new, clean feature lists by removing the identified features
log_features_clean = [f for f in log_features if f not in deselect_features]
numeric_features_clean = [f for f in numeric_features if f not in deselect_features]
categorical_features_clean = [f for f in categorical_features if f not in deselect_features]

feature_names_all = log_features_clean + numeric_features_clean + categorical_features_clean # TODO refactor to feature_names_all_cleaned

# Rebuild the preprocessor with the cleaned feature lists to be used for upcoming models
preprocessor_fe_clean = ColumnTransformer([
    ("log", log_transformer_fe, log_features_clean),
    ("num", numeric_transformer_fe, numeric_features_clean),
    ("cat", categorical_transformer_fe, categorical_features_clean)
])

In [None]:
# TODO @Samu 2 questions regarding the following commented-out code:
# 1) can this be removed when we are using the preprocessor_fe_clean for the elasticnet model as well?
# 2) Why was log_transformer not included in the linear model preprocessing before? ~J



# # Numeric/log features for linear models
# linear_numeric_features = [f for f in numeric_features + log_features if f not in spearman_deselect]

# preprocessor_linear = ColumnTransformer([
#     ("num", numeric_transformer_fe, linear_numeric_features),
#     ("cat", categorical_transformer_fe, categorical_features)
# ], remainder="drop")

# # => use preprocessor_linear for linear model setup; since tree models are indifferent to irrelevant features

### 4. Model Evaluation Metrics, Baselining, Setup

TASK IV (4 Points): Build a simple model and assess the performance
- Identify the type of problem and select the relevant algorithms
- Select one Model Assessment Strategy to use throughout your work. Which metrics are you using to evaluate your model and why?


=> Tip from lecturer: Use RandomSearch instead of GridSearchCV, set a wider Range


#### 4.1 Model Evaluation Metrics

**MAE (Mean Absolute Error):**
- average absolute deviation between predicted and true car prices
- easy to interpret in pounds, same metric used by Kaggle competition

**RMSE (Root Mean Squared Error):**
- sensitive to outliers, helps identify large prediction errors

**R²:**
- Coefficient of determination: proportion of variance explained by the model
- 1.0 = perfect predictions, 0.0 = same as predicting mean, < 0.0 = worse than mean

=> We define the metrics in the method `print_metrics` in file `car_functions.py`

#### 4.2 Baseline (mean and median)

In [None]:
# TODO: use CV here for evaluation too or remove the cell
# mean_pred = y_train.mean()
# median_pred = y_train.median()

# print("baseline mean predictor: ")
# print_metrics(y_val, [mean_pred]*len(y_val))
# # MAE: 6949.2397 | RMSE: 9544.0803 | R2: -0.0001

# print("-"*150)

# print("baseline median predictor: ") 
# print_metrics(y_val, [median_pred]*len(y_val))
# # MAE: 6714.2387 | RMSE: 9774.3098 | R2: -0.0489

#### 4.3 Pipeline Definitions (preprocessor + model)

In [None]:
##### Split definitions into original features and engineered features pipelines

### LINEAR MODEL (ElasticNet)

elastic_pipe_orig = Pipeline([
    ("preprocess", preprocessor_orig),
    ("model", ElasticNet(
        alpha=0.01,
        l1_ratio=0.5,
        max_iter=30000,
        tol=1e-4,
        selection="cyclic",
        random_state=42
    ))
])

elastic_pipe_fe = Pipeline([
    ("preprocess", preprocessor_fe_clean), 
    ("model", ElasticNet(
        alpha=0.01,
        l1_ratio=0.5,
        max_iter=30000,
        tol=1e-4,
        selection="cyclic",
        random_state=42
    ))
])


### TREE MODELS

# HistGradientBoostingRegressor

hgb_pipe_orig = Pipeline([
    ("preprocess", preprocessor_orig),
    ("model", HistGradientBoostingRegressor(
        early_stopping=True,
        validation_fraction=0.1,
        n_iter_no_change=20,
        l2_regularization=0.5,
        random_state=42
    ))
])

hgb_pipe_fe = Pipeline([
    ("preprocess", preprocessor_fe_clean),
    ("model", HistGradientBoostingRegressor(
        early_stopping=True,
        validation_fraction=0.1,
        n_iter_no_change=20,
        l2_regularization=0.5,
        random_state=42
    ))
])


# RandomForestRegressor

rf_pipe_orig = Pipeline([
    ("preprocess", preprocessor_orig),
    ("model", RandomForestRegressor(
        n_estimators=300,
        max_depth=None,
        min_samples_split=3,
        min_samples_leaf=2,
        max_features="sqrt",
        bootstrap=True,
        n_jobs=1,
        random_state=42
    ))
])

rf_pipe_fe = Pipeline([
    ("preprocess", preprocessor_fe_clean),
    ("model", RandomForestRegressor(
        n_estimators=300,
        max_depth=None,
        min_samples_split=3,
        min_samples_leaf=2,
        max_features="sqrt",
        bootstrap=True,
        n_jobs=1,
        random_state=42
    ))
])


# ExtraTreesRegressor

et_pipe_orig = Pipeline([
    ("preprocess", preprocessor_orig),
    ("model", ExtraTreesRegressor(
        n_estimators=400,
        max_depth=None,
        min_samples_leaf=2,
        max_features="sqrt",
        bootstrap=False,
        n_jobs=1,
        random_state=42
    ))
])

et_pipe_fe = Pipeline([
    ("preprocess", preprocessor_fe_clean),
    ("model", ExtraTreesRegressor(
        n_estimators=400,
        max_depth=None,
        min_samples_leaf=2,
        max_features="sqrt",
        bootstrap=False,
        n_jobs=1,
        random_state=42
    ))
])


### KERNEL-BASED MODEL (SVR)

svr_pipe_orig = Pipeline([
    ("preprocess", preprocessor_orig),
    ("model", SVR(
        kernel="rbf",
        C=10,
        epsilon=0.1,
        gamma="scale"
    ))
])

svr_pipe_fe = Pipeline([
    ("preprocess", preprocessor_fe_clean),
    ("model", SVR(
        kernel="rbf",
        C=10,
        epsilon=0.1,
        gamma="scale"
    ))
])


# ENSEMBLE META MODEL (Stacking)

stack_pipe_orig = StackingRegressor(
    estimators=[
        ("elastic_orig", elastic_pipe_orig),
        ("hgb_orig", hgb_pipe_orig),
        ("rf_orig", rf_pipe_orig),
    ],
    final_estimator=HistGradientBoostingRegressor(
        learning_rate=0.05,
        max_depth=5,
        l2_regularization=0.5,
        random_state=42
    ),
    passthrough=False,   # <- was True: disable raw-X passthrough to avoid string->float error
    n_jobs=1
)

stack_pipe_fe = StackingRegressor(
    estimators=[
        # ("elastic_fe", elastic_pipe_fe),
        ("hgb_fe", hgb_pipe_fe),
        ("rf_fe", rf_pipe_fe),
    ],
    final_estimator=HistGradientBoostingRegressor(
        learning_rate=0.05,
        max_depth=5,
        l2_regularization=0.5,
        random_state=42
    ),
    passthrough=False,   # <- same here
    n_jobs=1
)

#### 4.4 First run of models

In [None]:
# First evaluation of metrics based on original and engineered feature pipeline to decide how to proceed


models_orig = {
    # "ElasticNet_orig": elastic_pipe_orig,
    "HGB_orig": hgb_pipe_orig,
    "RF_orig": rf_pipe_orig,
    "ET_orig": et_pipe_orig,
    "SVR_orig": svr_pipe_orig,
    "Stack_orig": stack_pipe_orig,
}

models_fe = {
    # "ElasticNet_fe": elastic_pipe_fe,
    "HGB_fe": hgb_pipe_fe,
    "RF_fe": rf_pipe_fe,
    "ET_fe": et_pipe_fe,
    "SVR_fe": svr_pipe_fe,
    "Stack_fe": stack_pipe_fe,
}

results = []

# for name, model in {**models_orig, **models_fe}.items():
# TODO remove this again and fit on original models as well (did this only to save time during testing) ~J
for name, model in {**models_fe}.items():
    print(f"Fitting {name} with cross-validation...")
    
    # Perform cross-validation on the entire training set
    cv_results = cross_validate(
        model, 
        X_train, 
        y_train,
        cv=3,
        scoring={
            'neg_mae': 'neg_mean_absolute_error',
            'neg_mse': 'neg_mean_squared_error',
            'r2': 'r2'
        },
        return_train_score=False,
        verbose=3,
        n_jobs=-2
    )
    
    # Calculate mean metrics across folds
    mae = -cv_results['test_neg_mae'].mean()
    rmse = np.sqrt(-cv_results['test_neg_mse'].mean())
    r2 = cv_results['test_r2'].mean()
    
    results.append({
        "model": name,
        "feature_set": "original" if name.endswith("_orig") else "engineered",
        "MAE": mae,
        "RMSE": rmse,
        "R2": r2,
    })

results_df = (
    pd.DataFrame(results)
      .sort_values(["feature_set", "MAE"])
      .reset_index(drop=True)
)

print(results_df)

# Long Duration (with orig ca 25mins VS without orig ca 6mins VS with CV ca 16mins VS with njobs=-1 ca )

# Predicted on hold-out val set (20%):
#       model feature_set          MAE          RMSE        R2
# 0     RF_fe  engineered  1299.728938  4.509435e+06  0.950490
# 1  Stack_fe  engineered  1321.130612  4.831609e+06  0.946953
# 2     ET_fe  engineered  1328.051439  4.707534e+06  0.948315
# 3    HGB_fe  engineered  1534.496164  5.609255e+06  0.938415
# 4    SVR_fe  engineered  2955.064750  3.242891e+07  0.643956

# Predicted using 3-fold CV on entire data:
#       model feature_set          MAE         RMSE        R2
# 0     RF_fe  engineered  1336.806163  2375.850617  0.940424
# 1  Stack_fe  engineered  1357.266391  2505.029128  0.933786
# 2     ET_fe  engineered  1364.212656  2399.654669  0.939223
# 3    HGB_fe  engineered  1551.419964  2503.445871  0.933858
# 4    SVR_fe  engineered  3068.524237  6130.420383  0.603579

In [None]:
# TODO the following markdown and reasoning for hyperparameter tuning has to be adjusted regarding new insights (e.g. ET is not underperforming anymore) ~J

After a first run comparing the original feature pipeline and the engineered feature pipeline for all models, we decided to focus on RandomForest and HistGradientBoost. 

They seem to have the best prediction performance for now. StackingRegressor currently performs best, but since it is blending existing models, we will focus on that and reevaluate in the end.

With ExtraTrees and SVR really underperforming, we decide not to do Hyperparameter Tuning.

### 5. Hyperparameter Tuning and Model Evaluation

In [None]:
# Define a function to use it here and potentially use it later for a final hyperparameter tuning after feature selection again
def model_hyperparameter_tuning(model_estimator, param_dist, n_iter=100, splits=5):
    
    cv = KFold(n_splits=splits, shuffle=True, random_state=42) # 5 folds for more robust estimation

    # Randomized search setup
    model_random = RandomizedSearchCV(
        estimator=model_estimator,
        param_distributions=param_dist,
        n_iter=n_iter,                      # nnumber of different hyperparameter combinations that will be randomly sampled and evaluated (more iterations = more thorough search but longer runtime)
        scoring={
            'mae': 'neg_mean_absolute_error',
            'mse': 'neg_mean_squared_error',
            'r2': 'r2'
        },
        refit='mae', # Refit the best model based on MAE on the whole training set
        cv=cv,
        n_jobs=-2,
        random_state=42,
        verbose=3,
    )

    # Fit the search
    model_random.fit(X_train, y_train)

    mae = -model_random.cv_results_['mean_test_mae'][model_random.best_index_]
    mse = -model_random.cv_results_['mean_test_mse'][model_random.best_index_]
    rmse = np.sqrt(mse)
    r2 = model_random.cv_results_['mean_test_r2'][model_random.best_index_]

    print("Model Results (CV metrics):")
    print(f"MAE: {mae:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"R²: {r2:.4f}")
    print("Best Model params:", model_random.best_params_)

    return model_random.best_estimator_ # return the best model

##### 5.1 ElasticNet

In [None]:
# Hyperparameter Tuning: ElasticNet

elastic_param_grid = {
    "model__alpha": [0.001],    # also tried 0.01, 0.05, 0.1, 0.5
    "model__l1_ratio": [0.9]    # also tried 0.1, 0.3, 0.5, 0.7  
}

# CV: Calculate all metrics but use MAE for selecting best model
elastic_grid = GridSearchCV(
    elastic_pipe_fe, 
    param_grid=elastic_param_grid,
    cv=5,
    scoring={
        'mae': 'neg_mean_absolute_error',
        'mse': 'neg_mean_squared_error',
        'r2': 'r2'
    },
    refit='mae', # Refit the best model based on MAE on the whole training set
    n_jobs=-2,
    verbose=3,
    return_train_score=False
)
elastic_grid.fit(X_train, y_train)

# Get mean metrics across folds
mae = -elastic_grid.cv_results_['mean_test_mae'][elastic_grid.best_index_]
mse = -elastic_grid.cv_results_['mean_test_mse'][elastic_grid.best_index_]
rmse = np.sqrt(mse)
r2 = elastic_grid.cv_results_['mean_test_r2'][elastic_grid.best_index_]
print("ElasticNet Results (CV on entire train set):")
print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")
print("Best ElasticNet params:", elastic_grid.best_params_)

elastic_best = elastic_grid.best_estimator_ # Final model trained on entire training set with best hyperparameters minimizing MAE

# Long Duration (Before removal of OHE-categoricals interrupted kernel after 64mins VS after removal ca 1min -> now 15secs with njobs=-2)

# ElasticNet Results: 
# MAE: 2353.9112 | RMSE: 13356867.7860 | R2: 0.8534
# Best ElasticNet params: {'model__alpha': 0.001, 'model__l1_ratio': 0.9}

# MAE: 2589.6100
# RMSE: 4104.4515
# R²: 0.8222
# Best ElasticNet params: {'model__alpha': 0.001, 'model__l1_ratio': 0.9}

In [None]:
# TODO adjust this approach after removing ohe categoricals and adding feature selection in Section 3 -> We only have 19 features now and RFE is not filtering any because there are already preselected ~J

# # ElasticNet + RFE (no CV)
# # reuse the tuned ElasticNet from the best pipeline
# en_base = clone(elastic_best.named_steps["model"])
 
# rfe_pipe_linear = Pipeline([
#     ("preprocess", preprocessor_fe_clean),
#     ("rfe", RFE(
#         estimator=en_base,
#         n_features_to_select=100,
#         step=0.1,              # remove ~10% of features per iteration
#         importance_getter="auto"
#     )),
#     ("model", clone(en_base))
# ])
 
# # train / predict / evaluate
# rfe_pipe_linear.fit(X_train, y_train)
# val_pred_linear_rfe = rfe_pipe_linear.predict(X_val)
 
# print("ElasticNet with RFE (no CV):")
# print_metrics(y_val, val_pred_linear_rfe)
 
# # show which features were removed
# rfe = rfe_pipe_linear.named_steps["rfe"]
# feat_names = feature_names_all
# # (
# #     deselect_features + # Mention the previously removed features here too to have the full list of removed features
# #     list(rfe_pipe_linear.named_steps["preprocess"]
# #          .named_transformers_["cat"].named_steps["encoder"]
# #          .get_feature_names_out(categorical_features))
# # )
# removed = [n for n, keep in zip(feat_names, rfe.support_) if not keep]

# print(f"\nRemoved Features through RFE ({len(removed)}):")
# for feat in removed:
#     print(f"  - {feat}")

**Reasoning**: We used 100 features as an initial, arbitrary cutoff for feature selection in the ElasticNet model. Preliminary experiments and insights from the EDA (see separate notebook) indicated that tree-based methods are likely to perform better. Therefore, we prioritized feature selection for the tree-based models based on SHAP values.
 

##### 5.2 HistGradientBoost

In [None]:
hgb_param_dist = {
    "model__learning_rate": uniform(0.01, 0.15),       # samples values
    "model__max_leaf_nodes": randint(50, 150),         
    "model__min_samples_leaf": randint(2, 20),         # samples leaf sizes between 2–20
    "model__max_iter": randint(200, 900),              # tries 200–900 iterations
    "model__l2_regularization": uniform(0.0, 1.0),      # samples small regularization values
    "model__early_stopping": [True],
    "model__validation_fraction": [0.1],
    "model__n_iter_no_change": [20],
    "model__random_state":[42]
}

# optimized the parameter distributions based on previous runs to focus search space
# hgb_param_dist = {
#     "model__learning_rate": [0.06389789198396824],
#     "model__max_leaf_nodes": [105],
#     "model__min_samples_leaf": [3],
#     "model__max_iter": [642],
#     "model__l2_regularization": [0.942853570557981],
#     "model__early_stopping": [True],
#     "model__validation_fraction": [0.1],
#     "model__n_iter_no_change": [20],
#     "model__random_state":[42]
# }

hgb_best = model_hyperparameter_tuning(hgb_pipe_fe, hgb_param_dist) 

# Old preset hps (1min):
# MAE: 1289.7294
# RMSE: 2185.5006
# R²: 0.9498

# Reapplying RandomizedSearchCV (40mins):
# MAE: 1289.6713
# RMSE: 2181.5766
# R²: 0.9500
# Best Model params: {'model__early_stopping': True, 'model__l2_regularization': np.float64(0.14092422497476265), 'model__learning_rate': np.float64(0.08219772826786356), 'model__max_iter': 464, 'model__max_leaf_nodes': 108, 'model__min_samples_leaf': 8, 'model__n_iter_no_change': 20, 'model__random_state': 42, 'model__validation_fraction': 0.1}

# Using transmission and fuelType as OHE instead of TE:

##### 5.3 RandomForest

In [None]:
# Old parameter distribution
rf_param_dist = {
    "model__n_estimators": randint(200, 600),        # number of trees
    "model__max_depth": randint(5, 40),              # depth of each tree
    "model__min_samples_split": randint(2, 10),      # min samples to split an internal node
    "model__min_samples_leaf": randint(1, 8),        # min samples per leaf
    "model__max_features": ["sqrt", None],           # feature sampling strategy (sqrt performed better than log2 in previous tests)
    "model__bootstrap": [False]                      # use bootstrapping or not (False performed better than True in previous tests)
}

# So far best parameter distribution based on previous runs to focus search space
# rf_param_dist = {
#     "model__n_estimators": [467],        
#     "model__max_depth": [32],              
#     "model__min_samples_split": [9],      
#     "model__min_samples_leaf": [1],        
#     "model__max_features": ["sqrt"],         
#     "model__bootstrap": [False]                
# }

rf_best_rand = model_hyperparameter_tuning(rf_pipe_fe, rf_param_dist)
# joblib.dump(rf_best_rand, "rf_best_rand.pkl")


# Long Duration (~2min)

# MAE: 1275.1518
# RMSE: 2232.9070
# R²: 0.9477

# Reapplying RandomizedSearchCV (~120mins):
# MAE: 1272.4144
# RMSE: 2214.9228
# R²: 0.9486
# Best Model params: {'model__bootstrap': False, 'model__max_depth': 27, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 1, 'model__min_samples_split': 5, 'model__n_estimators': 328}

# Using transmission and fuelType as OHE instead of TE:


##### 5.4 StackingRegressor

In [None]:
# Old parameter distribution
stack_param_dist = {
    "final_estimator__learning_rate": uniform(0.02, 0.1),
    "final_estimator__max_depth": randint(3, 10),
    "final_estimator__min_samples_leaf": randint(3, 20),
    "final_estimator__l2_regularization": uniform(0.0, 1.0),
}

# So far best parameter distribution based on previous runs to focus search space
# stack_param_dist = {
#     "final_estimator__learning_rate": [0.0960571445127933],
#     "final_estimator__max_depth": [5],
#     "final_estimator__min_samples_leaf": [10],
#     "final_estimator__l2_regularization": [0.3745401188473625]
# }

stack_best = model_hyperparameter_tuning(stack_pipe_fe, stack_param_dist, splits=3)
# joblib.dump(stack_best, "stack_best.pkl")


# Long Duration (~3mins)

# MAE: 1351.8682
# RMSE: 2498.2822
# R²: 0.9342

# After RandomizedSearchCV:
# MAE: 1350.4717
# RMSE: 2497.0474
# R²: 0.9343
# Best Model params: {'final_estimator__l2_regularization': np.float64(0.978892858275009), 'final_estimator__learning_rate': np.float64(0.06867421529594551), 'final_estimator__max_depth': 6, 'final_estimator__min_samples_leaf': 13}

# Removed ElasticNet from stacking due to poor performance compared to RF and HGB alone


### 6. Feature Importance of Tree Models (with SHAP)

  **Problem:** Current feature selection targets linear models
  (ElasticNet), but we primarily use tree-based models (HGB,
  RandomForest).

  **Solution:** Use SHAP (SHapley Additive exPlanations) to
  identify feature importance specifically for tree models

  **Why SHAP for Trees:**
  - Provides exact feature importance values for tree-based
  models
  - Tree models handle irrelevant features, but noise features
  still impact performance
  - Enables data-driven selection rather than statistical filter
  methods

In [None]:
# Function to compute SHAP-based feature importance for any tree model
def calculate_shap_values(best_pipeline, X_train, log_features, numeric_features, categorical_features, sample_size=1000, seed=42, label=None):
    '''
    We use SHAP's TreeExplainer to calculate feature importance values. TreeExplainer is specifically optimized for tree-based models and provides exact Shapley values efficiently.
    '''
    # Preprocess training data with the pipeline’s preprocessor
    pre = best_pipeline.named_steps["preprocess"]
    X_train_proc = pre.transform(X_train)

    # Build feature names in ColumnTransformer order: log, numeric, one-hot(cat)
    # cat_names = pre.named_transformers_["cat"].named_steps["encoder"].get_feature_names_out(categorical_features) # TODO add cat_names back if trying multiple encodings ~J
    feature_names_all = list(log_features) + list(numeric_features) #+ list(cat_names) # TODO add cat_names back if trying multiple encodings ~J
    feature_names_all = [f for f in feature_names_all if f not in deselect_features]  # ensure deselected features are not included

    # SHAP for tree models
    model = best_pipeline.named_steps["model"]
    explainer = shap.TreeExplainer(model)

    # Sample for speed, reproducible
    rng = np.random.default_rng(seed)
    n = min(sample_size, len(X_train_proc))
    idx = rng.choice(len(X_train_proc), n, replace=False)

    shap_values = explainer.shap_values(X_train_proc[idx])
    importance = np.abs(shap_values).mean(axis=0)

    shap_df = (pd.DataFrame({"feature": feature_names_all, "importance": importance})
               .sort_values("importance", ascending=False)
               .reset_index(drop=True))

    tag = label or model.__class__.__name__
    print(f"\Most important features ({tag}):")
    print(shap_df.to_string(index=False))

    return shap_df, feature_names_all, X_train_proc


In [None]:
# General function which can be called by the models to avoid redundant code and enable easy maintenance
def train_model_on_best_features(baseline_mae, shap_importance, model, X_train_processed, X_val_processed, range_number_of_features, feature_names_all):
    '''
    We systematically test different numbers of top features to find the optimal subset:
    We train the model with the same optimized hyperparameters but using only the most important features identified by SHAP
    '''
    # Track best model
    results = []
    best_model = None
    best_mae = float("inf")
    best_n = None
    best_features = None

    # Find best feature counts
    for n_features in range_number_of_features:
        # Select top N features
        top_features = shap_importance.head(n_features)["feature"].tolist()
        feature_indices = [i for i, fname in enumerate(feature_names_all) if fname in top_features]

        X_train_subset = X_train_processed[:, feature_indices]
        X_val_subset   = X_val_processed[:, feature_indices]

        # Train and predict using selected amount of features (model uses tuned hyperparams)
        model.fit(X_train_subset, y_train)
        pred_subset = model.predict(X_val_subset)
        mae_subset = mean_absolute_error(y_val, pred_subset)
        results.append({"n_features": n_features, "mae": mae_subset})

        # Check whether current mae is best so far
        if mae_subset < best_mae:
            best_mae = mae_subset
            best_n = n_features
            best_model = model
            best_features = top_features

        # Print MAE for each amount of features
        if n_features in range_number_of_features:
            improvement_rf = baseline_mae - mae_subset
            print(f"Top {n_features:3d} features: MAE: {mae_subset:.1f} (Δ: {improvement_rf:+.1f})")


    print(f"\nOptimal feature selection results:")
    print(f"Best performance with {best_n} features: MAE: {best_mae:.2f}")
    print(f"Improvement over baseline: {baseline_mae - best_mae:+.2f} MAE\n")

    print(f"Optimal {best_n} features for production model:")
    for i, feat in enumerate(best_features, start=1):
        imp = shap_importance.loc[shap_importance['feature'] == feat, 'importance'].values[0]
        print(f"{i:2d}. {feat:25s} ({imp:.1f})")
    
    # Retrain a fresh final estimator on the full training set restricted to best_features (guarantees correct input dimension)
    selected_idx = [i for i, fname in enumerate(feature_names_all) if fname in best_features]
    final_est = clone(model)
    final_est.fit(X_train_processed[:, selected_idx], y_train)

    return final_est, best_features

In [None]:
def plot_top_shap(shap_df, model_name, top_k=20):
    top_df = shap_df.head(top_k).iloc[::-1]
    fig, ax = plt.subplots(figsize=(8, 6))
    ax.barh(top_df["feature"], top_df["importance"], color="#4C72B0")
    ax.set_xlabel("Average |SHAP| value")
    ax.set_title(f"Top {top_k} {model_name} features by SHAP")
    plt.tight_layout()
    plt.show()


#### 6.1 HGB

##### Step 1: Baseline Performance with Optimized Hyperparameters

In [None]:
X_val_processed_hgb = hgb_best.named_steps["preprocess"].transform(X_val)
hgb_val_pred = hgb_best.named_steps["model"].predict(X_val_processed_hgb)
n_features_total = X_val_processed_hgb.shape[1]
baseline_mae_hgb = mean_absolute_error(y_val, hgb_val_pred)

print("Baseline Performance of HGB model after Hyperparameter Tuning:\n")
print_metrics(y_val, hgb_val_pred)
print(f"\nTotal features used: {n_features_total}")

##### Step 2: SHAP Feature Importance Analysis

In [None]:
#function
shap_importance_df_hgb, feature_names_all_hgb, X_train_processed_hgb = calculate_shap_values(
    hgb_best, X_train, log_features, numeric_features, categorical_features,
    sample_size=1000, seed=42, label="HGB"
)

##### Step 3: Automated Feature Selection Optimization

In [None]:
# Define model with the same hyperparams
hgb_model = hgb_best.named_steps["model"]
hgb_selected = HistGradientBoostingRegressor(**hgb_model.get_params())

# Number of top SHAP features to try
range_number_of_features_hgb = range(15, n_features_total + 1, 1) # After previous runs with higher step size, the range is now narrowed down

# Train/evaluate on subsets of top features
best_model_hgb, best_features_hgb = train_model_on_best_features(
    baseline_mae_hgb, shap_importance_df_hgb,
    hgb_selected,
    X_train_processed_hgb, X_val_processed_hgb,
    range_number_of_features_hgb,
    feature_names_all_hgb
)

# Long Duration (ca 2mins)

# Notes by Jan: (TODO to be removed)
# start: Best performance with 17 features: MAE: 1288.12
# removed ohe categoricals: Best performance with 19 features: MAE: 1282.81
# After FS removal: Best performance with 19 features (all features bc deselection filtered the same features as SHAP): MAE: 1282.81

In [None]:
# HGB SHAP bar plot
plot_top_shap(shap_importance_df_hgb, "HGB", top_k=20)

In [None]:
# Build the final pipeline with feature selection included
def select_best_features_hgb(X):
    # X is the output of the preprocessing step: Matrix with all features after they have been scaled, encoded, and combined by the preprocessor
    idx = [i for i, fname in enumerate(feature_names_all) if fname in best_features_hgb]
    return X[:, idx]

hgb_final_pipe = Pipeline([
    ("preprocess", hgb_best.named_steps["preprocess"]),
    ("feature_selector", FunctionTransformer(select_best_features_hgb, validate=False)), # a flexible wrapper that applies a custom function to the data flow in a pipeline
    ("model", best_model_hgb)
])

# Save the best model for later use
joblib.dump(hgb_final_pipe, "hgb_best_feature.pkl")

#### 6.2 RF

##### Step 1: Baseline Performance with Optimized Hyperparameters

In [None]:
# Use the tuned RF pipeline (rf_best_rand) and compute baseline on the validation set
X_val_processed_rf = rf_best_rand.named_steps["preprocess"].transform(X_val)
rf_val_pred = rf_best_rand.named_steps["model"].predict(X_val_processed_rf)
n_features_total_rf = X_val_processed_rf.shape[1] # TODO cant we just use one val_processed and one n_features_total or why did we split that? ~J
baseline_mae_rf = mean_absolute_error(y_val, rf_val_pred)

print("Baseline Performance of RF model after Hyperparameter Tuning:\n")
print_metrics(y_val, rf_val_pred)
print(f"\nTotal features used: {n_features_total_rf}")
# Notes by Jan: (TODO to be removed)
# start: MAE: 1322.4418 | RMSE: 4630733.9922 | R2: 0.9492 (Total features used: 155)
# after removing ohe categoricals: MAE: 1282.0634 | RMSE: 4317505.8622 | R2: 0.9526 (Total features used: 22)
# After FS removal: MAE: 1275.2603 | RMSE: 4274727.2255 | R2: 0.9531 (Total features used: 19)

##### Step 2: SHAP Feature Importance Analysis

In [None]:
shap_importance_df_rf, feature_names_all_rf, X_train_processed_rf = calculate_shap_values(
    rf_best_rand, X_train, log_features, numeric_features, categorical_features,
    sample_size=100, seed=42, label="RF"
)

# Long Duration (ca 4mins)

##### Step 3: Automated Feature Selection Optimization

In [None]:
# Use the same processed validation data and reuse tuned RF hyperparameters
rf_params = {k.replace("model__", ""): v for k, v in rf_random.best_params_.items()}
rf_selected = RandomForestRegressor(random_state=42, n_jobs=-1, **rf_params)
range_number_of_features_rf = range(16, n_features_total_rf + 1, 1) # After previous runs with higher step size, the range is now narrowed down

best_model_rf, best_features_rf = train_model_on_best_features(baseline_mae_rf, shap_importance_df_rf, rf_selected, X_train_processed_rf, X_val_processed_rf, range_number_of_features_rf, feature_names_all_rf)

# Long Duration (ca 1min)

# Notes by Jan: (TODO to be removed)
# start: Best performance with 26 features: MAE: 1277.16
# removed ohe categoricals: Best performance with 20 features: MAE: 1273.89
# After FS removal: Best performance with 19 features (all features bc deselection filtered the same features as SHAP): MAE: 1275.26

In [None]:
# RF SHAP bar plot
plot_top_shap(shap_importance_df_rf,  "RF",  top_k=20)

In [None]:
# Save the best RF model for later use

# Build the final RF pipeline with feature selection included
def select_best_features_rf(X):
    idx = [i for i, fname in enumerate(feature_names_all_rf) if fname in best_features_rf]
    return X[:, idx]

final_rf_pipe = Pipeline([
    ("preprocess", rf_best_rand.named_steps["preprocess"]),
    ("feature_selector", FunctionTransformer(select_best_features_rf, validate=False)),
    ("model", best_model_rf)
])

joblib.dump(final_rf_pipe, "rf_best_feature.pkl")

#### 6.3 Build Final Stacking Regressor to mix tuned and feature selected HGB and RF

In [None]:
stack_pipe_final = StackingRegressor(
    estimators=[
        ("hgb_final", hgb_final_pipe),   # tuned HGB pipeline (preprocessor + model)
        ("rf_final",  final_rf_pipe),    # tuned RF pipeline (preprocessor + model)
    ],
    final_estimator=LinearRegression(),  # simple, perfect for 2 base preds
    passthrough=False,                   # meta model sees only base predictions
    cv=5,                                # proper OOF stacking
    n_jobs=1                             # no BrokenProcessPool on Databricks
)

stack_pipe_final.fit(X_train, y_train)
stack_val_pred = stack_pipe_final.predict(X_val)
print_metrics(y_val, stack_val_pred)

joblib.dump(stack_pipe_final, "stack_pipe.pkl")

# MAE: 1255.3112 | RMSE: 4157099.9081 | R2: 0.9544

# Kaggle Score submit 1274 !! OVERFITTED

# Long Duration (ca 3mins)

# Notes by Jan: (TODO to be removed)
# start: MAE: 1256.5922 | RMSE: 4154147.3742 | R2: 0.9544
# removed ohe categoricals: MAE: 1252.2718 | RMSE: 4146270.8422 | R2: 0.9545
# After FS removal: MAE: 1252.9558 | RMSE: 4145097.0520 | R2: 0.9545

Final SR of the tuned HGB and RF models, did improve over the best single HGB and RF models on the validation set (MAE 1258 vs 1281/1289). 

However, it seems to be overfitted, Kaggle Score is only 1274

Therefore we will keep the RF/HGB model => with such small difference in MAE, we further need to evaluate them both + the Stacking

### 7. Kaggle Competition

Extra Task (1 Point): Be in the Top 5 Groups on Kaggle

In [None]:
def predict_on_test(model_pipeline, model_name):
    # Load best model from Joblib and predict on validation set to verify
    pipe_best = joblib.load(model_pipeline)
    pred_loaded = pipe_best.predict(X_val)
    print(f"Loaded {model_name}-model MAE on validation set: {mean_absolute_error(y_val, pred_loaded):.2f}")

    # Predict on test set
    df_cars_test['price'] = pipe_best.predict(df_cars_test)
    df_cars_test['price'].to_csv(f'Group05_{model_name}_Version10.csv', index=True)

In [None]:
predict_on_test("hgb_best_feature.pkl", "HGB")

In [None]:
predict_on_test("rf_best_feature.pkl", "RF")

In [None]:
predict_on_test("stack_pipe.pkl", "Stack")

In [None]:
# !kaggle competitions submit -c cars4you -f Group05_Version05.csv -m "Message" # Uncomment to submit to Kaggle

In [None]:
!kaggle competitions submissions -c cars4you