_Clone before play/engineer with Features / Hyperparameters_

Process to test new features:

1. Add new Features in cell Feature Engineering
2. Go into Preprocessing and add the features in either num, log or cat
3. Run the RandomSearch // Hyperparameter Tuning for at least HGB and RF and see if MAE gets improved compared to previous results
4. If MAE gets not improved, comment in cell below Feature Engineering that you tested those features (+results?) and on which Models - and remove everything
5. If MAE gets improved, find out via Feature Importance (Shap Values) which Feature was responsible + document it, remove the other features that have negative impact
6. comment below the Hyperparameter Tuning cell of the model the new achieved results + all the features you used for that + the hyperparameters
7. save the results in a new model with joblib, name it correctly
8. push to kaggle
9. push to GIT + document everything

# Project Cars4you (Group 5)

### Import & load Data

In [1]:
# Import and load Data

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import joblib
import seaborn as sns

from sklearn.feature_selection import VarianceThreshold, RFE, chi2
from scipy.stats import spearmanr, uniform, loguniform, randint
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler, FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin


from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, KFold
from sklearn.linear_model import Ridge, Lasso, LassoCV, ElasticNet
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor, StackingRegressor
from sklearn.svm import SVR

from data_cleaning import clean_car_dataframe

df_cars_train = pd.read_csv("train.csv")
df_cars_test = pd.read_csv("test.csv")

In [0]:
pip install kaggle

In [0]:
### Everyone has to do this himself, with his own kaggle.json -> get it from kaggle as api token
# Kaggle API Connect

# Folder containing kaggle.json
os.environ['KAGGLE_CONFIG_DIR'] = "/Workspace/Users/20250355@novaims.unl.pt"

# Test
!echo $KAGGLE_CONFIG_DIR

### Data Cleaning, Feature Engineering, Split & Preprocessing

Task II (5 Points): Clean and preprocess the dataset. 
- Missing Value handling, Outlier preprocessing + justify decisions -> in data_cleaning.py
- Review current features and create extra features if needed + explain -> in Feature Engineering
- Deal with categorical variables -> In One-Hot-Encoding 
- Perform data scaling, explain reasoning -> In Transforming

In [2]:
# Outlier Preprocessing, Missing Value Handling and Decision justifying happens here
df_cars_train = clean_car_dataframe(df_cars_train)
df_cars_test = clean_car_dataframe(df_cars_test)


# Safety Check: print unique values of all columns of df_cars_train // df_cars_test to see if data cleaning worked and if there are still odd values
for col in df_cars_train.columns:
    print(col, df_cars_train[col].unique())
print("X"*150)
print("-"*150)
print("X"*150)
for col in df_cars_test.columns:
    print(col, df_cars_test[col].unique())

Brand ['VW' 'Toyota' 'Audi' 'Ford' 'BMW' 'Skoda' 'Opel' 'Mercedes' 'Hyundai' nan]
model ['golf' 'yaris' 'q2' 'fiesta' '2 series' '3 series' 'a3' 'octavia'
 'passat' 'focus' 'insignia' 'a class' 'q3' 'fabia' 'ka+' 'glc class'
 'i30' 'c class' 'polo' 'e class' 'q5' 'up' 'c-hr' 'mokka' 'corsa' 'astra'
 'tt' '5 series' 'aygo' '4 series' nan 'ecosport' 'tucson' 'x-class'
 'cl class' 'ix20' 'i20' 'a1' 'auris' 'sharan' 'adam' 'x3' 'a8'
 'gls class' 'b-max' 'a4' 'kona' 'i10' 's-max' 'x2' 'crossland x' 'tiguan'
 'a5' 'gle class' 'zafira' 'ioniq' 'a6' 'mondeo' 'yeti' 'x1' 'scala'
 's class' '1 series' 'kamiq' 'kuga' 'q7' 'gla class' 'arteon' 'sl class'
 'santa fe' 'grandland x' 'rav4' 'touran' 'corolla' 'b class' 'kodiaq'
 'v class' 'superb' 'combo life' 'beetle' 'm3' 'x4' 'ix35' 'm4' 'z4' 'x5'
 'meriva' 'verso' 'cls class' 'c-max' 'puma' 'i40' '6 series' 'karoq' 'a7'
 'land cruiser' 'edge' 'x6' '8 series' 'scirocco' 'z3' 'hilux' 'amarok'
 '7 series' 'avensis' 'm class' 'r8' 'antara' 'q8' 'x7' '

In [3]:
# Feature Engineering and Explaination

# add column age: models can easier interpret linear numerical features
df_cars_train['age'] = 2025 - df_cars_train['year']
df_cars_test['age'] = 2025 - df_cars_test['year']


# miles per year: normalizes the total mileage by how old the car is
df_cars_train['miles_per_year'] = df_cars_train['mileage'] / df_cars_train['age'].replace({0: np.nan})
df_cars_train['miles_per_year'] = df_cars_train['miles_per_year'].fillna(df_cars_train['mileage'])

df_cars_test['miles_per_year'] = df_cars_test['mileage'] / df_cars_test['age'].replace({0: np.nan})
df_cars_test['miles_per_year'] = df_cars_test['miles_per_year'].fillna(df_cars_test['mileage'])


# model frequency: some models are more common, which means they can be cheaper (supply) or retain their values better (demand). freq shows their popularity
model_freq = df_cars_train['model'].value_counts(normalize=True).to_dict()
df_cars_train['model_freq'] = df_cars_train['model'].map(model_freq)
df_cars_test['model_freq'] = df_cars_test['model'].map(model_freq)


# brand median price (only train): shows brand positioning (e.g. BMW > KIA)
brand_median_price = df_cars_train.groupby('Brand')['price'].median()
df_cars_train['brand_med_price'] = df_cars_train['Brand'].map(brand_median_price)


# model median price (only train): shows model positioning (e.g. 3er > 1er)
model_med_price = df_cars_train.groupby('model')['price'].median()
df_cars_train['model_med_price'] = df_cars_train['model'].map(model_med_price)


# brand anchor (market position) 
brand_median_price = df_cars_train.groupby("Brand")["price"].median().to_dict()
overall_mean_price = df_cars_train["price"].mean()

df_cars_train["brand_anchor"] = df_cars_train["Brand"].map(brand_median_price) / overall_mean_price
df_cars_test["brand_anchor"]  = df_cars_test["Brand"].map(brand_median_price) / overall_mean_price

In [4]:
# Split Data, Stratify not necessary due to regression problem and Cross Validation later
X = df_cars_train.drop(columns='price')
y = df_cars_train['price']

X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.3, 
                                                  random_state = 42, 
                                                  shuffle = True)

In [6]:
# Write custom GroupMedianImputer to impute missing values on a model, brand level and not only global (with SimpleImputer)
class GroupMedianImputer(BaseEstimator, TransformerMixin):
    """
    Impute missing numeric values using hierarchical medians:
    1. By (Brand, model)
    2. If model missing → by Brand
    3. Fallback → global median
    """

    def __init__(self, group_cols=["Brand", "model"]):
        self.group_cols = group_cols

    def fit(self, X, y=None):
        X = pd.DataFrame(X).copy()
        self.feature_names_in_ = X.columns

        # Step 1 — model-level medians
        if all(c in X.columns for c in self.group_cols):
            self.medians_ = X.groupby(self.group_cols).median(numeric_only=True)
        else:
            self.medians_ = pd.DataFrame()

        # Step 2 — brand-level medians
        if "Brand" in X.columns:
            self.brand_medians_ = X.groupby("Brand").median(numeric_only=True)
        else:
            self.brand_medians_ = pd.DataFrame()

        # Step 3 — global medians
        self.global_median_ = X.median(numeric_only=True)
        return self

    def transform(self, X):
        X = pd.DataFrame(X).copy()
        for col in X.columns:
            if X[col].isna().sum() == 0:
                continue

            # (Brand, model) level
            if all(c in X.columns for c in self.group_cols):
                X[col] = X.apply(
                    lambda r: self.medians_.loc[(r["Brand"], r["model"]), col]
                    if (r["Brand"], r["model"]) in self.medians_.index and pd.isna(r[col])
                    else r[col],
                    axis=1
                )

            # Brand-only fallback
            if "Brand" in X.columns:
                X[col] = X.apply(
                    lambda r: self.brand_medians_.loc[r["Brand"], col]
                    if r["Brand"] in self.brand_medians_.index and pd.isna(r[col])
                    else r[col],
                    axis=1
                )

            # Global fallback
            X[col] = X[col].fillna(self.global_median_[col])

        return X.values  # sklearn expects ndarray

In [7]:
# Preprocessing: with sklearn Pipeline & Column Transformer

group_imputer = GroupMedianImputer(group_cols=["Brand", "model"])

numeric_features = ["age", "tax", "mpg", "engineSize", "paintQuality", "previousOwners", "brand_anchor"]
log_features = ["mileage", "miles_per_year", "model_freq", "brand_med_price", "model_med_price"] # could try to test previousOwners, age here; tax, mpg didnt work
categorical_features = ["Brand", "model", "transmission", "fuelType"]

# left out columns: year (age is better), hasDamage (unsure what the two values 0 and NaN mean)


log_transformer = Pipeline([
    ("group_impute", GroupMedianImputer(group_cols=["Brand", "model"])),  # Handling of missing numerical values with GroupMedianImputer
    ("to_float", FunctionTransformer(lambda x: np.array(x, dtype=float))),
    ("log", FunctionTransformer(np.log1p, validate=False)), # log1p handles zeros safely
    ("scaler", StandardScaler()) #  # Data Scaling with sklearn StandardScaler
])

numeric_transformer = Pipeline([
    ("group_impute", GroupMedianImputer(group_cols=["Brand", "model"])),
    ("to_float", FunctionTransformer(lambda x: np.array(x, dtype=float))),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")), # fill by mode instead of Unknown (a diesel 3er BMW is probably a diesel)
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)) # Deal with Categorical Variables with sklearn OneHotEncoder
])

# Apply the preprocessing steps to the data with ColumnTransformer
preprocessor = ColumnTransformer([
    ("log", log_transformer, log_features),
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])


# Fit preprocessor on training data - avoid data leakage
preprocessor.fit(X_train)



""" EXPLAINATIONS """
# 1) Pipeline bundles preprocessing + model training:
#       > Ensures all preprocessing happens inside cross-validation folds (no data leakage)
#       > Keeps the entire workflow reproducible — scaling, encoding, and modeling are learned together
#       > After .fit(), the final model automatically knows how to preprocess new unseen data
#       > When saving with joblib, the entire preprocessing (imputers, scalers, encoders) and model are stored together

# 2) The ColumnTransformer applies different transformations to subsets of features:
#       > Numeric Features arehandled by our custom GroupMedianImputer (domain-aware filling)
#           - Missing numeric values are imputed hierarchically:
#           1. By (Brand, model)
#           2. If missing model by Brand
#           3. If missing Brand by global median
#       > This approach captures brand/model-level patterns (e.g. BMWs have similar engine sizes)
#       > After imputation, StandardScaler standardizes all numeric features
#
#       > Log Features use the same group-median imputation, followed by log1p() transformation
#           - log1p() compresses large, skewed values (like mileage or price-related features), stabilizing variance and helping linear models perform better
#           - StandardScaler then scales them to zero mean and unit variance
#
#       > Categorical Features are handled by SimpleImputer + OneHotEncoder
#           - SimpleImputer fills missing categorical values with the most frequent (mode) value.
#             (Alternative would be “Unknown”, but mode keeps categories realistic, e.g. most cars in a model share the same transmission)
#           - OneHotEncoder converts each categorical label (Brand, model, etc.) into binary dummy variables
#             This lets the model use category information numerically without implying order
#
# 3) Overall:
#       > The pipeline ensures consistent preprocessing across training, validation, and test data.
#       > It combines domain knowledge (brand/model-aware imputation) with robust numerical scaling.
#       > Linear models (ElasticNet, Ridge, Lasso) and tree models (HistGradientBoosting, RandomForest)
#           can now learn from the same standardized, clean, and information-rich feature space.

' EXPLAINATIONS '

### Feature Selection

Task III (3 Points): Define and Implement a clear and unambiguous strategy for feature selection. Use the methods **discussed in the course**. Present and justify your final selection 

Model independent Filter Methods:
- Remove constant numerical variables with VarianceThreshold (manual)
- Check highly correlated numerical variables and keep one with Spearman (manual)
- Remove unindependent categorical variables with Chi2

Model dependent Wrapper Methods:
- RFE LR / RFE SVR for linear Models: Ridge, Lasso, ElasticNet, SVM
- Feature Importance for tree Models: DecisionTrees, RandomForest, GradientBoosting => trees are unsensitive to irrelevant features but doing feature importance and remove some can reduce dimensionality
- L1 Regularization for Neural Networks: MLP


In [8]:
X_train_proc = preprocessor.transform(X_train)

feature_names_all = []
for name, trans, cols in preprocessor.transformers_:
    if name != 'remainder':
        if hasattr(trans, 'get_feature_names_out'):
            # for categorical OHE
            try:
                feature_names_all.extend(trans.get_feature_names_out(cols))
            except:
                feature_names_all.extend(cols)
        else:
            feature_names_all.extend(cols)

X_df = pd.DataFrame(X_train_proc, columns=feature_names_all)


# Variance Threshold
vt = VarianceThreshold(threshold=0.0)
vt.fit(X_df)
vt_deselect = [f for f, keep in zip(feature_names_all, vt.get_support()) if not keep]
print("Features to deselect according to VarianceThreshold:", vt_deselect)


# Spearman correlation (numeric + log only)
numeric_log = numeric_features + log_features
spearman_deselect = []
for f in numeric_log:
    if f in X_df.columns:
        corr, _ = spearmanr(X_df[f], y_train)
        if abs(corr) <= 0.05:
            spearman_deselect.append(f)
print("Features to deselect according to Spearman correlation:", spearman_deselect)


# Chi2 (categorical only, must be non-negative)
cat_cols = [c for c in X_df.columns if c not in numeric_log]
X_cat = X_df[cat_cols].astype(float)
chi2_vals, _ = chi2(X_cat, y_train)
chi2_deselect = [f for f, val in zip(cat_cols, chi2_vals) if val <= 0]
print("Features to deselect according to Chi²:", chi2_deselect)


Features to deselect according to VarianceThreshold: []
Features to deselect according to Spearman correlation: ['paintQuality', 'previousOwners']
Features to deselect according to Chi²: []


In [9]:
# Numeric/log features for linear models
linear_numeric_features = [f for f in numeric_features + log_features if f not in spearman_deselect]

preprocessor_linear = ColumnTransformer([
    ("num", numeric_transformer, linear_numeric_features),
    ("cat", categorical_transformer, categorical_features)
], remainder="drop")

# => use preprocessor_linear for linear model setup; since tree models are indifferent to irrelevant features

### Tree Model Feature Selection with SHAP

  **Problem:** Current feature selection targets linear models
  (ElasticNet), but we primarily use tree-based models (HGB,
  RandomForest).

  **Solution:** Use SHAP (SHapley Additive exPlanations) to
  identify feature importance specifically for tree models and
  optimize our 145-feature dataset.

  **Why SHAP for Trees:**
  - Provides exact feature importance values for tree-based
  models
  - Tree models handle irrelevant features, but noise features
  still impact performance
  - Enables data-driven selection rather than statistical filter
  methods
  - Automated and reproducible process for production models

  **Methodology:** Train baseline → Calculate SHAP importance →
  Test feature subsets → Recommend optimal configuration

#### Step 1: Baseline Performance with Optimized Hyperparameters

In [18]:
# Import SHAP and define helper function
import shap

def print_metrics(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"MAE: {mae:.4f} | RMSE: {rmse:.4f} | R2: {r2:.4f}")
    return

# Train baseline model with optimized hyperparameters
print("=== BASELINE PERFORMANCE WITH OPTIMIZED HYPERPARAMETERS ===")

# Use best hyperparameters from previous hyperparameter tuning
# sklearn 1.6.1
best_params = {
      'l2_regularization': 0.942853570557981,
      'learning_rate': 0.06389789198396824,
      'max_iter': 642,
      'max_leaf_nodes': 105,
      'min_samples_leaf': 3,
      'random_state': 42
  }

# sklearn 1.4.2
# best_params = {
#     'learning_rate': 0.05704595464437946,
#     'max_leaf_nodes': 109,
#     'min_samples_leaf': 11,
#     'max_iter': 414,
#     'l2_regularization': 0.49379559636439074,
#     'random_state': 42
# }

hgb_baseline = HistGradientBoostingRegressor(**best_params)

# Fit baseline model
X_train_processed = preprocessor.transform(X_train)
X_val_processed = preprocessor.transform(X_val)

hgb_baseline.fit(X_train_processed, y_train)
baseline_pred = hgb_baseline.predict(X_val_processed)

print("Baseline HGB Performance (optimized hyperparameters, all features):")
print_metrics(y_val, baseline_pred)
print(f"Using {X_train_processed.shape[1]} features")

# Store baseline for comparison
baseline_mae = mean_absolute_error(y_val, baseline_pred)

=== BASELINE PERFORMANCE WITH OPTIMIZED HYPERPARAMETERS ===
Baseline HGB Performance (optimized hyperparameters, all features):
MAE: 1323.7920 | RMSE: 4484463.9519 | R2: 0.9517
Using 145 features


#### Step 2: SHAP Feature Importance Analysis

We use SHAP's TreeExplainer to calculate feature importance values. TreeExplainer is specifically optimized for tree-based models and provides exact Shapley values efficiently.

In [20]:
# Calculate SHAP values for feature importance analysis
print("=== CALCULATING SHAP VALUES ===")

# Set seeds for reproducibility
np.random.seed(42)

# Use TreeExplainer for fast computation with tree models
explainer = shap.TreeExplainer(hgb_baseline)

# Calculate SHAP values on a sample (for speed) - REPRODUCIBLE
sample_size = min(1000, len(X_train_processed))
sample_indices = np.random.choice(len(X_train_processed), sample_size, replace=False)
X_sample = X_train_processed[sample_indices]

print(f"Computing SHAP values for {sample_size} samples...")
shap_values = explainer.shap_values(X_sample)

# Calculate feature importance (mean absolute SHAP values)
feature_importance = np.abs(shap_values).mean(axis=0)

# Create feature importance DataFrame
shap_importance_df = pd.DataFrame({
    'feature': feature_names_all,
    'importance': feature_importance
}).sort_values('importance', ascending=False)

print("Top 20 most important features:")
print(shap_importance_df.head(20))

print(f"\nBottom 10 least important features:")
print(shap_importance_df.tail(10))

=== CALCULATING SHAP VALUES ===
Computing SHAP values for 1000 samples...
Top 20 most important features:
                    feature   importance
4           model_med_price  2798.250661
5                       age  2271.378566
136     transmission_Manual  1656.190012
0                   mileage  1557.877704
8                engineSize  1511.617460
7                       mpg   798.939048
3           brand_med_price   781.774598
1            miles_per_year   191.858515
2                model_freq   145.477741
6                       tax   123.331385
144         fuelType_Petrol   118.094618
142         fuelType_Hybrid   106.727948
62              model_focus   102.024200
14               Brand_Ford    76.184813
140         fuelType_Diesel    46.629148
138  transmission_Semi-Auto    32.958380
12               Brand_Audi    31.773527
135  transmission_Automatic    30.133471
18              Brand_Skoda    26.802926
57            model_e class    25.986541

Bottom 10 least important featur

#### Step 3: Automated Feature Selection Optimization

Now we systematically test different numbers of top features to find the optimal subset. We train the model with the same optimized hyperparameters but using only the most important features identified by SHAP.

In [21]:
# Evaluate different feature subset sizes (5-145)
feature_counts = [5, 10, 15, 20, 30, 50, 70, 100, 145]
results = []

print("=== AUTOMATED FEATURE SELECTION ANALYSIS ===")

for n_features in feature_counts:
    if n_features == 145:
        # Use baseline result for all features
        mae_subset = baseline_mae
        print(f"All {n_features} features → MAE: {mae_subset:.1f} (baseline)")
    else:
        # Select top N features based on SHAP importance
        top_features = shap_importance_df.head(n_features)["feature"].tolist()
        feature_indices = [i for i, fname in enumerate(feature_names_all) if fname in top_features]

        # Create subset datasets
        X_train_subset = X_train_processed[:, feature_indices]
        X_val_subset = X_val_processed[:, feature_indices]

        # Train model with selected features using same optimized hyperparameters
        hgb_selected = HistGradientBoostingRegressor(**best_params)
        hgb_selected.fit(X_train_subset, y_train)
        pred_subset = hgb_selected.predict(X_val_subset)
        mae_subset = mean_absolute_error(y_val, pred_subset)

        improvement = baseline_mae - mae_subset
        print(f"Top {n_features:2d} features → MAE: {mae_subset:.1f} (change: {improvement:+.1f})")
    
    # Store results for optimization analysis
    results.append({
        'n_features': n_features,
        'mae': mae_subset,
        'improvement': baseline_mae - mae_subset
    })
    

=== AUTOMATED FEATURE SELECTION ANALYSIS ===
Top  5 features → MAE: 1615.8 (change: -292.0)
Top 10 features → MAE: 1360.6 (change: -36.8)
Top 15 features → MAE: 1299.8 (change: +23.9)
Top 20 features → MAE: 1302.3 (change: +21.5)
Top 30 features → MAE: 1316.7 (change: +7.1)
Top 50 features → MAE: 1318.8 (change: +5.0)
Top 70 features → MAE: 1316.5 (change: +7.3)
Top 100 features → MAE: 1364.1 (change: -40.3)
All 145 features → MAE: 1323.8 (baseline)


In [22]:
# Evaluate different feature subset sizes in detail (10-30)
feature_counts = [10, 11, 12, 13, 14, 15, 16, 17,
  18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
results = []

print("=== AUTOMATED FEATURE SELECTION ANALYSIS ===")

for n_features in feature_counts:
    if n_features == 145:
        # Use baseline result for all features
        mae_subset = baseline_mae
        print(f"All {n_features} features → MAE: {mae_subset:.1f} (baseline)")
    else:
        # Select top N features based on SHAP importance
        top_features = shap_importance_df.head(n_features)["feature"].tolist()
        feature_indices = [i for i, fname in enumerate(feature_names_all) if fname in top_features]

        # Create subset datasets
        X_train_subset = X_train_processed[:, feature_indices]
        X_val_subset = X_val_processed[:, feature_indices]

        # Train model with selected features using same optimized hyperparameters
        hgb_selected = HistGradientBoostingRegressor(**best_params)
        hgb_selected.fit(X_train_subset, y_train)
        pred_subset = hgb_selected.predict(X_val_subset)
        mae_subset = mean_absolute_error(y_val, pred_subset)

        improvement = baseline_mae - mae_subset
        print(f"Top {n_features:2d} features → MAE: {mae_subset:.1f} (change: {improvement:+.1f})")
    
    # Store results for optimization analysis
    results.append({
        'n_features': n_features,
        'mae': mae_subset,
        'improvement': baseline_mae - mae_subset
    })

# Find optimal feature configuration
results_df = pd.DataFrame(results)
best_result = results_df.loc[results_df['mae'].idxmin()]

print(f"\n=== OPTIMAL FEATURE SELECTION RESULTS ===")
print(f"Best performance: {best_result['n_features']:.0f} features")
print(f"Best MAE: {best_result['mae']:.1f}")
print(f"Improvement over baseline: {best_result['improvement']:.1f} MAE")

# Display optimal feature set
optimal_n = int(best_result['n_features'])
optimal_features = shap_importance_df.head(optimal_n)['feature'].tolist()

print(f"\nOptimal {optimal_n} features for production model:")
for i, feat in enumerate(optimal_features[:15]):
    importance = shap_importance_df.iloc[i]['importance']
    print(f"{i+1:2d}. {feat:20s} (importance: {importance:.1f})")
if optimal_n > 15:
    print(f"... and {optimal_n-15} more features")

print(f"\n=== FINAL RECOMMENDATION ===")
print(f"Current baseline (145 features): {baseline_mae:.1f} MAE")
print(f"Optimized model ({optimal_n} features): {best_result['mae']:.1f} MAE")

if best_result['improvement'] > 0:
    print(f"Improvement: {best_result['improvement']:.1f} MAE better")
    print(f"Recommendation: Use top {optimal_n} features for final submission")
else:
    print(f"Result: Feature selection shows minimal impact")
    print(f"Recommendation: Use all features (preprocessing already well-optimized)")
    

=== AUTOMATED FEATURE SELECTION ANALYSIS ===
Top 10 features → MAE: 1360.6 (change: -36.8)
Top 11 features → MAE: 1320.4 (change: +3.4)
Top 12 features → MAE: 1310.6 (change: +13.2)
Top 13 features → MAE: 1317.3 (change: +6.5)
Top 14 features → MAE: 1346.4 (change: -22.7)
Top 15 features → MAE: 1299.8 (change: +23.9)
Top 16 features → MAE: 1322.9 (change: +0.9)
Top 17 features → MAE: 1303.3 (change: +20.5)
Top 18 features → MAE: 1302.1 (change: +21.7)
Top 19 features → MAE: 1295.2 (change: +28.6)
Top 20 features → MAE: 1302.3 (change: +21.5)
Top 21 features → MAE: 1299.7 (change: +24.1)
Top 22 features → MAE: 1335.6 (change: -11.8)
Top 23 features → MAE: 1312.1 (change: +11.7)
Top 24 features → MAE: 1343.0 (change: -19.2)
Top 25 features → MAE: 1337.2 (change: -13.4)
Top 26 features → MAE: 1343.5 (change: -19.7)
Top 27 features → MAE: 1334.8 (change: -11.0)
Top 28 features → MAE: 1356.5 (change: -32.7)
Top 29 features → MAE: 1332.2 (change: -8.5)
Top 30 features → MAE: 1316.7 (change: 

In [None]:
# Put all engineered features that didn't positively affect MAE for HGB or RF at all here:



TASK IV (4 Points): Build a simple model and assess the performance
- Identify the type of problem and select the relevant algorithms
- Select one Model Assessment Strategy to use throughout your work. Which metrics are you using to evaluate your model and why?


=> Tip from lecturer: Use RandomSearch instead of GridSearchCV, set a wider Range


In [None]:
# Model evaluation metrics used throughout this analysis:
#
#   > MAE: Mean Absolute Error - average absolute deviation between predicted and true car prices
#          Easy to interpret in pounds, same metric used by Kaggle competition
#   > RMSE: Root Mean Squared Error - sensitive to outliers, helps identify large prediction errors  
#   > R²: Coefficient of determination - proportion of variance explained by the model
#         1.0 = perfect predictions, 0.0 = same as predicting mean, < 0.0 = worse than mean
#
# These metrics are appropriate for regression problems predicting continuous variables (car prices)

In [131]:
# Absolute basic baselining with the mean and median

mean_pred = y_train.mean()
median_pred = y_train.median()

print("baseline mean predictor: ")
print_metrics(y_val, [mean_pred]*len(y_val))
# MAE: 6976.3626 | RMSE: 92839550.2849 | R2: -0.0000

print("-"*150)

print("baseline median predictor: ") 
print_metrics(y_val, [median_pred]*len(y_val))
# MAE: 6751.1604 | RMSE: 97557866.6363 | R2: -0.0508

baseline mean predictor: 
MAE: 6976.3626 | RMSE: 92839550.2849 | R2: -0.0000
------------------------------------------------------------------------------------------------------------------------------------------------------
baseline median predictor: 
MAE: 6751.1604 | RMSE: 97557866.6363 | R2: -0.0508


In [138]:
# Models Setup (inkl. Prepro in Pipeline)

### LINEAR MODEL

# ElasticNet
elastic_pipe = Pipeline([
    ("preprocess", preprocessor_linear),
    ("model", ElasticNet(
        alpha=0.01,            # mild regularization to stabilize if many features
        l1_ratio=0.5,          # balanced L1/L2, can grid-search
        max_iter=30000,        # allow more convergence iterations
        tol=1e-4,              # stricter tolerance often improves accuracy
        selection="cyclic",    # usually converges faster than random
        random_state=42
    ))
])


### TREE MODELS

# HistGradientBoostingRegressor: modern and very fast, handles missing values natively (no imputation needed!). often matches or beats XGBoost/LightGBM 
hgb_pipe = Pipeline([
    ("preprocess", preprocessor),
    ("model", HistGradientBoostingRegressor(
        early_stopping=True,
        validation_fraction=0.1,
        n_iter_no_change=20,
        l2_regularization=0.5, # regularize slightly to prevent overfit, > 0.5 does not seem to work
        random_state=42  
    ))
])


# RandomForestRegressor: excellent general baseline ensemble, handles non-linearities well, doesn’t overfit easily but can be slow for large data
rf_pipe = Pipeline([
    ("preprocess", preprocessor),
    ("model", RandomForestRegressor(
        n_estimators=300,
        max_depth=None,
        min_samples_split=3,
        min_samples_leaf=2,
        max_features="sqrt",
        bootstrap=True,
        n_jobs=-1,
        random_state=42
    ))
])

# ExtraTreesRegressor: similar to RandomForest but with more randomization => often better generalization
et_pipe = Pipeline([
    ("preprocess", preprocessor),
    ("model", ExtraTreesRegressor(
        n_estimators=400,          
        max_depth=None,
        min_samples_leaf=2,
        max_features="sqrt",
        bootstrap=False,
        n_jobs=-1,
        random_state=42
    ))
])


### KERNEL BASED MODEL

# SVR: powerful, but slow on large data, sensitive to scaling => already handled in preprocessing
svr_pipe = Pipeline([
    ("preprocess", preprocessor),
    ("model", SVR(
        kernel="rbf",
        C=10,
        epsilon=0.1,     # slightly tighter margin
        gamma="scale"    # default: 1 / (n_features * X.var())
    ))
])


### ENSEMBLE META MODEL

# StackingRegressor: stacks/blends multiple models => typically gives a small but consistent boost in leaderboard competitions
stack_pipe = StackingRegressor(
    estimators=[
        ("elastic", elastic_pipe),
        ("hgb", hgb_pipe),
        ("rf", rf_pipe),
    ],
    final_estimator=HistGradientBoostingRegressor(
        learning_rate=0.05,
        max_depth=5,
        l2_regularization=0.5,
        random_state=42
    ),
    passthrough=True,     # allow meta-model to see raw inputs too
    n_jobs=-1
)


### Hyperparameter Tuning, Evaluation and Feature Importance

##### ElasticNet

In [None]:
# Hyperparameter Tuning: ElasticNet

elastic_param_grid = {
    "model__alpha": [0.001],    # also tried 0.01, 0.05, 0.1, 0.5
    "model__l1_ratio": [0.9]    # also tried 0.1, 0.3, 0.5, 0.7  
}

elastic_grid = GridSearchCV(
    elastic_pipe, 
    param_grid=elastic_param_grid,
    cv=5,
    scoring="neg_mean_absolute_error",
    n_jobs=-1,
    verbose=1
)

elastic_grid.fit(X_train, y_train)
elastic_best = elastic_grid.best_estimator_
elastic_val_pred = elastic_best.predict(X_val)


print("ElasticNet Results: ")
print_metrics(y_val, elastic_val_pred)
print("Best ElasticNet params:", elastic_grid.best_params_)

#ElasticNet Results: 
#MAE: 2543.7302 | RMSE: 16690880.0888 | R2: 0.8202
#Best ElasticNet params: {'model__alpha': 0.001, 'model__l1_ratio': 0.9}

##### HistGradientBoost

In [140]:
# Hyperparameter Tuning: HistGradientBoost

hgb_param_dist = {
    "model__learning_rate": uniform(0.01, 0.09),       # samples values between 0.01–0.10
    "model__max_leaf_nodes": randint(20, 120),         # tries between 20–120 leaves
    "model__min_samples_leaf": randint(2, 20),         # samples leaf sizes between 2–20
    "model__max_iter": randint(400, 1000),             # tries 400–1000 iterations
    "model__l2_regularization": uniform(0.0, 1.0)      # samples small regularization values
}

cv = KFold(n_splits=3, shuffle=True, random_state=42)  # 3-fold for faster runtime

# Randomized search setup
hgb_random = RandomizedSearchCV(
    estimator=hgb_pipe,
    param_distributions=hgb_param_dist,
    n_iter=30,                         # number of random combinations to try
    scoring="neg_mean_absolute_error", # optimize for MAE
    cv=cv,
    n_jobs=-1,
    random_state=42,
    verbose=2
)

# Fit the search
hgb_random.fit(X_train, y_train)

# Get best model
hgb_best_rand = hgb_random.best_estimator_
print("Best Params:", hgb_random.best_params_)

# Evaluate on validation set
hgb_val_pred = hgb_random.best_estimator_.predict(X_val)
print_metrics(y_val, hgb_val_pred)

# Save model for later use
#joblib.dump(hgb_XYZ, "hgb_XYZ") # neuen namen vergeben und speichern wenn besser als hgb_best_1

Fitting 3 folds for each of 30 candidates, totalling 90 fits
[CV] END model__l2_regularization=0.3745401188473625, model__learning_rate=0.09556428757689245, model__max_iter=506, model__max_leaf_nodes=91, model__min_samples_leaf=8; total time=  14.5s
[CV] END model__l2_regularization=0.3745401188473625, model__learning_rate=0.09556428757689245, model__max_iter=506, model__max_leaf_nodes=91, model__min_samples_leaf=8; total time=  15.5s
[CV] END model__l2_regularization=0.3745401188473625, model__learning_rate=0.09556428757689245, model__max_iter=506, model__max_leaf_nodes=91, model__min_samples_leaf=8; total time=  18.3s
[CV] END model__l2_regularization=0.14286681792194078, model__learning_rate=0.06857996256539675, model__max_iter=708, model__max_leaf_nodes=21, model__min_samples_leaf=13; total time=  18.5s
[CV] END model__l2_regularization=0.14286681792194078, model__learning_rate=0.06857996256539675, model__max_iter=708, model__max_leaf_nodes=21, model__min_samples_leaf=13; total tim

In [0]:
# Results of current best HGB (hgb_best_1.pk)

# Params: {'model__l2_regularization': 0.49379559636439074, 'model__learning_rate': 0.05704595464437946, 'model__max_iter': 414, 'model__max_leaf_nodes': 109, 'model__min_samples_leaf': 11}
# MAE: 1326.7973 | RMSE: 4663930.5059 | R2: 0.9498

# Features:
# numeric_features = ["age", "tax", "mpg", "engineSize", "paintQuality", "previousOwners", "brand_anchor"]
# log_features = ["mileage", "miles_per_year", "model_freq", "brand_med_price", "model_med_price"] # could try to test previousOwners, age here; tax, mpg didnt work
# categorical_features = ["Brand", "model", "transmission", "fuelType"]

###############################################################################################################################################################################



In [0]:
# OUTDATED HT: HistGradientBoost - with GridSearch - but good to quickly check something

hgb_param_grid = {
    "model__learning_rate": [0.06], # also tried: 0.02, 0.04, 0.05, 0.1
    "model__max_leaf_nodes": [50], # also tried: 15, 25, 31, 60
    "model__min_samples_leaf": [5], # also tried: 8, 10, 15, 20
    "model__max_iter": [800] # also tried: 500, 1000
}

cv = KFold(n_splits=5, shuffle=True, random_state=42)

hgb_grid = GridSearchCV(
    estimator=hgb_pipe,
    param_grid=hgb_param_grid,
    cv=cv,
    scoring="neg_mean_absolute_error",  # optimize MAE
    n_jobs=-1,
    verbose=2
)

hgb_grid.fit(X_train, y_train)
hgb_best_2 = hgb_grid.best_estimator_

hgb_val_pred = hgb_best_2.predict(X_val)
print_metrics(y_val, hgb_val_pred)

# Best Parameters: {'model__learning_rate': 0.06, 'model__max_iter': 800, 'model__max_leaf_nodes': 50, 'model__min_samples_leaf': 5}
# MAE: 1304.7611 | RMSE: 4503446.5247 | R2: 0.9515 
# joblib.dump(hgb_best_1, "hgb_best_1.pkl") => save current best model


# Save model for later use
#joblib.dump(hgb_XYZ, "hgb_XYZ") # neuen namen vergeben und speichern wenn besser als hgb_best_1

##### RandomForest

In [0]:
# Hyperparameter Tuning: RandomForest

rf_param_dist = {
    "model__n_estimators": randint(200, 600),        # number of trees
    "model__max_depth": randint(5, 40),              # depth of each tree
    "model__min_samples_split": randint(2, 10),      # min samples to split an internal node
    "model__min_samples_leaf": randint(1, 8),        # min samples per leaf
    "model__max_features": ["sqrt", "log2"],         # feature sampling strategy
    "model__bootstrap": [True, False]                # use bootstrapping or not
}

cv = KFold(n_splits=3, shuffle=True, random_state=42)

# Randomized search setup
rf_random = RandomizedSearchCV(
    estimator=rf_pipe,
    param_distributions=rf_param_dist,
    n_iter=30,                         # number of random combinations
    scoring="neg_mean_absolute_error", 
    cv=cv,
    n_jobs=-1,
    random_state=42,
    verbose=2
)

# Fit the search
rf_random.fit(X_train, y_train)

# Best model
rf_best_rand = rf_random.best_estimator_
print("Best Params:", rf_random.best_params_)

# Evaluate on validation set
rf_val_pred = rf_best_rand.predict(X_val)
print_metrics(y_val, rf_val_pred)

# Save best model
# joblib.dump(rf_best_rand, "rf_best.pkl")


# Best Params: {'model__bootstrap': False, 'model__max_depth': 32, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 1, 'model__min_samples_split': 9, 'model__n_estimators': 467}
# MAE: 1403.6263 | RMSE: 5529283.1606 | R2: 0.9404

In [0]:
# Results of current best RF > Samuel

# Best Params: {'model__bootstrap': False, 'model__max_depth': 32, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 1, 'model__min_samples_split': 9, 'model__n_estimators': 467}
# MAE: 1403.6263 | RMSE: 5529283.1606 | R2: 0.9404

# Features:
# numeric_features = ["age", "tax", "mpg", "engineSize", "paintQuality", "previousOwners", "brand_anchor"]
# log_features = ["mileage", "miles_per_year", "model_freq", "brand_med_price", "model_med_price"] # could try to test previousOwners, age here; tax, mpg didnt work
# categorical_features = ["Brand", "model", "transmission", "fuelType"]

###############################################################################################################################################################################



##### ExtraTrees

In [0]:
# Hyperparameter Tuning: ExtraTrees

et_param_dist = {
    "model__n_estimators": randint(200, 600),
    "model__max_depth": randint(5, 40),
    "model__min_samples_split": randint(2, 10),
    "model__min_samples_leaf": randint(1, 8),
    "model__max_features": ["sqrt", "log2"],
    "model__bootstrap": [True, False]
}

et_random = RandomizedSearchCV(
    estimator=et_pipe,
    param_distributions=et_param_dist,
    n_iter=30,
    scoring="neg_mean_absolute_error",
    cv=cv,
    n_jobs=-1,
    random_state=42,
    verbose=2
)

et_random.fit(X_train, y_train)

et_best = et_random.best_estimator_
print("ExtraTrees Best Params:", et_random.best_params_)

et_val_pred = et_best.predict(X_val)
print_metrics(y_val, et_val_pred)

# joblib.dump(et_best, "et_best.pkl")

In [0]:
# Results of current best ET > Samuel

# ExtraTrees Best Params: {'model__bootstrap': False, 'model__max_depth': 32, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 1, 'model__min_samples_split': 9, 
# 'model__n_estimators': 467}
# MAE: 1631.3539 | RMSE: 7549459.7557 | R2: 0.9187

# Features:
# numeric_features = ["age", "tax", "mpg", "engineSize", "paintQuality", "previousOwners", "brand_anchor"]
# log_features = ["mileage", "miles_per_year", "model_freq", "brand_med_price", "model_med_price"] # could try to test previousOwners, age here; tax, mpg didnt work
# categorical_features = ["Brand", "model", "transmission", "fuelType"]

###############################################################################################################################################################################



##### SupportVectorRegressor

In [0]:
# Hyperparameter Tuning: SupportVectorRegressor

svr_param_dist = {
    "model__C": loguniform(1e-1, 1e3),          # wide search over regularization
    "model__epsilon": uniform(0.05, 0.3),       # small-margin tolerance
    "model__kernel": ["rbf"],                   # best general-purpose kernel
    "model__gamma": ["scale", "auto"]
}

svr_random = RandomizedSearchCV(
    estimator=svr_pipe,
    param_distributions=svr_param_dist,
    n_iter=25,
    scoring="neg_mean_absolute_error",
    cv=cv,
    n_jobs=-1,
    random_state=42,
    verbose=2
)

svr_random.fit(X_train, y_train)

svr_best = svr_random.best_estimator_
print("SVR Best Params:", svr_random.best_params_)

svr_val_pred = svr_best.predict(X_val)
print_metrics(y_val, svr_val_pred)

# joblib.dump(svr_best, "svr_best.pkl")

# MAE: 1683.6556 | RMSE: 10282674.8643 | R2: 0.8892

In [0]:
# Results of current best SVR > Samuel

# MAE: 1683.6556 | RMSE: 10282674.8643 | R2: 0.8892

# Features:
# numeric_features = ["age", "tax", "mpg", "engineSize", "paintQuality", "previousOwners", "brand_anchor"]
# log_features = ["mileage", "miles_per_year", "model_freq", "brand_med_price", "model_med_price"] # could try to test previousOwners, age here; tax, mpg didnt work
# categorical_features = ["Brand", "model", "transmission", "fuelType"]

###############################################################################################################################################################################



##### StackingRegressor

In [0]:
# Hyperparameter Tuning: StackingRegressor

stack_param_dist = {
    "final_estimator__learning_rate": uniform(0.02, 0.08),
    "final_estimator__max_depth": randint(3, 7),
    "final_estimator__min_samples_leaf": randint(3, 15),
    "final_estimator__l2_regularization": uniform(0.0, 1.0)
}

stack_random = RandomizedSearchCV(
    estimator=stack_pipe,
    param_distributions=stack_param_dist,
    n_iter=25,
    scoring="neg_mean_absolute_error",
    cv=3,                 # low CV because stacking is slow
    n_jobs=-1,
    random_state=42,
    verbose=2
)

stack_random.fit(X_train, y_train)

stack_best = stack_random.best_estimator_
print("StackingRegressor Best Params:", stack_random.best_params_)

stack_val_pred = stack_best.predict(X_val)
print_metrics(y_val, stack_val_pred)

# joblib.dump(stack_best, "stack_best.pkl")

In [0]:
# Results of current best SVR



###############################################################################################################################################################################



### Kaggle Competition

Extra Task (1 Point): Be in the Top 5 Groups on Kaggle

In [0]:
# Load best Models from Joblib

# hgb_best_1
hgb_best_1 = joblib.load("hgb_best_1.pkl")

In [0]:
# Pick best model and predict on test:
df_cars_test['price'] = hgb_best_1.predict(df_cars_test)

df_cars_test['price'].to_csv('Group05_VersionXX.csv', index=True) # currently version 3


In [0]:
!kaggle competitions submit -c cars4you -f Group05_VersionXX.csv -m "Message"

In [0]:
!kaggle competitions submissions -c cars4you