# Homework Group 49 - Car Price Regression

# Table of content

* [0. Importing required libraries](#0-importing-required-libraries)

* [1. Loading dataset](#1-loading-dataset)
    * [1.1 Loading dataset](#11-loading-dataset)
    * [1.2 Streamline column names](#12-streamline-column-names)

* [2. Implementing model pipeline components](#2-implementing-model-pipeline-components)
    * [2.1 Cleaning and Imputation](#21-cleaning-and-imputation)
    * [2.2 Feature Engineering](#22-feature-engineering)
    * [2.3 Feature Definition and Scoring Configuration](#23-feature-definition-and-scoring-configuration)

* [3. Feature Selection](#3-feature-selection)
    * [3.0 Create Baseline Pipeline](#30-create-baseline-pipeline)
    * [3.1 Filter Methods](#31-filter-methods)
        * [3.1.1 Variance Thresholding](#311-variance-thresholding)
        * [3.1.2 Correlations](#312-correlations)
        * [3.1.3 F_Regression and Mutual Information](#313-f_regression-and-mutual-information)
    * [3.2 Wrapper Methods](#32-wrapper-methods)
    * [3.3 Comparing feature selection results](#33-comparing-feature-selection-results)

* [4. Model training](#4-model-training)

* [5. Tuning Hyperparameters](#5-tuning-hyperparameters)

* [6. Training and utilizing final model](#6-training-and-utilizing-final-model)
    * [6.1 Training final model](#61-training-final-model)
    * [6.2 Making final predictions](#62-making-final-predictions)

# 0. Importing required libraries

In [0]:
import warnings
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.model_selection import cross_validate, KFold, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_regression, mutual_info_regression 

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

---

# 1. Loading dataset

## 1.1 Loading dataset

In [0]:
train_raw = pd.read_csv("train.csv")
test_raw = pd.read_csv("test.csv")

## 1.2 Streamline column names
For better assessment, we standardise the column names of the dataset

In [0]:
train_raw = train_raw.rename(
    columns={
        "carID": "car_id",
        "Brand": "brand",
        "model": "model",
        "year": "year",
        "price": "price",
        "transmission": "transmission",
        "mileage": "mileage",
        "fuelType": "fuel_type",
        "tax": "tax",
        "mpg": "mpg",
        "engineSize": "engine_size",
        "paintQuality%": "paint_quality",
        "previousOwners": "previous_owners",
        "hasDamage": "has_damage",
    }
)

test_raw = test_raw.rename(
    columns={
        "carID": "car_id",
        "Brand": "brand",
        "model": "model",
        "year": "year",
        "price": "price",
        "transmission": "transmission",
        "mileage": "mileage",
        "fuelType": "fuel_type",
        "tax": "tax",
        "mpg": "mpg",
        "engineSize": "engine_size",
        "paintQuality%": "paint_quality",
        "previousOwners": "previous_owners",
        "hasDamage": "has_damage",
    }
)

---

# 2. Implementing model pipeline components

In this part, we implemented custom-made pipeline classes. We rely on custom pipeline components because several preprocessing steps are highly domain-specific and must integrate with the pipeline paradigm propossed by sklearn. 
The pipeline classes are:
- CleanerTransformer()
- GroupImputer()
- FeatureEngineer()
- TargetMeanEncoder()

By incorporating the pipeline steps, we can ensure that no data leakage will happen during Cross-validation steps or tuning.

## 2.1 Cleaning and Imputation

**Cleaner Transformer**

CleanerTransformer standardizes raw car data and removes any logically impossible values before any modeling or imputation. It applies deterministic string normalization and plausibility checks.

Purpose: The goal is to make semantically identical cars comparable across records and to remove values that cannot be real (for example negative mileage). This avoids false values in the columns such as misspelled fuel types and avoids unrealistic numeric values leaking into downstream steps.

Steps: 
1. Text normalization: Columns 'brand', 'model', 'transmission', 'fuel_type' are lowercased and stripped of extra whitespace to enable consistent matching.
2. String alignment / mapping:
    - 'fuel_type' is mapped using a lookup. Common typos are fixed. Values like "other", "unknown", and "electric" are set to NaN because they are either non-informative catch-alls or rare singletons that would create unstable categories.
    - 'transmission' is mapped using a lookup. Misspellings are corrected. "unknown" and "other" are set to NaN.
    - 'model' is mapped to a standardized canonical model name (fuzzy model_to_clean).
    - 'brand' is recomputed from the cleaned model (model_to_brand_clean) to fix cars that were assigned to the wrong brand.
3. Plausibility and logical checks on numerical columns:
    - 'year': values > 2020 are set to NaN because the dataset was collected in 2020 so future cars cannot exist. Values with decimals are set to NaN since model years are discrete.
    - 'mileage': negative mileage is set to NaN. Mileage entries with decimals are set to NaN because mileage is expected as an integer reading and only a minority of rows use floats.
    - 'tax': tax must be non-negative integer. Negative values and decimal values are set to NaN.
    - 'mpg': negative mpg is set to NaN. Values with many decimal places are set to NaN.
    - 'engine_size': values < 1 liter are set to NaN as implausible for this dataset. Values with many decimal places are set to NaN.
    - 'paint_quality': must be a percentage from 0 to 100. Values > 100 or non-integer style decimals are set to NaN.
    - 'previous_owners': negative or decimal counts are set to NaN because owner count must be a non-negative integer.
4. Column removal: Has damage is dropped because it always has 0 values or NaNs.

Notice: After this, a more values have been set to NaN. The later Imputer will handle these cases. 

In [0]:
class CleanerTransformer(BaseEstimator, TransformerMixin): 
    
    def __init__(self) -> None:
        """Initialize the CleanerTransformer."""
        pass

    def fit(self, X, y=None): 
        """Fit method - does nothing as no fitting is required."""
        return self
    
    def transform(self, X): 
        """
            This function cleans and standardizes the input DataFrame X.
            It performs the following operations:
            - Standardizes brand names using a predefined mapping.
            - Standardizes model names based on brand-specific mappings.
            - Standardizes transmission types.
            - Standardizes fuel types.
            - Cleans numerical columns by removing invalid or out-of-range values.

            Args:
                X (pd.DataFrame): Input DataFrame containing car data.
            
            Returns: 
                pd.DataFrame: Cleaned and standardized DataFrame.
        """


        df = X.copy()

        # Transmission
        transmission_alignment = {
            # Semi-Automatic
            'semi-auto': 'semi-automatic',
            'semi-aut': 'semi-automatic',
            'emi-auto': 'semi-automatic',
            'emi-aut': 'semi-automatic',
            # Automatic
            'automatic': 'automatic',
            'automati': 'automatic',
            'utomatic': 'automatic',
            'utomati': 'automatic',
            # Manual
            'manual': 'manual',
            'manua': 'manual',
            'anual': 'manual',
            'anua': 'manual',
            # Unknown
            'unknown': np.nan,
            'unknow': np.nan,
            'nknown': np.nan,
            'nknow': np.nan,
            # Other
            'other': np.nan,
            # Missing values
            np.nan: np.nan
        }

        # Model
        model_mapping = {
                    'audi': {
                        'a': 'a series (unspecified)',
                        'a1': 'a1',
                        'a2': 'a2',
                        'a3': 'a3',
                        'a4': 'a4',
                        'a5': 'a5',
                        'a6': 'a6',
                        'a7': 'a7',
                        'a8': 'a8',
                        'q': 'q series (unspecified)',
                        'q2': 'q2',
                        'q3': 'q3',
                        'q5': 'q5',
                        'q7': 'q7',
                        'q8': 'q8',
                        'r8': 'r8',
                        'rs': 'rs series (unspecified)',
                        'rs3': 'rs3',
                        'rs4': 'rs4',
                        'rs5': 'rs5',
                        'rs6': 'rs6',
                        'rs7': 'rs7',
                        's3': 's3',
                        's4': 's4',
                        's5': 's5',
                        's8': 's8',
                        'sq5': 'sq5',
                        'sq7': 'sq7',
                        't': 'tt',
                        'tt': 'tt'
                    },
                    'bmw': {
                        'i': 'i (unspecified)',
                        '1 serie': '1 series',
                        '1 series': '1 series',
                        '2 serie': '2 series',
                        '2 series': '2 series',
                        '3 serie': '3 series',
                        '3 series': '3 series',
                        '4 serie': '4 series',
                        '4 series': '4 series',
                        '5 serie': '5 series',
                        '5 series': '5 series',
                        '6 serie': '6 series',
                        '6 series': '6 series',
                        '7 serie': '7 series',
                        '7 series': '7 series',
                        '8 serie': '8 series',
                        '8 series': '8 series',
                        'i3': 'i3',
                        'i8': 'i8',
                        'm': 'm series (unspecified)',
                        'm2': 'm2',
                        'm3': 'm3',
                        'm4': 'm4',
                        'm5': 'm5',
                        'm6': 'm6',
                        'x': 'x series (unspecified)',
                        'x1': 'x1',
                        'x2': 'x2',
                        'x3': 'x3',
                        'x4': 'x4',
                        'x5': 'x5',
                        'x6': 'x6',
                        'x7': 'x7',
                        'z': 'z series (unspecified)',
                        'z3': 'z3',
                        'z4': 'z4'
                    },
                    'ford': {
                        'amica': np.nan,
                        'b-ma': 'b-max',
                        'b-max': 'b-max',
                        'c-ma': 'c-max',
                        'c-max': 'c-max',
                        'ecospor': 'ecosport',
                        'ecosport': 'ecosport',
                        'edg': 'edge',
                        'edge': 'edge',
                        'escort': 'escort',
                        'fiest': 'fiesta',
                        'fiesta': 'fiesta',
                        'focu': 'focus',
                        'focus': 'focus',
                        'fusion': 'fusion',
                        'galax': 'galaxy',
                        'galaxy': 'galaxy',
                        'grand c-ma': 'grand c-max',
                        'grand c-max': 'grand c-max',
                        'grand tourneo connec': 'grand tourneo connect',
                        'grand tourneo connect': 'grand tourneo connect',
                        'k': 'ka',
                        'ka': 'ka',
                        'ka+': 'ka+',
                        'kug': 'kuga',
                        'kuga': 'kuga',
                        'monde': 'mondeo',
                        'mondeo': 'mondeo',
                        'mustang': 'mustang',
                        'puma': 'puma',
                        'pum': 'puma',
                        'ranger': 'ranger',
                        's-ma': 's-max',
                        's-max': 's-max',
                        'streetka': 'streetka',
                        'tourneo connect': 'tourneo connect',
                        'tourneo custo': 'tourneo custom',
                        'tourneo custom': 'tourneo custom',
                        'turneo custom': 'tourneo custom',
                        'transit tourneo': np.nan
                    },
                    'hyundai': {
                        'accent': 'accent',
                        'getz': 'getz',
                        'i1': 'i10',
                        'i10': 'i10',
                        'i2': 'i20',
                        'i20': 'i20',
                        'i3': 'i30',
                        'i4': np.nan,
                        'i30': 'i30',
                        'i40': 'i40',
                        'i80': 'i800',
                        'i800': 'i800',
                        'ioni': 'ioniq',
                        'ioniq': 'ioniq',
                        'ix2': 'ix20',
                        'ix20': 'ix20',
                        'ix35': 'ix35',
                        'kon': 'kona',
                        'kona': 'kona',
                        'santa f': 'santa fe',
                        'santa fe': 'santa fe',
                        'terracan': 'terracan',
                        'tucso': 'tucson',
                        'tucson': 'tucson',
                        'veloste': 'veloster'
                    },
                    'mercedes': {
                        '180': np.nan,
                        '200': '200',
                        '220': '220',
                        '230': '230',
                        'a clas': 'a class',
                        'a class': 'a class',
                        'b clas': 'b class',
                        'b class': 'b class',
                        'c clas': 'c class',
                        'c class': 'c class',
                        'cl clas': 'cl class',
                        'cl class': 'cl class',
                        'cla class': 'cla class',
                        'cla clas': 'cla class',
                        'clc class': 'clc class',
                        'clk': 'clk',
                        'cls clas': 'cls class',
                        'cls class': 'cls class',
                        'e clas': 'e class',
                        'e class': 'e class',
                        'g class': 'g class',
                        'g clas': 'g class',
                        'gl class': 'gl class',
                        'gl clas': 'gl class',
                        'gla clas': 'gla class',
                        'gla class': 'gla class',
                        'glb class': 'glb class',
                        'glc clas': 'glc class',
                        'glc class': 'glc class',
                        'gle clas': 'gle class',
                        'gle class': 'gle class',
                        'gls clas': 'gls class',
                        'gls class': 'gls class',
                        'm clas': 'm class',
                        'm class': 'm class',
                        'r class': np.nan,
                        's clas': 's class',
                        's class': 's class',
                        'sl': 'sl class',
                        'sl clas': 'sl class',
                        'sl class': 'sl class',
                        'slk': 'slk',
                        'v clas': 'v class',
                        'v class': 'v class',
                        'x-clas': 'x class',
                        'x-class': 'x class'
                    },
                    'opel': {
                        'ada': 'adam',
                        'adam': 'adam',
                        'agila': 'agila',
                        'ampera': 'ampera',
                        'antara': 'antara',
                        'astr': 'astra',
                        'astra': 'astra',
                        'cascada': 'cascada',
                        'combo lif': 'combo life',
                        'combo life': 'combo life',
                        'cors': 'corsa',
                        'corsa': 'corsa',
                        'crossland': 'crossland',
                        'crossland x': 'crossland x',
                        'grandland': 'grandland x',
                        'grandland x': 'grandland x',
                        'gtc': 'gtc',
                        'insigni': 'insignia',
                        'insignia': 'insignia',
                        'meriv': 'meriva',
                        'meriva': 'meriva',
                        'mokk': 'mokka',
                        'mokka': 'mokka',
                        'mokka x': 'mokka x',
                        'tigra': 'tigra',
                        'vectra': 'vectra',
                        'viv': 'viva',
                        'viva': 'viva',
                        'vivaro': 'vivaro',
                        'zafir': 'zafira',
                        'zafira': 'zafira',
                        'zafira toure': 'zafira tourer',
                        'zafira tourer': 'zafira tourer',
                        'kadjar': np.nan
                    },
                    'skoda': {
                        'citig': 'citigo',
                        'citigo': 'citigo',
                        'fabi': 'fabia',
                        'fabia': 'fabia',
                        'kami': 'kamiq',
                        'kamiq': 'kamiq',
                        'karo': 'karoq',
                        'karoq': 'karoq',
                        'kodia': 'kodiaq',
                        'kodiaq': 'kodiaq',
                        'octavi': 'octavia',
                        'octavia': 'octavia',
                        'rapi': 'rapid',
                        'rapid': 'rapid',
                        'roomste': 'roomster',
                        'roomster': 'roomster',
                        'scal': 'scala',
                        'scala': 'scala',
                        'super': 'superb',
                        'superb': 'superb',
                        'yet': 'yeti',
                        'yeti': 'yeti',
                        'yeti outdoo': 'yeti outdoor',
                        'yeti outdoor': 'yeti outdoor'
                    },
                    'toyota': {
                        'auri': 'auris',
                        'auris': 'auris',
                        'avensis': 'avensis',
                        'ayg': 'aygo',
                        'aygo': 'aygo',
                        'c-h': 'c-hr',
                        'c-hr': 'c-hr',
                        'camry': 'camry',
                        'cam': 'camry',
                        'camr': 'camry',
                        'coroll': 'corolla',
                        'corolla': 'corolla',
                        'gt86': 'gt86',
                        'hilu': 'hilux',
                        'hilux': 'hilux',
                        'iq': 'iq',
                        'land cruise': 'land cruiser',
                        'land cruiser': 'land cruiser',
                        'prius': 'prius',
                        'proace verso': 'proace verso',
                        'rav': 'rav4',
                        'rav4': 'rav4',
                        'supra': 'supra',
                        'urban cruise': 'urban cruiser',
                        'urban cruiser': 'urban cruiser',
                        'vers': 'verso',
                        'verso': 'verso',
                        'verso-s': 'verso-s',
                        'yari': 'yaris',
                        'yaris': 'yaris'
                    },
                    'volkswagen': {
                        'amaro': 'amarok',
                        'amarok': 'amarok',
                        'arteo': 'arteon',
                        'arteon': 'arteon',
                        'beetl': 'beetle',
                        'beetle': 'beetle',
                        'caddy': 'caddy',
                        'caddy life': 'caddy life',
                        'caddy maxi': 'caddy maxi',
                        'caddy maxi lif': 'caddy maxi life',
                        'caddy maxi life': 'caddy maxi life',
                        'california': 'california',
                        'californi': 'california',
                        'caravell': 'caravelle',
                        'caravelle': 'caravelle',
                        'cc': 'cc',
                        'eos': 'eos',
                        'fox': 'fox',
                        'gol': 'golf',
                        'golf': 'golf',
                        'golf s': 'golf sv',
                        'golf sv': 'golf sv',
                        'jetta': 'jetta',
                        'passa': 'passat',
                        'passat': 'passat',
                        'pol': 'polo',
                        'polo': 'polo',
                        'scirocc': 'scirocco',
                        'scirocco': 'scirocco',
                        'shara': 'sharan',
                        'sharan': 'sharan',
                        'shuttle': 'shuttle',
                        't-cros': 't-cross',
                        't-cross': 't-cross',
                        't-ro': 't-roc',
                        't-roc': 't-roc',
                        'tigua': 'tiguan',
                        'tiguan': 'tiguan',
                        'tiguan allspac': 'tiguan allspace',
                        'tiguan allspace': 'tiguan allspace',
                        'touare': 'touareg',
                        'touareg': 'touareg',
                        'toura': 'touran',
                        'touran': 'touran',
                        'u': 'up',
                        'up': 'up'
            }
        }

        # Fuel_type
        alignment_fuel = {
            # Petrol
            'petrol': 'petrol',
            'petro': 'petrol',
            'etrol': 'petrol',
            'etro': 'petrol',

            # Diesel
            'diesel': 'diesel',
            'diese': 'diesel',
            'iesel': 'diesel',
            'iese': 'diesel',

            # Hybrid
            'hybrid': 'hybrid',
            'hybri': 'hybrid',
            'ybrid': 'hybrid',
            'ybri': 'hybrid',

            # Electric
            'electric': np.nan,

            # Other
            'other': np.nan,
            'othe': np.nan,
            'ther': np.nan,

            # Missing values
            np.nan: np.nan
        }

        for col in ['brand', 'model', 'transmission', 'fuel_type']:
            df[col] = df[col].str.lower().str.strip()

        df['fuel_type'] = df['fuel_type'].replace(alignment_fuel)
        df['transmission'] = df['transmission'].replace(transmission_alignment)

        model_to_brand = {}
        model_to_clean = {}

        for brand, models in model_mapping.items():
            for key, value in models.items():
                model_to_brand[key] = brand
                model_to_clean[key] = value

        df['model'] = df['model'].map(model_to_clean)

        model_to_brand_clean = {}
        for brand, model in model_mapping.items():
            for key, value in model.items():
                if pd.notna(value):
                    model_to_brand_clean[value] = brand

        df['brand'] = df['model'].map(model_to_brand_clean)


        ############################################################################################################################################

        # year
        df.loc[df["year"] > 2020, "year"] = np.nan
        df.loc[df["year"].astype(str).str.rstrip("0").str.contains(r"\.\d+"), "year"] = (np.nan)

        # mileage
        df.loc[df["mileage"] < 0, "mileage"] = np.nan

        df.loc[df["mileage"].astype(str).str.rstrip("0").str.contains(r"\.\d+"), "mileage"] = np.nan

        # tax
        df.loc[df["tax"] < 0, "tax"] = np.nan
        df.loc[df["tax"].astype(str).str.rstrip("0").str.contains(r"\.\d+"), "tax"] = np.nan

        # mpg
        df.loc[df["mpg"] < 0, "mpg"] = np.nan
        df.loc[df["mpg"].astype(str).str.rstrip("0").str.contains(r"\.\d{2,}"), "mpg"] = (np.nan)

        # engine size
        df.loc[df["engine_size"] < 1, "engine_size"] = np.nan
        df.loc[df["engine_size"].astype(str).str.rstrip("0").str.contains(r"\.\d{2,}"), "engine_size", ] = np.nan

        # paint quality
        df.loc[df["paint_quality"] > 100, "paint_quality"] = np.nan
        df.loc[df["paint_quality"].astype(str).str.rstrip("0").str.contains(r"\.\d+"), "paint_quality", ] = np.nan

        # Previous owners
        df.loc[df["previous_owners"] < 0, "previous_owners"] = np.nan  
        df.loc[df["previous_owners"].astype(str).str.rstrip("0").str.contains(r"\.\d+"), "previous_owners", ] = np.nan

        # has_damage
        df.drop(columns=["has_damage"], inplace=True)

        return df

**GroupImputer**

The GroupImputer class imputes missing categorical and numerical values using subgroup based statisics instead of global ones. It preserves within-group consistency such as brand–model relationships and avoids loss of rows due to missing data.

Purpose: The transformer fills NaN values in both categorical and numerical features by using learned medians or modes within logical subgroups. This creates more realistic imputations that respect the internal structure of the dataset. It is applied after CleanerTransformer to ensure standardized group identifiers.

Steps: 
1. Fit Phase:
    - For each target column in cat_specs and num_specs, compute the lookup tables based in hierarchical groupings.
    - Categorical variables use the mode (most frequent value) per group.
    - Numerical variables use the median per group.
    - Global fallback statistics (overall mode or median) are stored per column. 
2. Transform Phase:
    - Sequentially apply lookup tables from most specific to most general grouping.
    - For each missing value, search for matches in the hierarchy.
    - If no match found, use the global fallback.
    - Columns defined in int_cols are converted back to integer type.

Data Leakage Prevention: Looup values and fallback values are learned during fitting on the training fold (data without the NaN rows). This is then applied to non Nan rows, preventing data leakage.

Notice: Must run after CleanerTransformer. 

In [0]:
class GroupImputer(BaseEstimator, TransformerMixin):
    def __init__(self, cat_specs = [], num_specs = [], int_cols = None):
        # cat_specs: Dictionary mapping categorical columns to their grouping hierarchies
        # num_specs: Dictionary mapping numerical columns to their grouping hierarchies
        # int_cols: List of columns that should be treated as integers after imputation
        # tables_: Learned lookup tables that is filled during fit
        # fallbacks_: Global statistics for each column that is filled during fit
        self.cat_specs = cat_specs
        self.num_specs = num_specs
        self.int_cols = int_cols
        self.tables_ = {}
        self.fallbacks_ = {}

    @staticmethod
    def _series_mode(s):
        """
        Helper function to compute the mode (most frequent non-NA value) of a pandas Series.

        Args:
            s (pandas.Series): Input series for which the mode should be calculated.

        Return: 
            scalar or numpy.nan: The most frequent non-missing value in the series. If multiple modes exist, the first one is returned. 
            If the series is empty or contains only missing values, return numpy.nan.
        """
        
        mode = s.mode(dropna=True)
        
        if mode.empty:
            return np.nan
        else:  
            return mode.iloc[0]


    @staticmethod
    def _reduce_by_groups(df, target, groups, reducer):
        """
        Creates lookup tables for imputation by aggregating values across different grouping combinations.
        
        The function iterates through various grouping column combinations and computes an aggregated
        value (mode or median) of the target variable for each group. Only rows without missing values
        in both the grouping columns and the target variable are used for aggregation.
        
        Args:
            df: DataFrame containing the data
            target: Name of the column for which values should be aggregated
            groups: List of column combinations to use for grouping
            reducer: Aggregation method ('mode' for categorical, 'median' for numerical data)
        
        Returns:
            List of tuples, where each tuple contains:
            - List of grouping columns used
            - DataFrame with aggregated values as lookup table
        """
        if target not in df.columns:
            return []
        
        out = []
        
        for cols in groups:
            grouping_cols = []
            
            for c in cols:
                if c in df.columns and c != target:
                    grouping_cols.append(c)
            
            if len(grouping_cols) == 0:
                continue
            
            columns_to_check = grouping_cols + [target]
            valid_rows = df.dropna(subset=columns_to_check)
            
            if valid_rows.empty:
                continue
            
            if reducer == "mode":
                agg = valid_rows.groupby(grouping_cols)[target].agg(GroupImputer._series_mode)
            else:
                agg = valid_rows.groupby(grouping_cols)[target].median()
            
            agg = agg.dropna()
            
            if isinstance(agg, pd.Series):
                agg = agg.to_frame(name=target)
            
            if agg.empty:
                continue
            
            out.append((grouping_cols, agg.reset_index()))
        
        return out

    
    @staticmethod
    def _apply_tables_inplace(df, target, tables, fallback):
        """
        Applies lookup tables to impute missing values in the target column in-place.
        
        The function iterates through the provided lookup tables and fills missing values 
        in the target column by merging rows with matching grouping columns. If multiple 
        tables are provided, they are applied sequentially until no more missing values 
        can be filled. Finally, any remaining missing values are replaced with the fallback value.
        
        Args:
            df: DataFrame to be modified
            target: Name of the column for which missing values should be imputed
            tables: List of tuples containing grouping columns and lookup tables
            fallback: Default value to fill remaining missing values after table application
        
        Returns:
            None (DataFrame is modified in-place)
        """

        if target not in df.columns:
            return
        
        for grouping_cols, lookup_table in tables:
            
            missing_target_mask = df[target].isna()
            
            if not missing_target_mask.any():
                break
            
            eligible_rows = missing_target_mask
            
            for col in grouping_cols:
                if col not in df.columns:
                    eligible_rows = eligible_rows & False
                else:
                    eligible_rows = eligible_rows & df[col].notna()
            
            if not eligible_rows.any():
                continue
            
            rows_to_fill = df.loc[eligible_rows, grouping_cols]
            
            merged_data = rows_to_fill.merge(lookup_table, how="left", on=grouping_cols)
            merged_data.index = rows_to_fill.index
            
            successfully_filled = merged_data.index[merged_data[target].notna()]
            
            if len(successfully_filled) == 0:
                continue
            
            df.loc[successfully_filled, target] = merged_data.loc[successfully_filled, target].values
        
        if fallback is not None:
            df[target] = df[target].fillna(fallback)

    def fit(self, X, y=None):
        """
        Learn group-based imputation tables and fallback values from the given DataFrame.

        The method builds lookup tables for categorical and numerical columns. The columns are later specified in
        `cat_specs` and `num_specs`. Each table stores aggregated values (either mode or median)
        of the target column for various grouping combinations. These tables are later
        used by `transform` to impute missing values.
        
        Args: 
            X: Input DataFrame containing the data to learn from.
            y: Not used but necessary for sklearns TransformerMixMin.
        
        Returns: 
            self: Instance of fitted GroupImputer.
        """

        X = X.copy()
        self.tables_ = {}
        self.fallbacks_ = {}

        def _fit_block(specs, reducer, fb_func):
            for tgt, groups in specs.items():
                tables = self._reduce_by_groups(X, tgt, groups, reducer)
                if tgt in X.columns:
                    fb = fb_func(X[tgt])
                else:
                    fb = None
                self.tables_[tgt] = tables
                self.fallbacks_[tgt] = fb

        _fit_block(self.cat_specs, "mode", self._series_mode)
        _fit_block(self.num_specs, "median", lambda s: s.median() if s.notna().any() else None)

        return self


    def transform(self, X):
        """
        Impute missing values in a DataFrame using group-based lookup tables and fallback values.

        This method applies the lookup tables and global fallback values learned during `fit()`
        to fill missing entries in each target column defined in `cat_specs` and `num_specs`.
        For each column, the method sequentially uses the stored lookup tables—from the most
        specific to the most general grouping—to infer missing values. If no match is found in
        any table, a global fallback value is used.

        Example: 
            Hierarchical order: 
            ["transmission", "fuel_type", "tax", "year", "engine_size"],
            ["year", "engine_size", "tax"],                             
            ["engine_size", "tax"],
            ["engine_size"],
            ["tax"],
            ["year"],
            
            Filling the brand and model: Search for cars with identical transmission, fuel type, tax, year, and engine size -> fill with most frequent value in this group
            If no match found, search for cars with identical year, engine size, and tax -> fill with most frequent value in this group
            ...
            

        Args: 
            X: Input data containing missing values to impute.

        Returns: 
            df: Input DataFrame with missing values imputed.
        """
        
        df = X.copy()
        for tgt, tables in self.tables_.items():
            fb = self.fallbacks_.get(tgt, None)
            self._apply_tables_inplace(df, tgt, tables, fb)

        if self.int_cols is not None:
            for c in self.int_cols:
                if c in df.columns:
                    df[c] = pd.to_numeric(df[c], errors="coerce").round().astype("Int64")

        return df

**Define groups for imputation**

When filling in missing values, we group similar observations so that imputed values come from comparable data points instead of the whole dataset.

For example consider the car Audi Q7. One version might be manual, diesel, from 2010, with small engine and lower tax. Another Audi Q7 might be automatic, petrol, from 2019, with a larger engine and higher tax. We would expect the second car to be more expensive than the first. To account for this, we use group imputings.

In [0]:
BRAND_MODEL_GROUPS = [
    ["transmission", "fuel_type", "tax", "year", "engine_size"],
    ["year", "engine_size", "tax"],
    ["engine_size", "tax"],
    ["engine_size"],
    ["tax"],
    ["year"],
]

TRANS_FUEL_GROUPS = [
    ["brand", "model", "engine_size", "tax", "year"],
    ["brand", "model", "engine_size", "tax"],
    ["brand", "model", "engine_size"],
    ["brand", "model"],
    ["brand"],
]

NUMERIC_GROUPS = [
    ["brand", "model", "transmission", "fuel_type", "tax", "year", "engine_size"],
    ["brand", "model", "transmission", "fuel_type", "year", "engine_size"],
    ["brand", "model", "fuel_type", "year", "engine_size"],
    ["brand", "model", "engine_size", "year"],
    ["brand", "model", "engine_size"],
    ["brand", "model"],
    ["brand"],
]

NUMERIC_COLUMNS = [
    "mpg",
    "tax",
    "previous_owners",
    "engine_size",
    "paint_quality",
    "mileage",
    "year",
]

cat_specs = {
    "brand": BRAND_MODEL_GROUPS,
    "model": BRAND_MODEL_GROUPS,
    "transmission": TRANS_FUEL_GROUPS,
    "fuel_type": TRANS_FUEL_GROUPS
}

num_specs = {col: NUMERIC_GROUPS for col in NUMERIC_COLUMNS}

int_cols_for_imputer = ["year", "previous_owners", "tax", "paint_quality", "mileage"]

## 2.2 Feature Engineering

**FeatureEngineer**

The FeatureEngineer class creates additionally derived features that enhance the model interpretability and predicitve power. It converts existing raw variables into more meaningful representations of car characteristics and relationships.

Purpose: The class automatically generates engineered features inside the pipeline to maintain strict isolation.

Steps: 
1. Store reference year(2020) during fitting.
2. Derive new features during transformation:
    - vehicle_age: This feature is used to quantify the depreciation of a car. As the dataset is from 2020, the age of the vehicle is measured from this point on. We hope that this feature represents the fact that newer vehicles are potentially higher priced than older ones. 
    - mileage_per_year: This feature is used to normalize the total mileage by vehicle age to reveal annual usage intensity. 
    - tax_per_engine: This feature quantifies the tax burden relative to the engine size. Effectively, the taxation efficiency of the vehicle is captured. We hope, that the model can identify vehicle that might be expensive to own and operate, as this can affect their market value.
    - mpg_per_engine: This feature measures the fuel efficiency normalized by engine size. It helps distinguishing between large engines that are relative efficient and smaller engines that underperform. 
    - log_mileage: This feature applies a logarithmic transformation to the mileage, compressing the scale and reducing the impact of extreme outlier. 
    - log_tax: This feature applies a logarithmic transformation to the tax, compressing the scale and reducing the impact of extreme outlier.
    - efficiency_index: This feature combines mpg and engine size to create a composite measure of overall engine efficiency. It helps to identify cars that deliver high performance with lower fuel consumption.
    - is_premium_brand: This feature binary encodes the brands (audi, bmw, mercedes). It helps identify brands that are typically luxurious and more expensive.

Notice: The constant 0.00001 avoids division by zero.

In [0]:
class FeatureEngineer(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        """
        Fit method - only the reference year is stored.
        """ 
        self.ref_year_ = 2020
        return self
    
    def transform(self, df): 
        """
        This function creates new features based on the existing columns
        
        Args: 
            X: Input DataFrame
        
        Returns: 
            df: DataFrame with new features added.    
        """
        
        df = df.copy()

        df_year = df['year']
        df['vehicle_age'] = (self.ref_year_ - df_year).clip(lower=0)
        df['mileage_per_year'] = df['mileage'] / (df['vehicle_age']+0.00001)
        df['tax_per_engine'] = df['tax'] / (df['engine_size']+0.00001)
        df['mpg_per_engine'] = df['mpg'] / (df['engine_size']+0.00001)
        df['log_mileage'] = np.log1p(df['mileage'])
        df['log_tax'] = np.log1p(df['tax'])
        df['efficiency_index'] = df['engine_size'] / (df['mpg']+0.00001)
        
        premium_brands = ['audi', 'bmw', 'mercedes']
        df['is_premium_brand'] = df['brand'].isin(premium_brands).astype(int)

        return df


**TargetMeanEncoder**

The TargetMeanEncoder class replaces categorical values with the mean of the target variable (price) for each category. It handles high-cardinality categorical features efficiently by encoding them into single numerical columns instead of many one-hot columns.

Purpose: This transformer allows the model to capture relationships between category levels and the target variable, improving predictive strength while avoiding dimensionality explosion. It is particularly effective for columns such as brand or model that contain many unique values.

Steps: 
1. Fit Phase: For each column, compute the mean of the target grouped by category and store them as fallback.
2. Fit-transform Phase (out-of-fold encoding):
    - Split the data into K folds using cross-validation
    - For each fold, compute category means using only the training data and apply them to the validation data.
    - Combine the fold ouputs into one out-of-fold dataset.
    - Refit on the full dataset to store final category mappings and global means.
3. Transform Phase:
    - Replace each category with its stored mean from trainig.
    - Replace unseen categories with global mean.

Data Leakage Prevention: Out-of-fold encoding prevents leakage by ensuring target means for validation rows are computed only for training rows in each fold. The model never learns its own target during encoding. When applied to test data, mappings are fixed and use only statistics learned from training. 

Notice: Fit() should only be called through fit_transform(); never run fit() alone. The global mean is only stored for inference on unseen data and does not cause leakage.


In [0]:
class TargetMeanEncoder(BaseEstimator, TransformerMixin):
    
    def __init__(self, cols, n_splits=5, random_state=0):
        """
        Initialization and configure the encoder by specify categorical columns to encode and cross validation parameters 
    
        Args: 
            cols: List of categorical columns to be target mean encoded
            n_splits: Number of splits for K-Fold cross validation (default=5)
            random_state: Random seed for reproducibility
        """
        self.cols = cols
        self.n_splits = n_splits
        self.random_state = random_state

    def fit(self, X, y):
        """
        Computes target mean mappings for the specified categorical columns.

        This method trains the encoder by calculating the mean of the target variable
        for each category in the given columns. The computed mappings are stored
        for later use during transformation.

        Args:
            X: Input data containing the categorical columns to be encoded.
            y: Target variable used to compute mean values per category.

        Returns:
            self: Fitted TargetMeanEncoder instance.
        """
        X = pd.DataFrame(X).reset_index(drop=True)
        y = pd.Series(y).reset_index(drop=True)
        self.col_maps = {}
        self.global_means = {}
        for column in self.cols:
            if column in X.columns:
                grp = y.groupby(X[column]).mean()
                self.col_maps[column] = grp
                self.global_means[column] = float(y.mean())
        self.feature_names_out = [f"{column}_tgt" for column in self.cols if column in X.columns]
        return self

    def fit_transform(self, X, y):
        """
        Computes out-of-fold target mean encodings for the specified categorical columns and returns the transformed DataFrame.
        During each fold of K-Fold cross-validation, mean target values are computed using only the training portion, then applied to the validation portion to prevent target leakage.
        After all folds are processed, the encoder is fitted on the full dataset, and missing values are replaced with the global mean of the target.
        The output includes the original features and new encoded columns with the suffix `_tgt`.

        Args: 
            X: Input data containing the categorical columns to be encoded.
            y: Target variable used to compute mean values per category.

        Returns: 
            Z: Transformed DataFrame with original features and new target mean encoded columns.
        
        """
        # Making sure X and y are indeed DataFrames and Series with correct indicies
        X = pd.DataFrame(X).reset_index(drop=True)
        y = pd.Series(y).reset_index(drop=True)
        
        # Initialize DataFrame that will hold out-of-fold encodings later
        # Initially, all values are set to NaN
        oof = pd.DataFrame(index=X.index)
        for column in self.cols:
                oof[column] = np.nan
        
        # Create k-Fold cross-validator
        kf = KFold(n_splits=self.n_splits, shuffle=True, random_state=self.random_state)
        
        # Extract train and validation indices for each fold
        for train, validation in kf.split(X):

            # Extract training and validation sets
            Xtr, ytr = X.loc[train], y.loc[train]

            # For each categorical column, compute mean target value per category (only based on ytr and therefore train data)
            for column in self.cols:
                    # Calculate the means
                    means = ytr.groupby(Xtr[column]).mean()

                    # Apply the computed means to the validation set 
                    oof.loc[validation, column] = X.loc[validation, column].map(means)
        
        # Fit the encoder on the full dataset after cross-validation
        # This stores the final mappings and global means for use in transform()
        self.fit(X, y)
        
        # Create a copy of X and add the out-of-fold encodings
        Z = X.copy()
        for column in self.cols:
                
                # Replace any NaN (and therefore unseen category) with the global mean of y (Fallback)
                # Store final columns with suffix '_tgt'
                val = oof[column].fillna(self.global_means[column])
                Z[f"{column}_tgt"] = val.values
        
        return Z

    def transform(self, X):
        """
        Transforms the given data using target mean mappings learned during fit.
        Categories that have not been seen during training are replaced with the global mean of the target variable.

        Args: 
            X: Input data containing the categorical columns to be encoded.

        Returns: 
            Z: Transformed DataFrame with original features and new target mean encoded columns.
        """

        X = pd.DataFrame(X).reset_index(drop=True)
        Z = X.copy()
        for column in self.cols:
                Z[f"{column}_tgt"] = X[column].map(self.col_maps[column]).fillna(self.global_means[column]).values
        return Z

## 2.3 Feature Definition and Scoring Configuration

In [0]:
# Define columns that should be target encoded 
target_encoding_cols = ["brand", "model", "is_premium_brand"]

# Define numeric
numeric_features = [
    "year", "mileage", "tax", "mpg", "engine_size", "paint_quality", 
    "previous_owners", "vehicle_age", "mileage_per_year", "tax_per_engine", 
    "mpg_per_engine", "log_mileage", "log_tax", "efficiency_index"
]

# Add the target encoded columns to numeric stack
numeric_features = numeric_features + [f"{col}_tgt" for col in target_encoding_cols]

# Define categorical features
categorical_features = ["transmission", "fuel_type"]

# Split data into X and y
X = train_raw.drop("price", axis=1)
y = train_raw["price"]
X_test = test_raw

# Define the scoring metrics that should be used
scoring = {
    "mae": "neg_mean_absolute_error",
    "rmse": "neg_root_mean_squared_error",
    "r2": "r2"
}

---

# 3. Feature Selection

In this section, we will apply several feature selection methods in order to find out, if a selection of features is advantegous for the decision of the model. 

The following techniques are used and compared: 
- Filter methods
    - Variance Thresholding
    - Correlations (feature to target and feature to feature)
    - F Regression and Mutual Information test
- Wrapper methods
    - Recursive Feature Selection with Cross Validation based on RandomForestRegressor feature importance

In the end, the selected features of all methods are compared to choose and discuss best cross-method feature selection.

## 3.0 Create Baseline Pipeline

To be able to apply the feature selection methods accordingly we need a base dataset. This is reached by applying the preprocessing steps previously implemented.

We use the RobustScaler for numerical features and OneHotEncoder for categorical features.

In [0]:
# Initialize preprocessor transformer, which scales numeric features and one-hot encodes categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ("num", RobustScaler(), numeric_features),
        ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=True), categorical_features)
    ],
    remainder="drop"
)

# Initialize pipeline used for feature selection later
base_feature_selection_pipe = Pipeline(steps=[
    ('cleaner', CleanerTransformer()),
    ('imputer', GroupImputer(cat_specs=cat_specs, num_specs=num_specs, int_cols=int_cols_for_imputer)),
    ('feature_engineer', FeatureEngineer()),
    ('target_encoder', TargetMeanEncoder(cols=target_encoding_cols, n_splits=5, random_state=0)),
    ('preprocessor', preprocessor),
], verbose=True)

# Fitting the pipeline to the training data and transforming it
X_processed = base_feature_selection_pipe.fit_transform(X, y)

# Extracting feature names after preprocessing
feature_names = base_feature_selection_pipe.named_steps['preprocessor'].get_feature_names_out()

if hasattr(X_processed, "toarray"):
    X_processed = X_processed.toarray()

X_processed_df = pd.DataFrame(X_processed, columns=feature_names)

print(f'Number of initial features: {len(X_processed_df.columns)}')

## 3.1 Filter Methods

### 3.1.1 Variance Thresholding

Variance Thresholding identifies features with variance below a specified threshold. We assume that features with near-uero variance potentially carry little informative benefits since their values barely change across observations.

In [0]:
# Constructing a DataFrame and calculating variances 
variances = pd.DataFrame({
    'Feature': X_processed_df.columns,
    'Variance': X_processed_df.var().values
}).sort_values(by='Variance', ascending=False).reset_index(drop=True)

display(variances.style.format(precision=4, thousands=".", decimal=","))

vt = VarianceThreshold(threshold=0.01)
vt.fit(X_processed_df)

# Features to keep are those with variance above the threshold. Threshold is choosen "logically", but is subject to discussion
features_to_keep_vt = X_processed_df.columns[vt.get_support()]
features_to_eliminate_vt = X_processed_df.columns[~vt.get_support()]

print(f'Features to keep: {list(features_to_keep_vt)}')
print(f'Features to eliminate: {list(features_to_eliminate_vt)}')

**Results**
The Variance threshold identifies, that all features have a high enough variance and should be included.

### 3.1.2 Correlations

In this section, we will analyze correlations. We splitted this analysis into two parts: feature-to-target correlation and feature-to-feature correlation. This is done to analyze the correlations between each feature and the target variable as well as the correlations between all features itself. 

By doing this, we can identify 1. which variables correlate with the target and therefore have potentially high impact (relevancy) and 2. scan for features that are similar to each other (redundancy). 

We chose to focus on the Spearman correlation coefficient because it can capture non-linear relationships, which are predominant in our data according to the exploratory data analysis.

In [0]:
# Creating a deep copy and adding the target variable back again to feature dataset to analyse feature-target correlations
X_with_target = X_processed_df.copy()
X_with_target['price'] = y.values

# Calculating spearman correlation between features and target variable. Absolute values are taken to consider both positive and negative correlations.
corr_spearman_features_target = X_with_target.corr(method='spearman')['price'].abs().reset_index()
corr_spearman_features_target = corr_spearman_features_target.rename(columns={'index': 'Feature', 'price': 'Corr'})
corr_spearman_features_target = corr_spearman_features_target.sort_values(by='Corr', ascending=False)

# Selecting features with correlation >= 0.1 to target variable. Thresholds are choosen "logically", but are subject of discussion
features_to_keep_corr = corr_spearman_features_target[corr_spearman_features_target['Corr'] >= 0.1]['Feature'].values
features_to_eliminate_corr = corr_spearman_features_target[corr_spearman_features_target['Corr'] < 0.1]['Feature'].values

print(f'Correlation matrix features-target (Spearman):')
display(corr_spearman_features_target.style.format(precision=4, thousands=".", decimal=",").background_gradient(axis=0))

print(f'Features to keep: {list(features_to_keep_corr)}')

# Calculating spearman correlation matrix between features to identify highly correlated features
corr_matrix_spearman_feature_feature = X_processed_df.corr(method='spearman').abs()

print(f'\n Correlation matrix features-features (Spearman):')
display(corr_matrix_spearman_feature_feature.style.format(precision=4, thousands=".", decimal=",").background_gradient(axis=0))

# Calculating pairs of features with correlation > 0.8
high_corr = []
for i in range(len(corr_matrix_spearman_feature_feature.columns)):
    for j in range(i):
        if corr_matrix_spearman_feature_feature.iloc[i, j] > 0.8:
            high_corr.append((corr_matrix_spearman_feature_feature.columns[i], corr_matrix_spearman_feature_feature.columns[j]))

display(high_corr)

**Results**
- The feature-target corrlation with a threshold of 0.1 eliminate num__tax_per_engine, num__mileage_per_year, num__paint_quality, and num__previous_owners.
- Features that correlate highly with eachother are mainly mileage, year, and vehicle age, as well as one ohe pair. We initially expect these to have a higher correlation with eachother.


### 3.1.3 F_Regression and Mutual Information 

In this section, we are filtering the features based on the F_Regression and Mutual Information paradimgs. 

F_Regression: 
Tests the null hypothesis that a feature has zero linear relationship with the target. For each feature, it computes an F statistic from its correlation with price; higher F and smaller p-values indicate stronger linear explanatory power.

Mutual Information: 
We are also calculating the mutual informativeness as the F_Regression is only able to capture linearity. In short, the Mutual Information measures, how much information a feature provides about the target. Therefore, a non-zero value hints that the feature reduces uncertainty about the target. 

Result interpretation: High p-value for the f test and low MI value indicate, that a feature has limited predictive power.

In [0]:
# Calculating f scores for all features
selector_f = SelectKBest(score_func=f_regression, k="all")
selector_f.fit(X_processed_df, y)

# Calculating mutual information for all features
selector_mi = SelectKBest(score_func=mutual_info_regression, k="all")
selector_mi.fit(X_processed_df, y)

# Create DataFrame to hold the results
select_k_results = pd.DataFrame({
    'Feature': X_processed_df.columns,
    'F_Score': selector_f.scores_,
    'P_Value': selector_f.pvalues_,
    'MI_Score': selector_mi.scores_
}).sort_values(by=['F_Score', 'MI_Score'], ascending=False).reset_index(drop=True)

# Display the results with styling configuration to highlight results
display(select_k_results.style.format(precision=4, thousands=".", decimal=",").background_gradient(axis=0))

# Determine features to keep based on p-value and mutual information score. Thresholds are choosen "logically", but are subject for discussions
features_to_keep_f_mi = select_k_results[(select_k_results['P_Value'] < 0.05) | (select_k_results['MI_Score'] > 0.01)]['Feature'].values
features_to_eliminate_f_mi = list(set(X_processed_df.columns) - set(features_to_keep_f_mi))
print(f'Features to keep: {list(features_to_keep_f_mi)}')
print(f'Features to eliminate: {features_to_eliminate_f_mi}')

**Results**
The p-values for num__paint_quality and num__previous_owners are very high while the MI-Score is below 0.005, indicating they have limited predictive power.

## 3.2 Wrapper Methods

Wrapper Methods evaluate feature subsets by model performance. They wrap the learning algorithm and assess which features improve predictive accuracy.

1. We apply the Recursive Feature Elemination (RFE) with five cross-validations using the RandomForestRegressor as the estimator. The initally prefared model is the Random Forest thats why we do the RFECV with that model. We are aware that this might bias the model evaluation to take the model that the feature selecetion was based on.
2. After doing the inital feature selection and model evaluation we discover that the Random Forest is the best model for our current setup. So we do the feature selection again with the tuned hyperparameters. This should help improve the feature selection for our final Random Forest.

In [0]:
# Initialize preprocessing pipeline, which includes cleaning, imputation, feature engineering, target mean encoding, and scaling/encoding
preprocessing_pipeline = Pipeline(steps=[
    ('cleaner', CleanerTransformer()),
    ('imputer', GroupImputer(cat_specs=cat_specs, num_specs=num_specs, int_cols=int_cols_for_imputer)),
    ('feature_engineer', FeatureEngineer()),
    ('target_encoder', TargetMeanEncoder(cols=target_encoding_cols, n_splits=5, random_state=0)),
    ('preprocessor', ColumnTransformer(
        transformers=[
            ("num", RobustScaler(), numeric_features),
            ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features)
        ],
        remainder="drop"
    ))
])

# Initialize models
base_models = {
    'Default RandomForestRegressor': RandomForestRegressor(random_state=0, n_jobs=-1),
    'Tuned RandomForestRegressor': RandomForestRegressor(
        n_estimators=360,
        max_depth=22,
        min_samples_leaf=1,
        max_features=0.4,
        random_state=0,
        n_jobs=-1
    )
}

# Initialize K-Fold cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=0)

# Iterate over each model for RFE
for model_name, base_model in base_models.items(): 
    print(f'Starting RFE for model: {model_name}')
    
    # Store results for analysis
    all_results = []

    # Perform Recursive Feature Elimination with Cross-Validation
    # fold: Current fold number
    # train_fold_idx: Indices for training data in the current fold
    # val_fold_idx: Indices for validation data in the current fold
    for fold, (train_fold_idx, val_fold_idx) in enumerate(cv.split(X, y)):
        
        # Split data into training and validation sets for the current fold
        X_train_fold, X_val_fold = X.iloc[train_fold_idx], X.iloc[val_fold_idx]
        y_train_fold, y_val_fold = y.iloc[train_fold_idx], y.iloc[val_fold_idx]

        # Process the training and validation data
        X_train_processed = preprocessing_pipeline.fit_transform(X_train_fold, y_train_fold)
        X_val_processed = preprocessing_pipeline.transform(X_val_fold)

        # Get processed feature names    
        processed_feature_names = list(preprocessing_pipeline.named_steps['preprocessor'].get_feature_names_out())

        # Convert processed data to DataFrames for easier feature manipulation
        X_train_processed_df = pd.DataFrame(X_train_processed, columns=processed_feature_names).reset_index(drop=True)

        # Convert validation data to DataFrame
        X_val_processed_df = pd.DataFrame(X_val_processed, columns=processed_feature_names).reset_index(drop=True)

        # Initialize features to keep; by start of a new fold, all features are kept
        features_to_keep = list(processed_feature_names)
        
        # Recursive Feature Elimination loop
        for i in range(len(processed_feature_names)):
            # Get the current number of features
            num_features = len(features_to_keep)
            
            # Clone the base model to ensure a fresh model for each iteration
            model = clone(base_model)

            # Fit the model on the training data with the current set of features
            model.fit(X_train_processed_df[features_to_keep], y_train_fold)
            
            # Predict on the validation set and compute MAE
            y_pred = model.predict(X_val_processed_df[features_to_keep])
            score = mean_absolute_error(y_val_fold, y_pred)
            
            print(f'Fold: {fold}, Features: {num_features}, MAE: {score}')

            # If only one feature remains, record the result and exit the loop
            if num_features == 1:
                all_results.append({
                    'Fold': fold,
                    'Num_Features': num_features,
                    'MAE': score,
                    'Eliminated_Feature': None
                })
                break

            # Identify the least important feature to eliminate
            importances = model.feature_importances_
            worst_feature_index = np.argmin(importances)
            worst_feature_name = features_to_keep.pop(worst_feature_index)
            
            # Record the results of this iteration
            all_results.append({
                'Fold': fold,
                'Num_Features': num_features,
                'MAE': score,
                'Eliminated_Feature': worst_feature_name
            })

    results_df = pd.DataFrame(all_results)

    plt.figure(figsize=(10, 6))
    plt.title(model_name)
    sns.lineplot(data=results_df, x='Num_Features', y='MAE', ci='sd')

    feature_analysis = results_df[results_df['Eliminated_Feature'].notna()].groupby('Eliminated_Feature').agg({'Num_Features': 'mean'}).reset_index()

    feature_analysis.columns = ['Feature', 'Avg Elimination Position']

    feature_analysis = feature_analysis.sort_values('Avg Elimination Position', ascending=False)

    display(feature_analysis)


    optimal_n = results_df.groupby('Num_Features')['MAE'].mean().idxmin()

    final_features = set(processed_feature_names)

    for fold in results_df['Fold'].unique():
        eliminated = set(results_df[(results_df['Fold'] == fold) & (results_df['Num_Features'] >= optimal_n)]['Eliminated_Feature'].dropna())
        final_features = final_features - eliminated

    features_to_keep_rfe = sorted(final_features)
    features_to_eliminate_rfe = sorted(set(processed_feature_names) - final_features)
    print(f"Final selected features for {model_name} (based on lowest average MAE):{features_to_keep_rfe}")

    print("\n")

**Results**
1. Iteration: Default Random Forest
	-  Using the default hyperparameters for the Random Forest in RFECV, no feature should be eliminated.
2. Iteration: Tuned Random Forest
	- Using the Tuned hyperparameters for the Random Forest in RFECV, only 12 features should be kept.

**Interpretation**: The difference for the final model is recognizable (40 MAE improvement). It is clearly recognizable that adjusting the feature selection for an already tuned model improves the feature selection for that model.

## 3.3 Comparing feature selection results

In [0]:
vt_keep = features_to_keep_vt 
vt_elim = features_to_eliminate_vt

corr_keep = features_to_keep_corr
corr_elim = features_to_eliminate_corr

fmi_keep = features_to_keep_f_mi
fmi_elim = features_to_eliminate_f_mi

rfecv_keep = features_to_keep_rfe
rfecv_elim = features_to_eliminate_rfe

all_features = list(X_processed_df.columns)

results = []
for feature in all_features: 
    results.append({'Methode': 'Variance Threshold', 'Feature': feature, 'Kept': feature in vt_keep})
    results.append({'Methode': 'Correlation', 'Feature': feature, 'Kept': feature in corr_keep})
    results.append({'Methode': 'F-Test/MI', 'Feature': feature, 'Kept': feature in fmi_keep})
    results.append({'Methode': 'RFECV', 'Feature': feature, 'Kept': feature in rfecv_keep})

comparison_df = pd.DataFrame(results).pivot(index='Feature', columns='Methode', values='Kept').reset_index()

display(comparison_df.style.applymap(lambda x: 'background-color: green' if x else 'background-color: red'))

**Results**
By plotting the features to keep and features to eliminate for each feature selection method, one can see:
- The RFECV with tuned parameters selects the most features to eliminate.
- All other methods confirm at least a few of the eliminations. Only correlation identifies num__tax_per_engine as additional elimination.
- Since RFECV used the Random Forest with the model used later for car price prediction, we only take this feature selection into account.

The following features are eliminated: num__is_premium_brand_tgt, num__log_tax, num__mileage_per_year, num__mpg, num__paint_quality, num__previous_owners, num__tax, ohe__fuel_type_diesel, ohe__fuel_type_hybrid, ohe__fuel_type_petrol, ohe__transmission_automatic, ohe__transmission_semi-automatic.

# 4. Model training

**Overview**:
We train and compare diverse regressors to select the most accurate model for final prediction. (Note: The Notebook favors the forests heavily, as the preprocessing does not include any normalization steps and scaling for the linear regressors) Preprocessing is varied to quantify the impact of cleaning, feature engineering, encoding, and feature selection.  

**Pipelines**
1. Pipeline: Clean + Impute + Encoding + Scaling -> Build a baseline model.
2. Pipeline: Pipeline 1 + Feature Engineering -> Test FE impact on baseline model.
3. Pipeline: Pipeline 2 + Feature Selection -> Test FS impact on model performance. The tuned Random Forest is used in RFE.

**Models Evaluated**
- Linear: LinearRegression, Ridge
- Trees / Ensambles: DecisionTreeRegressor, RandomForestRegressor, HistGradientBoostingRegressor
- Neural Network: MLPRegressor

**Assessment protocol**
We use a 5-fold CV strategy throughout the project. This is mainly done since the 'simple' holdout validation is to simplistic and allows to much variability. We try to reduce variability with the folds for a better generalization. For the k-fold parameters we decided to use k of 5, so that each fold contains 20% of the data (around 15k values), which is sufficiently large enough. The folds are created using the shuffle functionality to ensure that the data is randomly distributed across the folds and that there are no systematic biases.

**Metrics**
- MAE, RMSE, and R² are displayed. Sorted and evaluated for MAE.

In [0]:
# Selecting the feature that we will use in the first base case. Here, only raw features without feature engineering or target encoding are used.
numeric_features_pipe_1 = ["year", "mileage", "tax", "mpg", "engine_size", "paint_quality", "previous_owners"]
categorical_features_pipe_1 = ["brand", "model", "transmission", "fuel_type"]

# Constructing the first pipeline, which only includes data cleaning, imputation, and preprocessing
pipe_1 = Pipeline(steps=[
    ('cleaner', CleanerTransformer()),
    ('imputer', GroupImputer(cat_specs=cat_specs, num_specs=num_specs, int_cols=int_cols_for_imputer)),
    ('preprocessor', ColumnTransformer(
        transformers=[
            ("num", RobustScaler(), numeric_features_pipe_1),
            ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features_pipe_1)
        ],
        remainder="drop"
    ))
])

# Selecting the features that will be used in the second pipeline, which includes the raw features plus feature engineering and target encoding
numeric_features_pipe_2 = [
    "year", "mileage", "tax", "mpg", "engine_size", "paint_quality", 
    "previous_owners", "vehicle_age", "mileage_per_year", "tax_per_engine", 
    "mpg_per_engine", "log_mileage", "log_tax", "efficiency_index",
    "brand_tgt", "model_tgt", "is_premium_brand_tgt"
]

categorical_features_pipe_2 = ["transmission", "fuel_type"]

# Constructing the second pipeline, which includes data cleaning, imputation, feature engineering, target encoding and preprocessing
pipe_2 = Pipeline(steps=[
    ('cleaner', CleanerTransformer()),
    ('imputer', GroupImputer(cat_specs=cat_specs, num_specs=num_specs, int_cols=int_cols_for_imputer)),
    ('feature_engineer', FeatureEngineer()),
    ('target_encoder', TargetMeanEncoder(cols=target_encoding_cols, n_splits=5, random_state=0)),
    ('preprocessor', ColumnTransformer(
        transformers=[
            ("num", RobustScaler(), numeric_features_pipe_2),
            ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features_pipe_2)
        ],
        remainder="drop"
    ))
])

# Selecting the features that will be used in the third pipeline, which includes only the selected features after feature selection
numeric_features_pipe_3 = [
    "brand_tgt", "efficiency_index", "engine_size", "log_mileage", "mileage", "model_tgt", "mpg_per_engine", "tax_per_engine", "vehicle_age", "year"
]

categorical_features_pipe_3 = []

# Constructing the third pipeline, which includes data cleaning, imputation, feature engineering, target encoding, preprocessing and feature selection
pipe_3 = Pipeline(steps=[
    ('cleaner', CleanerTransformer()),
    ('imputer', GroupImputer(cat_specs=cat_specs, num_specs=num_specs, int_cols=int_cols_for_imputer)),
    ('feature_engineer', FeatureEngineer()),
    ('target_encoder', TargetMeanEncoder(cols=target_encoding_cols, n_splits=5, random_state=0)),
    ('preprocessor', ColumnTransformer(
        transformers=[
            ("num", RobustScaler(), numeric_features_pipe_3),
            ("ohe", Pipeline([
                ('encoder', OneHotEncoder(handle_unknown="ignore", sparse_output=False)),

                # Special case: We have identified in the feature selection phase, that only manual transmission cars are relevant. Therefore, we will only keep that category.
                ('selector', ColumnTransformer([
                    ('transmission_manual_only', 'passthrough', [1])
                ], remainder='drop'))
            ]), categorical_features_pipe_3)],
        remainder="drop"
    )),
])

# Define the models which will be used in the comparison
models = {
    'LinearRegressor': LinearRegression(),
    'RidgeRegressor': Ridge(random_state=0),

    
    'DecisionTreeRegressor': DecisionTreeRegressor(random_state=0),
    # Remark: n_jobs is used to optimize CPU usage
    'RandomForestRegressor': RandomForestRegressor(random_state=0, n_jobs=-1),
    
    'HistGradientBoostingRegressor': HistGradientBoostingRegressor(random_state=0),
    'MLPRegressor': MLPRegressor(random_state=0),
}

# Create dictionary of pipelines to iterate over
pipelines_to_compare = {
    'Pipeline 1 (Base)': pipe_1,
    'Pipeline 2 (+ Feature Engineering)': pipe_2,
    'Pipeline 3 (+ Feature Selection)': pipe_3
}

# Initialize empty list to store comparison results
comparison_results = []

# Create loop that will iterate over each pipeline
for pipeline_name, pipeline in pipelines_to_compare.items():
    print(f"Pipeline: {pipeline_name}")
    
    # Create loop that will iterate over each model
    for model_name, model in models.items():
        print(f"\n Training {model_name}")
        
        # Add the current model to the pipeline
        curr_model = [('model', model)]
        full_pipeline = Pipeline(steps=pipeline.steps + curr_model)

        # Initialize K-Fold cross-validation
        cv = KFold(n_splits=5, shuffle=True, random_state=0)
        
        # Calculate the cross-validated scores
        scores = cross_validate(
            full_pipeline, 
            X, 
            y, 
            cv=cv, 
            scoring=scoring,
            n_jobs=-1
        )
        
        # Extract and store results of training
        # Remark: sklearn stores error metrics as negative values for maximization. Therefore, we have to negate them back
        mean_mae = -scores['test_mae'].mean()
        std_mae = scores['test_mae'].std()
        mean_rmse = -scores['test_rmse'].mean()
        std_rmse = scores['test_rmse'].std()
        mean_r2 = scores['test_r2'].mean()
        std_r2 = scores['test_r2'].std()
        
        comparison_results.append({
            'Pipeline': pipeline_name,
            'Model': model_name,
            'MAE Mean': mean_mae,
            'MAE Std': std_mae,
            'RMSE Mean': mean_rmse,
            'RMSE Std': std_rmse,
            'R2 Mean': mean_r2,
            'R2 Std': std_r2
        })
        
        print(f" MAE: {mean_mae} , Std: {std_mae}")
        print(f" RMSE: {mean_rmse}, Std: {std_rmse}")
        print(f" R^2: {mean_r2} , Std: {std_r2}")

# Create DataFrame from comparison results and display it
results_df = pd.DataFrame(comparison_results)
display(results_df)

**Results**
1. Pipeline:
	- Linear models show higher MAE/RMSE and lower R² than Trees/Ensambles. This indicates a non-linear interactions and structure in the data.
	- RandomForest has the lowest errors and the most stable folds. It is the primary candidate for refinement.
2. Pipeline:
	- Feature Engineering helps HistGradientBoost across all metrics.
    - They hurt the linear models and the decision tree.
    - Random Forest shows a small MAE improvement and but slightly worse RMSE and R². 
3. Pipeline:
	- By using feature selection all models are getting worse. 
    - It will be interesting to see the later performance of the tuned Random Forest.

Note: For this evaluation we expected the Random Forest to be the best model, since we did the feature selection explicitly for the Random Forest and no specific steps were done for linear models.

# 5. Tuning Hyperparameters

In this section, we focus on the best-performing model from the previous analysis, the RandomForestRegressor, and tune its hyperparameters to further enhance its performance. To stabilise the results, we will rely on cross-validation in this section as well.

Remark: Through multiple testing iterations, we identified a subset of optimal parameters. To demonstrate the parameter tuning process, this subset was defined as the parameter grid. Consequently, the tuning does not cover the entire parameter space due to computational constraints. Instead, we focus on the subset that performed well and refine the search within this space. To find the optimal configuration in this narrowed space, we use the more unefficient but more precise GridSearchCV instead of RandomizedSearchCV. However, during extensive testing in the future, more advanced/ efficienct optimization algorithms will be used.

In [0]:
# Initialize preprocessor transformer, which scales numeric features and one-hot encodes categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ("num", RobustScaler(), numeric_features),
        ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=True), categorical_features)
    ],
    remainder="drop"
)

# Initialize the full pipeline for hyperparameter tuning
tuning_pipeline = Pipeline(steps=[
    ('cleaner', CleanerTransformer()),
    ('imputer', GroupImputer(cat_specs=cat_specs, num_specs=num_specs, int_cols=int_cols_for_imputer)),
    ('feature_engineer', FeatureEngineer()),
    ('target_encoder', TargetMeanEncoder(cols=target_encoding_cols, n_splits=5, random_state=0)),
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(random_state=0, n_jobs=1)),
])

# Define parameter grid for hyperparameter tuning. This grid will be searched during GridSearchCV. 
# In total, 4*4*2*3 = 96 combinations will be tested. Each 96 combinations will be evaluated using 2-Fold cross-validation, resulting in 192 model fits.
param_grid = {
    'model__n_estimators': [300, 320, 340, 360],
    'model__max_depth': [18, 22, 26, 30],
    'model__min_samples_leaf': [1, 2],
    'model__max_features': [0.2, 0.4, 0.6],
}

# Define GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(
    estimator=tuning_pipeline,
    param_grid=param_grid,
    cv=2,
    scoring='neg_mean_absolute_error',
    n_jobs=-1,
    verbose=2
)

# Fit GridSearchCV to the training data to find the best hyperparameters
grid_search.fit(X, y)

# Output the best score and parameters found during hyperparameter tuning
print(f"Loest MAE: {-grid_search.best_score_}")
print("Best Parameters:")
for param, value in grid_search.best_params_.items():
    print(f"{param}: {value}")

# Create DataFrame to display all results from the grid search
results_df = pd.DataFrame(grid_search.cv_results_)
results_df['mae'] = -results_df['mean_test_score']

results = results_df[[
    'param_model__n_estimators',
    'param_model__max_depth',
    'param_model__min_samples_leaf',
    'param_model__max_features',
    'mae',
    'std_test_score',
]].reset_index(drop=True)

display(results)

- Impact of parameter tuning: 
    - The tuning of the RandomForest does indeed improved the performance by a fair bit. 
    - The optimal parameters are: 
        - model__max_depth: 22
        - model__max_features: 0.4
        - model__min_samples_leaf: 1
        - model__n_estimators: 360

    - default parameters: 
        - n_estimators: 100
        - min_samples_leaf: 1
        - max_depth: None
        - max_features: 1

    - The relativ high number of estimators and the depth of the trees leads to the hypothesis, that the regression problem is non-trivial with non-linear relationships.

# 6. Training and utilizing final model

## 6.1 Training final model

In this final section, we will utilize all optimizations done before to initialize one final, tuned model. This model will be used to make the final predictions. 

In [0]:
# Define the columns that will be target encoded in the final model
target_encoding_cols = ["brand", "model"]

# Define the final numeric features that will be used in the model
numeric_features_final = [
    "mpg",
    "tax_per_engine",
    "log_mileage",
    "engine_size",
    "mileage",
    "vehicle_age",
    "efficiency_index",
    "year",
    "mpg_per_engine"
]
 
# Add the target encoded columns to numeric stack
numeric_features_final = numeric_features_final + [f"{col}_tgt" for col in target_encoding_cols]

# Define the final categorical features that will be used in the model
categorical_features_final = ["transmission"]

# Initialize the final model with the best hyperparameters found during tuning
model = RandomForestRegressor(
        n_estimators=360,
        max_depth=22,
        min_samples_leaf=1,
        max_features=0.4,
        random_state=0,
        n_jobs=-1
    )

results = {}
fold_results = {}

# Construct the final pipeline
final_pipeline = Pipeline(steps=[
    ('cleaner', CleanerTransformer()),
    ('imputer', GroupImputer(cat_specs=cat_specs, num_specs=num_specs, int_cols=int_cols_for_imputer)),
    ('feature_engineer', FeatureEngineer()),
    ('target_encoder', TargetMeanEncoder(cols=target_encoding_cols, n_splits=5, random_state=0)),
    ('preprocessor', ColumnTransformer(
        transformers=[
            ("num", RobustScaler(), numeric_features_final),
            ("ohe", Pipeline([
                ('encoder', OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
                ('selector', ColumnTransformer([
                    ('transmission_manual_only', 'passthrough', [1])
                ], remainder='drop'))
            ]), categorical_features_final)],
        remainder="drop"
    )),
    ('model', model),
])
 
# Initialize K-Fold cross-validation
cv = KFold(n_splits=10, shuffle=True, random_state=0)

# Calculate the cross-validated scores
scores = cross_validate(
    final_pipeline,
    X,
    y,
    cv=cv,
    scoring=scoring,
    n_jobs=1
)
 
# Extract and store results of training
mean_mae = -scores['test_mae'].mean()
std_mae = scores['test_mae'].std()
mean_rmse = -scores['test_rmse'].mean()
std_rmse = scores['test_rmse'].std()
mean_r2 = scores['test_r2'].mean()
std_r2 = scores['test_r2'].std()
 
fold_results = pd.DataFrame({
    'Fold': range(1, len(scores['test_mae']) + 1),
    'MAE': -scores['test_mae'],
    'RMSE': -scores['test_rmse'],
    'R2': scores['test_r2']
})
 
results = {
    'MAE Mean': mean_mae,
    'MAE Std': std_mae,
    'RMSE Mean': mean_rmse,
    'RMSE Std': std_rmse,
    'R2 Mean': mean_r2,
    'R2 Std': std_r2
}

display(results)
display(fold_results)

## 6.2 Making final predictions

In [0]:
# Fitting the final pipeline on the entire training data
final_pipeline.fit(X, y)

# Making predictions on the test data
predictions = final_pipeline.predict(X_test)

# Creating submission DataFrame
submission = pd.DataFrame({
    "carID": X_test["car_id"],
    "Price": predictions
})

# Saving submission to CSV file
submission.to_csv("Group49_Version96.csv", index=False)

# Displaying the first few rows of the submission DataFrame
display(submission.head())

In [0]:
# Kaggle score: 1194.33751 (Public score, "Group49_Version96.csv")