# Machine Learning Project

## Table of Contents
- [Import Data](#import-data)
  - [Import Data Summary](#import-data-summary)
- [Data Exploration](#data-exploration)
  - [Boolean Features](#boolean-features)
    - [Boolean Features Analysis](#boolean-features-analysis)
  - [Categorical Features](#categorical-features)
    - [Check Categorical Features Consistency](#check-categorical-features-consistency)
    - [Categorical Features Summary](#categorical-features-summary)
  - [Numerical Features](#numerical-features)
    - [Numerical Plots](#plots)
    - [Analysis of Numerical Distributions](#analysis-of-numerical-distributions)
- [Pre-processing](#pre-processing)
  - [Functions](#functions)
  - [Data Preparation](#data-preparation)
    - [Correlation Analysis](#correlation-analysis)
- [Model Training](#model-training)
  - [Quick Baseline Model](#quick-baseline-model)
  - [Experiment Algorithms](#experiment-algorithms)
- [Predictions](#predictions)

<a id="import-data"></a>
## Import Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split, KFold, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score
from collections import Counter
from scipy.stats import loguniform, randint
import nltk


SEED = 42

warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('data/train.csv').set_index('carID')
df.head()

In [None]:
df.info()

In [None]:
num_duplicated_ids = df.index.duplicated().sum()
print(f'Number of duplicated carIDs: {num_duplicated_ids}')

<a id="import-data-summary"></a>
#### Import Data Summary
- Dataset loaded successfully with `carID` as the index
- There are no duplicate entries in carID
- The dataset contains information about cars including both numerical features (price, mileage, tax, etc.) and categorical features (brand, model, transmission, etc.)
- Initial inspection shows multiple features that will require preprocessing:
  - Numerical features that need cleaning (negative values, outliers)
  - Categorical features that need standardization
  - Presence of missing values in several columns

<a id="data-exploration"></a>
## Data Exploration

<a id="boolean-features"></a>
### Boolean Features

In [None]:
df['hasDamage'].value_counts(dropna=False)

<a id="boolean-features-analysis"></a>
#### Boolean Features Analysis

Key observations about `hasDamage` feature:
- Only contains binary values (0) and NaN
- No instances of value 1 found, suggesting potential data collection issues
- May indicate:
  - Cars with damage not being listed
  - System default setting of 0 for non-damaged cars
  - Incomplete damage assessment process
- Requires special handling in preprocessing:
  - Consider treating NaN as a separate category
  - Validate if 0 truly represents "no damage"
  - May need to be treated as a categorical rather than boolean feature

<a id="categorical-features"></a>
### Categorical Features

<a id="check-categorical-features-consistency"></a>
#### Check Categorical Features Consistency

In [None]:
# List of categorical features
cat_cols = ['Brand', 'model', 'fuelType', 'transmission']

# Identify outlier examples in categorical features
cat_outliers_examples = {col: df[col].value_counts().tail(10).index for col in cat_cols}

# Display the outlier examples
pd.DataFrame(cat_outliers_examples)

<a id="categorical-features-summary"></a>
#### Categorical Features Summary
- Initial analysis reveals significant data quality issues across all categorical columns
- No standardization in categorical features, with multiple variations of the same values (different spellings, capitalizations)
- Solution: We will implement string distance-based standardization using the `thefuzz` library to clean and standardize these features

<a id="numerical-features"></a>
### Numerical Features

In [None]:
# Dict of numerical features
num_cols = {
    'price': 'continuous',
    'mileage': 'continuous',
    'tax': 'continuous',
    'mpg': 'continuous',
    'paintQuality%': 'continuous',
    'engineSize': 'continuous',
    'year': 'discrete',
    'previousOwners': 'discrete'
}

<a id="plots"></a>
#### Numerical Plots

In [None]:
# Plot figures for numerical features and the target variable (price)
plt.figure(figsize=(16, 10))
for i, (col, var_type) in enumerate(num_cols.items(), 1):
    plt.subplot(4, 2, i)

    # Plot based on variable type
    if var_type == 'continuous':
        sns.histplot(data=df, x=col, kde=True, color="lightcoral", bins=30)
        plt.title(f"Distribution of {col}", fontsize=11)
    elif var_type == 'discrete':
        sns.countplot(data=df, x=col, color="lightcoral")
        plt.title(f"Distribution of {col}", fontsize=11)
        plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

In [None]:
# Boxplots for continuous numerical features and the target variable (price)
continuous_cols = [col for col, var_type in num_cols.items() if var_type == 'continuous']
plt.figure(figsize=(16, 10))
for i, col in enumerate(continuous_cols, 1):
    plt.subplot(3, 2, i)
    sns.boxplot(data=df, x=col, color="lightblue")
    plt.title(f"Distribution of {col}", fontsize=11)

plt.tight_layout()
plt.show()

<a id="analysis-of-numerical-distributions"></a>
#### Analysis of Numerical Distributions

Key observations from the plots:
- **Target Variable (Price)**:
  - Highly right-skewed distribution
  - Contains significant number of outliers in the upper range
  - Most cars are concentrated in the lower price range

- **Mileage**:
  - Right-skewed distribution
  - Large range from nearly new cars to high-mileage vehicles
  - Some outliers in upper range suggesting possible data entry errors

- **Tax**:
  - Multiple peaks suggesting different tax bands
  - Contains negative values which require investigation (possible tax benefits/rebates)
  - Large number of outliers on both ends of the distribution

- **MPG (Miles Per Gallon)**:
  - Approximately normal distribution with slight right skew
  - Some unrealistic extreme values that need cleaning
  - Reasonable median around typical car efficiency ranges

- **Paint Quality %**:
  - Contains values above 100% which are logically impossible
  - Left-skewed distribution suggesting optimistic ratings
  - Requires standardization to 0-100 range

- **Engine Size**:
  - There are engine size with zero values which are not realistic (might indicate electric vehicles)
  - Some unusual patterns that need investigation
  - Contains outliers that may represent specialty vehicles

- **Year**:
  - Should be discrete but contains decimal values

- **Previous Owners**:
  - Should be integer but contains float values
  - Right-skewed distribution as expected
  - Maximum values need validation (unusually high number of previous owners)

<a id="pre-processing"></a>
## Pre-processing

<a id="functions"></a>
### Functions

In [None]:
def  general_cleaning(df: pd.DataFrame) -> pd.DataFrame:
    """Perform general data cleaning on the DataFrame.
    
    This function handles logical inconsistencies and data quality issues that
    don't require statistical calculations (mean, median, etc.) to prevent data
    leakage between training and validation sets.
    
    Args:
        df (pd.DataFrame): The input DataFrame containing car data with columns:
            Brand, model, year, transmission, fuelType, mileage, tax, mpg, 
            engineSize, paintQuality%, previousOwners, hasDamage.
        
    Returns:
        pd.DataFrame: The cleaned DataFrame with logical inconsistencies resolved.
    """

    df = df.copy()

    # Set negative values to NaN for features that shouldn't be negative
    for col in ['previousOwners', 'mileage', 'tax', 'mpg', 'engineSize']:
        df.loc[df[col] < 0, col] = np.nan

    for col in ['Brand', 'model', 'transmission', 'fuelType']:
        df[col] = df[col].str.lower()
        df[col] = df[col].replace('', np.nan)

    # Remove decimal part from 'year'
    df['year'] = np.floor(df['year']).astype('Int64')

    # Remove decimal part from 'previousOwners'
    df['previousOwners'] = np.floor(df['previousOwners']).astype('Int64')

    # Ensure 'paintQuality%' is within 0-100
    df.loc[(df['paintQuality%'] < 0) | (df['paintQuality%'] > 100), 'paintQuality%'] = np.nan

    # Fill missing 'hasDamage' with 1
    df['hasDamage'] = df['hasDamage'].fillna(1).astype('Int64')

    return df

In [None]:
def standardize_categorical_col(series: pd.Series, 
                                standardised_cats: list[str], 
                                distance_threshold: int = 2) -> pd.Series:
    """Standardizes a categorical column using edit distance with a threshold.

    1. Maps values to a standard category if they are a likely typo
       (i.e., within the edit distance_threshold).
    2. Keeps values that are already in the standard list.
    3. Groups all other values that don't match into an 'other' bin.
    
    Args:
        series (pd.Series): The categorical column to standardize.
        standardised_cats (list[str]): The list of "good" categories to match against.
        distance_threshold (int): The max edit distance to consider something a typo.
                                  A value of 1 or 2 is recommended.
                                  
    Returns:
        pd.Series: The standardized categorical column.
    """
    
    # 1. Get all unique non-null values from the series
    unique_values = series.dropna().unique()
    
    # 2. Build the mapping dictionary
    mapping = {}
    
    for x in unique_values:
        x_str = str(x)
        
        # Check if it's already a perfect match
        if x_str in standardised_cats:
            mapping[x] = x_str
            continue

        # Find the closest match and its distance
        distances = [nltk.edit_distance(x_str, cat) for cat in standardised_cats]
        min_distance = np.min(distances)
        
        if min_distance <= distance_threshold:
            closest_cat = standardised_cats[np.argmin(distances)]
            mapping[x] = closest_cat
        else:
            mapping[x] = 'other'
            
    return series.map(mapping)

In [None]:
def get_categories_high_freq(series: pd.Series, percent_threshold: float = 0.02) -> list[str]:
    """Get categories that appear more than a dynamic percentage threshold.
    
    Args:
        series (pd.Series): The categorical series to analyze.
        percent_threshold (float): The minimum percentage of total rows a category
                                   must have to be included (e.g., 0.01 for 1%).
                                   
    Returns:
        list[str]: List of categories with frequency above the dynamic threshold.
    """
    
    # Calculate the dynamic count threshold based on the percentage
    dynamic_count_threshold = len(series) * percent_threshold
    
    value_counts = series.value_counts()
    
    # Use the *same logic* as before, but with the new dynamic threshold
    high_freq_cats = value_counts[value_counts > dynamic_count_threshold].index.tolist()
    
    return high_freq_cats

In [None]:
def calculate_upper_bound(series: pd.Series) -> float:
    """Calculates the upper outlier bound for a pandas Series."""
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    return Q3 + (1.5 * IQR)

In [None]:
def clean_outliers(series: pd.Series, 
                   upper_bound: float,
                   lower_bound: float = 0.0, 
                   return_missing: bool = True) -> pd.Series:
    """Clean outliers in the Series based on specified bounds.

    This function clips values outside the specified bounds or sets them to NaN.
    
    Args:
        series (pd.Series): The input Series containing numerical data.
        lower_bound (float): The lower bound for valid values.
        upper_bound (float): The upper bound for valid values.
        return_missing (bool): If True, set out-of-bound values to NaN.
                              If False, clip values to the bounds.
    
    Returns:
        pd.Series: The cleaned Series with outliers handled.
    """
    cleaned = series.copy()
    
    if return_missing:
        # Set out-of-bound values to NaN
        cleaned[(cleaned < lower_bound) | (cleaned > upper_bound)] = np.nan
    else:
        # Clip values to the specified bounds
        cleaned = cleaned.clip(lower=lower_bound, upper=upper_bound)
    
    return cleaned


In [None]:
def preprocess_data(X, cat_cols, num_cols, artifacts=None, fit=True):
    """
    Preprocess data using consistent transformations.
    
    Args:
        X (pd.DataFrame): Features to preprocess
        cat_cols (list): Categorical column names
        num_cols (dict): Numerical column names with types
        artifacts (dict): Preprocessing artifacts (if fit=False)
        fit (bool): If True, fit transformers; if False, use provided artifacts
        
    Returns:
        tuple: (X_processed, artifacts) if fit=True, else X_processed
    """
    X = X.copy()
    continuous_cols = [col for col, var_type in num_cols.items() if var_type == 'continuous']
    
    if fit:
        # Fit preprocessing on training data
        high_freq_cats = {col: get_categories_high_freq(X[col]) for col in cat_cols}
        mileage_upper = X['mileage'].quantile(0.95)
        outlier_bounds = {col: calculate_upper_bound(X[col]) for col in continuous_cols}
        medians = {col: X[col].median() for col in num_cols}
        
        artifacts = {
            'high_freq_cats': high_freq_cats,
            'mileage_upper': mileage_upper,
            'outlier_bounds': outlier_bounds,
            'medians': medians,
            'cat_cols': cat_cols,
            'num_cols': num_cols
        }
    else:
        high_freq_cats = artifacts['high_freq_cats']
        mileage_upper = artifacts['mileage_upper']
        outlier_bounds = artifacts['outlier_bounds']
        medians = artifacts['medians']
    
    # 1. Categorical preprocessing
    for col in cat_cols:
        X[col] = standardize_categorical_col(X[col], high_freq_cats[col])
        X[col] = X[col].fillna('other')
    
    # 2. Numerical outliers
    X['mileage'] = clean_outliers(X['mileage'], mileage_upper, 0, return_missing=False)
    
    for col in continuous_cols:
        if col != 'mileage':
            X[col] = clean_outliers(X[col], outlier_bounds[col])
    
    # 3. Fill missing values
    for col in num_cols:
        X[col] = X[col].fillna(medians[col])
    
    # 4. One-hot encoding
    if fit:
        ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
        ohe_data = pd.DataFrame(
            ohe.fit_transform(X[cat_cols]),
            columns=ohe.get_feature_names_out(cat_cols),
            index=X.index
        )
        artifacts['encoder'] = ohe
    else:
        ohe_data = pd.DataFrame(
            artifacts['encoder'].transform(X[cat_cols]),
            columns=artifacts['encoder'].get_feature_names_out(cat_cols),
            index=X.index
        )
    
    X = pd.concat([X.drop(columns=cat_cols), ohe_data], axis=1)
    
    # 5. Normalize numerical features
    numerical_cols = list(num_cols.keys())
    if fit:
        scaler = StandardScaler()
        X[numerical_cols] = scaler.fit_transform(X[numerical_cols])
        artifacts['scaler'] = scaler
    else:
        X[numerical_cols] = artifacts['scaler'].transform(X[numerical_cols])
    
    return (X, artifacts) if fit else X

In [None]:
def sample_hyperparameters(param_distributions, n_iter, seed):
    """
    Sample hyperparameters from distributions.
    
    Args:
        param_distributions (dict): Dictionary with parameter names as keys and
                                   distributions/lists as values
        n_iter (int): Number of parameter combinations to sample
        seed (int): Random seed for reproducibility
        
    Returns:
        list[dict]: List of parameter dictionaries
    """
    np.random.seed(seed)
    param_list = []
    
    for i in range(n_iter):
        params = {}
        for param_name, param_values in param_distributions.items():
            # Check if it's a scipy distribution (has .rvs method)
            if hasattr(param_values, 'rvs'):
                params[param_name] = param_values.rvs(random_state=seed + i)
            # Check if it's a list (discrete choices)
            elif isinstance(param_values, list):
                params[param_name] = np.random.choice(param_values)
            else:
                raise ValueError(f"Unknown parameter type for {param_name}")
        
        param_list.append(params)
    
    return param_list


In [None]:
def cross_validate_with_tuning(X_raw, y_raw, cat_cols_list, num_cols_dict, model_config, k=3, seed=42):
    """
    Perform k-fold cross-validation with manual hyperparameter search.
    Preprocessing is done within each fold to prevent data leakage.
    
    Args:
        X_raw (pd.DataFrame): Raw training features (not preprocessed)
        y_raw (pd.Series): Raw training target (not log-transformed)
        cat_cols_list (list): List of categorical column names
        num_cols_dict (dict): Dictionary of numerical columns with types
        model_config (dict): Configuration dictionary with keys:
            - 'model_class': sklearn model class (e.g., Ridge, Lasso, RandomForestRegressor)
            - 'param_distributions': dict of parameter distributions for sampling
            - 'n_iter': number of parameter settings to sample (default: 20)
        k (int): Number of CV folds (default: 3)
        seed (int): Random seed for reproducibility
        
    Returns:
        dict: Results with best_params, best_estimator, CV scores, and preprocessing artifacts
    """
    # Setup CV
    kfold = KFold(n_splits=k, shuffle=True, random_state=seed)
    
    # Sample hyperparameters once (will be used across all folds)
    n_iter = model_config.get('n_iter', 20)
    param_combinations = sample_hyperparameters(
        model_config['param_distributions'], 
        n_iter, 
        seed
    )
    
    # Storage for results
    fold_results = []
    best_score_overall = float('inf')
    best_params_overall = None
    best_estimator_overall = None
    best_artifacts = None
    
    print(f"\n{'='*60}")
    print(f"Starting {k}-Fold Cross-Validation with Hyperparameter Tuning")
    print(f"{'='*60}")
    print(f"Model: {model_config['model_class'].__name__}")
    print(f"Parameter space: {model_config['param_distributions']}")
    print(f"Hyperparameter combinations: {n_iter}")
    print(f"Total model fits: {k * n_iter}")
    print(f"{'='*60}\n")
    
    # Perform manual CV with preprocessing in each fold
    for fold_idx, (train_idx, val_idx) in enumerate(kfold.split(X_raw), 1):
        print(f"Fold {fold_idx}/{k}")
        print(f"{'-'*40}")
        
        # Split data for this fold
        X_train_fold = X_raw.iloc[train_idx].copy()
        X_val_fold = X_raw.iloc[val_idx].copy()
        y_train_fold = y_raw.iloc[train_idx].copy()
        y_val_fold = y_raw.iloc[val_idx].copy()
        
        # Preprocess data for this fold
        X_train_processed, fold_artifacts = preprocess_data(
            X_train_fold, cat_cols_list, num_cols_dict, fit=True
        )
        X_val_processed = preprocess_data(
            X_val_fold, cat_cols_list, num_cols_dict, 
            artifacts=fold_artifacts, fit=False
        )
        
        # Log-transform target
        y_train_log = np.log1p(y_train_fold)
        
        # Hyperparameter tuning: try each parameter combination
        best_fold_score = float('inf')
        best_fold_params = None
        best_fold_model = None
        
        for param_idx, params in enumerate(param_combinations, 1):
            # Train model with these parameters
            model = model_config['model_class'](**params)
            model.fit(X_train_processed, y_train_log)
            
            # Predict on validation fold
            y_val_pred_log = model.predict(X_val_processed)
            y_val_pred = np.expm1(y_val_pred_log)
            
            # Calculate MAE
            fold_mae = mean_absolute_error(y_val_fold, y_val_pred)
            
            # Track best for this fold
            if fold_mae < best_fold_score:
                best_fold_score = fold_mae
                best_fold_params = params
                best_fold_model = model
            
            # Progress indicator every 5 iterations
            if param_idx % 5 == 0 or param_idx == n_iter:
                print(f"  Evaluated {param_idx}/{n_iter} parameter combinations...")
        
        # Calculate train performance for best model
        y_train_pred_log = best_fold_model.predict(X_train_processed)
        y_train_pred = np.expm1(y_train_pred_log)
        train_mae = mean_absolute_error(y_train_fold, y_train_pred)
        
        print(f"  Best params: {best_fold_params}")
        print(f"  Train MAE: £{train_mae:.2f}")
        print(f"  Validation MAE: £{best_fold_score:.2f}\n")
        
        # Store fold results
        fold_results.append({
            'fold': fold_idx,
            'best_params': best_fold_params,
            'train_mae': train_mae,
            'val_mae': best_fold_score,
            'best_model': best_fold_model,
            'artifacts': fold_artifacts
        })
        
        # Track overall best across all folds
        if best_fold_score < best_score_overall:
            best_score_overall = best_fold_score
            best_params_overall = best_fold_params
            best_estimator_overall = best_fold_model
            best_artifacts = fold_artifacts
    
    # Calculate mean and std of CV scores
    cv_scores = [r['val_mae'] for r in fold_results]
    mean_cv_score = np.mean(cv_scores)
    std_cv_score = np.std(cv_scores)
    
    print(f"{'='*60}")
    print(f"Cross-Validation Results:")
    print(f"{'='*60}")
    print(f"Mean CV MAE: £{mean_cv_score:.2f} ± £{std_cv_score:.2f}")
    print(f"Best Fold MAE: £{best_score_overall:.2f}")
    print(f"Best Parameters: {best_params_overall}")
    print(f"{'='*60}\n")
    
    # Train final model on all data using best parameters
    print("Training final model on all data with best parameters...")
    X_all_processed, final_artifacts = preprocess_data(
        X_raw, cat_cols_list, num_cols_dict, fit=True
    )
    y_all_log = np.log1p(y_raw)
    
    final_model = model_config['model_class'](**best_params_overall)
    final_model.fit(X_all_processed, y_all_log)
    
    print("Final model trained successfully!\n")
    
    return {
        'best_params': best_params_overall,
        'best_estimator': final_model,
        'mean_cv_score': mean_cv_score,
        'std_cv_score': std_cv_score,
        'best_fold_score': best_score_overall,
        'fold_results': fold_results,
        'preprocessing_artifacts': final_artifacts
    }

In [None]:
def preprocess_test_data(test_df, artifacts):
    """
    Preprocess test data using artifacts from training.
    
    Args:
        test_df (pd.DataFrame): Raw test dataframe
        artifacts (dict): Preprocessing artifacts from cross_validate_with_tuning
        
    Returns:
        pd.DataFrame: Preprocessed test data ready for prediction
    """
    # General cleaning
    test_cleaned = general_cleaning(test_df)
    
    # Apply preprocessing using artifacts
    test_processed = preprocess_data(
        test_cleaned, 
        artifacts['cat_cols'], 
        artifacts['num_cols'], 
        artifacts=artifacts, 
        fit=False
    )
    
    return test_processed

## Summary of Preprocessing Pipeline

The preprocessing is now properly separated:

1. **`preprocess_data()`** - Preprocesses a single dataset
   - Handles categorical features (standardization, encoding)
   - Handles numerical outliers using IQR method
   - Imputes missing values with medians
   - One-hot encodes categorical features
   - Normalizes numerical features with StandardScaler
   - Can fit transformers (fit=True) or use existing ones (fit=False)

2. **`cross_validate_with_tuning()`** - Performs CV with hyperparameter tuning
   - Takes **raw data** (after general_cleaning)
   - Applies preprocessing **separately for each fold** (prevents data leakage)
   - Performs manual hyperparameter search by sampling from parameter distributions
   - Evaluates each combination on validation fold and tracks train/validation performance
   - Returns best model trained on all data + preprocessing artifacts

3. **`preprocess_test_data()`** - Preprocesses test data
   - Uses artifacts from CV to ensure consistency

<a id="data-preparation"></a>
### Data Preparation

In [None]:
# Prepare cleaned data for cross-validation
df_cleaned = general_cleaning(df)
X = df_cleaned.drop(columns=["price"])
y = df_cleaned["price"]

# Remove 'price' from num_cols since it's the target
del num_cols['price']

print(f"Dataset size: {X.shape}")
print(f"Target range: £{y.min():.2f} - £{y.max():.2f}")
print(f"\nReady for cross-validation!")

<a id="correlation-analysis"></a>
#### Correlation Analysis

Before model training, let's examine correlations between numerical features to understand their relationships.

In [None]:
# Correlation matrix for numerical features
fig = plt.figure(figsize=(10, 8))
corr = X[list(num_cols.keys())].corr(method="pearson")
sns.heatmap(data=corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Numerical Features')
plt.tight_layout()
plt.show()

<a id="model-training"></a>
## Model Training

<a id="model-selection-with-cv"></a>
### Model Selection with Cross-Validation

We'll use cross-validation with hyperparameter tuning to select the best model. Configure your model using a dictionary with the model class, parameter distributions, and number of iterations.

<a id="quick-baseline-model"></a>
### Quick Baseline Model

Before running extensive CV, let's train a simple baseline model for quick reference.

In [None]:
# Quick train/val split for baseline
X_train_baseline, X_val_baseline, y_train_baseline, y_val_baseline = train_test_split(
    X, y, test_size=0.2, random_state=SEED
)

# Preprocess baseline data
X_train_processed, baseline_artifacts = preprocess_data(X_train_baseline, cat_cols, num_cols, fit=True)
X_val_processed = preprocess_data(X_val_baseline, cat_cols, num_cols, artifacts=baseline_artifacts, fit=False)

# Train simple Ridge model
baseline_model = Ridge(alpha=1.0, fit_intercept=True)
baseline_model.fit(X_train_processed, np.log1p(y_train_baseline))

# Evaluate
y_train_pred = np.expm1(baseline_model.predict(X_train_processed))
y_val_pred = np.expm1(baseline_model.predict(X_val_processed))

mae_train = mean_absolute_error(y_train_baseline, y_train_pred)
mae_val = mean_absolute_error(y_val_baseline, y_val_pred)
r2_val = r2_score(y_val_baseline, y_val_pred)

print(f"Baseline Ridge (alpha=1.0):")
print(f"  Train MAE: £{mae_train:.2f}")
print(f"  Val MAE:   £{mae_val:.2f}")
print(f"  Val R²:    {r2_val:.4f}")
print(f"\nThis gives us a reference point before hyperparameter tuning with CV.")

<a id="experiment-algorithms"></a>
### Experiment Algorithms

Now we'll experiment with different algorithms using cross-validation with hyperparameter tuning.

In [None]:
# Example 1: Ridge Regression with hyperparameter tuning
ridge_config = {
    'model_class': Ridge,
    'param_distributions': {
        'alpha': loguniform(1e-3, 1e2),
        'fit_intercept': [True, False]
    },
    'n_iter': 20
}

ridge_results = cross_validate_with_tuning(X, y, cat_cols, num_cols, ridge_config, k=3, seed=SEED)

In [None]:
# Example 2: Lasso Regression
lasso_config = {
    'model_class': Lasso,
    'param_distributions': {
        'alpha': loguniform(1e-3, 1e2),
        'fit_intercept': [True, False]
    },
    'n_iter': 20
}

lasso_results = cross_validate_with_tuning(X, y, cat_cols, num_cols, lasso_config, k=3, seed=SEED)

In [None]:
# Example 3: Random Forest
rf_config = {
    'model_class': RandomForestRegressor,
    'param_distributions': {
        'n_estimators': randint(50, 200),
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': randint(2, 10),
        'min_samples_leaf': randint(1, 5)
    },
    'n_iter': 4
}

rf_results = cross_validate_with_tuning(X, y, cat_cols, num_cols, rf_config, k=3, seed=SEED)

In [None]:
# Select best model based on CV results
best_result = ridge_results  # Choose: ridge_results, lasso_results, or rf_results
best_model = best_result['best_estimator']
preprocessing_artifacts = best_result['preprocessing_artifacts']

print(f"\nSelected Model: {best_model.__class__.__name__}")
print(f"CV Performance: MAE = £{best_result['mean_cv_score']:.2f} ± £{best_result['std_cv_score']:.2f}")
print(f"Best Parameters: {best_result['best_params']}")

<a id="predictions"></a>
# Predictions

In [None]:
# Load and preprocess test data
test_df = pd.read_csv('data/test.csv').set_index('carID')
test_processed = preprocess_test_data(test_df, preprocessing_artifacts)

# Make predictions
test_predictions = np.expm1(best_model.predict(test_processed))

# Save predictions
predictions_df = pd.DataFrame({'price': test_predictions}, index=test_df.index)
predictions_df.to_csv('data/test_predictions.csv')

print(f"Predictions saved for {len(test_predictions)} test samples")
print(f"Predicted price range: £{test_predictions.min():.2f} - £{test_predictions.max():.2f}")