# Machine Learning Project

## Table of Contents
- [Import Data](#import-data)
  - [Import Data Summary](#import-data-summary)
- [Data Exploration](#data-exploration)
  - [Boolean Features](#boolean-features)
    - [Boolean Features Analysis](#boolean-features-analysis)
  - [Categorical Features](#categorical-features)
    - [Check Categorical Features Consistency](#check-categorical-features-consistency)
    - [Categorical Features Summary](#categorical-features-summary)
  - [Numerical Features](#numerical-features)
    - [Numerical Plots](#plots)
    - [Analysis of Numerical Distributions](#analysis-of-numerical-distributions)
- [Pre-processing](#pre-processing)
  - [Summary of Preprocessing Pipeline](#preprocessing-pipeline-summary)
  - [Data Preparation](#data-preparation)
    - [Correlation Analysis](#correlation-analysis)
- [Model Training](#model-training)
  - [Model Selection with Cross-Validation](#model-selection-with-cv)
  - [Quick Baseline Model](#quick-baseline-model)
  - [Experiment Algorithms](#experiment-algorithms)
- [Predictions](#predictions)

# Abstract

### Context:
This project addresses car price prediction—a fundamental regression task in machine learning. The dataset includes features spanning categorical attributes (brand, model, transmission type, fuel type) and numerical characteristics (mileage, engine size, tax, MPG, paint quality). This problem is relevant for automotive valuations, insurance pricing, and market analysis.

### Objectives:
The primary goals were to:
1.  Systematically explore and preprocess complex, real-world automotive data containing missing values, outliers, and inconsistencies.
2.  Develop a robust preprocessing pipeline that prevents data leakage through proper fold-wise application.
3.  Benchmark multiple regression algorithms with hyperparameter tuning via cross-validation.
4.  Identify the most influential features through importance analysis.

### Methodology:
Data exploration revealed categorical inconsistencies (typos, spacing variations) and numerical anomalies (negative values, out-of-range percentages). The preprocessing pipeline incorporated:
* General cleaning
* Categorical standardization using edit distance
* Outlier handling via IQR method
* Imputation with training-set medians
* One-hot encoding
* Feature normalization


### Main Results:
**Ridge regression** with log-transformed targets established a baseline (MAE on validation). Feature importance revealed that `mileage`, `year`, and `engine size` are primary predictors, while `hasDamage` and certain categorical overflow categories contributed minimal signal. The hyperparameter tuning identified optimal configurations for each algorithm, with **ensemble methods** generally outperforming linear approaches. Cross-validation metrics tracked both training and validation performance to detect overfitting.

### Conclusions:
The project demonstrates that systematic preprocessing and ensemble approaches significantly improve prediction accuracy. Feature engineering and selection based on importance analysis reduced model complexity while preserving predictive power, supporting the principle that data quality and feature selection are as critical as algorithm choice in regression tasks.

<a id="import-data"></a>
## Import Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, HistGradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import loguniform, randint
from model_training_utils import (
    general_cleaning,
    preprocess_data,
    cross_validate_with_tuning,
    preprocess_test_data,
    get_feature_importance
)

SEED = 42
LOG_TARGET = True

warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('data/train.csv').set_index('carID')
df.head()

In [None]:
df.info()

In [None]:
num_duplicated_ids = df.index.duplicated().sum()
print(f'Number of duplicated carIDs: {num_duplicated_ids}')

<a id="import-data-summary"></a>
#### Import Data Summary
- Dataset loaded successfully with `carID` as the index
- There are no duplicate entries in carID
- The dataset contains information about cars including both numerical features (price, mileage, tax, etc.) and categorical features (brand, model, transmission, etc.)
- Initial inspection shows multiple features that will require preprocessing:
  - Numerical features that need cleaning (negative values, outliers)
  - Categorical features that need standardization
  - Presence of missing values in several columns

<a id="data-exploration"></a>
## Data Exploration

<a id="boolean-features"></a>
### Boolean Features

In [None]:
df['hasDamage'].value_counts(dropna=False)

<a id="boolean-features-analysis"></a>
#### Boolean Features Analysis

Key observations about `hasDamage` feature:
- Only contains binary values (0) and NaN
- No instances of value 1 found, suggesting potential data collection issues
- May indicate:
  - Cars with damage not being listed
  - System default setting of 0 for non-damaged cars
  - Incomplete damage assessment process
- Requires special handling in preprocessing:
  - Consider treating NaN as a separate category
  - Validate if 0 truly represents "no damage"
  - May need to be treated as a categorical rather than boolean feature

<a id="categorical-features"></a>
### Categorical Features

<a id="check-categorical-features-consistency"></a>
#### Check Categorical Features Consistency

In [None]:
# List of categorical features
cat_cols = ['Brand', 'model', 'fuelType', 'transmission']

# Identify outlier examples in categorical features
cat_outliers_examples = {col: df[col].value_counts().tail(10).index for col in cat_cols}

# Display the outlier examples
pd.DataFrame(cat_outliers_examples)

<a id="categorical-features-summary"></a>
#### Categorical Features Summary
- Initial analysis reveals significant data quality issues across all categorical columns
- No standardization in categorical features, with multiple variations of the same values (different spellings, capitalizations)
- Solution: We will implement string distance-based standardization using the `nltk` library to clean and standardize these features

<a id="numerical-features"></a>
### Numerical Features

In [None]:
# Dict of numerical features
num_cols = {
    'price': 'continuous',
    'mileage': 'continuous',
    'tax': 'continuous',
    'mpg': 'continuous',
    'paintQuality%': 'continuous',
    'engineSize': 'continuous',
    'year': 'discrete',
    'previousOwners': 'discrete'
}

<a id="plots"></a>
#### Numerical Plots

In [None]:
# Plot figures for numerical features and the target variable (price)
plt.figure(figsize=(16, 10))
for i, (col, var_type) in enumerate(num_cols.items(), 1):
    plt.subplot(4, 2, i)

    # Plot based on variable type
    if var_type == 'continuous':
        sns.histplot(data=df, x=col, kde=True, color="lightcoral", bins=30)
        plt.title(f"Distribution of {col}", fontsize=11)
    elif var_type == 'discrete':
        sns.countplot(data=df, x=col, color="lightcoral")
        plt.title(f"Distribution of {col}", fontsize=11)
        plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

In [None]:
# Boxplots for continuous numerical features and the target variable (price)
continuous_cols = [col for col, var_type in num_cols.items() if var_type == 'continuous']
plt.figure(figsize=(16, 10))
for i, col in enumerate(continuous_cols, 1):
    plt.subplot(3, 2, i)
    sns.boxplot(data=df, x=col, color="lightblue")
    plt.title(f"Distribution of {col}", fontsize=11)

plt.tight_layout()
plt.show()

<a id="analysis-of-numerical-distributions"></a>
#### Analysis of Numerical Distributions

Key observations from the plots:
- **Target Variable (Price)**:
  - Highly right-skewed distribution
  - Contains significant number of outliers in the upper range
  - Most cars are concentrated in the lower price range

- **Mileage**:
  - Right-skewed distribution
  - Large range from nearly new cars to high-mileage vehicles
  - Some outliers in upper range suggesting possible data entry errors

- **Tax**:
  - Multiple peaks suggesting different tax bands
  - Contains negative values which require investigation (possible tax benefits/rebates)
  - Large number of outliers on both ends of the distribution

- **MPG (Miles Per Gallon)**:
  - Approximately normal distribution with slight right skew
  - Some unrealistic extreme values that need cleaning
  - Reasonable median around typical car efficiency ranges

- **Paint Quality %**:
  - Contains values above 100% which are logically impossible
  - Left-skewed distribution suggesting optimistic ratings
  - Requires standardization to 0-100 range

- **Engine Size**:
  - There are engine size with zero values which are not realistic (might indicate electric vehicles)
  - Some unusual patterns that need investigation
  - Contains outliers that may represent specialty vehicles

- **Year**:
  - Should be discrete but contains decimal values

- **Previous Owners**:
  - Should be integer but contains float values
  - Right-skewed distribution as expected
  - Maximum values need validation (unusually high number of previous owners)

<a id="pre-processing"></a>
## Pre-processing

<a id="preprocessing-pipeline-summary"></a>
## Summary of Preprocessing Pipeline

The preprocessing is now properly separated:

1. **`preprocess_data()`** - Preprocesses a single dataset
   - Handles categorical features (standardization, encoding)
   - Handles numerical outliers using IQR method
   - Imputes missing values with medians
   - One-hot encodes categorical features
   - Normalizes numerical features with StandardScaler
   - Can fit transformers (fit=True) or use existing ones (fit=False)

2. **`cross_validate_with_tuning()`** - Performs CV with hyperparameter tuning
   - Takes **raw data** (after general_cleaning)
   - Applies preprocessing **separately for each fold** (prevents data leakage)
   - Performs manual hyperparameter search by sampling from parameter distributions
   - Evaluates each combination on validation fold and tracks train/validation performance
   - Returns best model configurations

3. **`preprocess_test_data()`** - Preprocesses test data
   - Uses artifacts from CV to ensure consistency

<a id="data-preparation"></a>
### Data Preparation

In [None]:
# Prepare cleaned data for cross-validation
df_cleaned = general_cleaning(df)
X = df_cleaned.drop(columns=["price"])
y = df_cleaned["price"]

# Remove 'price' from num_cols since it's the target
del num_cols['price']

print(f"Dataset size: {X.shape}")
print(f"Target range: £{y.min():.2f} - £{y.max():.2f}")


<a id="correlation-analysis"></a>
#### Correlation Analysis

Before model training, let's examine correlations between numerical features to understand their relationships.

In [None]:
# Correlation matrix for numerical features
fig = plt.figure(figsize=(10, 8))
corr = X[list(num_cols.keys())].corr(method="pearson")
sns.heatmap(data=corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Numerical Features')
plt.tight_layout()
plt.show()

<a id="model-training"></a>
## Model Training

<a id="model-selection-with-cv"></a>
### Model Selection with Cross-Validation

We'll use cross-validation with hyperparameter tuning to select the best model. Configure your model using a dictionary with the model class, parameter distributions, and number of iterations.

<a id="quick-baseline-model"></a>
### Quick Baseline Model

Before running extensive CV, let's train a simple baseline model for quick reference.

In [None]:
# Quick train/val split for baseline
X_train_baseline, X_val_baseline, y_train_baseline, y_val_baseline = train_test_split(
    X, y, test_size=0.2, random_state=SEED
)

# Preprocess baseline data
X_train_processed, baseline_artifacts = preprocess_data(X_train_baseline, cat_cols, num_cols, fit=True)
X_val_processed = preprocess_data(X_val_baseline, cat_cols, num_cols, artifacts=baseline_artifacts, fit=False)

# Train simple Ridge model
baseline_model = Ridge(alpha=1.0, fit_intercept=True)

if LOG_TARGET:
    baseline_model.fit(X_train_processed, np.log1p(y_train_baseline))
    y_train_pred = np.expm1(baseline_model.predict(X_train_processed))
    y_val_pred = np.expm1(baseline_model.predict(X_val_processed))
else:
    baseline_model.fit(X_train_processed, y_train_baseline)
    y_train_pred = baseline_model.predict(X_train_processed)
    y_val_pred = baseline_model.predict(X_val_processed)

mae_train = mean_absolute_error(y_train_baseline, y_train_pred)
mae_val = mean_absolute_error(y_val_baseline, y_val_pred)
r2_val = r2_score(y_val_baseline, y_val_pred)

print(f"Baseline Ridge (alpha=1.0):")
print(f"  Train MAE: £{mae_train:.2f}")
print(f"  Val MAE:   £{mae_val:.2f}")
print(f"  Val R²:    {r2_val:.4f}")
print(f"\nThis gives us a reference point before hyperparameter tuning with CV.")

<a id="feature-importance-analysis"></a>
### Feature Importance Analysis

In this section, we will train 4 different models (Ridge, Random Forest, Gradient Boosting, and Extra Trees) to analyze feature importance. Based on the results, we will select a subset of features to use for the final hyperparameter tuning.

In [None]:
# Models for Feature Importance Analysis
# We use default parameters or a small search space for quick analysis

# 1. Ridge (Linear)
print("--- Ridge Feature Importance ---")
fi_ridge_config = {
    'model_class': Ridge, 
    'param_distributions': {'alpha': [1.0]}, 
    'n_iter': 1
}
fi_ridge_results = cross_validate_with_tuning(
    X, y, cat_cols, num_cols, fi_ridge_config, k=3, seed=SEED, verbose=False
)

# 2. Random Forest
print("\n--- Random Forest Feature Importance ---")
fi_rf_config = {
    'model_class': RandomForestRegressor, 
    'param_distributions': {'n_estimators': [100], 'max_depth': [10]}, 
    'n_iter': 1
}
fi_rf_results = cross_validate_with_tuning(
    X, y, cat_cols, num_cols, fi_rf_config, k=3, seed=SEED, verbose=False
)

# 3. Gradient Boosting
print("\n--- Gradient Boosting Feature Importance ---")
fi_gb_config = {
    'model_class': GradientBoostingRegressor, 
    'param_distributions': {'n_estimators': [100], 'max_depth': [5]}, 
    'n_iter': 1
}
fi_gb_results = cross_validate_with_tuning(
    X, y, cat_cols, num_cols, fi_gb_config, k=3, seed=SEED, verbose=False
)

# 4. Extra Trees
print("\n--- Extra Trees Feature Importance ---")
fi_et_config = {
    'model_class': ExtraTreesRegressor, 
    'param_distributions': {'n_estimators': [100], 'max_depth': [10]}, 
    'n_iter': 1
}
fi_et_results = cross_validate_with_tuning(
    X, y, cat_cols, num_cols, fi_et_config, k=3, seed=SEED, verbose=False
)

In [None]:
# Plot Feature Importance for Ridge
get_feature_importance(fi_ridge_results['best_estimator'], 
                       preprocess_data(X, cat_cols, num_cols, artifacts=fi_ridge_results['final_artifacts'], fit=False))

In [None]:
# Plot Feature Importance for Random Forest
get_feature_importance(fi_rf_results['best_estimator'], 
                       preprocess_data(X, cat_cols, num_cols, artifacts=fi_rf_results['final_artifacts'], fit=False))

In [None]:
# Plot Feature Importance for Gradient Boosting
get_feature_importance(fi_gb_results['best_estimator'], 
                       preprocess_data(X, cat_cols, num_cols, artifacts=fi_gb_results['final_artifacts'], fit=False))

In [None]:
# Plot Feature Importance for Extra Trees
get_feature_importance(fi_et_results['best_estimator'], 
                       preprocess_data(X, cat_cols, num_cols, artifacts=fi_et_results['final_artifacts'], fit=False))

In [None]:
# Get the list of all processed feature names from one of the artifacts
all_features = list(preprocess_data(X, cat_cols, num_cols, artifacts=fi_ridge_results['final_artifacts'], fit=False).columns)

print("Original Feature Count:", len(all_features))

# Collect feature importance from all models
feature_importance_df = pd.DataFrame(index=all_features)

models_fi = {
    'Ridge': fi_ridge_results,
    'Random Forest': fi_rf_results,
    'Gradient Boosting': fi_gb_results,
    'Extra Trees': fi_et_results
}

for name, result in models_fi.items():
    # Reconstruct processed X to match columns
    X_processed = preprocess_data(X, cat_cols, num_cols, artifacts=result['final_artifacts'], fit=False)
    
    # Get importance dataframe (plot=False to avoid duplicate plots)
    imp_df = get_feature_importance(result['best_estimator'], X_processed, plot=False)
    
    # Map values to the main dataframe
    # imp_df has columns: Feature, Value, Method
    # We set the index to Feature and extract Value
    imp_series = imp_df.set_index('Feature')['Value']
    feature_importance_df[name] = imp_series

# Normalize each column to 0-1 range
scaler = MinMaxScaler()
feature_importance_normalized = pd.DataFrame(
    scaler.fit_transform(feature_importance_df.fillna(0)), 
    columns=feature_importance_df.columns, 
    index=feature_importance_df.index
)

# Calculate mean importance
feature_importance_normalized['Mean_Importance'] = feature_importance_normalized.mean(axis=1)
feature_importance_normalized = feature_importance_normalized.sort_values('Mean_Importance', ascending=False)

print("\nTop 15 Features by Mean Importance:")
print(feature_importance_normalized['Mean_Importance'].head(15))

# Selection Strategy: Keep features with mean importance > 0.005
threshold = 0.05
selected_features = feature_importance_normalized[feature_importance_normalized['Mean_Importance'] > threshold].index.tolist()

print(f"\nSelected {len(selected_features)} features (Mean Importance > {threshold}).")
print(f"Dropped {len(all_features) - len(selected_features)} features: {list(set(all_features) - set(selected_features))}")

<a id="experiment-algorithms"></a>
### Experiment Algorithms

Now we'll experiment with different algorithms using cross-validation with hyperparameter tuning.

In [None]:
# Example 1: Ridge Regression with hyperparameter tuning
ridge_config = {
    'model_class': Ridge,
    'param_distributions': {
        'alpha': loguniform(1e-3, 1e2),
        'fit_intercept': [True, False]
    },
    'n_iter': 20
}

ridge_results = cross_validate_with_tuning(
    X, 
    y, 
    cat_cols, 
    num_cols, 
    ridge_config, 
    k=3, 
    seed=SEED, 
    selected_features=selected_features, 
    log_target=LOG_TARGET
)

In [None]:
# Example 2: Lasso Regression
lasso_config = {
    'model_class': Lasso,
    'param_distributions': {
        'alpha': loguniform(1e-3, 1e2),
        'fit_intercept': [True, False]
    },
    'n_iter': 20
}

lasso_results = cross_validate_with_tuning(
    X, 
    y, 
    cat_cols, 
    num_cols, 
    lasso_config, 
    k=3, 
    seed=SEED, 
    selected_features=selected_features, 
    log_target=LOG_TARGET
)

In [None]:
# Example 3: Random Forest
rf_config = {
    'model_class': RandomForestRegressor,
    'param_distributions': {
        'n_estimators': randint(50, 200),
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': randint(2, 20),
        'min_samples_leaf': randint(1, 10),
        'max_features': ['sqrt', 0.5],
        'random_state': SEED
    },
    'n_iter': 10
}

rf_results = cross_validate_with_tuning(
    X, 
    y, 
    cat_cols, 
    num_cols, 
    rf_config, 
    k=3, 
    seed=SEED, 
    selected_features=selected_features, 
    log_target=LOG_TARGET
)

In [None]:
# Example 4: Hist Gradient Boosting Regressor
hgb_config = {
    'model_class': HistGradientBoostingRegressor,
    'param_distributions': {
        'learning_rate': loguniform(0.01, 0.3),
        'max_depth': randint(3, 10),
        'max_leaf_nodes': randint(15, 63),
        'l2_regularization': loguniform(1e-3, 10),
        'max_iter': [100, 200],
        'loss': ['absolute_error'],
        'random_state': SEED
    },
    'n_iter': 20
}

hgb_results = cross_validate_with_tuning(
    X, 
    y, 
    cat_cols, 
    num_cols, 
    hgb_config, 
    k=3, 
    seed=SEED, 
    selected_features=selected_features, 
    log_target=False
)

In [None]:
# Example 5: Gradient Boosting Regressor (Standard)
gb_config = {
    'model_class': GradientBoostingRegressor,
    'param_distributions': {
        'n_estimators': randint(50, 200),
        'learning_rate': loguniform(0.01, 0.2),
        'max_depth': randint(3, 6),
        'min_samples_split': randint(2, 20),
        'min_samples_leaf': randint(1, 10),
        'subsample': [0.8, 0.9, 1.0],
        'random_state': SEED
    },
    'n_iter': 10
}

gb_results = cross_validate_with_tuning(
    X, 
    y, 
    cat_cols, 
    num_cols, 
    gb_config, 
    k=3, 
    seed=SEED, 
    selected_features=selected_features, 
    log_target=LOG_TARGET
)

In [None]:
# Example 6: Extra Trees Regressor
et_config = {
    'model_class': ExtraTreesRegressor,
    'param_distributions': {
        'n_estimators': randint(50, 200),
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': randint(2, 20),
        'min_samples_leaf': randint(1, 10),
        'max_features': ['sqrt', 0.5],
        'random_state': SEED
    },
    'n_iter': 15
}

et_results = cross_validate_with_tuning(
    X, 
    y, 
    cat_cols, 
    num_cols, 
    et_config, 
    k=3, 
    seed=SEED, 
    selected_features=selected_features, 
    log_target=LOG_TARGET
)

In [None]:
# Example 7: MLP Regressor (Neural Network)
mlp_config = {
    'model_class': MLPRegressor,
    'param_distributions': {
        'hidden_layer_sizes': [(32,), (64,), (100,), (32, 16), (64, 32), (100, 50)],
        'activation': ['relu', 'tanh'],
        'solver': 'adam',
        'alpha': loguniform(1e-5, 1e-1),
        'learning_rate_init': loguniform(1e-4, 1e-2),
        'max_iter': 500,
        'early_stopping': True,
        'random_state': SEED
    },
    'n_iter': 10
}

mlp_results = cross_validate_with_tuning(
    X, 
    y, 
    cat_cols, 
    num_cols, 
    mlp_config, 
    k=3, 
    seed=SEED, 
    selected_features=selected_features, 
    log_target=LOG_TARGET
)

### Final Model Selection

I will select the best algorithm with the best hyperparameters and train the model with it. While doing it I want to use all available data.

In [None]:
# Compare all models to find the best one
results = {
    'Ridge': ridge_results,
    'Lasso': lasso_results,
    'Random Forest': rf_results,
    'Hist Gradient Boosting': hgb_results,
    'Gradient Boosting': gb_results,
    'Extra Trees': et_results,
    'MLP': mlp_results
}

# Select the best model based on the lowest mean CV score (MAE)
best_model_name = min(results, key=lambda k: results[k]['mean_cv_score'])
best_result = results[best_model_name]

print(f"Best Model Selected: {best_model_name}")
print(f"Best CV MAE: £{best_result['mean_cv_score']:.2f} ± £{best_result['std_cv_score']:.2f}")
print(f"Best Parameters: {best_result['best_params']}\n")

# The function now returns the model already fitted on ALL available data
# and the corresponding preprocessing artifacts
best_model = best_result['best_estimator']
final_artifacts = best_result['final_artifacts']

print(f"Final model ({best_model_name}) is ready and fitted on all data.")

# Visualize Feature Importance (if applicable)
try:
    # We need the processed feature names for the plot
    X_all_processed = preprocess_data(X, cat_cols, num_cols, artifacts=final_artifacts, fit=False)
    
    # Filter selected features if they were used during training
    if 'selected_features' in final_artifacts:
        X_all_processed = X_all_processed[final_artifacts['selected_features']]
        
    get_feature_importance(best_model, X_all_processed, model_class=type(best_model))
except Exception as e:
    print(f"Could not plot feature importance: {e}")

<a id="predictions"></a>
# Predictions

In [None]:
# Load and preprocess test data
test_df = pd.read_csv('data/test.csv').set_index('carID')

# Use the artifacts from the final fit on all data
test_processed = preprocess_test_data(test_df, final_artifacts)

# Make predictions
if final_artifacts.get('log_target', True):
    test_predictions = np.expm1(best_model.predict(test_processed))
else:
    test_predictions = best_model.predict(test_processed)

# Save predictions
predictions_df = pd.DataFrame({'price': test_predictions}, index=test_df.index)
predictions_df.to_csv('data/test_predictions.csv')

print(f"Predictions saved for {len(test_predictions)} test samples")
print(f"Predicted price range: £{test_predictions.min():.2f} - £{test_predictions.max():.2f}")