# Machine Learning Project

## Table of Contents
- [Import Data](#import-data)
  - [Import Data Summary](#import-data-summary)
- [Data Exploration](#data-exploration)
  - [Boolean Features](#boolean-features)
    - [Boolean Features Analysis](#boolean-features-analysis)
  - [Categorical Features](#categorical-features)
    - [Check Categorical Features Consistency](#check-categorical-features-consistency)
    - [Categorical Features Summary](#categorical-features-summary)
  - [Numerical Features](#numerical-features)
    - [Numerical Plots](#plots)
    - [Analysis of Numerical Distributions](#analysis-of-numerical-distributions)
- [Pre-processing](#pre-processing)
  - [Summary of Preprocessing Pipeline](#preprocessing-pipeline-summary)
  - [Data Preparation](#data-preparation)
    - [Correlation Analysis](#correlation-analysis)
- [Model Training](#model-training)
  - [Model Selection with Cross-Validation](#model-selection-with-cv)
  - [Quick Baseline Model](#quick-baseline-model)
  - [Experiment Algorithms](#experiment-algorithms)
- [Predictions](#predictions)

<a id="import-data"></a>
## Import Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from scipy.stats import loguniform, randint
from model_training_utils import (
    general_cleaning,
    preprocess_data,
    cross_validate_with_tuning,
    preprocess_test_data
)

SEED = 42

warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('data/train.csv').set_index('carID')
df.head()

In [None]:
df.info()

In [None]:
num_duplicated_ids = df.index.duplicated().sum()
print(f'Number of duplicated carIDs: {num_duplicated_ids}')

<a id="import-data-summary"></a>
#### Import Data Summary
- Dataset loaded successfully with `carID` as the index
- There are no duplicate entries in carID
- The dataset contains information about cars including both numerical features (price, mileage, tax, etc.) and categorical features (brand, model, transmission, etc.)
- Initial inspection shows multiple features that will require preprocessing:
  - Numerical features that need cleaning (negative values, outliers)
  - Categorical features that need standardization
  - Presence of missing values in several columns

<a id="data-exploration"></a>
## Data Exploration

<a id="boolean-features"></a>
### Boolean Features

In [None]:
df['hasDamage'].value_counts(dropna=False)

<a id="boolean-features-analysis"></a>
#### Boolean Features Analysis

Key observations about `hasDamage` feature:
- Only contains binary values (0) and NaN
- No instances of value 1 found, suggesting potential data collection issues
- May indicate:
  - Cars with damage not being listed
  - System default setting of 0 for non-damaged cars
  - Incomplete damage assessment process
- Requires special handling in preprocessing:
  - Consider treating NaN as a separate category
  - Validate if 0 truly represents "no damage"
  - May need to be treated as a categorical rather than boolean feature

<a id="categorical-features"></a>
### Categorical Features

<a id="check-categorical-features-consistency"></a>
#### Check Categorical Features Consistency

In [None]:
# List of categorical features
cat_cols = ['Brand', 'model', 'fuelType', 'transmission']

# Identify outlier examples in categorical features
cat_outliers_examples = {col: df[col].value_counts().tail(10).index for col in cat_cols}

# Display the outlier examples
pd.DataFrame(cat_outliers_examples)

<a id="categorical-features-summary"></a>
#### Categorical Features Summary
- Initial analysis reveals significant data quality issues across all categorical columns
- No standardization in categorical features, with multiple variations of the same values (different spellings, capitalizations)
- Solution: We will implement string distance-based standardization using the `nltk` library to clean and standardize these features

<a id="numerical-features"></a>
### Numerical Features

In [None]:
# Dict of numerical features
num_cols = {
    'price': 'continuous',
    'mileage': 'continuous',
    'tax': 'continuous',
    'mpg': 'continuous',
    'paintQuality%': 'continuous',
    'engineSize': 'continuous',
    'year': 'discrete',
    'previousOwners': 'discrete'
}

<a id="plots"></a>
#### Numerical Plots

In [None]:
# Plot figures for numerical features and the target variable (price)
plt.figure(figsize=(16, 10))
for i, (col, var_type) in enumerate(num_cols.items(), 1):
    plt.subplot(4, 2, i)

    # Plot based on variable type
    if var_type == 'continuous':
        sns.histplot(data=df, x=col, kde=True, color="lightcoral", bins=30)
        plt.title(f"Distribution of {col}", fontsize=11)
    elif var_type == 'discrete':
        sns.countplot(data=df, x=col, color="lightcoral")
        plt.title(f"Distribution of {col}", fontsize=11)
        plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

In [None]:
# Boxplots for continuous numerical features and the target variable (price)
continuous_cols = [col for col, var_type in num_cols.items() if var_type == 'continuous']
plt.figure(figsize=(16, 10))
for i, col in enumerate(continuous_cols, 1):
    plt.subplot(3, 2, i)
    sns.boxplot(data=df, x=col, color="lightblue")
    plt.title(f"Distribution of {col}", fontsize=11)

plt.tight_layout()
plt.show()

<a id="analysis-of-numerical-distributions"></a>
#### Analysis of Numerical Distributions

Key observations from the plots:
- **Target Variable (Price)**:
  - Highly right-skewed distribution
  - Contains significant number of outliers in the upper range
  - Most cars are concentrated in the lower price range

- **Mileage**:
  - Right-skewed distribution
  - Large range from nearly new cars to high-mileage vehicles
  - Some outliers in upper range suggesting possible data entry errors

- **Tax**:
  - Multiple peaks suggesting different tax bands
  - Contains negative values which require investigation (possible tax benefits/rebates)
  - Large number of outliers on both ends of the distribution

- **MPG (Miles Per Gallon)**:
  - Approximately normal distribution with slight right skew
  - Some unrealistic extreme values that need cleaning
  - Reasonable median around typical car efficiency ranges

- **Paint Quality %**:
  - Contains values above 100% which are logically impossible
  - Left-skewed distribution suggesting optimistic ratings
  - Requires standardization to 0-100 range

- **Engine Size**:
  - There are engine size with zero values which are not realistic (might indicate electric vehicles)
  - Some unusual patterns that need investigation
  - Contains outliers that may represent specialty vehicles

- **Year**:
  - Should be discrete but contains decimal values

- **Previous Owners**:
  - Should be integer but contains float values
  - Right-skewed distribution as expected
  - Maximum values need validation (unusually high number of previous owners)

<a id="pre-processing"></a>
## Pre-processing

<a id="preprocessing-pipeline-summary"></a>
## Summary of Preprocessing Pipeline

The preprocessing is now properly separated:

1. **`preprocess_data()`** - Preprocesses a single dataset
   - Handles categorical features (standardization, encoding)
   - Handles numerical outliers using IQR method
   - Imputes missing values with medians
   - One-hot encodes categorical features
   - Normalizes numerical features with StandardScaler
   - Can fit transformers (fit=True) or use existing ones (fit=False)

2. **`cross_validate_with_tuning()`** - Performs CV with hyperparameter tuning
   - Takes **raw data** (after general_cleaning)
   - Applies preprocessing **separately for each fold** (prevents data leakage)
   - Performs manual hyperparameter search by sampling from parameter distributions
   - Evaluates each combination on validation fold and tracks train/validation performance
   - Returns best model configurations

3. **`preprocess_test_data()`** - Preprocesses test data
   - Uses artifacts from CV to ensure consistency

<a id="data-preparation"></a>
### Data Preparation

In [None]:
# Prepare cleaned data for cross-validation
df_cleaned = general_cleaning(df)
X = df_cleaned.drop(columns=["price"])
y = df_cleaned["price"]

# Remove 'price' from num_cols since it's the target
del num_cols['price']

print(f"Dataset size: {X.shape}")
print(f"Target range: £{y.min():.2f} - £{y.max():.2f}")
print(f"\nReady for cross-validation!")

<a id="correlation-analysis"></a>
#### Correlation Analysis

Before model training, let's examine correlations between numerical features to understand their relationships.

In [None]:
# Correlation matrix for numerical features
fig = plt.figure(figsize=(10, 8))
corr = X[list(num_cols.keys())].corr(method="pearson")
sns.heatmap(data=corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Numerical Features')
plt.tight_layout()
plt.show()

<a id="model-training"></a>
## Model Training

<a id="model-selection-with-cv"></a>
### Model Selection with Cross-Validation

We'll use cross-validation with hyperparameter tuning to select the best model. Configure your model using a dictionary with the model class, parameter distributions, and number of iterations.

<a id="quick-baseline-model"></a>
### Quick Baseline Model

Before running extensive CV, let's train a simple baseline model for quick reference.

In [None]:
# Quick train/val split for baseline
X_train_baseline, X_val_baseline, y_train_baseline, y_val_baseline = train_test_split(
    X, y, test_size=0.2, random_state=SEED
)

# Preprocess baseline data
X_train_processed, baseline_artifacts = preprocess_data(X_train_baseline, cat_cols, num_cols, fit=True)
X_val_processed = preprocess_data(X_val_baseline, cat_cols, num_cols, artifacts=baseline_artifacts, fit=False)

# Train simple Ridge model
baseline_model = Ridge(alpha=1.0, fit_intercept=True)
baseline_model.fit(X_train_processed, np.log1p(y_train_baseline))

# Evaluate
y_train_pred = np.expm1(baseline_model.predict(X_train_processed))
y_val_pred = np.expm1(baseline_model.predict(X_val_processed))

mae_train = mean_absolute_error(y_train_baseline, y_train_pred)
mae_val = mean_absolute_error(y_val_baseline, y_val_pred)
r2_val = r2_score(y_val_baseline, y_val_pred)

print(f"Baseline Ridge (alpha=1.0):")
print(f"  Train MAE: £{mae_train:.2f}")
print(f"  Val MAE:   £{mae_val:.2f}")
print(f"  Val R²:    {r2_val:.4f}")
print(f"\nThis gives us a reference point before hyperparameter tuning with CV.")

# Feature Importance

In [None]:
def get_feature_importance(model_class, X_train, y_train):
   
    # Trees
    criteria_map = {
        RandomForestRegressor: ["squared_error", "absolute_error", "poisson"],
        ExtraTreesRegressor:   ["squared_error", "absolute_error", "poisson"],
    }

    results = []

    if model_class in criteria_map:
        for crit in criteria_map[model_class]:
            model = model_class(criterion=crit)
            model.fit(X_train, y_train)
            importance = model.feature_importances_
            results.append((crit, importance))

    # Gradient boosting
    elif model_class is GradientBoostingRegressor:
        model = model_class()
        model.fit(X_train, y_train)
        results.append(("feature_importances", model.feature_importances_))

    # Ridge / Lasso (based on coef)
    elif model_class in [Ridge, Lasso]:
        model = model_class()
        model.fit(X_train, y_train)
        coef = np.abs(model.coef_)
        if coef.ndim > 1:  # multioutput
            coef = coef.mean(axis=0)
        results.append(("coef", coef))

    # MLP 
    elif model_class is MLPRegressor:
        model = model_class(max_iter=2000)
        model.fit(X_train, y_train)

        # Importance based on first layer weights
        first_layer_weights = model.coefs_[0]     # shape = (n_features, n_hidden)
        importance = np.linalg.norm(first_layer_weights, axis=1)

        results.append(("mlp_weights", importance))

    else:
        raise ValueError(f"Model {model_class.__name__} not supported.")

    df_list = []
    for label, values in results:
        df_list.append(pd.DataFrame({
            "Feature": X_train.columns,
            "Value": values,
            "Method": label
        }))

    tidy = pd.concat(df_list)
    tidy.sort_values("Value", ascending=False, inplace=True)

    plt.figure(figsize=(15, 8))
    sns.barplot(data=tidy, y="Feature", x="Value", hue="Method")
    plt.title(f"Feature Importance — {model_class.__name__}")
    plt.tight_layout()
    plt.show()



<a id="experiment-algorithms"></a>
### Experiment Algorithms

Now we'll experiment with different algorithms using cross-validation with hyperparameter tuning.

In [None]:
# Example 1: Ridge Regression with hyperparameter tuning
ridge_config = {
    'model_class': Ridge,
    'param_distributions': {
        'alpha': loguniform(1e-3, 1e2),
        'fit_intercept': [True, False]
    },
    'n_iter': 20
}

ridge_results = cross_validate_with_tuning(X, y, cat_cols, num_cols, ridge_config, k=3, seed=SEED)
get_feature_importance(Ridge, X_train_processed, y_train_baseline)

In [None]:
# Example 2: Lasso Regression
lasso_config = {
    'model_class': Lasso,
    'param_distributions': {
        'alpha': loguniform(1e-3, 1e2),
        'fit_intercept': [True, False]
    },
    'n_iter': 20
}

lasso_results = cross_validate_with_tuning(X, y, cat_cols, num_cols, lasso_config, k=3, seed=SEED)
get_feature_importance(Lasso, X_train_processed, y_train_baseline)

In [None]:
# Example 3: Random Forest
rf_config = {
    'model_class': RandomForestRegressor,
    'param_distributions': {
        'n_estimators': randint(10, 50),
        'max_depth': randint(5, 15),
        'min_samples_split': randint(20, 50),
        'min_samples_leaf': randint(5, 15),
        'max_features': ['sqrt', 0.5],
    },
    'n_iter': 10
}

rf_results = cross_validate_with_tuning(X, y, cat_cols, num_cols, rf_config, k=3, seed=SEED)
get_feature_importance(RandomForestRegressor, X_train_processed, y_train_baseline)

In [None]:
# Example 4: Gradient Boosting Regressor
gb_config = {
    'model_class': GradientBoostingRegressor,
    'param_distributions': {
        'n_estimators': randint(50, 100),
        'learning_rate': loguniform(0.01, 0.2),
        'max_depth': randint(3, 8),
        'min_samples_split': randint(2, 20),
        'min_samples_leaf': randint(1, 10),
        'subsample': [0.8, 0.9, 1.0],
    },
    'n_iter': 20
}

gb_results = cross_validate_with_tuning(X, y, cat_cols, num_cols, gb_config, k=3, seed=SEED)
get_feature_importance(GradientBoostingRegressor, X_train_processed, y_train_baseline)

In [None]:
# 5 Example: Extra Trees Regressor
et_config = {
    'model_class': ExtraTreesRegressor,
    'param_distributions': {
        'n_estimators': randint(1, 5),
        'max_depth': randint(3, 20),
        'min_samples_split': randint(10, 100),
        'min_samples_leaf': randint(1, 10),
    },
    'n_iter': 20
}

et_results = cross_validate_with_tuning(X, y, cat_cols, num_cols, et_config, k=3, seed=SEED)
get_feature_importance(ExtraTreesRegressor, X_train_processed, y_train_baseline)

In [None]:
# Example 6: MLP Regressor (Neural Network)
mlp_config = {
    'model_class': MLPRegressor,
    'param_distributions': {
        'hidden_layer_sizes': [(16,), (32,), (64,), (16, 8), (32, 16), (64, 32)],
        'activation': ['relu', 'tanh'],
        'solver': 'adam',
        'alpha': loguniform(1e-5, 1e-1),
        'learning_rate_init': loguniform(1e-4, 1e-2),
        'max_iter': 500,
        'early_stopping': True
    },
    'n_iter': 5 
}

mlp_results = cross_validate_with_tuning(X, y, cat_cols, num_cols, mlp_config, k=3, seed=SEED)
get_feature_importance(MLPRegressor, X_train_processed, y_train_baseline)

### Final Model Selection

I will select the best algorithm with the best hyperparameters and train the model with it. While doing it I want to use all available data.

In [None]:
# Compare all models to find the best one
results = {
    'Ridge': ridge_results,
    'Lasso': lasso_results,
    'Random Forest': rf_results,
    'Extra Trees': et_results,
    'MLP': mlp_results,
    'Gradient Boosting': gb_results
}

# Select the best model based on the lowest mean CV score (MAE)
best_model_name = min(results, key=lambda k: results[k]['mean_cv_score'])
best_result = results[best_model_name]

print(f"Best Model Selected: {best_model_name}")
print(f"Best CV MAE: £{best_result['mean_cv_score']:.2f} ± £{best_result['std_cv_score']:.2f}")
print(f"Best Parameters: {best_result['best_params']}\n")

# Retrain the best model on ALL available data
print(f"Retraining {best_model_name} on all available data...")

# 1. Preprocess the entire dataset
X_all_processed, preprocessing_artifacts = preprocess_data(X, cat_cols, num_cols, fit=True)

# 2. Prepare the target variable (log transform)
y_all_log = np.log1p(y)

# 3. Get the model with the best parameters
# The function returns an unfitted model with the best params, so we can use it directly
best_model = best_result['best_estimator']

# 4. Train the model on the full dataset
best_model.fit(X_all_processed, y_all_log)

print("Final model trained successfully on all data!")

<a id="predictions"></a>
# Predictions

In [None]:
# Load and preprocess test data
test_df = pd.read_csv('data/test.csv').set_index('carID')
test_processed = preprocess_test_data(test_df, preprocessing_artifacts)

# Make predictions
test_predictions = np.expm1(best_model.predict(test_processed))

# Save predictions
predictions_df = pd.DataFrame({'price': test_predictions}, index=test_df.index)
predictions_df.to_csv('data/test_predictions.csv')

print(f"Predictions saved for {len(test_predictions)} test samples")
print(f"Predicted price range: £{test_predictions.min():.2f} - £{test_predictions.max():.2f}")