# Deep Learning Experiments: Pre-training vs Training from Scratch

---

## 1. Objective and Motivation

#### Motivation

Deep neural networks often underperform tree-based models on tabular data due to heterogeneous feature types, relatively small sample sizes, and irregular, non-smooth decision boundaries that are well captured by tree ensembles (Grinsztajn et al., 2022).

Meanwhile, self-supervised pre-training has revolutionized performance benchmarks, especially in natural language processing or computer vision (Devlin et al., 2019; Somepalli et al., 2021).  
This raises the question: **Can similar pre-training strategies help neural networks close the performance gap on tabular data?**

#### Approach
We compare two training strategies:

**Experiment A - From Scratch:**
- Initialize weights randomly
- Train end-to-end on supervised price prediction task
- Baseline for neural network performance

**Experiment B - Pre-trained:**
- **Phase 1:** Unsupervised pre-training using an autoencoder to learn feature representations
- **Phase 2:** Fine-tune the pre-trained encoder on the supervised price prediction task
- Tests whether unsupervised feature learning improves performance

---

## 2. Task Difficulty

- **Limited Training Data:** 60k samples is relatively small for deep learning

- **Heterogeneous Features:** Mixed numerical and categorical features require careful preprocessing

- **Non-Smooth Decision Boundaries:** Car prices exhibit discontinuities (e.g., brand premium jumps)

- **Training Dynamics:** Unsupervised pre-training may learn features irrelevant to price prediction  

- **Hyperparameter Sensitivity:** Learning rates, architecture depth, and regularization critically impact performance  


---

## 3. Implementation Details

### Architecture Design

#### Autoencoder:
```
Input: 26 features → Encoder → 12 compressed features → Decoder → 26 reconstructed features
```

**Encoder:**
- Linear layer: 26 → 12 dimensions (2.17× compression)
- Activation: ReLU (non-linearity, prevents negative values)
- Objective: Learn compressed representations that preserve information

**Decoder:**
- Linear layer: 12 → 26 dimensions (reconstruction)
- No activation: Allow negative values for normalized features
- Objective: Reconstruct original input from compressed representation

**Pre-training Loss:** MSE (Mean Squared Error)
- Minimizes: ||x - decoder(encoder(x))||²
- Goal: Force encoder to learn informative 12-dimensional features



#### Price Predictor:
```
Input: 26 features → Encoder (12) → Prediction Head (16) → Output: Price
```

**Encoder:** 
- Architecture: 26 → 12 (identical to autoencoder encoder)
- Initialization: Random (Experiment A) or Pre-trained (Experiment B)
- Training: Unfrozen - weights continue updating

**Prediction Head:**
- Layer 1: Linear(12 → 16) + ReLU + Dropout(0.3)
- Layer 2: Linear(16 → 1)
- Regularization: 30% dropout prevents overfitting

**Supervised Loss:** L1/MAE (Mean Absolute Error)
- Minimizes: |y_true - y_pred|
- Robust to outliers compared to MSE
- Directly optimizes evaluation metric



### Training Configuration
- **Phase 1: Unsupervised Pre-training (Experiment B only)** 50 epochs, MSE loss, learn feature representations without supervision
- **Phase 2: Supervised Training (Both experiments)** 50 epochs, MAE loss, Optimize price prediction performance

---

## 4. Discussion

### Discussion Themes

**Performance Analysis:**
- How does pre-training affect validation MAE?

- Pre-training shows faster early convergence with lower MAE during initial epochs, but training from scratch ultimately achieves slightly better final performance after 50 epochs (MAE: £2,125 vs £2,105, 1,0% difference). This training dynamic suggests that while unsupervised pre-training provides better weight initialization, the advantage diminishes as supervised training progresses. For our tabular car price dataset, the final performance difference is not statistically significant, indicating that unsupervised feature learning provides minimal long-term benefit over end-to-end supervised learning.

**Feature Learning:**
- Does the autoencoder learn meaningful price-relevant features?

- While the autoencoder successfully learns to reconstruct input features (achieving low reconstruction loss), these learned representations do not translate to improved price prediction performance. This indicates a mismatch between reconstruction objectives and regression objectives: features optimized for minimizing MSE reconstruction error are not necessarily optimal for predicting car prices. The autoencoder captures general data structure rather than price-relevant patterns.

**Limitations**
- **Single architecture tested:** Results may not generalize to deeper networks
- **Fixed hyperparameters:** No extensive tuning performed
- **Modest dataset size:** 60k training samples may be insufficient for neural networks to show their full potential

---
## 5. Comparison to State-of-the-Art Tabular Pre-training

Recent literature proposes more sophisticated pre-training approaches than our simple reconstruction-based autoencoder.

#### Why Advanced Methods Were Not Implemented

**XTab (Zhu et al., 2023) - Cross-Table Pre-training:**
- **Concept:** Pre-train a single transformer on 50+ diverse datasets simultaneously
- **Benefit:** Learn general tabular patterns through cross-domain transfer
- **Why Not Feasible:** 
  - Requires 20-50+ datasets
  - Beyond project scope and computational resources

**SAINT (Somepalli et al., 2021) - Advanced Single-Dataset Pre-training:**
- **Concept:** Contrastive learning + intersample attention (samples attend to each other)
- **Benefit:** Better pre-training objective than reconstruction
- **Why Not Implemented:**
  - Implementation complexity (intersample attention mechanism)
  - We focused on a baseline comparison first
  - Time constraints

#### Key Insight from Literature

**Reconstruction loss (our approach):**
- Optimizes: Input similarity (minimize ||x - x̂||²)
- Learns: General data structure
- Problem: Not aligned with regression task

**Contrastive loss (SAINT approach):**
- Optimizes: Sample similarity (similar samples close, different samples far)
- Learns: Task-relevant feature representations
- Benefit: Better aligned with prediction tasks

SAINT's success on some tabular benchmarks suggests that contrastive pre-training—not reconstruction—is the key to helping neural networks on tabular data.  
Our simple autoencoder establishes the baseline: basic pre-training provides minimal benefit, motivating investigation into sophisticated objectives like contrastive learning.

#### References

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019).  
**BERT: Pre-training of deep bidirectional transformers for language understanding**.  
In *Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019)* (pp. 4171–4186). Association for Computational Linguistics.  
https://doi.org/10.18653/v1/N19-1423

Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C. B., & Goldstein, T. (2021).  
**SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training**.  
*arXiv preprint*. https://arxiv.org/abs/2106.01342

Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022).  
**Why do tree-based models still outperform deep learning on tabular data?**  
*arXiv preprint*. https://doi.org/10.48550/arXiv.2207.08815

Zhu, B., Shi, X., Erickson, N., Li, M., Karypis, G., & Shoaran, M. (2023).  
**XTab: Cross-table pretraining for tabular transformers**.  
In *Proceedings of the 40th International Conference on Machine Learning (ICML 2023)*.  
*Proceedings of Machine Learning Research, 202*. https://proceedings.mlr.press/v202/zhu23o.html


  ---

  ## Table of Contents
  1. [Imports and Setup](#1-imports-and-setup)
  2. [Data Loading and Preprocessing](#2-data-loading-and-preprocessing)
  3. [Autoencoder Architecture](#3-autoencoder-architecture)
  4. [Pre-training for Autoencoder](#4-pre-training-for-autoencoder)
  5. [Price Prediction](#5-price-prediction)
  6. [Supervised Training](#6-supervised-training)
  7. [Experiments](#7-experiments)
     - [A: DL From Scratch](#a-dl-from-scratch)
     - [B: DL with pre-trained Autoencoder](#b-dl-with-pre-trained-autoencoder)
  8. [Results and Comparison](#8-results-and-comparison)

  ---


## 1. Imports and Setup

In [None]:
!pip install torch

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# PyTorch libraries for deep learning
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Custom preprocessing pipeline
from pipeline_functions import CarDataCleaner
import pickle
 
# Reproducibility
import random
def set_seed(seed):
    """
           Set all random seeds for reproducible results across 
       different libraries.
           This ensures that experiments can be replicated with 
       identical results.
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.use_deterministic_algorithms(True)

set_seed(42)

## 2. Data Loading and Preprocessing

In [None]:
# DATA LOADING
# Load the raw training dataset from CSV file
df_train = pd.read_csv('train.csv')

# Initialize the data cleaning pipeline
cleaner = CarDataCleaner(handle_electric="other", set_carid_index=False)

# Apply data cleaning transformations
df_train = cleaner.fit_transform(df_train)

# Separate features (X) from target variable (y)
X = df_train.drop(columns=['price'])
y = df_train['price']

# X = X.rename(columns={
#     'Brand': 'brand'
# })

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Data loaded and cleaned")
print(f"Features: {X_train.shape[1]}")
print(f"Training samples: {len(X_train)}")
print(f"Validation samples: {len(X_val)}")

In [None]:
# PREPROCESSING

# Load existing preprocessing pipeline
with open('preprocessor_pipe.pkl', 'rb') as f: # The main notebook has to be run to create this file. It is not included in the zip due to its file size.
    preprocessor = pickle.load(f)

# Transform training data & validation data
# Learns the parameters from training data and applies transformation
X_train_transformed = preprocessor.fit_transform(X_train, y_train)

# Only applies the transformation without learning new parameters (prevents data leakage)
X_val_transformed = preprocessor.transform(X_val)

# Convert data to PyTorch tensors for neural network training
X_train_tensor = torch.FloatTensor(X_train_transformed)
y_train_tensor = torch.FloatTensor(y_train.values).reshape(-1, 1)
X_val_tensor = torch.FloatTensor(X_val_transformed)
y_val_tensor = torch.FloatTensor(y_val.values).reshape(-1, 1)

print("Data preprocessed")
print(f"Input dimension: {X_train_tensor.shape[1]}")
print(f"Train tensor shape: {X_train_tensor.shape}")
print(f"Validation tensor shape: {X_val_tensor.shape}")

## 3. Autoencoder Architecture

In [None]:
class TabularAutoencoder(nn.Module):
    """
        An autoencoder is an unsupervised learning model that learns to compress data 
        into a lower-dimensional representation (encoding) and then reconstruct the 
        original data from this compressed representation (decoding).
    """
    def __init__(self, input_dim, encoding_dim=12):
        """
            Initialize the autoencoder architecture.
            Args: 
                input_dim (int): Number of input features (26)
                encoding_dim (int): Size of the compressed representation (12)
        """
        super(TabularAutoencoder, self).__init__()
        
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, encoding_dim),
            nn.ReLU() # ReLU activation adds non-linearity and prevents negative values
        )
        
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, input_dim),
            # No ReLU: Decoder needs to reconstruct negative values from normalized features
        )
    
    def forward(self, x):
        """
        Forward pass: data flows through the autoencoder to reconstruct input features.
        
        This method is called automatically when you use model(input_data).
        Data flows: input features → encoder → decoder → reconstructed features.
        
        Args:
            x: Input car features (batch_size, 26)
        Returns:
            decoded: Reconstructed car features (batch_size, 26)
        """
        encoded = self.encoder(x)       # 26 -> 12 
        decoded = self.decoder(encoded) # 12 -> 26
        return decoded                  # Returns decoded features
    

# Create an instance of the autoencoder
input_dim = X_train_tensor.shape[1] # 26 features after preprocessing

print(f"Autoencoder created with input_dim={input_dim}")
print(f"Encoding dimension: 12 (compression from {input_dim})")
print(f"Compression ratio: {input_dim/12:.2f}x")

## 4. Pre-training for Autoencoder

In [None]:
# PRE-TRAINING FUNCTION
def pretrain_autoencoder(model, X_train, X_val, epochs=50, batch_size=256, lr=0.001):
    """
        Pre-training means training the autoencoder to reconstruct input data without using any target labels (prices). 
        This helps the encoder learn meaningful feature representations that can later be used for price prediction.

        Args:
            model: The autoencoder model to train
            X_train: Training data features
            X_val: Validation data features  
            epochs: Number of training epochs (full passes through the data)
            batch_size: Number of samples processed together in each batch
            lr: Learning rate for the optimizer
            
        Returns:
            train_losses: List of training losses for each epoch
            val_losses: List of validation losses for each epoch
"""
    
    
    # Set seeds for reproducibility
    set_seed(42)
    
     # Set up the optimizer and loss function
    optimizer = optim.Adam(model.parameters(), lr=lr) # Adam optimizer: Adaptive learning rate algorithm
    criterion = nn.MSELoss()
    
    # Create batches for efficient training
    generator = torch.Generator().manual_seed(42)
    
    # For autoencoder, both input and target are the same
    train_dataset = TensorDataset(X_train, X_train)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False, generator=generator)
    
    val_dataset = TensorDataset(X_val, X_val)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    
    # Lists to store losses for plotting training progress
    train_losses = []
    val_losses = []
    
    # Main training loop
    for epoch in range(epochs):
        # TRAINING PHASE
        model.train() # Enable learning mode (dropout active, gradients computed)
        train_loss = 0.0 # Start

        # Process training data in batches
        for batch_X, batch_X_target in train_loader:
            optimizer.zero_grad()                           # Reset gradients from previous batch
            reconstructed = model(batch_X)                  # Forward pass: get reconstructed data
            loss = criterion(reconstructed, batch_X_target) # Calculate reconstruction error
            loss.backward()                                 # Compute gradients (how to improve)
            optimizer.step()                                # Update weights (actual learning)
            train_loss += loss.item()                       # Accumulate batch loss
        
        train_loss /= len(train_loader)                     # Average loss across all batches
        train_losses.append(train_loss)
        
        # VALIDATION PHASE
        model.eval() # Set model to evaluation mode (disables dropout, batch norm updates)
        val_loss = 0.0 # Start
        with torch.no_grad(): # Process validation data without computing gradients (saves memory)
            for batch_X, batch_X_target in val_loader:
                reconstructed = model(batch_X)
                loss = criterion(reconstructed, batch_X_target)
                val_loss += loss.item()  # Accumulate batch loss
        
        val_loss /= len(val_loader)  # Average validation loss
        val_losses.append(val_loss)
        
         # Print progress every 10 epochs
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{epochs} - Train Loss: {train_loss:.6f}, Val Loss: {val_loss:.6f}")
    
    return train_losses, val_losses

print("Pre-training function defined")

## 5. Price Prediction

In [None]:
class PricePredictor(nn.Module):
    """
        Neural network for car price prediction using encoder features.
        Uses an encoder to compress input features to 12 dimensions, then applies a prediction head to estimate car prices. 
    """
    def __init__(self, encoder, encoding_dim=12):
        """
          Initialize the price prediction model.
          Args:
              encoder:Encoder network (fresh or pre-trained from autoencoder)
              encoding_dim: Dimension of encoded feature vector (default: 12)
        """
        super(PricePredictor, self).__init__()
        
        self.encoder = encoder             # Store the Encoder
        
        # Build Prediction head: encoded features (12) → price prediction
        self.head = nn.Sequential(
            nn.Linear(encoding_dim, 16),  # Expand to 16 dimensions (Layer 1)
            nn.ReLU(),                    # Non-linear activation
            nn.Dropout(0.3),              # Regularization (30% dropout)
            nn.Linear(16, 1)              # Final price output (Layer 2)
          )
    
    def forward(self, x):
      """
      Forward pass: data flows through the network to predict car prices.
      
      This method is called automatically when you use model(input_data).
      Data flows: input features → encoder → prediction head → price output.
      
      Args: 
        x: Input car features (batch_size, 26)
      Returns: 
        price: Predicted car prices (batch_size, 1) 
      """
    
      encoded = self.encoder(x)    # Transform the Data: (batch_size, 26) → (batch_size, 12)
      price = self.head(encoded)   # Uses the prediction head for price estimate: (batch_size, 12) → (batch_size, 1)
      return price

print("PricePredictor class defined")

## 6. Supervised Training

In [None]:
def train_predictor(model, X_train, y_train, X_val, y_val, epochs=100, batch_size=256, lr=0.001):
    """
      Train the price prediction model using supervised learning.
      
      Performs the actual learning process: 
      repeatedly shows the model car features and their true prices, 
      then adjusts the model's weights to improve predictions.
      
      Args:
          model: PricePredictor model to train
          X_train: Training car features  
          y_train: True training prices
          X_val: Validation car features
          y_val: True validation prices
          epochs: Number of complete passes through training data
          batch_size: Number of samples processed together
          lr: Learning rate
          
      Returns:
          train_losses, val_losses, val_maes: Training progress metrics
      """
    
    # Set seeds for reproducibility
    set_seed(42)
    
    # Set up the optimizer and loss function
    optimizer = optim.Adam(model.parameters(), lr=lr) # Adam optimizer: Adaptive learning rate algorithm
    criterion = nn.L1Loss()

    # Create batches for efficient training
    generator = torch.Generator().manual_seed(42)
    
    # Training data: features → prices
    train_dataset = TensorDataset(X_train, y_train)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, generator=generator)
    
    # Validation data: features → prices
    val_dataset = TensorDataset(X_val, y_val)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    
    # Progress tracking: Store losses and metrics for each epoch
    train_losses = []
    val_losses = []
    val_maes = []
    
    
    # TRAINING LOOP
    for epoch in range(epochs):
        # Training Phase:
        model.train() # Enable learning mode (dropout active, gradients computed)
        train_loss = 0.0 # Start
        for batch_X, batch_y in train_loader:       # Process training data in batches
            optimizer.zero_grad()                   # Clear previous gradients
            predictions = model(batch_X)            # Make predictions(calls forward)
            loss = criterion(predictions, batch_y)  # Calculate prediction error
            loss.backward()                         # Compute gradients (how to improve)
            optimizer.step()                        # Update weights (actual learning)
            train_loss += loss.item()               # Accumulate batch loss
        
        # Calculate average training loss for this epoch
        train_loss /= len(train_loader)
        train_losses.append(train_loss)
        
        # Validation Phase
        model.eval() # Disable learning mode (dropout off, no gradient computation)
        val_loss = 0.0 # Start
        all_preds = [] # all predictions for MAE
        all_targets = [] # all true prices for MAE calculation
        with torch.no_grad(): # Process validation data without computing gradients (saves memory)
            for batch_X, batch_y in val_loader:
                predictions = model(batch_X)
                loss = criterion(predictions, batch_y)
                val_loss += loss.item()
                
                # Store predictions and true values for metrics
                all_preds.append(predictions.numpy())
                all_targets.append(batch_y.numpy())
        

         # Calculate validation metrics
        val_loss /= len(val_loader) # Average validation loss
        val_losses.append(val_loss)
        
        # Calculate Mean Absolute Error in pounds (more interpretable than loss)
        all_preds = np.concatenate(all_preds)
        all_targets = np.concatenate(all_targets)
        val_mae = mean_absolute_error(all_targets, all_preds)
        val_maes.append(val_mae)

        # Print progress every 5 epochs
        if (epoch + 1) % 5 == 0:
            print(f"Epoch {epoch+1}/{epochs} - Train Loss: {train_loss:.2f}, Val Loss: {val_loss:.2f}, Val MAE: £{val_mae:.2f}")
    
    return train_losses, val_losses, val_maes

print("Training function defined")

## 7. Experiments

### A: DL From Scratch

In [None]:
# EXPERIMENT A: TRAINING FROM SCRATCH (NO PRE-TRAINING)
print("EXPERIMENT A: FROM SCRATCH")

# Reproducible initialization
set_seed(42)

# Create autoencoder architecture with random initialization and no pre-training)
scratch_autoencoder = TabularAutoencoder(input_dim=input_dim, encoding_dim=12)

# Build price predictor using the untrained encoder
# Architecture: 26 input features → encoder (12 compressed features) → prediction head → price
model_scratch = PricePredictor(scratch_autoencoder.encoder, encoding_dim=12)

print("\nTraining from scratch...")
# Train the complete pipeline end-to-end using supervised learning
train_losses_scratch, val_losses_scratch, val_maes_scratch = train_predictor(
    model_scratch, 
    X_train_tensor, 
    y_train_tensor, 
    X_val_tensor, 
    y_val_tensor,
    epochs=50,
    batch_size=256,
    lr=0.001
)

# Extract best performance metrics for comparison
best_mae_scratch = min(val_maes_scratch)
best_epoch_scratch = np.argmin(val_maes_scratch) + 1
print(f"\nBest Validation MAE (From Scratch): £{best_mae_scratch:.2f} (Epoch {best_epoch_scratch})")

### B: DL with pre-trained Autoencoder

In [None]:
# EXPERIMENT B: PRE-TRAINING PHASE
print("EXPERIMENT B: WITH PRE-TRAINING")

# Reproducible initialization
set_seed(42)

# Create autoencoder with identical architecture to Experiment A
pretrained_autoencoder = TabularAutoencoder(input_dim=input_dim, encoding_dim=12)

print("\nPre-training autoencoder...")

# Unsupervised pre-training - learn feature representations
train_losses_pretrain, val_losses_pretrain = pretrain_autoencoder(
    pretrained_autoencoder,
    X_train_tensor,
    X_val_tensor,
    epochs=50,
    batch_size=256,
    lr=0.001
)

print("\nPre-training complete!")

In [None]:
# EXPERIMENT B: FINE-TUNING WITH PRE-TRAINED ENCODER
# Build price predictor using the pre-trained encoder
# The encoder weights are initialized with learned representations, not random values
model_pretrained = PricePredictor(pretrained_autoencoder.encoder, encoding_dim=12)


print("\nFine-tuning with pre-trained encoder...")
# Train the complete pipeline end-to-end using supervised learning
train_losses_pretrained, val_losses_pretrained, val_maes_pretrained = train_predictor(
    model_pretrained,
    X_train_tensor,
    y_train_tensor,
    X_val_tensor,
    y_val_tensor,
    epochs=50,
    batch_size=256,
    lr=0.001
)

# Extract best performance metrics for comparison
best_mae_pretrained = min(val_maes_pretrained)
best_epoch_pretrained = np.argmin(val_maes_pretrained) + 1
print(f"\nBest Validation MAE (Pre-trained): £{best_mae_pretrained:.2f} (Epoch {best_epoch_pretrained})")

## 8. Results and Comparison

In [None]:
# RESULTS SUMMARY

# Compile results from both experiments 
results = {
    'Model': ['From Scratch', 'Pre-trained'],
    'Validation MAE (£)': [best_mae_scratch, best_mae_pretrained],
    'Improvement vs Scratch': [
        0.0,
        ((best_mae_scratch - best_mae_pretrained) / best_mae_scratch) * 100
    ]
}

# Create comparison table for clear result interpretation
results_df = pd.DataFrame(results)
print("FINAL RESULTS COMPARISON")
print(results_df.to_string(index=False))
print("\n")
# Calculate absolute improvement in prediction accuracy
improvement = best_mae_scratch - best_mae_pretrained
print(f"Pre-training improved MAE by: £{improvement:.2f}")

In [None]:
# VISUALIZATION: MODEL COMPARISON

fig, ax = plt.subplots(1, 1, figsize=(8, 6))

# Data
models = results_df['Model']
maes = results_df['Validation MAE (£)']
colors = ['#e74c3c', '#3498db']

# Create bars
x_pos = [0, 1]
bars = ax.bar(x_pos, maes, color=colors, alpha=0.85, width=0.6, edgecolor='black', linewidth=2)

# Styling
ax.set_ylabel('Validation MAE (£)', fontsize=13, fontweight='bold')
ax.set_title('Neural Network: Pre-training vs From Scratch', fontsize=15, fontweight='bold', pad=15)
ax.set_xticks(x_pos)
ax.set_xticklabels(models, fontsize=12, fontweight='bold')
ax.grid(axis='y', alpha=0.3, linestyle='--')
ax.set_ylim(0, maes.max() * 1.15)

# Add MAE values on bars
for i, (bar, mae) in enumerate(zip(bars, maes)):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + maes.max()*0.015,
            f'£{mae:.0f}',
            ha='center', va='bottom', fontsize=12, fontweight='bold')

# Add improvement info 
improvement = best_mae_scratch - best_mae_pretrained  
improvement_pct = ((best_mae_scratch - best_mae_pretrained) / best_mae_scratch) * 100 

# Highlight the improvement
mid_y = (maes[0] + maes[1]) / 2
if improvement > 0:
    # Arrow showing improvement
    ax.annotate('', xy=(1, maes[1]), xytext=(0, maes[0]),
                arrowprops=dict(arrowstyle='<->', color='green', lw=3, ls='--'))
    
    # Improvement text
    ax.text(0.5, mid_y, f'−£{improvement:.0f}\n({improvement_pct:.1f}% better)',
            ha='center', va='center', fontsize=11, fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgreen', 
                     edgecolor='darkgreen', linewidth=2, alpha=0.9))
else:
    # If pre-training made it worse
    ax.annotate('', xy=(0, maes[0]), xytext=(1, maes[1]),
                arrowprops=dict(arrowstyle='<->', color='red', lw=3, ls='--'))
    
    ax.text(0.5, mid_y, f'+£{abs(improvement):.0f}\n({abs(improvement_pct):.1f}% worse)',
            ha='center', va='center', fontsize=11, fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.5', facecolor='lightcoral', 
                     edgecolor='darkred', linewidth=2, alpha=0.9))

plt.tight_layout()
plt.show()

In [None]:
# VISUALIZATION: TRAINING HISTORY

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Pre-training
axes[0].plot(train_losses_pretrain, label='Train Loss', color='#3498db', linewidth=2)
axes[0].plot(val_losses_pretrain, label='Val Loss', color='#e74c3c', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Reconstruction Loss (MSE)', fontsize=12)
axes[0].set_title('Autoencoder Pre-training', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Right plot: Fine-tuning
axes[1].plot(val_maes_scratch, label='From Scratch', color='#e74c3c', linewidth=2)
axes[1].plot(val_maes_pretrained, label='Pre-trained', color='#3498db', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Validation MAE (£)', fontsize=12)
axes[1].set_title('Supervised Training Comparison', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()