# Fusion Strategy Comparison
This notebook compares different fusion strategies for multimodal learning (RGB + LiDAR).
Strategies evaluated:
1.  **Late Fusion**: Process modalities separately and combine at the end.
2.  **Intermediate Fusion**: Combine features at an intermediate layer using:
    *   Concatenation
    *   Addition
    *   Multiplication

The models are defined in `src/models.py` and training utilities in `src/training.py`.


## Setup
### Drive
This cell tries to use the data and files from google drive.
The Paths may be changed for different drive file structures/layouts. This assumes a structur of `MyDrive/extended_assessments/Multimodal_Learning/{repo contents}` and the gdrive to be mounted at `/gdrive`.
If that does not work we fallback to local files (which of course only works, when there is a local copy of the files and code. This fallback was required, since KDE, the desktop environment I am using, has no support for accessing google drive)

In [None]:
from pathlib import Path
import sys
import os
try:
    from google.colab import drive
    drive.mount('/gdrive')
    print("Mounted Google Drive")
    DATA_DIR = Path('/gdrive/MyDrive/extended_assessments/Multimodal_Learning/data')
    sys.path.append(os.path.abspath('/gdrive/MyDrive/extended_assessments/Multimodal_Learning'))
except:
    print("Running locally")
    DATA_DIR = Path('../data')
    sys.path.append(os.path.abspath('../.'))
print(f"Using {DATA_DIR} as data source")

### Imports

The `src.datasets`, `src.models` and `src.training` imports are from the files in `src/`. We need the `src.` prefix, since otherwise python confuses our own datasets file with the huggingface datasets library. 

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import wandb

# Import dataset and models from src
import sys
import os
from src.datasets import CubesAndShperesDataset
from src.models import LateFusionModel, IntermediateFusionModel
from src.training import train_model, evaluate_model, count_parameters

# Configuration
BATCH_SIZE = 32
EPOCHS = 15
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
WANDB_ENTITY = "jan-kubeler-hpi"
WANDB_PROJECT = "clip-extended-assessment"

print(f"Using device: {DEVICE}")

### Reproducability

This function is used to set all random seeds for reproducable results.
I copied it from the notebooks in https://github.com/andandandand/practical-computer-vision.

In [None]:
def set_seeds(seed=51):
    """
    Set seeds for complete reproducibility across all libraries and operations.

    Args:
        seed (int): Random seed value
    """
    # Set environment variables before other imports
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'

    # Python random module
    random.seed(seed)

    # NumPy
    np.random.seed(seed)

    # PyTorch CPU
    torch.manual_seed(seed)

    # PyTorch GPU (all devices)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  # For multi-GPU setups

        # CUDA deterministic operations
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

    # PyTorch deterministic algorithms (may impact performance)
    try:
        torch.use_deterministic_algorithms(True)
    except RuntimeError:
        # Some operations don't have deterministic implementations
        print("Warning: Some operations may not be deterministic")

    print(f"All random seeds set to {seed} for reproducibility")


set_seeds(51)

## Datasets
The dataset class used is `CubesAndShperesDataset` from `src/datasets`. A noteworthy difference to the dataset from the NVIDIA Lab is, that I normalized the LiDAR values to be between 0 and 1 and not 0 and 50.

Since the CPU compute on google collab seems quite low and building the dataset took quite some time, I added the option to export the datasets and load them again, reducing the creation compute overhead. The first cell below tries to load the precomputed dataset from `dataset_precomputed.pkl`. The second cell below builds the datasets from scratch and the third cell below exports the created datasets into the `dataset_precomputed.pkl` file.

**If the precomputed file is available, only run the first of the three following cells!**
Otherwise this will take up to three hours!

The splitting logic is aligned with the NVIDIA Lab, where we take 10 Batches of 32 Samples per class for validation. This splitting is not very sophisticated, but I wanted to maintain alignment with the original experiments, since the final evaluation uses the Lab numbers as well. 

In [None]:
# load precomputed dataset
import pickle
with open("/gdrive/MyDrive/extended_assessments/Multimodal_Learning/notebooks/dataset_precomputed.pkl", "rb") as f:
    data = pickle.load(f)
train_indices = data["train_indices"]
val_indices = data["val_indices"]
train_dataset = data["train_dataset"]
val_dataset = data["val_dataset"]

In [None]:
# Build Datasets
full_dataset = CubesAndShperesDataset(DATA_DIR)

# Split into Train and Validation (No Test)
# Logic from 05_Assessment.ipynb: Last VALID_BATCHES * BATCH_SIZE are validation

total_len = len(full_dataset)
n_classes = 2
samples_per_class = total_len // n_classes

VALID_BATCHES = 10
valid_samples_per_class = VALID_BATCHES * BATCH_SIZE
train_samples_per_class = samples_per_class - valid_samples_per_class

train_indices = []
val_indices = []
for i in range(n_classes):
    start_idx = i * samples_per_class
    train_indices.extend(range(start_idx, start_idx + train_samples_per_class))
    val_indices.extend(range(start_idx + train_samples_per_class, start_idx + samples_per_class))

from torch.utils.data import Subset
train_dataset = Subset(full_dataset, train_indices)
val_dataset = Subset(full_dataset, val_indices)

In [None]:
import pickle
with open("dataset_precomputed.pkl", "wb") as f:
    pickle.dump({
        "train_indices": train_indices,
        "val_indices": val_indices,
        "train_dataset": train_dataset,
        "val_dataset": val_dataset
    }, f)

In [None]:
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

print(f"Total samples: {total_len}")
print(f"Train size: {len(train_dataset)}")
print(f"Val size: {len(val_dataset)}")

## Model Architectures

The models `LateFusionModel` and `IntermediateFusionModel` are defined in `src/models.py` and imported above.

### Late Fusion Model
Process modalities separately and combine outputs.

### Intermediate Fusion Model
Combine feature maps at an intermediate layer. We explore three variations:
*   Concatenation
*   Addition
*   Hadamard Product

## Experiments

Here is the place to define the Hyperparameters (LRs, using a CosineAnnealing Scheduler and the fusion strategies). This code performs a simple grid search over the provided parameters. (Intermediate) Results are logged to W&B. Some final metrics are plotted in the end.

The current implementation runs full training on the full dataset. Since that only took about 2 minutes per run, I think that is fine.

In [None]:
results = {}

# Hyperparameters to test
learning_rates = [1e-4, 1e-5, 1e-6]
use_schedulers = [True, False]

# Define models to test
# Format: (Name, ID, ModelClass, InitArgs)
strategies = [
    ("Late Fusion", "late-fusion", LateFusionModel, {}),
    ("Intermediate (Concat)", "intermediate-concat", IntermediateFusionModel, {'fusion_type': 'concat'}),
    ("Intermediate (Add)", "intermediate-add", IntermediateFusionModel, {'fusion_type': 'add'}),
    ("Intermediate (Multiply)", "intermediate-multiply", IntermediateFusionModel, {'fusion_type': 'multiply'}),
]
best_val_acc = 0.0
best_model_info = None
for name, strategy_id, model_cls, model_kwargs in strategies:
    for lr in learning_rates:
        for use_sched in use_schedulers:
            sched_str = "cosine" if use_sched else "constant"
            run_name = f"{strategy_id}_lr{lr}_{sched_str}"

            print(f"\n=== Training {name} | LR: {lr} | Scheduler: {sched_str} ===")

            wandb.init(project=WANDB_PROJECT, name=run_name, entity=WANDB_ENTITY, config={
                "fusion_strategy": strategy_id,
                "batch_size": BATCH_SIZE,
                "learning_rate": lr,
                "scheduler": sched_str,
                "epochs": EPOCHS
            })

            # Initialize model
            model = model_cls(**model_kwargs).to(DEVICE)
            model_params = count_parameters(model)

            optimizer = optim.Adam(model.parameters(), lr=lr)
            criterion = nn.CrossEntropyLoss()
            scheduler = None
            if use_sched:
                scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)

            # Train
            train_loss, val_loss, val_acc, train_time = train_model(model, train_loader, val_loader, optimizer, criterion, epochs=EPOCHS, scheduler=scheduler, use_wandb=True)

            # Evaluate on Validation Set (since we don't have a separate test set)
            final_acc, final_f1, _, _ = evaluate_model(model, val_loader)

            results[run_name] = {
                'Validation Loss': val_loss[-1],
                'Parameters': model_params,
                'Training Time (s)': train_time,
                'Final Accuracy': final_acc,
                'Final F1': final_f1
            }

            print(f"Final Val Metrics: Acc={final_acc:.4f}, F1={final_f1:.4f}")
            
            if final_acc > best_val_acc:
                best_val_acc = final_acc
                best_model_info = {
                    "name": name,
                    "strategy_id": strategy_id,
                    "model_cls": model_cls,
                    "model_kwargs": model_kwargs,
                    "learning_rate": lr,
                    "scheduler": sched_str,
                    "accuracy": final_acc,
                    "f1_score": final_f1
                }
                torch.save(model.state_dict(), f"/gdrive/MyDrive/extended_assessments/Multimodal_Learning/checkpoints/best_fusion_model.pth")
                with open("/gdrive/MyDrive/extended_assessments/Multimodal_Learning/checkpoints/best_fusion_model_info.txt", "wb") as f:
                    f.write(str(best_model_info).encode())

            wandb.finish()

## Comparison and Analysis

For in depth analysis, check out W&B. Here we plot the final accuracies and Training times of the different hyperparameter combinations.

In [None]:
# Create Comparison Table
df_results = pd.DataFrame(results).T
print(df_results)

# Bar plot of Final Accuracy
plt.figure(figsize=(12, 6))
df_results['Final Accuracy'].plot(kind='bar')
plt.title('Final Validation Accuracy by Strategy')
plt.ylabel('Accuracy')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Bar plot of Training Time
plt.figure(figsize=(12, 6))
df_results['Training Time (s)'].plot(kind='bar', color='orange')
plt.title('Training Time by Strategy')
plt.ylabel('Time (s)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### Analysis

#### Theoretical Thoughts

The main difference between Late and Intermediate Fusion is the level of independence of the two modalities. In Late Fusion the model has to focus more on the two seperate modalities, potentially allowing for better, more detailed, understanding of both, and only combines them in the end. In real-world applications this also allows for better reuse of pretrained models, since they can just be ''plugged in'' and used. 

The Intermediate Fusion on the other hand is used for for better combination of both modalities. Since we combine them earlier in the intermediate architectures, the model is more capable of findeing correlations between the two modalities, which theoretically should lead for better performance, when there are strong correlations in the datapoints of the different modalities.
* The **Concatination** approach has the benefit of fully maintaining the features of both modalities, but comes at the cost of greater dimensionality, resulting in higher parameter count and longer training.
* The **Addition** approach keeps the dimensionality lower and puts the same importance to all features in both modalities.
* The **Multiplication** approach also keeps the dimensionality down, but puts more importance onto features present in both modalities. Since multiplications results in a small product when at least one factor is small, but in a greater product if both factors are larger. Ideally this leads to more sophisticated alignment of the two dimensions.

#### Practical Results
| Metric        | Late Fusion | IF-Add | IF-Multiply | IF-Concat |
|---------------|-------------|--------|-------------|-----------|
| Val Loss      |             |        |             |           |
| Parameters    |             |        |             |           |
| Training Time |             |        |             |           |
| GPU Memory    |             |        |             |           |

Training time here covers a full 15 epochs training run on a T4 Colab system.