# Final Individual Model Notebook

This notebook is a **simplified, production-ready version** of our experimental notebooks. It is designed to:
- Present the way we selected the best individual model trained during experimentation.
- Run predictions on the full training and test sets.
- Explore and select the optimal threshold for classification.
- Generate a submission file for the test set.

All code here is streamlined for clarity and reproducibility.

In [None]:
# Standard library imports
import os
import time

# Third-party imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image

# PyTorch and related imports
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader, Subset, random_split
import torchvision.transforms as transforms

# ML utilities
import timm
from sklearn.metrics import f1_score

# Allow duplicate OpenMP libraries (fixes some multi-threading issues on some systems)
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

## Environment Setup

- **Device selection**: Automatically uses GPU if available, otherwise falls back to CPU.
- **Reproducibility**: Random seeds are set for consistent results across runs.
- **Directory paths**: Set up paths for training images, test images and label file.


In [None]:
# Print available CUDA devices and select device for computation
print(f"CUDA Devices: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
print(f"Using device: {device}")

# Set random seed for reproducibility
random_state_42 = torch.Generator().manual_seed(42)

In [None]:
images_train_dir = 'images_train'
images_test_dir = 'images_test'
labels_dir = 'train_onehot.csv'

## Data Preparation

- **Dataset class**: Custom PyTorch Dataset for both training and test images. Handles label loading and image reading.
- **Transforms**: Includes augmentations for training and normalization for both train/test.
- **Augmentations**: Training images are augmented with color jitter, flips, blur, and random crops to improve generalization.
- **Normalization**: Both train and test images are normalized using dataset-specific mean and std values found through EDA.
- **Splitting**: The training set is split into train/validation subsets for model selection and threshold tuning.
- **Dataloaders**: For full training, validation, and test sets.


In [None]:
image_dim_px = 224

class FoodDataset(Dataset):
    def __init__(self, img_dir, labels_csv = None, transform=None):
        self.img_dir = img_dir
        self.transform = transform
        self.is_training = labels_csv is not None  # True if labels are provided (train/val), False for test

        if self.is_training:
            self.labels_df = pd.read_csv(labels_csv)
            self.filenames = self.labels_df.iloc[:, 0].values
            self.labels = self.labels_df.iloc[:, 1:].values.astype('float')
        else:
            self.filenames = sorted(os.listdir(img_dir))
            self.labels = None  # No labels for the test set

    def __len__(self):
        return len(self.filenames)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.filenames[idx])
        image = Image.open(img_path).convert("RGB")
        if self.transform:
            image = self.transform(image)

        if self.is_training:
            label = torch.tensor(self.labels[idx])
            return image, label
        else:
            return image, self.filenames[idx]

# Define data augmentation and normalization for training
train_transform = transforms.Compose([
    transforms.Resize((image_dim_px, image_dim_px)),
    transforms.RandomApply([transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1)], p=0.3),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomApply([transforms.GaussianBlur(kernel_size=3, sigma=(0.1, 1.0))], p=0.2),
    transforms.RandomApply([transforms.RandomResizedCrop(image_dim_px, scale=(0.9, 1.0), ratio=(1.0, 1.0))], p=0.3),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5944, 0.5082, 0.4259], std=[0.2128, 0.2213, 0.2308])
])

# Only normalization for test/validation
test_transform = transforms.Compose([
    transforms.Resize((image_dim_px, image_dim_px)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5944, 0.5082, 0.4259], std=[0.2128, 0.2213, 0.2308])
])

full_train_dataset = FoodDataset(images_train_dir, labels_dir, transform=train_transform)
test_dataset = FoodDataset(images_test_dir, labels_csv=None, transform=test_transform)

# Reproducible train/val split
val_ratio = 0.2
val_size = int(len(full_train_dataset) * val_ratio)
train_size = len(full_train_dataset) - val_size
train_indices, val_indices = random_split(range(len(full_train_dataset)), [train_size, val_size], generator=random_state_42)
train_dataset = Subset(FoodDataset(images_train_dir, labels_dir, transform=train_transform), train_indices)
val_dataset = Subset(FoodDataset(images_train_dir, labels_dir, transform=test_transform), val_indices)

# Dataloaders for all datasets
batch_size = 64
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=4, generator=random_state_42)
full_train_dataloader = DataLoader(full_train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=4, generator=random_state_42)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, pin_memory=True, num_workers=4)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, pin_memory=True, num_workers=4)

## Model Selection and Fine-Tuning

- **Model**: We use a Swin Transformer (swin_base_patch4_window7_224) from the timm library, pre-trained on ImageNet.
- **Layer Freezing**: Most layers are frozen to retain pre-trained features, while later layers and the head are unfrozen for fine-tuning.
- **Multi-GPU**: DataParallel is used for efficient training on multiple GPUs if available.

In [None]:
# Model definition and layer freezing/unfreezing

def initialize_model(model_name):
    model = timm.create_model(model_name, pretrained=True, num_classes=498)

    # Freeze all layers initially
    for param in model.parameters():
        param.requires_grad = False

    # Unfreeze later layers and head for fine-tuning
    for param in model.layers[2].parameters():
        param.requires_grad = True
    for param in model.layers[3].parameters():
        param.requires_grad = True
    for param in model.head.parameters():
        param.requires_grad = True
    
    model = nn.DataParallel(model)  # Enable multi-GPU training if available
    model = model.to(device)

    return model

model = initialize_model('swin_base_patch4_window7_224')

### Hyperparameters and Optimization

- **Loss Function**: Binary cross-entropy with logits, suitable for multi-label classification.
- **Optimizer**: Adam with different learning rates for different layers (head and deeper layers).
- **Scheduler**: StepLR reduces learning rate during training to help convergence.
- **Threshold**: Initial threshold is set to 0.5, but will be optimized later.
- **Epochs**: 10 epochs for initial fine-tuning.

In [None]:
def initialize_hyperparameters(model):
    # This function is not strictly necessary, but can be used to encapsulate hyperparameter initialization
    loss_fn = nn.BCEWithLogitsLoss()  # this applies sigmoid inside the loss

    optimizer = torch.optim.Adam([ 
        {'params': model.module.head.parameters(), 'lr': 1e-3},
        {'params': model.module.layers[3].parameters(), 'lr': 1e-4},
        {'params': model.module.layers[2].parameters(), 'lr': 1e-5},
    ])

    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)
   
    return loss_fn, optimizer, scheduler

loss_fn, optimizer, scheduler = initialize_hyperparameters(model)

classification_threshold = 0.5  # Used as a static benchmark value
num_epochs = 10 # 10 epochs is a good starting point for fine-tuning

### Threshold Optimization Utility

The `find_optimal_threshold` function sweeps a range of thresholds to:
- Find the threshold that maximizes micro-F1 score on the validation set.
- Report the mean and standard deviation of F1 across thresholds.

This helps select a threshold that balances precision and recall for multi-label classification.

In [None]:
def find_optimal_threshold(all_probs, all_labels):
    # Sweep a range of thresholds to maximize micro-F1
    threshold_range = np.arange(0.20, 0.51, 0.02)  
    
    f1_scores = []
    best_f1 = 0
    best_thresh = 0

    for t in threshold_range:
        temp_preds = (all_probs > t).astype(np.float32)
        temp_f1 = f1_score(all_labels, temp_preds, average='micro')
        f1_scores.append(temp_f1)
        
        if temp_f1 > best_f1:
            best_f1 = temp_f1
            best_thresh = t

    f1_scores = np.array(f1_scores)
    mean_f1 = f1_scores.mean()
    std_f1 = f1_scores.std()

    return best_thresh, best_f1, mean_f1, std_f1


## Training and Validation Loop

- **Training**: For each epoch, the model is trained on the training set and evaluated on the validation set.
- **Metrics**: Tracks training/validation loss, F1 score at fixed and optimal thresholds, learning rate, and number of trainable parameters.
- **Threshold Search**: After each epoch, the best threshold is found for the current model state.
- **Progress**: ETA and timing information are printed for monitoring.

In [None]:
num_batches_train = len(train_dataloader)
num_batches_val = len(val_dataloader)
total_runtime = 0
history = {'epoch': [],'train_loss': [],'val_loss': [],'f1_at_fixed_thresh': [],'fixed_thresh': [],'best_f1': [],
    'best_thresh': [], 'mean_f1' : [], 'f1_std' : [],'lr': [], 'num_trainable_params' : [], 'epoch_time_sec': [],'cumulative_time_min': []}
previous_lr = 0

for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}:")
    history['epoch'].append(epoch+1)
    epoch_start_time = time.time()   

    # Learning rate monitoring
    current_lr = optimizer.param_groups[0]['lr']
    history['lr'].append(current_lr)
    if current_lr != previous_lr:
        print(f"Scheduler updates base (head) learning rate to: {current_lr:.3e}")     
    previous_lr = current_lr
    
    # Trainable parameters monitoring
    num_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    history['num_trainable_params'].append(num_trainable_params)

    # ------ Training phase ------
    model.train()
    running_loss_train = 0
    print(f"Begin training {num_trainable_params} parameters.")

    for batch_number, (X, Y) in enumerate(train_dataloader):  # X: Image tensor, Y: Label tensor
        X, Y = X.to(device), Y.to(device)

        # Forward pass: Compute prediction and loss
        logits = model(X)
        loss = loss_fn(logits, Y)
        running_loss_train += loss.item()

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (batch_number + 1)%100==0 or batch_number==0:
            # Progress log
            print(f"\rTrained on batch {batch_number + 1}/{num_batches_train}. Current training loss: {loss.item():.4f}",
                  end="", flush=True)
    print(
        f"\nFinished training for epoch {epoch + 1}. Average training loss: {running_loss_train / num_batches_train:.4f}")
    history['train_loss'].append(running_loss_train / num_batches_train)

    # ------ Validation phase ------
    model.eval()
    running_loss_val = 0
    all_probs = []  # for post-processing
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch_number, (X, Y) in enumerate(val_dataloader):
            X, Y = X.to(device), Y.to(device)

            # Forward pass: Compute prediction and loss
            logits = model(X)       
            loss = loss_fn(logits, Y)
            running_loss_val += loss.item()

            # Compute binary predictions and collect labels
            probs = torch.sigmoid(logits).cpu().numpy()
            preds = (probs > classification_threshold).astype(np.float32)
            labels = Y.cpu().numpy()

            all_probs.append(probs)
            all_preds.append(preds)
            all_labels.append(labels)

            if (batch_number + 1)%25==0 or batch_number==0:
                # Progress log
                print(
                    f"\rValidated batch {batch_number + 1}/{num_batches_val}. Current validation loss: {loss.item():.4f}",
                    end="", flush=True)

    print(
        f"\nFinished validation for epoch {epoch + 1}. Average validation loss: {running_loss_val / num_batches_val:.4f}")
    history['val_loss'].append(running_loss_val / num_batches_val)

    # ------ Epoch grand result ------
    scheduler.step()

    # Concatenate all batches and compute F1 score
    all_preds = np.vstack(all_preds)
    all_probs = np.vstack(all_probs)
    all_labels = np.vstack(all_labels)
    f1_micro = f1_score(all_labels, all_preds, average='micro')

    print(f"Epoch {epoch + 1} micro F1 score: {f1_micro:.5f} with threshold: {classification_threshold:.2f}.")  # For intra-epoch comparison
    history['f1_at_fixed_thresh'].append(f1_micro)
    history['fixed_thresh'].append(classification_threshold)  

    # Threshold optimization per epoch
    best_thresh, best_f1, mean_f1, std_f1 = find_optimal_threshold(all_probs, all_labels)    
    print(f"Epoch {epoch + 1} optimal micro F1 score: {best_f1:.5f} with threshold: {best_thresh:.2f}.")
    history['best_f1'].append(best_f1)
    history['best_thresh'].append(best_thresh)
    history['mean_f1'].append(mean_f1)
    history['f1_std'].append(std_f1)

    scheduler.step()

    # All done. Deal with temporal stuff
    epoch_end_time = time.time()
    epoch_duration = epoch_end_time - epoch_start_time
    total_runtime += epoch_duration
    avg_epoch_duration = total_runtime / (epoch+1)
    remaining_epochs = num_epochs - epoch - 1
    eta_seconds = avg_epoch_duration * remaining_epochs
    print(f"Epoch {epoch + 1}/{num_epochs} completed in {int(epoch_duration)} seconds. ETA: {round(eta_seconds / 60)} minutes.")
    history['epoch_time_sec'].append(epoch_duration)
    history['cumulative_time_min'].append(round(total_runtime / 60))
    print("------------------------------------------------------------------")


### Training History and Metrics

- **DataFrame**: All tracked metrics are stored in a DataFrame for easy analysis and visualization.
- **Interpretation**: We used this table to identify the best epoch, monitor overfitting, and compare loss/F1 trends.

In [None]:
df_history = pd.DataFrame(history)
print(df_history.round(4))

### Visualizing Training Progress

- **Loss Curves**: We compared training and validation loss to check for overfitting.
- **F1 Scores**: We track F1 at fixed and optimal thresholds to monitor model improvement.
- **Threshold Trends**: Observe how the best threshold evolves over epochs.
- **Best Epoch**: The epoch with the highest micro-F1 is highlighted for model selection.

In [None]:
# Create subplots with 4 rows and 1 column
fig, axes = plt.subplots(4, 1, figsize=(12, 24))

# 1st subplot: Train and Validation Loss
axes[0].plot(df_history['epoch'], df_history['train_loss'], label='Train Loss', color='blue')
axes[0].plot(df_history['epoch'], df_history['val_loss'], label='Validation Loss', color='red')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Train and Validation Loss')
axes[0].legend()
axes[0].grid(True)

# 2nd subplot: F1 score at Fixed Threshold and the Threshold Line
axes[1].plot(df_history['epoch'], df_history['f1_at_fixed_thresh'], label='F1 at Fixed Threshold', color='green')
axes[1].plot(df_history['epoch'], df_history['fixed_thresh'], label='Fixed Threshold', color='orange', linestyle='--')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('F1 Score')
axes[1].set_title('F1 at Fixed Threshold and Threshold Line')
axes[1].legend()
axes[1].grid(True)

# 3rd subplot: Best F1 Score and Best Threshold
axes[2].plot(df_history['epoch'], df_history['best_f1'], label='Best F1 Score', color='purple')
axes[2].plot(df_history['epoch'], df_history['best_thresh'], label='Best Threshold', color='brown', linestyle='--')
axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('Score / Threshold')
axes[2].set_title('Best F1 Score and Best Threshold')
axes[2].legend()
axes[2].grid(True)

# 4th subplot: Mean F1 and F1 Standard Deviation
axes[3].plot(df_history['epoch'], df_history['mean_f1'], label='Mean F1 across thresholds', color='teal')
axes[3].plot(df_history['epoch'], df_history['f1_std'], label='Std of F1 across thresholds', color='darkorange')
axes[3].set_xlabel('Epoch')
axes[3].set_ylabel('F1 Score')
axes[3].set_title('Mean and Std of F1 Across Thresholds')
axes[3].legend()
axes[3].grid(True)

# Adjust layout
plt.tight_layout()
plt.show()

# Print best epoch based on highest best_f1
best_epoch_index = df_history['best_f1'].idxmax()
best_row = df_history.loc[best_epoch_index]
print(f"Best epoch based on micro-F1: {int(best_row['epoch'])} "
      f"with threshold: {best_row['best_thresh']:.2f} "
      f"and micro F1: {best_row['best_f1']:.5f}")


## Final Training on Full Dataset

- **Why retrain?** After selecting the best hyperparameters and threshold, we retrain the model on the entire training set to maximize data usage.
- **Parameters**: Use the best epoch's settings for number of epochs and threshold.
- **Model Saving**: The final model is saved for future inference and ensembling.

In [None]:
model = initialize_model('swin_base_patch4_window7_224')  # Reinitialize to get a fresh start

loss_fn, optimizer, scheduler = initialize_hyperparameters(model) # Same

num_epochs = 6 # Based on validation
classification_threshold = 0.32  # Based on the epoch we chose

# We will save the model after full training, to use in an ensemble
model_save_name = "SwinV1(3-4)_v15.pth"
submission_name = "submission_nik_v15_0_1.csv"

### Full Training Loop

- **No Validation**: All data is used for training, so no validation metrics are computed.
- **Progress**: Training loss and ETA are printed for each epoch, nothing more.

In [None]:
num_batches = len(full_train_dataloader)
total_runtime = 0

for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}:")
    epoch_start_time = time.time()

    model.train()
    running_loss_train = 0.0    
   
    for batch_number, (X, Y) in enumerate(full_train_dataloader):     # X: Image tensor, Y: Label tensor
        X, Y = X.to(device), Y.to(device)

        logits = model(X)
        loss = loss_fn(logits, Y)
        running_loss_train += loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (batch_number + 1)%100==0 or batch_number==0:
            # Progress log
            print(f"\rTrained on batch {batch_number+1}/{num_batches}. Current training loss: {loss.item():.4f}", end="", flush=True)

    print(f"\nFinished training for epoch {epoch+1}. Average training loss: {running_loss_train/num_batches:.4f}")

    scheduler.step()

    # All done. Deal with temporal stuff
    epoch_end_time = time.time()
    epoch_duration = epoch_end_time - epoch_start_time
    total_runtime += epoch_duration
    avg_epoch_duration = total_runtime / (epoch+1)
    remaining_epochs = num_epochs - epoch - 1
    eta_seconds = avg_epoch_duration * remaining_epochs
    print(f"Epoch {epoch + 1}/{num_epochs} completed in {int(epoch_duration)} seconds. ETA: {round(eta_seconds / 60)} minutes.")
    print("------------------------------------------------------------------")

# Save the fully trained model to use in the ensemble 
torch.save(model.state_dict(), model_save_name) # Note to self: the model was trained on 2 GPUs so there is a "module." prefix on its state dict keys
print(f"Model saved as {model_save_name}.")

## Test Set Prediction and Submission

- **Inference**: The trained model predicts labels for each test image.
- **Thresholding**: Predictions are binarized using the selected threshold.
- **Fallback**: If no label is assigned to an image, the most confident label is set to ensure every image has at least one label.
- **Submission**: Results are saved in the required CSV format for competition submission.

In [None]:
model.eval()
all_preds = []
all_filenames = []
times_fallback = 0

num_batches = len(test_dataloader)
with torch.no_grad():
    for batch_number, (X, filenames) in enumerate(test_dataloader):
        X = X.to(device)
        logits = model(X)
        probs = torch.sigmoid(logits).cpu().numpy()
        preds = (probs > classification_threshold).astype(int)  

        # Fallback: Ensure at least one label is set per sample
        for i in range(preds.shape[0]):
            if preds[i].sum() == 0: # if no labels are present for this image
                times_fallback += 1                
                max_idx = np.argmax(probs[i]) # set the highest probability label
                preds[i][max_idx] = 1
        
        all_preds.append(preds)
        all_filenames.extend(filenames)
        
        print(f"Predicted batch {batch_number+1}/{num_batches}.")

all_preds = np.vstack(all_preds)
print(f"Total number of fallbacks (no labels set through thresholding): {times_fallback}/1000. Fallback rate: {times_fallback / 1000:.2%}")

# Save submission. Good luck!
submission_df = pd.DataFrame(all_preds, columns=[str(i) for i in range(498)])
submission_df.insert(0, "Filename", all_filenames)
submission_df.to_csv(submission_name, index=False)
print(f"Submission saved as {submission_name}.")