# Noisy-Augmented Classifier Performance

This notebook trains a classifier on both real and noisy-augmented data across multiple subset sizes, evaluating on several validation sets and multiple random seeds. The results (mean ± std) serve as a reference for comparing noise-augmented models against the baseline.


## 1. Imports and Setup

Load all necessary libraries, dataset helpers, and set up the device and loss function.


In [None]:
import os
import sys
sys.path.insert(0, os.path.abspath('..'))

import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, ConcatDataset

from pkldataset import PKLDataset, NoisyPKLDataset
from helpers import set_seed, get_model, train_model, eval_model

# Device and loss criterion
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()


## 2. Configuration

- **Train paths**: directories containing real data pickle files for various subset sizes  
- **Validation paths**: held-out datasets for evaluation  
- **Seeds**: for estimating training stability


In [None]:
# Training subsets (real data)
train_paths = [
    "../datasets/RPDC197/train_20",
    "../datasets/RPDC197/train_50",
    "../datasets/RPDC197/train_100",
    "../datasets/RPDC197/train_200",
    "../datasets/RPDC197/train_300",
    "../datasets/RPDC197/train_400",
    "../datasets/RPDC197/train_500",
    "../datasets/RPDC197/train_600",
]

# Validation sets
val_paths = [
    "../datasets/RPDC185/val_1000",
    "../datasets/RPDC188/val_1000",
    "../datasets/RPDC191/val_1000",
    "../datasets/RPDC194/val_1000",
    "../datasets/RPDC197/val_1000",
]

# Random seeds for reproducibility
seeds = [101, 202, 303, 404, 505, 606, 707, 808, 909, 1001]

# Results container: {train_path: {val_path: [accuracies]}}
results = {tp: {vp: [] for vp in val_paths} for tp in train_paths}


## 3. Training with Real + Noisy Data & Evaluation Loop

For each seed:
1. Set the random seed  
2. For each training subset:
   - Load real and noisy datasets, concatenate. The training dataset is augmented by combining the original data (`PKLDataset`)
with a noise-injected version (`NoisyPKLDataset`). These two datasets are merged using
ConcatDataset, effectively doubling the training data with added variability introduced by
jitter. Additionally, magnitude scaling is applied alongside jittering to enhance diversity.
For parameter details, refer to the `NoisyPKLDataset` class in `pkldataset.py`.
   - Instantiate model, optimizer, scheduler  
   - Train for 100 epochs  
3. Evaluate on every validation set and record accuracy


In [None]:
if __name__ == "__main__":
    for seed in seeds:
        print(f"\n=== Seed {seed} ===")
        set_seed(seed)

        for tp in train_paths:
            print(f"-- Training on {tp} (real + noisy)")
            ds_real = PKLDataset(tp)
            ds_noisy = NoisyPKLDataset(tp)
            combined = ConcatDataset([ds_real, ds_noisy])
            train_loader = DataLoader(combined, batch_size=32, shuffle=True)

            model = get_model().to(device)
            optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
            scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.1)

            model = train_model(
                model,
                train_loader,
                criterion,
                optimizer,
                scheduler,
                num_epochs=100,
                device=device
            )

            for vp in val_paths:
                val_loader = DataLoader(PKLDataset(vp), batch_size=64, shuffle=False)
                acc = eval_model(model, val_loader, device)
                results[tp][vp].append(acc)
                print(f"[{tp} -> {vp}] Seed {seed}: Acc = {acc:.2f}%")


## 4. Summary of Results

Compute mean and standard deviation of accuracy across seeds for each (train → val) pair.


In [None]:
print("\n=== Summary across seeds ===")
for tp in train_paths:
    for vp in val_paths:
        arr = np.array(results[tp][vp])
        mean, std = arr.mean(), arr.std(ddof=1)
        print(f"{tp} -> {vp}: Mean = {mean:.2f}%, Std = {std:.2f}%")
