# Generator-Augmented Transfer Learning Performance

This notebook pre-trains a data generator on a source dataset, then for each training subset:
1. Generates synthetic samples
2. Combines them with real data
3. Trains a classifier and evaluates on multiple validation sets across multiple seeds
Results are summarized as mean ± std for each (train → val) pair.

## 1. Imports and Setup

Load libraries, modules for data generation and formatting, define device and loss.

In [None]:
import os, sys, random
sys.path.insert(0, os.path.abspath('..'))

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, ConcatDataset

from pkldataset import PKLDataset
import form, gen
from helpers import set_seed, get_model, train_model, eval_model

# Device and loss
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

# Paths for subsets and validation
train_paths = [
    "../datasets/RPDC197/train_20",
    "../datasets/RPDC197/train_50",
    "../datasets/RPDC197/train_100",
    "../datasets/RPDC197/train_200",
    "../datasets/RPDC197/train_300",
    "../datasets/RPDC197/train_400",
    "../datasets/RPDC197/train_500",
    "../datasets/RPDC197/train_600",
]

val_paths = [
    "../datasets/RPDC185/val_1000",
    "../datasets/RPDC188/val_1000",
    "../datasets/RPDC191/val_1000",
    "../datasets/RPDC194/val_1000",
    "../datasets/RPDC197/val_1000",
]

seeds = [101, 202, 303, 404, 505, 606, 707, 808, 909, 1001]

# Results container
results = {tp: {vp: [] for vp in val_paths} for tp in train_paths}


## 2. Generator Pretraining

Train the generative model once on a primary dataset before transfer experiments.
The 10 synthetic samples generated here are not used. The goal is only to pretrain the model.

In [None]:
# Pretrain generator on a source dataset
train_dataset_1 = PKLDataset(r"C:\Users\gus07\Desktop\data hiwi\preprocessing\HC\T197\RP")
train_loader_1 = DataLoader(train_dataset_1, batch_size=64, shuffle=True)

gen.generate(
    train_loader_1,
    num_epochs=150,
    num_samples=10,
    save_new_generator_path="generator_model.pth"
)


## 3. Seeded Generator-Augmented Training & Evaluation
The synthetic generation ensures class balance by generating an equal number of samples per
class. The labels are randomly permuted before generation to avoid ordering bias. Generated
inputs and labels are saved as a pickle file (`generated data.pkl`) which is later processed
by `form.py` to separate the generated data into individual samples.

For each seed and each training subset:
1. Generate synthetic data
2. Format synthetic samples
3. Combine real + synthetic data
4. Train classifier for 50 epochs
5. Evaluate on validation sets

In [None]:
for seed in seeds:
    print(f"\n=== Seed {seed} ===")
    set_seed(seed)
    
    for train_path in train_paths:
        print(f"--- Transfer Learning on {train_path} ---")
        # Load real data
        ds_real = PKLDataset(train_path)
        loader_real = DataLoader(ds_real, batch_size=64, shuffle=True)

        # Generate synthetic under same seed
        gen.generate(
            loader_real,
            num_epochs=150,
            num_samples=20,
            pretrained_generator_path="generator_model.pth"
        )
        form.format()

        # Build combined dataset
        synth_ds = PKLDataset("synth_data/individual_samples")
        combined = ConcatDataset([ds_real, synth_ds])
        loader_comb = DataLoader(combined, batch_size=32, shuffle=True)

        # Train classifier
        model = get_model().to(device)
        optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
        scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.1)

        model = train_model(
            model,
            loader_comb,
            criterion,
            optimizer,
            scheduler,
            num_epochs=50,
            device=device
        )

        # Evaluate
        model.eval()
        with torch.no_grad():
            for vp in val_paths:
                val_loader = DataLoader(PKLDataset(vp), batch_size=64, shuffle=False)
                correct = total = 0
                for X, Y in val_loader:
                    X, Y = X.to(device), Y.to(device)
                    y_idx = Y.argmax(dim=1)
                    preds = model(X).argmax(dim=1)
                    correct += (preds == y_idx).sum().item()
                    total += Y.size(0)
                acc = 100. * correct / total
                results[train_path][vp].append(acc)
                print(f"[{train_path} → {vp}] Seed {seed}: {acc:.2f}%")


## 4. Summary of Results

Compute mean and standard deviation of accuracy across seeds for each (train → validation) pair.

In [None]:
print("\n=== Summary across seeds ===")
for tp in train_paths:
    for vp in val_paths:
        arr = np.array(results[tp][vp])
        mean, std = arr.mean(), arr.std(ddof=1)
        print(f"{tp} -> {vp}: Mean = {mean:.2f}%, Std = {std:.2f}%")
