# Simple Transfer Learning Pipeline

This notebook pre-trains a model on a primary dataset split, then fine-tunes on multiple transfer sets, evaluating on several validation sets across multiple seeds. Results are summarized as mean ± std for each transfer → validation pair.

## 1. Imports and Setup

Load necessary libraries, dataset modules, and helper functions. Configure device and loss.

In [None]:
import os, sys, random
sys.path.insert(0, os.path.abspath('..'))

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

from pkldataset import PKLDataset
from helpers import set_seed, get_model, train_model, eval_model

# Device and loss
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()


## 2. Configuration

- **Primary pretraining path**: path to dataset folder to split  
- **Transfer sets**: list of datasets for fine-tuning  
- **Validation sets**: list of held-out datasets for evaluation  
- **Seeds**: for reproducibility

In [None]:
# Primary dataset path (will be split)
train_path_1 = r"C:\Users\gus07\Desktop\data hiwi\preprocessing\HC\T197\RP"

# Transfer learning datasets
transfer_sets = [
    "../datasets/RPDC197/train_20",
    "../datasets/RPDC197/train_50",
    "../datasets/RPDC197/train_100",
    "../datasets/RPDC197/train_200",
    "../datasets/RPDC197/train_300",
    "../datasets/RPDC197/train_400",
    "../datasets/RPDC197/train_500",
    "../datasets/RPDC197/train_600",
]

# Validation sets
val_paths = [
    "../datasets/RPDC185/val_1000",
    "../datasets/RPDC188/val_1000",
    "../datasets/RPDC191/val_1000",
    "../datasets/RPDC194/val_1000",
    "../datasets/RPDC197/val_1000",
]

# Random seeds
seeds = [101,202,303,404,505,606,707,808,909,1001]

# Prepare results container
results = {t: {vp: [] for vp in val_paths} for t in transfer_sets}


### 3. Phase 1: Pretraining on Source Dataset (HC)

For each seed:
1. Split the primary dataset into train/val
2. Train for 10 epochs
3. Save pretrained weights
### 4. Phase 2: Fine-Tuning on Transfer Sets

For each seed and each transfer set:
1. Load pretrained weights
2. Train for 100 epochs
3. Evaluate on validation sets

In [None]:
for seed in seeds:
    print(f"\n>>> Seed {seed} - Pretraining")
    set_seed(seed)

    # Split dataset
    train_ds, val_ds = PKLDataset.split_dataset(train_path_1)
    train_loader1 = DataLoader(train_ds, batch_size=64, shuffle=True)
    

    model = get_model().to(device)
    opt = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
    sch = optim.lr_scheduler.StepLR(opt, step_size=50, gamma=0.1)

    model = train_model(model, train_loader1, criterion, opt, sch, num_epochs=10, device=device)
    pretrained_state = model.state_dict()
    # Transfer learning phase
    for t in transfer_sets:
        print(f"--- Transfer on {t}")
        tl_model = get_model().to(device)
        tl_model.load_state_dict(pretrained_state)

        loader_t = DataLoader(PKLDataset(t), batch_size=64, shuffle=True)
        opt2 = optim.Adam(tl_model.parameters(), lr=1e-3, weight_decay=1e-5)
        sch2 = optim.lr_scheduler.StepLR(opt2, step_size=25, gamma=0.1)

        tl_model = train_model(tl_model, loader_t, criterion, opt2, sch2, num_epochs=100, device=device)

        # Evaluate on validation sets
        for vp in val_paths:
            acc = eval_model(tl_model, DataLoader(PKLDataset(vp), batch_size=64, shuffle=False), device)
            results[t][vp].append(acc)
            print(f"[Seed {seed}] {t} → {vp}: {acc:.2f}%")


## 5. Summary of Results

Compute mean and standard deviation across seeds for each transfer → validation pair.

In [None]:
print("\n=== Mean ± Std Dev over seeds ===")
for t in transfer_sets:
    for vp in val_paths:
        arr = np.array(results[t][vp])
        mean_acc = arr.mean()
        std_acc = arr.std(ddof=1)
        print(f"{t} → {vp}: mean = {mean_acc:.2f}%,  std = {std_acc:.2f}%")
