<a href="https://colab.research.google.com/github/marcvonrohr/DeepLearning/blob/main/meta_learning_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os
import time
import json
import random
from google.colab import drive

#################################################################
#  STEP 2.1: PREPARE LOCAL VM
#################################################################

# --- 1. Mount Google Drive ---
print("Connecting Google Drive...")
drive.mount('/content/drive')
print("...Google Drive connected.")

# --- 2. Define Key Paths ---
GDRIVE_ROOT = '/content/drive/MyDrive/'
PROJECT_DIR = os.path.join(GDRIVE_ROOT, 'Deep Learning')
DATASETS_ROOT_DIR = os.path.join(PROJECT_DIR, 'datasets')
INAT_ROOT_DIR = os.path.join(DATASETS_ROOT_DIR, 'inaturalist')

# Source: The COMPRESSED archives
ARCHIVES_DIR_ON_DRIVE = os.path.join(INAT_ROOT_DIR, 'archives')

# Target: The LOCAL VM fast disk
LOCAL_DATA_ROOT = '/content/data'
# This is the final path your PyTorch code will use:
FINAL_DATA_PATH = os.path.join(LOCAL_DATA_ROOT, 'inaturalist_unpacked')

# Define source/destination paths
TAR_FILES = {
    "2021_train_mini": {
        "src": os.path.join(ARCHIVES_DIR_ON_DRIVE, '2021_train_mini.tar.gz'),
        "dest_tar": os.path.join(LOCAL_DATA_ROOT, '2021_train_mini.tar.gz'),
        "check_unpacked": os.path.join(FINAL_DATA_PATH, '2021_train_mini')
    },
    "2021_valid": {
        "src": os.path.join(ARCHIVES_DIR_ON_DRIVE, '2021_valid.tar.gz'),
        "dest_tar": os.path.join(LOCAL_DATA_ROOT, '2021_valid.tar.gz'),
        "check_unpacked": os.path.join(FINAL_DATA_PATH, '2021_valid')
    }
}

# --- 3. Create Local Directories on VM ---
os.makedirs(LOCAL_DATA_ROOT, exist_ok=True)
os.makedirs(FINAL_DATA_PATH, exist_ok=True)
print(f"Local data directory created at: {FINAL_DATA_PATH}")

# --- 4. Copy, Unpack, and Clean up for each file ---
for name, paths in TAR_FILES.items():
    print(f"\n--- Processing {name} ---")

    if os.path.exists(paths["check_unpacked"]):
        print(f"'{name}' is already unpacked in local VM. Skipping.")
        continue

    # 4a. Copy .tar.gz from Drive to local VM
    print(f"Copying '{name}.tar.gz' from Drive to local VM...")
    start_time = time.time()
    !cp "{paths['src']}" "{paths['dest_tar']}"
    print(f"...Copy complete. Took {time.time() - start_time:.2f} seconds.")

    # 4b. Unpack the file on the local VM
    print(f"Unpacking '{name}.tar.gz' locally...")
    start_time = time.time()
    !tar -xzf "{paths['dest_tar']}" -C "{FINAL_DATA_PATH}"
    print(f"...Unpacking complete. Took {time.time() - start_time:.2f} seconds.")

    # 4c. Delete the local .tar.gz file to save VM space
    print(f"Deleting local tarball '{paths['dest_tar']}'...")
    !rm "{paths['dest_tar']}"
    print("...Local tarball deleted.")

# --- 5. Verify and Set Path for Training ---
print("\n--- Final Data Setup Verification ---")
print(f"Dataset is ready for training at: {FINAL_DATA_PATH}")
!ls -lh "{FINAL_DATA_PATH}"
print("\nLocal VM Disk Space Usage:")
!df -h

Connecting Google Drive...
Mounted at /content/drive
...Google Drive connected.
Local data directory created at: /content/data/inaturalist_unpacked

--- Processing 2021_train_mini ---
Copying '2021_train_mini.tar.gz' from Drive to local VM...
...Copy complete. Took 1155.01 seconds.
Unpacking '2021_train_mini.tar.gz' locally...
...Unpacking complete. Took 436.23 seconds.
Deleting local tarball '/content/data/2021_train_mini.tar.gz'...
...Local tarball deleted.

--- Processing 2021_valid ---
Copying '2021_valid.tar.gz' from Drive to local VM...
...Copy complete. Took 242.87 seconds.
Unpacking '2021_valid.tar.gz' locally...
...Unpacking complete. Took 78.99 seconds.
Deleting local tarball '/content/data/2021_valid.tar.gz'...
...Local tarball deleted.

--- Final Data Setup Verification ---
Dataset is ready for training at: /content/data/inaturalist_unpacked
total 2.5M
drwxrwxr-x 10002 1000 1000 1.3M Oct 13  2020 train_mini
drwxrwxr-x 10002 1000 1000 1.3M Oct 13  2020 val

Local VM Disk Spa

In [2]:
#################################################################
#  STEP 2.2: SCIENTIFIC DATA PARTITIONING
#################################################################
print("\n--- STEP 2.2: Loading/Creating Scientific Class Partition ---")

# --- 6. Define Paths for Partition File ---
# We create a 'project_meta' folder on GDrive to store helper files
META_DIR_ON_DRIVE = os.path.join(PROJECT_DIR, 'project_meta')
os.makedirs(META_DIR_ON_DRIVE, exist_ok=True)

PARTITION_FILE_PATH = os.path.join(META_DIR_ON_DRIVE, 'inat_class_split.json')
print(f"Looking for partition file at: {PARTITION_FILE_PATH}")


--- STEP 2.2: Loading/Creating Scientific Class Partition ---
Looking for partition file at: /content/drive/MyDrive/Deep Learning/project_meta/inat_class_split.json


In [3]:
# --- 7. Logic to Find Classes and Create Partition ---

# 7a. Identify the Dataset Root
# The unpacking might have created a subfolder (e.g., '2021_train_mini' or 'train_mini')
# or files might be directly in FINAL_DATA_PATH. We check common patterns.
possible_roots = [
    os.path.join(FINAL_DATA_PATH, '2021_train_mini'),
    os.path.join(FINAL_DATA_PATH, 'train_mini'),
    FINAL_DATA_PATH
]

DATASET_ROOT = None
for path in possible_roots:
    if os.path.exists(path):
        # Check if this path actually contains subdirectories
        if len([d for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]) > 0:
            DATASET_ROOT = path
            break

print(f"Dataset root identified as: {DATASET_ROOT}")

# 7b. Load or Create the Partition
partition_data = {}
RANDOM_SEED = 42

if os.path.exists(PARTITION_FILE_PATH):
    print("Found existing partition file. Loading...")
    with open(PARTITION_FILE_PATH, 'r') as f:
        partition_data = json.load(f)
else:
    print("No partition file found. Scanning directories to create new partition...")
    print("This ensures independence from missing metadata files.")

    # --- Scan for Class Folders ---
    class_folders_rel = []

    # Walk through the directory tree
    # A "class" is any folder that contains image files (.jpg, .jpeg, .png)
    print("Scanning folders (this may take 1-2 minutes)...")
    for root, dirs, files in os.walk(DATASET_ROOT):
        # Check for images in this specific folder
        images = [f for f in files if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

        if len(images) > 0:
            # Get path relative to the dataset root (e.g., "Aves/Turdus_migratorius")
            rel_path = os.path.relpath(root, DATASET_ROOT)
            class_folders_rel.append(rel_path)

    # --- CRITICAL: Sort for Reproducibility ---
    # Sorting ensures that Index 0 is ALWAYS the same class on every machine/run
    class_folders_rel.sort()

    num_classes = len(class_folders_rel)
    print(f"Found {num_classes} classes containing images.")

    if num_classes < 9900:
        print("WARNING: Found significantly fewer than 10,000 classes. Check extraction.")

    # --- Assign IDs and Shuffle ---
    all_class_ids = list(range(num_classes))

    print(f"Shuffling {num_classes} class IDs with random seed {RANDOM_SEED}...")
    random.seed(RANDOM_SEED)
    random.shuffle(all_class_ids)

    # --- Split into Sets ---
    # 6000 Base (Train/Meta-Train), 2000 Val (Hyperparams), 2000 Novel (Test)
    c_base_ids = all_class_ids[:6000]
    c_val_ids = all_class_ids[6000:8000]
    c_novel_ids = all_class_ids[8000:]

    # --- Construct Data Structure ---
    # We save both the sets AND the mapping from ID -> Folder Path
    partition_data = {
        "sets": {
            'c_base': sorted(c_base_ids),
            'c_val': sorted(c_val_ids),
            'c_novel': sorted(c_novel_ids)
        },
        "id_to_path": {
            str(i): folder_path for i, folder_path in enumerate(class_folders_rel)
        }
    }

    # --- Save to Drive ---
    print(f"Saving new partition and mapping to: {PARTITION_FILE_PATH}")
    with open(PARTITION_FILE_PATH, 'w') as f:
        json.dump(partition_data, f, indent=4)

Dataset root identified as: /content/data/inaturalist_unpacked/train_mini
Found existing partition file. Loading...


In [4]:
# --- 8. Verification ---
print("\n--- Partitioning Complete ---")
sets = partition_data['sets']
print(f"Total C_base classes:  {len(sets['c_base'])}")
print(f"Total C_val classes:   {len(sets['c_val'])}")
print(f"Total C_novel classes: {len(sets['c_novel'])}")

# Check for overlaps (should be 0)
base_set = set(sets['c_base'])
val_set = set(sets['c_val'])
novel_set = set(sets['c_novel'])

overlap_bv = base_set & val_set
overlap_bn = base_set & novel_set
overlap_vn = val_set & novel_set

print(f"Overlap (Base-Val):    {len(overlap_bv)}")
print(f"Overlap (Base-Novel):  {len(overlap_bn)}")
print(f"Overlap (Val-Novel):   {len(overlap_vn)}")

if len(overlap_bv) + len(overlap_bn) + len(overlap_vn) == 0:
    print("\nSUCCESS: Classes are cleanly partitioned.")
else:
    print("\nCRITICAL ERROR: Overlaps detected in class sets!")


--- Partitioning Complete ---
Total C_base classes:  6000
Total C_val classes:   2000
Total C_novel classes: 2000
Overlap (Base-Val):    0
Overlap (Base-Novel):  0
Overlap (Val-Novel):   0

SUCCESS: Classes are cleanly partitioned.


In [5]:
#################################################################
#  STEP 2.3: MODULAR DATA LOADERS (NO LEARN2LEARN DEPENDENCY)
#################################################################
import torch
import numpy as np
from PIL import Image
from torch.utils.data import Dataset, DataLoader, Subset
from torchvision import transforms

print("\n--- STEP 2.3: Initialize Custom Data Loaders (Native PyTorch) ---")

# --- SAFETY CHECK ---
# Ensure variables from Step 2.2 exist
required_vars = ['DATASET_ROOT', 'PARTITION_FILE_PATH']
if not all(v in globals() for v in required_vars):
    raise NameError(f"Missing variables from Step 2.2. Please run the previous cell.")

print(f"Using Dataset Root: {DATASET_ROOT}")
print(f"Using Partition File: {PARTITION_FILE_PATH}")

# --- CONSTANTS ---
NORMALIZE_MEAN = [0.485, 0.456, 0.406]
NORMALIZE_STD = [0.229, 0.224, 0.225]


--- STEP 2.3: Initialize Custom Data Loaders (Native PyTorch) ---
Using Dataset Root: /content/data/inaturalist_unpacked/train_mini
Using Partition File: /content/drive/MyDrive/Deep Learning/project_meta/inat_class_split.json


In [6]:
# ==============================================================================
#  CORE COMPONENT: The Custom Dataset Class
# ==============================================================================
class MetaINatDataset(Dataset):
    """
    A custom PyTorch Dataset that enforces the scientific partition.
    """
    def __init__(self, root_dir, partition_file, split='c_base', transform=None):
        self.root_dir = root_dir
        self.transform = transform
        self.split = split

        with open(partition_file, 'r') as f:
            data = json.load(f)

        if split not in data['sets']:
            raise ValueError(f"Invalid split '{split}'. Available: {list(data['sets'].keys())}")

        self.allowed_ids = data['sets'][split]
        self.id_to_path = data['id_to_path']

        # Map original ID -> 0..N-1
        self.label_map = {orig: new for new, orig in enumerate(self.allowed_ids)}

        self.samples = []
        for original_id in self.allowed_ids:
            rel_path = self.id_to_path[str(original_id)]
            abs_path = os.path.join(self.root_dir, rel_path)
            if os.path.exists(abs_path):
                for img in os.listdir(abs_path):
                    if img.lower().endswith(('.jpg', '.jpeg', '.png')):
                        self.samples.append({
                            'path': os.path.join(abs_path, img),
                            'label': self.label_map[original_id]
                        })

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        sample = self.samples[idx]
        image = Image.open(sample['path']).convert('RGB')
        label = sample['label']
        if self.transform:
            image = self.transform(image)
        return image, label

In [7]:
# ==============================================================================
#  HELPER: Episodic Batch Generator (Replaces learn2learn)
# ==============================================================================
class EpisodicTaskGenerator:
    """
    Native PyTorch implementation of an N-Way K-Shot task sampler.
    Replaces learn2learn functionality without installation issues.
    """
    def __init__(self, dataset, ways, shots, query_shots):
        self.dataset = dataset
        self.ways = ways
        self.shots = shots
        self.query_shots = query_shots

        # Group all image indices by their label for fast sampling
        self.indices_by_label = {}
        for idx, sample in enumerate(dataset.samples):
            lbl = sample['label']
            if lbl not in self.indices_by_label:
                self.indices_by_label[lbl] = []
            self.indices_by_label[lbl].append(idx)

        self.classes = list(self.indices_by_label.keys())

    def __iter__(self):
        return self

    def __next__(self):
        # 1. Sample N random classes (Ways)
        selected_classes = random.sample(self.classes, self.ways)

        batch_images = []
        batch_labels = []

        # 2. Sample K + Q images from each class
        for local_label, global_label_idx in enumerate(selected_classes):
            indices = self.indices_by_label[global_label_idx]

            # Ensure we have enough images, otherwise sample with replacement
            needed = self.shots + self.query_shots
            if len(indices) >= needed:
                selected_indices = random.sample(indices, needed)
            else:
                selected_indices = random.choices(indices, k=needed)

            # 3. Load images and re-label them to 0..N-1 for the episode
            for idx in selected_indices:
                img, _ = self.dataset[idx] # dataset returns (img, global_label)
                batch_images.append(img)
                # Important: The label for the loss function must be 0..Ways-1
                batch_labels.append(local_label)

        # Stack into a single tensor: [Ways * (Shots+Query), C, H, W]
        data = torch.stack(batch_images)
        labels = torch.tensor(batch_labels)

        return data, labels

    def sample(self):
        # Compatibility method to look like learn2learn
        return self.__next__()

In [8]:
# ==============================================================================
#  LOADER A: Standard Pre-Training Loader
# ==============================================================================
def get_standard_loader(split='c_base', batch_size=64, shuffle=True):
    print(f"\n[Loader A] Initializing Standard Loader for split '{split}'...")

    train_transforms = transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(NORMALIZE_MEAN, NORMALIZE_STD)
    ])

    dataset = MetaINatDataset(DATASET_ROOT, PARTITION_FILE_PATH, split=split, transform=train_transforms)

    loader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, num_workers=2, pin_memory=True)

    print(f" -> {len(dataset)} total images.")
    print(f" -> {len(dataset.allowed_ids)} classes.")
    return loader, len(dataset.allowed_ids)

In [9]:
# ==============================================================================
#  LOADER B: Episodic Task Loader (MAML) - NATIVE IMPLEMENTATION
# ==============================================================================
def get_episodic_taskset(split='c_base', ways=5, shots=1, query_shots=1, img_size=84):
    print(f"\n[Loader B] Initializing Episodic Generator for split '{split}'...")

    maml_transforms = transforms.Compose([
        transforms.Resize((img_size, img_size)),
        transforms.ToTensor(),
        transforms.Normalize(NORMALIZE_MEAN, NORMALIZE_STD)
    ])

    dataset = MetaINatDataset(DATASET_ROOT, PARTITION_FILE_PATH, split=split, transform=maml_transforms)

    # Use our native generator instead of learn2learn
    task_generator = EpisodicTaskGenerator(
        dataset,
        ways=ways,
        shots=shots,
        query_shots=query_shots
    )

    print(f" -> Configured {ways}-Way {shots}-Shot Tasks (Native PyTorch).")
    return task_generator

In [10]:
# ==============================================================================
#  LOADER C: Fixed Few-Shot Loader for FT/LoRA
# ==============================================================================
def get_fixed_few_shot_task(split='c_novel', ways=5, shots=1, query_shots=15, seed=None):
    print(f"\n[Loader C] Creating Fixed Few-Shot Task from '{split}'...")

    if seed:
        random.seed(seed)
        torch.manual_seed(seed)

    eval_transforms = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(NORMALIZE_MEAN, NORMALIZE_STD)
    ])

    dataset = MetaINatDataset(DATASET_ROOT, PARTITION_FILE_PATH, split=split, transform=eval_transforms)

    available_labels = list(set(s['label'] for s in dataset.samples))
    selected_classes = random.sample(available_labels, ways)

    class_indices = {c: [] for c in selected_classes}
    for idx, sample in enumerate(dataset.samples):
        if sample['label'] in selected_classes:
            class_indices[sample['label']].append(idx)

    support_indices = []
    query_indices = []

    for c in selected_classes:
        idxs = class_indices[c]
        random.shuffle(idxs)
        support_indices.extend(idxs[:shots])
        query_indices.extend(idxs[shots : shots+query_shots])

    support_loader = DataLoader(Subset(dataset, support_indices), batch_size=16, shuffle=True)
    query_loader = DataLoader(Subset(dataset, query_indices), batch_size=32, shuffle=False)

    print(f" -> Support Set: {len(support_indices)} images, Query Set: {len(query_indices)} images")
    return support_loader, query_loader

In [11]:
# ==============================================================================
#  VERIFICATION
# ==============================================================================
print("\n--- Testing Loaders ---")

# Test A
try:
    l_std, n_cls = get_standard_loader(split='c_base', batch_size=4)
    print("Loader A (Standard) check: OK.")
except Exception as e:
    print(f"Loader A Failed: {e}")

# Test B (Now using Native Generator)
try:
    task_gen = get_episodic_taskset(split='c_base', ways=5, shots=1, query_shots=1)
    batch_data, batch_labels = task_gen.sample()
    # Expected shape: [Way*(Shot+Query), 3, 84, 84] -> [5*(1+1), 3, 84, 84] = [10, 3, 84, 84]
    print(f"Loader B (Episodic) check: OK. Batch shape: {batch_data.shape}")
    if batch_labels.max() >= 5:
        print("WARNING: Labels not properly remapped to 0..N-1")
except Exception as e:
    print(f"Loader B Failed: {e}")

# Test C
try:
    sup_dl, q_dl = get_fixed_few_shot_task(split='c_novel', ways=5, shots=5)
    print("Loader C (Fixed) check: OK.")
except Exception as e:
    print(f"Loader C Failed: {e}")

print("\nStep 2.3 Complete (Dependencies Fixed).")


--- Testing Loaders ---

[Loader A] Initializing Standard Loader for split 'c_base'...
 -> 300000 total images.
 -> 6000 classes.
Loader A (Standard) check: OK.

[Loader B] Initializing Episodic Generator for split 'c_base'...
 -> Configured 5-Way 1-Shot Tasks (Native PyTorch).
Loader B (Episodic) check: OK. Batch shape: torch.Size([10, 3, 84, 84])

[Loader C] Creating Fixed Few-Shot Task from 'c_novel'...
 -> Support Set: 25 images, Query Set: 75 images
Loader C (Fixed) check: OK.

Step 2.3 Complete (Dependencies Fixed).


In [19]:
#################################################################
#  PHASE 4: INTELLIGENT PRE-TRAINING (MAX PERF & MEMORY SAFE)
#################################################################
import os
import time
import shutil
import random
import gc  # <--- WICHTIG für Garbage Collection
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from torchvision import models, transforms
from tqdm.notebook import tqdm
from torch.cuda.amp import autocast, GradScaler
from google.colab import drive

print("\n--- PHASE 4: Pipeline 0 - Base Model Pre-Training ---")

# --- 0. DRIVE & PATH SETUP ---
if not os.path.exists('/content/drive'):
    print("Mounting Google Drive...")
    drive.mount('/content/drive')

GDRIVE_ROOT = '/content/drive/MyDrive/'
PROJECT_DIR = os.path.join(GDRIVE_ROOT, 'Deep Learning')
MODELS_DIR = os.path.join(PROJECT_DIR, 'models', 'base_models')
os.makedirs(MODELS_DIR, exist_ok=True)

# --- 1. SEED SETUP ---
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed(42)

# --- 2. HARDWARE DETECTION (TUNED FOR A100) ---
def get_optimal_config():
    cpu_count = os.cpu_count()
    optimal_workers = min(cpu_count, 8)
    device_name = "CPU"
    batch_size = 16

    if torch.cuda.is_available():
        gpu_name = torch.cuda.get_device_name(0)
        device_name = gpu_name
        # --- TUNING ---
        if "A100" in gpu_name:
            batch_size = 512  # <--- Aggressiver für A100 (40GB VRAM erlaubt das locker)
        elif "T4" in gpu_name:
            batch_size = 128
        else:
            batch_size = 64
    else:
        print("WARNING: No GPU detected!")

    return device_name, batch_size, optimal_workers

detected_device, auto_bs, auto_workers = get_optimal_config()

# --- 3. CONFIGURATION ---
CONFIG = {
    'ARCH': 'resnet34',

    # --- CONTROL CENTER ---
    'DRY_RUN': False,            # <--- REAL TRAINING
    'NUM_EPOCHS': 20,
    # ----------------------

    'BATCH_SIZE': auto_bs,
    'NUM_WORKERS': auto_workers,
    'DEVICE_NAME': detected_device,
    'LEARNING_RATE': 1e-3,
    'PATIENCE': 5,
    'SUBSETS': [0.25, 0.50, 1.0],

    'CHECKPOINT_DIR_LOC': '/content/checkpoints',
    'CHECKPOINT_DIR_DRIVE': MODELS_DIR
}

os.makedirs(CONFIG['CHECKPOINT_DIR_LOC'], exist_ok=True)

print(f"\nSystem Configuration:")
print(f" -> Hardware:    {CONFIG['DEVICE_NAME']}")
print(f" -> Batch Size:  {CONFIG['BATCH_SIZE']} (Optimized)")
print(f" -> Workers:     {CONFIG['NUM_WORKERS']}")
print(f" -> Mode:        {'DRY RUN' if CONFIG['DRY_RUN'] else 'REAL TRAINING'}")

# --- 4. MEMORY CLEANUP HELPER (NEW) ---
def cleanup_memory():
    """Forces Garbage Collection and clears GPU Cache."""
    gc.collect()
    torch.cuda.empty_cache()
    # Optional: Print stats to verify
    # print(f"   [Mem] Reserved: {torch.cuda.memory_reserved(0)/1e9:.2f} GB")


# --- 5. MODEL FACTORY ---
def get_base_model(arch_name, num_classes, pretrained=True):
    # Loading logic same as before
    if arch_name == 'resnet34':
        model = models.resnet34(weights=models.ResNet34_Weights.DEFAULT if pretrained else None)
        in_features = model.fc.in_features
    elif arch_name == 'resnet18':
        model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT if pretrained else None)
        in_features = model.fc.in_features
    elif arch_name == 'resnet50':
        model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT if pretrained else None)
        in_features = model.fc.in_features
    else:
        raise ValueError("Arch not supported")
    model.fc = nn.Linear(in_features, num_classes)
    return model

# --- 6. DATA LOADER HELPER ---
def get_subset_loader(fraction):
    train_transforms = transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(NORMALIZE_MEAN, NORMALIZE_STD)
    ])

    full_ds = MetaINatDataset(DATASET_ROOT, PARTITION_FILE_PATH, split='c_base', transform=train_transforms)

    total_base_classes = len(full_ds.allowed_ids)
    target_num = int(total_base_classes * fraction)
    subset_ids = full_ds.allowed_ids[:target_num]

    # Filter samples (Memory efficient filtering logic)
    # We recreate the list to drop references to unused samples
    new_samples = [s for s in full_ds.samples if s['label'] < target_num]
    full_ds.samples = new_samples
    full_ds.allowed_ids = subset_ids
    full_ds.label_map = {orig: new for new, orig in enumerate(subset_ids)}

    print(f"\n[Data] Subset {fraction*100}%: {len(new_samples)} images, {target_num} classes.")

    num_val = int(0.1 * len(full_ds))
    train_ds, val_ds = random_split(full_ds, [len(full_ds)-num_val, num_val],
                                    generator=torch.Generator().manual_seed(42))

    train_loader = DataLoader(train_ds, batch_size=CONFIG['BATCH_SIZE'], shuffle=True,
                              num_workers=CONFIG['NUM_WORKERS'], pin_memory=True)
    val_loader = DataLoader(val_ds, batch_size=CONFIG['BATCH_SIZE'], shuffle=False,
                            num_workers=CONFIG['NUM_WORKERS'], pin_memory=True)

    return train_loader, val_loader, target_num

# --- 7. ROBUST CHECKPOINTING ---
def safe_copy_to_drive(local_path, filename, max_retries=5):
    drive_path = os.path.join(CONFIG['CHECKPOINT_DIR_DRIVE'], filename)
    if not os.path.exists(CONFIG['CHECKPOINT_DIR_DRIVE']):
        try: os.makedirs(CONFIG['CHECKPOINT_DIR_DRIVE'], exist_ok=True)
        except: pass

    for attempt in range(1, max_retries + 1):
        try:
            shutil.copy(local_path, drive_path)
            if os.path.exists(drive_path) and os.path.getsize(drive_path) > 0:
                print(f"   -> Drive Copy: SUCCESS")
                return
        except Exception as e:
            wait_time = 3 * attempt
            print(f"   [Retry {attempt}] Copy failed ({e}). Waiting {wait_time}s...")
            time.sleep(wait_time)
    print(f"   [CRITICAL ERROR] Failed to copy {filename} to Drive.")

def save_checkpoint(state, filename):
    local_path = os.path.join(CONFIG['CHECKPOINT_DIR_LOC'], filename)
    torch.save(state, local_path)
    safe_copy_to_drive(local_path, filename)

def save_best_model(model, filename):
    local_path = os.path.join(CONFIG['CHECKPOINT_DIR_LOC'], filename)
    torch.save(model.state_dict(), local_path)
    safe_copy_to_drive(local_path, filename)

# --- 8. TRAINING ENGINE ---
def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, target_epochs, model_name):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    scaler = GradScaler()

    run_tag = "_dryrun" if CONFIG['DRY_RUN'] else ""
    ckpt_filename = f"{model_name}{run_tag}_checkpoint.pth"
    best_filename = f"{model_name}{run_tag}_best.pth"

    start_epoch = 0
    best_acc = -1.0

    # Resume Logic
    drive_ckpt_path = os.path.join(CONFIG['CHECKPOINT_DIR_DRIVE'], ckpt_filename)
    if os.path.exists(drive_ckpt_path):
        print(f"\n[RESUME] Found: {ckpt_filename}")
        try:
            checkpoint = torch.load(drive_ckpt_path, map_location=device)
            saved_epoch = checkpoint['epoch']

            if CONFIG['DRY_RUN']:
                print(f"   -> (Dry Run) Resetting loop despite found epoch {saved_epoch+1}.")
                start_epoch = 0
                best_acc = checkpoint.get('best_acc', -1.0)
            else:
                if saved_epoch >= (target_epochs - 1):
                    print(f"   -> Fully trained ({saved_epoch+1} epochs). Skipping.")
                    return model
                start_epoch = saved_epoch + 1
                best_acc = checkpoint.get('best_acc', 0.0)

            model.load_state_dict(checkpoint['model_state_dict'])
            optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
            scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
            if 'scaler_state_dict' in checkpoint:
                scaler.load_state_dict(checkpoint['scaler_state_dict'])
            print(f"   -> Resuming with Best Acc: {best_acc:.4f}")
        except Exception as e:
            print(f"   [ERROR] Checkpoint corrupted ({e}). Fresh start.")
    else:
        print(f"\n[START] Fresh start for {model_name}.")

    effective_epochs = 2 if CONFIG['DRY_RUN'] else target_epochs
    patience_counter = 0

    for epoch in range(start_epoch, effective_epochs):
        print(f"\nEpoch {epoch+1}/{effective_epochs}")

        model.train()
        running_loss = 0.0
        running_corrects = 0
        limit_batches = 5 if CONFIG['DRY_RUN'] else None

        pbar = tqdm(train_loader, leave=False, desc="Training")

        for i, (inputs, labels) in enumerate(pbar):
            if limit_batches and i >= limit_batches: break
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()

            with autocast():
                outputs = model(inputs)
                loss = criterion(outputs, labels)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

            _, preds = torch.max(outputs, 1)
            running_loss += loss.item() * inputs.size(0)
            running_corrects += torch.sum(preds == labels.data)
            pbar.set_postfix(loss=loss.item())

        iter_size = (limit_batches * CONFIG['BATCH_SIZE']) if limit_batches else len(train_loader.dataset)
        if iter_size == 0: iter_size = 1
        epoch_acc = running_corrects.double() / iter_size
        epoch_loss = running_loss / iter_size
        scheduler.step(epoch_loss)

        model.eval()
        val_corrects = 0
        val_limit = 5 if CONFIG['DRY_RUN'] else None
        val_count = 0
        for i, (inputs, labels) in enumerate(val_loader):
            if val_limit and i >= val_limit: break
            inputs, labels = inputs.to(device), labels.to(device)
            with torch.no_grad():
                outputs = model(inputs)
                _, preds = torch.max(outputs, 1)
            val_corrects += torch.sum(preds == labels.data)
            val_count += inputs.size(0)
        val_acc = val_corrects.double() / val_count if val_count > 0 else 0.0
        print(f"   Train Acc: {epoch_acc:.4f} | Val Acc: {val_acc:.4f}")

        # Save Checkpoint
        full_state = {
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'scheduler_state_dict': scheduler.state_dict(),
            'scaler_state_dict': scaler.state_dict(),
            'best_acc': best_acc
        }
        save_checkpoint(full_state, ckpt_filename)

        # Save Best Model logic
        save_condition = False
        if val_acc > best_acc: save_condition = True
        elif CONFIG['DRY_RUN'] and val_acc >= best_acc: save_condition = True
        elif best_acc == -1.0: save_condition = True

        if save_condition:
            best_acc = val_acc
            save_best_model(model, best_filename)
            print(f"   [New Best] Saved {best_filename}")
            patience_counter = 0
        else:
            patience_counter += 1

        if not CONFIG['DRY_RUN'] and patience_counter >= CONFIG['PATIENCE']:
            print(f"   [Early Stopping] Reached patience limit.")
            break

    print(f"Training Finished. Final Best Acc: {best_acc:.4f}")
    return model


# --- 9. EXECUTION LOOP (WITH CLEANUP) ---
for fraction in CONFIG['SUBSETS']:
    subset_name = f"M_base_{int(fraction*100)}"
    print(f"\n{'='*40}\nRUN: {subset_name}\n{'='*40}")

    train_dl, val_dl, num_cls = get_subset_loader(fraction)
    model = get_base_model(CONFIG['ARCH'], num_classes=num_cls)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=CONFIG['LEARNING_RATE'])
    lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=3)

    # Train
    train_model(model, train_dl, val_dl, criterion, optimizer, lr_scheduler, CONFIG['NUM_EPOCHS'], subset_name)

    # --- MEMORY CLEANUP ---
    print(f"   [Cleanup] Clearing GPU memory after {subset_name}...")
    del model
    del optimizer
    del criterion
    del train_dl
    del val_dl
    cleanup_memory() # Call helper to force GC and Empty Cache
    print(f"   [Cleanup] Done. Ready for next model.\n")

print("\nPHASE 4 COMPLETE.")


--- PHASE 4: Pipeline 0 - Base Model Pre-Training ---

System Configuration:
 -> Hardware:    NVIDIA A100-SXM4-40GB
 -> Batch Size:  512 (Optimized)
 -> Workers:     8
 -> Mode:        REAL TRAINING

RUN: M_base_25

[Data] Subset 25.0%: 75000 images, 1500 classes.

[START] Fresh start for M_base_25.

Epoch 1/20


  scaler = GradScaler()


Training:   0%|          | 0/132 [00:00<?, ?it/s]

  with autocast():


   Train Acc: 0.0016 | Val Acc: 0.0023
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 2/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.0066 | Val Acc: 0.0091
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 3/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.0251 | Val Acc: 0.0331
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 4/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.0790 | Val Acc: 0.0917
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 5/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.1567 | Val Acc: 0.1544
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 6/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.2387 | Val Acc: 0.2171
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 7/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.3030 | Val Acc: 0.2693
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 8/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.3585 | Val Acc: 0.3068
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 9/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.4045 | Val Acc: 0.3133
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 10/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.4368 | Val Acc: 0.3403
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 11/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.4669 | Val Acc: 0.3631
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 12/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.4922 | Val Acc: 0.3787
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 13/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.5176 | Val Acc: 0.3997
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 14/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.5369 | Val Acc: 0.3927
   -> Drive Copy: SUCCESS

Epoch 15/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.5566 | Val Acc: 0.4225
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 16/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.5714 | Val Acc: 0.4329
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 17/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.5897 | Val Acc: 0.4273
   -> Drive Copy: SUCCESS

Epoch 18/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.6001 | Val Acc: 0.4201
   -> Drive Copy: SUCCESS

Epoch 19/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.6163 | Val Acc: 0.4552
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_25_best.pth

Epoch 20/20


Training:   0%|          | 0/132 [00:00<?, ?it/s]

   Train Acc: 0.6323 | Val Acc: 0.4269
   -> Drive Copy: SUCCESS
Training Finished. Final Best Acc: 0.4552
   [Cleanup] Clearing GPU memory after M_base_25...
   [Cleanup] Done. Ready for next model.


RUN: M_base_50

[Data] Subset 50.0%: 150000 images, 3000 classes.

[START] Fresh start for M_base_50.

Epoch 1/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.0007 | Val Acc: 0.0015
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 2/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.0029 | Val Acc: 0.0031
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 3/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.0091 | Val Acc: 0.0115
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 4/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.0261 | Val Acc: 0.0303
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 5/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.0584 | Val Acc: 0.0606
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 6/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.1035 | Val Acc: 0.0992
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 7/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.1509 | Val Acc: 0.1336
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 8/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.1941 | Val Acc: 0.1515
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 9/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.2342 | Val Acc: 0.1915
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 10/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.2682 | Val Acc: 0.2043
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 11/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.2990 | Val Acc: 0.2261
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 12/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.3273 | Val Acc: 0.2591
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 13/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.3498 | Val Acc: 0.2699
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 14/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.3730 | Val Acc: 0.2724
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 15/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.3925 | Val Acc: 0.3043
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 16/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.4137 | Val Acc: 0.3003
   -> Drive Copy: SUCCESS

Epoch 17/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.4284 | Val Acc: 0.2990
   -> Drive Copy: SUCCESS

Epoch 18/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.4455 | Val Acc: 0.2995
   -> Drive Copy: SUCCESS

Epoch 19/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.4621 | Val Acc: 0.3140
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth

Epoch 20/20


Training:   0%|          | 0/264 [00:00<?, ?it/s]

   Train Acc: 0.4763 | Val Acc: 0.3199
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_50_best.pth
Training Finished. Final Best Acc: 0.3199
   [Cleanup] Clearing GPU memory after M_base_50...
   [Cleanup] Done. Ready for next model.


RUN: M_base_100

[Data] Subset 100.0%: 300000 images, 6000 classes.

[START] Fresh start for M_base_100.

Epoch 1/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.0008 | Val Acc: 0.0021
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 2/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.0078 | Val Acc: 0.0127
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 3/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.0349 | Val Acc: 0.0509
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 4/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.0864 | Val Acc: 0.1003
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 5/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.1508 | Val Acc: 0.1488
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 6/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.2112 | Val Acc: 0.1856
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 7/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.2617 | Val Acc: 0.2201
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 8/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.3057 | Val Acc: 0.2578
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 9/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.3429 | Val Acc: 0.2767
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 10/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.3751 | Val Acc: 0.2871
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 11/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.4032 | Val Acc: 0.3063
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 12/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.4284 | Val Acc: 0.3253
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 13/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.4502 | Val Acc: 0.3258
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 14/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.4715 | Val Acc: 0.3346
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 15/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.4874 | Val Acc: 0.3518
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 16/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.5086 | Val Acc: 0.3530
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 17/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.5247 | Val Acc: 0.3665
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 18/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.5392 | Val Acc: 0.3643
   -> Drive Copy: SUCCESS

Epoch 19/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.5532 | Val Acc: 0.3681
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth

Epoch 20/20


Training:   0%|          | 0/528 [00:00<?, ?it/s]

   Train Acc: 0.5692 | Val Acc: 0.3748
   -> Drive Copy: SUCCESS
   -> Drive Copy: SUCCESS
   [New Best] Saved M_base_100_best.pth
Training Finished. Final Best Acc: 0.3748
   [Cleanup] Clearing GPU memory after M_base_100...
   [Cleanup] Done. Ready for next model.


PHASE 4 COMPLETE.
