# Phase 1: Training Orchestration (Google Colab & Kaggle)

This notebook orchestrates all training activities for **Google Colab or Kaggle execution** with GPU compute support.

## Important

- This notebook **executes training in Google Colab or Kaggle** (not on Azure ML)
- All computation happens on the platform's GPU
- **Storage & Persistence**:
  - **Google Colab**: Checkpoints are automatically saved to Google Drive for persistence across sessions
  - **Kaggle**: Outputs in `/kaggle/working/` are automatically persisted - no manual backup needed
- The notebook must be **re-runnable end-to-end**
- Uses the dataset path specified in the data config (from `config/data/*.yaml`), typically pointing to a local folder included in the repository
- **Session Management**:
  - **Colab**: Sessions timeout after 12-24 hours (depending on Colab plan). Checkpoints are saved to Drive automatically.
  - **Kaggle**: Sessions have time limits based on your plan. All outputs are automatically saved.


## Step 1: Environment Detection

The notebook automatically detects the execution environment (local, Google Colab, or Kaggle) and adapts its behavior accordingly.


In [1]:
# Use cached values or import functions
try:
    from common.shared.notebook_setup import (
        get_platform_vars,
        ensure_src_in_path,
        detect_notebook_environment,
    )
except ImportError:
    print("⚠ Repository not cloned yet. Run Repository Setup cell first.")
    # Minimal fallback - just detect platform
    import os
    from pathlib import Path
    
    def get_platform_vars():
        if "COLAB_GPU" in os.environ or "COLAB_TPU" in os.environ:
            return {"platform": "colab", "is_colab": True, "is_kaggle": False, "is_local": False, "base_dir": Path("/content"), "backup_enabled": True}
        if "KAGGLE_KERNEL_RUN_TYPE" in os.environ:
            return {"platform": "kaggle", "is_colab": False, "is_kaggle": True, "is_local": False, "base_dir": Path("/kaggle/working"), "backup_enabled": False}
        return {"platform": "local", "is_colab": False, "is_kaggle": False, "is_local": True, "base_dir": None, "backup_enabled": False}
    
    def ensure_src_in_path():
        return None
    
    def detect_notebook_environment():
        class Env:
            def __init__(self):
                pv = get_platform_vars()
                self.platform = pv["platform"]
                self.is_colab = pv["is_colab"]
                self.is_kaggle = pv["is_kaggle"]
                self.is_local = pv["is_local"]
                self.base_dir = pv["base_dir"]
                self.backup_enabled = pv["backup_enabled"]
        return Env()

# Use cached values or compute
if 'PLATFORM_VARS' not in globals():
    PLATFORM_VARS = get_platform_vars()
platform_vars = PLATFORM_VARS

if 'REPO_ROOT' not in globals():
    REPO_ROOT = ensure_src_in_path()
repo_root = REPO_ROOT

# Get environment info
env = detect_notebook_environment()
PLATFORM = env.platform
IN_COLAB = env.is_colab
IN_KAGGLE = env.is_kaggle
IS_LOCAL = env.is_local
BASE_DIR = env.base_dir
BACKUP_ENABLED = env.backup_enabled

if not repo_root and not IS_LOCAL:
    print("⚠ Repository not found. Run Repository Setup cell to clone.")

print(f"✓ Platform: {PLATFORM}")
print(f"✓ Base directory: {BASE_DIR if BASE_DIR else 'Current working directory'}")
print(f"✓ Backup enabled: {BACKUP_ENABLED}")


⚠ Repository not cloned yet. Run Repository Setup cell first.
✓ Platform: local
✓ Base directory: Current working directory
✓ Backup enabled: False


In [2]:
if 'PLATFORM_VARS' not in globals():
    PLATFORM_VARS = get_platform_vars()
platform_vars = PLATFORM_VARS

if 'REPO_ROOT' not in globals():
    REPO_ROOT = ensure_src_in_path()
repo_root = REPO_ROOT

if repo_root:
    try:
        from common.shared.notebook_setup import detect_notebook_environment
        env = detect_notebook_environment()
        PLATFORM = env.platform
        IN_COLAB = env.is_colab
        IN_KAGGLE = env.is_kaggle
        IS_LOCAL = env.is_local
        BASE_DIR = env.base_dir
        BACKUP_ENABLED = env.backup_enabled
    except ImportError:
        PLATFORM = platform_vars["platform"]
        IN_COLAB = platform_vars["is_colab"]
        IN_KAGGLE = platform_vars["is_kaggle"]
        IS_LOCAL = platform_vars["is_local"]
        BASE_DIR = platform_vars["base_dir"]
        BACKUP_ENABLED = platform_vars["backup_enabled"]
else:
    PLATFORM = platform_vars["platform"]
    IN_COLAB = platform_vars["is_colab"]
    IN_KAGGLE = platform_vars["is_kaggle"]
    IS_LOCAL = platform_vars["is_local"]
    BASE_DIR = platform_vars["base_dir"]
    BACKUP_ENABLED = platform_vars["backup_enabled"]
    if not IS_LOCAL:
        print("Repository not found. Run Repository Setup cell to clone.")

print(f"Platform: {PLATFORM}")


Platform: local


## Step 2: Repository Setup

**Note**: Repository setup is only needed for Colab/Kaggle environments. Local environments should already have the repository cloned.

### For Colab/Kaggle: Clone from Git or Upload Files

Choose one of the following options:

**Option A: Clone from Git (Recommended)**

If your repository is on GitHub/GitLab, clone it:

**For Google Colab:**
```python
!git clone -b gg_final_training_2 https://github.com/longdang193/resume-ner-azureml.git /content/resume-ner-azureml
```

**For Kaggle:**
```python
!git clone -b gg_final_training_2 https://github.com/longdang193/resume-ner-azureml.git /kaggle/working/resume-ner-azureml
```

**Option B: Upload Files**

**For Google Colab:**
1. Use the Colab file browser (folder icon on left sidebar)
2. Upload your project files to `/content/resume-ner-azureml/`
3. Ensure the directory structure matches: `src/`, `config/`, `notebooks/`, etc.

**For Kaggle:**
1. Use the Kaggle file browser (Data tab)
2. Upload your project files to `/kaggle/working/resume-ner-azureml/`
3. Ensure the directory structure matches: `src/`, `config/`, `notebooks/`, etc.

### For Local: Repository Already Exists

Local environments should have the repository already cloned. The notebook will automatically detect the repository location.


In [3]:
from pathlib import Path

if 'PLATFORM_VARS' not in globals():
    PLATFORM_VARS = get_platform_vars()
platform_vars = PLATFORM_VARS

if 'REPO_ROOT' not in globals():
    REPO_ROOT = find_repo_root() if 'find_repo_root' in globals() else None
repo_root = REPO_ROOT

if not repo_root and not platform_vars["is_local"]:
    if platform_vars["is_kaggle"]:
        repo_path = Path("/kaggle/working/resume-ner-azureml")
        if not repo_path.exists():
            !git clone -b hpo_run_time_excl https://github.com/hoanglongvonguyen009/resume-ner-azureml.git /kaggle/working/resume-ner-azureml
    elif platform_vars["is_colab"]:
        repo_path = Path("/content/resume-ner-azureml")
        if not repo_path.exists():
            !git clone -b hpo_run_time_excl https://github.com/hoanglongvonguyen009/resume-ner-azureml.git /content/resume-ner-azureml


### Verify Repository Setup

Verify the repository structure exists:


In [4]:
# Get platform vars and repository root
if 'PLATFORM_VARS' not in globals():
    PLATFORM_VARS = get_platform_vars()
platform_vars = PLATFORM_VARS

if 'REPO_ROOT' not in globals():
    REPO_ROOT = ensure_src_in_path()
    # Try expected location if not found (for Colab/Kaggle after cloning)
    if not REPO_ROOT and not platform_vars["is_local"]:
        expected_path = platform_vars["base_dir"] / "resume-ner-azureml"
        if expected_path.exists() and (expected_path / "config").exists() and (expected_path / "src").exists():
            import sys
            src_dir = expected_path / "src"
            if str(src_dir) not in sys.path:
                sys.path.insert(0, str(src_dir))
            REPO_ROOT = expected_path
repo_root = REPO_ROOT

# Install mlflow for Colab/Kaggle before importing modules that depend on it
if repo_root and not platform_vars["is_local"]:
    try:
        import mlflow  # noqa: F401
    except ImportError:
        import subprocess
        import sys
        subprocess.check_call([sys.executable, "-m", "pip", "install", "mlflow", "--quiet"])

if not repo_root:
    raise FileNotFoundError("Repository not found. Run Repository Setup cell to clone.")

# Setup paths
from common.shared.notebook_setup import setup_notebook_paths
paths = setup_notebook_paths(root_dir=repo_root, add_src_to_path=True)

ROOT_DIR = paths.root_dir
CONFIG_DIR = paths.config_dir
SRC_DIR = paths.src_dir
NOTEBOOK_DIR = ROOT_DIR / "notebooks"

# Import path validation utility
import importlib.util
paths_validation_spec = importlib.util.spec_from_file_location(
    "paths_validation",
    SRC_DIR / "infrastructure" / "paths" / "validation.py"
)
paths_validation = importlib.util.module_from_spec(paths_validation_spec)
paths_validation_spec.loader.exec_module(paths_validation)
validate_path_before_mkdir = paths_validation.validate_path_before_mkdir


FileNotFoundError: Repository not found. Run Repository Setup cell to clone.

## Step 3: Install Dependencies

**For Local**: Use conda environment (instructions below).  
**For Colab/Kaggle**: Install packages via pip (automated below).

### Local Environment Setup

For local execution, create and activate a conda environment:

1. Open a terminal in the project root
2. Create the conda environment: `conda env create -f config/environment/conda.yaml`
3. Activate: `conda activate resume-ner-training`
4. Restart the kernel after activation

### Colab/Kaggle: Automated Installation

PyTorch is usually pre-installed in Colab/Kaggle, but we'll verify and install other required packages.


In [None]:
if 'PLATFORM_VARS' not in globals():
    if 'PLATFORM_VARS' not in globals():
        PLATFORM_VARS = get_platform_vars()
platform_vars = PLATFORM_VARS

import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    device_count = torch.cuda.device_count()
    print(f"Visible GPUs: {device_count}")
    for i in range(device_count):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")

torch_version = tuple(map(int, torch.__version__.split('.')[:2]))
if torch_version < (2, 6):
    print(f"Warning: PyTorch {torch.__version__} may not meet requirements (>=2.6.0)")
    if not platform_vars["is_local"]:
        print("Consider upgrading: !pip install torch>=2.6.0 --upgrade")


In [None]:
if 'PLATFORM_VARS' not in globals():
    PLATFORM_VARS = get_platform_vars()
platform_vars = PLATFORM_VARS

if platform_vars["is_local"]:
    print("For local environment, please:")
    print("1. Create conda environment: conda env create -f config/environment/conda.yaml")
    print("2. Activate: conda activate resume-ner-training")
    print("3. Restart kernel after activation")
    print("\nIf you've already done this, you can continue to the next cell.")
    print("\nInstalling Azure ML SDK (required for imports)...")
    # Install Azure ML packages even for local (in case conda env not activated)
    %pip install "azure-ai-ml>=1.0.0" --quiet
    %pip install "azure-identity>=1.12.0" --quiet
    %pip install azureml-defaults --quiet
    %pip install azureml-mlflow --quiet
else:
    # Core ML libraries
    %pip install "transformers>=4.35.0,<5.0.0" --quiet
    %pip install "safetensors>=0.4.0" --quiet
    %pip install "datasets>=2.12.0" --quiet

    # ML utilities
    %pip install "numpy>=1.24.0,<2.0.0" --quiet
    %pip install "pandas>=2.0.0" --quiet
    %pip install "scikit-learn>=1.3.0" --quiet

    # Utilities
    %pip install "pyyaml>=6.0" --quiet
    %pip install "tqdm>=4.65.0" --quiet
    %pip install "seqeval>=1.2.2" --quiet
    %pip install "sentencepiece>=0.1.99" --quiet

    # Experiment tracking
    %pip install mlflow --quiet
    %pip install optuna --quiet

    # Azure ML SDK (required for orchestration imports)
    %pip install "azure-ai-ml>=1.0.0" --quiet
    %pip install "azure-identity>=1.12.0" --quiet
    %pip install azureml-defaults --quiet
    %pip install azureml-mlflow --quiet

    # ONNX support
    %pip install onnxruntime --quiet
    %pip install "onnx>=1.16.0" --quiet
    %pip install "onnxscript>=0.1.0" --quiet

    print("✓ All dependencies installed")

## Step 4: Setup Paths and Import Paths

Python paths are already configured in Step 2. This section verifies the setup.


In [None]:
# Environment detection and platform configuration
# This cell can be run independently to re-detect environment
# Useful if environment variables change during notebook execution

from common.shared.notebook_setup import detect_notebook_environment

# Re-detect environment (useful if env vars change)
env = detect_notebook_environment()
PLATFORM = env.platform
IN_COLAB = env.is_colab
IN_KAGGLE = env.is_kaggle
IS_LOCAL = env.is_local
BASE_DIR = env.base_dir
BACKUP_ENABLED = env.backup_enabled

print(f"✓ Detected environment: {PLATFORM.upper()}")
print(f"Platform: {PLATFORM}")


In [None]:
if 'PLATFORM_VARS' not in globals():
    PLATFORM_VARS = get_platform_vars()
platform_vars = PLATFORM_VARS

if 'REPO_ROOT' not in globals():
    REPO_ROOT = ensure_src_in_path()
repo_root = REPO_ROOT

if not repo_root:
    raise FileNotFoundError("Repository not found. Run Repository Setup cell to clone.")

from common.shared.notebook_setup import setup_notebook_paths

if platform_vars["is_local"]:
    paths = setup_notebook_paths(add_src_to_path=True)
else:
    expected_path = platform_vars["base_dir"] / "resume-ner-azureml"
    if expected_path.exists() and (expected_path / "config").exists() and (expected_path / "src").exists():
        paths = setup_notebook_paths(root_dir=expected_path, add_src_to_path=True)
    else:
        paths = setup_notebook_paths(add_src_to_path=True)

ROOT_DIR = paths.root_dir
CONFIG_DIR = paths.config_dir
SRC_DIR = paths.src_dir
NOTEBOOK_DIR = ROOT_DIR / "notebooks"


## Step 5: Mount Google Drive

Mount Google Drive to enable checkpoint persistence across Colab sessions. Checkpoints will be automatically saved to Drive after training completes.


In [None]:
# Google Drive backup/restore functionality
# Uses the DriveBackupStore from orchestration.drive_backup module
# The drive_store is created in Cell 15 (after mounting)

# Backward-compatible wrapper functions (delegate to drive_store)
# These maintain the old API for gradual migration
from pathlib import Path

# Note: drive_store is created in Cell 15 (Mount Google Drive)
# If drive_store is None, backup/restore operations are disabled

def backup_to_drive(source_path: Path, is_directory: bool = False) -> bool:
    """
    Backward-compatible wrapper for drive_store.backup().
    
    Note: Prefer using drive_store.backup() directly for better error handling.
    """
    if not BACKUP_ENABLED or drive_store is None:
        return False
    
    if not source_path.exists():
        print(f"⚠ Warning: Source path does not exist: {source_path}")
        return False
    
    # Map is_directory to expect parameter
    expect = "dir" if is_directory else "file"
    result = drive_store.backup(source_path, expect=expect)
    
    if result.ok:
        print(result)
    else:
        print(f"⚠ Warning: Backup failed: {result.reason}")
    
    return result.ok

def restore_from_drive(local_path: Path, is_directory: bool = False) -> bool:
    """
    Backward-compatible wrapper for drive_store.restore().
    
    Note: Prefer using drive_store.restore() directly for better error handling.
    """
    if not BACKUP_ENABLED or drive_store is None:
        return False
    
    # Map is_directory to expect parameter
    expect = "dir" if is_directory else "file"
    result = drive_store.restore(local_path, expect=expect)
    
    if result.ok:
        print(result)
    else:
        print(f"⚠ Warning: Restore failed: {result.reason}")
    
    return result.ok

def ensure_restored_from_drive(local_path: Path, is_directory: bool = False) -> bool:
    """
    Ensure file/directory exists locally, restoring from Drive if missing.
    
    This is the primary entry point for most use cases.
    """
    if not BACKUP_ENABLED or drive_store is None:
        return False
    
    # Map is_directory to expect parameter
    expect = "dir" if is_directory else "file"
    result = drive_store.ensure_local(local_path)
    
    if result.ok and result.action.value == "copied":
        print(result)
    
    return result.ok

print("✓ Backup/restore wrapper functions defined (using DriveBackupStore)")


In [None]:
from pathlib import Path
# Fix numpy/pandas compatibility before importing orchestration modules
try:
    from infrastructure.storage.drive import create_colab_store
except (ValueError, ImportError) as e:
    if "numpy.dtype size changed" in str(e) or "numpy" in str(e).lower():
        print("⚠ Numpy/pandas compatibility issue detected. Fixing...")
        import subprocess
        import sys
        subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "--force-reinstall", "--no-cache-dir", "numpy>=1.24.0,<2.0.0", "pandas>=2.0.0", "--quiet"])
        print("✓ Numpy/pandas reinstalled. Please restart the kernel and re-run this cell.")
        raise RuntimeError("Please restart kernel after numpy/pandas fix")
    else:
        raise

# Mount Google Drive and create backup store (Colab only - Kaggle doesn't need this)
# Uses centralized config from config/paths.yaml
DRIVE_BACKUP_DIR = None
drive_store = None

if IN_COLAB:
    drive_store = create_colab_store(ROOT_DIR, CONFIG_DIR)
    if drive_store:
        BACKUP_ENABLED = True
        DRIVE_BACKUP_DIR = drive_store.backup_root
        print(f"✓ Google Drive mounted")
        print(f"✓ Backup base directory: {DRIVE_BACKUP_DIR}")
        print(f"\nNote: All outputs/ will be mirrored to: {DRIVE_BACKUP_DIR / 'outputs'}")
    else:
        BACKUP_ENABLED = False
        print("⚠ Warning: Could not mount Google Drive. Backup to Google Drive will be disabled.")
elif IN_KAGGLE:
    print("✓ Kaggle environment detected - outputs are automatically persisted (no Drive mount needed)")
    BACKUP_ENABLED = False
else:
    print("⚠ Warning: Unknown environment. Backup to Google Drive will be disabled.")
    BACKUP_ENABLED = False


## Step P1-3.1: Load Centralized Configs

Load and validate all configuration files. Configs are immutable and will be logged with each job for reproducibility.

**Note**: 
- **Local**: Config files should already exist in the repository
- **Colab/Kaggle**: Config files will be auto-created if missing (useful for fresh environments)


In [None]:
# Optional: Update repository from git (only for Colab/Kaggle if needed)
# Uncomment and run if you need to pull latest changes
# if not IS_LOCAL:
#     !cd {ROOT_DIR} && git fetch origin gg_final_training_2
#     !cd {ROOT_DIR} && git reset --hard origin/gg_final_training_2

In [None]:
# Write/override config files (useful for Colab/Kaggle where file editing is limited)
# Local environments should have configs already in the repo
if IS_LOCAL:
    print("✓ Local environment - assuming config files already exist in repository")
else:
    # Create the experiment config directory if it doesn't exist
    experiment_config_dir = CONFIG_DIR / "experiment"
    experiment_config_dir = validate_path_before_mkdir(experiment_config_dir, context="directory")
    experiment_config_dir.mkdir(parents=True, exist_ok=True)

    config_path = experiment_config_dir / "resume_ner_baseline.yaml"

    # Always write/override the config file (useful for Kaggle where editing is difficult)
    config_content = """
experiment_name: "resume_ner_baseline"

# Relative to the top-level config directory
data_config: "data/resume_tiny.yaml"
model_config: "model/distilbert.yaml"
train_config: "train.yaml"
hpo_config: "hpo/prod.yaml"      # default HPO config; stages can override if needed
env_config: "env/azure.yaml"
benchmark_config: "benchmark.yaml"

# High-level orchestration design:
# - Stages: smoke → hpo → training
# - Smoke and HPO stage backbones are controlled by the HPO config file (search_space.backbone.values)
# - Training stage can target specific backbones via stage config
# - AML experiment names are per-stage, optionally per-backbone

stages:
  smoke:
    # AML experiment base name for smoke tests
    aml_experiment: "resume-ner-smoke"
    # HPO config for smoke/dry run tests (uses smoke.yaml with reduced trials)
    hpo_config: "hpo/smoke.yaml"
    # Backbones are controlled by the HPO config file (hpo_config) via search_space.backbone.values

  hpo:
    # AML experiment base name for HPO sweeps
    aml_experiment: "resume-ner-hpo"
    # HPO config override for production HPO sweep (uses prod.yaml instead of default smoke.yaml)
    hpo_config: "hpo/smoke.yaml"
    # Backbones are controlled by the HPO config file (hpo_config) via search_space.backbone.values

  training:
    # AML experiment base name for final single-run training
    aml_experiment: "resume-ner-train"
    # Final production backbone(s); typically one chosen after HPO
    backbones:
      - "distilbert"

# Optional naming policy for how AML experiments are derived per backbone.
# If true, the orchestrator should build experiment_name as:
#   "<aml_experiment>-<backbone>"
# otherwise it should use "<aml_experiment>" directly and rely on tags
# (stage/backbone) for grouping in AML.
naming:
  include_backbone_in_experiment: true
"""

    config_path.write_text(config_content)

    if config_path.exists():
        print(f"✓ Config overridden at: {config_path}")
    else:
        print(f"✓ Config written to: {config_path}")

In [None]:
# Write/override HPO config file (useful for Colab/Kaggle where file editing is limited)
# Local environments should have configs already in the repo
if IS_LOCAL:
    print("✓ Local environment - assuming HPO config files already exist in repository")
else:
    # Create the HPO config directory if it doesn't exist
    hpo_config_dir = CONFIG_DIR / "hpo"
    hpo_config_dir = validate_path_before_mkdir(hpo_config_dir, context="directory")
    hpo_config_dir.mkdir(parents=True, exist_ok=True)

    config_path = hpo_config_dir / "smoke.yaml"

    # Always write/override the config file (useful for Kaggle where editing is difficult)
    config_content = """
search_space:
  backbone:
    type: "choice"
    values: ["distilbert"]  # ["distilbert", "distilroberta"]
    # Note: "deberta" excluded from smoke tests due to CUDA/NVRTC issues on Windows
    # DeBERTa requires nvrtc-builtins64_129.dll which may not be available in all environments
  
  learning_rate:
    type: "loguniform"
    min: 1e-5
    max: 5e-5
  
  batch_size:
    type: "choice"
    values: [4]
  
  dropout:
    type: "uniform"
    min: 0.1
    max: 0.3
  
  weight_decay:
    type: "loguniform"
    min: 0.001
    max: 0.1

sampling:
  algorithm: "random"
  max_trials: 1
  timeout_minutes: 20

# Checkpoint configuration for HPO resume support
# Enables saving study state to SQLite database for resuming interrupted runs
checkpoint:
  enabled: true
  study_name: "hpo_{backbone}_smoke_test_3.67"
  storage_path: "{study_name}/study.db"
  auto_resume: true
  # Only save checkpoints for best trials locally (reduces storage from ~30 GB to ~300 MB)
  save_only_best: true

mlflow:
  # Log best trial checkpoint to MLflow after HPO completes
  # Set to false to disable MLflow checkpoint logging entirely
  log_best_checkpoint: true

early_termination:
  policy: "bandit"
  evaluation_interval: 1
  slack_factor: 0.2
  delay_evaluation: 2

objective:
  metric: "macro-f1"
  goal: "maximize"

# Selection strategy configuration for accuracy-speed tradeoff
selection:
  # Accuracy threshold for speed tradeoff (0.015 = 1.5% relative)
  # If two models are within this accuracy difference, prefer faster model
  # Set to null for accuracy-only selection (default behavior)
  accuracy_threshold: 0.015
  
  # Use relative threshold (percentage of best accuracy) vs absolute difference
  # Relative thresholds are more robust across different accuracy ranges
  # Default: true (recommended)
  use_relative_threshold: true
  
  # Minimum relative accuracy gain to justify slower model (optional)
  # If DeBERTa is < 2% better than DistilBERT, prefer DistilBERT
  # Set to null to disable this check
  min_accuracy_gain: 0.02

k_fold:
  enabled: true
  n_splits: 2
  random_seed: 42
  shuffle: true
  stratified: true

# Refit training configuration
# After HPO completes, train the best trial on the full training dataset
# This creates a canonical checkpoint for production use (instead of using arbitrary fold checkpoints)
refit:
  enabled: true  # Default: enabled. Set to false to skip refit training
  # Optional: Add timeout, max_epochs overrides if needed in the future

# Cleanup configuration for interrupted runs
# Controls automatic cleanup/marking of interrupted runs from previous sessions
cleanup:
  # Disable automatic MLflow cleanup (tagging interrupted runs with code.interrupted=true)
  # Default: true (disabled for speed). Set to false to enable automatic cleanup
  disable_auto_cleanup: true
  
  # Disable automatic Optuna marking (marking RUNNING trials as FAILED)
  # Default: false (enabled). Set to true to disable automatic Optuna state cleanup
  disable_auto_optuna_mark: false
"""

    config_path.write_text(config_content)

    if config_path.exists():
        print(f"✓ HPO config overridden at: {config_path}")
    else:
        print(f"✓ HPO config written to: {config_path}")

In [None]:
# Write/override training config file (useful for Colab/Kaggle where file editing is limited)
# Local environments should have configs already in the repo
if IS_LOCAL:
    print("✓ Local environment - assuming training config file already exists in repository")
else:
    # Ensure config directory exists
    CONFIG_DIR = validate_path_before_mkdir(CONFIG_DIR, context="directory")
    CONFIG_DIR.mkdir(parents=True, exist_ok=True)

    config_path = CONFIG_DIR / "train.yaml"

    # Always write/override the config file (useful for Kaggle where editing is difficult)
    config_content = """
# Global Training Defaults
# Applied to all training runs

training:
  epochs: 1  # 5
  batch_size: 2  # 12 
  gradient_accumulation_steps: 2
  learning_rate: 2e-5
  weight_decay: 0.01
  warmup_steps: 500
  max_grad_norm: 1.0
  # Data splitting and model-specific settings
  val_split_divisor: 10  # Divide train set by this to create validation split if none exists
  deberta_max_batch_size: 8  # 16  # Maximum batch size for DeBERTa models (memory constraints)
  warmup_steps_divisor: 10  # Divide total steps by this to cap warmup steps
  
  # EDA-based metric selection
  metric: "macro-f1"  # Class imbalance requires macro-f1
  metric_mode: "max"  # Maximize macro-f1
  
  early_stopping:
    enabled: true
    patience: 3
    min_delta: 0.001

logging:
  log_interval: 100
  eval_interval: 500
  save_interval: 1000

# NOTE: Multi-GPU / DDP is optional and currently experimental. When enabled,
# the training code will use this section together with hardware detection to
# decide whether to run single-GPU vs multi-GPU. If no multiple GPUs or DDP
# backend are available, it will safely fall back to single-GPU.
distributed:
  enabled: false         # Set true to enable multi-GPU / DDP
  backend: "nccl"        # Typically 'nccl' for GPUs
  world_size: "auto"     # 'auto' = use all visible GPUs; or set an int
  init_method: "env://"  # Default init method; can be overridden if needed
  timeout_seconds: 1800  # Process group init timeout (in seconds)
"""

    config_path.write_text(config_content)

    if config_path.exists():
        print(f"✓ Training config overridden at: {config_path}")
    else:
        print(f"✓ Training config written to: {config_path}")

### Define Constants

Define constants for file and directory names used throughout the notebook. Benchmark settings come from centralized config, not hard-coded here. These constants work across all environments.


In [None]:
# Import constants from centralized module
from common.constants import (
    STAGE_HPO,
    STAGE_TRAINING,
    METRICS_FILENAME,
    BENCHMARK_FILENAME,
    CHECKPOINT_DIRNAME,
    DEFAULT_RANDOM_SEED,
    DEFAULT_K_FOLDS,
)

# Import MLflow trackers from new location (migrated from orchestration.jobs.tracking.mlflow_tracker)
from infrastructure.tracking.mlflow.trackers import (
    MLflowSweepTracker,
    MLflowBenchmarkTracker,
    MLflowTrainingTracker,
    MLflowConversionTracker,
)


### Define Helper Functions

Reusable helper functions following DRY principle for common operations. These functions work across all environments (local, Colab, Kaggle).


In [None]:
# Import helper functions from consolidated modules (DRY principle)
from typing import List, Optional, Any
from infrastructure.naming.experiments import build_mlflow_experiment_name
from evaluation.benchmarking.utils import run_benchmarking
from infrastructure.tracking.mlflow.setup import setup_mlflow
from common.shared import verify_output_file

# Wrapper function for run_benchmarking that uses notebook-specific paths
def run_benchmarking_local(
    checkpoint_dir: Path,
    test_data_path: Path,
    output_path: Path,
    batch_sizes: List[int],
    iterations: int,
    warmup_iterations: int,
    max_length: int = 512,
    device: Optional[str] = None,
    tracker: Optional[Any] = None,
    backbone: Optional[str] = None,
    benchmark_source: str = "final_training",
    study_key_hash: Optional[str] = None,
    trial_key_hash: Optional[str] = None,
) -> bool:
    """
    Run benchmarking on a model checkpoint (local notebook wrapper).
    
    This is a thin wrapper around orchestration.benchmark_utils.run_benchmarking
    that automatically uses the notebook's SRC_DIR and ROOT_DIR.
    
    Args:
        checkpoint_dir: Path to checkpoint directory.
        test_data_path: Path to test data JSON file.
        output_path: Path to output benchmark.json file.
        batch_sizes: List of batch sizes to test.
        iterations: Number of iterations per batch size.
        warmup_iterations: Number of warmup iterations.
        max_length: Maximum sequence length.
        device: Device to use (None = auto-detect).
        tracker: Optional MLflowBenchmarkTracker instance.
        backbone: Optional model backbone name.
        benchmark_source: Source of benchmark ("hpo_trial" or "final_training").
        study_key_hash: Optional study key hash for grouping tags.
        trial_key_hash: Optional trial key hash for grouping tags.
    
    Returns:
        True if successful, False otherwise.
    """
    return run_benchmarking(
        checkpoint_dir=checkpoint_dir,
        test_data_path=test_data_path,
        output_path=output_path,
        batch_sizes=batch_sizes,
        iterations=iterations,
        warmup_iterations=warmup_iterations,
        max_length=max_length,
        device=device,
        tracker=tracker,
        backbone=backbone,
        benchmark_source=benchmark_source,
        project_root=ROOT_DIR,
        study_key_hash=study_key_hash,
        trial_key_hash=trial_key_hash,
    )


In [None]:
from pathlib import Path
from typing import Any, Dict

from common.constants import EXPERIMENT_NAME
from infrastructure.config.loader import (
    ExperimentConfig,
    compute_config_hashes,
    create_config_metadata,
    load_all_configs,
    load_experiment_config,
    snapshot_configs,
    validate_config_immutability,
)

# P1-3.1: Load Centralized Configs (local-only)
# Mirrors the Azure orchestration notebook, but does not create an Azure ML client.

if not CONFIG_DIR.exists():
    raise FileNotFoundError(f"Config directory not found: {CONFIG_DIR}")

experiment_config: ExperimentConfig = load_experiment_config(CONFIG_DIR, EXPERIMENT_NAME)
configs: Dict[str, Any] = load_all_configs(experiment_config)
config_hashes = compute_config_hashes(configs)
config_metadata = create_config_metadata(configs, config_hashes)

# Immutable snapshots for runtime mutation checks
original_configs = snapshot_configs(configs)
validate_config_immutability(configs, original_configs)

print(f"Loaded experiment: {experiment_config.name}")
print("Loaded config domains:", sorted(configs.keys()))
print("Config hashes:", config_hashes)
print("Config metadata:", config_metadata)

# Get dataset path from data config (centralized configuration)
# The local_path in the data config is relative to the config directory
data_config = configs["data"]
local_path_str = data_config.get("local_path", "../dataset")
DATASET_LOCAL_PATH = (CONFIG_DIR / local_path_str).resolve()

# Check if seed-based dataset structure (for dataset_tiny with seed subdirectories)
seed = data_config.get("seed")
if seed is not None and "dataset_tiny" in str(DATASET_LOCAL_PATH):
    DATASET_LOCAL_PATH = DATASET_LOCAL_PATH / f"seed{seed}"

print(f"Dataset path (from data config): {DATASET_LOCAL_PATH}")
if seed is not None:
    print(f"Using seed: {seed}")


## Step P1-3.2: Verify Local Dataset

Verify that the dataset directory (specified by `local_path` in the data config) exists and contains the required files. The dataset path is loaded from the centralized data configuration in Step P1-3.1.


In [None]:
# P1-3.2: Verify Local Dataset
# The dataset path comes from the data config's local_path field (loaded in Step P1-3.1).
# This ensures the dataset location is controlled by centralized configuration.
# Note: train.json is required, but validation.json is optional (matches training script behavior).

REQUIRED_FILE = "train.json"
OPTIONAL_FILE = "validation.json"

if not DATASET_LOCAL_PATH.exists():
    raise FileNotFoundError(
        f"Dataset directory not found: {DATASET_LOCAL_PATH}\n"
        f"This path comes from the data config's 'local_path' field.\n"
        f"If you need to create the dataset, run the notebook: notebooks/00_make_tiny_dataset.ipynb"
    )

# Check required file
train_file = DATASET_LOCAL_PATH / REQUIRED_FILE
if not train_file.exists():
    raise FileNotFoundError(
        f"Required dataset file not found: {train_file}\n"
        f"This path comes from the data config's 'local_path' field.\n"
        f"If you need to create it, run the notebook: notebooks/00_make_tiny_dataset.ipynb"
    )

# Check optional file
val_file = DATASET_LOCAL_PATH / OPTIONAL_FILE
has_validation = val_file.exists()

print(f"✓ Dataset directory found: {DATASET_LOCAL_PATH}")
print(f"  (from data config: {data_config.get('name', 'unknown')} v{data_config.get('version', 'unknown')})")

train_size = train_file.stat().st_size
print(f"  ✓ {REQUIRED_FILE} ({train_size:,} bytes)")

if has_validation:
    val_size = val_file.stat().st_size
    print(f"  ✓ {OPTIONAL_FILE} ({val_size:,} bytes)")
else:
    print(f"  ⚠ {OPTIONAL_FILE} not found (optional - training will proceed without validation set)")


## Step P1-3.2.1: Optional Train/Test Split

**Optional step**: Create a train/test split if `test.json` is missing. This is useful when you only have `train.json` and `validation.json` and want to create a separate test set.

**⚠ WARNING**: This will overwrite `train.json` with the split version. Only enable if you want to create a permanent train/test split.


In [None]:
# Optional: create train/test split if test.json is missing
# WARNING: This will overwrite train.json with the split version
# Only enable if you want to create a permanent train/test split
import json
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional

from data.loaders.dataset_loader import split_train_test, save_split_files

CREATE_TEST_SPLIT = False  # Set True to create test.json when absent (WARNING: overwrites train.json)

train_file = DATASET_LOCAL_PATH / "train.json"
val_file = DATASET_LOCAL_PATH / "validation.json"
test_file = DATASET_LOCAL_PATH / "test.json"

if CREATE_TEST_SPLIT and not test_file.exists():
    # Backup original train.json before overwriting
    backup_file = DATASET_LOCAL_PATH / "train.json.backup"
    if train_file.exists() and not backup_file.exists():
        import shutil
        shutil.copy2(train_file, backup_file)
        print(f"⚠ Backed up original train.json to {backup_file}")
    
    full_dataset = []
    # Start with train data; optionally include validation to maximize coverage
    with open(train_file, "r", encoding="utf-8") as f:
        full_dataset.extend(json.load(f))
    if val_file.exists():
        with open(val_file, "r", encoding="utf-8") as f:
            full_dataset.extend(json.load(f))

    split_cfg = configs.get("data", {}).get("splitting", {})
    train_ratio = split_cfg.get("train_test_ratio", 0.8)
    stratified = split_cfg.get("stratified", False)
    random_seed = split_cfg.get("random_seed", 42)
    entity_types = configs.get("data", {}).get("schema", {}).get("entity_types", [])

    print(f"Creating train/test split (train_ratio={train_ratio}, stratified={stratified})...")
    print(f"⚠ WARNING: This will overwrite train.json with {int(len(full_dataset) * train_ratio)} samples")
    
    new_train, new_test = split_train_test(
        dataset=full_dataset,
        train_ratio=train_ratio,
        stratified=stratified,
        random_seed=random_seed,
        entity_types=entity_types,
    )

    save_split_files(DATASET_LOCAL_PATH, new_train, new_test)
    print(f"✓ Wrote train.json ({len(new_train)}) and test.json ({len(new_test)})")
elif test_file.exists():
    print(f"✓ Found existing test.json at {test_file}")
else:
    print("⚠ test.json not found. Set CREATE_TEST_SPLIT=True to generate a split.")


## Step P1-3.3: Setup Local Environment

Verify GPU availability, set up MLflow tracking (local file store), and check that key dependencies are installed. This step ensures the local environment is ready for training.


In [None]:
import sys
import torch

DEFAULT_DEVICE = "cuda"

env_config = configs["env"]
device_type = env_config.get("compute", {}).get("device", DEFAULT_DEVICE)

# Fallback to CPU if CUDA is requested but not available
if device_type == "cuda" and not torch.cuda.is_available():
    print("⚠ Warning: CUDA device requested but not available. Falling back to CPU.")
    if not IS_LOCAL:
        print("  In Colab, ensure you've selected a GPU runtime: Runtime > Change runtime type > GPU")
    device_type = "cpu"


In [None]:
from pathlib import Path
import mlflow
from common.shared.mlflow_setup import setup_mlflow_from_config

# Get MLflow tracking URI for later use
mlflow_tracking_uri = mlflow.get_tracking_uri()
if mlflow_tracking_uri:
    print(f"MLflow tracking URI: {mlflow_tracking_uri[:80]}...")
else:
    print("Warning: MLflow tracking URI not set")

# Setup MLflow from config (automatically uses Azure ML if enabled in config/mlflow.yaml)
# To enable Azure ML Workspace tracking:
# 1. Edit config/mlflow.yaml and set azure_ml.enabled: true
# 2. Set environment variables: AZURE_SUBSCRIPTION_ID and AZURE_RESOURCE_GROUP
setup_mlflow_from_config(
    experiment_name="placeholder",  # Will be set per HPO run
    config_dir=CONFIG_DIR
)

In [None]:
# For Kaggle only - install specific package versions required for Optuna checkpointing
if IN_KAGGLE:
    %pip install ""SQLAlchemy<2.0.0" "alembic<1.13.0" "optuna<4.0.0"" --quiet
else:
    print("Skipping Kaggle-specific package installation (not running on Kaggle)")


In [None]:
try:
    import mlflow
    import transformers
    import optuna
except ImportError as e:
    raise ImportError(f"Required package not installed: {e}")

REQUIRED_PACKAGES = {
    "torch": torch,
    "transformers": transformers,
    "mlflow": mlflow,
    "optuna": optuna,
}

for name, module in REQUIRED_PACKAGES.items():
    if not hasattr(module, "__version__"):
        raise ImportError(
            f"Required package '{name}' is not properly installed")

## Step P1-3.4: The Sweep (HPO) - Local with Optuna

Run the full hyperparameter optimization sweep using Optuna to systematically search for the best model configuration. Uses the production HPO configuration with more trials than the dry run.

**Note on K-Fold Cross-Validation:**
- When k-fold CV is enabled (`k_fold.enabled: true`), each trial trains **k models** (one per fold) and returns the **average metric** across folds
- The number of **trials** is controlled by `sampling.max_trials` (e.g., 2 trials in smoke.yaml)
- With k=5 folds and 2 trials: **2 trials × 5 folds = 10 model trainings total**
- K-fold CV provides more robust hyperparameter evaluation but increases compute time (k× per trial)

**Note on Checkpoint and Resume:**
- When `checkpoint.enabled: true` is set in the HPO config, the system automatically saves the Optuna study state to a SQLite database
- This allows interrupted HPO runs to be resumed from the last checkpoint
- The checkpoint is automatically detected and loaded on the next run if `auto_resume: true` (default)
- Platform-specific paths are handled automatically (local, Colab, Kaggle)
- **Selective Checkpoint Saving**: When `checkpoint.save_only_best: true` is set, only best trial checkpoints are saved locally (reduces storage from ~30 GB to ~300 MB for 100 trials)
- **MLflow Checkpoint Logging**: When `mlflow.log_best_checkpoint: true` is set, the best trial checkpoint is automatically logged to MLflow after HPO completes (artifact path: `best_trial_checkpoint`)
- **Refit Training**: When `refit.enabled: true` is set (default), after HPO completes, the best trial is automatically retrained on the full training dataset. This produces a canonical checkpoint in `trial_<n>_<ts>/refit/checkpoint/` that is preferred over fold checkpoints for benchmarking and production use.
- See `docs/HPO_CHECKPOINT_RESUME.md` for detailed documentation


In [None]:
from pathlib import Path
from common.constants import STAGE_HPO
from training.hpo import run_local_hpo_sweep

# Use new paths module (orchestration.paths is deprecated)
from infrastructure.paths import resolve_output_path

# Use centralized HPO root from paths.yaml (respects env_overrides / storage_env)
HPO_ROOT = resolve_output_path(ROOT_DIR, CONFIG_DIR, "hpo")

# Keep fold_splits as a study-level meta artifact, not mixed with trials
HPO_META_DIR = validate_path_before_mkdir(HPO_ROOT / "_meta", context="directory")
HPO_META_DIR.mkdir(parents=True, exist_ok=True)


In [None]:
# Use HPO config already loaded in configs (from Step P1-3.1)
# Following DRY principle - don't reload configs that are already available
# Check for stage-specific hpo_config override
from infrastructure.naming.experiments import get_stage_config
from common.shared.yaml_utils import load_yaml

hpo_stage_config = get_stage_config(experiment_config, STAGE_HPO)
hpo_config_override = hpo_stage_config.get("hpo_config")

if hpo_config_override:
    # Load stage-specific HPO config override
    hpo_config_path = CONFIG_DIR / hpo_config_override
    hpo_config = load_yaml(hpo_config_path)
    print(f"✓ Using stage-specific HPO config for hpo: {hpo_config_override}")
else:
    # Use default HPO config from top-level experiment config
    # Always reload default HPO config from file (don't use cached configs["hpo"])
    # This ensures changes to the YAML file are picked up even if configs dict wasn't reloaded
    # This is especially important in Colab where configs might be cached in memory
    # after editing YAML files without restarting the kernel
    hpo_config_path = experiment_config.hpo_config
    hpo_config = load_yaml(hpo_config_path)
    print(f"✓ Using default HPO config (reloaded from file): {experiment_config.hpo_config.name}")
train_config = configs["train"]
backbone_values = hpo_config["search_space"]["backbone"]["values"]


### Setup K-Fold Splits and Google Drive Backup for HPO Trials

**K-Fold Cross-Validation Setup**: If k-fold CV is enabled in the HPO config, create and save fold splits before starting the sweep.

**Colab-specific feature**: Configure automatic backup of each HPO trial to Google Drive immediately after completion. This prevents data loss if the Colab session disconnects during long-running hyperparameter optimization sweeps.

**Note on Checkpoint Backup:**
- If `checkpoint.save_only_best: true` is enabled, only best trial checkpoints are saved locally and backed up to Drive
- Each trial's `metrics.json` is always saved and backed up
- The best trial checkpoint is also automatically logged to MLflow (if `mlflow.log_best_checkpoint: true`)
- This reduces storage usage while ensuring the best model is always available


In [None]:
from training.core.cv_utils import (
    create_kfold_splits,
    save_fold_splits,
    validate_splits,
)
from data.loaders import load_dataset
from infrastructure.paths import resolve_output_path

# Setup k-fold splits if enabled
k_fold_config = hpo_config.get("k_fold", {})
k_folds_enabled = k_fold_config.get("enabled", False)
fold_splits_file = None

if k_folds_enabled:
    n_splits = k_fold_config.get("n_splits", DEFAULT_K_FOLDS)
    random_seed = k_fold_config.get("random_seed", DEFAULT_RANDOM_SEED)
    shuffle = k_fold_config.get("shuffle", True)
    stratified = k_fold_config.get("stratified", False)
    entity_types = (
        configs.get("data", {})
        .get("schema", {})
        .get("entity_types", [])
    )

    print(f"Setting up {n_splits}-fold cross-validation splits...")

    full_dataset = load_dataset(str(DATASET_LOCAL_PATH))
    train_data = full_dataset.get("train", [])

    fold_splits = create_kfold_splits(
        dataset=train_data,
        k=n_splits,
        random_seed=random_seed,
        shuffle=shuffle,
        stratified=stratified,
        entity_types=entity_types,
    )

    # Optional validation to ensure rare entities appear across folds
    validate_splits(train_data, fold_splits, entity_types=entity_types)

    # Use centralized HPO root from paths.yaml (respects env_overrides / storage_env)
    HPO_ROOT = resolve_output_path(ROOT_DIR, CONFIG_DIR, "hpo")

    # Keep fold_splits as a study-level meta artifact
    HPO_META_DIR = validate_path_before_mkdir(
        HPO_ROOT / "_meta", context="directory"
    )
    HPO_META_DIR.mkdir(parents=True, exist_ok=True)

    fold_splits_file = HPO_META_DIR / "fold_splits.json"

    save_fold_splits(
        fold_splits,
        fold_splits_file,
        metadata={
            "k": n_splits,
            "random_seed": random_seed,
            "shuffle": shuffle,
            "stratified": stratified,
            "dataset_path": str(DATASET_LOCAL_PATH),
        },
    )

    print(f"✓ K-fold splits saved to: {fold_splits_file}")

else:
    print("K-fold CV disabled - using single train/validation split")

In [None]:
# Checkpoint functionality is now handled automatically by run_local_hpo_sweep
# when checkpoint.enabled: true is set in the HPO config.
# No manual backup callbacks are needed - SQLite persistence is built-in.


In [None]:
# Checkpoint functionality is now handled automatically by run_local_hpo_sweep
# when checkpoint.enabled: true is set in the HPO config.
# No wrapper functions are needed - SQLite persistence is built-in.


In [None]:
# # In a Kaggle notebook cell
# !cd /kaggle/working/resume-ner-azureml && git fetch origin gg_final_training_2 && git checkout origin/gg_final_training_2 -- src/train.py src/training/trainer.py

In [None]:
print("data_config:", configs.get("data"))
print("hpo_config keys:", hpo_config.keys())
print("train_config keys:", train_config.keys())

In [None]:
# Extract checkpoint configuration from HPO config
checkpoint_config = hpo_config.get("checkpoint", {})

hpo_studies = {}
k_folds_param = k_fold_config.get("n_splits", DEFAULT_K_FOLDS) if k_folds_enabled else None

# Use new centralized naming system for HPO
# Build base output directory: outputs/hpo/<env>/<model>/
# Trial-specific paths will be created by run_local_hpo_sweep as subdirectories

# Import required functions
from pathlib import Path
from common.constants import STAGE_HPO
from training.hpo import run_local_hpo_sweep
from infrastructure.naming.experiments import build_mlflow_experiment_name
from infrastructure.paths.validation import validate_path_before_mkdir
from common.shared.platform_detection import detect_platform

# Ensure environment is defined
environment = detect_platform()
print(f"Detected environment: {environment}")

for backbone in backbone_values:
    mlflow_experiment_name = build_mlflow_experiment_name(
        experiment_config.name, STAGE_HPO, backbone
    )
    
    backbone_name = backbone.split("-")[0] if "-" in backbone else backbone
    
    # Build base HPO directory using new structure: outputs/hpo/<env>/<model>/
    backbone_output_dir = ROOT_DIR / "outputs" / "hpo" / environment / backbone_name
    backbone_output_dir = validate_path_before_mkdir(backbone_output_dir, context="directory")
    backbone_output_dir.mkdir(parents=True, exist_ok=True)
    
    print(f"✓ HPO output directory: {backbone_output_dir}")
    
    # Create restore function for HPO checkpoint if checkpointing enabled and BACKUP_ENABLED
    restore_fn = None
    if checkpoint_config.get("enabled", False) and BACKUP_ENABLED:
        # Resolve study_name from checkpoint_config (same logic as create_study_name)
        study_name_template = checkpoint_config.get("study_name") or hpo_config.get("study_name")
        study_name = None
        if study_name_template:
            study_name = study_name_template.replace("{backbone}", backbone)
        
        # Resolve storage_path with both {backbone} and {study_name} placeholders
        storage_path_template = checkpoint_config.get("storage_path", "{backbone}/study.db")
        storage_path_str = storage_path_template.replace("{backbone}", backbone)
        if study_name:
            storage_path_str = storage_path_str.replace("{study_name}", study_name)
        expected_checkpoint = backbone_output_dir / storage_path_str
        
        def make_restore_fn(checkpoint_path):
            def restore_fn_inner(path: Path) -> bool:
                # Only restore if path matches expected checkpoint
                if path == checkpoint_path:
                    return ensure_restored_from_drive(checkpoint_path, is_directory=False)
                return False
            return restore_fn_inner
        
        restore_fn = make_restore_fn(expected_checkpoint)
    
    # Use standard run_local_hpo_sweep with checkpoint_config
    # Checkpoint.enabled handles persistence via SQLite (better than manual Drive backup)
    study = run_local_hpo_sweep(
        dataset_path=str(DATASET_LOCAL_PATH),
        config_dir=CONFIG_DIR,
        backbone=backbone,
        hpo_config=hpo_config,
        train_config=train_config,
        output_dir=backbone_output_dir,
        mlflow_experiment_name=mlflow_experiment_name,
        k_folds=k_folds_param,
        fold_splits_file=fold_splits_file,
        checkpoint_config=checkpoint_config,
        restore_from_drive=restore_fn,
        data_config=configs.get("data"),
        benchmark_config=configs.get("benchmark"),
    )
    # Backup HPO study.db and study folder to Drive
    # Note: HPO backup function may still be in orchestration.jobs.hpo
    from orchestration.jobs.hpo.local.backup import backup_hpo_study_to_drive
    
    backup_hpo_study_to_drive(
        backbone=backbone,
        backbone_output_dir=backbone_output_dir,
        checkpoint_config=checkpoint_config,
        hpo_config=hpo_config,
        backup_to_drive=backup_to_drive,
        backup_enabled=BACKUP_ENABLED,
    )

    # Store study in hpo_studies dict (must be inside loop!)
    hpo_studies[backbone] = study



In [None]:
# Generate missing trial_meta.json files for existing trials
from training.hpo.trial.meta import generate_missing_trial_meta_for_all_studies

if BACKUP_ENABLED and "hpo_studies" in locals():
    total_created = generate_missing_trial_meta_for_all_studies(
        hpo_studies=hpo_studies if "hpo_studies" in locals() else {},
        backbone_values=backbone_values,
        root_dir=ROOT_DIR,
        environment=environment,
        hpo_config=hpo_config,
        data_config=data_config if "data_config" in locals() else None,
        backup_enabled=BACKUP_ENABLED,
    )
    print(f"\n[OK] Total: Created {total_created} trial_meta.json files")
else:
    print("[INFO] Skipping trial_meta.json generation (BACKUP_ENABLED=False or hpo_studies not available)")



In [None]:
from evaluation.selection.study_summary import print_study_summaries
from common.shared.platform_detection import detect_platform

# Get environment if not already set
if 'environment' not in locals():
    environment = detect_platform()

# Print study summaries using the module
print_study_summaries(
    hpo_studies=hpo_studies if "hpo_studies" in locals() else None,
    backbone_values=backbone_values if "backbone_values" in locals() else [],
    hpo_config=hpo_config,
    root_dir=ROOT_DIR,
    environment=environment,
)
