# Phase 1: Training Orchestration (Google Colab & Kaggle)

This notebook orchestrates all training activities for **Google Colab or Kaggle execution** with GPU compute support.

## Important

- This notebook **executes training in Google Colab or Kaggle** (not on Azure ML)
- All computation happens on the platform's GPU
- **Storage & Persistence**:
  - **Google Colab**: Checkpoints are automatically saved to Google Drive for persistence across sessions
  - **Kaggle**: Outputs in `/kaggle/working/` are automatically persisted - no manual backup needed
- The notebook must be **re-runnable end-to-end**
- Uses the dataset path specified in the data config (from `config/data/*.yaml`), typically pointing to a local folder included in the repository
- **Session Management**:
  - **Colab**: Sessions timeout after 12-24 hours (depending on Colab plan). Checkpoints are saved to Drive automatically.
  - **Kaggle**: Sessions have time limits based on your plan. All outputs are automatically saved.


## Step 1: Environment Detection

The notebook automatically detects the execution environment (local, Google Colab, or Kaggle) and adapts its behavior accordingly.


In [1]:
import os
from pathlib import Path

# Detect execution environment
IN_COLAB = "COLAB_GPU" in os.environ or "COLAB_TPU" in os.environ
IN_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
IS_LOCAL = not IN_COLAB and not IN_KAGGLE

# Set platform-specific constants
if IN_COLAB:
    PLATFORM = "colab"
    BASE_DIR = Path("/content")
    BACKUP_ENABLED = True
elif IN_KAGGLE:
    PLATFORM = "kaggle"
    BASE_DIR = Path("/kaggle/working")
    BACKUP_ENABLED = False
else:
    PLATFORM = "local"
    BASE_DIR = None  # Will use Path.cwd() instead
    BACKUP_ENABLED = False

print(f"✓ Detected environment: {PLATFORM.upper()}")
print(f"Platform: {PLATFORM}")
if BASE_DIR:
    print(f"Base directory: {BASE_DIR}")
else:
    print(f"Base directory: Will use current working directory")
print(f"Backup enabled: {BACKUP_ENABLED}")


✓ Detected environment: LOCAL
Platform: local
Base directory: Will use current working directory
Backup enabled: False


## Step 2: Repository Setup

**Note**: Repository setup is only needed for Colab/Kaggle environments. Local environments should already have the repository cloned.

### For Colab/Kaggle: Clone from Git or Upload Files

Choose one of the following options:

**Option A: Clone from Git (Recommended)**

If your repository is on GitHub/GitLab, clone it:

**For Google Colab:**
```python
!git clone -b gg https://github.com/longdang193/resume-ner-azureml.git /content/resume-ner-azureml
```

**For Kaggle:**
```python
!git clone -b gg https://github.com/longdang193/resume-ner-azureml.git /kaggle/working/resume-ner-azureml
```

**Option B: Upload Files**

**For Google Colab:**
1. Use the Colab file browser (folder icon on left sidebar)
2. Upload your project files to `/content/resume-ner-azureml/`
3. Ensure the directory structure matches: `src/`, `config/`, `notebooks/`, etc.

**For Kaggle:**
1. Use the Kaggle file browser (Data tab)
2. Upload your project files to `/kaggle/working/resume-ner-azureml/`
3. Ensure the directory structure matches: `src/`, `config/`, `notebooks/`, etc.

### For Local: Repository Already Exists

Local environments should have the repository already cloned. The notebook will automatically detect the repository location.


In [2]:
# Repository setup - only needed for Colab/Kaggle
if not IS_LOCAL:
    if IN_KAGGLE:
        # For Kaggle
        !git clone -b gg https://github.com/longdang193/resume-ner-azureml.git /kaggle/working/resume-ner-azureml
    elif IN_COLAB:
        # For Google Colab
        !git clone -b gg https://github.com/longdang193/resume-ner-azureml.git /content/resume-ner-azureml
else:
    print("✓ Local environment detected - assuming repository already exists")

✓ Local environment detected - assuming repository already exists


### Verify Repository Setup

Verify the repository structure exists:


In [3]:
import sys
from pathlib import Path

# Unified path setup for all environments
if IS_LOCAL:
    # Local: assume notebook is in notebooks/ directory
    NOTEBOOK_DIR = Path.cwd()
    ROOT_DIR = NOTEBOOK_DIR.parent
else:
    # Colab/Kaggle: use fixed paths
    ROOT_DIR = BASE_DIR / "resume-ner-azureml"

SRC_DIR = ROOT_DIR / "src"
CONFIG_DIR = ROOT_DIR / "config"
NOTEBOOK_DIR = ROOT_DIR / "notebooks"

# Verify repository structure
if not ROOT_DIR.exists():
    if IS_LOCAL:
        raise FileNotFoundError(
            f"Repository not found at {ROOT_DIR}\n"
            f"Please ensure you're running this notebook from the notebooks/ directory of the repository."
        )
    else:
        raise FileNotFoundError(
            f"Repository not found at {ROOT_DIR}\n"
            f"Please run Step 2 to clone or upload the repository."
        )

required_dirs = ["src", "config", "notebooks"]
missing_dirs = [d for d in required_dirs if not (ROOT_DIR / d).exists()]

if missing_dirs:
    raise FileNotFoundError(
        f"Missing required directories: {missing_dirs}\n"
        f"Please ensure the repository structure is correct."
    )

# Add to Python path
sys.path.insert(0, str(ROOT_DIR))
sys.path.insert(0, str(SRC_DIR))

print(f"✓ Repository found at: {ROOT_DIR}")
print(f"✓ Required directories found: {required_dirs}")
print(f"Notebook directory: {NOTEBOOK_DIR}")
print(f"Project root: {ROOT_DIR}")
print(f"Source directory: {SRC_DIR}")
print(f"Config directory: {CONFIG_DIR}")


✓ Repository found at: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml
✓ Required directories found: ['src', 'config', 'notebooks']
Notebook directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\notebooks
Project root: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml
Source directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\src
Config directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config


## Step 3: Install Dependencies

**For Local**: Use conda environment (instructions below).  
**For Colab/Kaggle**: Install packages via pip (automated below).

### Local Environment Setup

For local execution, create and activate a conda environment:

1. Open a terminal in the project root
2. Create the conda environment: `conda env create -f config/environment/conda.yaml`
3. Activate: `conda activate resume-ner-training`
4. Restart the kernel after activation

### Colab/Kaggle: Automated Installation

PyTorch is usually pre-installed in Colab/Kaggle, but we'll verify and install other required packages.


In [4]:
import torch

# Check PyTorch version and GPU availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    device_count = torch.cuda.device_count()
    print(f"Visible GPUs: {device_count}")
    for i in range(device_count):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")

# Verify PyTorch version meets requirements (>=2.6.0)
torch_version = tuple(map(int, torch.__version__.split('.')[:2]))
if torch_version < (2, 6):
    print(f"⚠ Warning: PyTorch {torch.__version__} may not meet requirements (>=2.6.0)")
    if not IS_LOCAL:
        print("Consider upgrading: !pip install torch>=2.6.0 --upgrade")
else:
    print("✓ PyTorch version meets requirements")


PyTorch version: 2.9.1
CUDA available: True
Visible GPUs: 1
  GPU 0: Quadro T1000
✓ PyTorch version meets requirements


In [5]:
# Install required packages
if IS_LOCAL:
    print("For local environment, please:")
    print("1. Create conda environment: conda env create -f config/environment/conda.yaml")
    print("2. Activate: conda activate resume-ner-training")
    print("3. Restart kernel after activation")
    print("\nIf you've already done this, you can continue to the next cell.")
    print("\nInstalling Azure ML SDK (required for imports)...")
    # Install Azure ML packages even for local (in case conda env not activated)
    %pip install azure-ai-ml>=1.0.0 --quiet
    %pip install azureml-defaults --quiet
    %pip install azureml-mlflow --quiet
else:
    # Core ML libraries
    %pip install transformers>=4.35.0,<5.0.0 --quiet
    %pip install safetensors>=0.4.0 --quiet
    %pip install datasets>=2.12.0 --quiet

    # ML utilities
    %pip install numpy>=1.24.0,<2.0.0 --quiet
    %pip install pandas>=2.0.0 --quiet
    %pip install scikit-learn>=1.3.0 --quiet

    # Utilities
    %pip install pyyaml>=6.0 --quiet
    %pip install tqdm>=4.65.0 --quiet
    %pip install seqeval>=1.2.2 --quiet
    %pip install sentencepiece>=0.1.99 --quiet

    # Experiment tracking
    %pip install mlflow --quiet
    %pip install optuna --quiet

    # Azure ML SDK (required for orchestration imports)
    %pip install azure-ai-ml>=1.0.0 --quiet
    %pip install azureml-defaults --quiet
    %pip install azureml-mlflow --quiet

    # ONNX support
    %pip install onnxruntime --quiet
    %pip install onnx>=1.16.0 --quiet
    %pip install onnxscript>=0.1.0 --quiet

    print("✓ All dependencies installed")


For local environment, please:
1. Create conda environment: conda env create -f config/environment/conda.yaml
2. Activate: conda activate resume-ner-training
3. Restart kernel after activation

If you've already done this, you can continue to the next cell.

Installing Azure ML SDK (required for imports)...
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Step 4: Setup Paths and Import Paths

Python paths are already configured in Step 2. This section verifies the setup.


In [6]:
# Environment detection and platform configuration
# Note: This cell is a duplicate of Cell 2. If Cell 2 was already executed, these variables are already set.
# This cell ensures they're set even if Cell 2 was skipped.
import os
from pathlib import Path

# Detect execution environment
IN_COLAB = "COLAB_GPU" in os.environ or "COLAB_TPU" in os.environ
IN_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
IS_LOCAL = not IN_COLAB and not IN_KAGGLE

# Set platform-specific constants (only if not already set)
if 'PLATFORM' not in globals():
    if IN_COLAB:
        PLATFORM = "colab"
        BASE_DIR = Path("/content")
        BACKUP_ENABLED = True
        print("✓ Detected: Google Colab environment")
    elif IN_KAGGLE:
        PLATFORM = "kaggle"
        BASE_DIR = Path("/kaggle/working")
        BACKUP_ENABLED = False  # Kaggle outputs are automatically persisted
        print("✓ Detected: Kaggle environment")
    else:
        PLATFORM = "local"
        BASE_DIR = None  # Will use Path.cwd() instead
        BACKUP_ENABLED = False
        print("✓ Detected: Local environment")

if 'PLATFORM' in globals():
    print(f"Platform: {PLATFORM}")
    if BASE_DIR:
        print(f"Base directory: {BASE_DIR}")
    else:
        print(f"Base directory: Will use current working directory")
    print(f"Backup enabled: {BACKUP_ENABLED}")


Platform: local
Base directory: Will use current working directory
Backup enabled: False


In [7]:
import os
import sys
from pathlib import Path

# Setup paths (ROOT_DIR should be set in Cell 2)
# If not, set it here
if 'ROOT_DIR' not in globals():
    if IN_COLAB:
        ROOT_DIR = Path("/content/resume-ner-azureml")
    elif IN_KAGGLE:
        ROOT_DIR = Path("/kaggle/working/resume-ner-azureml")
    else:
        ROOT_DIR = Path("/content/resume-ner-azureml")  # Default to Colab path

SRC_DIR = ROOT_DIR / "src"
CONFIG_DIR = ROOT_DIR / "config"
NOTEBOOK_DIR = ROOT_DIR / "notebooks"

# Add to Python path
sys.path.insert(0, str(ROOT_DIR))
sys.path.insert(0, str(SRC_DIR))

print("Notebook directory:", NOTEBOOK_DIR)
print("Project root:", ROOT_DIR)
print("Source directory:", SRC_DIR)
print("Config directory:", CONFIG_DIR)
print("Platform:", PLATFORM if 'PLATFORM' in globals() else "unknown")
print("In Colab:", IN_COLAB if 'IN_COLAB' in globals() else False)
print("In Kaggle:", IN_KAGGLE if 'IN_KAGGLE' in globals() else False)


Notebook directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\notebooks
Project root: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml
Source directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\src
Config directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config
Platform: local
In Colab: False
In Kaggle: False


## Step 4: Mount Google Drive

Mount Google Drive to enable checkpoint persistence across Colab sessions. Checkpoints will be automatically saved to Drive after training completes.


In [8]:
# Google Drive backup/restore functionality
# Uses the DriveBackupStore from orchestration.drive_backup module
# The drive_store is created in Cell 15 (after mounting)

# Backward-compatible wrapper functions (delegate to drive_store)
# These maintain the old API for gradual migration
from pathlib import Path

# Note: drive_store is created in Cell 15 (Mount Google Drive)
# If drive_store is None, backup/restore operations are disabled

def backup_to_drive(source_path: Path, is_directory: bool = False) -> bool:
    """
    Backward-compatible wrapper for drive_store.backup().
    
    Note: Prefer using drive_store.backup() directly for better error handling.
    """
    if not BACKUP_ENABLED or drive_store is None:
        return False
    
    if not source_path.exists():
        print(f"⚠ Warning: Source path does not exist: {source_path}")
        return False
    
    # Map is_directory to expect parameter
    expect = "dir" if is_directory else "file"
    result = drive_store.backup(source_path, expect=expect)
    
    if result.ok:
        print(result)
    else:
        print(f"⚠ Warning: Backup failed: {result.reason}")
    
    return result.ok

def restore_from_drive(local_path: Path, is_directory: bool = False) -> bool:
    """
    Backward-compatible wrapper for drive_store.restore().
    
    Note: Prefer using drive_store.restore() directly for better error handling.
    """
    if not BACKUP_ENABLED or drive_store is None:
        return False
    
    # Map is_directory to expect parameter
    expect = "dir" if is_directory else "file"
    result = drive_store.restore(local_path, expect=expect)
    
    if result.ok:
        print(result)
    else:
        print(f"⚠ Warning: Restore failed: {result.reason}")
    
    return result.ok

def check_drive_backup_exists(local_path: Path) -> bool:
    """
    Backward-compatible wrapper for drive_store.backup_exists().
    """
    if not BACKUP_ENABLED or drive_store is None:
        return False
    
    return drive_store.backup_exists(local_path)

def restore_if_missing(local_path: Path, is_directory: bool = False) -> bool:
    """
    Restore from Drive only if local file/directory is missing.
    """
    if local_path.exists():
        return False
    
    return restore_from_drive(local_path, is_directory=is_directory)

def ensure_restored_from_drive(local_path: Path, is_directory: bool = False) -> bool:
    """
    Ensure file/directory exists locally, restoring from Drive if missing.
    
    This is the primary entry point for most use cases.
    """
    if not BACKUP_ENABLED or drive_store is None:
        return False
    
    # Map is_directory to expect parameter
    expect = "dir" if is_directory else "file"
    result = drive_store.ensure_local(local_path)
    
    if result.ok and result.action.value == "copied":
        print(result)
    
    return result.ok

print("✓ Backup/restore wrapper functions defined (using DriveBackupStore)")


✓ Backup/restore wrapper functions defined (using DriveBackupStore)


In [9]:
from pathlib import Path
from orchestration.drive_backup import create_colab_store

# Mount Google Drive and create backup store (Colab only - Kaggle doesn't need this)
# Uses centralized config from config/paths.yaml
DRIVE_BACKUP_DIR = None
drive_store = None

if IN_COLAB:
    drive_store = create_colab_store(ROOT_DIR, CONFIG_DIR)
    if drive_store:
        BACKUP_ENABLED = True
        DRIVE_BACKUP_DIR = drive_store.backup_root
        print(f"✓ Google Drive mounted")
        print(f"✓ Backup base directory: {DRIVE_BACKUP_DIR}")
        print(f"\nNote: All outputs/ will be mirrored to: {DRIVE_BACKUP_DIR / 'outputs'}")
    else:
        BACKUP_ENABLED = False
        print("⚠ Warning: Could not mount Google Drive. Backup to Google Drive will be disabled.")
elif IN_KAGGLE:
    print("✓ Kaggle environment detected - outputs are automatically persisted (no Drive mount needed)")
    BACKUP_ENABLED = False
else:
    print("⚠ Warning: Unknown environment. Backup to Google Drive will be disabled.")
    BACKUP_ENABLED = False


  from .autonotebook import tqdm as notebook_tqdm




## Step P1-3.1: Load Centralized Configs

Load and validate all configuration files. Configs are immutable and will be logged with each job for reproducibility.

**Note**: 
- **Local**: Config files should already exist in the repository
- **Colab/Kaggle**: Config files will be auto-created if missing (useful for fresh environments)


In [10]:
# Optional: Update repository from git (only for Colab/Kaggle if needed)
# Uncomment and run if you need to pull latest changes
# if not IS_LOCAL:
#     !cd {ROOT_DIR} && git fetch origin gg
#     !cd {ROOT_DIR} && git reset --hard origin/gg

In [11]:
# Write config files only if they don't exist (useful for Colab/Kaggle fresh environments)
# Local environments should have configs already in the repo
if IS_LOCAL:
    print("✓ Local environment - assuming config files already exist in repository")
else:
    # Create the experiment config directory if it doesn't exist
    experiment_config_dir = CONFIG_DIR / "experiment"
    experiment_config_dir.mkdir(parents=True, exist_ok=True)
    
    config_path = experiment_config_dir / "resume_ner_baseline.yaml"
    
    # Only write if file doesn't exist
    if not config_path.exists():
        config_content = """
experiment_name: "resume_ner_baseline"

# Relative to the top-level config directory
data_config: "data/resume_v1.yaml"
model_config: "model/distilbert.yaml"
train_config: "train.yaml"
hpo_config: "hpo/prod.yaml"      # default HPO config; stages can override if needed
env_config: "env/azure.yaml"
benchmark_config: "benchmark.yaml"

# High-level orchestration design:
# - Stages: smoke → hpo → training
# - Smoke and HPO stage backbones are controlled by the HPO config file (search_space.backbone.values)
# - Training stage can target specific backbones via stage config
# - AML experiment names are per-stage, optionally per-backbone

stages:
  smoke:
    # AML experiment base name for smoke tests
    aml_experiment: "resume-ner-smoke"
    # HPO config for smoke/dry run tests (uses smoke.yaml with reduced trials)
    hpo_config: "hpo/smoke.yaml"
    # Backbones are controlled by the HPO config file (hpo_config) via search_space.backbone.values

  hpo:
    # AML experiment base name for HPO sweeps
    aml_experiment: "resume-ner-hpo"
    # HPO config override for production HPO sweep (uses prod.yaml instead of default smoke.yaml)
    hpo_config: "hpo/prod.yaml"
    # Backbones are controlled by the HPO config file (hpo_config) via search_space.backbone.values

  training:
    # AML experiment base name for final single-run training
    aml_experiment: "resume-ner-train"
    # Final production backbone(s); typically one chosen after HPO
    backbones:
      - "distilbert"

# Optional naming policy for how AML experiments are derived per backbone.
# If true, the orchestrator should build experiment_name as:
#   "<aml_experiment>-<backbone>"
# otherwise it should use "<aml_experiment>" directly and rely on tags
# (stage/backbone) for grouping in AML.
naming:
  include_backbone_in_experiment: true
"""
        config_path.write_text(config_content)
        print(f"✓ Config file written to: {config_path}")
    else:
        print(f"✓ Config file already exists: {config_path}")

✓ Local environment - assuming config files already exist in repository


In [12]:
# Write HPO config file only if it doesn't exist (useful for Colab/Kaggle fresh environments)
# Local environments should have configs already in the repo
if IS_LOCAL:
    print("✓ Local environment - assuming HPO config files already exist in repository")
else:
    # Create the HPO config directory if it doesn't exist
    hpo_config_dir = CONFIG_DIR / "hpo"
    hpo_config_dir.mkdir(parents=True, exist_ok=True)
    
    config_path = hpo_config_dir / "prod.yaml"
    
    # Only write if file doesn't exist
    if not config_path.exists():
        config_content = """
search_space:
  backbone:
    type: "choice"
    values: ["distilroberta"]
  
  learning_rate:
    type: "loguniform"
    min: 1e-5
    max: 5e-5
  
  batch_size:
    type: "choice"
    values: [8, 16]
  
  dropout:
    type: "uniform"
    min: 0.1
    max: 0.3
  
  weight_decay:
    type: "loguniform"
    min: 0.001
    max: 0.1

sampling:
  algorithm: "random"
  max_trials: 20
  timeout_minutes: 960

early_termination:
  policy: "bandit"
  evaluation_interval: 1
  slack_factor: 0.2
  delay_evaluation: 2

objective:
  metric: "macro-f1"
  goal: "maximize"

# Selection strategy configuration for accuracy-speed tradeoff
selection:
  # Accuracy threshold for speed tradeoff (0.015 = 1.5% relative)
  # If two models are within this accuracy difference, prefer faster model
  # Set to null for accuracy-only selection (default behavior)
  accuracy_threshold: 0.015
  
  # Use relative threshold (percentage of best accuracy) vs absolute difference
  # Relative thresholds are more robust across different accuracy ranges
  # Default: true (recommended)
  use_relative_threshold: true
  
  # Minimum relative accuracy gain to justify slower model (optional)
  # If DeBERTa is < 2% better than DistilBERT, prefer DistilBERT
  # Set to null to disable this check
  min_accuracy_gain: 0.02

k_fold:
  enabled: true
  n_splits: 5
  random_seed: 42
  shuffle: true
  stratified: true

# Checkpoint configuration for HPO resume support
# Enables saving study state to SQLite database for resuming interrupted runs
checkpoint:
  enabled: true  # Set to true to enable checkpointing (useful for Colab/Kaggle)
  storage_path: "{backbone}/study.db"  # Relative to output_dir, {backbone} placeholder
  auto_resume: true  # Automatically resume if checkpoint exists (only if enabled=true)
"""
        config_path.write_text(config_content)
        print(f"✓ HPO config written to: {config_path}")
    else:
        print(f"✓ HPO config file already exists: {config_path}")

✓ Local environment - assuming HPO config files already exist in repository


In [13]:
# Write training config file only if it doesn't exist (useful for Colab/Kaggle fresh environments)
# Local environments should have configs already in the repo
if IS_LOCAL:
    print("✓ Local environment - assuming training config file already exists in repository")
else:
    # Ensure config directory exists
    CONFIG_DIR.mkdir(parents=True, exist_ok=True)
    
    config_path = CONFIG_DIR / "train.yaml"
    
    # Only write if file doesn't exist
    if not config_path.exists():
        config_content = """
# Global Training Defaults
# Applied to all training runs

training:
  epochs: 5
  batch_size: 12 
  gradient_accumulation_steps: 2
  learning_rate: 2e-5
  weight_decay: 0.01
  warmup_steps: 500
  max_grad_norm: 1.0
  # Data splitting and model-specific settings
  val_split_divisor: 10  # Divide train set by this to create validation split if none exists
  deberta_max_batch_size: 16  # Maximum batch size for DeBERTa models (memory constraints)
  warmup_steps_divisor: 10  # Divide total steps by this to cap warmup steps
  
  # EDA-based metric selection
  metric: "macro-f1"  # Class imbalance requires macro-f1
  metric_mode: "max"  # Maximize macro-f1
  
  early_stopping:
    enabled: true
    patience: 3
    min_delta: 0.001

logging:
  log_interval: 100
  eval_interval: 500
  save_interval: 1000

checkpointing:
  save_strategy: "steps"
  save_total_limit: 3
  load_best_model_at_end: true

# NOTE: Multi-GPU / DDP is optional and currently experimental. When enabled,
# the training code will use this section together with hardware detection to
# decide whether to run single-GPU vs multi-GPU. If no multiple GPUs or DDP
# backend are available, it will safely fall back to single-GPU.
distributed:
  enabled: true         # Set true to enable multi-GPU / DDP
  backend: "nccl"        # Typically 'nccl' for GPUs
  world_size: "auto"     # 'auto' = use all visible GPUs; or set an int
  init_method: "env://"  # Default init method; can be overridden if needed
  timeout_seconds: 1800  # Process group init timeout (in seconds)
"""
        config_path.write_text(config_content)
        print(f"✓ Training config written to: {config_path}")
    else:
        print(f"✓ Training config file already exists: {config_path}")

✓ Local environment - assuming training config file already exists in repository


### Define Constants

Define constants for file and directory names used throughout the notebook. Benchmark settings come from centralized config, not hard-coded here. These constants work across all environments.


In [14]:
# Import constants from centralized module
from orchestration import (
    STAGE_HPO,
    STAGE_TRAINING,
    METRICS_FILENAME,
    BENCHMARK_FILENAME,
    CHECKPOINT_DIRNAME,
    DEFAULT_RANDOM_SEED,
    DEFAULT_K_FOLDS,
)

from orchestration.jobs.tracking.mlflow_tracker import (
    MLflowSweepTracker,
    MLflowBenchmarkTracker,
    MLflowTrainingTracker,
    MLflowConversionTracker,
)


### Define Helper Functions

Reusable helper functions following DRY principle for common operations. These functions work across all environments (local, Colab, Kaggle).


In [15]:
# Import helper functions from consolidated modules (DRY principle)
from typing import List, Optional, Any
from orchestration import (
    build_mlflow_experiment_name,
    setup_mlflow_for_stage,
    run_benchmarking,
)
from shared import verify_output_file

# Wrapper function for run_benchmarking that uses notebook-specific paths
def run_benchmarking_local(
    checkpoint_dir: Path,
    test_data_path: Path,
    output_path: Path,
    batch_sizes: List[int],
    iterations: int,
    warmup_iterations: int,
    max_length: int = 512,
    device: Optional[str] = None,
    tracker: Optional[Any] = None,
    backbone: Optional[str] = None,
    benchmark_source: str = "final_training",
    study_key_hash: Optional[str] = None,
    trial_key_hash: Optional[str] = None,
) -> bool:
    """
    Run benchmarking on a model checkpoint (local notebook wrapper).
    
    This is a thin wrapper around orchestration.benchmark_utils.run_benchmarking
    that automatically uses the notebook's SRC_DIR and ROOT_DIR.
    
    Args:
        checkpoint_dir: Path to checkpoint directory.
        test_data_path: Path to test data JSON file.
        output_path: Path to output benchmark.json file.
        batch_sizes: List of batch sizes to test.
        iterations: Number of iterations per batch size.
        warmup_iterations: Number of warmup iterations.
        max_length: Maximum sequence length.
        device: Device to use (None = auto-detect).
        tracker: Optional MLflowBenchmarkTracker instance.
        backbone: Optional model backbone name.
        benchmark_source: Source of benchmark ("hpo_trial" or "final_training").
        study_key_hash: Optional study key hash for grouping tags.
        trial_key_hash: Optional trial key hash for grouping tags.
    
    Returns:
        True if successful, False otherwise.
    """
    return run_benchmarking(
        checkpoint_dir=checkpoint_dir,
        test_data_path=test_data_path,
        output_path=output_path,
        batch_sizes=batch_sizes,
        iterations=iterations,
        warmup_iterations=warmup_iterations,
        max_length=max_length,
        device=device,
        tracker=tracker,
        backbone=backbone,
        benchmark_source=benchmark_source,
        project_root=ROOT_DIR,
        study_key_hash=study_key_hash,
        trial_key_hash=trial_key_hash,
    )


In [16]:
from pathlib import Path
from typing import Any, Dict

from orchestration import EXPERIMENT_NAME
from orchestration.config_loader import (
    ExperimentConfig,
    compute_config_hashes,
    create_config_metadata,
    load_all_configs,
    load_experiment_config,
    snapshot_configs,
    validate_config_immutability,
)

# P1-3.1: Load Centralized Configs (local-only)
# Mirrors the Azure orchestration notebook, but does not create an Azure ML client.

if not CONFIG_DIR.exists():
    raise FileNotFoundError(f"Config directory not found: {CONFIG_DIR}")

experiment_config: ExperimentConfig = load_experiment_config(CONFIG_DIR, EXPERIMENT_NAME)
configs: Dict[str, Any] = load_all_configs(experiment_config)
config_hashes = compute_config_hashes(configs)
config_metadata = create_config_metadata(configs, config_hashes)

# Immutable snapshots for runtime mutation checks
original_configs = snapshot_configs(configs)
validate_config_immutability(configs, original_configs)

print(f"Loaded experiment: {experiment_config.name}")
print("Loaded config domains:", sorted(configs.keys()))
print("Config hashes:", config_hashes)
print("Config metadata:", config_metadata)

# Get dataset path from data config (centralized configuration)
# The local_path in the data config is relative to the config directory
data_config = configs["data"]
local_path_str = data_config.get("local_path", "../dataset")
DATASET_LOCAL_PATH = (CONFIG_DIR / local_path_str).resolve()

# Check if seed-based dataset structure (for dataset_tiny with seed subdirectories)
seed = data_config.get("seed")
if seed is not None and "dataset_tiny" in str(DATASET_LOCAL_PATH):
    DATASET_LOCAL_PATH = DATASET_LOCAL_PATH / f"seed{seed}"

print(f"Dataset path (from data config): {DATASET_LOCAL_PATH}")
if seed is not None:
    print(f"Using seed: {seed}")


Loaded experiment: resume_ner_baseline
Loaded config domains: ['benchmark', 'data', 'env', 'hpo', 'model', 'train']
Config hashes: {'data': 'e87b126b961fa20d', 'model': '5f90a66353401b44', 'train': '1f54e404fd78a76f', 'hpo': '1c47c1143764f7b2', 'env': '3e54b931c7640cf2', 'benchmark': '33da3b0fc59ff812'}
Config metadata: {'data_config_hash': 'e87b126b961fa20d', 'model_config_hash': '5f90a66353401b44', 'train_config_hash': '1f54e404fd78a76f', 'hpo_config_hash': '1c47c1143764f7b2', 'env_config_hash': '3e54b931c7640cf2', 'data_version': 'v3', 'model_backbone': 'distilbert-base-uncased'}
Dataset path (from data config): C:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\dataset_tiny\seed0
Using seed: 0


## Step P1-3.2: Verify Local Dataset

Verify that the dataset directory (specified by `local_path` in the data config) exists and contains the required files. The dataset path is loaded from the centralized data configuration in Step P1-3.1.


In [17]:
# P1-3.2: Verify Local Dataset
# The dataset path comes from the data config's local_path field (loaded in Step P1-3.1).
# This ensures the dataset location is controlled by centralized configuration.
# Note: train.json is required, but validation.json is optional (matches training script behavior).

REQUIRED_FILE = "train.json"
OPTIONAL_FILE = "validation.json"

if not DATASET_LOCAL_PATH.exists():
    raise FileNotFoundError(
        f"Dataset directory not found: {DATASET_LOCAL_PATH}\n"
        f"This path comes from the data config's 'local_path' field.\n"
        f"If you need to create the dataset, run the notebook: notebooks/00_make_tiny_dataset.ipynb"
    )

# Check required file
train_file = DATASET_LOCAL_PATH / REQUIRED_FILE
if not train_file.exists():
    raise FileNotFoundError(
        f"Required dataset file not found: {train_file}\n"
        f"This path comes from the data config's 'local_path' field.\n"
        f"If you need to create it, run the notebook: notebooks/00_make_tiny_dataset.ipynb"
    )

# Check optional file
val_file = DATASET_LOCAL_PATH / OPTIONAL_FILE
has_validation = val_file.exists()

print(f"✓ Dataset directory found: {DATASET_LOCAL_PATH}")
print(f"  (from data config: {data_config.get('name', 'unknown')} v{data_config.get('version', 'unknown')})")

train_size = train_file.stat().st_size
print(f"  ✓ {REQUIRED_FILE} ({train_size:,} bytes)")

if has_validation:
    val_size = val_file.stat().st_size
    print(f"  ✓ {OPTIONAL_FILE} ({val_size:,} bytes)")
else:
    print(f"  ⚠ {OPTIONAL_FILE} not found (optional - training will proceed without validation set)")


✓ Dataset directory found: C:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\dataset_tiny\seed0
  (from data config: resume-ner-data-tiny-short vv3)
  ✓ train.json (28,721 bytes)
  ⚠ validation.json not found (optional - training will proceed without validation set)


## Step P1-3.2.1: Optional Train/Test Split

**Optional step**: Create a train/test split if `test.json` is missing. This is useful when you only have `train.json` and `validation.json` and want to create a separate test set.

**⚠ WARNING**: This will overwrite `train.json` with the split version. Only enable if you want to create a permanent train/test split.


In [18]:
# Optional: create train/test split if test.json is missing
# WARNING: This will overwrite train.json with the split version
# Only enable if you want to create a permanent train/test split
import json
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional

from training.data import split_train_test, save_split_files

CREATE_TEST_SPLIT = False  # Set True to create test.json when absent (WARNING: overwrites train.json)

train_file = DATASET_LOCAL_PATH / "train.json"
val_file = DATASET_LOCAL_PATH / "validation.json"
test_file = DATASET_LOCAL_PATH / "test.json"

if CREATE_TEST_SPLIT and not test_file.exists():
    # Backup original train.json before overwriting
    backup_file = DATASET_LOCAL_PATH / "train.json.backup"
    if train_file.exists() and not backup_file.exists():
        import shutil
        shutil.copy2(train_file, backup_file)
        print(f"⚠ Backed up original train.json to {backup_file}")
    
    full_dataset = []
    # Start with train data; optionally include validation to maximize coverage
    with open(train_file, "r", encoding="utf-8") as f:
        full_dataset.extend(json.load(f))
    if val_file.exists():
        with open(val_file, "r", encoding="utf-8") as f:
            full_dataset.extend(json.load(f))

    split_cfg = configs.get("data", {}).get("splitting", {})
    train_ratio = split_cfg.get("train_test_ratio", 0.8)
    stratified = split_cfg.get("stratified", False)
    random_seed = split_cfg.get("random_seed", 42)
    entity_types = configs.get("data", {}).get("schema", {}).get("entity_types", [])

    print(f"Creating train/test split (train_ratio={train_ratio}, stratified={stratified})...")
    print(f"⚠ WARNING: This will overwrite train.json with {int(len(full_dataset) * train_ratio)} samples")
    
    new_train, new_test = split_train_test(
        dataset=full_dataset,
        train_ratio=train_ratio,
        stratified=stratified,
        random_seed=random_seed,
        entity_types=entity_types,
    )

    save_split_files(DATASET_LOCAL_PATH, new_train, new_test)
    print(f"✓ Wrote train.json ({len(new_train)}) and test.json ({len(new_test)})")
elif test_file.exists():
    print(f"✓ Found existing test.json at {test_file}")
else:
    print("⚠ test.json not found. Set CREATE_TEST_SPLIT=True to generate a split.")


✓ Found existing test.json at C:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\dataset_tiny\seed0\test.json


## Step P1-3.3: Setup Local Environment

Verify GPU availability, set up MLflow tracking (local file store), and check that key dependencies are installed. This step ensures the local environment is ready for training.


In [19]:
import sys
import torch

DEFAULT_DEVICE = "cuda"

env_config = configs["env"]
device_type = env_config.get("compute", {}).get("device", DEFAULT_DEVICE)

if device_type == "cuda" and not torch.cuda.is_available():
    raise RuntimeError(
        "CUDA device requested but not available. "
        "In Colab, ensure you've selected a GPU runtime: Runtime > Change runtime type > GPU"
    )


In [20]:
from pathlib import Path
import mlflow
# Check and install Azure ML packages if needed (for Azure ML tracking)
try:
    import azure.ai.ml
    import azure.identity
except ImportError:
    print("Azure ML packages not found. Installing...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "azure-ai-ml>=1.0.0", "azure-identity>=1.12.0", "azureml-defaults", "azureml-mlflow", "--quiet"])
    print("[OK] Azure ML packages installed")
from shared.mlflow_setup import setup_mlflow_from_config

# Get MLflow tracking URI for later use
mlflow_tracking_uri = mlflow.get_tracking_uri()
if mlflow_tracking_uri:
    print(f"MLflow tracking URI: {mlflow_tracking_uri[:80]}...")
else:
    print("Warning: MLflow tracking URI not set")

# Setup MLflow from config (automatically uses Azure ML if enabled in config/mlflow.yaml)
# To enable Azure ML Workspace tracking:
# 1. Edit config/mlflow.yaml and set azure_ml.enabled: true
# 2. Set environment variables: AZURE_SUBSCRIPTION_ID and AZURE_RESOURCE_GROUP
setup_mlflow_from_config(
    experiment_name="placeholder",  # Will be set per HPO run
    config_dir=CONFIG_DIR
)

2026-01-02 17:23:55,089 - shared.mlflow_setup - INFO - Azure ML enabled in config, attempting to connect...
2026-01-02 17:23:55,091 - shared.mlflow_setup - INFO - Environment variables not set, loading from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config.env
2026-01-02 17:23:55,092 - shared.mlflow_setup - INFO - Loaded subscription/resource group from config.env
2026-01-02 17:23:55,093 - shared.mlflow_setup - INFO - Using DefaultAzureCredential (trying multiple auth methods)


MLflow tracking URI: sqlite:///mlflow.db...


Class DeploymentTemplateOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
2026-01-02 17:23:57,845 - shared.mlflow_setup - INFO - Successfully connected to Azure ML workspace: resume-ner-ws


KeyboardInterrupt: 

In [None]:
# For Kaggle only - install specific package versions required for Optuna checkpointing
if IN_KAGGLE:
    %pip install "SQLAlchemy<2.0.0" "alembic<1.13.0" "optuna<4.0.0" --quiet
else:
    print("Skipping Kaggle-specific package installation (not running on Kaggle)")


Skipping Kaggle-specific package installation (not running on Kaggle)


In [None]:
try:
    import mlflow
    import transformers
    import optuna
except ImportError as e:
    raise ImportError(f"Required package not installed: {e}")

REQUIRED_PACKAGES = {
    "torch": torch,
    "transformers": transformers,
    "mlflow": mlflow,
    "optuna": optuna,
}

for name, module in REQUIRED_PACKAGES.items():
    if not hasattr(module, "__version__"):
        raise ImportError(
            f"Required package '{name}' is not properly installed")

## Step P1-3.4: The Sweep (HPO) - Local with Optuna

Run the full hyperparameter optimization sweep using Optuna to systematically search for the best model configuration. Uses the production HPO configuration with more trials than the dry run.

**Note on K-Fold Cross-Validation:**
- When k-fold CV is enabled (`k_fold.enabled: true`), each trial trains **k models** (one per fold) and returns the **average metric** across folds
- The number of **trials** is controlled by `sampling.max_trials` (e.g., 2 trials in smoke.yaml)
- With k=5 folds and 2 trials: **2 trials × 5 folds = 10 model trainings total**
- K-fold CV provides more robust hyperparameter evaluation but increases compute time (k× per trial)

**Note on Checkpoint and Resume:**
- When `checkpoint.enabled: true` is set in the HPO config, the system automatically saves the Optuna study state to a SQLite database
- This allows interrupted HPO runs to be resumed from the last checkpoint
- The checkpoint is automatically detected and loaded on the next run if `auto_resume: true` (default)
- Platform-specific paths are handled automatically (local, Colab, Kaggle)
- **Selective Checkpoint Saving**: When `checkpoint.save_only_best: true` is set, only best trial checkpoints are saved locally (reduces storage from ~30 GB to ~300 MB for 100 trials)
- **MLflow Checkpoint Logging**: When `mlflow.log_best_checkpoint: true` is set, the best trial checkpoint is automatically logged to MLflow after HPO completes (artifact path: `best_trial_checkpoint`)
- **Refit Training**: When `refit.enabled: true` is set (default), after HPO completes, the best trial is automatically retrained on the full training dataset. This produces a canonical checkpoint in `trial_<n>_<ts>/refit/checkpoint/` that is preferred over fold checkpoints for benchmarking and production use.
- See `docs/HPO_CHECKPOINT_RESUME.md` for detailed documentation


In [None]:
from pathlib import Path
from orchestration import STAGE_HPO
from orchestration.jobs import run_local_hpo_sweep

# Constants are imported from orchestration module
HPO_OUTPUT_DIR = ROOT_DIR / "outputs" / "hpo"
HPO_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


In [None]:
# Use HPO config already loaded in configs (from Step P1-3.1)
# Following DRY principle - don't reload configs that are already available
# Check for stage-specific hpo_config override
from orchestration.naming import get_stage_config
from shared.yaml_utils import load_yaml

hpo_stage_config = get_stage_config(experiment_config, STAGE_HPO)
hpo_config_override = hpo_stage_config.get("hpo_config")

if hpo_config_override:
    # Load stage-specific HPO config override
    hpo_config_path = CONFIG_DIR / hpo_config_override
    hpo_config = load_yaml(hpo_config_path)
    print(f"✓ Using stage-specific HPO config for hpo: {hpo_config_override}")
else:
    # Use default HPO config from top-level experiment config
    hpo_config = configs["hpo"]
    print(f"✓ Using default HPO config: {experiment_config.hpo_config.name}")
train_config = configs["train"]
backbone_values = hpo_config["search_space"]["backbone"]["values"]


✓ Using stage-specific HPO config for hpo: hpo/smoke.yaml


### Setup K-Fold Splits and Google Drive Backup for HPO Trials

**K-Fold Cross-Validation Setup**: If k-fold CV is enabled in the HPO config, create and save fold splits before starting the sweep.

**Colab-specific feature**: Configure automatic backup of each HPO trial to Google Drive immediately after completion. This prevents data loss if the Colab session disconnects during long-running hyperparameter optimization sweeps.

**Note on Checkpoint Backup:**
- If `checkpoint.save_only_best: true` is enabled, only best trial checkpoints are saved locally and backed up to Drive
- Each trial's `metrics.json` is always saved and backed up
- The best trial checkpoint is also automatically logged to MLflow (if `mlflow.log_best_checkpoint: true`)
- This reduces storage usage while ensuring the best model is always available


In [None]:
from training.cv_utils import create_kfold_splits, save_fold_splits, validate_splits
from training.data import load_dataset

# Setup k-fold splits if enabled
k_fold_config = hpo_config.get("k_fold", {})
k_folds_enabled = k_fold_config.get("enabled", False)
fold_splits_file = None

if k_folds_enabled:
    n_splits = k_fold_config.get("n_splits", DEFAULT_K_FOLDS)
    random_seed = k_fold_config.get("random_seed", DEFAULT_RANDOM_SEED)
    shuffle = k_fold_config.get("shuffle", True)
    stratified = k_fold_config.get("stratified", False)
    entity_types = configs.get("data", {}).get("schema", {}).get("entity_types", [])
    
    print(f"Setting up {n_splits}-fold cross-validation splits...")
    full_dataset = load_dataset(str(DATASET_LOCAL_PATH))
    train_data = full_dataset.get("train", [])
    
    fold_splits = create_kfold_splits(
        dataset=train_data,
        k=n_splits,
        random_seed=random_seed,
        shuffle=shuffle,
        stratified=stratified,
        entity_types=entity_types,
    )
    
    # Optional validation to ensure rare entities appear across folds
    validate_splits(train_data, fold_splits, entity_types=entity_types)
    
    HPO_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    fold_splits_file = HPO_OUTPUT_DIR / "fold_splits.json"
    save_fold_splits(
        fold_splits,
        fold_splits_file,
        metadata={
            "k": n_splits,
            "random_seed": random_seed,
            "shuffle": shuffle,
            "stratified": stratified,
            "dataset_path": str(DATASET_LOCAL_PATH),
        }
    )
    print(f"✓ K-fold splits saved to: {fold_splits_file}")
else:
    print("K-fold CV disabled - using single train/validation split")


Setting up 2-fold cross-validation splits...
[CV] Fold 0: {'SKILL': 154} | Missing: ['EDUCATION', 'DESIGNATION', 'EXPERIENCE', 'NAME', 'EMAIL', 'PHONE', 'LOCATION']
[CV] Fold 1: {'SKILL': 107, 'LOCATION': 4, 'DESIGNATION': 1, 'EXPERIENCE': 1, 'EDUCATION': 1} | Missing: ['NAME', 'EMAIL', 'PHONE']
✓ K-fold splits saved to: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo\fold_splits.json


In [None]:
# Checkpoint functionality is now handled automatically by run_local_hpo_sweep
# when checkpoint.enabled: true is set in the HPO config.
# No manual backup callbacks are needed - SQLite persistence is built-in.


In [None]:
# Checkpoint functionality is now handled automatically by run_local_hpo_sweep
# when checkpoint.enabled: true is set in the HPO config.
# No wrapper functions are needed - SQLite persistence is built-in.


In [None]:
# # In a Kaggle notebook cell
# !cd /kaggle/working/resume-ner-azureml && git fetch origin gg && git checkout origin/gg -- src/train.py src/training/trainer.py

In [None]:
# Extract checkpoint configuration from HPO config
checkpoint_config = hpo_config.get("checkpoint", {})

hpo_studies = {}
k_folds_param = k_fold_config.get("n_splits", DEFAULT_K_FOLDS) if k_folds_enabled else None

# Use new centralized naming system for HPO
# Build base output directory: outputs/hpo/<env>/<model>/
# Trial-specific paths will be created by run_local_hpo_sweep as subdirectories


# Ensure environment is defined
from shared.platform_detection import detect_platform
environment = detect_platform()
print(f"Detected environment: {environment}")

for backbone in backbone_values:
    mlflow_experiment_name = build_mlflow_experiment_name(
        experiment_config.name, STAGE_HPO, backbone
    )
    
    backbone_name = backbone.split("-")[0] if "-" in backbone else backbone
    
    # Build base HPO directory using new structure: outputs/hpo/<env>/<model>/
    backbone_output_dir = ROOT_DIR / "outputs" / "hpo" / environment / backbone_name
    backbone_output_dir.mkdir(parents=True, exist_ok=True)
    
    print(f"✓ HPO output directory: {backbone_output_dir}")
    
    # Create restore function for HPO checkpoint if checkpointing enabled and BACKUP_ENABLED
    restore_fn = None
    if checkpoint_config.get("enabled", False) and BACKUP_ENABLED:
        storage_path_template = checkpoint_config.get("storage_path", "{backbone}/study.db")
        storage_path_str = storage_path_template.replace("{backbone}", backbone)
        expected_checkpoint = backbone_output_dir / storage_path_str
        
        def make_restore_fn(checkpoint_path):
            def restore_fn_inner(path: Path) -> bool:
                # Only restore if path matches expected checkpoint
                if path == checkpoint_path:
                    return ensure_restored_from_drive(checkpoint_path, is_directory=False)
                return False
            return restore_fn_inner
        
        restore_fn = make_restore_fn(expected_checkpoint)
    
    # Use standard run_local_hpo_sweep with checkpoint_config
    # Checkpoint.enabled handles persistence via SQLite (better than manual Drive backup)
    study = run_local_hpo_sweep(
        dataset_path=str(DATASET_LOCAL_PATH),
        config_dir=CONFIG_DIR,
        backbone=backbone,
        hpo_config=hpo_config,
        train_config=train_config,
        output_dir=backbone_output_dir,
        mlflow_experiment_name=mlflow_experiment_name,
        k_folds=k_folds_param,
        fold_splits_file=fold_splits_file,
        checkpoint_config=checkpoint_config,
        restore_from_drive=restore_fn,
        data_config=configs.get("data"),
        benchmark_config=configs.get("benchmark"),
    )
    
    # Backup checkpoint to Drive after HPO completion
    if checkpoint_config.get("enabled", False) and BACKUP_ENABLED:
        # Backup study.db file (checkpoint database)
        storage_path_template = checkpoint_config.get("storage_path", "{backbone}/study.db")
        storage_path_str = storage_path_template.replace("{backbone}", backbone)
        checkpoint_path = backbone_output_dir / storage_path_str
        if checkpoint_path.exists():
            backup_to_drive(checkpoint_path, is_directory=False)
            print(f"✓ Backed up HPO checkpoint database to Drive: {checkpoint_path}")
        
        # Backup all trial directories (new structure: outputs/hpo/{env}/{model}/trial_*/...)
        # Find all directories that start with "trial_"
        trial_dirs = [d for d in backbone_output_dir.iterdir() 
                     if d.is_dir() and d.name.startswith("trial_")]
        
        if trial_dirs:
            print(f"Found {len(trial_dirs)} trial directory(ies) to backup...")
            for trial_dir in trial_dirs:
                result = backup_to_drive(trial_dir, is_directory=True)
                if result:
                    print(f"✓ Backed up trial directory to Drive: {trial_dir.name}")
                else:
                    print(f"⚠ Failed to backup trial directory: {trial_dir.name}")
        else:
            print("No trial directories found to backup")
    
    hpo_studies[backbone] = study


2026-01-02 14:24:07,609 - orchestration.jobs.hpo.local.study.manager - INFO - [HPO] Resuming optimization for distilbert from checkpoint...


Detected environment: local
✓ HPO output directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo\local\distilbert


2026-01-02 14:24:08,022 - orchestration.jobs.hpo.local.study.manager - INFO - ✓ HPO already completed and checkpoint uploaded (best trial: 0, completed: 2026-01-02T13:58:25.628118). Skipping HPO execution.


In [None]:
def extract_cv_statistics(best_trial):
    if not hasattr(best_trial, "user_attrs"):
        return None
    cv_mean = best_trial.user_attrs.get("cv_mean")
    cv_std = best_trial.user_attrs.get("cv_std")
    return (cv_mean, cv_std) if cv_mean is not None else None

objective_metric = hpo_config['objective']['metric']

for backbone, study in hpo_studies.items():
    if not study.trials:
        continue
    
    best_trial = study.best_trial
    cv_stats = extract_cv_statistics(best_trial)
    
    print(f"{backbone}: {len(study.trials)} trials completed")
    print(f"  Best {objective_metric}: {best_trial.value:.4f}")
    print(f"  Best params: {best_trial.params}")
    
    if cv_stats:
        cv_mean, cv_std = cv_stats
        print(f"  CV Statistics: Mean: {cv_mean:.4f} ± {cv_std:.4f}")


distilbert: 1 trials completed
  Best macro-f1: 0.4851
  Best params: {'learning_rate': 4.143783964517761e-05, 'batch_size': 4, 'dropout': 0.18800769496265501, 'weight_decay': 0.007134973962158688}
  CV Statistics: Mean: 0.4851 ± 0.0333


## Step P1-3.5: Benchmarking Best Trials

Benchmark the best trial from each backbone to measure actual inference performance. This provides real latency data that replaces parameter-count proxies in model selection, enabling more accurate speed comparisons.

**Workflow:**
1. Identify best trial per backbone (from HPO results)
2. Select checkpoint: prefers `refit/checkpoint/` (if refit training completed), otherwise uses best fold from `cv/foldN/checkpoint/`
3. Run benchmarking on each best trial checkpoint
4. Save benchmark results as `benchmark.json` in trial directories
5. Model selection will automatically use this data when available


In [None]:
from orchestration.jobs.local_selection import load_best_trial_from_disk, extract_best_config_from_study
from pathlib import Path
from shared.platform_detection import detect_platform

best_trials = {}
objective_metric = hpo_config['objective']['metric']
dataset_version = data_config.get("version", "unknown")  # Get dataset version from config

environment = detect_platform()

HPO_OUTPUT_DIR_NEW = ROOT_DIR / "outputs" / "hpo" / environment

for backbone in backbone_values:
    backbone_name = backbone.split("-")[0] if "-" in backbone else backbone
    
    # First, try to use the study from the current HPO run if available
    best_trial_info = None
    if 'hpo_studies' in locals() and backbone_name in hpo_studies:
        study = hpo_studies[backbone_name]
        if study and study.best_trial is not None:
            try:
                # Extract best trial from current study
                best_trial_config = extract_best_config_from_study(
                    study, backbone_name, dataset_version, objective_metric
                )
                
                # Get trial number and find the trial directory
                trial_number = study.best_trial.number
                
                # Use load_best_trial_from_disk which now handles refit checkpoints
                # It prefers refit/checkpoint/ over cv/foldN/checkpoint/
                hpo_backbone_dir = HPO_OUTPUT_DIR_NEW / backbone_name
                best_trial_from_disk = None
                if hpo_backbone_dir.exists():
                    best_trial_from_disk = load_best_trial_from_disk(
                        HPO_OUTPUT_DIR_NEW.parent,  # Pass parent to maintain compatibility
                        f"{environment}/{backbone_name}",
                        objective_metric
                    )
                
                # Build trial_info dict matching load_best_trial_from_disk format
                if best_trial_from_disk:
                    # Use checkpoint_dir from load_best_trial_from_disk (prefers refit)
                    best_trial_info = {
                        'backbone': backbone_name,
                        'trial_name': best_trial_from_disk['trial_name'],
                        'trial_dir': best_trial_from_disk['trial_dir'],
                        'checkpoint_dir': best_trial_from_disk.get('checkpoint_dir', str(Path(best_trial_from_disk['trial_dir']) / 'checkpoint')),
                        'checkpoint_type': best_trial_from_disk.get('checkpoint_type', 'unknown'),
                        'accuracy': best_trial_from_disk['accuracy'],
                        'metrics': best_trial_from_disk['metrics'],
                        'hyperparameters': best_trial_config.get('hyperparameters', {}),
                    }
                    print(f"{backbone}: Best trial from current HPO run is {best_trial_info['trial_name']} "
                          f"({objective_metric}={best_trial_info['accuracy']:.4f})")
            except Exception as e:
                import logging
                logger = logging.getLogger(__name__)
                logger.warning(f"Could not extract best trial from study for {backbone}: {e}")
                # Fall through to disk search
    
    # Fallback to disk search if study not available
    if best_trial_info is None:
        # Try new path structure first
        hpo_backbone_dir = HPO_OUTPUT_DIR_NEW / backbone_name
        if hpo_backbone_dir.exists():
            best_trial_info = load_best_trial_from_disk(
                HPO_OUTPUT_DIR_NEW.parent,  # Pass parent to maintain compatibility
                f"{environment}/{backbone_name}",  # Modified backbone path
                objective_metric
            )
        else:
            # Fallback to old structure for backward compatibility
            HPO_OUTPUT_DIR_OLD = ROOT_DIR / "outputs" / "hpo"
            best_trial_info = load_best_trial_from_disk(
                HPO_OUTPUT_DIR_OLD,
                backbone,
                objective_metric
            )
    
    if best_trial_info:
        best_trials[backbone] = best_trial_info
        if 'hpo_studies' not in locals() or backbone_name not in hpo_studies:
            # Only print if we used disk search (not already printed above)
            print(f"{backbone}: Best trial is {best_trial_info['trial_name']} "
                  f"({objective_metric}={best_trial_info['accuracy']:.4f})")


distilbert: Best trial from current HPO run is trial_0_20260101_163854 (macro-f1=0.3431)


In [None]:
# Run benchmarking on best trials
test_data_path = test_file  # Use test_file as test_data_path

# Load benchmark config (if available)
benchmark_config = configs.get("benchmark", {})
benchmark_settings = benchmark_config.get("benchmarking", {})

# Get benchmark parameters from config or use defaults
benchmark_batch_sizes = benchmark_settings.get("batch_sizes", [1, 8, 16])
benchmark_iterations = benchmark_settings.get("iterations", 100)
benchmark_warmup = benchmark_settings.get("warmup_iterations", 10)
benchmark_max_length = benchmark_settings.get("max_length", 512)
benchmark_device = benchmark_settings.get("device")

# Create MLflow tracker for benchmarking
from orchestration.jobs.tracking.mlflow_tracker import MLflowBenchmarkTracker
# Use benchmark experiment name (typically same as HPO experiment with -benchmark suffix)
benchmark_experiment_name = f"{experiment_config.name}-benchmark" if 'experiment_config' in locals() else "resume_ner_baseline-benchmark"
benchmark_tracker = MLflowBenchmarkTracker(benchmark_experiment_name)


if test_data_path and test_data_path.exists():
    benchmark_results = {}

    for backbone, trial_info in best_trials.items():
        # Use checkpoint_dir from trial_info if available (from load_best_trial_from_disk)
        # This handles new structure: refit/checkpoint/ or cv/foldN/checkpoint/
        if "checkpoint_dir" in trial_info:
            checkpoint_dir = Path(trial_info["checkpoint_dir"])
        else:
            # Fallback: construct from trial_dir (backward compatibility)
            trial_dir = Path(trial_info["trial_dir"])
            checkpoint_dir = trial_dir / CHECKPOINT_DIRNAME

        backbone_name = backbone.split("-")[0] if "-" in backbone else backbone

        trial_id_raw = trial_info.get(
            "trial_id") or trial_info.get("trial_name", "unknown")
        if trial_id_raw.startswith("trial_"):
            trial_id = trial_id_raw[6:]
        else:
            trial_id = trial_id_raw

        if 'environment' not in locals():
            from shared.platform_detection import detect_platform
            environment = detect_platform()

        from orchestration.naming_centralized import create_naming_context, build_output_path
        benchmarking_context = create_naming_context(
            process_type="benchmarking",
            model=backbone_name,
            trial_id=trial_id,
            environment=environment,
        )
        benchmarking_path = build_output_path(ROOT_DIR, benchmarking_context)
        benchmarking_path.mkdir(parents=True, exist_ok=True)

        benchmark_output = benchmarking_path / BENCHMARK_FILENAME

        if not checkpoint_dir.exists():
            print(
                f"Warning: Checkpoint not found for {backbone} "
                f"{trial_info['trial_name']}"
            )
            continue

        if ensure_restored_from_drive(benchmark_output, is_directory=False):
            print(
                f"✓ Restored benchmark results from Drive - "
                f"skipping benchmarking for {backbone}"
            )
            benchmark_results[backbone] = benchmark_output
            continue

        # Compute grouping tags locally (no MLflow lookup needed)
        # We have all the data needed: data_config, hpo_config, benchmark_config, backbone, hyperparameters
        study_key_hash = None
        trial_key_hash = None
        study_family_hash = None
        
        try:
            from orchestration.jobs.tracking.naming.hpo_keys import (
                build_hpo_study_key,
                build_hpo_study_key_hash,
                build_hpo_study_family_key,
                build_hpo_study_family_hash,
                build_hpo_trial_key,
                build_hpo_trial_key_hash,
            )
            
            # Get hyperparameters from trial_info
            hyperparameters = trial_info.get('hyperparameters', {})
            
            if hyperparameters and data_config and hpo_config:
                # Compute study_key_hash
                study_key = build_hpo_study_key(
                    data_config=data_config,
                    hpo_config=hpo_config,
                    model=backbone_name,
                    benchmark_config=benchmark_config if 'benchmark_config' in locals() else None,
                )
                study_key_hash = build_hpo_study_key_hash(study_key)
                
                # Compute study_family_hash (optional, for cross-model comparison)
                study_family_key = build_hpo_study_family_key(
                    data_config=data_config,
                    hpo_config=hpo_config,
                    benchmark_config=benchmark_config if 'benchmark_config' in locals() else None,
                )
                study_family_hash = build_hpo_study_family_hash(study_family_key)
                
                # Compute trial_key_hash
                trial_key = build_hpo_trial_key(
                    study_key_hash=study_key_hash,
                    hyperparameters=hyperparameters,
                )
                trial_key_hash = build_hpo_trial_key_hash(trial_key)
                
                print(f"  Computed grouping tags: study_key_hash={study_key_hash[:16]}..., trial_key_hash={trial_key_hash[:16]}...")
            else:
                missing = []
                if not hyperparameters:
                    missing.append("hyperparameters")
                if not data_config:
                    missing.append("data_config")
                if not hpo_config:
                    missing.append("hpo_config")
                print(f"  Warning: Cannot compute grouping tags (missing: {', '.join(missing)})")
        except Exception as e:
            import logging
            logger = logging.getLogger(__name__)
            logger.warning(f"Could not compute grouping tags locally: {e}", exc_info=True)
            # Continue with None values (backward compatible)

        print(f"\nBenchmarking {backbone} ({trial_info['trial_name']})...")

        success = run_benchmarking_local(
            checkpoint_dir=checkpoint_dir,
            test_data_path=test_data_path,
            output_path=benchmark_output,
            batch_sizes=benchmark_batch_sizes,
            iterations=benchmark_iterations,
            warmup_iterations=benchmark_warmup,
            max_length=benchmark_max_length,
            device=benchmark_device,
            tracker=benchmark_tracker,
            backbone=backbone,
            benchmark_source="hpo_trial",
            study_key_hash=study_key_hash,
            trial_key_hash=trial_key_hash,
        )

        if success:
            benchmark_results[backbone] = benchmark_output
            print(f"✓ Benchmark completed: {benchmark_output}")

            if BACKUP_ENABLED:
                backup_to_drive(benchmark_output, is_directory=False)
                print("✓ Backed up benchmark results to Drive")
        else:
            print(f"✗ Benchmark failed for {backbone}")

    print(
        f"\n✓ Benchmarking complete. "
        f"{len(benchmark_results)}/{len(best_trials)} trials benchmarked."
    )
else:
    print("Skipping benchmarking (test data not available)")

  Computed grouping tags: study_key_hash=350a79aa1e425060..., trial_key_hash=db84e525ec11af04...

Benchmarking distilbert (trial_0_20260101_163854)...


2026-01-02 14:25:31,167 - orchestration.benchmark_utils - INFO - [Benchmark Run Name] Extracted trial_id from path (fallback): trial_0_20260101_163854
2026-01-02 14:25:31,168 - orchestration.benchmark_utils - INFO - [Benchmark Run Name] Building run name: trial_id=trial_0_20260101_163854, root_dir=c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml, config_dir=c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config
2026-01-02 14:25:31,177 - orchestration.jobs.tracking.config.loader - INFO - [MLflow Config] Loaded config from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config\mlflow.yaml
2026-01-02 14:25:31,178 - orchestration.jobs.tracking.config.loader - INFO - [Auto-Increment Config] Loading from config_dir=c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config, raw_auto_inc_config={'enabled': True, 'processes': {'hpo': True, 'benchmarking': True}, 'format': '{base}.{version}'}
2026-01-02 14:25:31,179 - orchestration.jobs.tracking.config.loader - INFO - [Auto-I

🏃 View run benchmark_distilbert_trial_0_20260101_163854_3 at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/05e58c57-c15e-4023-b2a5-ccc206638c27/runs/07a3b8e1-18b8-4891-8d5a-b503a20a7cd3
🧪 View experiment at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/05e58c57-c15e-4023-b2a5-ccc206638c27
✓ Benchmark completed: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\benchmarking\local\distilbert\trial_0_20260101_163854\benchmark.json

✓ Benchmarking complete. 1/1 trials benchmarked.


## Step P1-3.6: Best Configuration Selection (Automated)

Programmatically select the best configuration from all HPO sweep runs across all backbone models. The best configuration is determined by the objective metric specified in the HPO config.


In [None]:
from pathlib import Path
import importlib.util
from shared.json_cache import save_json

# Import local_selection directly to avoid triggering Azure ML imports in __init__.py
local_selection_spec = importlib.util.spec_from_file_location(
    "local_selection", SRC_DIR / "orchestration" / "jobs" / "local_selection.py"
)
local_selection = importlib.util.module_from_spec(local_selection_spec)
local_selection_spec.loader.exec_module(local_selection)
select_best_configuration_across_studies = local_selection.select_best_configuration_across_studies



In [None]:
dataset_version = data_config.get("version", "unknown")

# Select best configuration with accuracy-speed tradeoff
# Supports both in-memory studies and disk-based selection
# Uses threshold from hpo_config["selection"] if configured

# Option 1: Use in-memory studies (if notebook still running)
if 'hpo_studies' in locals() and hpo_studies:
    best_configuration = select_best_configuration_across_studies(
        studies=hpo_studies,
        hpo_config=hpo_config,
        dataset_version=dataset_version,
        # Uses accuracy_threshold from hpo_config["selection"] if set
    )
else:
    # Option 2: Load from disk (works after notebook restart)
    HPO_OUTPUT_DIR = ROOT_DIR / "outputs" / "hpo"
    best_configuration = select_best_configuration_across_studies(
        studies=None,  # No in-memory studies
        hpo_config=hpo_config,
        dataset_version=dataset_version,
        hpo_output_dir=HPO_OUTPUT_DIR,  # Load from saved metrics.json files
        # Uses accuracy_threshold from hpo_config["selection"] if set
    )


In [None]:
from orchestration.paths import (
    resolve_output_path,
    save_cache_with_dual_strategy,
)
from datetime import datetime
from shared.json_cache import save_json

# Extract backbone name first (needed for spec_fp computation)
backbone = best_configuration.get('backbone', 'unknown')
backbone_name = backbone.split("-")[0] if "-" in backbone else backbone

# Use new centralized naming system with fingerprints
from orchestration.naming_centralized import create_naming_context, build_output_path
from orchestration.fingerprints import compute_spec_fp
from orchestration.config_loader import load_all_configs

# Compute spec_fp for best configuration
all_configs = load_all_configs(experiment_config)
spec_fp = compute_spec_fp(
    model_config=all_configs.get("model", {}),
    data_config=all_configs.get("data", {}),
    train_config=all_configs.get("train", {}),
    seed=configs.get("train", {}).get("training", {}).get("random_seed", 42)
)

# Create naming context for best configuration
best_config_context = create_naming_context(
    process_type="best_configurations",
    model=backbone_name,
    spec_fp=spec_fp
)

# Build path using new structure: outputs/cache/best_configurations/<model>/spec_<spec_fp>/
BEST_CONFIG_CACHE_DIR = build_output_path(ROOT_DIR, best_config_context)
BEST_CONFIG_CACHE_DIR.mkdir(parents=True, exist_ok=True)


# Generate timestamp and identifiers
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
trial_name = best_configuration.get('trial_name', 'unknown')
trial_id = best_configuration.get('trial_id', 'unknown')

# Generate stable training name from best configuration
stable_training_name = f"{backbone_name}_{trial_name}"

timestamped_file, latest_file, index_file = save_cache_with_dual_strategy(
    root_dir=ROOT_DIR,
    config_dir=CONFIG_DIR,
    cache_type="best_configurations",
    data=best_configuration,
    backbone=backbone,
    identifier=trial_name,
    timestamp=timestamp,
    additional_metadata={
        "experiment_name": experiment_config.name if 'experiment_config' in locals() else "unknown",
        "hpo_study_name": hpo_config.get('study_name', 'unknown') if 'hpo_config' in locals() else "unknown",
        "spec_fp": spec_fp
    }
)

# Also save directly to new fingerprint-based directory
new_timestamped_file = BEST_CONFIG_CACHE_DIR / f"best_config_{backbone_name}_{trial_name}_{timestamp}.json"
save_json(new_timestamped_file, best_configuration)

# Save latest and index to new directory
new_latest_file = BEST_CONFIG_CACHE_DIR / "latest_best_configuration.json"
save_json(new_latest_file, best_configuration)

# Update index in new directory
from orchestration.paths import get_cache_file_path, load_json
new_index_file = BEST_CONFIG_CACHE_DIR / "index.json"
index_data = load_json(new_index_file, default={"entries": []})
index_entry = {
    "backbone": backbone,
    "trial_name": trial_name,
    "timestamp": timestamp,
    "file": str(new_timestamped_file),
    "spec_fp": spec_fp
}
index_data.setdefault("entries", []).append(index_entry)
save_json(new_index_file, index_data)


print(f"Best configuration selected:")
print(f"  Backbone: {best_configuration.get('backbone', 'unknown')}")
print(f"  Trial: {trial_name}")
best_value = best_configuration.get('best_value', None)
if best_value is not None and isinstance(best_value, (int, float)):
    print(f"  Best macro-f1: {best_value:.4f}")
else:
    print(f"  Best macro-f1: {best_value}")
print(f"  Selection reason: {best_configuration.get('selection_reason', 'unknown')}")
print(f"\nAll candidates considered:")
for candidate in best_configuration.get('candidates', []):
    candidate_value = candidate.get('best_value', None)
    if candidate_value is not None and isinstance(candidate_value, (int, float)):
        print(f"  ✓ {candidate.get('backbone', 'unknown')}: acc={candidate_value:.4f}, speed={candidate.get('speed_factor', 1.0):.2f}x")
    else:
        print(f"  ✓ {candidate.get('backbone', 'unknown')}: acc={candidate_value}, speed={candidate.get('speed_factor', 1.0):.2f}x")

print(f"\n✓ Saved to: {BEST_CONFIG_CACHE_DIR}")


Best configuration selected:
  Backbone: distilbert
  Trial: trial_5
  Best macro-f1: None
  Selection reason: unknown

All candidates considered:

✓ Saved to: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\cache\best_configurations\distilbert\spec_81710c3324325ad0


## Step P1-3.7: Final Training (Post-HPO, Single Run)

Train the final production model using the best configuration from HPO with stable, controlled conditions. This uses the full training epochs (no early stopping) and the best hyperparameters found during HPO.

**Note**: After training completes, the checkpoint will be automatically backed up to Google Drive for persistence.


In [None]:
from pathlib import Path
import os
import sys
import subprocess
import mlflow
from shared.json_cache import load_json, save_json
from orchestration import STAGE_TRAINING

# Define build_final_training_config locally to avoid importing Azure ML dependencies
# This function doesn't use Azure ML, so we can define it here
def build_final_training_config(
    best_config: dict,
    train_config: dict,
    random_seed: int = 42,
) -> dict:
    """
    Build final training configuration by merging best HPO config with train.yaml defaults.
    """
    hyperparameters = best_config.get("hyperparameters", {})
    training_defaults = train_config.get("training", {})
    
    return {
        "backbone": best_config["backbone"],
        "learning_rate": hyperparameters.get("learning_rate", training_defaults.get("learning_rate", 2e-5)),
        "dropout": hyperparameters.get("dropout", training_defaults.get("dropout", 0.1)),
        "weight_decay": hyperparameters.get("weight_decay", training_defaults.get("weight_decay", 0.01)),
        "batch_size": training_defaults.get("batch_size", 16),
        "epochs": training_defaults.get("epochs", 5),
        "random_seed": random_seed,
        "early_stopping_enabled": False,
        "use_combined_data": True,
        "use_all_data": True,
    }

DEFAULT_RANDOM_SEED = 42
FINAL_TRAINING_OUTPUT_DIR = ROOT_DIR / "outputs" / "final_training"


In [None]:
from orchestration.paths import load_cache_file
from orchestration.metadata_manager import (
    load_training_metadata,
    is_training_complete,
    are_training_artifacts_uploaded,
    save_training_metadata,
    get_training_checkpoint_path,
)

# Try loading from centralized cache first
best_configuration = load_cache_file(
    ROOT_DIR, CONFIG_DIR, "best_configurations", use_latest=True
)


if best_configuration is None:
    raise FileNotFoundError(
        f"Best configuration cache not found.\n"
        f"Please run Step P1-3.6: Best Configuration Selection first.\n"
        f"Cache directory: {resolve_output_path(ROOT_DIR, CONFIG_DIR, 'cache', subcategory='best_configurations')}"
    )


In [None]:
# Build final training configuration from best HPO configuration
# Use train_config from configs if available, otherwise load it
if 'train_config' not in locals():
    train_config = configs.get("train", {})

# Read seed from train_config (with fallback to 42)
random_seed = train_config.get("training", {}).get("random_seed", 42)

final_training_config = build_final_training_config(
    best_config=best_configuration,
    train_config=train_config,
    random_seed=random_seed,
)

print("Final training configuration:")
print(f"  Backbone: {final_training_config['backbone']}")
print(f"  Learning rate: {final_training_config['learning_rate']}")
print(f"  Batch size: {final_training_config['batch_size']}")
print(f"  Dropout: {final_training_config['dropout']}")
print(f"  Weight decay: {final_training_config['weight_decay']}")
print(f"  Epochs: {final_training_config['epochs']}")
print(f"  Random seed: {final_training_config['random_seed']}")
print(f"  Early stopping: {final_training_config['early_stopping_enabled']}")

Final training configuration:
  Backbone: distilbert
  Learning rate: 4.033407641328295e-05
  Batch size: 2
  Dropout: 0.2962144733204924
  Weight decay: 0.005813977793367053
  Epochs: 1
  Random seed: 42
  Early stopping: False


In [None]:
from orchestration.naming_centralized import (
    create_naming_context,
    build_output_path,
)
from orchestration.config_loader import load_all_configs
from orchestration.fingerprints import compute_spec_fp, compute_exec_fp
import subprocess
from datetime import datetime
mlflow_experiment_name = (
    f"{experiment_config.name}-{STAGE_TRAINING}-{final_training_config['backbone']}"
)


# Use new centralized naming system with fingerprints and variants
backbone = best_configuration.get("backbone", "unknown")
backbone_name = backbone.split("-")[0] if "-" in backbone else backbone

# Get git SHA for exec_fp computation
try:
    git_sha = subprocess.check_output(
        ["git", "rev-parse", "HEAD"],
        cwd=ROOT_DIR,
        stderr=subprocess.DEVNULL,
    ).decode().strip()
except Exception:
    git_sha = "unknown"

# Compute fingerprints for final training

all_configs = load_all_configs(experiment_config)
spec_fp = compute_spec_fp(
    model_config=all_configs.get("model", {}),
    data_config=all_configs.get("data", {}),
    train_config=all_configs.get("train", {}),
    seed=final_training_config.get("random_seed", 42),
)
exec_fp = compute_exec_fp(
    git_sha=git_sha,
    env_config=all_configs.get("env", {}),
)

print("✓ Computed fingerprints:")
print(f"  spec_fp: {spec_fp}")
print(f"  exec_fp: {exec_fp}")

# Create naming context for final training

# Ensure environment is defined
if "environment" not in locals():
    from shared.platform_detection import detect_platform
    environment = detect_platform()

# Check for existing training with same fingerprints (check variants)
# Use variant resolution to find next available variant or reuse existing
from orchestration.final_training_config import _compute_next_variant, _find_existing_variant, _is_variant_complete

# Check for existing variants
existing_variant = _find_existing_variant(
    ROOT_DIR, CONFIG_DIR, spec_fp, exec_fp, backbone_name
)

# Reuse existing variant if complete, otherwise use next available
if existing_variant and _is_variant_complete(ROOT_DIR, CONFIG_DIR, spec_fp, exec_fp, backbone_name, existing_variant):
    variant = existing_variant
    print(f"Reusing existing variant {variant}")
else:
    variant = _compute_next_variant(ROOT_DIR, CONFIG_DIR, spec_fp, exec_fp, backbone_name)
    print(f"Using new variant {variant}")
training_context = create_naming_context(
    process_type="final_training",
    model=backbone_name,
    spec_fp=spec_fp,
    exec_fp=exec_fp,
    environment=environment,
    variant=variant,
)

final_output_dir = build_output_path(ROOT_DIR, training_context)
final_output_dir.mkdir(parents=True, exist_ok=True)

print(f"✓ Final training output directory: {final_output_dir}")

# Check if training already exists at this path
checkpoint_path = final_output_dir / "checkpoint"
training_complete = (
    checkpoint_path.exists() and any(checkpoint_path.iterdir())
)

stable_training_name = (
    f"{backbone_name}_{best_configuration.get('trial_name', 'unknown')}"
)

if training_complete:
    print(f"✓ Training already completed at: {final_output_dir}")
    if checkpoint_path.exists():
        print(f"  Checkpoint: {checkpoint_path}")
    SKIP_TRAINING = True
else:
    print("Training not yet completed - will proceed with training")
    SKIP_TRAINING = False

mlflow.set_experiment(mlflow_experiment_name)


✓ Computed fingerprints:
  spec_fp: 81710c3324325ad0
  exec_fp: 8d244347b2eff67e
Using new variant 1
✓ Final training output directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\final_training\local\distilbert\spec_81710c3324325ad0_exec_8d244347b2eff67e\v1
Training not yet completed - will proceed with training


<Experiment: artifact_location='', creation_time=1766263257813, experiment_id='a5897b88-fd66-448c-ae65-1ef21bfc11dd', last_update_time=None, lifecycle_stage='active', name='resume_ner_baseline-training-distilbert', tags={}>

In [None]:
# Run training as a module (python -m training.train) to allow relative imports to work
# This requires src/ to be in PYTHONPATH (set in env below)
training_args = [
    sys.executable,
    "-m",
    "training.train",
    "--data-asset",
    str(DATASET_LOCAL_PATH),
    "--config-dir",
    str(CONFIG_DIR),
    "--backbone",
    final_training_config["backbone"],
    "--learning-rate",
    str(final_training_config["learning_rate"]),
    "--batch-size",
    str(final_training_config["batch_size"]),
    "--dropout",
    str(final_training_config["dropout"]),
    "--weight-decay",
    str(final_training_config["weight_decay"]),
    "--epochs",
    str(final_training_config["epochs"]),
    "--random-seed",
    str(final_training_config["random_seed"]),
    "--early-stopping-enabled",
    str(final_training_config["early_stopping_enabled"]).lower(),
    "--use-combined-data",
    str(final_training_config["use_combined_data"]).lower(),
]


In [None]:
training_env = os.environ.copy()
training_env["AZURE_ML_OUTPUT_checkpoint"] = str(final_output_dir)

# Add src directory to PYTHONPATH to allow relative imports in training.train
pythonpath = training_env.get("PYTHONPATH", "")
if pythonpath:
    training_env["PYTHONPATH"] = f"{str(SRC_DIR)}{os.pathsep}{pythonpath}"
else:
    training_env["PYTHONPATH"] = str(SRC_DIR)

mlflow_tracking_uri = mlflow.get_tracking_uri()
if mlflow_tracking_uri:
    training_env["MLFLOW_TRACKING_URI"] = mlflow_tracking_uri
training_env["MLFLOW_EXPERIMENT_NAME"] = mlflow_experiment_name

            # Note: Run name is now set automatically by trainer.py using build_mlflow_run_name()
# Extract backbone name (e.g., "distilbert" from "distilbert-base-uncased" or use as-is if already short)
backbone_value = final_training_config["backbone"]
# If backbone contains hyphens, extract the first part (e.g., "distilbert" from "distilbert-base-uncased")
# Otherwise use as-is (e.g., "distilbert" stays "distilbert")
backbone_name = backbone_value.split("-")[0] if "-" in backbone_value else backbone_value


In [None]:
# Check if we should skip training
if 'SKIP_TRAINING' in locals() and SKIP_TRAINING:
    print("Skipping training - already completed")

    checkpoint_path = None

    # First, try new fingerprint-based location
    if 'final_output_dir' in locals() and final_output_dir:
        new_checkpoint = final_output_dir / "checkpoint"
        if new_checkpoint.exists() and any(new_checkpoint.iterdir()):
            checkpoint_path = new_checkpoint
        else:
            checkpoint_path = None

    # Third, try actual checkpoint location
    if not checkpoint_path:
        actual_checkpoint = ROOT_DIR / "outputs" / "checkpoint"
        if actual_checkpoint.exists() and any(actual_checkpoint.iterdir()):
            checkpoint_path = actual_checkpoint
            final_output_dir = actual_checkpoint.parent

    if checkpoint_path and checkpoint_path.exists():
        if 'artifacts_uploaded' not in locals():
            try:
                artifacts_uploaded = are_training_artifacts_uploaded(
                    ROOT_DIR, CONFIG_DIR, stable_training_name
                )
            except Exception:
                artifacts_uploaded = False

        if not artifacts_uploaded:
            metrics_file = checkpoint_path.parent / "metrics.json"
            if not metrics_file.exists():
                metrics_file = checkpoint_path.parent.parent / "metrics.json"

            metrics = None
            if metrics_file.exists():
                import json
                with open(metrics_file, "r") as f:
                    metrics = json.load(f)

            UPLOAD_ARTIFACTS = True
        else:
            UPLOAD_ARTIFACTS = False
    else:
        print("⚠ Warning: Checkpoint not found - proceeding without checkpoint")

        if 'final_output_dir' not in locals() or not final_output_dir:
            if 'training_context' in locals():
                from orchestration.naming_centralized import build_output_path
                final_output_dir = build_output_path(
                    ROOT_DIR, training_context
                )
            else:
                from orchestration.naming_centralized import (
                    create_naming_context,
                    build_output_path,
                )
                from orchestration.fingerprints import (
                    compute_spec_fp,
                    compute_exec_fp,
                )
                from orchestration.final_training_config import (
                    _compute_next_variant,
                )
                from orchestration.config_loader import load_all_configs
                from shared.platform_detection import detect_platform

                environment = detect_platform()
                backbone_name = final_training_config.get(
                    "backbone", "unknown"
                )
                if "-" in backbone_name:
                    backbone_name = backbone_name.split("-")[0]

                try:
                    from orchestration.metadata_manager import (
                        load_training_metadata,
                    )
                    old_meta = load_training_metadata(
                        ROOT_DIR, CONFIG_DIR, stable_training_name
                    )
                    spec_fp = old_meta.get("fingerprints", {}).get("spec_fp")
                    exec_fp = old_meta.get("fingerprints", {}).get("exec_fp")
                except Exception:
                    spec_fp = None
                    exec_fp = None

                if not spec_fp or not exec_fp:
                    all_configs = load_all_configs(experiment_config)
                    spec_fp = compute_spec_fp(
                        model_config=all_configs.get("model", {}),
                        data_config=all_configs.get("data", {}),
                        train_config=all_configs.get("train", {}),
                        seed=final_training_config.get("random_seed", 42),
                    )

                    try:
                        git_sha = subprocess.check_output(
                            ["git", "rev-parse", "HEAD"],
                            cwd=ROOT_DIR,
                            stderr=subprocess.DEVNULL,
                        ).decode().strip()
                    except Exception:
                        git_sha = None

                    exec_fp = compute_exec_fp(
                        git_sha=git_sha,
                        env_config=all_configs.get("env", {}),
                    )

                training_context = create_naming_context(
                    process_type="final_training",
                    model=backbone_name,
                    spec_fp=spec_fp,
                    exec_fp=exec_fp,
                    environment=environment,
                    variant=_compute_next_variant(
                        ROOT_DIR,
                        CONFIG_DIR,
                        spec_fp,
                        exec_fp,
                        backbone_name,
                    ),
                )
                final_output_dir = build_output_path(
                    ROOT_DIR, training_context
                )

        UPLOAD_ARTIFACTS = False

    result = type(
        "obj",
        (object,),
        {"returncode": 0, "stdout": "", "stderr": ""},
    )()

else:
    result = subprocess.run(
        training_args,
        cwd=ROOT_DIR,
        env=training_env,
        capture_output=True,
        text=True,
    )

if result.returncode != 0:
    raise RuntimeError(
        f"Final training failed with return code {result.returncode}"
    )
else:
    if result.stdout:
        print(result.stdout)

    if BACKUP_ENABLED:
        checkpoint_dir = final_output_dir / "checkpoint"
        if not checkpoint_dir.exists():
            actual_checkpoint = ROOT_DIR / "outputs" / "checkpoint"
            if actual_checkpoint.exists():
                checkpoint_dir = actual_checkpoint

        if checkpoint_dir.exists():
            backup_to_drive(checkpoint_dir, is_directory=True)
            print(
                f"Backed up final training checkpoint to Drive: {checkpoint_dir}"
            )

        metrics_file = final_output_dir / "metrics.json"
        if not metrics_file.exists():
            actual_metrics = ROOT_DIR / "outputs" / "metrics.json"
            if actual_metrics.exists():
                metrics_file = actual_metrics

        if metrics_file.exists():
            backup_to_drive(metrics_file, is_directory=False)

    if mlflow_tracking_uri:
        try:
            from orchestration.jobs.tracking.mlflow_tracker import (
                MLflowTrainingTracker,
            )
            training_tracker = MLflowTrainingTracker(
                mlflow_experiment_name
            )

            checkpoint_dir = final_output_dir / "checkpoint"
            if not checkpoint_dir.exists():
                actual_checkpoint = ROOT_DIR / "outputs" / "checkpoint"
                if actual_checkpoint.exists():
                    checkpoint_dir = actual_checkpoint

            metrics_json_path = final_output_dir / "metrics.json"
            if not metrics_json_path.exists():
                actual_metrics = ROOT_DIR / "outputs" / "metrics.json"
                if actual_metrics.exists():
                    metrics_json_path = actual_metrics

            if checkpoint_dir.exists() or metrics_json_path.exists():
                import mlflow
                # Use new systematic MLflow run finder
                from orchestration.jobs.tracking.finder.run_finder import find_mlflow_run

                # Find MLflow run using new finder
                report = find_mlflow_run(
                    experiment_name=mlflow_experiment_name,
                    context=training_context if 'training_context' in locals() else None,
                    output_dir=final_output_dir if 'final_output_dir' in locals() else None,
                    strict=True,  # Default: fail loud instead of attaching to wrong run
                    root_dir=ROOT_DIR,
                    config_dir=CONFIG_DIR if 'CONFIG_DIR' in locals() else None,
                )

                if report.found and report.run_id:
                    with mlflow.start_run(run_id=report.run_id):
                        training_tracker.log_training_artifacts(
                            checkpoint_dir=checkpoint_dir
                            if checkpoint_dir.exists()
                            else None,
                            metrics_json_path=metrics_json_path
                            if metrics_json_path.exists()
                            else None,
                        )
                        print(f"✓ Logged training artifacts to MLflow run {report.run_id}")
                        print(f"  Strategy used: {report.strategy_used}")
                else:
                    print(f"⚠ Could not find MLflow run for artifact upload")
                    print(f"  Experiment: {mlflow_experiment_name}")
                    if report.error:
                        print(f"  Error: {report.error}")
                    if report.strategies_attempted:
                        print(f"  Attempted strategies: {', '.join(report.strategies_attempted)}")
                    print(f"  Try checking the MLflow UI for the most recent run")
        except Exception as e:
            print(
                f"⚠ Failed to log training artifacts to MLflow: {e}"
            )

: 

In [None]:
import json
import shutil
from pathlib import Path
import os

# Check actual checkpoint location
# The training script may save to outputs/checkpoint instead of final_output_dir/checkpoint
actual_checkpoint = ROOT_DIR / "outputs" / "checkpoint"
actual_metrics = ROOT_DIR / "outputs" / METRICS_FILENAME
expected_checkpoint = final_output_dir / "checkpoint"
expected_metrics = final_output_dir / METRICS_FILENAME

print("Checking training completion...")
print(f"  Expected checkpoint: {expected_checkpoint} (exists: {expected_checkpoint.exists()})")
print(f"  Actual checkpoint: {actual_checkpoint} (exists: {actual_checkpoint.exists()})")
print(f"  Expected metrics: {expected_metrics} (exists: {expected_metrics.exists()})")
print(f"  Actual metrics: {actual_metrics} (exists: {actual_metrics.exists()})")

# Determine which checkpoint and metrics to use
checkpoint_source = None
metrics_file = None

if expected_checkpoint.exists() and any(expected_checkpoint.iterdir()):
    checkpoint_source = expected_checkpoint
    print(f"✓ Using expected checkpoint location: {checkpoint_source}")
elif actual_checkpoint.exists() and any(actual_checkpoint.iterdir()):
    checkpoint_source = actual_checkpoint
    print(f"✓ Using actual checkpoint location: {checkpoint_source}")
    # Update final_output_dir to match actual location
    final_output_dir = actual_checkpoint.parent

if expected_metrics.exists():
    metrics_file = expected_metrics
elif actual_metrics.exists():
    metrics_file = actual_metrics

# Load metrics if available
metrics = None
if metrics_file and metrics_file.exists():
    with open(metrics_file, "r") as f:
        metrics = json.load(f)
    print(f"✓ Metrics loaded from: {metrics_file}")
    print(f"  Metrics: {metrics}")
elif checkpoint_source:
    print(f"⚠ Warning: Metrics file not found, but checkpoint exists.")
    metrics = {"status": "completed", "checkpoint_found": True}

# Save training completion to metadata using new system with fingerprints
if 'training_context' in locals() and not training_complete:
    from orchestration.metadata_manager import save_metadata_with_fingerprints
    from orchestration.index_manager import update_index
    
    metadata_content = {
        "backbone": best_configuration.get('backbone', 'unknown'),
        "trial_name": best_configuration.get('trial_name', 'unknown'),
        "trial_id": best_configuration.get('trial_id', 'unknown'),
        "checkpoint_path": str(checkpoint_source) if checkpoint_source else None,
        "metrics": metrics,
    }
    
    # Save metadata with fingerprints
    metadata_file = save_metadata_with_fingerprints(
        ROOT_DIR,
        CONFIG_DIR,
        training_context,
        metadata_content,
        status_updates={
            "training": {
                "completed": True,
            }
        }
    )
    print(f"✓ Saved training completion to metadata: {metadata_file}")
    
    # Update index for fast lookup
    update_index(ROOT_DIR, CONFIG_DIR, training_context, metadata_content)
    print(f"✓ Updated training index")
    
    if 'stable_training_name' in locals():
        from datetime import datetime
        best_config_timestamp = best_configuration.get('cache_metadata', {}).get('saved_at', datetime.now().isoformat())
        if isinstance(best_config_timestamp, str) and 'T' in best_config_timestamp:
            best_config_timestamp = best_config_timestamp.split('T')[0].replace('-', '') + '_' + best_config_timestamp.split('T')[1].split('.')[0].replace(':', '')
        
elif checkpoint_source:
    # Checkpoint found but training_context condition not met - still valid
    pass
else:
    # No checkpoint found - raise error
    raise FileNotFoundError(
        f"Training completed but no checkpoint found.\n"
        f"  Expected: {expected_checkpoint}\n"
        f"  Actual: {actual_checkpoint}\n"
        f"  Please check training logs for errors."
    )


Checking training completion...
  Expected checkpoint: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\final_training\local\distilbert\spec_81710c3324325ad0_exec_8d244347b2eff67e\v1\checkpoint (exists: True)
  Actual checkpoint: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\checkpoint (exists: False)
  Expected metrics: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\final_training\local\distilbert\spec_81710c3324325ad0_exec_8d244347b2eff67e\v1\metrics.json (exists: True)
  Actual metrics: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\metrics.json (exists: False)
✓ Using expected checkpoint location: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\final_training\local\distilbert\spec_81710c3324325ad0_exec_8d244347b2eff67e\v1\checkpoint
✓ Metrics loaded from: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\final_training\local\distilbert\spec_81710c3324325ad0_exec_8d244347b2eff67e\v1\metrics.json
  M

In [None]:
from orchestration.paths import (
    resolve_output_path,
    save_cache_with_dual_strategy,
)
from datetime import datetime
import subprocess

# Ensure final_output_dir is defined
if 'final_output_dir' not in locals() or not final_output_dir:
    # Try to get from training_context
    if 'training_context' in locals():
        from orchestration.naming_centralized import build_output_path
        final_output_dir = build_output_path(ROOT_DIR, training_context)
    else:
        # Fallback: use expected path structure
        from orchestration.naming_centralized import (
            create_naming_context,
            build_output_path,
        )
        from orchestration.fingerprints import (
            compute_spec_fp,
            compute_exec_fp,
        )
        from orchestration.final_training_config import _compute_next_variant
        from orchestration.config_loader import load_all_configs
        from shared.platform_detection import detect_platform

        environment = detect_platform()
        backbone_name = final_training_config.get(
            "backbone", "unknown"
        )
        if "-" in backbone_name:
            backbone_name = backbone_name.split("-")[0]

        # Compute fingerprints (this should match the earlier computation)
        all_configs = load_all_configs(experiment_config)
        spec_fp = compute_spec_fp(
            model_config=all_configs.get("model", {}),
            data_config=all_configs.get("data", {}),
            train_config=all_configs.get("train", {}),
            seed=final_training_config.get("random_seed", 42),
        )

        try:
            git_sha = subprocess.check_output(
                ["git", "rev-parse", "HEAD"],
                cwd=ROOT_DIR,
                stderr=subprocess.DEVNULL,
            ).decode().strip()
        except Exception:
            git_sha = None

        exec_fp = compute_exec_fp(
            git_sha=git_sha,
            env_config=all_configs.get("env", {}),
        )

        training_context = create_naming_context(
            process_type="final_training",
            model=backbone_name,
            spec_fp=spec_fp,
            exec_fp=exec_fp,
            environment=environment,
            variant=_compute_next_variant(
                ROOT_DIR,
                CONFIG_DIR,
                spec_fp,
                exec_fp,
                backbone_name,
            ),
        )
        final_output_dir = build_output_path(ROOT_DIR, training_context)

# Prepare cache data
final_training_cache_data = {
    "output_dir": str(final_output_dir),
    "backbone": final_training_config["backbone"],
    "run_id": stable_training_name,
    "config": final_training_config,
    "metrics": metrics,
}

# Save using dual file strategy
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backbone = (
    final_training_config["backbone"]
    .replace("-", "_")
    .replace("/", "_")
)

# Use fingerprint-based identifier for cache naming
if (
    'training_context' in locals()
    and training_context.spec_fp
    and training_context.exec_fp
):
    from orchestration.naming_centralized import build_parent_training_id
    cache_identifier = build_parent_training_id(
        spec_fp=training_context.spec_fp,
        exec_fp=training_context.exec_fp,
        variant=training_context.variant,
    ).replace("/", "_")
else:
    cache_identifier = (
        stable_training_name.replace("-", "_").replace("/", "_")
    )

timestamped_file, latest_file, index_file = save_cache_with_dual_strategy(
    root_dir=ROOT_DIR,
    config_dir=CONFIG_DIR,
    cache_type="final_training",
    data=final_training_cache_data,
    backbone=backbone,
    identifier=cache_identifier,
    timestamp=timestamp,
    additional_metadata={
        "checkpoint_path": (
            str(checkpoint_source) if checkpoint_source else None
        ),
    },
)

print(f"✓ Saved timestamped final training cache: {timestamped_file}")
print(f"✓ Updated latest cache: {latest_file}")
print(f"✓ Updated index: {index_file}")

✓ Saved timestamped final training cache: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\cache\final_training\final_training_distilbert_spec_81710c3324325ad0_exec_8d244347b2eff67e_v1_20251230_185958.json
✓ Updated latest cache: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\cache\final_training\latest_final_training_cache.json
✓ Updated index: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\cache\final_training\final_training_index.json


## Step P1-4: Model Conversion & Optimization

Convert the final training checkpoint to an optimized ONNX model (int8 quantized) for production inference.

**Platform Adapter Note**: The conversion script (`src/model_conversion/convert_to_onnx.py`) uses the platform adapter to automatically handle output paths and logging appropriately for local execution.

**Checkpoint Restoration**: 
- **Google Colab**: If the checkpoint is not found locally (e.g., after a session disconnect), it will be automatically restored from Google Drive.
- **Kaggle**: Checkpoints are automatically persisted in `/kaggle/working/` - no restoration needed.


In [None]:
from pathlib import Path
import os
import sys
import subprocess
import mlflow
import shutil
from shared.json_cache import load_json

CONVERSION_SCRIPT_PATH = SRC_DIR / "model_conversion" / "convert_to_onnx.py"
CONVERSION_OUTPUT_DIR = ROOT_DIR / "outputs" / "conversion"

from orchestration.metadata_manager import (
    is_conversion_complete,
    are_conversion_artifacts_uploaded,
    save_training_metadata,
    get_conversion_onnx_path,
)


In [None]:
from orchestration.paths import load_cache_file, resolve_output_path

# Try loading from centralized cache first
training_cache = load_cache_file(
    ROOT_DIR, CONFIG_DIR, "final_training", use_latest=True
)

# Restore cache from Drive if missing
cache_dir = resolve_output_path(ROOT_DIR, CONFIG_DIR, "cache", subcategory="final_training")
latest_cache_file = cache_dir / "latest_final_training_cache.json"
if latest_cache_file.exists() or ensure_restored_from_drive(latest_cache_file, is_directory=False):
    # Cache exists or was restored - load it
    if training_cache is None:
        from shared.json_cache import load_json
        training_cache = load_json(latest_cache_file, default=None)


In [None]:
# Extract checkpoint directory, backbone, and create conversion output directory
from datetime import datetime

# Get checkpoint directory from training cache
checkpoint_source = Path(training_cache.get("output_dir", "")) / "checkpoint"
if not checkpoint_source.exists():
    # Try alternative location
    checkpoint_source = Path(training_cache.get("output_dir", "")) / CHECKPOINT_DIRNAME
    if not checkpoint_source.exists():
        raise FileNotFoundError(
            f"Checkpoint not found in training cache output_dir: {training_cache.get('output_dir', '')}"
        )

checkpoint_dir = checkpoint_source

# Restore checkpoint from Drive if missing
if not checkpoint_dir.exists():
    # Try restoring checkpoint directory
    if ensure_restored_from_drive(checkpoint_dir, is_directory=True):
        print(f"Restored checkpoint from Drive: {checkpoint_dir}")
    else:
        # Try restoring entire output directory
        output_parent = checkpoint_dir.parent.parent
        if ensure_restored_from_drive(output_parent, is_directory=True):
            print(f"Restored output directory from Drive: {output_parent}")
        else:
            raise FileNotFoundError(
                f"Checkpoint not found locally or in Drive: {checkpoint_dir}\n"
                f"Please ensure final training completed successfully."
            )

print(f"Using checkpoint: {checkpoint_dir}")

# Extract backbone from training cache
backbone = training_cache.get("backbone", "unknown")
if backbone == "unknown":
    # Try to get from config
    backbone = training_cache.get("config", {}).get("backbone", "unknown")
    if backbone == "unknown":
        raise ValueError("Could not determine backbone from training cache")

# Extract backbone name (e.g., "distilbert" from "distilbert-base-uncased")
backbone_name = backbone.split("-")[0] if "-" in backbone else backbone

# Use new centralized naming system for conversion
# Build parent_training_id from training context (if available) or from cache
if 'training_context' in locals():
    from orchestration.naming_centralized import build_parent_training_id
    parent_training_id = build_parent_training_id(
        spec_fp=training_context.spec_fp,
        exec_fp=training_context.exec_fp,
        variant=training_context.variant
    )
else:
    # Fallback: construct from training cache or use directory name
    training_output_dir = Path(training_cache.get("output_dir", ""))
    # Try to extract from path: outputs/final_training/<env>/<model>/spec_<spec_fp>_exec_<exec_fp>/v<variant>
    parent_training_id = training_output_dir.name if training_output_dir.exists() else None
    if not parent_training_id or not parent_training_id.startswith("spec_"):
        # Fallback to old naming
        if 'stable_training_name' in locals():
            parent_training_id = stable_training_name
        else:
            parent_training_id = f"{backbone_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"

# Create conversion context with fingerprints
from orchestration.naming_centralized import create_naming_context, build_output_path
from orchestration.fingerprints import compute_conv_fp

# Get conversion config (defaults if not specified)
conversion_config = {
    "opset": 18,  # Default from conversion script
    "quantization": "int8",
    "dynamic_axes": True,
}

# Compute conv_fp (requires parent spec_fp and exec_fp)
# Try to get from training context, otherwise compute from configs
if 'training_context' in locals():
    parent_spec_fp = training_context.spec_fp
    parent_exec_fp = training_context.exec_fp
else:
    # Compute from configs
    from orchestration.config_loader import load_all_configs
    all_configs = load_all_configs(experiment_config)
    parent_spec_fp = compute_spec_fp(
        model_config=all_configs.get("model", {}),
        data_config=all_configs.get("data", {}),
        train_config=all_configs.get("train", {}),
        seed=training_cache.get("config", {}).get("random_seed", 42)
    )
    try:
        git_sha = subprocess.check_output(
            ["git", "rev-parse", "HEAD"], 
            cwd=ROOT_DIR, 
            stderr=subprocess.DEVNULL
        ).decode().strip()
    except Exception:
        git_sha = "unknown"
    parent_exec_fp = compute_exec_fp(
        git_sha=git_sha,
        env_config=all_configs.get("env", {})
    )

conv_fp = compute_conv_fp(
    parent_spec_fp=parent_spec_fp,
    parent_exec_fp=parent_exec_fp,
    conversion_config=conversion_config
)

print(f"✓ Computed conversion fingerprints:")

# Ensure environment is defined
if 'environment' not in locals():
    from shared.platform_detection import detect_platform
    environment = detect_platform()

# Create conversion context
conversion_context = create_naming_context(
    process_type="conversion",
    model=backbone_name,
    environment=environment,
    parent_training_id=parent_training_id,
    conv_fp=conv_fp
)

conversion_output_dir = build_output_path(ROOT_DIR, conversion_context)
conversion_output_dir.mkdir(parents=True, exist_ok=True)

print(f"✓ Conversion output directory: {conversion_output_dir}")

# Check if conversion already exists
onnx_model_path = conversion_output_dir / "model_int8.onnx"
if not onnx_model_path.exists():
    onnx_model_path = conversion_output_dir / "model.onnx"

conversion_complete = onnx_model_path.exists()

stable_training_name = f"{backbone_name}_{training_cache.get('config', {}).get('trial_name', 'unknown')}"

if conversion_complete:
    print(f"✓ Conversion already completed at: {conversion_output_dir}")
    print(f"  ONNX model: {onnx_model_path}")
    # Check if artifacts are already uploaded
    try:
        from orchestration.metadata_manager import are_conversion_artifacts_uploaded
        artifacts_uploaded = are_conversion_artifacts_uploaded(
            ROOT_DIR, CONFIG_DIR, stable_training_name
        )
        if artifacts_uploaded:
            print(f"  ✓ Artifacts already uploaded to MLflow")
            SKIP_CONVERSION = True
        else:
            print(f"  ⚠ Artifacts not yet uploaded - will upload after loading ONNX model")
            SKIP_CONVERSION = True  # Conversion done, just need to upload
    except Exception:
        SKIP_CONVERSION = True  # Conversion done, upload status unknown
else:
    print(f"Conversion not yet completed - will proceed with conversion")
    SKIP_CONVERSION = False


Using checkpoint: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\final_training\local\distilbert\spec_81710c3324325ad0_exec_8d244347b2eff67e\v1\checkpoint
✓ Computed conversion fingerprints:
✓ Conversion output directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\conversion\local\distilbert\spec_81710c3324325ad0_exec_8d244347b2eff67e\v1\conv_7f8c58314bae731e
Conversion not yet completed - will proceed with conversion


In [None]:
# Run conversion as a module (python -m model_conversion.convert_to_onnx) to allow relative imports to work
# This requires src/ to be in PYTHONPATH (set in env below)
conversion_args = [
    sys.executable,
    "-m",
    "model_conversion.convert_to_onnx",
    "--checkpoint-path",
    str(checkpoint_dir),
    "--config-dir",
    str(CONFIG_DIR),
    "--backbone",
    backbone,
    "--output-dir",
    str(conversion_output_dir),
    "--quantize-int8",
    "--run-smoke-test",
]


In [None]:
conversion_env = os.environ.copy()
conversion_env["AZURE_ML_OUTPUT_onnx_model"] = str(conversion_output_dir)

# Add src directory to PYTHONPATH to allow relative imports in model_conversion.convert_to_onnx
pythonpath = conversion_env.get("PYTHONPATH", "")
if pythonpath:
    conversion_env["PYTHONPATH"] = f"{str(SRC_DIR)}{os.pathsep}{pythonpath}"
else:
    conversion_env["PYTHONPATH"] = str(SRC_DIR)

mlflow_tracking_uri = mlflow.get_tracking_uri()
if mlflow_tracking_uri:
    conversion_env["MLFLOW_TRACKING_URI"] = mlflow_tracking_uri


In [None]:
result = subprocess.run(
    conversion_args,
    cwd=ROOT_DIR,
    env=conversion_env,
    capture_output=True,
    text=True,
)

if result.returncode != 0:
    print("Model conversion failed with the following output:")
    print("=" * 80)
    if result.stdout:
        print("STDOUT:")
        print(result.stdout)
    if result.stderr:
        print("STDERR:")
        print(result.stderr)
    print("=" * 80)
    raise RuntimeError(f"Model conversion failed with return code {result.returncode}")
else:

    # Log conversion artifacts to MLflow
    if mlflow_tracking_uri:
        try:
            from orchestration.jobs.tracking.mlflow_tracker import MLflowConversionTracker
            
            # Get MLflow experiment name for conversion
            conversion_experiment_name = f"{experiment_config.name}-conversion"
            conversion_tracker = MLflowConversionTracker(conversion_experiment_name)
            
            # Get ONNX model path
            onnx_model_path = conversion_output_dir / "model_int8.onnx"
            if not onnx_model_path.exists():
                onnx_model_path = conversion_output_dir / "model.onnx"
            
            # Get source training run ID (if available)
            source_training_run_id = None
            try:
                # Try to get from training cache
                training_cache = load_cache_file(
                    ROOT_DIR, CONFIG_DIR, "final_training", use_latest=True
                )
                if training_cache and "mlflow_run_id" in training_cache:
                    source_training_run_id = training_cache["mlflow_run_id"]
            except Exception:
                pass
            
            import mlflow
            from mlflow.tracking import MlflowClient
            
            # Create or find MLflow run for conversion
            # Since conversion runs via subprocess, we need to create a run here
            mlflow.set_experiment(conversion_experiment_name)
            
            # Create a new run for this conversion
            
            # Use systematic naming if conversion_context is available
            from orchestration.jobs.tracking.naming.run_names import build_mlflow_run_name
            
            if 'conversion_context' in locals():
                run_name = build_mlflow_run_name(
                    conversion_context, 
                    config_dir=CONFIG_DIR,
                    root_dir=ROOT_DIR,
                    output_dir=conversion_output_dir
                )
            else:
                # Fallback to manual construction if context not available
                from datetime import datetime
                conversion_run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
                run_name = f"{backbone}_conversion_{conversion_run_id}"
            with mlflow.start_run(run_name=run_name) as conversion_run:
                # Set tags
                mlflow.set_tag("conversion_type", "onnx_int8" if "int8" in str(onnx_model_path) else "onnx_fp32")
                if source_training_run_id:
                    mlflow.set_tag("source_training_run", source_training_run_id)
                
                # Log conversion parameters
                conversion_tracker.log_conversion_parameters(
                    checkpoint_path=str(checkpoint_dir),
                    conversion_target="onnx_int8" if "int8" in str(onnx_model_path) else "onnx_fp32",
                    quantization="int8" if "int8" in str(onnx_model_path) else "none",
                    opset_version=18,  # Hardcoded in conversion script
                    backbone=backbone,
                )
                
                # Log artifacts if ONNX model exists
                if onnx_model_path.exists():
                    # Calculate compression ratio if possible
                    original_checkpoint_size_mb = None
                    try:
                        import os
                        total_size = sum(f.stat().st_size for f in checkpoint_dir.rglob('*') if f.is_file())
                        original_checkpoint_size_mb = total_size / (1024 * 1024)
                    except Exception:
                        pass
                    
                    # Log conversion results
                    conversion_tracker.log_conversion_results(
                        conversion_success=True,
                        onnx_model_path=onnx_model_path,
                        original_checkpoint_size=original_checkpoint_size_mb,
                        smoke_test_passed=None,  # Could parse from output if needed
                        conversion_log_path=None,  # Could add if conversion script logs to file
                    )
                    print(f"✓ Logged conversion artifacts to MLflow run {conversion_run.info.run_id}")
                else:
                    print(f"⚠ ONNX model not found: {onnx_model_path}")
        except Exception as e:
            print(f"⚠ Failed to log conversion artifacts to MLflow: {e}")




✓ Logged conversion artifacts to MLflow run 2286bf78-a801-4408-a525-6647f42e7839
🏃 View run distilbert_conversion_20251230_190044 at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/713fa6a7-15e4-4b23-bf15-6f6ca0242b78/runs/2286bf78-a801-4408-a525-6647f42e7839
🧪 View experiment at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/713fa6a7-15e4-4b23-bf15-6f6ca0242b78


In [None]:
from shared.json_cache import save_json
import shutil

ONNX_MODEL_FILENAME = "model_int8.onnx"
FALLBACK_ONNX_MODEL_FILENAME = "model.onnx"

onnx_model_path = conversion_output_dir / ONNX_MODEL_FILENAME
if not onnx_model_path.exists():
    onnx_model_path = conversion_output_dir / FALLBACK_ONNX_MODEL_FILENAME

if not onnx_model_path.exists():
    raise FileNotFoundError(f"ONNX model not found in {conversion_output_dir}")

print(f"✓ Conversion completed. ONNX model: {onnx_model_path}")

# Backup ONNX model to Google Drive (mirrors local structure)
if onnx_model_path.exists():
    backup_to_drive(onnx_model_path, is_directory=False)
else:
    print(f"⚠ Warning: ONNX model not found for backup: {onnx_model_path}")

# Backup entire conversion output directory to Drive
backup_to_drive(conversion_output_dir, is_directory=True)

# Backup conversion cache file to Drive (if it's in outputs/, otherwise skip)
# If you want to backup cache files, move them to outputs/cache/ or backup manually


✓ Conversion completed. ONNX model: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\conversion\local\distilbert\spec_81710c3324325ad0_exec_8d244347b2eff67e\v1\conv_7f8c58314bae731e\model_int8.onnx


False