# Phase 1: Training Orchestration (Google Colab & Kaggle)

This notebook orchestrates all training activities for **Google Colab or Kaggle execution** with GPU compute support.

## Overview

- **Step 1**: Repository Setup & Environment Configuration
- **Step 2**: Load Centralized Configs
- **Step 3**: Verify Local Dataset (from data config)
- **Step 4**: Setup Local Environment
- **Step 5**: The Dry Run
- **Step 6**: The Sweep (HPO) - Local with Optuna
- **Step 5.5**: Benchmarking Best Trials (NEW)
- **Step 7**: Best Configuration Selection (Automated)
- **Step 8**: Final Training (Post-HPO, Single Run)
- **Step 9**: Model Conversion & Optimization

## Important

- This notebook **executes training in Google Colab or Kaggle** (not on Azure ML)
- All computation happens on the platform's GPU
- **Storage & Persistence**:
  - **Google Colab**: Checkpoints are automatically saved to Google Drive for persistence across sessions
  - **Kaggle**: Outputs in `/kaggle/working/` are automatically persisted - no manual backup needed
- The notebook must be **re-runnable end-to-end**
- Uses the dataset path specified in the data config (from `config/data/*.yaml`), typically pointing to a local folder included in the repository
- **Session Management**:
  - **Colab**: Sessions timeout after 12-24 hours (depending on Colab plan). Checkpoints are saved to Drive automatically.
  - **Kaggle**: Sessions have time limits based on your plan. All outputs are automatically saved.


## Step 1: Environment Detection

The notebook automatically detects the execution environment (local, Google Colab, or Kaggle) and adapts its behavior accordingly.


In [1]:
import os
from pathlib import Path

# Detect execution environment
IN_COLAB = "COLAB_GPU" in os.environ or "COLAB_TPU" in os.environ
IN_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
IS_LOCAL = not IN_COLAB and not IN_KAGGLE

# Set platform-specific constants
if IN_COLAB:
    PLATFORM = "colab"
    BASE_DIR = Path("/content")
    BACKUP_ENABLED = True
elif IN_KAGGLE:
    PLATFORM = "kaggle"
    BASE_DIR = Path("/kaggle/working")
    BACKUP_ENABLED = False
else:
    PLATFORM = "local"
    BASE_DIR = None  # Will use Path.cwd() instead
    BACKUP_ENABLED = False

print(f"‚úì Detected environment: {PLATFORM.upper()}")
print(f"Platform: {PLATFORM}")
if BASE_DIR:
    print(f"Base directory: {BASE_DIR}")
else:
    print(f"Base directory: Will use current working directory")
print(f"Backup enabled: {BACKUP_ENABLED}")


‚úì Detected environment: LOCAL
Platform: local
Base directory: Will use current working directory
Backup enabled: False


## Step 2: Repository Setup

**Note**: Repository setup is only needed for Colab/Kaggle environments. Local environments should already have the repository cloned.

### For Colab/Kaggle: Clone from Git or Upload Files

Choose one of the following options:

**Option A: Clone from Git (Recommended)**

If your repository is on GitHub/GitLab, clone it:

**For Google Colab:**
```python
!git clone -b feature/google-colab-compute https://github.com/longdang193/resume-ner-azureml.git /content/resume-ner-azureml
```

**For Kaggle:**
```python
!git clone -b feature/google-colab-compute https://github.com/longdang193/resume-ner-azureml.git /kaggle/working/resume-ner-azureml
```

**Option B: Upload Files**

**For Google Colab:**
1. Use the Colab file browser (folder icon on left sidebar)
2. Upload your project files to `/content/resume-ner-azureml/`
3. Ensure the directory structure matches: `src/`, `config/`, `notebooks/`, etc.

**For Kaggle:**
1. Use the Kaggle file browser (Data tab)
2. Upload your project files to `/kaggle/working/resume-ner-azureml/`
3. Ensure the directory structure matches: `src/`, `config/`, `notebooks/`, etc.

### For Local: Repository Already Exists

Local environments should have the repository already cloned. The notebook will automatically detect the repository location.


In [2]:
# Repository setup - only needed for Colab/Kaggle
if not IS_LOCAL:
    if IN_KAGGLE:
        # For Kaggle
        !git clone -b feature/google-colab-compute https://github.com/longdang193/resume-ner-azureml.git /kaggle/working/resume-ner-azureml
    elif IN_COLAB:
        # For Google Colab
        !git clone -b feature/google-colab-compute https://github.com/longdang193/resume-ner-azureml.git /content/resume-ner-azureml
else:
    print("‚úì Local environment detected - assuming repository already exists")

‚úì Local environment detected - assuming repository already exists


### Verify Repository Setup

Verify the repository structure exists:


In [3]:
import sys
from pathlib import Path

# Unified path setup for all environments
if IS_LOCAL:
    # Local: assume notebook is in notebooks/ directory
    NOTEBOOK_DIR = Path.cwd()
    ROOT_DIR = NOTEBOOK_DIR.parent
else:
    # Colab/Kaggle: use fixed paths
    ROOT_DIR = BASE_DIR / "resume-ner-azureml"

SRC_DIR = ROOT_DIR / "src"
CONFIG_DIR = ROOT_DIR / "config"
NOTEBOOK_DIR = ROOT_DIR / "notebooks"

# Verify repository structure
if not ROOT_DIR.exists():
    if IS_LOCAL:
        raise FileNotFoundError(
            f"Repository not found at {ROOT_DIR}\n"
            f"Please ensure you're running this notebook from the notebooks/ directory of the repository."
        )
    else:
        raise FileNotFoundError(
            f"Repository not found at {ROOT_DIR}\n"
            f"Please run Step 2 to clone or upload the repository."
        )

required_dirs = ["src", "config", "notebooks"]
missing_dirs = [d for d in required_dirs if not (ROOT_DIR / d).exists()]

if missing_dirs:
    raise FileNotFoundError(
        f"Missing required directories: {missing_dirs}\n"
        f"Please ensure the repository structure is correct."
    )

# Add to Python path
sys.path.insert(0, str(ROOT_DIR))
sys.path.insert(0, str(SRC_DIR))

print(f"‚úì Repository found at: {ROOT_DIR}")
print(f"‚úì Required directories found: {required_dirs}")
print(f"Notebook directory: {NOTEBOOK_DIR}")
print(f"Project root: {ROOT_DIR}")
print(f"Source directory: {SRC_DIR}")
print(f"Config directory: {CONFIG_DIR}")


‚úì Repository found at: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml
‚úì Required directories found: ['src', 'config', 'notebooks']
Notebook directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\notebooks
Project root: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml
Source directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\src
Config directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config


## Step 3: Install Dependencies

**For Local**: Use conda environment (instructions below).  
**For Colab/Kaggle**: Install packages via pip (automated below).

### Local Environment Setup

For local execution, create and activate a conda environment:

1. Open a terminal in the project root
2. Create the conda environment: `conda env create -f config/environment/conda.yaml`
3. Activate: `conda activate resume-ner-training`
4. Restart the kernel after activation

### Colab/Kaggle: Automated Installation

PyTorch is usually pre-installed in Colab/Kaggle, but we'll verify and install other required packages.


In [4]:
import torch

# Check PyTorch version and GPU availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    device_count = torch.cuda.device_count()
    print(f"Visible GPUs: {device_count}")
    for i in range(device_count):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")

# Verify PyTorch version meets requirements (>=2.6.0)
torch_version = tuple(map(int, torch.__version__.split('.')[:2]))
if torch_version < (2, 6):
    print(f"‚ö† Warning: PyTorch {torch.__version__} may not meet requirements (>=2.6.0)")
    if not IS_LOCAL:
        print("Consider upgrading: !pip install torch>=2.6.0 --upgrade")
else:
    print("‚úì PyTorch version meets requirements")


PyTorch version: 2.9.1
CUDA available: True
Visible GPUs: 1
  GPU 0: Quadro T1000
‚úì PyTorch version meets requirements


In [5]:
# Install required packages - only for Colab/Kaggle
if IS_LOCAL:
    print("For local environment, please:")
    print("1. Create conda environment: conda env create -f config/environment/conda.yaml")
    print("2. Activate: conda activate resume-ner-training")
    print("3. Restart kernel after activation")
    print("\nIf you've already done this, you can continue to the next cell.")
else:
    # Core ML libraries
    %pip install transformers>=4.35.0,<5.0.0 --quiet
    %pip install safetensors>=0.4.0 --quiet
    %pip install datasets>=2.12.0 --quiet

    # ML utilities
    %pip install numpy>=1.24.0,<2.0.0 --quiet
    %pip install pandas>=2.0.0 --quiet
    %pip install scikit-learn>=1.3.0 --quiet

    # Utilities
    %pip install pyyaml>=6.0 --quiet
    %pip install tqdm>=4.65.0 --quiet
    %pip install seqeval>=1.2.2 --quiet
    %pip install sentencepiece>=0.1.99 --quiet

    # Experiment tracking
    %pip install mlflow --quiet
    %pip install optuna --quiet

    # ONNX support
    %pip install onnxruntime --quiet
    %pip install onnx>=1.16.0 --quiet
    %pip install onnxscript>=0.1.0 --quiet

    print("‚úì All dependencies installed")


For local environment, please:
1. Create conda environment: conda env create -f config/environment/conda.yaml
2. Activate: conda activate resume-ner-training
3. Restart kernel after activation

If you've already done this, you can continue to the next cell.


## Step 4: Setup Paths and Import Paths

Python paths are already configured in Step 2. This section verifies the setup.


In [6]:
# Environment detection and platform configuration
# Note: This cell is a duplicate of Cell 2. If Cell 2 was already executed, these variables are already set.
# This cell ensures they're set even if Cell 2 was skipped.
import os
from pathlib import Path

# Detect execution environment
IN_COLAB = "COLAB_GPU" in os.environ or "COLAB_TPU" in os.environ
IN_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
IS_LOCAL = not IN_COLAB and not IN_KAGGLE

# Set platform-specific constants (only if not already set)
if 'PLATFORM' not in globals():
    if IN_COLAB:
        PLATFORM = "colab"
        BASE_DIR = Path("/content")
        BACKUP_ENABLED = True
        print("‚úì Detected: Google Colab environment")
    elif IN_KAGGLE:
        PLATFORM = "kaggle"
        BASE_DIR = Path("/kaggle/working")
        BACKUP_ENABLED = False  # Kaggle outputs are automatically persisted
        print("‚úì Detected: Kaggle environment")
    else:
        PLATFORM = "local"
        BASE_DIR = None  # Will use Path.cwd() instead
        BACKUP_ENABLED = False
        print("‚úì Detected: Local environment")

if 'PLATFORM' in globals():
    print(f"Platform: {PLATFORM}")
    if BASE_DIR:
        print(f"Base directory: {BASE_DIR}")
    else:
        print(f"Base directory: Will use current working directory")
    print(f"Backup enabled: {BACKUP_ENABLED}")


Platform: local
Base directory: Will use current working directory
Backup enabled: False


In [7]:
import os
import sys
from pathlib import Path

# Setup paths (ROOT_DIR should be set in Cell 2)
# If not, set it here
if 'ROOT_DIR' not in globals():
    if IN_COLAB:
        ROOT_DIR = Path("/content/resume-ner-azureml")
    elif IN_KAGGLE:
        ROOT_DIR = Path("/kaggle/working/resume-ner-azureml")
    else:
        ROOT_DIR = Path("/content/resume-ner-azureml")  # Default to Colab path

SRC_DIR = ROOT_DIR / "src"
CONFIG_DIR = ROOT_DIR / "config"
NOTEBOOK_DIR = ROOT_DIR / "notebooks"

# Add to Python path
sys.path.insert(0, str(ROOT_DIR))
sys.path.insert(0, str(SRC_DIR))

print("Notebook directory:", NOTEBOOK_DIR)
print("Project root:", ROOT_DIR)
print("Source directory:", SRC_DIR)
print("Config directory:", CONFIG_DIR)
print("Platform:", PLATFORM if 'PLATFORM' in globals() else "unknown")
print("In Colab:", IN_COLAB if 'IN_COLAB' in globals() else False)
print("In Kaggle:", IN_KAGGLE if 'IN_KAGGLE' in globals() else False)


Notebook directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\notebooks
Project root: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml
Source directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\src
Config directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config
Platform: local
In Colab: False
In Kaggle: False


## Step 4: Mount Google Drive

Mount Google Drive to enable checkpoint persistence across Colab sessions. Checkpoints will be automatically saved to Drive after training completes.


In [8]:
# Helper functions for checkpoint backup/restore (platform-aware)
import shutil
from pathlib import Path
from typing import Optional

def backup_to_drive(source_path: Path, backup_name: str, is_directory: bool = False) -> bool:
    """
    Backup a file or directory to Google Drive if available.
    
    Args:
        source_path: Path to the file or directory to backup
        backup_name: Name for the backup (will be placed in DRIVE_BACKUP_DIR)
        is_directory: True if backing up a directory, False for a file
    
    Returns:
        True if backup was successful, False if backup is not available or failed
    """
    if not BACKUP_ENABLED or not DRIVE_BACKUP_DIR:
        return False
    
    if not source_path.exists():
        print(f"‚ö† Warning: Source path does not exist: {source_path}")
        return False
    
    backup_path = DRIVE_BACKUP_DIR / backup_name
    
    try:
        if is_directory:
            # Remove existing backup if it exists
            if backup_path.exists():
                shutil.rmtree(backup_path)
            shutil.copytree(source_path, backup_path)
        else:
            shutil.copy2(source_path, backup_path)
        
        print(f"‚úì Backed up to Google Drive: {backup_path}")
        return True
    except Exception as e:
        print(f"‚ö† Warning: Backup failed: {e}")
        return False

def restore_from_drive(backup_name: str, target_path: Path, is_directory: bool = False) -> bool:
    """
    Restore a file or directory from Google Drive if available.
    
    Args:
        backup_name: Name of the backup in DRIVE_BACKUP_DIR
        target_path: Path where the restored file/directory should be placed
        is_directory: True if restoring a directory, False for a file
    
    Returns:
        True if restore was successful, False if backup is not available or failed
    """
    if not BACKUP_ENABLED or not DRIVE_BACKUP_DIR:
        return False
    
    backup_path = DRIVE_BACKUP_DIR / backup_name
    
    if not backup_path.exists():
        return False
    
    try:
        if is_directory:
            # Create parent directory if needed
            target_path.parent.mkdir(parents=True, exist_ok=True)
            shutil.copytree(backup_path, target_path)
        else:
            target_path.parent.mkdir(parents=True, exist_ok=True)
            shutil.copy2(backup_path, target_path)
        
        print(f"‚úì Restored from Google Drive: {target_path}")
        return True
    except Exception as e:
        print(f"‚ö† Warning: Restore failed: {e}")
        return False

print("‚úì Backup/restore helper functions defined")


‚úì Backup/restore helper functions defined


In [9]:
from pathlib import Path

# Mount Google Drive (Colab only - Kaggle doesn't need this)
DRIVE_BACKUP_DIR = None

if IN_COLAB:
    try:
        from google.colab import drive
        drive.mount('/content/drive')
        DRIVE_BACKUP_DIR = Path("/content/drive/MyDrive/resume-ner-checkpoints")
        DRIVE_BACKUP_DIR.mkdir(parents=True, exist_ok=True)
        print(f"‚úì Google Drive mounted")
        print(f"‚úì Checkpoint backup directory: {DRIVE_BACKUP_DIR}")
        print(f"\nNote: Checkpoints will be automatically saved to this directory after training completes.")
    except ImportError:
        print("‚ö† Warning: google.colab.drive not available. Backup to Google Drive will be disabled.")
        BACKUP_ENABLED = False
elif IN_KAGGLE:
    print("‚úì Kaggle environment detected - outputs are automatically persisted (no Drive mount needed)")
else:
    print("‚ö† Warning: Unknown environment. Backup to Google Drive will be disabled.")




## Step P1-3.1: Load Centralized Configs

Load and validate all configuration files. Configs are immutable and will be logged with each job for reproducibility.

**Note**: 
- **Local**: Config files should already exist in the repository
- **Colab/Kaggle**: Config files will be auto-created if missing (useful for fresh environments)


In [10]:
# Optional: Update repository from git (only for Colab/Kaggle if needed)
# Uncomment and run if you need to pull latest changes
# if not IS_LOCAL:
#     !cd {ROOT_DIR} && git fetch origin feature/google-colab-compute
#     !cd {ROOT_DIR} && git reset --hard origin/feature/google-colab-compute

In [11]:
# Write config files only if they don't exist (useful for Colab/Kaggle fresh environments)
# Local environments should have configs already in the repo
if IS_LOCAL:
    print("‚úì Local environment - assuming config files already exist in repository")
else:
    # Create the experiment config directory if it doesn't exist
    experiment_config_dir = CONFIG_DIR / "experiment"
    experiment_config_dir.mkdir(parents=True, exist_ok=True)
    
    config_path = experiment_config_dir / "resume_ner_baseline.yaml"
    
    # Only write if file doesn't exist
    if not config_path.exists():
        config_content = """
experiment_name: "resume_ner_baseline"

# Relative to the top-level config directory
data_config: "data/resume_v1.yaml"
model_config: "model/distilbert.yaml"
train_config: "train.yaml"
hpo_config: "hpo/prod.yaml"      # default HPO config; stages can override if needed
env_config: "env/azure.yaml"
benchmark_config: "benchmark.yaml"

# High-level orchestration design:
# - Stages: smoke ‚Üí hpo ‚Üí training
# - Smoke and HPO stage backbones are controlled by the HPO config file (search_space.backbone.values)
# - Training stage can target specific backbones via stage config
# - AML experiment names are per-stage, optionally per-backbone

stages:
  smoke:
    # AML experiment base name for smoke tests
    aml_experiment: "resume-ner-smoke"
    # HPO config for smoke/dry run tests (uses smoke.yaml with reduced trials)
    hpo_config: "hpo/smoke.yaml"
    # Backbones are controlled by the HPO config file (hpo_config) via search_space.backbone.values

  hpo:
    # AML experiment base name for HPO sweeps
    aml_experiment: "resume-ner-hpo"
    # HPO config override for production HPO sweep (uses prod.yaml instead of default smoke.yaml)
    hpo_config: "hpo/prod.yaml"
    # Backbones are controlled by the HPO config file (hpo_config) via search_space.backbone.values

  training:
    # AML experiment base name for final single-run training
    aml_experiment: "resume-ner-train"
    # Final production backbone(s); typically one chosen after HPO
    backbones:
      - "distilbert"

# Optional naming policy for how AML experiments are derived per backbone.
# If true, the orchestrator should build experiment_name as:
#   "<aml_experiment>-<backbone>"
# otherwise it should use "<aml_experiment>" directly and rely on tags
# (stage/backbone) for grouping in AML.
naming:
  include_backbone_in_experiment: true
"""
        config_path.write_text(config_content)
        print(f"‚úì Config file written to: {config_path}")
    else:
        print(f"‚úì Config file already exists: {config_path}")

‚úì Local environment - assuming config files already exist in repository


In [12]:
# Write HPO config file only if it doesn't exist (useful for Colab/Kaggle fresh environments)
# Local environments should have configs already in the repo
if IS_LOCAL:
    print("‚úì Local environment - assuming HPO config files already exist in repository")
else:
    # Create the HPO config directory if it doesn't exist
    hpo_config_dir = CONFIG_DIR / "hpo"
    hpo_config_dir.mkdir(parents=True, exist_ok=True)
    
    config_path = hpo_config_dir / "prod.yaml"
    
    # Only write if file doesn't exist
    if not config_path.exists():
        config_content = """
search_space:
  backbone:
    type: "choice"
    values: ["distilroberta"]
  
  learning_rate:
    type: "loguniform"
    min: 1e-5
    max: 5e-5
  
  batch_size:
    type: "choice"
    values: [8, 16]
  
  dropout:
    type: "uniform"
    min: 0.1
    max: 0.3
  
  weight_decay:
    type: "loguniform"
    min: 0.001
    max: 0.1

sampling:
  algorithm: "random"
  max_trials: 20
  timeout_minutes: 960

early_termination:
  policy: "bandit"
  evaluation_interval: 1
  slack_factor: 0.2
  delay_evaluation: 2

objective:
  metric: "macro-f1"
  goal: "maximize"

# Selection strategy configuration for accuracy-speed tradeoff
selection:
  # Accuracy threshold for speed tradeoff (0.015 = 1.5% relative)
  # If two models are within this accuracy difference, prefer faster model
  # Set to null for accuracy-only selection (default behavior)
  accuracy_threshold: 0.015
  
  # Use relative threshold (percentage of best accuracy) vs absolute difference
  # Relative thresholds are more robust across different accuracy ranges
  # Default: true (recommended)
  use_relative_threshold: true
  
  # Minimum relative accuracy gain to justify slower model (optional)
  # If DeBERTa is < 2% better than DistilBERT, prefer DistilBERT
  # Set to null to disable this check
  min_accuracy_gain: 0.02

k_fold:
  enabled: true
  n_splits: 5
  random_seed: 42
  shuffle: true
  stratified: true

# Checkpoint configuration for HPO resume support
# Enables saving study state to SQLite database for resuming interrupted runs
checkpoint:
  enabled: true  # Set to true to enable checkpointing (useful for Colab/Kaggle)
  storage_path: "{backbone}/study.db"  # Relative to output_dir, {backbone} placeholder
  auto_resume: true  # Automatically resume if checkpoint exists (only if enabled=true)
"""
        config_path.write_text(config_content)
        print(f"‚úì HPO config written to: {config_path}")
    else:
        print(f"‚úì HPO config file already exists: {config_path}")

‚úì Local environment - assuming HPO config files already exist in repository


In [13]:
# Write training config file only if it doesn't exist (useful for Colab/Kaggle fresh environments)
# Local environments should have configs already in the repo
if IS_LOCAL:
    print("‚úì Local environment - assuming training config file already exists in repository")
else:
    # Ensure config directory exists
    CONFIG_DIR.mkdir(parents=True, exist_ok=True)
    
    config_path = CONFIG_DIR / "train.yaml"
    
    # Only write if file doesn't exist
    if not config_path.exists():
        config_content = """
# Global Training Defaults
# Applied to all training runs

training:
  epochs: 5
  batch_size: 12 
  gradient_accumulation_steps: 2
  learning_rate: 2e-5
  weight_decay: 0.01
  warmup_steps: 500
  max_grad_norm: 1.0
  # Data splitting and model-specific settings
  val_split_divisor: 10  # Divide train set by this to create validation split if none exists
  deberta_max_batch_size: 16  # Maximum batch size for DeBERTa models (memory constraints)
  warmup_steps_divisor: 10  # Divide total steps by this to cap warmup steps
  
  # EDA-based metric selection
  metric: "macro-f1"  # Class imbalance requires macro-f1
  metric_mode: "max"  # Maximize macro-f1
  
  early_stopping:
    enabled: true
    patience: 3
    min_delta: 0.001

logging:
  log_interval: 100
  eval_interval: 500
  save_interval: 1000

checkpointing:
  save_strategy: "steps"
  save_total_limit: 3
  load_best_model_at_end: true

# NOTE: Multi-GPU / DDP is optional and currently experimental. When enabled,
# the training code will use this section together with hardware detection to
# decide whether to run single-GPU vs multi-GPU. If no multiple GPUs or DDP
# backend are available, it will safely fall back to single-GPU.
distributed:
  enabled: true         # Set true to enable multi-GPU / DDP
  backend: "nccl"        # Typically 'nccl' for GPUs
  world_size: "auto"     # 'auto' = use all visible GPUs; or set an int
  init_method: "env://"  # Default init method; can be overridden if needed
  timeout_seconds: 1800  # Process group init timeout (in seconds)
"""
        config_path.write_text(config_content)
        print(f"‚úì Training config written to: {config_path}")
    else:
        print(f"‚úì Training config file already exists: {config_path}")

‚úì Local environment - assuming training config file already exists in repository


### Define Constants

Define constants for file and directory names used throughout the notebook. Benchmark settings come from centralized config, not hard-coded here. These constants work across all environments.


In [14]:
# Import constants from centralized module
from orchestration import (
    METRICS_FILENAME,
    BENCHMARK_FILENAME,
    CHECKPOINT_DIRNAME,
    OUTPUTS_DIRNAME,
    MLRUNS_DIRNAME,
    DEFAULT_RANDOM_SEED,
    DEFAULT_K_FOLDS,
)

# Note: Benchmark settings (batch_sizes, iterations, etc.) come from configs["benchmark"]


  from .autonotebook import tqdm as notebook_tqdm


### Define Helper Functions

Reusable helper functions following DRY principle for common operations. These functions work across all environments (local, Colab, Kaggle).


In [15]:
# Import helper functions from consolidated modules (DRY principle)
from typing import List, Optional
from orchestration import (
    build_mlflow_experiment_name,
    setup_mlflow_for_stage,
    run_benchmarking,
)
from shared import verify_output_file

# Wrapper function for run_benchmarking that uses notebook-specific paths
def run_benchmarking_local(
    checkpoint_dir: Path,
    test_data_path: Path,
    output_path: Path,
    batch_sizes: List[int],
    iterations: int,
    warmup_iterations: int,
    max_length: int = 512,
    device: Optional[str] = None,
) -> bool:
    """
    Run benchmarking on a model checkpoint (local notebook wrapper).
    
    This is a thin wrapper around orchestration.benchmark_utils.run_benchmarking
    that automatically uses the notebook's SRC_DIR and ROOT_DIR.
    
    Args:
        checkpoint_dir: Path to checkpoint directory.
        test_data_path: Path to test data JSON file.
        output_path: Path to output benchmark.json file.
        batch_sizes: List of batch sizes to test.
        iterations: Number of iterations per batch size.
        warmup_iterations: Number of warmup iterations.
        max_length: Maximum sequence length.
        device: Device to use (None = auto-detect).
    
    Returns:
        True if successful, False otherwise.
    """
    return run_benchmarking(
        checkpoint_dir=checkpoint_dir,
        test_data_path=test_data_path,
        output_path=output_path,
        batch_sizes=batch_sizes,
        iterations=iterations,
        warmup_iterations=warmup_iterations,
        max_length=max_length,
        device=device,
        project_root=ROOT_DIR,
    )


In [16]:
from pathlib import Path
from typing import Any, Dict

from orchestration import EXPERIMENT_NAME
from orchestration.config_loader import (
    ExperimentConfig,
    compute_config_hashes,
    create_config_metadata,
    load_all_configs,
    load_experiment_config,
    snapshot_configs,
    validate_config_immutability,
)

# P1-3.1: Load Centralized Configs (local-only)
# Mirrors the Azure orchestration notebook, but does not create an Azure ML client.

if not CONFIG_DIR.exists():
    raise FileNotFoundError(f"Config directory not found: {CONFIG_DIR}")

experiment_config: ExperimentConfig = load_experiment_config(CONFIG_DIR, EXPERIMENT_NAME)
configs: Dict[str, Any] = load_all_configs(experiment_config)
config_hashes = compute_config_hashes(configs)
config_metadata = create_config_metadata(configs, config_hashes)

# Immutable snapshots for runtime mutation checks
original_configs = snapshot_configs(configs)
validate_config_immutability(configs, original_configs)

print(f"Loaded experiment: {experiment_config.name}")
print("Loaded config domains:", sorted(configs.keys()))
print("Config hashes:", config_hashes)
print("Config metadata:", config_metadata)

# Get dataset path from data config (centralized configuration)
# The local_path in the data config is relative to the config directory
data_config = configs["data"]
local_path_str = data_config.get("local_path", "../dataset")
DATASET_LOCAL_PATH = (CONFIG_DIR / local_path_str).resolve()

# Check if seed-based dataset structure (for dataset_tiny with seed subdirectories)
seed = data_config.get("seed")
if seed is not None and "dataset_tiny" in str(DATASET_LOCAL_PATH):
    DATASET_LOCAL_PATH = DATASET_LOCAL_PATH / f"seed{seed}"

print(f"Dataset path (from data config): {DATASET_LOCAL_PATH}")
if seed is not None:
    print(f"Using seed: {seed}")


Loaded experiment: resume_ner_baseline
Loaded config domains: ['benchmark', 'data', 'env', 'hpo', 'model', 'train']
Config hashes: {'data': 'e87b126b961fa20d', 'model': '5f90a66353401b44', 'train': '781de5190c9f6bcc', 'hpo': 'a55c5ddfff162498', 'env': '3e54b931c7640cf2', 'benchmark': '33da3b0fc59ff812'}
Config metadata: {'data_config_hash': 'e87b126b961fa20d', 'model_config_hash': '5f90a66353401b44', 'train_config_hash': '781de5190c9f6bcc', 'hpo_config_hash': 'a55c5ddfff162498', 'env_config_hash': '3e54b931c7640cf2', 'data_version': 'v3', 'model_backbone': 'distilbert-base-uncased'}
Dataset path (from data config): C:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\dataset_tiny\seed0
Using seed: 0


## Step P1-3.2: Verify Local Dataset

Verify that the dataset directory (specified by `local_path` in the data config) exists and contains the required files. The dataset path is loaded from the centralized data configuration in Step P1-3.1.


In [17]:
# P1-3.2: Verify Local Dataset
# The dataset path comes from the data config's local_path field (loaded in Step P1-3.1).
# This ensures the dataset location is controlled by centralized configuration.
# Note: train.json is required, but validation.json is optional (matches training script behavior).

REQUIRED_FILE = "train.json"
OPTIONAL_FILE = "validation.json"

if not DATASET_LOCAL_PATH.exists():
    raise FileNotFoundError(
        f"Dataset directory not found: {DATASET_LOCAL_PATH}\n"
        f"This path comes from the data config's 'local_path' field.\n"
        f"If you need to create the dataset, run the notebook: notebooks/00_make_tiny_dataset.ipynb"
    )

# Check required file
train_file = DATASET_LOCAL_PATH / REQUIRED_FILE
if not train_file.exists():
    raise FileNotFoundError(
        f"Required dataset file not found: {train_file}\n"
        f"This path comes from the data config's 'local_path' field.\n"
        f"If you need to create it, run the notebook: notebooks/00_make_tiny_dataset.ipynb"
    )

# Check optional file
val_file = DATASET_LOCAL_PATH / OPTIONAL_FILE
has_validation = val_file.exists()

print(f"‚úì Dataset directory found: {DATASET_LOCAL_PATH}")
print(f"  (from data config: {data_config.get('name', 'unknown')} v{data_config.get('version', 'unknown')})")

train_size = train_file.stat().st_size
print(f"  ‚úì {REQUIRED_FILE} ({train_size:,} bytes)")

if has_validation:
    val_size = val_file.stat().st_size
    print(f"  ‚úì {OPTIONAL_FILE} ({val_size:,} bytes)")
else:
    print(f"  ‚ö† {OPTIONAL_FILE} not found (optional - training will proceed without validation set)")


‚úì Dataset directory found: C:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\dataset_tiny\seed0
  (from data config: resume-ner-data-tiny-short vv3)
  ‚úì train.json (28,721 bytes)
  ‚ö† validation.json not found (optional - training will proceed without validation set)


## Step P1-3.2.1: Optional Train/Test Split

**Optional step**: Create a train/test split if `test.json` is missing. This is useful when you only have `train.json` and `validation.json` and want to create a separate test set.

**‚ö† WARNING**: This will overwrite `train.json` with the split version. Only enable if you want to create a permanent train/test split.


In [18]:
# Optional: create train/test split if test.json is missing
# WARNING: This will overwrite train.json with the split version
# Only enable if you want to create a permanent train/test split
import json
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional

from training.data import split_train_test, save_split_files

CREATE_TEST_SPLIT = False  # Set True to create test.json when absent (WARNING: overwrites train.json)

train_file = DATASET_LOCAL_PATH / "train.json"
val_file = DATASET_LOCAL_PATH / "validation.json"
test_file = DATASET_LOCAL_PATH / "test.json"

if CREATE_TEST_SPLIT and not test_file.exists():
    # Backup original train.json before overwriting
    backup_file = DATASET_LOCAL_PATH / "train.json.backup"
    if train_file.exists() and not backup_file.exists():
        import shutil
        shutil.copy2(train_file, backup_file)
        print(f"‚ö† Backed up original train.json to {backup_file}")
    
    full_dataset = []
    # Start with train data; optionally include validation to maximize coverage
    with open(train_file, "r", encoding="utf-8") as f:
        full_dataset.extend(json.load(f))
    if val_file.exists():
        with open(val_file, "r", encoding="utf-8") as f:
            full_dataset.extend(json.load(f))

    split_cfg = configs.get("data", {}).get("splitting", {})
    train_ratio = split_cfg.get("train_test_ratio", 0.8)
    stratified = split_cfg.get("stratified", False)
    random_seed = split_cfg.get("random_seed", 42)
    entity_types = configs.get("data", {}).get("schema", {}).get("entity_types", [])

    print(f"Creating train/test split (train_ratio={train_ratio}, stratified={stratified})...")
    print(f"‚ö† WARNING: This will overwrite train.json with {int(len(full_dataset) * train_ratio)} samples")
    
    new_train, new_test = split_train_test(
        dataset=full_dataset,
        train_ratio=train_ratio,
        stratified=stratified,
        random_seed=random_seed,
        entity_types=entity_types,
    )

    save_split_files(DATASET_LOCAL_PATH, new_train, new_test)
    print(f"‚úì Wrote train.json ({len(new_train)}) and test.json ({len(new_test)})")
elif test_file.exists():
    print(f"‚úì Found existing test.json at {test_file}")
else:
    print("‚ö† test.json not found. Set CREATE_TEST_SPLIT=True to generate a split.")


‚úì Found existing test.json at C:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\dataset_tiny\seed0\test.json


## Step P1-3.3: Setup Local Environment

Verify GPU availability, set up MLflow tracking (local file store), and check that key dependencies are installed. This step ensures the local environment is ready for training.


In [19]:
import sys
import torch

DEFAULT_DEVICE = "cuda"

env_config = configs["env"]
device_type = env_config.get("compute", {}).get("device", DEFAULT_DEVICE)

if device_type == "cuda" and not torch.cuda.is_available():
    raise RuntimeError(
        "CUDA device requested but not available. "
        "In Colab, ensure you've selected a GPU runtime: Runtime > Change runtime type > GPU"
    )


In [20]:
from pathlib import Path
from shared.mlflow_setup import setup_mlflow_from_config

# Setup MLflow from config (automatically uses Azure ML if enabled in config/mlflow.yaml)
# To enable Azure ML Workspace tracking:
# 1. Edit config/mlflow.yaml and set azure_ml.enabled: true
# 2. Set environment variables: AZURE_SUBSCRIPTION_ID and AZURE_RESOURCE_GROUP
setup_mlflow_from_config(
    experiment_name="placeholder",  # Will be set per HPO run
    config_dir=CONFIG_DIR
)

2025-12-27 23:54:14,403 - shared.mlflow_setup - INFO - Azure ML enabled in config, attempting to connect...
2025-12-27 23:54:17,137 - shared.mlflow_setup - INFO - Environment variables not set, loading from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config.env
2025-12-27 23:54:17,138 - shared.mlflow_setup - INFO - Loaded credentials from config.env
Class DeploymentTemplateOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
2025-12-27 23:54:27,273 - shared.mlflow_setup - INFO - Using Azure ML workspace tracking


'azureml://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws'

In [21]:
# For Kaggle only - install specific package versions required for Optuna checkpointing
if IN_KAGGLE:
    %pip install "SQLAlchemy<2.0.0" "alembic<1.13.0" "optuna<4.0.0" --quiet
else:
    print("Skipping Kaggle-specific package installation (not running on Kaggle)")


Skipping Kaggle-specific package installation (not running on Kaggle)


In [22]:
try:
    import mlflow
    import transformers
    import optuna
except ImportError as e:
    raise ImportError(f"Required package not installed: {e}")

REQUIRED_PACKAGES = {
    "torch": torch,
    "transformers": transformers,
    "mlflow": mlflow,
    "optuna": optuna,
}

for name, module in REQUIRED_PACKAGES.items():
    if not hasattr(module, "__version__"):
        raise ImportError(
            f"Required package '{name}' is not properly installed")

## Step P1-3.4: The Sweep (HPO) - Local with Optuna

Run the full hyperparameter optimization sweep using Optuna to systematically search for the best model configuration. Uses the production HPO configuration with more trials than the dry run.

**Note on K-Fold Cross-Validation:**
- When k-fold CV is enabled (`k_fold.enabled: true`), each trial trains **k models** (one per fold) and returns the **average metric** across folds
- The number of **trials** is controlled by `sampling.max_trials` (e.g., 2 trials in smoke.yaml)
- With k=5 folds and 2 trials: **2 trials √ó 5 folds = 10 model trainings total**
- K-fold CV provides more robust hyperparameter evaluation but increases compute time (k√ó per trial)

**Note on Checkpoint and Resume:**
- When `checkpoint.enabled: true` is set in the HPO config, the system automatically saves the Optuna study state to a SQLite database
- This allows interrupted HPO runs to be resumed from the last checkpoint
- The checkpoint is automatically detected and loaded on the next run if `auto_resume: true` (default)
- Platform-specific paths are handled automatically (local, Colab, Kaggle)
- See `docs/HPO_CHECKPOINT_RESUME.md` for detailed documentation


In [24]:
from pathlib import Path
from orchestration import STAGE_HPO
from orchestration.jobs import run_local_hpo_sweep

# Constants are imported from orchestration module
HPO_OUTPUT_DIR = ROOT_DIR / "outputs" / "hpo"
HPO_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


In [26]:
# Use HPO config already loaded in configs (from Step P1-3.1)
# Following DRY principle - don't reload configs that are already available
# Check for stage-specific hpo_config override
from orchestration.naming import get_stage_config
from shared.yaml_utils import load_yaml

hpo_stage_config = get_stage_config(experiment_config, STAGE_HPO)
hpo_config_override = hpo_stage_config.get("hpo_config")

if hpo_config_override:
    # Load stage-specific HPO config override
    hpo_config_path = CONFIG_DIR / hpo_config_override
    hpo_config = load_yaml(hpo_config_path)
    print(f"‚úì Using stage-specific HPO config for hpo: {hpo_config_override}")
else:
    # Use default HPO config from top-level experiment config
    hpo_config = configs["hpo"]
    print(f"‚úì Using default HPO config: {experiment_config.hpo_config.name}")
train_config = configs["train"]
backbone_values = hpo_config["search_space"]["backbone"]["values"]


‚úì Using stage-specific HPO config for hpo: hpo/smoke.yaml


### Setup K-Fold Splits and Google Drive Backup for HPO Trials

**K-Fold Cross-Validation Setup**: If k-fold CV is enabled in the HPO config, create and save fold splits before starting the sweep.

**Colab-specific feature**: Configure automatic backup of each HPO trial to Google Drive immediately after completion. This prevents data loss if the Colab session disconnects during long-running hyperparameter optimization sweeps.

Each trial's results (including `metrics.json` and checkpoint) are automatically backed up to Google Drive as soon as the trial completes, ensuring no progress is lost even if the session times out.


In [27]:
from training.cv_utils import create_kfold_splits, save_fold_splits, validate_splits
from training.data import load_dataset

# Setup k-fold splits if enabled
k_fold_config = hpo_config.get("k_fold", {})
k_folds_enabled = k_fold_config.get("enabled", False)
fold_splits_file = None

if k_folds_enabled:
    n_splits = k_fold_config.get("n_splits", DEFAULT_K_FOLDS)
    random_seed = k_fold_config.get("random_seed", DEFAULT_RANDOM_SEED)
    shuffle = k_fold_config.get("shuffle", True)
    stratified = k_fold_config.get("stratified", False)
    entity_types = configs.get("data", {}).get("schema", {}).get("entity_types", [])
    
    print(f"Setting up {n_splits}-fold cross-validation splits...")
    full_dataset = load_dataset(str(DATASET_LOCAL_PATH))
    train_data = full_dataset.get("train", [])
    
    fold_splits = create_kfold_splits(
        dataset=train_data,
        k=n_splits,
        random_seed=random_seed,
        shuffle=shuffle,
        stratified=stratified,
        entity_types=entity_types,
    )
    
    # Optional validation to ensure rare entities appear across folds
    validate_splits(train_data, fold_splits, entity_types=entity_types)
    
    HPO_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    fold_splits_file = HPO_OUTPUT_DIR / "fold_splits.json"
    save_fold_splits(
        fold_splits,
        fold_splits_file,
        metadata={
            "k": n_splits,
            "random_seed": random_seed,
            "shuffle": shuffle,
            "stratified": stratified,
            "dataset_path": str(DATASET_LOCAL_PATH),
        }
    )
    print(f"‚úì K-fold splits saved to: {fold_splits_file}")
else:
    print("K-fold CV disabled - using single train/validation split")


Setting up 2-fold cross-validation splits...
[CV] Fold 0: {'SKILL': 154} | Missing: ['EDUCATION', 'DESIGNATION', 'EXPERIENCE', 'NAME', 'EMAIL', 'PHONE', 'LOCATION']
[CV] Fold 1: {'SKILL': 107, 'LOCATION': 4, 'DESIGNATION': 1, 'EXPERIENCE': 1, 'EDUCATION': 1} | Missing: ['NAME', 'EMAIL', 'PHONE']
‚úì K-fold splits saved to: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo\fold_splits.json


In [None]:
# Checkpoint functionality is now handled automatically by run_local_hpo_sweep
# when checkpoint.enabled: true is set in the HPO config.
# No manual backup callbacks are needed - SQLite persistence is built-in.


In [None]:
# Checkpoint functionality is now handled automatically by run_local_hpo_sweep
# when checkpoint.enabled: true is set in the HPO config.
# No wrapper functions are needed - SQLite persistence is built-in.


In [None]:
# # In a Kaggle notebook cell
# !cd /kaggle/working/resume-ner-azureml && git fetch origin feature/google-colab-compute && git checkout origin/feature/google-colab-compute -- src/train.py src/training/trainer.py

In [28]:
# Extract checkpoint configuration from HPO config
checkpoint_config = hpo_config.get("checkpoint", {})

hpo_studies = {}
k_folds_param = k_fold_config.get("n_splits", DEFAULT_K_FOLDS) if k_folds_enabled else None

for backbone in backbone_values:
    mlflow_experiment_name = build_mlflow_experiment_name(
        experiment_config.name, STAGE_HPO, backbone
    )
    backbone_output_dir = HPO_OUTPUT_DIR / backbone
    
    # Use standard run_local_hpo_sweep with checkpoint_config
    # Checkpoint.enabled handles persistence via SQLite (better than manual Drive backup)
    study = run_local_hpo_sweep(
        dataset_path=str(DATASET_LOCAL_PATH),
        config_dir=CONFIG_DIR,
        backbone=backbone,
        hpo_config=hpo_config,
        train_config=train_config,
        output_dir=backbone_output_dir,
        mlflow_experiment_name=mlflow_experiment_name,
        k_folds=k_folds_param,
        fold_splits_file=fold_splits_file,
        checkpoint_config=checkpoint_config,
    )
    
    hpo_studies[backbone] = study


2025-12-27 23:56:48,609 - orchestration.jobs.hpo.local_sweeps - INFO - [HPO] Starting optimization for distilbert with checkpointing...
2025-12-27 23:56:51,029 - orchestration.jobs.tracking.mlflow_tracker - INFO - Using Azure ML Workspace for MLflow tracking
  0%|          | 0/2 [00:00<?, ?it/s]

üèÉ View run trial_0 at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/4e8ebac4-d78a-46fc-9993-724858d8a72d/runs/e129db92-3add-4027-b389-975ac70b2bf8
üß™ View experiment at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/4e8ebac4-d78a-46fc-9993-724858d8a72d


2025-12-27 23:59:31,938 - orchestration.jobs.hpo.local_sweeps - INFO - 
2025-12-27 23:59:31,938 - orchestration.jobs.hpo.local_sweeps - INFO - [BEST]: trial_0
2025-12-27 23:59:31,939 - orchestration.jobs.hpo.local_sweeps - INFO -   Metrics: macro-f1=0.210576 | span=0.181818 | loss=2.074378 | entity_f1=0.060388 (8 entities)
2025-12-27 23:59:31,939 - orchestration.jobs.hpo.local_sweeps - INFO -   Params: learning_rate=1.85e-05 | batch_size=4 | dropout=0.184636 | weight_decay=0.036972 (Run ID: e129db92-3ad...)
Best trial: 0. Best value: 0.210576:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 1/2 [02:13<02:13, 133.15s/it, 133.13/1200 seconds]

üèÉ View run trial_1 at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/4e8ebac4-d78a-46fc-9993-724858d8a72d/runs/3ab1246b-3369-41c3-8241-56f5f19a809b
üß™ View experiment at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/4e8ebac4-d78a-46fc-9993-724858d8a72d


2025-12-28 00:01:43,949 - orchestration.jobs.hpo.local_sweeps - INFO - 
2025-12-28 00:01:43,951 - orchestration.jobs.hpo.local_sweeps - INFO - [Trial 1]: trial_1
2025-12-28 00:01:43,951 - orchestration.jobs.hpo.local_sweeps - INFO -   Metrics: macro-f1=0.141890 | span=0.021818 | loss=2.220258 | entity_f1=0.011132 (7 entities)
2025-12-28 00:01:43,952 - orchestration.jobs.hpo.local_sweeps - INFO -   Params: learning_rate=1.02e-05 | batch_size=4 | dropout=0.209284 | weight_decay=0.028925 (Run ID: 3ab1246b-336...)
Best trial: 0. Best value: 0.210576: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [04:25<00:00, 132.58s/it, 265.14/1200 seconds]
2025-12-28 00:01:57,926 - orchestration.jobs.tracking.mlflow_tracker - INFO - Found 2 total child runs for best trial search
2025-12-28 00:02:03,015 - orchestration.jobs.tracking.mlflow_tracker - INFO - Best trial: 0 (run ID: e129db92-3ad...)


üèÉ View run hpo_distilbert_smoke_test_1.4 at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/4e8ebac4-d78a-46fc-9993-724858d8a72d/runs/38904fb6-e696-40e9-b5f4-1e62551593d2
üß™ View experiment at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/4e8ebac4-d78a-46fc-9993-724858d8a72d


In [29]:
def extract_cv_statistics(best_trial):
    if not hasattr(best_trial, "user_attrs"):
        return None
    cv_mean = best_trial.user_attrs.get("cv_mean")
    cv_std = best_trial.user_attrs.get("cv_std")
    return (cv_mean, cv_std) if cv_mean is not None else None

objective_metric = hpo_config['objective']['metric']

for backbone, study in hpo_studies.items():
    if not study.trials:
        continue
    
    best_trial = study.best_trial
    cv_stats = extract_cv_statistics(best_trial)
    
    print(f"{backbone}: {len(study.trials)} trials completed")
    print(f"  Best {objective_metric}: {best_trial.value:.4f}")
    print(f"  Best params: {best_trial.params}")
    
    if cv_stats:
        cv_mean, cv_std = cv_stats
        print(f"  CV Statistics: Mean: {cv_mean:.4f} ¬± {cv_std:.4f}")


distilbert: 2 trials completed
  Best macro-f1: 0.2106
  Best params: {'learning_rate': 1.850331567345489e-05, 'batch_size': 4, 'dropout': 0.1846357866460817, 'weight_decay': 0.03697191631420018}
  CV Statistics: Mean: 0.2106 ¬± 0.0441


## Step P1-3.5: Benchmarking Best Trials

Benchmark the best trial from each backbone to measure actual inference performance. This provides real latency data that replaces parameter-count proxies in model selection, enabling more accurate speed comparisons.

**Workflow:**
1. Identify best trial per backbone (from HPO results)
2. Run benchmarking on each best trial checkpoint
3. Save benchmark results as `benchmark.json` in trial directories
4. Model selection will automatically use this data when available


In [34]:
from orchestration.jobs.local_selection import load_best_trial_from_disk
import json

# Load benchmark config (if available)
benchmark_config = configs.get("benchmark", {})
benchmark_settings = benchmark_config.get("benchmarking", {})

# Get benchmark parameters from config or use defaults
benchmark_batch_sizes = benchmark_settings.get("batch_sizes", [1, 8, 16])
benchmark_iterations = benchmark_settings.get("iterations", 100)
benchmark_warmup = benchmark_settings.get("warmup_iterations", 10)
benchmark_max_length = benchmark_settings.get("max_length", 512)
benchmark_device = benchmark_settings.get("device")

# Get test data path from benchmark config or data config
test_data_path_str = benchmark_settings.get("test_data")
if test_data_path_str:
    test_data_path = (CONFIG_DIR / test_data_path_str).resolve()
else:
    # Fallback to dataset directory
    test_data_path = DATASET_LOCAL_PATH / "test.json"

if not test_data_path.exists():
    print(f"Warning: Test data not found at {test_data_path}")
    print("Benchmarking will be skipped. Model selection will use parameter proxy.")
    test_data_path = None

# Identify best trials per backbone
objective_metric = hpo_config["objective"]["metric"]
best_trials = {}

for backbone in backbone_values:
    best_trial_info = load_best_trial_from_disk(
        HPO_OUTPUT_DIR,
        backbone,
        objective_metric
    )
    if best_trial_info:
        best_trials[backbone] = best_trial_info
        print(f"{backbone}: Best trial is {best_trial_info['trial_name']} "
              f"({objective_metric}={best_trial_info['accuracy']:.4f})")


distilbert: Best trial is trial_0_20251227_175111 (macro-f1=0.3840)


In [31]:
# Run benchmarking on best trials
if test_data_path and test_data_path.exists():
    benchmark_results = {}
    
    for backbone, trial_info in best_trials.items():
        trial_dir = Path(trial_info["trial_dir"])
        checkpoint_dir = trial_dir / CHECKPOINT_DIRNAME
        benchmark_output = trial_dir / BENCHMARK_FILENAME
        
        if not checkpoint_dir.exists():
            print(f"Warning: Checkpoint not found for {backbone} {trial_info['trial_name']}")
            continue
        
        print(f"\nBenchmarking {backbone} ({trial_info['trial_name']})...")
        
        success = run_benchmarking_local(
            checkpoint_dir=checkpoint_dir,
            test_data_path=test_data_path,
            output_path=benchmark_output,
            batch_sizes=benchmark_batch_sizes,
            iterations=benchmark_iterations,
            warmup_iterations=benchmark_warmup,
            max_length=benchmark_max_length,
            device=benchmark_device,
        )
        
        if success:
            benchmark_results[backbone] = benchmark_output
            print(f"‚úì Benchmark completed: {benchmark_output}")
        else:
            print(f"‚úó Benchmark failed for {backbone}")
    
    print(f"\n‚úì Benchmarking complete. {len(benchmark_results)}/{len(best_trials)} trials benchmarked.")
else:
    print("Skipping benchmarking (test data not available)")



Benchmarking distilbert (trial_0_20251227_175111)...
‚úì Benchmark completed: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo\distilbert\trial_0_20251227_175111_fold0\benchmark.json

‚úì Benchmarking complete. 1/1 trials benchmarked.


In [33]:
# Verify benchmark files were created
if test_data_path and test_data_path.exists():
    for backbone, trial_info in best_trials.items():
        trial_dir = Path(trial_info["trial_dir"])
        benchmark_file = trial_dir / BENCHMARK_FILENAME
        
        if benchmark_file.exists():
            with open(benchmark_file, "r") as f:
                benchmark_data = json.load(f)
            batch_1_latency = benchmark_data.get("batch_1", {}).get("mean_ms", "N/A")
            print(f"{backbone}: benchmark.json exists (batch_1 latency: {batch_1_latency} ms)")
        else:
            print(f"{backbone}: benchmark.json not found (will use parameter proxy)")


distilbert: benchmark.json exists (batch_1 latency: 5.777592999365879 ms)


## Step P1-3.6: Best Configuration Selection (Automated)

Programmatically select the best configuration from all HPO sweep runs across all backbone models. The best configuration is determined by the objective metric specified in the HPO config.


In [35]:
from pathlib import Path
import importlib.util
from shared.json_cache import save_json

# Import local_selection directly to avoid triggering Azure ML imports in __init__.py
local_selection_spec = importlib.util.spec_from_file_location(
    "local_selection", SRC_DIR / "orchestration" / "jobs" / "local_selection.py"
)
local_selection = importlib.util.module_from_spec(local_selection_spec)
local_selection_spec.loader.exec_module(local_selection)
select_best_configuration_across_studies = local_selection.select_best_configuration_across_studies

BEST_CONFIG_CACHE_FILE = ROOT_DIR / "notebooks" / "best_configuration_cache.json"


In [36]:
dataset_version = data_config.get("version", "unknown")

# Select best configuration with accuracy-speed tradeoff
# Supports both in-memory studies and disk-based selection
# Uses threshold from hpo_config["selection"] if configured

# Option 1: Use in-memory studies (if notebook still running)
if 'hpo_studies' in locals() and hpo_studies:
    best_configuration = select_best_configuration_across_studies(
        studies=hpo_studies,
        hpo_config=hpo_config,
        dataset_version=dataset_version,
        # Uses accuracy_threshold from hpo_config["selection"] if set
    )
else:
    # Option 2: Load from disk (works after notebook restart)
    HPO_OUTPUT_DIR = ROOT_DIR / "outputs" / "hpo"
    best_configuration = select_best_configuration_across_studies(
        studies=None,  # No in-memory studies
        hpo_config=hpo_config,
        dataset_version=dataset_version,
        hpo_output_dir=HPO_OUTPUT_DIR,  # Load from saved metrics.json files
        # Uses accuracy_threshold from hpo_config["selection"] if set
    )


In [37]:
from orchestration.paths import (
    resolve_output_path,
    save_cache_with_dual_strategy,
)
from datetime import datetime

# Use centralized path resolution
BEST_CONFIG_CACHE_DIR = resolve_output_path(
    ROOT_DIR,
    CONFIG_DIR,
    "cache",
    subcategory="best_configurations"
)

# Generate timestamp and identifiers
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backbone = best_configuration.get('backbone', 'unknown')
trial_name = best_configuration.get('trial_name', 'unknown')

# Save using dual file strategy
timestamped_file, latest_file, index_file = save_cache_with_dual_strategy(
    root_dir=ROOT_DIR,
    config_dir=CONFIG_DIR,
    cache_type="best_configurations",
    data=best_configuration,
    backbone=backbone,
    identifier=trial_name,
    timestamp=timestamp,
    additional_metadata={
        "experiment_name": experiment_config.name if 'experiment_config' in locals() else "unknown",
        "hpo_study_name": hpo_config.get('study_name', 'unknown') if 'hpo_config' in locals() else "unknown",
    }
)

# Also save to legacy location for backward compatibility
LEGACY_CACHE_FILE = ROOT_DIR / "notebooks" / "best_configuration_cache.json"
save_json(LEGACY_CACHE_FILE, best_configuration)

print(f"Best configuration selected:")
print(f"  Backbone: {backbone}")
print(f"  Trial: {trial_name}")
print(f"  Best {hpo_config['objective']['metric']}: {best_configuration.get('selection_criteria', {}).get('best_value'):.4f}")

# Show selection reasoning (if available)
selection_criteria = best_configuration.get('selection_criteria', {})
if 'reason' in selection_criteria:
    print(f"  Selection reason: {selection_criteria['reason']}")
if 'accuracy_diff_from_best' in selection_criteria:
    print(f"  Accuracy difference from best: {selection_criteria['accuracy_diff_from_best']:.4f}")

# Show all candidates (if available)
if 'all_candidates' in selection_criteria:
    print(f"\nAll candidates considered:")
    for c in selection_criteria['all_candidates']:
        marker = "‚úì" if c['backbone'] == backbone else " "
        print(f"  {marker} {c['backbone']}: acc={c['accuracy']:.4f}, speed={c['speed_score']:.2f}x")

print(f"\n‚úì Saved timestamped cache: {timestamped_file}")
print(f"‚úì Updated latest cache: {latest_file}")
print(f"‚úì Updated index: {index_file}")
print(f"‚úì Saved legacy cache (backward compatibility): {LEGACY_CACHE_FILE}")
print(f"\n  Cache directory: {BEST_CONFIG_CACHE_DIR}")


Best configuration selected:
  Backbone: distilbert
  Trial: trial_0
  Best macro-f1: 0.2106
  Selection reason: Best accuracy (0.2106)

‚úì Saved timestamped cache: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\cache\best_configurations\best_config_distilbert_trial_0_20251228_000511.json
‚úì Updated latest cache: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\cache\best_configurations\latest_best_configuration.json
‚úì Updated index: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\cache\best_configurations\index.json
‚úì Saved legacy cache (backward compatibility): c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\notebooks\best_configuration_cache.json

  Cache directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\cache\best_configurations


## Step P1-3.7: Final Training (Post-HPO, Single Run)

Train the final production model using the best configuration from HPO with stable, controlled conditions. This uses the full training epochs (no early stopping) and the best hyperparameters found during HPO.

**Note**: After training completes, the checkpoint will be automatically backed up to Google Drive for persistence.


In [43]:
from pathlib import Path
import os
import sys
import subprocess
import mlflow
from shared.json_cache import load_json, save_json
from orchestration import STAGE_TRAINING

# Define build_final_training_config locally to avoid importing Azure ML dependencies
# This function doesn't use Azure ML, so we can define it here
def build_final_training_config(
    best_config: dict,
    train_config: dict,
    random_seed: int = 42,
) -> dict:
    """
    Build final training configuration by merging best HPO config with train.yaml defaults.
    """
    hyperparameters = best_config.get("hyperparameters", {})
    training_defaults = train_config.get("training", {})
    
    return {
        "backbone": best_config["backbone"],
        "learning_rate": hyperparameters.get("learning_rate", training_defaults.get("learning_rate", 2e-5)),
        "dropout": hyperparameters.get("dropout", training_defaults.get("dropout", 0.1)),
        "weight_decay": hyperparameters.get("weight_decay", training_defaults.get("weight_decay", 0.01)),
        "batch_size": training_defaults.get("batch_size", 16),
        "epochs": training_defaults.get("epochs", 5),
        "random_seed": random_seed,
        "early_stopping_enabled": False,
        "use_combined_data": True,
        "use_all_data": True,
    }

DEFAULT_RANDOM_SEED = 42
BEST_CONFIG_CACHE_FILE = ROOT_DIR / "notebooks" / "best_configuration_cache.json"
FINAL_TRAINING_OUTPUT_DIR = ROOT_DIR / "outputs" / "final_training"


In [44]:
from orchestration.paths import load_cache_file

# Try loading from centralized cache first
best_configuration = load_cache_file(
    ROOT_DIR, CONFIG_DIR, "best_configurations", use_latest=True
)

# Fallback to legacy location
if best_configuration is None:
    LEGACY_CACHE_FILE = ROOT_DIR / "notebooks" / "best_configuration_cache.json"
    best_configuration = load_json(LEGACY_CACHE_FILE, default=None)

if best_configuration is None:
    raise FileNotFoundError(
        f"Best configuration cache not found.\n"
        f"Please run Step P1-3.6: Best Configuration Selection first.\n"
        f"Cache directory: {resolve_output_path(ROOT_DIR, CONFIG_DIR, 'cache', subcategory='best_configurations')}"
    )


In [45]:
# Build final training configuration from best HPO configuration
# Use train_config from configs if available, otherwise load it
if 'train_config' not in locals():
    train_config = configs.get("train", {})

final_training_config = build_final_training_config(
    best_config=best_configuration,
    train_config=train_config,
    random_seed=DEFAULT_RANDOM_SEED,
)

print(f"Final training configuration:")
print(f"  Backbone: {final_training_config['backbone']}")
print(f"  Learning rate: {final_training_config['learning_rate']}")
print(f"  Batch size: {final_training_config['batch_size']}")
print(f"  Dropout: {final_training_config['dropout']}")
print(f"  Weight decay: {final_training_config['weight_decay']}")
print(f"  Epochs: {final_training_config['epochs']}")
print(f"  Random seed: {final_training_config['random_seed']}")
print(f"  Early stopping: {final_training_config['early_stopping_enabled']}")


Final training configuration:
  Backbone: distilbert
  Learning rate: 1.850331567345489e-05
  Batch size: 3
  Dropout: 0.1846357866460817
  Weight decay: 0.03697191631420018
  Epochs: 1
  Random seed: 42
  Early stopping: False


In [46]:
mlflow_experiment_name = f"{experiment_config.name}-{STAGE_TRAINING}-{final_training_config['backbone']}"
from datetime import datetime

# Generate unique run ID to prevent overwriting on reruns
final_training_run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
print(f"Final training Run ID: {final_training_run_id} (prevents overwriting on reruns)")

mlflow_experiment_name = f"{experiment_config.name}-{STAGE_TRAINING}-{final_training_config['backbone']}"
final_output_dir = FINAL_TRAINING_OUTPUT_DIR / f"{final_training_config['backbone']}_{final_training_run_id}"
final_output_dir.mkdir(parents=True, exist_ok=True)

mlflow.set_experiment(mlflow_experiment_name)


Final training Run ID: 20251228_000723 (prevents overwriting on reruns)


<Experiment: artifact_location='', creation_time=1766263257813, experiment_id='a5897b88-fd66-448c-ae65-1ef21bfc11dd', last_update_time=None, lifecycle_stage='active', name='resume_ner_baseline-training-distilbert', tags={}>

In [47]:
# Run training as a module (python -m training.train) to allow relative imports to work
# This requires src/ to be in PYTHONPATH (set in env below)
training_args = [
    sys.executable,
    "-m",
    "training.train",
    "--data-asset",
    str(DATASET_LOCAL_PATH),
    "--config-dir",
    str(CONFIG_DIR),
    "--backbone",
    final_training_config["backbone"],
    "--learning-rate",
    str(final_training_config["learning_rate"]),
    "--batch-size",
    str(final_training_config["batch_size"]),
    "--dropout",
    str(final_training_config["dropout"]),
    "--weight-decay",
    str(final_training_config["weight_decay"]),
    "--epochs",
    str(final_training_config["epochs"]),
    "--random-seed",
    str(final_training_config["random_seed"]),
    "--early-stopping-enabled",
    str(final_training_config["early_stopping_enabled"]).lower(),
    "--use-combined-data",
    str(final_training_config["use_combined_data"]).lower(),
]


In [48]:
training_env = os.environ.copy()
training_env["AZURE_ML_OUTPUT_checkpoint"] = str(final_output_dir)

# Add src directory to PYTHONPATH to allow relative imports in training.train
pythonpath = training_env.get("PYTHONPATH", "")
if pythonpath:
    training_env["PYTHONPATH"] = f"{str(SRC_DIR)}{os.pathsep}{pythonpath}"
else:
    training_env["PYTHONPATH"] = str(SRC_DIR)

mlflow_tracking_uri = mlflow.get_tracking_uri()
if mlflow_tracking_uri:
    training_env["MLFLOW_TRACKING_URI"] = mlflow_tracking_uri
training_env["MLFLOW_EXPERIMENT_NAME"] = mlflow_experiment_name

# Set custom run name for easier searching: {backbone}_{run_id}
# Extract backbone name (e.g., "distilbert" from "distilbert-base-uncased" or use as-is if already short)
backbone_value = final_training_config["backbone"]
# If backbone contains hyphens, extract the first part (e.g., "distilbert" from "distilbert-base-uncased")
# Otherwise use as-is (e.g., "distilbert" stays "distilbert")
backbone_name = backbone_value.split("-")[0] if "-" in backbone_value else backbone_value
training_env["MLFLOW_RUN_NAME"] = f"{backbone_name}_{final_training_run_id}"
print(f"MLflow run name: {training_env['MLFLOW_RUN_NAME']}")


MLflow run name: distilbert_20251228_000723


In [49]:
result = subprocess.run(
    training_args,
    cwd=ROOT_DIR,
    env=training_env,
    capture_output=True,
    text=True,
)

if result.returncode != 0:
    raise RuntimeError(f"Final training failed with return code {result.returncode}")
else:
    # Print output for successful runs too (helpful for debugging)
    if result.stdout:
        print(result.stdout)


Attempted to log scalar metric param_learning_rate:
1.850331567345489e-05
Attempted to log scalar metric param_batch_size:
3
Attempted to log scalar metric param_dropout:
0.1846357866460817
Attempted to log scalar metric param_weight_decay:
0.03697191631420018
Attempted to log scalar metric param_epochs:
1
Attempted to log scalar metric param_backbone:
distilbert-base-uncased
Attempted to log scalar metric macro-f1:
0.2900154400411734
Attempted to log scalar metric macro-f1-span:
0.060606060606060615
Attempted to log scalar metric loss:
2.014418840408325
üèÉ View run plucky_calypso_v3n8cx1y at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-802c-4fdf-9e59-e3d7969bcf31/resourceGroups/resume_ner_2025-12-14-13-17-35/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/a5897b88-fd66-448c-ae65-1ef21bfc11dd/runs/c8e83b94-82a3-4152-b649-152ee581d5f4
üß™ View experiment at: https://japanwest.api.azureml.ms/mlflow/v2.0/subscriptions/a23fa87c-

In [50]:
import json
import shutil
from pathlib import Path
import os

# Check actual checkpoint location
# The training script may save to outputs/checkpoint instead of final_output_dir/checkpoint
actual_checkpoint = ROOT_DIR / "outputs" / "checkpoint"
actual_metrics = ROOT_DIR / "outputs" / METRICS_FILENAME
expected_checkpoint = final_output_dir / "checkpoint"
expected_metrics = final_output_dir / METRICS_FILENAME

print("Checking training completion...")
print(f"  Expected checkpoint: {expected_checkpoint} (exists: {expected_checkpoint.exists()})")
print(f"  Actual checkpoint: {actual_checkpoint} (exists: {actual_checkpoint.exists()})")
print(f"  Expected metrics: {expected_metrics} (exists: {expected_metrics.exists()})")
print(f"  Actual metrics: {actual_metrics} (exists: {actual_metrics.exists()})")

# Determine which checkpoint and metrics to use
checkpoint_source = None
metrics_file = None

if expected_checkpoint.exists() and any(expected_checkpoint.iterdir()):
    checkpoint_source = expected_checkpoint
    print(f"‚úì Using expected checkpoint location: {checkpoint_source}")
elif actual_checkpoint.exists() and any(actual_checkpoint.iterdir()):
    checkpoint_source = actual_checkpoint
    print(f"‚úì Using actual checkpoint location: {checkpoint_source}")
    # Update final_output_dir to match actual location
    final_output_dir = actual_checkpoint.parent

if expected_metrics.exists():
    metrics_file = expected_metrics
elif actual_metrics.exists():
    metrics_file = actual_metrics

# Load metrics if available
metrics = None
if metrics_file and metrics_file.exists():
    with open(metrics_file, "r") as f:
        metrics = json.load(f)
    print(f"‚úì Metrics loaded from: {metrics_file}")
    print(f"  Metrics: {metrics}")
elif checkpoint_source:
    print(f"‚ö† Warning: Metrics file not found, but checkpoint exists.")
    metrics = {"status": "completed", "checkpoint_found": True}
else:
    raise FileNotFoundError(
        f"Training completed but no checkpoint found.\n"
        f"  Expected: {expected_checkpoint}\n"
        f"  Actual: {actual_checkpoint}\n"
        f"  Please check training logs for errors."
    )


Checking training completion...
  Expected checkpoint: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\final_training\distilbert_20251228_000723\checkpoint (exists: True)
  Actual checkpoint: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\checkpoint (exists: False)
  Expected metrics: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\final_training\distilbert_20251228_000723\metrics.json (exists: True)
  Actual metrics: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\metrics.json (exists: False)
‚úì Using expected checkpoint location: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\final_training\distilbert_20251228_000723\checkpoint
‚úì Metrics loaded from: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\final_training\distilbert_20251228_000723\metrics.json
  Metrics: {'macro-f1': 0.2900154400411734, 'macro-f1-span': 0.060606060606060615, 'loss': 2.014418840408325, 'per_entity': {'AME': {'precision': 0

In [51]:
from orchestration.paths import (
    resolve_output_path,
    save_cache_with_dual_strategy,
)
from datetime import datetime

# Prepare cache data
final_training_cache_data = {
    "output_dir": str(final_output_dir),
    "backbone": final_training_config["backbone"],
    "run_id": final_training_run_id,
    "config": final_training_config,
    "metrics": metrics,  # Include metrics if available
}

# Save using dual file strategy
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backbone = final_training_config["backbone"].replace('-', '_').replace('/', '_')
run_id = final_training_run_id.replace('-', '_')

timestamped_file, latest_file, index_file = save_cache_with_dual_strategy(
    root_dir=ROOT_DIR,
    config_dir=CONFIG_DIR,
    cache_type="final_training",
    data=final_training_cache_data,
    backbone=backbone,
    identifier=run_id,
    timestamp=timestamp,
    additional_metadata={
        "checkpoint_path": str(checkpoint_source) if checkpoint_source else None,
    }
)

# Also save to legacy location for backward compatibility
LEGACY_CACHE_FILE = ROOT_DIR / "notebooks" / "final_training_cache.json"
save_json(LEGACY_CACHE_FILE, final_training_cache_data)

print(f"‚úì Saved timestamped final training cache: {timestamped_file}")
print(f"‚úì Updated latest cache: {latest_file}")
print(f"‚úì Updated index: {index_file}")
print(f"‚úì Saved legacy cache: {LEGACY_CACHE_FILE}")


‚úì Saved timestamped final training cache: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\cache\final_training\final_training_distilbert_20251228_000723_20251228_001051.json
‚úì Updated latest cache: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\cache\final_training\latest_final_training_cache.json
‚úì Updated index: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\cache\final_training\final_training_index.json
‚úì Saved legacy cache: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\notebooks\final_training_cache.json


## Step P1-3.8: Continued Training (Optional)

Continue training with new data by loading a checkpoint from a previous training run. This enables fine-tuning and domain adaptation while preserving the model's learned knowledge.

**Use Cases:**
- Fine-tuning on new domain-specific data
- Incremental learning with additional training samples
- Domain adaptation to new data distributions

**Configuration:**
- Uses `config/train_continued.yaml` for continued training settings
- Supports multiple data combination strategies (new_only, combined, append)
- Automatically resolves checkpoint path from previous training cache or config


In [None]:
# Resolve checkpoint path and prepare dataset for continued training
if CONTINUED_EXPERIMENT_ENABLED:
    # Get previous training cache
    from orchestration.paths import load_cache_file
    
    # Try loading from centralized cache first
    previous_training = load_cache_file(
        ROOT_DIR, CONFIG_DIR, "final_training", use_latest=True
    )
    
    # Fallback to legacy location
    if previous_training is None:
        previous_cache_path = ROOT_DIR / continued_training_config.get(
            "previous_training_cache", 
            "notebooks/final_training_cache.json"
        )
        previous_training = load_json(previous_cache_path, default=None)
    
    if previous_training:
        # Get checkpoint directory from previous training
        previous_output_dir = Path(previous_training.get("output_dir", ""))
        if previous_output_dir.exists():
            previous_checkpoint_dir = previous_output_dir / "checkpoint"
        else:
            # Try to get from final_training_cache.json
            final_training_cache = load_json(
                ROOT_DIR / "notebooks" / "final_training_cache.json",
                default=None
            )
            if final_training_cache:
                previous_output_dir = Path(final_training_cache.get("output_dir", ""))
                previous_checkpoint_dir = previous_output_dir / "checkpoint"
            else:
                previous_checkpoint_dir = None
    else:
        previous_checkpoint_dir = None
    
    # Resolve checkpoint path using checkpoint loader
    backbone = continued_configs["model"]["backbone"].split("-")[0] if "-" in continued_configs["model"]["backbone"] else continued_configs["model"]["backbone"]
    run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Add checkpoint config to training config for resolution
    continued_configs["training"]["checkpoint"] = checkpoint_config
    continued_configs["training"]["run_id"] = run_id
    continued_configs["_config_dir"] = CONFIG_DIR
    
    checkpoint_path = resolve_checkpoint_path(
        config=continued_configs,
        previous_cache_path=previous_cache_path if previous_training else None,
        backbone=backbone,
        run_id=run_id,
    )
    
    if checkpoint_path:
        print(f"‚úì Resolved checkpoint: {checkpoint_path}")
    else:
        print("‚ö† No checkpoint found. Will create new model from backbone.")
        checkpoint_path = None
    
    # Prepare dataset based on strategy
    data_strategy = data_config_continued.get("strategy", "combined")
    new_dataset_path_str = data_config_continued.get("new_dataset_path")
    
    if not new_dataset_path_str:
        # Try to get from experiment config
        new_data_config = continued_training_config.get("new_data", {})
        new_dataset_path_str = new_data_config.get("local_path")
        if new_dataset_path_str:
            new_dataset_path = (CONFIG_DIR / new_dataset_path_str).resolve()
        else:
            # Fallback to data config
            new_dataset_path = DATASET_LOCAL_PATH
    else:
        new_dataset_path = (CONFIG_DIR / new_dataset_path_str).resolve() if not Path(new_dataset_path_str).is_absolute() else Path(new_dataset_path_str)
    
    print(f"New dataset path: {new_dataset_path}")
    
    # Combine datasets based on strategy
    if data_strategy == "new_only":
        combined_dataset = load_dataset(str(new_dataset_path))
        print(f"‚úì Using new dataset only ({len(combined_dataset.get('train', []))} samples)")
    else:
        old_dataset_path_str = data_config_continued.get("old_dataset_path")
        if old_dataset_path_str:
            old_dataset_path = (CONFIG_DIR / old_dataset_path_str).resolve() if not Path(old_dataset_path_str).is_absolute() else Path(old_dataset_path_str)
        else:
            old_dataset_path = DATASET_LOCAL_PATH
        
        validation_ratio = data_config_continued.get("validation_ratio", 0.1)
        random_seed = data_config_continued.get("random_seed", 42)
        
        combined_dataset = combine_datasets(
            old_dataset_path=old_dataset_path,
            new_dataset_path=new_dataset_path,
            strategy=data_strategy,
            validation_ratio=validation_ratio,
            random_seed=random_seed,
        )
        print(f"‚úì Combined datasets using '{data_strategy}' strategy")
        print(f"  Total training samples: {len(combined_dataset.get('train', []))}")
        print(f"  Validation samples: {len(combined_dataset.get('validation', []))}")
    
    # Update data config with combined dataset path (temporary location)
    # We'll save the combined dataset to a temp location
    combined_dataset_dir = ROOT_DIR / "outputs" / "continued_training" / "combined_dataset"
    combined_dataset_dir.mkdir(parents=True, exist_ok=True)
    
    import json
    with open(combined_dataset_dir / "train.json", "w") as f:
        json.dump(combined_dataset.get("train", []), f, indent=2)
    if combined_dataset.get("validation"):
        with open(combined_dataset_dir / "validation.json", "w") as f:
            json.dump(combined_dataset["validation"], f, indent=2)
    
    CONTINUED_DATASET_PATH = combined_dataset_dir
    print(f"‚úì Combined dataset saved to: {CONTINUED_DATASET_PATH}")
else:
    print("Skipping continued training setup (disabled)")


In [None]:
# Resolve checkpoint path and prepare dataset for continued training
if CONTINUED_EXPERIMENT_ENABLED:
    # Get previous training cache
    from orchestration.paths import load_cache_file

    # Try loading from centralized cache first
    previous_training = load_cache_file(
        ROOT_DIR, CONFIG_DIR, "final_training", use_latest=True
    )

    # Fallback to legacy location
    if previous_training is None:
        previous_cache_path = ROOT_DIR / continued_training_config.get(
            "previous_training_cache",
            "notebooks/final_training_cache.json",
        )
        previous_training = load_json(previous_cache_path, default=None)

    if previous_training:
        # Get checkpoint directory from previous training
        previous_output_dir = Path(previous_training.get("output_dir", ""))
        if previous_output_dir.exists():
            previous_checkpoint_dir = previous_output_dir / "checkpoint"
        else:
            # Try to get from final_training_cache.json
            final_training_cache = load_json(
                ROOT_DIR / "notebooks" / "final_training_cache.json",
                default=None,
            )
            if final_training_cache:
                previous_output_dir = Path(
                    final_training_cache.get("output_dir", ""))
                previous_checkpoint_dir = previous_output_dir / "checkpoint"
            else:
                previous_checkpoint_dir = None
    else:
        previous_checkpoint_dir = None

    # Resolve checkpoint path using checkpoint loader
    backbone = (
        continued_configs["model"]["backbone"].split("-")[0]
        if "-" in continued_configs["model"]["backbone"]
        else continued_configs["model"]["backbone"]
    )
    run_id = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Add checkpoint config to training config for resolution
    continued_configs["training"]["checkpoint"] = checkpoint_config
    continued_configs["training"]["run_id"] = run_id
    continued_configs["_config_dir"] = CONFIG_DIR

    checkpoint_path = resolve_checkpoint_path(
        config=continued_configs,
        previous_cache_path=previous_cache_path if previous_training else None,
        backbone=backbone,
        run_id=run_id,
    )

    if checkpoint_path:
        print(f"‚úì Resolved checkpoint: {checkpoint_path}")
    else:
        print("‚ö† No checkpoint found. Will create new model from backbone.")
        checkpoint_path = None

    # Prepare dataset based on strategy
    data_strategy = data_config_continued.get("strategy", "combined")
    new_dataset_path_str = data_config_continued.get("new_dataset_path")

    if not new_dataset_path_str:
        # Try to get from experiment config
        new_data_config = continued_training_config.get("new_data", {})
        new_dataset_path_str = new_data_config.get("local_path")
        if new_dataset_path_str:
            new_dataset_path = (CONFIG_DIR / new_dataset_path_str).resolve()
        else:
            # Fallback to data config
            new_dataset_path = DATASET_LOCAL_PATH
    else:
        new_dataset_path = (
            (CONFIG_DIR / new_dataset_path_str).resolve()
            if not Path(new_dataset_path_str).is_absolute()
            else Path(new_dataset_path_str)
        )

    print(f"New dataset path: {new_dataset_path}")

    # Combine datasets based on strategy
    if data_strategy == "new_only":
        combined_dataset = load_dataset(str(new_dataset_path))
        print(
            f"‚úì Using new dataset only ({len(combined_dataset.get('train', []))} samples)"
        )
    else:
        old_dataset_path_str = data_config_continued.get("old_dataset_path")
        if old_dataset_path_str:
            old_dataset_path = (
                (CONFIG_DIR / old_dataset_path_str).resolve()
                if not Path(old_dataset_path_str).is_absolute()
                else Path(old_dataset_path_str)
            )
        else:
            old_dataset_path = DATASET_LOCAL_PATH

        validation_ratio = data_config_continued.get("validation_ratio", 0.1)
        random_seed = data_config_continued.get("random_seed", 42)

        combined_dataset = combine_datasets(
            old_dataset_path=old_dataset_path,
            new_dataset_path=new_dataset_path,
            strategy=data_strategy,
            validation_ratio=validation_ratio,
            random_seed=random_seed,
        )
        print(f"‚úì Combined datasets using '{data_strategy}' strategy")
        print(
            f"  Total training samples: {len(combined_dataset.get('train', []))}"
        )
        print(
            f"  Validation samples: {len(combined_dataset.get('validation', []))}"
        )

    # Save combined dataset
    combined_dataset_dir = (
        ROOT_DIR / "outputs" / "continued_training" / "combined_dataset"
    )
    combined_dataset_dir.mkdir(parents=True, exist_ok=True)

    import json

    with open(combined_dataset_dir / "train.json", "w") as f:
        json.dump(combined_dataset.get("train", []), f, indent=2)

    if combined_dataset.get("validation"):
        with open(combined_dataset_dir / "validation.json", "w") as f:
            json.dump(combined_dataset["validation"], f, indent=2)

    CONTINUED_DATASET_PATH = combined_dataset_dir
    print(f"‚úì Combined dataset saved to: {CONTINUED_DATASET_PATH}")

else:
    print("Skipping continued training setup (disabled)")

## Step P1-4: Model Conversion & Optimization

Convert the final training checkpoint to an optimized ONNX model (int8 quantized) for production inference.

**Platform Adapter Note**: The conversion script (`src/model_conversion/convert_to_onnx.py`) uses the platform adapter to automatically handle output paths and logging appropriately for local execution.

**Checkpoint Restoration**: 
- **Google Colab**: If the checkpoint is not found locally (e.g., after a session disconnect), it will be automatically restored from Google Drive.
- **Kaggle**: Checkpoints are automatically persisted in `/kaggle/working/` - no restoration needed.


In [55]:
from pathlib import Path
import os
import sys
import subprocess
import mlflow
import shutil
from shared.json_cache import load_json

CONVERSION_SCRIPT_PATH = SRC_DIR / "model_conversion" / "convert_to_onnx.py"
FINAL_TRAINING_CACHE_FILE = ROOT_DIR / "notebooks" / "final_training_cache.json"
CONVERSION_OUTPUT_DIR = ROOT_DIR / "outputs" / "conversion"

In [56]:
from orchestration.paths import load_cache_file

# Try loading from centralized cache first
training_cache = load_cache_file(
    ROOT_DIR, CONFIG_DIR, "final_training", use_latest=True
)

# Fallback to legacy location
if training_cache is None:
    LEGACY_CACHE_FILE = ROOT_DIR / "notebooks" / "final_training_cache.json"
    training_cache = load_json(LEGACY_CACHE_FILE, default=None)

# Try to restore from Google Drive if still not found
if training_cache is None:
    if restore_from_drive("final_training_cache.json", LEGACY_CACHE_FILE, is_directory=False):
        training_cache = load_json(LEGACY_CACHE_FILE, default=None)

if training_cache is None:
    raise FileNotFoundError(
        f"Final training cache not found locally or in backup.\n"
        f"Please run Step P1-3.7: Final Training first."
    )


In [57]:
# Extract checkpoint directory, backbone, and create conversion output directory
from datetime import datetime

# Get checkpoint directory from training cache
checkpoint_source = Path(training_cache.get("output_dir", "")) / "checkpoint"
if not checkpoint_source.exists():
    # Try alternative location
    checkpoint_source = Path(training_cache.get("output_dir", "")) / CHECKPOINT_DIRNAME
    if not checkpoint_source.exists():
        raise FileNotFoundError(
            f"Checkpoint not found in training cache output_dir: {training_cache.get('output_dir', '')}"
        )

checkpoint_dir = checkpoint_source
print(f"Using checkpoint: {checkpoint_dir}")

# Extract backbone from training cache
backbone = training_cache.get("backbone", "unknown")
if backbone == "unknown":
    # Try to get from config
    backbone = training_cache.get("config", {}).get("backbone", "unknown")
    if backbone == "unknown":
        raise ValueError("Could not determine backbone from training cache")

# Extract backbone name (e.g., "distilbert" from "distilbert-base-uncased")
backbone_name = backbone.split("-")[0] if "-" in backbone else backbone

# Generate conversion run ID
conversion_run_id = datetime.now().strftime("%Y%m%d_%H%M%S")

# Create conversion output directory
conversion_output_dir = CONVERSION_OUTPUT_DIR / f"{backbone_name}_{conversion_run_id}"
conversion_output_dir.mkdir(parents=True, exist_ok=True)

print(f"Conversion output directory: {conversion_output_dir}")
print(f"Backbone: {backbone}")
print(f"Conversion Run ID: {conversion_run_id}")


Using checkpoint: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\final_training\distilbert_20251228_000723\checkpoint
Conversion output directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\conversion\distilbert_20251228_001226
Backbone: distilbert
Conversion Run ID: 20251228_001226


In [58]:
# Run conversion as a module (python -m model_conversion.convert_to_onnx) to allow relative imports to work
# This requires src/ to be in PYTHONPATH (set in env below)
conversion_args = [
    sys.executable,
    "-m",
    "model_conversion.convert_to_onnx",
    "--checkpoint-path",
    str(checkpoint_dir),
    "--config-dir",
    str(CONFIG_DIR),
    "--backbone",
    backbone,
    "--output-dir",
    str(conversion_output_dir),
    "--quantize-int8",
    "--run-smoke-test",
]


In [59]:
conversion_env = os.environ.copy()
conversion_env["AZURE_ML_OUTPUT_onnx_model"] = str(conversion_output_dir)

# Add src directory to PYTHONPATH to allow relative imports in model_conversion.convert_to_onnx
pythonpath = conversion_env.get("PYTHONPATH", "")
if pythonpath:
    conversion_env["PYTHONPATH"] = f"{str(SRC_DIR)}{os.pathsep}{pythonpath}"
else:
    conversion_env["PYTHONPATH"] = str(SRC_DIR)

mlflow_tracking_uri = mlflow.get_tracking_uri()
if mlflow_tracking_uri:
    conversion_env["MLFLOW_TRACKING_URI"] = mlflow_tracking_uri


In [60]:
result = subprocess.run(
    conversion_args,
    cwd=ROOT_DIR,
    env=conversion_env,
    capture_output=True,
    text=True,
)

if result.returncode != 0:
    print("Model conversion failed with the following output:")
    print("=" * 80)
    if result.stdout:
        print("STDOUT:")
        print(result.stdout)
    if result.stderr:
        print("STDERR:")
        print(result.stderr)
    print("=" * 80)
    raise RuntimeError(f"Model conversion failed with return code {result.returncode}")
else:
    # Print output for successful runs too (helpful for debugging)
    if result.stdout:
        print(result.stdout)


In [61]:
from shared.json_cache import save_json
import shutil

ONNX_MODEL_FILENAME = "model_int8.onnx"
FALLBACK_ONNX_MODEL_FILENAME = "model.onnx"
CONVERSION_CACHE_FILE = ROOT_DIR / "notebooks" / "conversion_cache.json"

onnx_model_path = conversion_output_dir / ONNX_MODEL_FILENAME
if not onnx_model_path.exists():
    onnx_model_path = conversion_output_dir / FALLBACK_ONNX_MODEL_FILENAME

if not onnx_model_path.exists():
    raise FileNotFoundError(f"ONNX model not found in {conversion_output_dir}")

print(f"‚úì Conversion completed. ONNX model: {onnx_model_path}")

save_json(CONVERSION_CACHE_FILE, {
    "onnx_model_path": str(onnx_model_path),
    "backbone": backbone,
    "checkpoint_dir": str(checkpoint_dir),
})

# Backup ONNX model to Google Drive (if available)
if onnx_model_path.exists():
    backup_to_drive(onnx_model_path, f"{backbone}_model.onnx", is_directory=False)
else:
    print(f"‚ö† Warning: ONNX model not found for backup: {onnx_model_path}")

# Backup conversion cache file to Drive
backup_to_drive(CONVERSION_CACHE_FILE, "conversion_cache.json", is_directory=False)


‚úì Conversion completed. ONNX model: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\conversion\distilbert_20251228_001226\model_int8.onnx


False