# Phase 1: Training Orchestration (Google Colab)

This notebook orchestrates all training activities for **Google Colab execution** with GPU compute support.

## Overview

- **Step 1**: Repository Setup & Environment Configuration
- **Step 2**: Load Centralized Configs
- **Step 3**: Verify Local Dataset (from data config)
- **Step 4**: Setup Local Environment
- **Step 5**: The Dry Run
- **Step 6**: The Sweep (HPO) - Local with Optuna
- **Step 7**: Best Configuration Selection (Automated)
- **Step 8**: Final Training (Post-HPO, Single Run)
- **Step 9**: Model Conversion & Optimization

## Important

- This notebook **executes training in Google Colab** (not on Azure ML)
- All computation happens on Google Colab GPU
- **Google Drive Integration**: Checkpoints are automatically saved to Google Drive for persistence across sessions
- The notebook must be **re-runnable end-to-end**
- Uses the dataset path specified in the data config (from `config/data/*.yaml`), typically pointing to a local folder included in the repository
- **Session Management**: Colab sessions timeout after 12-24 hours (depending on Colab plan). Checkpoints are saved to Drive automatically, so you can resume from Drive if the session disconnects.


## Step 1: Repository Setup

Set up the repository in Google Colab. Choose one of the following options:

### Option A: Clone from Git (Recommended)

If your repository is on GitHub/GitLab, clone it:

```python
!git clone <your-repository-url> /content/resume-ner-azureml
```

### Option B: Upload Files

If you prefer to upload files manually:
1. Use the Colab file browser (folder icon on left sidebar)
2. Upload your project files to `/content/resume-ner-azureml/`
3. Ensure the directory structure matches: `src/`, `config/`, `notebooks/`, etc.

### Verify Repository Setup

After cloning or uploading, verify the repository structure:


In [1]:
from pathlib import Path

# Set repository root directory
# Change this if you used a different path in Step 1
ROOT_DIR = Path("/content/resume-ner-azureml")

# Verify repository structure
if not ROOT_DIR.exists():
    raise FileNotFoundError(
        f"Repository not found at {ROOT_DIR}\n"
        f"Please run Step 1 to clone or upload the repository."
    )

required_dirs = ["src", "config", "notebooks"]
missing_dirs = [d for d in required_dirs if not (ROOT_DIR / d).exists()]

if missing_dirs:
    raise FileNotFoundError(
        f"Missing required directories: {missing_dirs}\n"
        f"Please ensure the repository structure is correct."
    )

print(f"✓ Repository found at: {ROOT_DIR}")
print(f"✓ Required directories found: {required_dirs}")


FileNotFoundError: Repository not found at /content/resume-ner-azureml
Please run Step 1 to clone or upload the repository.

## Step 2: Install Dependencies

Install all required Python packages. PyTorch is usually pre-installed in Colab, but we'll verify the version.


In [None]:
import torch

# Check PyTorch version (usually pre-installed in Colab)
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Verify PyTorch version meets requirements (>=2.6.0)
torch_version = tuple(map(int, torch.__version__.split('.')[:2]))
if torch_version < (2, 6):
    print(f"⚠ Warning: PyTorch {torch.__version__} may not meet requirements (>=2.6.0)")
    print("Consider upgrading: !pip install torch>=2.6.0 --upgrade")
else:
    print("✓ PyTorch version meets requirements")


In [None]:
# Install required packages
# Core ML libraries
%pip install transformers>=4.35.0,<5.0.0 --quiet
%pip install safetensors>=0.4.0 --quiet
%pip install datasets>=2.12.0 --quiet

# ML utilities
%pip install numpy>=1.24.0,<2.0.0 --quiet
%pip install pandas>=2.0.0 --quiet
%pip install scikit-learn>=1.3.0 --quiet

# Utilities
%pip install pyyaml>=6.0 --quiet
%pip install tqdm>=4.65.0 --quiet
%pip install seqeval>=1.2.2 --quiet
%pip install sentencepiece>=0.1.99 --quiet

# Experiment tracking
%pip install mlflow --quiet
%pip install optuna --quiet

# ONNX support
%pip install onnxruntime --quiet
%pip install onnx>=1.16.0 --quiet
%pip install onnxscript>=0.1.0 --quiet

print("✓ All dependencies installed")


## Step 3: Setup Paths and Import Paths

Configure Python paths and verify Colab environment detection.


In [None]:
import os
import sys
from pathlib import Path

# Verify Colab environment
IN_COLAB = "COLAB_GPU" in os.environ or "COLAB_TPU" in os.environ

# Setup paths (ROOT_DIR should be set in Cell 2)
# If not, set it here
if 'ROOT_DIR' not in globals():
    ROOT_DIR = Path("/content/resume-ner-azureml")

SRC_DIR = ROOT_DIR / "src"
CONFIG_DIR = ROOT_DIR / "config"
NOTEBOOK_DIR = ROOT_DIR / "notebooks"

# Add to Python path
sys.path.append(str(ROOT_DIR))
sys.path.append(str(SRC_DIR))

print("Notebook directory:", NOTEBOOK_DIR)
print("Project root:", ROOT_DIR)
print("Source directory:", SRC_DIR)
print("Config directory:", CONFIG_DIR)
print("In Colab:", IN_COLAB)

if not IN_COLAB:
    print("⚠ Warning: Not detected as Colab environment. Some features may not work correctly.")


## Step 4: Mount Google Drive

Mount Google Drive to enable checkpoint persistence across Colab sessions. Checkpoints will be automatically saved to Drive after training completes.


In [2]:
from google.colab import drive
from pathlib import Path

# Mount Google Drive
drive.mount('/content/drive')

# Define backup directory in Google Drive
DRIVE_BACKUP_DIR = Path("/content/drive/MyDrive/resume-ner-checkpoints")
DRIVE_BACKUP_DIR.mkdir(parents=True, exist_ok=True)

print(f"✓ Google Drive mounted")
print(f"✓ Checkpoint backup directory: {DRIVE_BACKUP_DIR}")
print(f"\nNote: Checkpoints will be automatically saved to this directory after training completes.")


KeyboardInterrupt: 

## Step P1-3.1: Load Centralized Configs

Load and validate all configuration files. Configs are immutable and will be logged with each job for reproducibility.


In [None]:
from pathlib import Path
from typing import Any, Dict

from orchestration import EXPERIMENT_NAME
from orchestration.config_loader import (
    ExperimentConfig,
    compute_config_hashes,
    create_config_metadata,
    load_all_configs,
    load_experiment_config,
    snapshot_configs,
    validate_config_immutability,
)

# P1-3.1: Load Centralized Configs (local-only)
# Mirrors the Azure orchestration notebook, but does not create an Azure ML client.

if not CONFIG_DIR.exists():
    raise FileNotFoundError(f"Config directory not found: {CONFIG_DIR}")

experiment_config: ExperimentConfig = load_experiment_config(CONFIG_DIR, EXPERIMENT_NAME)
configs: Dict[str, Any] = load_all_configs(experiment_config)
config_hashes = compute_config_hashes(configs)
config_metadata = create_config_metadata(configs, config_hashes)

# Immutable snapshots for runtime mutation checks
original_configs = snapshot_configs(configs)
validate_config_immutability(configs, original_configs)

print(f"Loaded experiment: {experiment_config.name}")
print("Loaded config domains:", sorted(configs.keys()))
print("Config hashes:", config_hashes)
print("Config metadata:", config_metadata)

# Get dataset path from data config (centralized configuration)
# The local_path in the data config is relative to the config directory
data_config = configs["data"]
local_path_str = data_config.get("local_path", "../dataset")
DATASET_LOCAL_PATH = (CONFIG_DIR / local_path_str).resolve()

print(f"Dataset path (from data config): {DATASET_LOCAL_PATH}")


## Step P1-3.2: Verify Local Dataset

Verify that the dataset directory (specified by `local_path` in the data config) exists and contains the required files. The dataset path is loaded from the centralized data configuration in Step P1-3.1.


In [None]:
# P1-3.2: Verify Local Dataset
# The dataset path comes from the data config's local_path field (loaded in Step P1-3.1).
# This ensures the dataset location is controlled by centralized configuration.
# Note: train.json is required, but validation.json is optional (matches training script behavior).

REQUIRED_FILE = "train.json"
OPTIONAL_FILE = "validation.json"

if not DATASET_LOCAL_PATH.exists():
    raise FileNotFoundError(
        f"Dataset directory not found: {DATASET_LOCAL_PATH}\n"
        f"This path comes from the data config's 'local_path' field.\n"
        f"If you need to create the dataset, run the notebook: notebooks/00_make_tiny_dataset.ipynb"
    )

# Check required file
train_file = DATASET_LOCAL_PATH / REQUIRED_FILE
if not train_file.exists():
    raise FileNotFoundError(
        f"Required dataset file not found: {train_file}\n"
        f"This path comes from the data config's 'local_path' field.\n"
        f"If you need to create it, run the notebook: notebooks/00_make_tiny_dataset.ipynb"
    )

# Check optional file
val_file = DATASET_LOCAL_PATH / OPTIONAL_FILE
has_validation = val_file.exists()

print(f"✓ Dataset directory found: {DATASET_LOCAL_PATH}")
print(f"  (from data config: {data_config.get('name', 'unknown')} v{data_config.get('version', 'unknown')})")

train_size = train_file.stat().st_size
print(f"  ✓ {REQUIRED_FILE} ({train_size:,} bytes)")

if has_validation:
    val_size = val_file.stat().st_size
    print(f"  ✓ {OPTIONAL_FILE} ({val_size:,} bytes)")
else:
    print(f"  ⚠ {OPTIONAL_FILE} not found (optional - training will proceed without validation set)")


## Step P1-3.3: Setup Local Environment

Verify GPU availability, set up MLflow tracking (local file store), and check that key dependencies are installed. This step ensures the local environment is ready for training.


In [None]:
import sys
import torch

DEFAULT_DEVICE = "cuda"

env_config = configs["env"]
device_type = env_config.get("compute", {}).get("device", DEFAULT_DEVICE)

if device_type == "cuda" and not torch.cuda.is_available():
    raise RuntimeError(
        "CUDA device requested but not available. "
        "In Colab, ensure you've selected a GPU runtime: Runtime > Change runtime type > GPU"
    )


In [None]:
import mlflow

MLFLOW_DIR = "mlruns"
mlflow_tracking_path = ROOT_DIR / MLFLOW_DIR
mlflow_tracking_path.mkdir(exist_ok=True)

# Convert path to file:// URI format for MLflow
mlflow_tracking_uri = mlflow_tracking_path.as_uri()
mlflow.set_tracking_uri(mlflow_tracking_uri)


In [None]:
try:
    import transformers
    import optuna
except ImportError as e:
    raise ImportError(f"Required package not installed: {e}")

REQUIRED_PACKAGES = {
    "torch": torch,
    "transformers": transformers,
    "mlflow": mlflow,
    "optuna": optuna,
}

for name, module in REQUIRED_PACKAGES.items():
    if not hasattr(module, "__version__"):
        raise ImportError(f"Required package '{name}' is not properly installed")


## Step P1-3.4: The Dry Run

Run a minimal HPO sweep to validate the training pipeline works correctly before launching the full HPO sweep. Uses the smoke HPO configuration with reduced trials.


In [None]:
from pathlib import Path
import importlib.util
from orchestration import STAGE_SMOKE

# Import local_sweeps directly to avoid triggering Azure ML imports in __init__.py
local_sweeps_spec = importlib.util.spec_from_file_location(
    "local_sweeps", SRC_DIR / "orchestration" / "jobs" / "local_sweeps.py"
)
local_sweeps = importlib.util.module_from_spec(local_sweeps_spec)
local_sweeps_spec.loader.exec_module(local_sweeps)
run_local_hpo_sweep = local_sweeps.run_local_hpo_sweep

TRAINING_SCRIPT_PATH = SRC_DIR / "train.py"
DRY_RUN_OUTPUT_DIR = ROOT_DIR / "outputs" / "dry_run"

if not TRAINING_SCRIPT_PATH.exists():
    raise FileNotFoundError(f"Training script not found: {TRAINING_SCRIPT_PATH}")


In [None]:
hpo_config = configs["hpo"]
train_config = configs["train"]
backbone_values = hpo_config["search_space"]["backbone"]["values"]

dry_run_studies = {}

for backbone in backbone_values:
    mlflow_experiment_name = f"{experiment_config.name}-{STAGE_SMOKE}-{backbone}"
    backbone_output_dir = DRY_RUN_OUTPUT_DIR / backbone
    
    study = run_local_hpo_sweep(
        dataset_path=str(DATASET_LOCAL_PATH),
        config_dir=CONFIG_DIR,
        backbone=backbone,
        hpo_config=hpo_config,
        train_config=train_config,
        output_dir=backbone_output_dir,
        mlflow_experiment_name=mlflow_experiment_name,
    )
    
    dry_run_studies[backbone] = study


In [None]:
for backbone, study in dry_run_studies.items():
    if study.trials:
        best_trial = study.best_trial
        print(f"{backbone}: {len(study.trials)} trials completed")
        print(
            f"  Best {hpo_config['objective']['metric']}: {best_trial.value:.4f}")
        print(f"  Best params: {best_trial.params}")
    else:
        print(f"{backbone}: No trials completed")


## Step P1-3.5: The Sweep (HPO) - Local with Optuna

Run the full hyperparameter optimization sweep using Optuna to systematically search for the best model configuration. Uses the production HPO configuration with more trials than the dry run.

**Note on K-Fold Cross-Validation:**
- When k-fold CV is enabled (`k_fold.enabled: true`), each trial trains **k models** (one per fold) and returns the **average metric** across folds
- The number of **trials** is controlled by `sampling.max_trials` (e.g., 2 trials in smoke.yaml)
- With k=5 folds and 2 trials: **2 trials × 5 folds = 10 model trainings total**
- K-fold CV provides more robust hyperparameter evaluation but increases compute time (k× per trial)


In [None]:
from pathlib import Path
import importlib.util
from orchestration import STAGE_HPO
from shared.yaml_utils import load_yaml

# Import local_sweeps directly to avoid triggering Azure ML imports in __init__.py
local_sweeps_spec = importlib.util.spec_from_file_location(
    "local_sweeps", SRC_DIR / "orchestration" / "jobs" / "local_sweeps.py"
)
local_sweeps = importlib.util.module_from_spec(local_sweeps_spec)
local_sweeps_spec.loader.exec_module(local_sweeps)
run_local_hpo_sweep = local_sweeps.run_local_hpo_sweep

DEFAULT_K_FOLDS = 5
DEFAULT_RANDOM_SEED = 42
HPO_OUTPUT_DIR = ROOT_DIR / "outputs" / "hpo"

hpo_stage_config = experiment_config.stages.get(STAGE_HPO, {})
hpo_config_override = hpo_stage_config.get("hpo_config")
hpo_config_path = CONFIG_DIR / hpo_config_override if hpo_config_override else experiment_config.hpo_config

if not hpo_config_path.exists():
    raise FileNotFoundError(f"HPO config not found: {hpo_config_path}")


In [None]:
hpo_config = load_yaml(hpo_config_path)
train_config = configs["train"]
backbone_values = hpo_config["search_space"]["backbone"]["values"]


In [None]:
from training.cv_utils import create_kfold_splits, save_fold_splits
from training.data import load_dataset

k_fold_config = hpo_config.get("k_fold", {})
k_folds_enabled = k_fold_config.get("enabled", False)
fold_splits_file = None

if k_folds_enabled:
    n_splits = k_fold_config.get("n_splits", DEFAULT_K_FOLDS)
    random_seed = k_fold_config.get("random_seed", DEFAULT_RANDOM_SEED)
    shuffle = k_fold_config.get("shuffle", True)
    
    full_dataset = load_dataset(str(DATASET_LOCAL_PATH))
    train_data = full_dataset.get("train", [])
    
    fold_splits = create_kfold_splits(
        dataset=train_data,
        k=n_splits,
        random_seed=random_seed,
        shuffle=shuffle,
    )
    
    fold_splits_file = HPO_OUTPUT_DIR / "fold_splits.json"
    save_fold_splits(
        fold_splits,
        fold_splits_file,
        metadata={
            "k": n_splits,
            "random_seed": random_seed,
            "shuffle": shuffle,
            "dataset_path": str(DATASET_LOCAL_PATH),
        }
    )


In [None]:
def build_mlflow_experiment_name(experiment_name: str, stage: str, backbone: str) -> str:
    return f"{experiment_name}-{stage}-{backbone}"

hpo_studies = {}
k_folds_param = k_fold_config.get("n_splits", DEFAULT_K_FOLDS) if k_folds_enabled else None

for backbone in backbone_values:
    mlflow_experiment_name = build_mlflow_experiment_name(
        experiment_config.name, STAGE_HPO, backbone
    )
    backbone_output_dir = HPO_OUTPUT_DIR / backbone
    
    study = run_local_hpo_sweep(
        dataset_path=str(DATASET_LOCAL_PATH),
        config_dir=CONFIG_DIR,
        backbone=backbone,
        hpo_config=hpo_config,
        train_config=train_config,
        output_dir=backbone_output_dir,
        mlflow_experiment_name=mlflow_experiment_name,
        k_folds=k_folds_param,
        fold_splits_file=fold_splits_file,
    )
    
    hpo_studies[backbone] = study


In [None]:
def extract_cv_statistics(best_trial):
    if not hasattr(best_trial, "user_attrs"):
        return None
    cv_mean = best_trial.user_attrs.get("cv_mean")
    cv_std = best_trial.user_attrs.get("cv_std")
    return (cv_mean, cv_std) if cv_mean is not None else None

objective_metric = hpo_config['objective']['metric']

for backbone, study in hpo_studies.items():
    if not study.trials:
        continue
    
    best_trial = study.best_trial
    cv_stats = extract_cv_statistics(best_trial)
    
    print(f"{backbone}: {len(study.trials)} trials completed")
    print(f"  Best {objective_metric}: {best_trial.value:.4f}")
    print(f"  Best params: {best_trial.params}")
    
    if cv_stats:
        cv_mean, cv_std = cv_stats
        print(f"  CV Statistics: Mean: {cv_mean:.4f} ± {cv_std:.4f}")


## Step P1-3.6: Best Configuration Selection (Automated)

Programmatically select the best configuration from all HPO sweep runs across all backbone models. The best configuration is determined by the objective metric specified in the HPO config.


In [None]:
from pathlib import Path
import importlib.util
from shared.json_cache import save_json

# Import local_selection directly to avoid triggering Azure ML imports in __init__.py
local_selection_spec = importlib.util.spec_from_file_location(
    "local_selection", SRC_DIR / "orchestration" / "jobs" / "local_selection.py"
)
local_selection = importlib.util.module_from_spec(local_selection_spec)
local_selection_spec.loader.exec_module(local_selection)
select_best_configuration_across_studies = local_selection.select_best_configuration_across_studies

BEST_CONFIG_CACHE_FILE = ROOT_DIR / "notebooks" / "best_configuration_cache.json"


In [None]:
dataset_version = data_config.get("version", "unknown")

best_configuration = select_best_configuration_across_studies(
    studies=hpo_studies,
    hpo_config=hpo_config,
    dataset_version=dataset_version,
)


In [None]:
save_json(BEST_CONFIG_CACHE_FILE, best_configuration)

print(f"Best configuration selected:")
print(f"  Backbone: {best_configuration.get('backbone')}")
print(f"  Trial: {best_configuration.get('trial_name')}")
print(f"  Best {hpo_config['objective']['metric']}: {best_configuration.get('selection_criteria', {}).get('best_value'):.4f}")
print(f"  Hyperparameters: {best_configuration.get('hyperparameters')}")
print(f"\nSaved to: {BEST_CONFIG_CACHE_FILE}")


## Step P1-3.7: Final Training (Post-HPO, Single Run)

Train the final production model using the best configuration from HPO with stable, controlled conditions. This uses the full training epochs (no early stopping) and the best hyperparameters found during HPO.

**Note**: After training completes, the checkpoint will be automatically backed up to Google Drive for persistence.


In [None]:
from pathlib import Path
import os
import sys
import subprocess
import mlflow
from shared.json_cache import load_json, save_json
from orchestration import STAGE_TRAINING

# Define build_final_training_config locally to avoid importing Azure ML dependencies
# This function doesn't use Azure ML, so we can define it here
def build_final_training_config(
    best_config: dict,
    train_config: dict,
    random_seed: int = 42,
) -> dict:
    """
    Build final training configuration by merging best HPO config with train.yaml defaults.
    """
    hyperparameters = best_config.get("hyperparameters", {})
    training_defaults = train_config.get("training", {})
    
    return {
        "backbone": best_config["backbone"],
        "learning_rate": hyperparameters.get("learning_rate", training_defaults.get("learning_rate", 2e-5)),
        "dropout": hyperparameters.get("dropout", training_defaults.get("dropout", 0.1)),
        "weight_decay": hyperparameters.get("weight_decay", training_defaults.get("weight_decay", 0.01)),
        "batch_size": training_defaults.get("batch_size", 16),
        "epochs": training_defaults.get("epochs", 5),
        "random_seed": random_seed,
        "early_stopping_enabled": False,
        "use_combined_data": True,
        "use_all_data": True,
    }

DEFAULT_RANDOM_SEED = 42
BEST_CONFIG_CACHE_FILE = ROOT_DIR / "notebooks" / "best_configuration_cache.json"
FINAL_TRAINING_OUTPUT_DIR = ROOT_DIR / "outputs" / "final_training"


In [None]:
best_configuration = load_json(BEST_CONFIG_CACHE_FILE, default=None)

if best_configuration is None:
    raise FileNotFoundError(
        f"Best configuration cache not found: {BEST_CONFIG_CACHE_FILE}\n"
        f"Please run Step P1-3.6: Best Configuration Selection first."
    )

final_training_config = build_final_training_config(
    best_config=best_configuration,
    train_config=configs["train"],
    random_seed=DEFAULT_RANDOM_SEED,
)


In [None]:
mlflow_experiment_name = f"{experiment_config.name}-{STAGE_TRAINING}-{final_training_config['backbone']}"
final_output_dir = FINAL_TRAINING_OUTPUT_DIR / final_training_config['backbone']
final_output_dir.mkdir(parents=True, exist_ok=True)

mlflow.set_experiment(mlflow_experiment_name)


In [None]:
training_script_path = SRC_DIR / "train.py"
training_args = [
    sys.executable,
    str(training_script_path),
    "--data-asset",
    str(DATASET_LOCAL_PATH),
    "--config-dir",
    str(CONFIG_DIR),
    "--backbone",
    final_training_config["backbone"],
    "--learning-rate",
    str(final_training_config["learning_rate"]),
    "--batch-size",
    str(final_training_config["batch_size"]),
    "--dropout",
    str(final_training_config["dropout"]),
    "--weight-decay",
    str(final_training_config["weight_decay"]),
    "--epochs",
    str(final_training_config["epochs"]),
    "--random-seed",
    str(final_training_config["random_seed"]),
    "--early-stopping-enabled",
    str(final_training_config["early_stopping_enabled"]).lower(),
    "--use-combined-data",
    str(final_training_config["use_combined_data"]).lower(),
]


In [None]:
training_env = os.environ.copy()
training_env["AZURE_ML_OUTPUT_checkpoint"] = str(final_output_dir)

mlflow_tracking_uri = mlflow.get_tracking_uri()
if mlflow_tracking_uri:
    training_env["MLFLOW_TRACKING_URI"] = mlflow_tracking_uri
training_env["MLFLOW_EXPERIMENT_NAME"] = mlflow_experiment_name


In [None]:
result = subprocess.run(
    training_args,
    cwd=ROOT_DIR,
    env=training_env,
    capture_output=False,
    text=True,
)

if result.returncode != 0:
    raise RuntimeError(f"Final training failed with return code {result.returncode}")


In [None]:
import json
import shutil
from pathlib import Path
import os

METRICS_FILENAME = "metrics.json"
FINAL_TRAINING_CACHE_FILE = ROOT_DIR / "notebooks" / "final_training_cache.json"

# Check actual checkpoint location
# The training script may save to outputs/checkpoint instead of final_output_dir/checkpoint
actual_checkpoint = ROOT_DIR / "outputs" / "checkpoint"
actual_metrics = ROOT_DIR / "outputs" / METRICS_FILENAME
expected_checkpoint = final_output_dir / "checkpoint"
expected_metrics = final_output_dir / METRICS_FILENAME

print("Checking training completion...")
print(f"  Expected checkpoint: {expected_checkpoint} (exists: {expected_checkpoint.exists()})")
print(f"  Actual checkpoint: {actual_checkpoint} (exists: {actual_checkpoint.exists()})")
print(f"  Expected metrics: {expected_metrics} (exists: {expected_metrics.exists()})")
print(f"  Actual metrics: {actual_metrics} (exists: {actual_metrics.exists()})")

# Determine which checkpoint and metrics to use
checkpoint_source = None
metrics_file = None

if expected_checkpoint.exists() and any(expected_checkpoint.iterdir()):
    checkpoint_source = expected_checkpoint
    print(f"✓ Using expected checkpoint location: {checkpoint_source}")
elif actual_checkpoint.exists() and any(actual_checkpoint.iterdir()):
    checkpoint_source = actual_checkpoint
    print(f"✓ Using actual checkpoint location: {checkpoint_source}")
    # Update final_output_dir to match actual location
    final_output_dir = actual_checkpoint.parent

if expected_metrics.exists():
    metrics_file = expected_metrics
elif actual_metrics.exists():
    metrics_file = actual_metrics

# Load metrics if available
metrics = None
if metrics_file and metrics_file.exists():
    with open(metrics_file, "r") as f:
        metrics = json.load(f)
    print(f"✓ Metrics loaded from: {metrics_file}")
    print(f"  Metrics: {metrics}")
elif checkpoint_source:
    print(f"⚠ Warning: Metrics file not found, but checkpoint exists.")
    metrics = {"status": "completed", "checkpoint_found": True}
else:
    raise FileNotFoundError(
        f"Training completed but no checkpoint found.\n"
        f"  Expected: {expected_checkpoint}\n"
        f"  Actual: {actual_checkpoint}\n"
        f"  Please check training logs for errors."
    )

# Save cache file with actual paths
save_json(FINAL_TRAINING_CACHE_FILE, {
    "output_dir": str(final_output_dir),
    "backbone": final_training_config["backbone"],
    "config": final_training_config,
})

# Backup checkpoint to Google Drive
if checkpoint_source and checkpoint_source.exists() and any(checkpoint_source.iterdir()):
    checkpoint_backup = DRIVE_BACKUP_DIR / f"{final_training_config['backbone']}_checkpoint"
    
    # Remove existing backup if it exists
    if checkpoint_backup.exists():
        shutil.rmtree(checkpoint_backup)
    
    # Copy checkpoint to Drive
    shutil.copytree(checkpoint_source, checkpoint_backup)
    print(f"\n✓ Checkpoint backed up to Google Drive: {checkpoint_backup}")
    
    # Backup cache file to Drive
    cache_backup = DRIVE_BACKUP_DIR / "final_training_cache.json"
    shutil.copy2(FINAL_TRAINING_CACHE_FILE, cache_backup)
    print(f"✓ Cache file backed up to Google Drive: {cache_backup}")
else:
    raise FileNotFoundError(
        f"Checkpoint directory not found or empty.\n"
        f"  Expected: {expected_checkpoint}\n"
        f"  Actual: {actual_checkpoint}\n"
        f"Training may have failed. Please check the training output above."
    )


## Step P1-4: Model Conversion & Optimization

Convert the final training checkpoint to an optimized ONNX model (int8 quantized) for production inference.

**Platform Adapter Note**: The conversion script (`src/convert_to_onnx.py`) uses the platform adapter to automatically handle output paths and logging appropriately for local execution.

**Checkpoint Restoration**: If the checkpoint is not found locally (e.g., after a Colab session disconnect), it will be automatically restored from Google Drive.


In [None]:
from pathlib import Path
import os
import sys
import subprocess
import mlflow
import shutil
from shared.json_cache import load_json

CONVERSION_SCRIPT_PATH = SRC_DIR / "convert_to_onnx.py"
FINAL_TRAINING_CACHE_FILE = ROOT_DIR / "notebooks" / "final_training_cache.json"
CONVERSION_OUTPUT_DIR = ROOT_DIR / "outputs" / "conversion"


In [None]:
training_cache = load_json(FINAL_TRAINING_CACHE_FILE, default=None)

if training_cache is None:
    # Try to restore from Google Drive
    cache_backup = DRIVE_BACKUP_DIR / "final_training_cache.json"
    if cache_backup.exists():
        print(f"Restoring training cache from Google Drive...")
        shutil.copy2(cache_backup, FINAL_TRAINING_CACHE_FILE)
        training_cache = load_json(FINAL_TRAINING_CACHE_FILE, default=None)
        print(f"✓ Training cache restored from Google Drive")
    else:
        raise FileNotFoundError(
            f"Final training cache not found locally or in Google Drive.\n"
            f"Please run Step P1-3.7: Final Training first."
        )

# Try to find checkpoint in expected location or actual location
backbone = training_cache["backbone"]
expected_checkpoint_dir = Path(training_cache["output_dir"]) / "checkpoint"
actual_checkpoint_dir = ROOT_DIR / "outputs" / "checkpoint"

print(f"Looking for checkpoint...")
print(f"  Expected: {expected_checkpoint_dir} (exists: {expected_checkpoint_dir.exists()})")
print(f"  Actual: {actual_checkpoint_dir} (exists: {actual_checkpoint_dir.exists()})")

# Determine which checkpoint to use
checkpoint_dir = None
if expected_checkpoint_dir.exists() and any(expected_checkpoint_dir.iterdir()):
    checkpoint_dir = expected_checkpoint_dir
    print(f"✓ Using expected checkpoint location: {checkpoint_dir}")
elif actual_checkpoint_dir.exists() and any(actual_checkpoint_dir.iterdir()):
    checkpoint_dir = actual_checkpoint_dir
    print(f"✓ Using actual checkpoint location: {checkpoint_dir}")
else:
    # Try to restore from Google Drive
    checkpoint_backup = DRIVE_BACKUP_DIR / f"{backbone}_checkpoint"
    
    if checkpoint_backup.exists():
        print(f"Checkpoint not found locally. Restoring from Google Drive...")
        print(f"  From: {checkpoint_backup}")
        # Use expected location for restoration
        checkpoint_dir = expected_checkpoint_dir
        print(f"  To: {checkpoint_dir}")
        
        # Create parent directory if needed
        checkpoint_dir.parent.mkdir(parents=True, exist_ok=True)
        
        # Copy checkpoint from Drive
        shutil.copytree(checkpoint_backup, checkpoint_dir)
        print(f"✓ Checkpoint restored from Google Drive: {checkpoint_dir}")
    else:
        raise FileNotFoundError(
            f"Checkpoint directory not found locally or in Google Drive.\n"
            f"  Expected: {expected_checkpoint_dir}\n"
            f"  Actual: {actual_checkpoint_dir}\n"
            f"  Drive backup: {checkpoint_backup}\n"
            f"Please ensure training completed successfully and checkpoint was backed up."
        )

conversion_output_dir = CONVERSION_OUTPUT_DIR / backbone
conversion_output_dir.mkdir(parents=True, exist_ok=True)


In [None]:
conversion_args = [
    sys.executable,
    str(CONVERSION_SCRIPT_PATH),
    "--checkpoint-path",
    str(checkpoint_dir),
    "--config-dir",
    str(CONFIG_DIR),
    "--backbone",
    backbone,
    "--output-dir",
    str(conversion_output_dir),
    "--quantize-int8",
    "--run-smoke-test",
]


In [None]:
conversion_env = os.environ.copy()
conversion_env["AZURE_ML_OUTPUT_onnx_model"] = str(conversion_output_dir)

mlflow_tracking_uri = mlflow.get_tracking_uri()
if mlflow_tracking_uri:
    conversion_env["MLFLOW_TRACKING_URI"] = mlflow_tracking_uri


In [None]:
result = subprocess.run(
    conversion_args,
    cwd=ROOT_DIR,
    env=conversion_env,
    capture_output=False,
    text=True,
)

if result.returncode != 0:
    raise RuntimeError(f"Model conversion failed with return code {result.returncode}")


In [None]:
from shared.json_cache import save_json
import shutil

ONNX_MODEL_FILENAME = "model_int8.onnx"
FALLBACK_ONNX_MODEL_FILENAME = "model.onnx"
CONVERSION_CACHE_FILE = ROOT_DIR / "notebooks" / "conversion_cache.json"

onnx_model_path = conversion_output_dir / ONNX_MODEL_FILENAME
if not onnx_model_path.exists():
    onnx_model_path = conversion_output_dir / FALLBACK_ONNX_MODEL_FILENAME

if not onnx_model_path.exists():
    raise FileNotFoundError(f"ONNX model not found in {conversion_output_dir}")

print(f"✓ Conversion completed. ONNX model: {onnx_model_path}")

save_json(CONVERSION_CACHE_FILE, {
    "onnx_model_path": str(onnx_model_path),
    "backbone": backbone,
    "checkpoint_dir": str(checkpoint_dir),
})

# Backup ONNX model to Google Drive
onnx_backup = DRIVE_BACKUP_DIR / f"{backbone}_model.onnx"
if onnx_model_path.exists():
    shutil.copy2(onnx_model_path, onnx_backup)
    print(f"✓ ONNX model backed up to Google Drive: {onnx_backup}")
else:
    print(f"⚠ Warning: ONNX model not found for backup: {onnx_model_path}")

# Backup conversion cache file to Drive
conversion_cache_backup = DRIVE_BACKUP_DIR / "conversion_cache.json"
shutil.copy2(CONVERSION_CACHE_FILE, conversion_cache_backup)
print(f"✓ Conversion cache backed up to Google Drive: {conversion_cache_backup}")
