# Phase 1: Best Configuration Selection (Local, Google Colab & Kaggle)

This notebook automates the selection of the best model configuration from MLflow
based on metrics and benchmarking results, then performs final training and model conversion.


## Workflow

**Prerequisites**: Run `01_orchestrate_training_colab.ipynb` first to:
- Train models via HPO
- Run benchmarking on best trials (using `evaluation.benchmarking.benchmark_best_trials`)

Then this notebook:

1. **Best Model Selection**: Query MLflow benchmark runs, join to training runs via grouping tags (`code.study_key_hash`, `code.trial_key_hash`), select best using normalized composite scoring
2. **Artifact Acquisition**: Download the best model's checkpoint using fallback strategy (local disk ‚Üí drive restore ‚Üí MLflow download)
3. **Final Training**: Optionally retrain with best config on full dataset (if not already final training)
4. **Model Conversion**: Convert the final model to ONNX format using canonical path structure


## Important

- This notebook **executes on Local, Google Colab, or Kaggle** (not on Azure ML compute)
- Requires MLflow tracking to be set up (Azure ML workspace or local SQLite)
- All computation happens on the platform's GPU (if available) or CPU
- **Storage & Persistence**:
  - **Local**: Outputs saved to `outputs/` directory in repository root
  - **Google Colab**: Checkpoints are automatically saved to Google Drive for persistence across sessions
  - **Kaggle**: Outputs in `/kaggle/working/` are automatically persisted - no manual backup needed
- The notebook must be **re-runnable end-to-end**
- Uses the dataset path specified in the data config (from `config/data/*.yaml`), typically pointing to a local folder included in the repository
- **Session Management**:
  - **Local**: No session limits, outputs persist in repository
  - **Colab**: Sessions timeout after 12-24 hours (depending on Colab plan). Checkpoints are saved to Drive automatically.
  - **Kaggle**: Sessions have time limits based on your plan. All outputs are automatically saved.


## Step 1: Environment Detection

The notebook automatically detects the execution environment (local, Google Colab, or Kaggle) and adapts its behavior accordingly.


In [13]:
# Import environment detection functions
# Bootstrap: Try to import, fallback to minimal detection if repo not cloned yet
import os
from pathlib import Path

try:
    from common.shared.notebook_setup import (
        get_platform_vars,
        ensure_src_in_path,
        detect_notebook_environment,
        setup_notebook_paths,
    )
    
    # Get platform variables and repository root
    PLATFORM_VARS = get_platform_vars()
    REPO_ROOT = ensure_src_in_path()
    
    # Get environment info
    env = detect_notebook_environment()
    PLATFORM = env.platform
    IN_COLAB = env.is_colab
    IN_KAGGLE = env.is_kaggle
    IS_LOCAL = env.is_local
    BASE_DIR = env.base_dir
    BACKUP_ENABLED = env.backup_enabled
    
except ImportError:
    # Bootstrap fallback: Minimal environment detection without imports
    # This allows the notebook to work before repository is cloned
    print("‚ö† Repository not cloned yet. Using bootstrap environment detection.")
    print("   Run the 'Repository Setup' cell to clone the repository.")
    
    # Minimal environment detection
    if "COLAB_GPU" in os.environ or "COLAB_TPU" in os.environ:
        PLATFORM = "colab"
        IN_COLAB = True
        IN_KAGGLE = False
        IS_LOCAL = False
        BASE_DIR = Path("/content")
        BACKUP_ENABLED = True
    elif "KAGGLE_KERNEL_RUN_TYPE" in os.environ:
        PLATFORM = "kaggle"
        IN_COLAB = False
        IN_KAGGLE = True
        IS_LOCAL = False
        BASE_DIR = Path("/kaggle/working")
        BACKUP_ENABLED = False
    else:
        PLATFORM = "local"
        IN_COLAB = False
        IN_KAGGLE = False
        IS_LOCAL = True
        BASE_DIR = Path.cwd()
        BACKUP_ENABLED = False
    
    # Create minimal PLATFORM_VARS dict for compatibility
    PLATFORM_VARS = {
        "platform": PLATFORM,
        "is_colab": IN_COLAB,
        "is_kaggle": IN_KAGGLE,
        "is_local": IS_LOCAL,
        "base_dir": BASE_DIR,
        "backup_enabled": BACKUP_ENABLED,
    }
    
    REPO_ROOT = None

print(f"‚úì Detected environment: {PLATFORM.upper()}")
print(f"Platform: {PLATFORM}")
print(f"Base directory: {BASE_DIR if BASE_DIR else 'Current working directory'}")
print(f"Backup enabled: {BACKUP_ENABLED}")
if REPO_ROOT is None:
    print("‚ö† Repository root not found. Please run the Repository Setup cell.")


‚úì Detected environment: LOCAL
Platform: local
Base directory: Current working directory
Backup enabled: False


### Install Required Packages

Install required packages based on the execution environment.


In [14]:
# Install required packages
if IS_LOCAL:
    print("For local environment, please:")
    print("1. Create conda environment: conda env create -f config/environment/conda.yaml")
    print("2. Activate: conda activate resume-ner-training")
    print("3. Restart kernel after activation")
    print("\nIf you've already done this, you can continue to the next cell.")
    print("\nInstalling Azure ML SDK (required for imports)...")
    # Install Azure ML packages even for local (in case conda env not activated)
    %pip install "azure-ai-ml>=1.0.0" --quiet
    %pip install "azure-identity>=1.12.0" --quiet
    %pip install azureml-defaults --quiet
    %pip install azureml-mlflow --quiet
else:
    # Core ML libraries
    %pip install "transformers>=4.35.0,<5.0.0" --quiet
    %pip install "safetensors>=0.4.0" --quiet
    %pip install "datasets>=2.12.0" --quiet

    # ML utilities
    %pip install "numpy>=1.24.0,<2.0.0" --quiet
    %pip install "pandas>=2.0.0" --quiet
    %pip install "scikit-learn>=1.3.0" --quiet

    # Utilities
    %pip install "pyyaml>=6.0" --quiet
    %pip install "tqdm>=4.65.0" --quiet
    %pip install "seqeval>=1.2.2" --quiet
    %pip install "sentencepiece>=0.1.99" --quiet

    # Experiment tracking
    %pip install mlflow --quiet
    %pip install optuna --quiet

    # Azure ML SDK (required for orchestration imports)
    %pip install "azure-ai-ml>=1.0.0" --quiet
    %pip install "azure-identity>=1.12.0" --quiet
    %pip install azureml-defaults --quiet
    %pip install azureml-mlflow --quiet

    # ONNX support
    %pip install onnxruntime --quiet
    %pip install "onnx>=1.16.0" --quiet
    %pip install "onnxscript>=0.1.0" --quiet

    print("‚úì All dependencies installed")


For local environment, please:
1. Create conda environment: conda env create -f config/environment/conda.yaml
2. Activate: conda activate resume-ner-training
3. Restart kernel after activation

If you've already done this, you can continue to the next cell.

Installing Azure ML SDK (required for imports)...
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Step 2: Repository Setup

**Note**: Repository setup is only needed for Colab/Kaggle environments. Local environments should already have the repository cloned.


In [15]:
# Repository setup - only needed for Colab/Kaggle
# PLATFORM_VARS is set in Cell 2
if not IS_LOCAL:
    if IN_KAGGLE:
        !git clone -b hpo_run_time_excl https://github.com/hoanglongvonguyen009/resume-ner-azureml.git /kaggle/working/resume-ner-azureml
    elif IN_COLAB:
        !git clone -b hpo_run_time_excl https://github.com/hoanglongvonguyen009/resume-ner-azureml.git /content/resume-ner-azureml
else:
    print("‚úì Local environment detected - detecting repository root...")


‚úì Local environment detected - detecting repository root...


### Verify Repository Setup

Verify the repository structure exists:


In [16]:
# Get platform vars and repository root
# PLATFORM_VARS is set in Cell 4
import sys
from pathlib import Path

if 'REPO_ROOT' not in globals() or REPO_ROOT is None:
    # Try to get repo root using ensure_src_in_path if available
    if 'ensure_src_in_path' in globals():
        try:
            REPO_ROOT = ensure_src_in_path()
        except (ImportError, ValueError):
            REPO_ROOT = None
    
    # For local environments: manually detect repository root
    if not REPO_ROOT and IS_LOCAL:
        current_dir = Path.cwd()
        # Check current directory first
        if (current_dir / "config").exists() and (current_dir / "src").exists():
            REPO_ROOT = current_dir
        else:
            # Check parent directories (in case notebook is in notebooks/ subdirectory)
            for parent in current_dir.parents:
                if (parent / "config").exists() and (parent / "src").exists():
                    REPO_ROOT = parent
                    break
    
    # Try expected location if not found (for Colab/Kaggle after cloning)
    if not REPO_ROOT and not IS_LOCAL:
        expected_path = BASE_DIR / "resume-ner-azureml"
        if expected_path.exists() and (expected_path / "config").exists() and (expected_path / "src").exists():
            src_dir = expected_path / "src"
            if str(src_dir) not in sys.path:
                sys.path.insert(0, str(src_dir))
            REPO_ROOT = expected_path
    
    # Try to import and use setup functions if repo found
    if REPO_ROOT:
        try:
            # Add src/ to path first so we can import setup_notebook_paths
            src_dir = REPO_ROOT / "src"
            if str(src_dir) not in sys.path:
                sys.path.insert(0, str(src_dir))
            
            from common.shared.notebook_setup import setup_notebook_paths
            paths = setup_notebook_paths(root_dir=REPO_ROOT, add_src_to_path=True)
            ROOT_DIR = paths.root_dir
            CONFIG_DIR = paths.config_dir
            SRC_DIR = paths.src_dir
            print(f"‚úì Repository: {ROOT_DIR} (config={CONFIG_DIR.name}, src={SRC_DIR.name})")
            print("‚úì Repository structure verified")
        except (ImportError, ValueError) as e:
            print(f"‚ö† Could not setup paths: {e}")
            REPO_ROOT = None

if 'REPO_ROOT' not in globals() or REPO_ROOT is None:
    print("‚ö† Repository not found. Please run the Repository Setup cell (Cell 8) to clone the repository.")
    print("   After cloning, re-run this cell to verify the repository structure.")
    ROOT_DIR = None
    CONFIG_DIR = None
    SRC_DIR = None


## Step 3: Load Configuration

Load experiment configuration and define experiment naming convention.


In [17]:
from infrastructure.config.loader import load_experiment_config
from common.constants import EXPERIMENT_NAME
from common.shared.yaml_utils import load_yaml
from infrastructure.naming.mlflow.tags_registry import load_tags_registry

# Load experiment config
experiment_config = load_experiment_config(CONFIG_DIR, EXPERIMENT_NAME)

# Load best model selection configs
tags_config = load_tags_registry(CONFIG_DIR)
selection_config = load_yaml(CONFIG_DIR / "best_model_selection.yaml")
conversion_config = load_yaml(CONFIG_DIR / "conversion.yaml")
acquisition_config = load_yaml(CONFIG_DIR / "artifact_acquisition.yaml")
benchmark_config = load_yaml(CONFIG_DIR / "benchmark.yaml")

print(f"‚úì Loaded configs: experiment={experiment_config.name}, tags, selection, conversion, acquisition, benchmark")

# Define experiment names (discovery happens after MLflow setup in Cell 4)
experiment_name = experiment_config.name
benchmark_experiment_name = f"{experiment_name}-benchmark"
training_experiment_name = f"{experiment_name}-training"  # For final training runs
conversion_experiment_name = f"{experiment_name}-conversion"

print(f"‚úì Experiment names: benchmark={benchmark_experiment_name}, training={training_experiment_name}, conversion={conversion_experiment_name}")


‚úì Loaded configs: experiment=resume_ner_baseline, tags, selection, conversion, acquisition, benchmark
‚úì Experiment names: benchmark=resume_ner_baseline-benchmark, training=resume_ner_baseline-training, conversion=resume_ner_baseline-conversion


## Step 4: Setup MLflow

Setup MLflow tracking with fallback to local if Azure ML is unavailable.


In [18]:
# Check if azureml.mlflow is available
try:
    import azureml.mlflow  # noqa: F401
    print("‚úì azureml.mlflow is available - Azure ML tracking will be used if configured")
except ImportError:
    print("‚ö† azureml.mlflow is not available - will fallback to local SQLite tracking")
    print("  To use Azure ML tracking, install: pip install azureml-mlflow")
    print("  Then restart the kernel and re-run this cell")

from common.shared.mlflow_setup import setup_mlflow_from_config
import mlflow

# Setup MLflow tracking (use training experiment for setup - actual queries use discovered experiments)
tracking_uri = setup_mlflow_from_config(
    experiment_name=training_experiment_name,
    config_dir=CONFIG_DIR,
    fallback_to_local=True,
)

print(f"‚úì MLflow tracking URI: {tracking_uri}")
print(f"‚úì MLflow experiment: {training_experiment_name}")

# Discover HPO and benchmark experiments from MLflow (after setup)
# NOTE: This cell is the SINGLE SOURCE OF TRUTH for hpo_experiments and benchmark_experiment
# These variables are reused in:
#   - Cell 16 (Step 6: Benchmarking) - uses hpo_experiments and benchmark_experiment
#   - Cell 18 (Step 7: Best Model Selection) - uses hpo_experiments and benchmark_experiment
# Do not rebuild these variables elsewhere - always reference them from this cell.
from mlflow.tracking import MlflowClient
from evaluation.selection.experiment_discovery import discover_all_experiments

mlflow_client = MlflowClient()
experiments = discover_all_experiments(
    experiment_name=experiment_name,
    mlflow_client=mlflow_client,
    create_benchmark_if_missing=False,  # Don't auto-create, let benchmarking step handle it
)

hpo_experiments = experiments["hpo_experiments"]
benchmark_experiment = experiments["benchmark_experiment"]

hpo_backbones = ", ".join(hpo_experiments.keys())
print(f"‚úì Experiments: {len(hpo_experiments)} HPO ({hpo_backbones}), benchmark={'found' if benchmark_experiment else 'not found'}, training={training_experiment_name}, conversion={conversion_experiment_name}")


2026-01-18 21:41:44,213 - common.shared.mlflow_setup - INFO - Azure ML enabled in config, attempting to connect...


‚úì azureml.mlflow is available - Azure ML tracking will be used if configured


2026-01-18 21:41:44,217 - common.shared.mlflow_setup - INFO - Using Service Principal authentication (from config.env)
2026-01-18 21:41:44,408 - common.shared.mlflow_setup - INFO - Successfully connected to Azure ML workspace: resume-ner-ws
2026-01-18 21:43:45,591 - common.shared.mlflow_setup - INFO - Using Azure ML workspace tracking
2026-01-18 21:43:45,673 - evaluation.selection.experiment_discovery - INFO - Discovered 2 HPO experiment(s) for resume_ner_baseline
2026-01-18 21:43:45,776 - evaluation.selection.experiment_discovery - INFO - Found benchmark experiment: resume_ner_baseline-benchmark


‚úì MLflow tracking URI: azureml://germanywestcentral.api.azureml.ms/mlflow/v2.0/subscriptions/50c06ef8-627b-46d5-b779-d07c9b398f75/resourceGroups/resume_ner_2026-01-02-16-47-05/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws
‚úì MLflow experiment: resume_ner_baseline-training
‚úì Experiments: 2 HPO (distilbert, distilroberta), benchmark=found, training=resume_ner_baseline-training, conversion=resume_ner_baseline-conversion


## Step 5: Drive Backup Setup (Colab Only)

Setup Google Drive backup/restore for Colab environments.


In [19]:
from pathlib import Path

# Fix numpy/pandas compatibility before importing infrastructure modules
try:
    from infrastructure.storage.drive import create_colab_store
except (ValueError, ImportError) as e:
    if "numpy.dtype size changed" in str(e) or "numpy" in str(e).lower():
        print("‚ö† Numpy/pandas compatibility issue detected. Fixing...")
        import subprocess
        import sys
        subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "--force-reinstall", "--no-cache-dir", "numpy>=1.24.0,<2.0.0", "pandas>=2.0.0", "--quiet"])
        print("‚úì Numpy/pandas reinstalled. Please restart the kernel and re-run this cell.")
        raise RuntimeError("Please restart kernel after numpy/pandas fix")
    else:
        raise

# Mount Google Drive and create backup store (Colab only - Kaggle doesn't need this)
DRIVE_BACKUP_DIR = None
drive_store = None
restore_from_drive = None

if IN_COLAB:
    drive_store = create_colab_store(ROOT_DIR, CONFIG_DIR)
    if drive_store:
        BACKUP_ENABLED = True
        DRIVE_BACKUP_DIR = drive_store.backup_root
        # Create restore function wrapper
        def restore_from_drive(local_path: Path, is_directory: bool = False) -> bool:
            """Restore file/directory from Drive backup."""
            try:
                expect = "dir" if is_directory else "file"
                result = drive_store.restore(local_path, expect=expect)
                return result.ok
            except Exception as e:
                print(f"‚ö† Drive restore failed: {e}")
                return False
        
        # Create backup_to_drive wrapper function (standardized backup pattern)
        def backup_to_drive(source_path: Path, is_directory: bool = False) -> bool:
            """Backup file/directory to Drive using DriveBackupStore."""
            if not BACKUP_ENABLED or drive_store is None:
                return False
            # Map is_directory to expect parameter
            expect = "dir" if is_directory else "file"
            result = drive_store.backup(source_path, expect=expect)
            if result.ok:
                print(f"‚úì Backed up: {source_path.name}")
            return result.ok
        
        print(f"‚úì Google Drive mounted")
        print(f"‚úì Backup base directory: {DRIVE_BACKUP_DIR}")
        print(f"\nNote: All outputs/ will be mirrored to: {DRIVE_BACKUP_DIR / 'outputs'}")
    else:
        BACKUP_ENABLED = False
        print("‚ö† Warning: Could not mount Google Drive. Backup to Google Drive will be disabled.")
elif IN_KAGGLE:
    print("‚úì Kaggle environment detected - outputs are automatically persisted (no Drive mount needed)")
    BACKUP_ENABLED = False
else:
    # Local environment
    print("‚úì Local environment detected - outputs will be saved to repository (no Drive backup needed)")
    BACKUP_ENABLED = False


‚úì Local environment detected - outputs will be saved to repository (no Drive backup needed)


## Step 6: Run Benchmarking on Champions (Optional)

**Optional Step**: If you haven't run benchmarking in `01_orchestrate_training_colab.ipynb`, you can run it here before selecting the best model. This step will:
1. Select champions (best trials) from HPO runs using Phase 2 selection logic
2. Run benchmarking on each champion to measure inference performance
3. Save benchmark results to MLflow for use in Step 7

**Note**: If benchmark runs already exist in MLflow, you can skip this step and proceed directly to Step 7.


In [20]:
# Optional: Run benchmarking on champions if not already done
# Skip this cell if benchmark runs already exist in MLflow

RUN_BENCHMARKING = True  # Set to True to run benchmarking

if RUN_BENCHMARKING:
    # Reload module to ensure latest function signature (in case kernel has cached version)
    import importlib
    import evaluation.selection.workflows.benchmarking_workflow
    importlib.reload(evaluation.selection.workflows.benchmarking_workflow)
    
    from evaluation.selection.workflows import run_benchmarking_workflow
    from infrastructure.tracking.mlflow.trackers import MLflowBenchmarkTracker
    from infrastructure.config.loader import load_all_configs
    
    # Load all configs
    configs = load_all_configs(experiment_config)
    data_config = configs.get("data", {})
    hpo_config = configs.get("hpo", {})
    
    # Use benchmark_experiment from Cell 12 (single source of truth) if available
    benchmark_experiment_name = f"{experiment_name}-benchmark"
    if "benchmark_experiment" not in globals() or benchmark_experiment is None:
        from evaluation.selection.experiment_discovery import discover_benchmark_experiment
        benchmark_experiment = discover_benchmark_experiment(
            experiment_name=experiment_name,
            mlflow_client=mlflow_client,
            create_if_missing=True,
        )
    
    # Setup benchmark tracker
    benchmark_tracker = MLflowBenchmarkTracker(benchmark_experiment_name)
    
    # Run benchmarking workflow
    champions_to_benchmark = run_benchmarking_workflow(
        hpo_experiments=hpo_experiments,
        selection_config=selection_config,
        benchmark_config=benchmark_config,
        data_config=data_config,
        hpo_config=hpo_config,
        root_dir=ROOT_DIR,
        config_dir=CONFIG_DIR,
        experiment_name=experiment_name,
        mlflow_client=mlflow_client,
        benchmark_experiment=benchmark_experiment,
        benchmark_tracker=benchmark_tracker,
        backup_enabled=BACKUP_ENABLED,
        backup_to_drive=backup_to_drive if "backup_to_drive" in locals() else None,
        restore_from_drive=restore_from_drive if "restore_from_drive" in locals() else None,
        in_colab=IN_COLAB,
        platform=PLATFORM,
    )
    
    # Store benchmarked champions for checkpoint reuse in Step 7
    # Index by refit_run_id (primary) and (backbone, study_key_hash, trial_key_hash) (fallback)
    BENCHMARKED_CHAMPIONS_BY_REFIT = {}
    BENCHMARKED_CHAMPIONS_BY_KEYS = {}
    
    for backbone, champion_data in champions_to_benchmark.items():
        champion = champion_data.get("champion", {})
        refit_run_id = champion.get("refit_run_id")
        checkpoint_path = champion.get("checkpoint_path")
        
        if refit_run_id and checkpoint_path:
            # Primary index: refit_run_id (most reliable)
            BENCHMARKED_CHAMPIONS_BY_REFIT[refit_run_id] = {
                "checkpoint_path": Path(checkpoint_path),
                "backbone": backbone,
                "champion": champion,
            }
            
            # Fallback index: (backbone, study_key_hash, trial_key_hash)
            study_key_hash = champion.get("study_key_hash")
            trial_key_hash = champion.get("trial_key_hash")
            if study_key_hash and trial_key_hash:
                BENCHMARKED_CHAMPIONS_BY_KEYS[(backbone, study_key_hash, trial_key_hash)] = {
                    "checkpoint_path": Path(checkpoint_path),
                    "refit_run_id": refit_run_id,
                    "champion": champion,
                }
    
    if BENCHMARKED_CHAMPIONS_BY_REFIT:
        print(f"üíæ Stored {len(BENCHMARKED_CHAMPIONS_BY_REFIT)} benchmarked champion(s) for checkpoint reuse")
else:
    print("‚è≠ Skipping benchmarking (RUN_BENCHMARKING=False).")
    print("   If benchmark runs don't exist, set RUN_BENCHMARKING=True or run benchmarking in notebook 01.")
    BENCHMARKED_CHAMPIONS_BY_REFIT = {}
    BENCHMARKED_CHAMPIONS_BY_KEYS = {}

2026-01-18 21:43:45,877 - evaluation.selection.workflows.benchmarking_workflow - INFO - Running benchmarking workflow on champions
2026-01-18 21:43:45,878 - evaluation.selection.workflows.benchmarking_workflow - INFO - Selecting champions for 2 backbone(s)
2026-01-18 21:43:46,570 - evaluation.selection.trial_finder.mlflow_queries - INFO - No runs found with stage='hpo_trial' for distilbert, trying legacy stage='hpo'
2026-01-18 21:43:47,076 - evaluation.selection.trial_finder.mlflow_queries - INFO - Found 38 runs with stage tag for distilbert (backbone=distilbert)
2026-01-18 21:43:47,077 - evaluation.selection.trial_finder.champion_selection - INFO - Filtered out 13 parent run(s) (only child/trial runs have metrics). 25 child runs remaining.
2026-01-18 21:43:47,078 - evaluation.selection.trial_finder.champion_selection - INFO - Grouped runs for distilbert: 0 v1 group(s), 2 v2 group(s)
2026-01-18 21:43:47,087 - evaluation.selection.trial_finder.champion_selection - INFO - Found 2 eligibl

## Step 7: Best Model Selection

Query MLflow benchmark runs (created by `01_orchestrate_training_colab.ipynb` or Step 6 above using `evaluation.benchmarking.benchmark_best_trials`), join to training runs via grouping tags, and select the best model using normalized composite scoring.

**Note**: Benchmark runs must exist in MLflow before running this step. If no benchmark runs are found, either:
- Set `RUN_BENCHMARKING=True` in Step 6 above, or
- Go back to `01_orchestrate_training_colab.ipynb` and run the benchmarking step.


In [21]:
from evaluation.selection.workflows import run_selection_workflow
from training.execution import extract_lineage_from_best_model
from pathlib import Path

# Initialize dictionaries if not already created (e.g., if benchmarking was skipped)
if "BENCHMARKED_CHAMPIONS_BY_REFIT" not in globals():
    BENCHMARKED_CHAMPIONS_BY_REFIT = {}
if "BENCHMARKED_CHAMPIONS_BY_KEYS" not in globals():
    BENCHMARKED_CHAMPIONS_BY_KEYS = {}

# Run selection workflow
best_model, best_checkpoint_dir = run_selection_workflow(
    benchmark_experiment=benchmark_experiment,
    hpo_experiments=hpo_experiments,
    selection_config=selection_config,
    tags_config=tags_config,
    root_dir=ROOT_DIR,
    config_dir=CONFIG_DIR,
    experiment_name=experiment_name,
    acquisition_config=acquisition_config,
    platform=PLATFORM,
    backup_enabled=BACKUP_ENABLED,
    backup_to_drive=backup_to_drive if "backup_to_drive" in locals() else None,
    restore_from_drive=restore_from_drive if "restore_from_drive" in locals() else None,
    in_colab=IN_COLAB,
    benchmarked_champions_by_refit=BENCHMARKED_CHAMPIONS_BY_REFIT if "BENCHMARKED_CHAMPIONS_BY_REFIT" in globals() else None,
    benchmarked_champions_by_keys=BENCHMARKED_CHAMPIONS_BY_KEYS if "BENCHMARKED_CHAMPIONS_BY_KEYS" in globals() else None,
)

# Extract lineage from best model for final training
lineage = extract_lineage_from_best_model(best_model)

2026-01-18 21:43:48,693 - evaluation.selection.workflows.selection_workflow - INFO - Best Model Selection Mode: force_new
2026-01-18 21:43:48,695 - evaluation.selection.workflows.selection_workflow - INFO - Mode is 'force_new' - querying MLflow for fresh selection
2026-01-18 21:43:48,697 - evaluation.selection.mlflow_selection - INFO - Finding best model from MLflow
2026-01-18 21:43:48,698 - evaluation.selection.mlflow_selection - INFO -   Benchmark experiment: resume_ner_baseline-benchmark
2026-01-18 21:43:48,699 - evaluation.selection.mlflow_selection - INFO -   HPO experiments: 2
2026-01-18 21:43:48,699 - evaluation.selection.mlflow_selection - INFO -   Objective metric: macro-f1
2026-01-18 21:43:48,700 - evaluation.selection.mlflow_selection - INFO -   Composite weights: F1=0.70, Latency=0.30
2026-01-18 21:43:48,700 - evaluation.selection.mlflow_selection - INFO -   Latency aggregation: latest (from config file, applied when multiple benchmark runs exist with same benchmark_key)
20


‚úÖ Best model selected:
   Run ID: 0c67e62a-a317-4c85-9628-1ee50e3bc07a
   Experiment: resume_ner_baseline-hpo-distilbert
   Backbone: distilbert
   F1 Score: 0.5247
   Latency: 172.19 ms
   Composite Score: 1.0000


2026-01-18 21:43:49,284 - evaluation.selection.artifact_unified.acquisition - INFO - Acquiring artifact from mlflow: run_id=0c67e62a-a31..., artifact_kind=checkpoint, backbone=distilbert
2026-01-18 21:43:49,286 - evaluation.selection.artifact_unified.acquisition - INFO - Downloading artifact from MLflow: run_id=0c67e62a-a31..., destination=/workspaces/resume-ner-azureml/outputs/best_model_selection/local/distilbert/checkpoint_c3659fea_18075ea4
Downloading artifacts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.98s/it]
2026-01-18 21:43:54,898 - evaluation.selection.artifact_unified.acquisition - INFO - Successfully downloaded artifact from MLflow to: /workspaces/resume-ner-azureml/outputs/best_model_selection/local/distilbert/checkpoint_c3659fea_18075ea4/best_trial_checkpoint.tar.gz
2026-01-18 21:43:58,870 - evaluation.selection.artifact_unified.acquisition - INFO - Directory is a valid checkpoint: /workspaces/resume-ner-azureml/outputs/best_model_selection/local/distilbert

In [22]:
# Check if selected run is already final training (skip retraining if so)
stage_tag = tags_config.key("process", "stage")
trained_on_full_data_tag = tags_config.key("training", "trained_on_full_data")

is_final_training = best_model["tags"].get(stage_tag) == "final_training"
used_full_data = (
    best_model["tags"].get(trained_on_full_data_tag) == "true" or
    best_model["params"].get("use_combined_data", "false").lower() == "true"
)

SKIP_FINAL_TRAINING = is_final_training and used_full_data

if SKIP_FINAL_TRAINING:
    final_checkpoint_dir = best_checkpoint_dir


## Step 8: Final Training

Run final training with best configuration if needed.


In [23]:
if not SKIP_FINAL_TRAINING:
    print("üîÑ Starting final training with best configuration...")
    from training.execution import run_final_training_workflow
    # Execute final training (uses final_training.yaml via load_final_training_config)
    # Will automatically reuse existing complete runs if run.mode: reuse_if_exists in final_training.yaml
    # Backup to Drive is handled automatically by run_final_training_workflow() using standardized backup pattern
    final_checkpoint_dir = run_final_training_workflow(
        root_dir=ROOT_DIR,
        config_dir=CONFIG_DIR,
        best_model=best_model,
        experiment_config=experiment_config,
        lineage=lineage,
        training_experiment_name=training_experiment_name,
        platform=PLATFORM,
        backup_enabled=BACKUP_ENABLED,
        backup_to_drive=backup_to_drive if "backup_to_drive" in locals() else None,
        restore_from_drive=restore_from_drive if "restore_from_drive" in locals() else None,
        in_colab=IN_COLAB,
    )
else:
    print("‚úì Skipping final training - using selected checkpoint")

2026-01-18 21:44:01,597 - training.execution.executor - INFO - Final training config loaded from final_training.yaml
2026-01-18 21:44:01,597 - training.execution.executor - INFO - Output directory: /workspaces/resume-ner-azureml/outputs/final_training/local/distilbert/spec-1e6acb58_exec-02136b6b/v1
2026-01-18 21:44:01,653 - infrastructure.naming.mlflow.config - INFO - [Auto-Increment Config] Loading from config_dir=/workspaces/resume-ner-azureml/config, raw_auto_inc_config={'enabled': True, 'processes': {'hpo': True, 'benchmarking': True}, 'format': '{base}.{version}'}
2026-01-18 21:44:01,653 - infrastructure.naming.mlflow.config - INFO - [Auto-Increment Config] Validated config: {'enabled': True, 'processes': {'hpo': True, 'benchmarking': True}, 'format': '{base}.{version}'}, process_type=final_training


üîÑ Starting final training with best configuration...


2026-01-18 21:44:01,826 - training.execution.mlflow_setup - INFO - üèÉ View run local_distilbert_final_training_spec-1e6acb58_exec-02136b6b_v1 at: https://germanywestcentral.api.azureml.ms/mlflow/v2.0/subscriptions/50c06ef8-627b-46d5-b779-d07c9b398f75/resourceGroups/resume_ner_2026-01-02-16-47-05/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws/#/experiments/801daa4d-3a56-4952-a374-cf2c5a9c2846/runs/f3e57780-66ba-4bfc-beab-4f0516a65a8a
2026-01-18 21:44:01,826 - training.execution.mlflow_setup - INFO - Created MLflow run: local_distilbert_final_training_spec-1e6acb58_exec-02136b6b_v1 (f3e57780-66b...)
2026-01-18 21:44:01,827 - training.execution.executor - INFO - Created MLflow run: local_distilbert_final_training_spec-1e6acb58_exec-02136b6b_v1 (f3e57780-66b...)
2026-01-18 21:44:01,878 - training.execution.executor - INFO - Running final training...
2026-01-18 21:44:41,674 - training.execution.subprocess_runner - INFO - üèÉ View run local_distilbert_final_training_

## Step 9: Model Conversion & Optimization

Convert the final trained model to ONNX format with optimization.

In [24]:
# Extract parent training information for conversion
from common.shared.json_cache import load_json
from pathlib import Path

# Load metadata from final training output directory
final_training_metadata_path = final_checkpoint_dir.parent / "metadata.json"

if not final_training_metadata_path.exists():
    raise ValueError(
        f"Metadata file not found: {final_training_metadata_path}\n"
        "Please ensure final training completed successfully."
    )

metadata = load_json(final_training_metadata_path)
parent_spec_fp = metadata.get("spec_fp")
parent_exec_fp = metadata.get("exec_fp")
parent_training_run_id = metadata.get("mlflow", {}).get("run_id")

if not parent_spec_fp or not parent_exec_fp:
    raise ValueError(
        f"Missing required fingerprints in metadata: spec_fp={parent_spec_fp}, exec_fp={parent_exec_fp}\n"
        "Please ensure final training completed successfully."
    )

if parent_training_run_id:
    print(f"‚úì Parent training: spec_fp={parent_spec_fp[:8]}..., exec_fp={parent_exec_fp[:8]}..., run_id={parent_training_run_id[:12]}...")
else:
    print(f"‚úì Parent training: spec_fp={parent_spec_fp[:8]}..., exec_fp={parent_exec_fp[:8]}... (run_id not found)")

# Get parent training output directory (checkpoint parent)
parent_training_output_dir = final_checkpoint_dir.parent

print(f"\nüîÑ Starting model conversion...")
from deployment.conversion import run_conversion_workflow

# Execute conversion (uses conversion.yaml via load_conversion_config)
# Backup to Drive is handled automatically by run_conversion_workflow() using standardized backup pattern
conversion_output_dir = run_conversion_workflow(
    root_dir=ROOT_DIR,
    config_dir=CONFIG_DIR,
    parent_training_output_dir=parent_training_output_dir,
    parent_spec_fp=parent_spec_fp,
    parent_exec_fp=parent_exec_fp,
    experiment_config=experiment_config,
    conversion_experiment_name=conversion_experiment_name,
    platform=PLATFORM,
    parent_training_run_id=parent_training_run_id,  # May be None, that's OK
    backup_enabled=BACKUP_ENABLED,
    backup_to_drive=backup_to_drive if "backup_to_drive" in locals() else None,
    restore_from_drive=restore_from_drive if "restore_from_drive" in locals() else None,
    in_colab=IN_COLAB,
)

# Find ONNX model file (search recursively, as model may be in onnx_model/ subdirectory)
onnx_files = list(conversion_output_dir.rglob("*.onnx"))
if onnx_files:
    onnx_model_path = onnx_files[0]
    print(f"\n‚úì Conversion completed successfully!")
    print(f"  ONNX model: {onnx_model_path}")
    print(f"  Model size: {onnx_model_path.stat().st_size / (1024 * 1024):.2f} MB")
else:
    print(f"\n‚ö† Warning: No ONNX model file found in {conversion_output_dir} (searched recursively)")


2026-01-18 21:44:41,798 - deployment.conversion.orchestration - INFO - Output directory: /workspaces/resume-ner-azureml/outputs/conversion/local/distilbert/spec-1e6acb58_exec-02136b6b/v1/conv-3a36b9a8
2026-01-18 21:44:41,861 - infrastructure.naming.mlflow.config - INFO - [Auto-Increment Config] Loading from config_dir=/workspaces/resume-ner-azureml/config, raw_auto_inc_config={'enabled': True, 'processes': {'hpo': True, 'benchmarking': True}, 'format': '{base}.{version}'}
2026-01-18 21:44:41,862 - infrastructure.naming.mlflow.config - INFO - [Auto-Increment Config] Validated config: {'enabled': True, 'processes': {'hpo': True, 'benchmarking': True}, 'format': '{base}.{version}'}, process_type=conversion


‚úì Parent training: spec_fp=1e6acb58..., exec_fp=02136b6b..., run_id=f3e57780-66b...

üîÑ Starting model conversion...


2026-01-18 21:44:42,025 - deployment.conversion.orchestration - INFO - Created MLflow run: local_distilbert_conversion_spec-1e6acb58_exec-02136b6b_v1_conv-3a36b9a8 (71a5e2b8-7a5...)
2026-01-18 21:44:42,025 - deployment.conversion.orchestration - INFO - Running conversion: /opt/conda/envs/resume-ner-training/bin/python -m deployment.conversion.execution --checkpoint-path /workspaces/resume-ner-azureml/outputs/final_training/local/distilbert/spec-1e6acb58_exec-02136b6b/v1/checkpoint --config-dir /workspaces/resume-ner-azureml/config --backbone distilbert --output-dir /workspaces/resume-ner-azureml/outputs/conversion/local/distilbert/spec-1e6acb58_exec-02136b6b/v1/conv-3a36b9a8 --opset-version 18 --run-smoke-test
2026-01-18 21:45:03,839 - deployment.conversion.orchestration - INFO - üèÉ View run local_distilbert_conversion_spec-1e6acb58_exec-02136b6b_v1_conv-3a36b9a8 at: https://germanywestcentral.api.azureml.ms/mlflow/v2.0/subscriptions/50c06ef8-627b-46d5-b779-d07c9b398f75/resourceGroup


‚úì Conversion completed successfully!
  ONNX model: /workspaces/resume-ner-azureml/outputs/conversion/local/distilbert/spec-1e6acb58_exec-02136b6b/v1/conv-3a36b9a8/onnx_model/model.onnx
  Model size: 253.29 MB
