# Phase 1: Best Configuration Selection (Local, Google Colab & Kaggle)

This notebook automates the selection of the best model configuration from MLflow
based on metrics and benchmarking results, then performs final training and model conversion.


## Workflow

1. **Best Model Selection**: Query MLflow benchmark runs, join to training runs via grouping tags (`code.study_key_hash`, `code.trial_key_hash`), select best using normalized composite scoring
2. **Artifact Acquisition**: Download the best model's checkpoint using fallback strategy (local disk ‚Üí drive restore ‚Üí MLflow download)
3. **Final Training**: Optionally retrain with best config on full dataset (if not already final training)
4. **Model Conversion**: Convert the final model to ONNX format using canonical path structure


## Important

- This notebook **executes on Local, Google Colab, or Kaggle** (not on Azure ML compute)
- Requires MLflow tracking to be set up (Azure ML workspace or local SQLite)
- All computation happens on the platform's GPU (if available) or CPU
- **Storage & Persistence**:
  - **Local**: Outputs saved to `outputs/` directory in repository root
  - **Google Colab**: Checkpoints are automatically saved to Google Drive for persistence across sessions
  - **Kaggle**: Outputs in `/kaggle/working/` are automatically persisted - no manual backup needed
- The notebook must be **re-runnable end-to-end**
- Uses the dataset path specified in the data config (from `config/data/*.yaml`), typically pointing to a local folder included in the repository
- **Session Management**:
  - **Local**: No session limits, outputs persist in repository
  - **Colab**: Sessions timeout after 12-24 hours (depending on Colab plan). Checkpoints are saved to Drive automatically.
  - **Kaggle**: Sessions have time limits based on your plan. All outputs are automatically saved.


## Experiment Naming Convention

- HPO experiment: `{experiment_name}-hpo-{backbone}` (contains parent runs with `code.stage="hpo"` and refit runs with `code.stage="hpo_refit"`)
- Benchmark experiment: `{experiment_name}-benchmark` (single experiment, no backbone suffix)
- Training experiment: `{experiment_name}-training` (exists but currently unused - refit runs are in HPO experiments)
- Conversion experiment: `{experiment_name}-conversion`


## Step 1: Environment Detection

The notebook automatically detects the execution environment (local, Google Colab, or Kaggle) and adapts its behavior accordingly.


In [1]:
import os
from pathlib import Path


def validate_path_before_mkdir(path: Path, context: str = "directory") -> Path:
    """
    Validate path before creating directory to prevent creating invalid files.

    Args:
        path: Path to validate
        context: Context string for error messages

    Returns:
        Validated and resolved path

    Raises:
        ValueError: If path is invalid
    """
    if not path or not str(path):
        raise ValueError(f"Invalid {context} path: {path}")

    # Ensure path is absolute
    if not path.is_absolute():
        path = path.resolve()

    path_str = str(path)

    # Basic invalid cases
    if path_str in ("", ".", ".."):
        raise ValueError(
            f"Invalid {context} path (too short or relative): {path_str}"
        )

    # Split path
    path_parts = path_str.replace("\\", "/").split("/")

    # Check if last part looks like a version number (e.g. "1.0.0")
    import re

    if path_parts:
        last_part = path_parts[-1]

        if re.match(r"^[\d\.]+$", last_part):
            # Reject single-part paths like "1.0.0"
            if len(path_parts) == 1:
                raise ValueError(
                    f"Invalid {context} path (looks like version number): {path_str}"
                )

    # Validate path has reasonable structure
    if len(path_parts) < 2:
        raise ValueError(
            f"Invalid {context} path (too short, appears to be filename): {path_str}"
        )

    # Safety: path exists but is a file
    if path.exists() and path.is_file():
        import logging
        logger = logging.getLogger(__name__)
        logger.error(f"Path exists as file, not directory: {path}")
        raise ValueError(
            f"Cannot create {context}, path exists as file: {path}"
        )

    return path


# Detect execution environment
IN_COLAB = "COLAB_GPU" in os.environ or "COLAB_TPU" in os.environ
IN_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
IS_LOCAL = not IN_COLAB and not IN_KAGGLE

# Set platform-specific constants
if IN_COLAB:
    PLATFORM = "colab"
    BASE_DIR = Path("/content")
    BACKUP_ENABLED = True
elif IN_KAGGLE:
    PLATFORM = "kaggle"
    BASE_DIR = Path("/kaggle/working")
    BACKUP_ENABLED = False
else:
    PLATFORM = "local"
    BASE_DIR = None
    BACKUP_ENABLED = False

print(f"‚úì Detected environment: {PLATFORM.upper()}")
print(f"Platform: {PLATFORM}")
print(
    f"Base directory: {BASE_DIR if BASE_DIR else 'Current working directory'}")
print(f"Backup enabled: {BACKUP_ENABLED}")


‚úì Detected environment: LOCAL
Platform: local
Base directory: Current working directory
Backup enabled: False


## Step 2: Repository Setup

**Note**: Repository setup is only needed for Colab/Kaggle environments. Local environments should already have the repository cloned.


In [2]:
# Repository setup - only needed for Colab/Kaggle
if not IS_LOCAL:
    if IN_KAGGLE:
        # For Kaggle
        !git clone -b gg_hpo https://github.com/longdang193/resume-ner-azureml.git /kaggle/working/resume-ner-azureml
    elif IN_COLAB:
        # For Google Colab
        !git clone -b gg_hpo https://github.com/longdang193/resume-ner-azureml.git /content/resume-ner-azureml
else:
    print("‚úì Local environment detected - assuming repository already exists")

# Set up paths
if not IS_LOCAL:
    ROOT_DIR = BASE_DIR / "resume-ner-azureml"
else:
    # For local, try to find repo root
    ROOT_DIR = Path.cwd()
    # Try to find the repo root by looking for config directory
    while ROOT_DIR.parent != ROOT_DIR:
        if (ROOT_DIR / "config").exists() and (ROOT_DIR / "src").exists():
            break
        ROOT_DIR = ROOT_DIR.parent

CONFIG_DIR = ROOT_DIR / "config"
SRC_DIR = ROOT_DIR / "src"

# Add src to path
import sys
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

print(f"‚úì Repository root: {ROOT_DIR}")
print(f"‚úì Config directory: {CONFIG_DIR}")
print(f"‚úì Source directory: {SRC_DIR}")

# Verify repository structure
required_dirs = [CONFIG_DIR, SRC_DIR]
for dir_path in required_dirs:
    if not dir_path.exists():
        raise ValueError(f"Required directory not found: {dir_path}")
print("‚úì Repository structure verified")


‚úì Local environment detected - assuming repository already exists
‚úì Repository root: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml
‚úì Config directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config
‚úì Source directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\src
‚úì Repository structure verified


## Step 3: Load Configuration

Load experiment configuration and define experiment naming convention.


In [3]:
from orchestration.config_loader import load_experiment_config
from orchestration import EXPERIMENT_NAME
from shared.yaml_utils import load_yaml

# Load experiment config
experiment_config = load_experiment_config(CONFIG_DIR, EXPERIMENT_NAME)
print(f"‚úì Loaded experiment config: {experiment_config.name}")

# Load best model selection configs
tags_config = load_yaml(CONFIG_DIR / "tags.yaml")
selection_config = load_yaml(CONFIG_DIR / "best_model_selection.yaml")
conversion_config = load_yaml(CONFIG_DIR / "conversion.yaml")
acquisition_config = load_yaml(CONFIG_DIR / "artifact_acquisition.yaml")

print(f"‚úì Loaded tags config")
print(f"‚úì Loaded best model selection config")
print(f"‚úì Loaded conversion config")
print(f"‚úì Loaded artifact acquisition config")

# Define experiment names (discovery happens after MLflow setup in Cell 4)
experiment_name = experiment_config.name
benchmark_experiment_name = f"{experiment_name}-benchmark"
training_experiment_name = f"{experiment_name}-training"  # For final training runs
conversion_experiment_name = f"{experiment_name}-conversion"

print(f"\n‚úì Experiment names defined:")
print(f"  - Benchmark: {benchmark_experiment_name}")
print(f"  - Training: {training_experiment_name}")
print(f"  - Conversion: {conversion_experiment_name}")
print(f"\nNote: Experiment discovery will happen after MLflow setup (Cell 4)")

  from .autonotebook import tqdm as notebook_tqdm


‚úì Loaded experiment config: resume_ner_baseline
‚úì Loaded tags config
‚úì Loaded best model selection config
‚úì Loaded conversion config
‚úì Loaded artifact acquisition config

‚úì Experiment names defined:
  - Benchmark: resume_ner_baseline-benchmark
  - Training: resume_ner_baseline-training
  - Conversion: resume_ner_baseline-conversion

Note: Experiment discovery will happen after MLflow setup (Cell 4)


## Step 4: Setup MLflow

Setup MLflow tracking with fallback to local if Azure ML is unavailable.


In [4]:
from shared.mlflow_setup import setup_mlflow_from_config
import mlflow

# Setup MLflow tracking (use training experiment for setup - actual queries use discovered experiments)
tracking_uri = setup_mlflow_from_config(
    experiment_name=training_experiment_name,
    config_dir=CONFIG_DIR,
    fallback_to_local=True,
)

print(f"‚úì MLflow tracking URI: {tracking_uri}")
print(f"‚úì MLflow experiment: {training_experiment_name}")

# Discover HPO and benchmark experiments from MLflow (after setup)
from mlflow.tracking import MlflowClient

client = MlflowClient()
all_experiments = client.search_experiments()

# Find HPO experiments (format: {experiment_name}-hpo-{backbone})
hpo_experiments = {}
for exp in all_experiments:
    if exp.name.startswith(f"{experiment_name}-hpo-"):
        backbone = exp.name.replace(f"{experiment_name}-hpo-", "")
        hpo_experiments[backbone] = {
            "name": exp.name,
            "id": exp.experiment_id
        }

# Find benchmark experiment
benchmark_experiment = None
for exp in all_experiments:
    if exp.name == benchmark_experiment_name:
        benchmark_experiment = {
            "name": exp.name,
            "id": exp.experiment_id
        }
        break

print(f"\n‚úì Found {len(hpo_experiments)} HPO experiment(s):")
for backbone, exp_info in hpo_experiments.items():
    print(f"  - {exp_info['name']} (backbone: {backbone})")

if benchmark_experiment:
    print(f"‚úì Benchmark experiment: {benchmark_experiment['name']}")
else:
    print(f"‚ö† Benchmark experiment not found: {benchmark_experiment_name}")

print(f"‚úì Training experiment: {training_experiment_name} (for final training runs)")
print(f"‚úì Conversion experiment: {conversion_experiment_name}")


2026-01-04 17:19:45,853 - shared.mlflow_setup - INFO - Azure ML enabled in config, attempting to connect...
2026-01-04 17:19:48,689 - shared.mlflow_setup - INFO - Attempting to load credentials from config.env at: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config.env
2026-01-04 17:19:48,691 - shared.mlflow_setup - INFO - Loading credentials from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config.env
2026-01-04 17:19:48,692 - shared.mlflow_setup - INFO - Loaded subscription/resource group from config.env
2026-01-04 17:19:48,693 - shared.mlflow_setup - INFO - Loaded service principal credentials from config.env
2026-01-04 17:19:48,695 - shared.mlflow_setup - INFO - Using Service Principal authentication (from config.env)
Class DeploymentTemplateOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
2026-01-04 17:19:53,862 - shared.mlflow_setup - INFO - Successfully connected to Azure 

‚úì MLflow tracking URI: azureml://germanywestcentral.api.azureml.ms/mlflow/v2.0/subscriptions/50c06ef8-627b-46d5-b779-d07c9b398f75/resourceGroups/resume_ner_2026-01-02-16-47-05/providers/Microsoft.MachineLearningServices/workspaces/resume-ner-ws
‚úì MLflow experiment: resume_ner_baseline-training

‚úì Found 2 HPO experiment(s):
  - resume_ner_baseline-hpo-distilbert (backbone: distilbert)
  - resume_ner_baseline-hpo-distilroberta (backbone: distilroberta)
‚úì Benchmark experiment: resume_ner_baseline-benchmark
‚úì Training experiment: resume_ner_baseline-training (for final training runs)
‚úì Conversion experiment: resume_ner_baseline-conversion


## Step 5: Drive Backup Setup (Colab Only)

Setup Google Drive backup/restore for Colab environments.


In [5]:
from pathlib import Path

# Fix numpy/pandas compatibility before importing orchestration modules
try:
    from orchestration.drive_backup import create_colab_store
except (ValueError, ImportError) as e:
    if "numpy.dtype size changed" in str(e) or "numpy" in str(e).lower():
        print("‚ö† Numpy/pandas compatibility issue detected. Fixing...")
        import subprocess
        import sys
        subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "--force-reinstall", "--no-cache-dir", "numpy>=1.24.0,<2.0.0", "pandas>=2.0.0", "--quiet"])
        print("‚úì Numpy/pandas reinstalled. Please restart the kernel and re-run this cell.")
        raise RuntimeError("Please restart kernel after numpy/pandas fix")
    else:
        raise

# Mount Google Drive and create backup store (Colab only - Kaggle doesn't need this)
DRIVE_BACKUP_DIR = None
drive_store = None
restore_from_drive = None

if IN_COLAB:
    drive_store = create_colab_store(ROOT_DIR, CONFIG_DIR)
    if drive_store:
        BACKUP_ENABLED = True
        DRIVE_BACKUP_DIR = drive_store.backup_root
        # Create restore function wrapper
        def restore_from_drive(local_path: Path, is_directory: bool = False) -> bool:
            """Restore file/directory from Drive backup."""
            try:
                return drive_store.restore(local_path, is_directory=is_directory)
            except Exception as e:
                print(f"‚ö† Drive restore failed: {e}")
                return False
        print(f"‚úì Google Drive mounted")
        print(f"‚úì Backup base directory: {DRIVE_BACKUP_DIR}")
        print(f"\nNote: All outputs/ will be mirrored to: {DRIVE_BACKUP_DIR / 'outputs'}")
    else:
        BACKUP_ENABLED = False
        print("‚ö† Warning: Could not mount Google Drive. Backup to Google Drive will be disabled.")
elif IN_KAGGLE:
    print("‚úì Kaggle environment detected - outputs are automatically persisted (no Drive mount needed)")
    BACKUP_ENABLED = False
else:
    # Local environment
    print("‚úì Local environment detected - outputs will be saved to repository (no Drive backup needed)")
    BACKUP_ENABLED = False


‚úì Local environment detected - outputs will be saved to repository (no Drive backup needed)


## Step 6: Best Model Selection

Query MLflow benchmark runs, join to training runs via grouping tags, and select the best model using normalized composite scoring.


In [6]:
from orchestration.jobs.selection.mlflow_selection import find_best_model_from_mlflow
from orchestration.jobs.selection.artifact_acquisition import (
    acquire_best_model_checkpoint as acquire_checkpoint_improved,
)
from pathlib import Path
from typing import Optional, Callable, Dict, Any


def acquire_best_model_checkpoint(
    best_run_info: Dict[str, Any],
    root_dir: Path,
    config_dir: Path,
    restore_from_drive: Optional[Callable[[Path, bool], bool]] = None,
) -> Path:
    """Wrapper for artifact acquisition function."""
    return acquire_checkpoint_improved(
        best_run_info=best_run_info,
        root_dir=root_dir,
        config_dir=config_dir,
        acquisition_config=acquisition_config,
        selection_config=selection_config,
        platform=PLATFORM,
        restore_from_drive=restore_from_drive,
        in_colab=IN_COLAB,
    )


# Validate experiments
if benchmark_experiment is None:
    raise ValueError(f"Benchmark experiment '{benchmark_experiment_name}' not found. Run benchmark jobs first.")

if not hpo_experiments:
    raise ValueError(f"No HPO experiments found. Run HPO jobs first.")

# Find best model
best_model = find_best_model_from_mlflow(
    benchmark_experiment=benchmark_experiment,
    hpo_experiments=hpo_experiments,
    tags_config=tags_config,
    selection_config=selection_config,
    use_python_filtering=True,
)

if best_model is None:
    raise ValueError("Could not find best model from MLflow.")

# Acquire checkpoint
OUTPUT_DIR = ROOT_DIR / "outputs" / "best_model_selection"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

best_checkpoint_dir = acquire_best_model_checkpoint(
    best_run_info=best_model,
    root_dir=ROOT_DIR,
    config_dir=CONFIG_DIR,
    restore_from_drive=restore_from_drive if "restore_from_drive" in locals() else None,
)

print(f"\n‚úì Best model checkpoint available at: {best_checkpoint_dir}")

2026-01-04 17:19:57,794 - orchestration.jobs.selection.mlflow_selection - INFO - Finding best model from MLflow
2026-01-04 17:19:57,795 - orchestration.jobs.selection.mlflow_selection - INFO -   Benchmark experiment: resume_ner_baseline-benchmark
2026-01-04 17:19:57,796 - orchestration.jobs.selection.mlflow_selection - INFO -   HPO experiments: 2
2026-01-04 17:19:57,796 - orchestration.jobs.selection.mlflow_selection - INFO -   Objective metric: macro-f1
2026-01-04 17:19:57,797 - orchestration.jobs.selection.mlflow_selection - INFO -   Composite weights: F1=0.70, Latency=0.30
2026-01-04 17:19:57,797 - orchestration.jobs.selection.mlflow_selection - INFO - Querying benchmark runs...


üîç Finding best model from MLflow...
   Benchmark experiment: resume_ner_baseline-benchmark
   HPO experiments: 2
   Objective metric: macro-f1
   Composite weights: F1=0.70, Latency=0.30

üìä Querying benchmark runs...


2026-01-04 17:19:58,239 - orchestration.jobs.selection.mlflow_selection - INFO - Found 5 finished benchmark runs
2026-01-04 17:19:58,240 - orchestration.jobs.selection.mlflow_selection - INFO - Found 5 benchmark runs with required metrics and grouping tags
2026-01-04 17:19:58,241 - orchestration.jobs.selection.mlflow_selection - INFO - Preloading trial and refit runs from HPO experiments...


   Found 5 benchmark runs with required metrics and grouping tags

üîó Preloading trial runs (metrics) and refit runs (artifacts) from HPO experiments...
   resume_ner_baseline-hpo-distilbert: 45 finished runs, 18 trial runs, 9 refit runs


2026-01-04 17:19:58,785 - orchestration.jobs.selection.mlflow_selection - INFO - Built trial lookup with 10 unique (study_hash, trial_hash) pairs
2026-01-04 17:19:58,786 - orchestration.jobs.selection.mlflow_selection - INFO - Built refit lookup with 10 unique (study_hash, trial_hash) pairs
2026-01-04 17:19:58,786 - orchestration.jobs.selection.mlflow_selection - INFO - Joining benchmark runs with trial runs and refit runs...
2026-01-04 17:19:58,786 - orchestration.jobs.selection.mlflow_selection - INFO - Found 5 candidate(s) with both benchmark and training metrics
2026-01-04 17:19:58,787 - orchestration.jobs.selection.mlflow_selection - INFO - Computing composite scores...
2026-01-04 17:19:58,787 - orchestration.jobs.selection.mlflow_selection - INFO - Best model selected:
2026-01-04 17:19:58,788 - orchestration.jobs.selection.mlflow_selection - INFO -   Artifact Run ID: b0b34e9c-e0de-4d9c-8712-957c66d2781c
2026-01-04 17:19:58,788 - orchestration.jobs.selection.mlflow_selection - INF

   resume_ner_baseline-hpo-distilroberta: 5 finished runs, 2 trial runs, 1 refit runs
   Built trial lookup with 10 unique (study_hash, trial_hash) pairs
   Built refit lookup with 10 unique (study_hash, trial_hash) pairs

üîó Joining benchmark runs with trial runs (metrics) and refit runs (artifacts)...
   Found 5 candidate(s) with both benchmark and training metrics

‚úÖ Best model selected:
   Run ID: b0b34e9c-e0de-4d9c-8712-957c66d2781c
   Experiment: resume_ner_baseline-hpo-distilroberta
   Backbone: distilroberta
   F1 Score: 0.4422
   Latency: 4.15 ms
   Composite Score: 0.9206
[ACQUIRE] Acquiring checkpoint for run b0b34e9c...

[Strategy 1] Local disk selection...

[Strategy 3] MLflow download...


Downloading artifacts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [04:09<00:00, 249.16s/it]


   [OK] Downloaded checkpoint from MLflow: "c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\best_model_checkpoint\best_trial_checkpoint\best_trial_checkpoint"

‚úì Best model checkpoint available at: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\best_model_checkpoint\best_trial_checkpoint\best_trial_checkpoint


In [None]:
# Check if selected run is final training using standardized tag (from config)
stage_tag = tags_config["process"]["stage"]
trained_on_full_data_tag = tags_config["training"]["trained_on_full_data"]

is_final_training = best_model["tags"].get(stage_tag) == "final_training"

# Check if used full data using standardized tag
used_full_data = (
    best_model["tags"].get(trained_on_full_data_tag) == "true" or
    best_model["params"].get("use_combined_data", "false").lower() == "true"
)

# Decision
SKIP_FINAL_TRAINING = is_final_training and used_full_data

print(f"Selected run is final_training: {is_final_training}")
print(f"Selected run used full data: {used_full_data}")
print(f"Decision: {'Skip final training' if SKIP_FINAL_TRAINING else 'Run final training'}")

if SKIP_FINAL_TRAINING:
    print("‚úì Using selected checkpoint directly (skipping retraining)")
    final_checkpoint_dir = best_checkpoint_dir
else:
    print("üîÑ Selected run is not final training - will proceed with final training")


## Step 9: Final Training (Conditional)

Run final training with best configuration if needed.


In [None]:
if not SKIP_FINAL_TRAINING:
    print("üîÑ Starting final training with best configuration...")
    
    from orchestration.final_training_config import create_final_training_config
    from orchestration.naming_centralized import create_naming_context, build_output_path
    from orchestration.fingerprints import compute_spec_fp, compute_exec_fp
    from orchestration.config_loader import load_all_configs
    from shared.platform_detection import detect_platform
    import subprocess
    import sys
    
    # Extract best config from MLflow run
    best_params = best_model["params"]
    
    # Create final training config
    all_configs = load_all_configs(experiment_config)
    environment = detect_platform()
    
    # Compute fingerprints
    spec_fp = compute_spec_fp(
        model_config=all_configs.get("model", {}),
        data_config=all_configs.get("data", {}),
        train_config=all_configs.get("train", {}),
        seed=int(best_params.get("random_seed", 42)),
    )
    
    try:
        git_sha = subprocess.check_output(
            ["git", "rev-parse", "HEAD"],
            cwd=ROOT_DIR,
            stderr=subprocess.DEVNULL,
        ).decode().strip()
    except Exception:
        git_sha = None
    
    exec_fp = compute_exec_fp(
        git_sha=git_sha,
        env_config=all_configs.get("env", {}),
    )
    
    # Create training context
    backbone_name = best_params.get("backbone", "distilbert")
    if "-" in backbone_name:
        backbone_name = backbone_name.split("-")[0]
    
    from orchestration.final_training_config import _compute_next_variant
    variant = _compute_next_variant(
        ROOT_DIR,
        CONFIG_DIR,
        spec_fp,
        exec_fp,
        backbone_name,
    )
    
    training_context = create_naming_context(
        process_type="final_training",
        model=backbone_name,
        spec_fp=spec_fp,
        exec_fp=exec_fp,
        environment=environment,
        variant=variant,
    )
    
    final_output_dir = build_output_path(ROOT_DIR, training_context)
    
    # Create final training config with best hyperparameters
    final_training_config = {
        "backbone": best_params.get("backbone"),
        "learning_rate": float(best_params.get("learning_rate", 2e-5)),
        "batch_size": int(best_params.get("batch_size", 16)),
        "dropout": float(best_params.get("dropout", 0.1)),
        "weight_decay": float(best_params.get("weight_decay", 0.01)),
        "epochs": int(best_params.get("epochs", 10)),
        "random_seed": int(best_params.get("random_seed", 42)),
        "early_stopping_enabled": best_params.get("early_stopping_enabled", "true").lower() == "true",
        "use_combined_data": True,  # Use full dataset for final training
    }
    
    print(f"‚úì Final training config: {final_training_config}")
    print(f"‚úì Output directory: {final_output_dir}")
    
    # Load dataset path from config
    from shared.yaml_utils import load_yaml
    data_config = all_configs.get("data", {})
    dataset_path = data_config.get("dataset_path", "data/resume_tiny")
    DATASET_LOCAL_PATH = ROOT_DIR / dataset_path if not Path(dataset_path).is_absolute() else Path(dataset_path)
    
    # Run training as a module
    training_args = [
        sys.executable,
        "-m",
        "training.train",
        "--data-asset",
        str(DATASET_LOCAL_PATH),
        "--config-dir",
        str(CONFIG_DIR),
        "--backbone",
        final_training_config["backbone"],
        "--learning-rate",
        str(final_training_config["learning_rate"]),
        "--batch-size",
        str(final_training_config["batch_size"]),
        "--dropout",
        str(final_training_config["dropout"]),
        "--weight-decay",
        str(final_training_config["weight_decay"]),
        "--epochs",
        str(final_training_config["epochs"]),
        "--random-seed",
        str(final_training_config["random_seed"]),
        "--early-stopping-enabled",
        str(final_training_config["early_stopping_enabled"]).lower(),
        "--use-combined-data",
        str(final_training_config["use_combined_data"]).lower(),
    ]
    
    training_env = os.environ.copy()
    training_env["AZURE_ML_OUTPUT_checkpoint"] = str(final_output_dir)
    
    # Add src directory to PYTHONPATH
    pythonpath = training_env.get("PYTHONPATH", "")
    if pythonpath:
        training_env["PYTHONPATH"] = f"{str(SRC_DIR)}{os.pathsep}{pythonpath}"
    else:
        training_env["PYTHONPATH"] = str(SRC_DIR)
    
    mlflow_tracking_uri = mlflow.get_tracking_uri()
    if mlflow_tracking_uri:
        training_env["MLFLOW_TRACKING_URI"] = mlflow_tracking_uri
    training_env["MLFLOW_EXPERIMENT_NAME"] = training_experiment_name
    
    print("üîÑ Running final training...")
    result = subprocess.run(
        training_args,
        cwd=ROOT_DIR,
        env=training_env,
        capture_output=True,
        text=True,
    )
    
    if result.returncode != 0:
        raise RuntimeError(
            f"Final training failed with return code {result.returncode}\n"
            f"STDOUT: {result.stdout}\n"
            f"STDERR: {result.stderr}"
        )
    else:
        if result.stdout:
            print(result.stdout)
    
    # Set final checkpoint directory
    final_checkpoint_dir = final_output_dir / "checkpoint"
    if not final_checkpoint_dir.exists():
        # Try actual checkpoint location
        actual_checkpoint = ROOT_DIR / "outputs" / "checkpoint"
        if actual_checkpoint.exists():
            final_checkpoint_dir = actual_checkpoint
    
    print(f"‚úì Final training completed. Checkpoint: {final_checkpoint_dir}")
    
    # Try to set tag in MLflow run
    try:
        from orchestration.jobs.tracking.finder.run_finder import find_mlflow_run
        from orchestration.jobs.tracking.trackers.training_tracker import MLflowTrainingTracker
        
        training_tracker = MLflowTrainingTracker(training_experiment_name)
        report = find_mlflow_run(
            experiment_name=training_experiment_name,
            context=training_context,
            output_dir=final_output_dir,
            strict=False,
            root_dir=ROOT_DIR,
            config_dir=CONFIG_DIR,
        )
        
        if report.found and report.run_id:
            with mlflow.start_run(run_id=report.run_id):
                mlflow.set_tag("code.trained_on_full_data", "true")
                print(f"‚úì Set code.trained_on_full_data tag in MLflow run {report.run_id}")
    except Exception as e:
        print(f"‚ö† Could not set MLflow tag: {e}")
else:
    print("‚úì Skipping final training - using selected checkpoint")


In [None]:
print("üîÑ Starting model conversion to ONNX...")

from orchestration.jobs.tracking.trackers.conversion_tracker import MLflowConversionTracker
from orchestration.naming_centralized import create_naming_context, build_output_path
from orchestration.fingerprints import build_parent_training_id, compute_conv_fp
from pathlib import Path

# Setup conversion tracker
conversion_tracker = MLflowConversionTracker(conversion_experiment_name)

# Determine checkpoint path
checkpoint_path = final_checkpoint_dir

if not checkpoint_path.exists():
    raise ValueError(f"Checkpoint not found at: {checkpoint_path}")

# Extract backbone from best model
backbone = best_model["backbone"]

# Conversion settings
# Load conversion settings from config
quantization = conversion_config["onnx"]["quantization"]
opset_version = conversion_config["onnx"]["opset_version"]
conversion_target = conversion_config["target"]["format"]

print(f"‚úì Checkpoint path: {checkpoint_path}")
print(f"‚úì Backbone: {backbone}")
print(f"‚úì Conversion target: {conversion_target}")
print(f"‚úì Quantization: {quantization}")
print(f"‚úì ONNX opset version: {opset_version}")

# Extract or compute fingerprints for conversion context
# Try to get from best model tags first
spec_fp = best_model["tags"].get("code.spec_fp")
exec_fp = best_model["tags"].get("code.exec_fp")
variant = int(best_model["tags"].get("code.variant", "1")) if best_model["tags"].get("code.variant") else 1

# If not in tags, compute from config
if not spec_fp or not exec_fp:
    from orchestration.fingerprints import compute_spec_fp, compute_exec_fp
    from orchestration.config_loader import load_all_configs
    import subprocess
    
    all_configs = load_all_configs(experiment_config)
    
    spec_fp = compute_spec_fp(
        model_config=all_configs.get("model", {}),
        data_config=all_configs.get("data", {}),
        train_config=all_configs.get("train", {}),
        seed=int(best_model["params"].get("random_seed", 42)),
    )
    
    try:
        git_sha = subprocess.check_output(
            ["git", "rev-parse", "HEAD"],
            cwd=ROOT_DIR,
            stderr=subprocess.DEVNULL,
        ).decode().strip()
    except Exception:
        git_sha = None
    
    exec_fp = compute_exec_fp(
        git_sha=git_sha,
        env_config=all_configs.get("env", {}),
    )

# Build parent_training_id
parent_training_id = build_parent_training_id(spec_fp, exec_fp, variant)

# Compute conversion fingerprint
conv_fp = compute_conv_fp(backbone, quantization, opset_version)

# Create conversion context
conversion_context = create_naming_context(
    process_type="conversion",
    model=backbone.split("-")[0] if "-" in backbone else backbone,
    environment=PLATFORM,
    parent_training_id=parent_training_id,
    conv_fp=conv_fp,
)

# Build conversion output path using canonical structure
conversion_output_dir = build_output_path(ROOT_DIR, conversion_context)
conversion_output_dir.mkdir(parents=True, exist_ok=True)

print(f"‚úì Conversion output: {conversion_output_dir}")

# Run conversion
try:
    # Import conversion function
    from model_conversion.onnx_exporter import export_to_onnx
    
    # Start conversion run
    conversion_run_name = f"conversion_{backbone}_{conversion_target}"
    
    with conversion_tracker.start_conversion_run(
        run_name=conversion_run_name,
        conversion_type=conversion_target,
        source_training_run=best_model["run_id"],
        output_dir=conversion_output_dir,
    ) as conversion_handle:
        
        # Log conversion parameters
        conversion_tracker.log_conversion_parameters(
            checkpoint_path=str(checkpoint_path),
            conversion_target=conversion_target,
            quantization=quantization,
            opset_version=opset_version,
            backbone=backbone,
        )
        
        # Perform conversion
        onnx_model_path = export_to_onnx(
            checkpoint_dir=checkpoint_path,
            output_dir=conversion_output_dir,
            quantize_int8=(quantization == "int8"),
        )
        
        # Calculate original checkpoint size for compression ratio
        original_size = None
        if checkpoint_path.exists():
            total_size = sum(f.stat().st_size for f in checkpoint_path.rglob("*") if f.is_file())
            original_size = total_size / (1024 * 1024)  # MB
        
        # Log conversion results
        conversion_tracker.log_conversion_results(
            conversion_success=True,
            onnx_model_path=onnx_model_path,
            original_checkpoint_size=original_size,
            smoke_test_passed=True,  # Could add actual smoke test
        )
        
        print(f"‚úì Model converted successfully!")
        print(f"‚úì ONNX model saved to: {onnx_model_path}")
        
        if conversion_handle:
            print(f"‚úì Conversion logged to MLflow run: {conversion_handle.run_id}")
            
except Exception as e:
    print(f"‚ö† Conversion failed: {e}")
    import traceback
    traceback.print_exc()
    raise


## Step 11: Summary

Display summary of best model selection, artifact locations, and conversion status.


In [None]:
print("\n" + "="*80)
print("üìã SUMMARY")
print("="*80)

print(f"\n‚úì Best Model Selected:")
print(f"   Run ID: {best_model['run_id']}")
print(f"   Run Name: {best_model['run_name']}")
print(f"   Backbone: {best_model['backbone']}")
print(f"   Macro-F1: {best_model['primary_metric']:.4f}")
if best_model['latency_ms']:
    print(f"   Latency: {best_model['latency_ms']:.2f} ms")
if best_model['throughput']:
    print(f"   Throughput: {best_model['throughput']:.2f} docs/sec")
print(f"   Confidence: {best_model['confidence']}")
print(f"   Source: {best_model['source_type']}")

print(f"\n‚úì Artifacts:")
print(f"   Checkpoint: {final_checkpoint_dir}")

if 'onnx_model_path' in locals():
    print(f"   ONNX Model: {onnx_model_path}")

print(f"\n‚úì MLflow Tracking:")
print(f"   Training Experiment: {training_experiment_name}")
print(f"   Benchmark Experiment: {benchmark_experiment_name}")
print(f"   Conversion Experiment: {conversion_experiment_name}")

print(f"\n‚úì Final Training:")
print(f"   {'Skipped (already final training on full data)' if SKIP_FINAL_TRAINING else 'Completed'}")

print("\n‚úÖ Best model selection and conversion pipeline completed!")
