# Phase 1: Best Configuration Selection (Local, Google Colab & Kaggle)

This notebook automates the selection of the best model configuration from MLflow
based on metrics and benchmarking results, then performs final training and model conversion.


## Workflow

1. **Best Model Selection**: Query MLflow benchmark runs, join to training runs via grouping tags (`code.study_key_hash`, `code.trial_key_hash`), select best using normalized composite scoring
2. **Artifact Acquisition**: Download the best model's checkpoint using fallback strategy (local disk â†’ drive restore â†’ MLflow download)
3. **Final Training**: Optionally retrain with best config on full dataset (if not already final training)
4. **Model Conversion**: Convert the final model to ONNX format using canonical path structure


## Important

- This notebook **executes on Local, Google Colab, or Kaggle** (not on Azure ML compute)
- Requires MLflow tracking to be set up (Azure ML workspace or local SQLite)
- All computation happens on the platform's GPU (if available) or CPU
- **Storage & Persistence**:
  - **Local**: Outputs saved to `outputs/` directory in repository root
  - **Google Colab**: Checkpoints are automatically saved to Google Drive for persistence across sessions
  - **Kaggle**: Outputs in `/kaggle/working/` are automatically persisted - no manual backup needed
- The notebook must be **re-runnable end-to-end**
- Uses the dataset path specified in the data config (from `config/data/*.yaml`), typically pointing to a local folder included in the repository
- **Session Management**:
  - **Local**: No session limits, outputs persist in repository
  - **Colab**: Sessions timeout after 12-24 hours (depending on Colab plan). Checkpoints are saved to Drive automatically.
  - **Kaggle**: Sessions have time limits based on your plan. All outputs are automatically saved.


## Step 1: Environment Detection

The notebook automatically detects the execution environment (local, Google Colab, or Kaggle) and adapts its behavior accordingly.


In [None]:
import os
from pathlib import Path
# Detect execution environment
IN_COLAB = "COLAB_GPU" in os.environ or "COLAB_TPU" in os.environ
IN_KAGGLE = "KAGGLE_KERNEL_RUN_TYPE" in os.environ
IS_LOCAL = not IN_COLAB and not IN_KAGGLE
# Set platform-specific constants
if IN_COLAB:
    PLATFORM = "colab"
    BASE_DIR = Path("/content")
    BACKUP_ENABLED = True
elif IN_KAGGLE:
    PLATFORM = "kaggle"
    BASE_DIR = Path("/kaggle/working")
    BACKUP_ENABLED = False
else:
    PLATFORM = "local"
    BASE_DIR = None
    BACKUP_ENABLED = False
print(f"âœ“ Detected environment: {PLATFORM.upper()}")
print(f"Platform: {PLATFORM}")
print(
    f"Base directory: {BASE_DIR if BASE_DIR else 'Current working directory'}")
print(f"Backup enabled: {BACKUP_ENABLED}")


### Install Required Packages

Install required packages based on the execution environment.


In [None]:
# Install required packages
if IS_LOCAL:
    print("For local environment, please:")
    print("1. Create conda environment: conda env create -f config/environment/conda.yaml")
    print("2. Activate: conda activate resume-ner-training")
    print("3. Restart kernel after activation")
    print("\nIf you've already done this, you can continue to the next cell.")
    print("\nInstalling Azure ML SDK (required for imports)...")
    # Install Azure ML packages even for local (in case conda env not activated)
    %pip install "azure-ai-ml>=1.0.0" --quiet
    %pip install "azure-identity>=1.12.0" --quiet
    %pip install azureml-defaults --quiet
    %pip install azureml-mlflow --quiet
else:
    # Core ML libraries
    %pip install "transformers>=4.35.0,<5.0.0" --quiet
    %pip install "safetensors>=0.4.0" --quiet
    %pip install "datasets>=2.12.0" --quiet

    # ML utilities
    %pip install "numpy>=1.24.0,<2.0.0" --quiet
    %pip install "pandas>=2.0.0" --quiet
    %pip install "scikit-learn>=1.3.0" --quiet

    # Utilities
    %pip install "pyyaml>=6.0" --quiet
    %pip install "tqdm>=4.65.0" --quiet
    %pip install "seqeval>=1.2.2" --quiet
    %pip install "sentencepiece>=0.1.99" --quiet

    # Experiment tracking
    %pip install mlflow --quiet
    %pip install optuna --quiet

    # Azure ML SDK (required for orchestration imports)
    %pip install "azure-ai-ml>=1.0.0" --quiet
    %pip install "azure-identity>=1.12.0" --quiet
    %pip install azureml-defaults --quiet
    %pip install azureml-mlflow --quiet

    # ONNX support
    %pip install onnxruntime --quiet
    %pip install "onnx>=1.16.0" --quiet
    %pip install "onnxscript>=0.1.0" --quiet

    print("âœ“ All dependencies installed")


## Step 2: Repository Setup

**Note**: Repository setup is only needed for Colab/Kaggle environments. Local environments should already have the repository cloned.


In [None]:
# Repository setup - only needed for Colab/Kaggle
if not IS_LOCAL:
    if IN_KAGGLE:
        !git clone -b gg_final_training_2 https://github.com/longdang193/resume-ner-azureml.git /kaggle/working/resume-ner-azureml
    elif IN_COLAB:
        !git clone -b gg_final_training_2 https://github.com/longdang193/resume-ner-azureml.git /content/resume-ner-azureml
else:
    print("âœ“ Local environment detected - detecting repository root...")

# Set up paths
if not IS_LOCAL:
    ROOT_DIR = BASE_DIR / "resume-ner-azureml"
else:
    # For local, detect repo root by searching for config/ and src/ directories
    # Start from current working directory and search up
    current_dir = Path.cwd()
    ROOT_DIR = None
    
    # Check current directory first
    if (current_dir / "config").exists() and (current_dir / "src").exists():
        ROOT_DIR = current_dir
    else:
        # Search up the directory tree
        for parent in current_dir.parents:
            if (parent / "config").exists() and (parent / "src").exists():
                ROOT_DIR = parent
                break
    
    if ROOT_DIR is None:
        raise ValueError(
            f"Could not find repository root. Searched from: {current_dir}\n"
            "Please ensure you're running from within the repository or a subdirectory."
        )

CONFIG_DIR = ROOT_DIR / "config"
SRC_DIR = ROOT_DIR / "src"

# Add src to path
import sys
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

print(f"âœ“ Repository: {ROOT_DIR} (config={CONFIG_DIR.name}, src={SRC_DIR.name})")

# Verify repository structure
required_dirs = [CONFIG_DIR, SRC_DIR]
for dir_path in required_dirs:
    if not dir_path.exists():
        raise ValueError(f"Required directory not found: {dir_path}")
print("âœ“ Repository structure verified")


## Step 3: Load Configuration

Load experiment configuration and define experiment naming convention.


In [None]:
from orchestration.config_loader import load_experiment_config
from orchestration import EXPERIMENT_NAME
from shared.yaml_utils import load_yaml
from orchestration.jobs.tracking.naming.tags_registry import load_tags_registry

# Load experiment config
experiment_config = load_experiment_config(CONFIG_DIR, EXPERIMENT_NAME)

# Load best model selection configs
tags_config = load_tags_registry(CONFIG_DIR)
selection_config = load_yaml(CONFIG_DIR / "best_model_selection.yaml")
conversion_config = load_yaml(CONFIG_DIR / "conversion.yaml")
acquisition_config = load_yaml(CONFIG_DIR / "artifact_acquisition.yaml")

print(f"âœ“ Loaded configs: experiment={experiment_config.name}, tags, selection, conversion, acquisition")

# Define experiment names (discovery happens after MLflow setup in Cell 4)
experiment_name = experiment_config.name
benchmark_experiment_name = f"{experiment_name}-benchmark"
training_experiment_name = f"{experiment_name}-training"  # For final training runs
conversion_experiment_name = f"{experiment_name}-conversion"

print(f"âœ“ Experiment names: benchmark={benchmark_experiment_name}, training={training_experiment_name}, conversion={conversion_experiment_name}")


## Step 4: Setup MLflow

Setup MLflow tracking with fallback to local if Azure ML is unavailable.


In [None]:
from shared.mlflow_setup import setup_mlflow_from_config
import mlflow

# Setup MLflow tracking (use training experiment for setup - actual queries use discovered experiments)
tracking_uri = setup_mlflow_from_config(
    experiment_name=training_experiment_name,
    config_dir=CONFIG_DIR,
    fallback_to_local=True,
)

print(f"âœ“ MLflow tracking URI: {tracking_uri}")
print(f"âœ“ MLflow experiment: {training_experiment_name}")

# Discover HPO and benchmark experiments from MLflow (after setup)
from mlflow.tracking import MlflowClient

client = MlflowClient()
all_experiments = client.search_experiments()

# Find HPO experiments (format: {experiment_name}-hpo-{backbone})
hpo_experiments = {}
for exp in all_experiments:
    if exp.name.startswith(f"{experiment_name}-hpo-"):
        backbone = exp.name.replace(f"{experiment_name}-hpo-", "")
        hpo_experiments[backbone] = {
            "name": exp.name,
            "id": exp.experiment_id
        }

# Find benchmark experiment
benchmark_experiment = None
for exp in all_experiments:
    if exp.name == benchmark_experiment_name:
        benchmark_experiment = {
            "name": exp.name,
            "id": exp.experiment_id
        }
        break

hpo_backbones = ", ".join(hpo_experiments.keys())
print(f"âœ“ Experiments: {len(hpo_experiments)} HPO ({hpo_backbones}), benchmark={'found' if benchmark_experiment else 'not found'}, training={training_experiment_name}, conversion={conversion_experiment_name}")


## Step 5: Drive Backup Setup (Colab Only)

Setup Google Drive backup/restore for Colab environments.


In [None]:
from pathlib import Path

# Fix numpy/pandas compatibility before importing orchestration modules
try:
    from orchestration.drive_backup import create_colab_store
except (ValueError, ImportError) as e:
    if "numpy.dtype size changed" in str(e) or "numpy" in str(e).lower():
        print("âš  Numpy/pandas compatibility issue detected. Fixing...")
        import subprocess
        import sys
        subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "--force-reinstall", "--no-cache-dir", "numpy>=1.24.0,<2.0.0", "pandas>=2.0.0", "--quiet"])
        print("âœ“ Numpy/pandas reinstalled. Please restart the kernel and re-run this cell.")
        raise RuntimeError("Please restart kernel after numpy/pandas fix")
    else:
        raise

# Mount Google Drive and create backup store (Colab only - Kaggle doesn't need this)
DRIVE_BACKUP_DIR = None
drive_store = None
restore_from_drive = None

if IN_COLAB:
    drive_store = create_colab_store(ROOT_DIR, CONFIG_DIR)
    if drive_store:
        BACKUP_ENABLED = True
        DRIVE_BACKUP_DIR = drive_store.backup_root
        # Create restore function wrapper
        def restore_from_drive(local_path: Path, is_directory: bool = False) -> bool:
            """Restore file/directory from Drive backup."""
            try:
                expect = "dir" if is_directory else "file"
                result = drive_store.restore(local_path, expect=expect)
                return result.ok
            except Exception as e:
                print(f"âš  Drive restore failed: {e}")
                return False
        print(f"âœ“ Google Drive mounted")
        print(f"âœ“ Backup base directory: {DRIVE_BACKUP_DIR}")
        print(f"\nNote: All outputs/ will be mirrored to: {DRIVE_BACKUP_DIR / 'outputs'}")
    else:
        BACKUP_ENABLED = False
        print("âš  Warning: Could not mount Google Drive. Backup to Google Drive will be disabled.")
elif IN_KAGGLE:
    print("âœ“ Kaggle environment detected - outputs are automatically persisted (no Drive mount needed)")
    BACKUP_ENABLED = False
else:
    # Local environment
    print("âœ“ Local environment detected - outputs will be saved to repository (no Drive backup needed)")
    BACKUP_ENABLED = False


## Step 6: Best Model Selection

Query MLflow benchmark runs, join to training runs via grouping tags, and select the best model using normalized composite scoring.


In [None]:
from orchestration.jobs.selection.mlflow_selection import find_best_model_from_mlflow
from orchestration.jobs.selection.artifact_acquisition import acquire_best_model_checkpoint
from pathlib import Path
from typing import Optional, Callable, Dict, Any

# Validate experiments
if benchmark_experiment is None:
    raise ValueError(f"Benchmark experiment '{benchmark_experiment_name}' not found. Run benchmark jobs first.")
if not hpo_experiments:
    raise ValueError(f"No HPO experiments found. Run HPO jobs first.")

# Check if we should reuse cached selection
run_mode = selection_config.get("run", {}).get("mode", "reuse_if_exists")
best_model = None
cache_data = None

print(f"\nðŸ“‹ Best Model Selection Mode: {run_mode}")

if run_mode == "reuse_if_exists":
    from orchestration.jobs.selection.cache import load_cached_best_model
    
    tracking_uri = mlflow.get_tracking_uri()
    cache_data = load_cached_best_model(
        root_dir=ROOT_DIR,
        config_dir=CONFIG_DIR,
        experiment_name=experiment_name,
        selection_config=selection_config,
        tags_config=tags_config,
        benchmark_experiment_id=benchmark_experiment["id"],
        tracking_uri=tracking_uri,
    )
    
    if cache_data:
        best_model = cache_data["best_model"]
        # Success message already printed by load_cached_best_model
    else:
        print(f"\nâ„¹ Cache not available or invalid - will query MLflow for fresh selection")
elif run_mode == "force_new":
    print(f"  Mode is 'force_new' - skipping cache, querying MLflow...")
else:
    print(f"  âš  Unknown run mode '{run_mode}', defaulting to querying MLflow...")

if best_model is None:
    # Find best model
    best_model = find_best_model_from_mlflow(
        benchmark_experiment=benchmark_experiment,
        hpo_experiments=hpo_experiments,
        tags_config=tags_config,
        selection_config=selection_config,
        use_python_filtering=True,
    )
    
    if best_model is None:
        raise ValueError("Could not find best model from MLflow.")
    
    # Save to cache
    from orchestration.jobs.selection.cache import save_best_model_cache
    
    tracking_uri = mlflow.get_tracking_uri()
    # Note: inputs_summary could be enhanced if find_best_model_from_mlflow returns it
    inputs_summary = {}
    
    timestamped_file, latest_file, index_file = save_best_model_cache(
        root_dir=ROOT_DIR,
        config_dir=CONFIG_DIR,
        best_model=best_model,
        experiment_name=experiment_name,
        selection_config=selection_config,
        tags_config=tags_config,
        benchmark_experiment=benchmark_experiment,
        hpo_experiments=hpo_experiments,
        tracking_uri=tracking_uri,
        inputs_summary=inputs_summary,
    )
    print(f"âœ“ Saved best model selection to cache")

# Extract lineage information from best_model for final training tags
from orchestration.jobs.final_training import extract_lineage_from_best_model
lineage = extract_lineage_from_best_model(best_model)

# Lineage extracted for final training tags

# Acquire checkpoint
best_checkpoint_dir = acquire_best_model_checkpoint(
    best_run_info=best_model,
    root_dir=ROOT_DIR,
    config_dir=CONFIG_DIR,
    acquisition_config=acquisition_config,
    selection_config=selection_config,
    platform=PLATFORM,
    restore_from_drive=restore_from_drive if "restore_from_drive" in locals() else None,
    drive_store=drive_store if "drive_store" in locals() else None,
    in_colab=IN_COLAB,
)

print(f"\nâœ“ Best model checkpoint available at: {best_checkpoint_dir}")


In [None]:
# Check if selected run is already final training (skip retraining if so)
stage_tag = tags_config.key("process", "stage")
trained_on_full_data_tag = tags_config.key("training", "trained_on_full_data")

is_final_training = best_model["tags"].get(stage_tag) == "final_training"
used_full_data = (
    best_model["tags"].get(trained_on_full_data_tag) == "true" or
    best_model["params"].get("use_combined_data", "false").lower() == "true"
)

SKIP_FINAL_TRAINING = is_final_training and used_full_data

if SKIP_FINAL_TRAINING:
    final_checkpoint_dir = best_checkpoint_dir


## Step 7: Final Training

Run final training with best configuration if needed.


In [None]:
if not SKIP_FINAL_TRAINING:
    print("ðŸ”„ Starting final training with best configuration...")
    from orchestration.jobs.final_training import execute_final_training
    # Execute final training (uses final_training.yaml via load_final_training_config)
    # Will automatically reuse existing complete runs if run.mode: reuse_if_exists in final_training.yaml
    final_checkpoint_dir = execute_final_training(
        root_dir=ROOT_DIR,
        config_dir=CONFIG_DIR,
        best_model=best_model,
        experiment_config=experiment_config,
        lineage=lineage,
        training_experiment_name=training_experiment_name,
        platform=PLATFORM,
    )
else:
    print("âœ“ Skipping final training - using selected checkpoint")

# Backup final checkpoint to Google Drive if in Colab
if IN_COLAB and drive_store and final_checkpoint_dir:
    try:
        checkpoint_path = Path(final_checkpoint_dir).resolve()
        print(f"\nðŸ“¦ Backing up final training checkpoint to Google Drive...")
        result = drive_store.backup(checkpoint_path, expect="dir")
        if result.ok:
            print(f"âœ“ Successfully backed up final checkpoint to Google Drive")
            print(f"  Drive path: {result.dst}")
        else:
            print(f"âš  Drive backup failed: {result.reason}")
            if result.error:
                print(f"  Error: {result.error}")
    except Exception as e:
        print(f"âš  Drive backup error: {e}")
        print(f"  Checkpoint is still available locally at: {final_checkpoint_dir}")


## Step 8: Model Conversion & Optimization

Convert the final trained model to ONNX format with optimization.

In [None]:
# Extract parent training information for conversion
from shared.json_cache import load_json
from pathlib import Path

# Load metadata from final training output directory
final_training_metadata_path = final_checkpoint_dir.parent / "metadata.json"

if not final_training_metadata_path.exists():
    raise ValueError(
        f"Metadata file not found: {final_training_metadata_path}\n"
        "Please ensure final training completed successfully."
    )

metadata = load_json(final_training_metadata_path)
parent_spec_fp = metadata.get("spec_fp")
parent_exec_fp = metadata.get("exec_fp")
parent_training_run_id = metadata.get("mlflow", {}).get("run_id")

if not parent_spec_fp or not parent_exec_fp:
    raise ValueError(
        f"Missing required fingerprints in metadata: spec_fp={parent_spec_fp}, exec_fp={parent_exec_fp}\n"
        "Please ensure final training completed successfully."
    )

if parent_training_run_id:
    print(f"âœ“ Parent training: spec_fp={parent_spec_fp[:8]}..., exec_fp={parent_exec_fp[:8]}..., run_id={parent_training_run_id[:12]}...")
else:
    print(f"âœ“ Parent training: spec_fp={parent_spec_fp[:8]}..., exec_fp={parent_exec_fp[:8]}... (run_id not found)")

# Get parent training output directory (checkpoint parent)
parent_training_output_dir = final_checkpoint_dir.parent

print(f"\nðŸ”„ Starting model conversion...")
from orchestration.jobs.conversion import execute_conversion

# Execute conversion (uses conversion.yaml via load_conversion_config)
conversion_output_dir = execute_conversion(
    root_dir=ROOT_DIR,
    config_dir=CONFIG_DIR,
    parent_training_output_dir=parent_training_output_dir,
    parent_spec_fp=parent_spec_fp,
    parent_exec_fp=parent_exec_fp,
    experiment_config=experiment_config,
    conversion_experiment_name=conversion_experiment_name,
    platform=PLATFORM,
    parent_training_run_id=parent_training_run_id,  # May be None, that's OK
)

# Find ONNX model file (search recursively, as model may be in onnx_model/ subdirectory)
onnx_files = list(conversion_output_dir.rglob("*.onnx"))
if onnx_files:
    onnx_model_path = onnx_files[0]
    print(f"\nâœ“ Conversion completed successfully!")
    print(f"  ONNX model: {onnx_model_path}")
    print(f"  Model size: {onnx_model_path.stat().st_size / (1024 * 1024):.2f} MB")
else:
    print(f"\nâš  Warning: No ONNX model file found in {conversion_output_dir} (searched recursively)")

# Backup conversion output to Google Drive if in Colab
if IN_COLAB and drive_store and conversion_output_dir:
    try:
        output_path = Path(conversion_output_dir).resolve()
        print(f"\nðŸ“¦ Backing up conversion output to Google Drive...")
        result = drive_store.backup(output_path, expect="dir")
        if result.ok:
            print(f"âœ“ Successfully backed up conversion output to Google Drive")
            print(f"  Drive path: {result.dst}")
        else:
            print(f"âš  Drive backup failed: {result.reason}")
            if result.error:
                print(f"  Error: {result.error}")
    except Exception as e:
        print(f"âš  Drive backup error: {e}")
        print(f"  Output is still available locally at: {conversion_output_dir}")
