# Test HPO Pipeline with Tiny Datasets

This notebook validates that the HPO (Hyperparameter Optimization) pipeline works correctly with different tiny datasets created by `tests/00_make_tiny_dataset.ipynb`.

**Note**: This notebook is a thin wrapper around the test orchestrator in `tests/integration/orchestrators/test_orchestrator.py`. The tests can also be run directly:
- **CLI**: `python -m tests.integration.cli.main --seeds 0 1 2`
- **pytest**: `pytest tests/integration/test_hpo_pipeline.py -v -m integration`

## Purpose

- **Catch pipeline issues early**: Test HPO pipeline with small datasets before running on full data
- **Validate k-fold CV**: Ensure cross-validation works correctly with very small datasets (8 samples, 3 folds)
- **Test edge cases**: Verify pipeline handles minimal configurations gracefully
- **Compare datasets**: Test multiple random seed variants (seed0, seed1, seed2, ...)

## Test Areas

1. **HPO Pipeline Completion**: Verify HPO sweeps complete successfully with tiny datasets
2. **K-Fold Cross-Validation**: Test k-fold CV with small datasets and edge cases
3. **Edge Cases**: Test minimal k, small validation sets, batch size issues

## Prerequisites

- Run `tests/00_make_tiny_dataset.ipynb` first to create test datasets
- Ensure `dataset_tiny/seed{N}/` directories exist (where N is in `RANDOM_SEEDS_TO_TEST`)
- All datasets are seed-based: `dataset_tiny/seed0/`, `dataset_tiny/seed1/`, etc.

## Configuration

All configuration is loaded from `config/test/hpo_pipeline.yaml`. The notebook automatically loads:
- **Random seeds to test**: From `datasets.random_seeds` (default: `[0]`)
- **HPO config path**: From `configs.hpo_config` (default: `hpo/smoke.yaml`)
- **Train config path**: From `configs.train_config` (default: `train.yaml`)
- **Output directory**: From `output.base_dir` (default: `outputs/hpo_tests`)
- **MLflow directory**: From `output.mlflow_dir` (default: `mlruns`)
- **Dataset base path**: From `datasets.deterministic_path` (default: `dataset_tiny`) - used as base for seed-based datasets
- **Default constants**: From `defaults` section (backbone, random_seed, etc.)

To change configuration, edit `config/test/hpo_pipeline.yaml` rather than modifying notebook code.


## Setup: Import Dependencies and Configure Paths


In [1]:
import sys
from pathlib import Path

NOTEBOOK_DIR = Path.cwd()
ROOT_DIR = NOTEBOOK_DIR.parent

sys.path.append(str(ROOT_DIR))
sys.path.append(str(ROOT_DIR / "src"))

print("Notebook directory:", NOTEBOOK_DIR)
print("Project root:", ROOT_DIR)


Notebook directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\tests
Project root: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml


In [2]:
from tests.integration.orchestrators.test_orchestrator import (
    test_deterministic_hpo,
    test_deterministic_hpo_multiple_backbones,
    test_random_seed_variants,
    test_random_seed_variants_multiple_backbones,
    test_kfold_validation,
    test_edge_case_k_too_large,
    test_edge_cases_suite,
)
from tests.integration.aggregators.result_aggregator import (
    collect_test_results,
    build_test_details,
)
from tests.integration.comparators.result_comparator import compare_results
from tests.integration.setup.environment_setup import setup_test_environment
from tests.fixtures.config.test_config_loader import get_test_config, BACKBONES_LIST
from tests.fixtures.hpo_test_helpers import (
    DEFAULT_RANDOM_SEED,
    MINIMAL_K_FOLDS,
    DEFAULT_BACKBONE,
    METRIC_DECIMAL_PLACES,
    SEPARATOR_WIDTH,
    VERY_SMALL_VALIDATION_THRESHOLD,
    print_test_summary,
    print_comparison,
    print_hpo_results,
    print_kfold_results,
    print_edge_case_k_too_large_results,
    print_edge_case_results,
)

# Load configuration from config/test/hpo_pipeline.yaml
test_config = get_test_config(ROOT_DIR)
datasets_section = test_config.get("datasets", {})
RANDOM_SEEDS_TO_TEST = datasets_section.get("random_seeds", [0])
BACKBONES_TO_TEST = BACKBONES_LIST

print(f"Project root: {ROOT_DIR}")
print(f"Configuration loaded from: config/test/hpo_pipeline.yaml")
print(f"Random seeds to test (from config): {RANDOM_SEEDS_TO_TEST}")
print(f"Backbones to test (from config): {BACKBONES_TO_TEST}")
print(f"Default backbone (from config): {DEFAULT_BACKBONE}")
print(f"Default random seed (from config): {DEFAULT_RANDOM_SEED}")
print("\nNote: Individual test functions are imported from tests/integration/orchestrators/test_orchestrator.py")


  from .autonotebook import tqdm as notebook_tqdm


Project root: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml
Configuration loaded from: config/test/hpo_pipeline.yaml
Random seeds to test (from config): [0, 1, 2, 3, 4]
Backbones to test (from config): ['distilbert', 'deberta']
Default backbone (from config): distilbert
Default random seed (from config): 42

Note: Individual test functions are imported from tests/integration/orchestrators/test_orchestrator.py


In [3]:
# Setup test environment: load configs, setup paths, initialize MLflow
# This function loads configuration from config/test/hpo_pipeline.yaml and applies it
env = setup_test_environment(root_dir=ROOT_DIR)

config_dir = env["config_dir"]
hpo_config = env["hpo_config"]
train_config = env["train_config"]
output_dir = env["output_dir"]
deterministic_dataset = env["deterministic_dataset"]
mlflow_tracking_uri = env["mlflow_tracking_uri"]

print("=" * 60)
print("Test Environment Configuration")
print("=" * 60)
print(f"Config directory: {config_dir}")
print(f"HPO config file: {hpo_config.get('_config_path', 'N/A')}")
print(f"  Max trials: {hpo_config.get('sampling', {}).get('max_trials', 'N/A')}")
print(f"  K-folds: {hpo_config.get('k_fold', {}).get('n_splits', 'N/A')}")
print(f"  Objective: {hpo_config.get('objective', {}).get('metric', 'N/A')} ({hpo_config.get('objective', {}).get('goal', 'N/A')})")
print(f"Train config file: {train_config.get('_config_path', 'N/A')}")
print(f"Output directory: {output_dir}")
print(f"MLflow tracking URI: {mlflow_tracking_uri}")
print(f"Dataset base path: {deterministic_dataset}")
print(f"Random seeds to test: {RANDOM_SEEDS_TO_TEST}")
print(f"Backbones to test: {BACKBONES_TO_TEST}")
print(f"Note: All datasets are seed-based (seed0, seed1, seed2, ...)")
print("=" * 60)


Test Environment Configuration
Config directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\config
HPO config file: config\hpo\smoke.yaml
  Max trials: 2
  K-folds: 3
  Objective: macro-f1 (maximize)
Train config file: config\train.yaml
Output directory: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests
MLflow tracking URI: file:///c:/Users/HOANG%20PHI%20LONG%20DANG/repos/resume-ner-azureml/mlruns
Dataset base path: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\dataset_tiny
Random seeds to test: [0, 1, 2, 3, 4]
Backbones to test: ['distilbert', 'deberta']
Note: All datasets are seed-based (seed0, seed1, seed2, ...)


## Test Suite 1: HPO with Seed0 Dataset (Multiple Backbones)

This test verifies that the HPO pipeline completes successfully with seed0 dataset for all configured backbones.


In [None]:
# Use seed0 as the primary test dataset (all datasets are now seed-based)
seed0_dataset = deterministic_dataset / "seed0"

# Test all backbones configured in hpo_pipeline.yaml
deterministic_results_by_backbone = test_deterministic_hpo_multiple_backbones(
    dataset_path=seed0_dataset,
    config_dir=config_dir,
    hpo_config=hpo_config,
    train_config=train_config,
    output_dir=output_dir,
    backbones=BACKBONES_TO_TEST,
)

# Presentation: print results for each backbone
for backbone, results in deterministic_results_by_backbone.items():
    if results:
        print_hpo_results(results, f"Seed0 Dataset HPO Results ({backbone})")
        print()  # Empty line between backbones

# For backward compatibility, use first backbone's results as seed0_results
seed0_results = deterministic_results_by_backbone.get(BACKBONES_TO_TEST[0]) if BACKBONES_TO_TEST else None


[I 2025-12-20 14:49:12,229] A new study created in memory with name: hpo_distilbert
  return FileStore(store_uri, store_uri)
Best trial: 0. Best value: 0.168103:  50%|█████     | 1/2 [00:16<00:16, 16.94s/it, 16.94/1200 seconds]

Read macro-f1=0.16810344827586207 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\deterministic_distilbert\trial_0\metrics.json (modified: 1766238567.1637495)
[I 2025-12-20 14:49:29,164] Trial 0 finished with value: 0.16810344827586207 and parameters: {'learning_rate': 1.0321114140196525e-05, 'batch_size': 4, 'dropout': 0.1670006490533644, 'weight_decay': 0.04128504014838451}. Best is trial 0 with value: 0.16810344827586207.


Best trial: 1. Best value: 0.412844: 100%|██████████| 2/2 [00:34<00:00, 17.12s/it, 34.24/1200 seconds]
[I 2025-12-20 14:49:46,472] A new study created in memory with name: hpo_deberta


Read macro-f1=0.41284403669724773 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\deterministic_distilbert\trial_1\metrics.json (modified: 1766238585.246274)
[I 2025-12-20 14:49:46,466] Trial 1 finished with value: 0.41284403669724773 and parameters: {'learning_rate': 4.134240421019675e-05, 'batch_size': 4, 'dropout': 0.10319564833334856, 'weight_decay': 0.02711215175185821}. Best is trial 1 with value: 0.41284403669724773.


Best trial: 0. Best value: 0.396153:  50%|█████     | 1/2 [00:20<00:20, 20.70s/it, 20.70/1200 seconds]

Read macro-f1=0.3961531515260972 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\deterministic_deberta\trial_0\metrics.json (modified: 1766238605.989296)
[I 2025-12-20 14:50:07,170] Trial 0 finished with value: 0.3961531515260972 and parameters: {'learning_rate': 3.214437883557525e-05, 'batch_size': 4, 'dropout': 0.14632151345761005, 'weight_decay': 0.002674477895130168}. Best is trial 0 with value: 0.3961531515260972.


Best trial: 1. Best value: 0.414352: 100%|██████████| 2/2 [00:41<00:00, 20.66s/it, 41.32/1200 seconds]

Read macro-f1=0.4143518518518518 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\deterministic_deberta\trial_1\metrics.json (modified: 1766238626.5486965)
[I 2025-12-20 14:50:27,793] Trial 1 finished with value: 0.4143518518518518 and parameters: {'learning_rate': 4.3310675609091814e-05, 'batch_size': 4, 'dropout': 0.13558503140378794, 'weight_decay': 0.0017496794383519956}. Best is trial 1 with value: 0.4143518518518518.

Seed0 Dataset HPO Results (distilbert):
Success: True
Trials completed: 2
Trials failed: 0
Best trial: 1
Best value: 0.4128
Best params: {'learning_rate': 4.134240421019675e-05, 'batch_size': 4, 'dropout': 0.10319564833334856, 'weight_decay': 0.02711215175185821}


Seed0 Dataset HPO Results (deberta):
Success: True
Trials completed: 2
Trials failed: 0
Best trial: 1
Best value: 0.4144
Best params: {'learning_rate': 4.3310675609091814e-05, 'batch_size': 4, 'dropout': 0.13558503140378794, 'weight_decay': 0.0017496794383519956}






## Test Suite 2: HPO with Random Seed Variants

This test verifies that the HPO pipeline works with different random seed variants of the tiny dataset.


In [None]:
# Test all backbones with random seed variants
random_seed_results_by_backbone = test_random_seed_variants_multiple_backbones(
    dataset_base_path=deterministic_dataset,
    seeds=RANDOM_SEEDS_TO_TEST,
    config_dir=config_dir,
    hpo_config=hpo_config,
    train_config=train_config,
    output_dir=output_dir,
    backbones=BACKBONES_TO_TEST,
)

# Presentation: print results for each backbone and seed
for backbone, backbone_results in random_seed_results_by_backbone.items():
    if backbone_results:
        print(f"\n{'=' * 60}")
        print(f"Backbone: {backbone}")
        print('=' * 60)
        for seed, seed_results in backbone_results.items():
            print_hpo_results(
                seed_results,
                f"Random Seed Variant (seed {seed}) HPO Results"
            )

# For backward compatibility, use first backbone's results as random_seed_results
random_seed_results = random_seed_results_by_backbone.get(BACKBONES_TO_TEST[0]) if BACKBONES_TO_TEST else None

# Presentation: compare seed0 vs other random seed results (for first backbone)
if seed0_results and random_seed_results:
    comparison = compare_results(seed0_results, random_seed_results)
    if comparison:
        print_comparison(comparison)

[I 2025-12-20 14:50:52,562] A new study created in memory with name: hpo_distilbert
  0%|          | 0/2 [00:00<?, ?it/s]2025/12/20 14:50:52 INFO mlflow.tracking.fluent: Experiment with name 'test-hpo-random-seed0' does not exist. Creating a new experiment.
Best trial: 0. Best value: 0.383663:  50%|█████     | 1/2 [00:17<00:17, 17.33s/it, 17.32/1200 seconds]

Read macro-f1=0.3836633663366337 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_distilbert\random_seed0\trial_0\metrics.json (modified: 1766238668.1039436)
[I 2025-12-20 14:51:09,886] Trial 0 finished with value: 0.3836633663366337 and parameters: {'learning_rate': 1.4776966675680059e-05, 'batch_size': 4, 'dropout': 0.12769181279998454, 'weight_decay': 0.03293903673644672}. Best is trial 0 with value: 0.3836633663366337.


Best trial: 0. Best value: 0.383663: 100%|██████████| 2/2 [00:33<00:00, 16.87s/it, 33.73/1200 seconds]
[I 2025-12-20 14:51:26,301] A new study created in memory with name: hpo_distilbert


Read macro-f1=0.3617283950617284 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_distilbert\random_seed0\trial_1\metrics.json (modified: 1766238685.195622)
[I 2025-12-20 14:51:26,295] Trial 1 finished with value: 0.3617283950617284 and parameters: {'learning_rate': 1.839693778204376e-05, 'batch_size': 4, 'dropout': 0.16111395701794523, 'weight_decay': 0.011126776787118415}. Best is trial 0 with value: 0.3836633663366337.


  0%|          | 0/2 [00:00<?, ?it/s]2025/12/20 14:51:26 INFO mlflow.tracking.fluent: Experiment with name 'test-hpo-random-seed1' does not exist. Creating a new experiment.
Best trial: 0. Best value: 0.179785:  50%|█████     | 1/2 [00:17<00:17, 17.02s/it, 17.01/1200 seconds]

Read macro-f1=0.17978494623655916 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_distilbert\random_seed1\trial_0\metrics.json (modified: 1766238702.0926468)
[I 2025-12-20 14:51:43,314] Trial 0 finished with value: 0.17978494623655916 and parameters: {'learning_rate': 1.3585730611546055e-05, 'batch_size': 4, 'dropout': 0.2935797931455364, 'weight_decay': 0.0029265759002460976}. Best is trial 0 with value: 0.17978494623655916.


Best trial: 1. Best value: 0.189733: 100%|██████████| 2/2 [00:35<00:00, 17.54s/it, 35.07/1200 seconds]
[I 2025-12-20 14:52:01,378] A new study created in memory with name: hpo_distilbert


Read macro-f1=0.18973305954825462 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_distilbert\random_seed1\trial_1\metrics.json (modified: 1766238720.0310605)
[I 2025-12-20 14:52:01,370] Trial 1 finished with value: 0.18973305954825462 and parameters: {'learning_rate': 3.7882648855248245e-05, 'batch_size': 4, 'dropout': 0.19088696978785835, 'weight_decay': 0.0019081764868729794}. Best is trial 1 with value: 0.18973305954825462.


  0%|          | 0/2 [00:00<?, ?it/s]2025/12/20 14:52:01 INFO mlflow.tracking.fluent: Experiment with name 'test-hpo-random-seed2' does not exist. Creating a new experiment.
Best trial: 0. Best value: 0.172707:  50%|█████     | 1/2 [00:15<00:15, 15.82s/it, 15.82/1200 seconds]

Read macro-f1=0.1727069351230425 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_distilbert\random_seed2\trial_0\metrics.json (modified: 1766238736.047999)
[I 2025-12-20 14:52:17,193] Trial 0 finished with value: 0.1727069351230425 and parameters: {'learning_rate': 3.382277425589884e-05, 'batch_size': 4, 'dropout': 0.10810442043572231, 'weight_decay': 0.014115177379664546}. Best is trial 0 with value: 0.1727069351230425.


Best trial: 0. Best value: 0.172707: 100%|██████████| 2/2 [00:31<00:00, 15.96s/it, 31.91/1200 seconds]
[I 2025-12-20 14:52:33,294] A new study created in memory with name: hpo_distilbert


Read macro-f1=0.17193763919821828 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_distilbert\random_seed2\trial_1\metrics.json (modified: 1766238751.933619)
[I 2025-12-20 14:52:33,287] Trial 1 finished with value: 0.17193763919821828 and parameters: {'learning_rate': 3.8715819228950395e-05, 'batch_size': 4, 'dropout': 0.18996186530645473, 'weight_decay': 0.0011358276629996847}. Best is trial 0 with value: 0.1727069351230425.


  0%|          | 0/2 [00:00<?, ?it/s]2025/12/20 14:52:33 INFO mlflow.tracking.fluent: Experiment with name 'test-hpo-random-seed3' does not exist. Creating a new experiment.
Best trial: 0. Best value: 0.420455:  50%|█████     | 1/2 [00:16<00:16, 16.02s/it, 16.02/1200 seconds]

Read macro-f1=0.42045454545454547 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_distilbert\random_seed3\trial_0\metrics.json (modified: 1766238768.1839814)
[I 2025-12-20 14:52:49,316] Trial 0 finished with value: 0.42045454545454547 and parameters: {'learning_rate': 3.1985595622151474e-05, 'batch_size': 4, 'dropout': 0.24680558246273132, 'weight_decay': 0.016435689443727097}. Best is trial 0 with value: 0.42045454545454547.


Best trial: 1. Best value: 0.543485: 100%|██████████| 2/2 [00:31<00:00, 15.70s/it, 31.40/1200 seconds]
[I 2025-12-20 14:53:04,702] A new study created in memory with name: hpo_distilbert


Read macro-f1=0.543485251556699 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_distilbert\random_seed3\trial_1\metrics.json (modified: 1766238783.5893004)
[I 2025-12-20 14:53:04,696] Trial 1 finished with value: 0.543485251556699 and parameters: {'learning_rate': 2.666741633882948e-05, 'batch_size': 4, 'dropout': 0.2973440134943125, 'weight_decay': 0.0010893282682592203}. Best is trial 1 with value: 0.543485251556699.


  0%|          | 0/2 [00:00<?, ?it/s]2025/12/20 14:53:04 INFO mlflow.tracking.fluent: Experiment with name 'test-hpo-random-seed4' does not exist. Creating a new experiment.
Best trial: 0. Best value: 0.147266:  50%|█████     | 1/2 [00:15<00:15, 15.70s/it, 15.70/1200 seconds]

Read macro-f1=0.14726643598615916 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_distilbert\random_seed4\trial_0\metrics.json (modified: 1766238799.3067958)
[I 2025-12-20 14:53:20,405] Trial 0 finished with value: 0.14726643598615916 and parameters: {'learning_rate': 1.3931113588912513e-05, 'batch_size': 4, 'dropout': 0.2644198986672803, 'weight_decay': 0.02424485676076834}. Best is trial 0 with value: 0.14726643598615916.


Best trial: 1. Best value: 0.173451: 100%|██████████| 2/2 [00:33<00:00, 16.70s/it, 33.39/1200 seconds]
[I 2025-12-20 14:53:38,104] A new study created in memory with name: hpo_deberta


Read macro-f1=0.17345132743362832 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_distilbert\random_seed4\trial_1\metrics.json (modified: 1766238816.978101)
[I 2025-12-20 14:53:38,097] Trial 1 finished with value: 0.17345132743362832 and parameters: {'learning_rate': 3.36961892328147e-05, 'batch_size': 4, 'dropout': 0.26676953791326324, 'weight_decay': 0.003940034127547487}. Best is trial 1 with value: 0.17345132743362832.


Best trial: 0. Best value: 0.391924:  50%|█████     | 1/2 [00:20<00:20, 20.66s/it, 20.66/1200 seconds]

Read macro-f1=0.3919239904988124 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_deberta\random_seed0\trial_0\metrics.json (modified: 1766238837.5634422)
[I 2025-12-20 14:53:58,760] Trial 0 finished with value: 0.3919239904988124 and parameters: {'learning_rate': 3.142002884735033e-05, 'batch_size': 4, 'dropout': 0.28512463498882273, 'weight_decay': 0.0014535912902923192}. Best is trial 0 with value: 0.3919239904988124.


Best trial: 0. Best value: 0.391924: 100%|██████████| 2/2 [00:43<00:00, 21.96s/it, 43.91/1200 seconds]
[I 2025-12-20 14:54:22,022] A new study created in memory with name: hpo_deberta


Read macro-f1=0.36065573770491804 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_deberta\random_seed0\trial_1\metrics.json (modified: 1766238860.5005758)
[I 2025-12-20 14:54:22,016] Trial 1 finished with value: 0.36065573770491804 and parameters: {'learning_rate': 2.1702960842599027e-05, 'batch_size': 4, 'dropout': 0.2647474283618344, 'weight_decay': 0.02041799369415815}. Best is trial 0 with value: 0.3919239904988124.


Best trial: 0. Best value: 0.2125:  50%|█████     | 1/2 [00:21<00:21, 21.46s/it, 21.46/1200 seconds]

Read macro-f1=0.2125 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_deberta\random_seed1\trial_0\metrics.json (modified: 1766238882.2278357)
[I 2025-12-20 14:54:43,485] Trial 0 finished with value: 0.2125 and parameters: {'learning_rate': 3.884133117142114e-05, 'batch_size': 4, 'dropout': 0.18410462833758923, 'weight_decay': 0.039434909759804654}. Best is trial 0 with value: 0.2125.


Best trial: 0. Best value: 0.2125: 100%|██████████| 2/2 [00:42<00:00, 21.42s/it, 42.84/1200 seconds]
[I 2025-12-20 14:55:04,866] A new study created in memory with name: hpo_deberta


Read macro-f1=0.17589285714285713 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_deberta\random_seed1\trial_1\metrics.json (modified: 1766238903.6059089)
[I 2025-12-20 14:55:04,860] Trial 1 finished with value: 0.17589285714285713 and parameters: {'learning_rate': 3.748486524149794e-05, 'batch_size': 4, 'dropout': 0.2540398004284701, 'weight_decay': 0.002728126034237306}. Best is trial 0 with value: 0.2125.


Best trial: 0. Best value: 0.140535:  50%|█████     | 1/2 [00:21<00:21, 21.78s/it, 21.78/1200 seconds]

Read macro-f1=0.1405352798053528 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_deberta\random_seed2\trial_0\metrics.json (modified: 1766238925.4358475)
[I 2025-12-20 14:55:26,651] Trial 0 finished with value: 0.1405352798053528 and parameters: {'learning_rate': 1.1828932913875485e-05, 'batch_size': 4, 'dropout': 0.14444821652300557, 'weight_decay': 0.0010511504159694293}. Best is trial 0 with value: 0.1405352798053528.


Best trial: 0. Best value: 0.140535: 100%|██████████| 2/2 [00:42<00:00, 21.25s/it, 42.50/1200 seconds]
[I 2025-12-20 14:55:47,375] A new study created in memory with name: hpo_deberta


Read macro-f1=0.13130699088145897 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_deberta\random_seed2\trial_1\metrics.json (modified: 1766238946.1215396)
[I 2025-12-20 14:55:47,367] Trial 1 finished with value: 0.13130699088145897 and parameters: {'learning_rate': 1.2983767584960847e-05, 'batch_size': 4, 'dropout': 0.24320919819776202, 'weight_decay': 0.09089245437832538}. Best is trial 0 with value: 0.1405352798053528.


Best trial: 0. Best value: 0.0613208:  50%|█████     | 1/2 [00:19<00:19, 19.83s/it, 19.83/1200 seconds]

Read macro-f1=0.06132075471698114 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_deberta\random_seed3\trial_0\metrics.json (modified: 1766238966.0255678)
[I 2025-12-20 14:56:07,207] Trial 0 finished with value: 0.06132075471698114 and parameters: {'learning_rate': 1.1804432478297903e-05, 'batch_size': 4, 'dropout': 0.13308354570106568, 'weight_decay': 0.028294720067561986}. Best is trial 0 with value: 0.06132075471698114.


Best trial: 1. Best value: 0.117489: 100%|██████████| 2/2 [00:39<00:00, 19.68s/it, 39.36/1200 seconds] 
[I 2025-12-20 14:56:26,741] A new study created in memory with name: hpo_deberta


Read macro-f1=0.11748947569843093 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_deberta\random_seed3\trial_1\metrics.json (modified: 1766238985.5257633)
[I 2025-12-20 14:56:26,736] Trial 1 finished with value: 0.11748947569843093 and parameters: {'learning_rate': 1.0955426503749652e-05, 'batch_size': 4, 'dropout': 0.17274264365023, 'weight_decay': 0.00228575453683493}. Best is trial 1 with value: 0.11748947569843093.


Best trial: 0. Best value: 0.0869762:  50%|█████     | 1/2 [00:20<00:20, 20.88s/it, 20.88/1200 seconds]

Read macro-f1=0.0869761572690443 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_deberta\random_seed4\trial_0\metrics.json (modified: 1766239006.4434292)
[I 2025-12-20 14:56:47,623] Trial 0 finished with value: 0.0869761572690443 and parameters: {'learning_rate': 1.1384516669252764e-05, 'batch_size': 4, 'dropout': 0.1954501234222545, 'weight_decay': 0.003130041708182112}. Best is trial 0 with value: 0.0869761572690443.


Best trial: 1. Best value: 0.115202: 100%|██████████| 2/2 [00:40<00:00, 20.40s/it, 40.80/1200 seconds] 

Read macro-f1=0.11520223907547851 from c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\outputs\hpo_tests\random_deberta\random_seed4\trial_1\metrics.json (modified: 1766239026.3285778)
[I 2025-12-20 14:57:07,537] Trial 1 finished with value: 0.11520223907547851 and parameters: {'learning_rate': 2.8404808810620642e-05, 'batch_size': 4, 'dropout': 0.18002126175715644, 'weight_decay': 0.06256408800027316}. Best is trial 1 with value: 0.11520223907547851.

Backbone: distilbert

Random Seed Variant (seed 0) HPO Results:
Success: True
Trials completed: 2
Trials failed: 0
Best trial: 0
Best value: 0.3837
Best params: {'learning_rate': 1.4776966675680059e-05, 'batch_size': 4, 'dropout': 0.12769181279998454, 'weight_decay': 0.03293903673644672}

Random Seed Variant (seed 1) HPO Results:
Success: True
Trials completed: 2
Trials failed: 0
Best trial: 1
Best value: 0.1897
Best params: {'learning_rate': 3.7882648855248245e-05, 'batch_size': 4, 'dropout': 0.19088696978785835, 'weight_decay': 0




## Test Suite 3: K-Fold Cross-Validation Validation

This test validates that k-fold CV splits are created correctly and all samples are used properly.


In [None]:
# Use seed0 dataset for k-fold validation testing
seed0_dataset = deterministic_dataset / "seed0"

kfold_results = test_kfold_validation(
    dataset_path=seed0_dataset,
    hpo_config=hpo_config,
)

# Presentation: print results
if kfold_results:
    k_configured = hpo_config.get("k_fold", {}).get("n_splits", 3)
    print_kfold_results(kfold_results, k_configured)


K-Fold CV Validation (k=3):
Dataset: c:\Users\HOANG PHI LONG DANG\repos\resume-ner-azureml\dataset_tiny
Number of samples: 0
Splits created: 0
Splits valid: False
All folds non-empty: False
Success: False

Errors: ['K-fold validation failed: Training file not found: c:\\Users\\HOANG PHI LONG DANG\\repos\\resume-ner-azureml\\dataset_tiny\\train.json']


## Test Suite 4: Edge Case - k > n_samples

This test verifies that the pipeline handles the edge case where k (number of folds) is greater than the number of samples gracefully.

**Configuration used**: Uses `DEFAULT_RANDOM_SEED` from config for reproducibility.

In [8]:
# Use seed0 dataset for edge case testing
seed0_dataset = deterministic_dataset / "seed0"

k_too_large_results = test_edge_case_k_too_large(
    dataset_path=seed0_dataset,
)

# Presentation: print results
if k_too_large_results:
    print_edge_case_k_too_large_results(k_too_large_results)



Edge Case: k > n_samples (k=9, n_samples=8):
Success (expected error caught): True
Expected error: Cannot create 9 folds with only 8 samples. Reduce k or increase dataset size.


## Test Suite 5: Edge Cases Suite

This test checks for various edge cases: minimal k, batch size vs validation set size, and very small validation sets.

**Configuration used**: 
- Uses `MINIMAL_K_FOLDS` from config for minimal k test
- Uses `VERY_SMALL_VALIDATION_THRESHOLD` from config for small validation set detection
- Uses batch sizes from HPO config search space


In [9]:
# Use seed0 dataset for edge case testing
seed0_dataset = deterministic_dataset / "seed0"

edge_case_results = test_edge_cases_suite(
    dataset_path=seed0_dataset,
    hpo_config=hpo_config,
    train_config=train_config,
)

# Presentation: print results
if edge_case_results:
    print_edge_case_results(edge_case_results)



Edge Case Test Results:

1. Minimal k (k=2):
   Success: True
   Splits valid: True
   Fold 0: train=4, val=4
   Fold 1: train=4, val=4

2. Batch Size vs Validation Set Size:
   Batch sizes: [4]
   Validation set sizes: [3, 3, 2]
   Min val size: 2, Max val size: 3
   ⚠️  Potential issues:
      - batch_size=4 >= min_val_size=2 (may cause issues)

3. Very Small Validation Sets (≤2 samples):
   Count: 1
   Sizes: [2]
   ⚠️  Validation sets with 1-2 samples may produce unstable metrics

Overall success: False


## Summary: Test Results

Aggregate all test results and display the final summary.


In [11]:
# Aggregate all test results
# Note: seed0_results is defined in Cell 6. If not run, use None.
try:
    _ = seed0_results  # Check if variable exists
except NameError:
    seed0_results = None
    print("Warning: seed0_results not found. Make sure Cell 6 (HPO with Seed0 Dataset) has been executed.")

# Safely get other variables that may not be defined
random_seed_results_val = random_seed_results if 'random_seed_results' in globals() else None
kfold_results_val = kfold_results if 'kfold_results' in globals() else None
k_too_large_results_val = k_too_large_results if 'k_too_large_results' in globals() else None
edge_case_results_val = edge_case_results if 'edge_case_results' in globals() else None

test_summary = collect_test_results(
    deterministic_results=seed0_results,  # seed0_results used for backward compatibility
    random_seed_results=random_seed_results_val,
    kfold_results=kfold_results_val,
    k_too_large_results=k_too_large_results_val,
    edge_case_results=edge_case_results_val,
)

test_details = build_test_details(
    deterministic_results=seed0_results,  # seed0_results used for backward compatibility
    random_seed_results=random_seed_results_val,
    kfold_results=kfold_results_val,
)

# Print final summary
print_test_summary(
    test_summary=test_summary,
    test_details=test_details,
)


HPO Pipeline Testing Summary

Test Results:
------------------------------------------------------------
2. HPO with random seed variant(s): ✓ PASS
   - Seeds tested: [0, 1, 2, 3, 4]
   - Total trials completed: 10
   - Best values: seed0: 0.3837, seed1: 0.1897, seed2: 0.1727, seed3: 0.5435, seed4: 0.1735
3. K-fold CV validation: ✗ FAIL
   - Splits valid: False
   - All folds non-empty: False
4. Edge case (k > n_samples): ✓ PASS
5. Edge cases overall: ✗ FAIL
------------------------------------------------------------

Overall Status: ✗ SOME TESTS FAILED

⚠️  Some tests failed. Review the detailed output above.
   Common issues:
   - Dataset files not found (run tests/00_make_tiny_dataset.ipynb)
   - K-fold CV issues with small datasets
   - Batch size >= validation set size


In [None]:
# Uncomment to run all tests at once:
# from tests.integration.orchestrators.test_orchestrator import run_all_tests
# 
# results = run_all_tests(
#     root_dir=ROOT_DIR,
#     random_seeds=RANDOM_SEEDS_TO_TEST,
# )
# 
# print_test_summary(
#     test_summary=results["test_summary"],
#     test_details=results["test_details"],
# )
