Skip to content

[llmd] Rework the test harness and its surrounding#69

Merged
kpouget merged 34 commits into
openshift-psap:mainfrom
kpouget:llm-d
Jun 3, 2026
Merged

[llmd] Rework the test harness and its surrounding#69
kpouget merged 34 commits into
openshift-psap:mainfrom
kpouget:llm-d

Conversation

@kpouget
Copy link
Copy Markdown
Contributor

@kpouget kpouget commented Jun 2, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added HuggingFace model cache preparation with token authentication support
    • Added smoke request testing framework for inference validation
    • Enhanced LLM inference service deployment with gateway and scheduler profile support
    • Added GPU operator and Node Feature Discovery (NFD) bootstrapping tooling
    • Added Tekton pipeline notification controls and configurable artifact exports
  • Improvements

    • Refactored vault configuration to support mandatory and optional vault categories
    • Enhanced cluster resource cleanup with selective and comprehensive deletion modes
    • Improved diagnostics capture for inference services and cluster state

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Jun 2, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ashishkamra for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

Warning

Review limit reached

@kpouget, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 54 minutes and 47 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8da4bdc6-76d4-4455-8a90-36f3c584cf87

📥 Commits

Reviewing files that changed from the base of the PR and between f501cbc and c54af5a.

📒 Files selected for processing (54)
  • docs/toolbox/dsl.md
  • fournos/gitops/base/workflows/pipeline-prepare-only.yaml
  • fournos/gitops/base/workflows/pipeline-prepare-test.yaml
  • fournos/gitops/base/workflows/task-forge-step.yaml
  • projects/cluster/toolbox/cluster_deploy_operator/main.py
  • projects/cluster/toolbox/deploy_custom_catalog/main.py
  • projects/cluster/toolbox/wait_for_crds/main.py
  • projects/core/ci_entrypoint/prepare_ci.py
  • projects/core/dsl/runtime.py
  • projects/core/dsl/shell.py
  • projects/core/dsl/task.py
  • projects/core/dsl/toolbox.py
  • projects/core/dsl/utils/k8s.py
  • projects/core/library/run.py
  • projects/core/orchestration/__init__.py
  • projects/core/orchestration/utils/__init__.py
  • projects/core/orchestration/utils/k8s.py
  • projects/gpu_operator/toolbox/bootstrap_gpu_clusterpolicy/main.py
  • projects/gpu_operator/toolbox/bootstrap_nfd_instance/main.py
  • projects/guidellm/toolbox/run_guidellm_benchmark/main.py
  • projects/guidellm/toolbox/run_guidellm_benchmark/templates/guidellm_copy_pod.yaml.j2
  • projects/guidellm/toolbox/run_guidellm_benchmark/templates/guidellm_job.yaml.j2
  • projects/guidellm/toolbox/run_guidellm_benchmark/templates/guidellm_pvc.yaml.j2
  • projects/guidellm/toolbox/run_guidellm_benchmark/utils.py
  • projects/guidellm/toolbox/run_smoke_request/main.py
  • projects/guidellm/toolbox/run_smoke_request/utils.py
  • projects/jump_ci/testing/test.py
  • projects/kserve/toolbox/deploy_llmisvc/main.py
  • projects/kserve/toolbox/deploy_llmisvc/utils.py
  • projects/kserve/toolbox/ensure_gateway/main.py
  • projects/kserve/toolbox/prepare_hf_model_cache/main.py
  • projects/kserve/toolbox/prepare_hf_model_cache/templates/verify_pod_override.yaml.j2
  • projects/kserve/toolbox/prepare_hf_model_cache/utils.py
  • projects/legacy/library/run.py
  • projects/llm_d/orchestration/ci.py
  • projects/llm_d/orchestration/cleanup_phase.py
  • projects/llm_d/orchestration/cli.py
  • projects/llm_d/orchestration/config.d/platform.yaml
  • projects/llm_d/orchestration/config.d/runtime.yaml
  • projects/llm_d/orchestration/config.d/workloads.yaml
  • projects/llm_d/orchestration/config.yaml
  • projects/llm_d/orchestration/manifests/datasciencecluster.yaml
  • projects/llm_d/orchestration/manifests/gpu-clusterpolicy.yaml
  • projects/llm_d/orchestration/manifests/nfd-nodefeaturediscovery.yaml
  • projects/llm_d/orchestration/phase_inputs.py
  • projects/llm_d/orchestration/prepare_phase.py
  • projects/llm_d/orchestration/runtime_config.py
  • projects/llm_d/orchestration/test_phase.py
  • projects/llm_d/orchestration/utils.py
  • projects/llm_d/toolbox/capture_prepare_state/main.py
  • projects/llm_d/toolbox/cleanup_test_resources/main.py
  • projects/rhoai/toolbox/apply_datasciencecluster/main.py
  • projects/rhoai/toolbox/wait_datasciencecluster_ready/main.py
  • pyproject.toml
📝 Walkthrough

Walkthrough

Broad refactor: new DSL/k8s helpers, retry/logging tweaks, and entrypoint migration. Removed legacy llm_d runtime, added RHOAI/KServe/HF cache toolboxes, updated orchestration to runtime_config, and introduced Tekton prepare-only pipeline plus notification controls.

Changes

Unified orchestration and tooling refactor

Layer / File(s) Summary
End-to-end refactor: DSL, utilities, orchestration, and toolboxes
fournos/gitops/..., projects/core/..., projects/cluster/..., projects/gpu_operator/..., projects/guidellm/..., projects/kserve/..., projects/rhoai/..., projects/llm_d/..., vaults/psap-forge-hf.yaml
Adds Tekton prepare-only pipeline and task param; extends DSL logging/retry/shell; introduces k8s utils; migrates many toolboxes to @entrypoint; adds RHOAI/KServe/HF-cache flows; refactors llm_d orchestration to new runtime_config; removes legacy llm_d runtime/manifests/tests; updates configs and vault.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • openshift-psap/forge#35 — Also modifies DSL task execution/retry logic, overlapping with this PR’s retry/interrupt updates.
  • openshift-psap/forge#18 — Earlier work on vault framework extended here with strict-validation and new init parameters.
  • openshift-psap/forge#26 — Adjusts DSL runtime context handling, related to this PR’s execution metadata/logging changes.

Poem

A rabbit taps the Tekton drum—prepare, then stash the art,
New DSL wands wave “oc” spells, retries smart and tart.
KServe gates swing open wide, RHOAI lights the way,
HF caches hum and glow, pods dance in array.
With footprints small but swift—hop!—the cluster’s set today. 🐇✨

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Jun 2, 2026

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 2, 2026

🔴 Test of 'fournos_launcher submit' failed after 00 hours 02 minutes 11 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

• Failure indicator: Empty.
Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Jun 2, 2026

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 2, 2026

🔴 Test of 'llm_d export-artifacts' failed after 00 hours 00 minutes 00 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

ci_job.cluster: athena-fire
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260602-192840
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

Failure indicator:

## /workspace/artifacts/001__export-artifacts/FAILURE 
--- 📍RuntimeError STACKTRACE ---
--- 📍VaultManager not initialized. Call vault.init() first.

   Traceback (most recent call last):
     File "/app/forge/projects/core/library/ci.py", line 100, in wrapper
       exit_code = command_func(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/export.py", line 117, in caliper_export_command
       status = run_caliper_orchestration_export(artifact_directory=artifact_directory)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/export.py", line 101, in run_caliper_orchestration_export
       return run_from_orchestration_config(caliper_cfg)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/caliper/orchestration/export.py", line 76, in run_from_orchestration_config
       mlflow_secrets_path = vault_lib.get_vault_content_path(vault_name, vault_mlflow_secret)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/vault.py", line 471, in get_vault_content_path
       return get_vault_manager().get_vault_content_path(vault_name, content_name)
              ^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/vault.py", line 457, in get_vault_manager
       raise RuntimeError("VaultManager not initialized. Call vault.init() first.")
   RuntimeError: VaultManager not initialized. Call vault.init() first.

[...]

Execution logs

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 2, 2026

🔴 Test of 'fournos_launcher submit' failed after 00 hours 01 minutes 30 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

• Failure indicator: Empty.
Execution logs

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 2, 2026

🔴 Test of 'llm_d export-artifacts' failed after 00 hours 00 minutes 00 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

ci_job.cluster: athena-fire
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260602-192840
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

Failure indicator:

## /workspace/artifacts/001__export-artifacts/FAILURE 
--- 📍RuntimeError STACKTRACE ---
--- 📍VaultManager not initialized. Call vault.init() first.

   Traceback (most recent call last):
     File "/app/forge/projects/core/library/ci.py", line 100, in wrapper
       exit_code = command_func(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/export.py", line 117, in caliper_export_command
       status = run_caliper_orchestration_export(artifact_directory=artifact_directory)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/export.py", line 101, in run_caliper_orchestration_export
       return run_from_orchestration_config(caliper_cfg)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/caliper/orchestration/export.py", line 76, in run_from_orchestration_config
       mlflow_secrets_path = vault_lib.get_vault_content_path(vault_name, vault_mlflow_secret)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/vault.py", line 471, in get_vault_content_path
       return get_vault_manager().get_vault_content_path(vault_name, content_name)
              ^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/vault.py", line 457, in get_vault_manager
       raise RuntimeError("VaultManager not initialized. Call vault.init() first.")
   RuntimeError: VaultManager not initialized. Call vault.init() first.

[...]

Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Jun 2, 2026

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 2, 2026

🔴 Test of 'llm_d export-artifacts' failed after 00 hours 00 minutes 00 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

ci_job.cluster: athena-fire
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260602-194445
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

Failure indicator:

## /workspace/artifacts/001__export-artifacts/FAILURE 
--- 📍RuntimeError STACKTRACE ---
--- 📍VaultManager not initialized. Call vault.init() first.

   Traceback (most recent call last):
     File "/app/forge/projects/core/library/ci.py", line 100, in wrapper
       exit_code = command_func(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/export.py", line 117, in caliper_export_command
       status = run_caliper_orchestration_export(artifact_directory=artifact_directory)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/export.py", line 101, in run_caliper_orchestration_export
       return run_from_orchestration_config(caliper_cfg)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/caliper/orchestration/export.py", line 76, in run_from_orchestration_config
       mlflow_secrets_path = vault_lib.get_vault_content_path(vault_name, vault_mlflow_secret)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/vault.py", line 471, in get_vault_content_path
       return get_vault_manager().get_vault_content_path(vault_name, content_name)
              ^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/core/library/vault.py", line 457, in get_vault_manager
       raise RuntimeError("VaultManager not initialized. Call vault.init() first.")
   RuntimeError: VaultManager not initialized. Call vault.init() first.

[...]

Execution logs

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 2, 2026

🔴 Test of 'fournos_launcher submit' failed after 00 hours 01 minutes 31 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

• Failure indicator: Empty.
Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Jun 2, 2026

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 2, 2026

🟢 Test of 'fournos_launcher submit' succeeded after 00 hours 01 minutes 52 seconds 🟢

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Jun 3, 2026

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 3, 2026

🔴 Test of 'llm_d prepare' failed after 00 hours 00 minutes 08 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

ci_job.cluster: athena-fire
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260603-055615
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

Failure indicator:

## /workspace/artifacts/000__prepare/FAILURE 
--- 📍KeyError STACKTRACE ---
--- 📍'wait_timeout_seconds'

   Traceback (most recent call last):
     File "/app/forge/projects/core/library/ci.py", line 100, in wrapper
       exit_code = command_func(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/llm_d/orchestration/ci.py", line 249, in prepare
       return run_prepare_phase()
              ^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/llm_d/orchestration/ci.py", line 118, in run_prepare_phase
       return run_prepare_sequence()
              ^^^^^^^^^^^^^^^^^^^^^^
     File "/app/forge/projects/llm_d/orchestration/prepare_sequence.py", line 14, in run_prepare_sequence
       prepare_phase.prepare_nfd()
     File "/app/forge/projects/llm_d/orchestration/prepare_phase.py", line 132, in prepare_nfd
       timeout_seconds=operator_spec["wait_timeout_seconds"],
                       ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
   KeyError: 'wait_timeout_seconds'

[...]

Execution logs

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 3, 2026

🔴 Test of 'fournos_launcher submit' failed after 00 hours 01 minutes 29 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

• Failure indicator: Empty.
Execution logs

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Jun 3, 2026

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 3, 2026

🟢 Test of 'fournos_launcher submit' succeeded after 00 hours 01 minutes 39 seconds 🟢

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-only

Execution logs

@kpouget kpouget force-pushed the llm-d branch 2 times, most recently from 013b344 to e72f893 Compare June 3, 2026 16:41
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Jun 3, 2026

/test fournos llm_d
/cluster athena-fire
/pipeline forge-test-only

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 3, 2026

🟢 Test of 'llm_d test' succeeded after 00 hours 11 minutes 56 seconds 🟢

• Link to the test results.

• No reports generated...

Test configuration:

ci_job.cluster: athena-fire
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260603-164257
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

Execution logs

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 3, 2026

🟢 Test of 'fournos_launcher submit' succeeded after 00 hours 13 minutes 19 seconds 🟢

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster athena-fire
/pipeline forge-test-only

Execution logs

@openshift-psap openshift-psap deleted a comment from openshift-ci Bot Jun 3, 2026
@openshift-psap openshift-psap deleted a comment from psap-forge-bot Bot Jun 3, 2026
@openshift-psap openshift-psap deleted a comment from psap-forge-bot Bot Jun 3, 2026
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Jun 3, 2026

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-test

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 3, 2026

🟢 Test of 'llm_d test' succeeded after 00 hours 06 minutes 16 seconds 🟢

• Link to the test results.

• No reports generated...

Test configuration:

ci_job.cluster: athena-fire
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260603-192323
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

Execution logs

@psap-forge-bot
Copy link
Copy Markdown

psap-forge-bot Bot commented Jun 3, 2026

🟢 Test of 'fournos_launcher submit' succeeded after 00 hours 09 minutes 06 seconds 🟢

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster athena-fire
/pipeline forge-prepare-test

Execution logs

@openshift-psap openshift-psap deleted a comment from psap-forge-bot Bot Jun 3, 2026
@openshift-psap openshift-psap deleted a comment from psap-forge-bot Bot Jun 3, 2026
@kpouget kpouget merged commit 8db5c6b into openshift-psap:main Jun 3, 2026
6 of 7 checks passed
@kpouget kpouget deleted the llm-d branch June 3, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant