Skip to content

Adding Forge rhaiis test entrypoint#23

Open
MML-coder wants to merge 3 commits intoopenshift-psap:mainfrom
MML-coder:rhaiis-pr
Open

Adding Forge rhaiis test entrypoint#23
MML-coder wants to merge 3 commits intoopenshift-psap:mainfrom
MML-coder:rhaiis-pr

Conversation

@MML-coder
Copy link
Copy Markdown
Collaborator

@MML-coder MML-coder commented Apr 13, 2026

This PR only supports the Forge test workflow
It deploy vLLM, run benchmark, cleanup. It does NOT include:

prepare workflow (operator installation) - stubbed only
cleanup workflow - basic implementation
llm-d integration - design doc included, implementation separate
How It Works

ConfigLoader resolves model config via inheritance chain:
defaults.yaml → accelerator[nvidia|amd] → model → model.accelerator_overrides
BenchmarkWorkflow executes steps sequentially:
DeployVLLM → WaitForReady → RunGuideLLM → (finally) CollectArtifacts → Cleanup
`

Single model + workload
KUBECONFIG=~/kubeconfigs/my-cluster
PYTHONPATH=. python3 projects/rhaiis/orchestration/cli.py test --model qwen-0.6b --workload balanced --accelerator nvidia

Multiple workloads (deploy-once pattern)
PYTHONPATH=. python3 projects/rhaiis/orchestration/cli.py test --model llama-3.3-70b-fp8 --workloadsbalanced,short,long-prompt.

`

Summary

  • RHAIIS test entrypoint for Forge workflow engine
  • Updated implementation documentation with class interface diagrams
  • Added unit tests readme

Changes

  • forge rhaiis test entrypoint - Main orchestration entry point
  • IMPLEMENTATION.md - Class diagrams and architecture documentation
  • Unit tests documentation

Reopened from #13 (moved to fork)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added comprehensive configuration system for models, workloads, and deployment defaults with multi-level inheritance and accelerator-specific overrides.
    • Introduced workflow orchestration engine supporting sequential step execution with cleanup/finalization support.
    • Added automated vLLM deployment and readiness validation on OpenShift via KServe.
    • Added GuideLLM benchmark execution as a workflow step with configurable workloads and performance metrics collection.
    • Added RHAIIS CLI tools for manual and CI-driven benchmarking orchestration (prepare, test, cleanup phases).
    • Added automatic OpenShift operator installation (NFD, GPU, RHOAI) for environment setup.
    • Added artifact collection, pod logging, and deployment cleanup automation.
  • Documentation

    • Added comprehensive implementation and extensibility guides for the RHAIIS benchmarking framework.
  • Tests

    • Added extensive test coverage for configuration loading, scenario generation, workflows, CLI tools, and all orchestration components.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mml-coder for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

Introduces a comprehensive workflow orchestration system for RHAIIS benchmarking on OpenShift/Kubernetes, including: YAML configuration infrastructure (defaults, models, workloads), a reusable workflow engine with context/step/executor abstractions, a scenario generation system supporting matrix expansion and config inheritance, OpenShift CLI utilities with retry logic, concrete steps for vLLM deployment/monitoring and artifact collection, and RHAIIS-specific implementations combining these systems for prepare/benchmark/cleanup phases.

Changes

Cohort / File(s) Summary
Configuration Registry
config/rhaiis/defaults.yaml, config/rhaiis/models.yaml, config/rhaiis/workloads.yaml
New YAML configuration files defining deployment defaults, model registry with per-model vLLM settings and accelerator overrides, and workload profiles with GuideLLM parameters and constraints.
CLI Entrypoint Update
projects/core/ci_entrypoint/run_ci.py
Modified Click command decorator to ignore unknown options and disable interspersed args; added type=click.UNPROCESSED to args parameter for explicit token pass-through.
Workflow Engine Foundation
projects/core/workflow/__init__.py, projects/core/workflow/context.py, projects/core/workflow/step.py, projects/core/workflow/workflow.py, projects/core/workflow/executor.py
New workflow execution framework with context management, step abstraction, workflow definition, and sequential executor handling normal/finally step phases with result aggregation.
Utilities & Infrastructure
projects/core/utils/__init__.py, projects/core/utils/oc.py
New OpenShift CLI wrapper with retry configuration, error detection, and convenience methods; supports retries on transient failures and optional timeout retries.
Core Step Implementations
projects/core/steps/__init__.py, projects/core/steps/artifacts.py, projects/core/steps/guidellm.py
Workflow steps for collecting pod logs/events, cleaning up deployments, and running GuideLLM benchmarks; includes pod status monitoring and artifact collection via oc.
Scenario Generation System
projects/core/scenarios/__init__.py, projects/core/scenarios/config.py, projects/core/scenarios/config_loader.py, projects/core/scenarios/generator.py
Implements YAML-based benchmark scenario generation with multi-level config inheritance (global defaults → accelerator → model → model-accelerator overrides), matrix expansion across workload/routing/tensor-parallel, and deployment grouping with vLLM arg merging.
RHAIIS Project Orchestration
projects/rhaiis/__init__.py, projects/rhaiis/orchestration/ci.py, projects/rhaiis/orchestration/cli.py, projects/rhaiis/orchestration/test_rhaiis.py, projects/rhaiis/orchestration/__init__.py
CLI entry points and shared test orchestration logic supporting prepare/test/cleanup phases; test phase supports both single-model and multi-workload deployment-reuse patterns.
RHAIIS Workflows
projects/rhaiis/workflows/__init__.py, projects/rhaiis/workflows/benchmark.py, projects/rhaiis/workflows/cleanup.py, projects/rhaiis/workflows/prepare.py
Workflow definitions composing install/deploy/benchmark/cleanup steps; includes operator installation, vLLM deployment via KServe, and namespace cleanup.
RHAIIS Step Implementations
projects/rhaiis/workflows/steps/__init__.py, projects/rhaiis/workflows/steps/operators.py, projects/rhaiis/workflows/steps/deploy.py, projects/rhaiis/workflows/steps/cleanup.py
Concrete workflow steps for NFD/GPU/RHOAI operator installation, vLLM KServe deployment with health checks, namespace cleanup, and pod readiness polling.
Comprehensive Test Suite
tests/__init__.py, tests/core/__init__.py, tests/core/scenarios/..., tests/core/steps/..., tests/core/utils/..., tests/core/workflow/..., tests/rhaiis/...
Extensive unit tests covering config loading with inheritance, scenario expansion and determinism, workflow execution phases, step result handling, OpenShift CLI retry/error behavior, operator installation, vLLM deployment manifests, and CLI integration.
Documentation
projects/rhaiis/IMPLEMENTATION.md, projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md
Implementation guide describing workflow orchestration, config inheritance, artifact structure, and extensibility analysis for llm-d integration.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Executor
    participant Workflow
    participant Step1
    participant Step2
    participant Finally1
    
    User->>Executor: execute(workflow)
    activate Executor
    
    Executor->>Workflow: steps (lazy init)
    activate Workflow
    Workflow->>Workflow: define_steps()
    Workflow-->>Executor: [Step1, Step2]
    deactivate Workflow
    
    Executor->>Step1: execute(ctx)
    activate Step1
    Step1-->>Executor: StepResult(success=true)
    deactivate Step1
    
    Executor->>Step2: execute(ctx)
    activate Step2
    Step2-->>Executor: StepResult(success=true)
    deactivate Step2
    
    Executor->>Workflow: finally_steps
    activate Workflow
    Workflow-->>Executor: [Finally1]
    deactivate Workflow
    
    Executor->>Finally1: execute(ctx)
    activate Finally1
    Finally1-->>Executor: StepResult(success=true)
    deactivate Finally1
    
    Executor->>Executor: aggregate results
    Executor-->>User: WorkflowResult
    deactivate Executor
Loading
sequenceDiagram
    participant App
    participant ScenarioGenerator
    participant ConfigLoader
    participant DefaultsYAML as defaults.yaml
    participant ModelsYAML as models.yaml
    participant WorkloadsYAML as workloads.yaml
    
    App->>ScenarioGenerator: __init__(config_dir, accelerator)
    activate ScenarioGenerator
    
    ScenarioGenerator->>ConfigLoader: __init__(config_dir, accelerator)
    activate ConfigLoader
    ConfigLoader->>DefaultsYAML: load (cached)
    ConfigLoader->>ModelsYAML: load (cached)
    ConfigLoader->>WorkloadsYAML: load (cached)
    deactivate ConfigLoader
    deactivate ScenarioGenerator
    
    App->>ScenarioGenerator: expand()
    activate ScenarioGenerator
    
    ScenarioGenerator->>ConfigLoader: load_model(model_key)
    activate ConfigLoader
    ConfigLoader->>ConfigLoader: deep_merge(global → accelerator → model → model.accelerator_overrides)
    ConfigLoader-->>ScenarioGenerator: ResolvedModelConfig
    deactivate ConfigLoader
    
    ScenarioGenerator->>ConfigLoader: load_workload(workload_key)
    activate ConfigLoader
    ConfigLoader->>ConfigLoader: merge guidellm + vllm_args
    ConfigLoader-->>ScenarioGenerator: ResolvedWorkloadConfig
    deactivate ConfigLoader
    
    ScenarioGenerator->>ScenarioGenerator: matrix product(workloads × routing × tensor_parallel)
    ScenarioGenerator->>ScenarioGenerator: compute scenario_id, merge runtime_args
    ScenarioGenerator-->>App: [ExpandedScenario, ...]
    deactivate ScenarioGenerator
Loading
sequenceDiagram
    participant CLI
    participant test_rhaiis
    participant ConfigLoader
    participant BenchmarkWorkflow
    participant DeployVLLMStep
    participant WaitForReadyStep
    participant RunGuideLLMStep
    participant OC as oc (CLI)
    
    CLI->>test_rhaiis: run_test(model, workloads, ...)
    activate test_rhaiis
    
    test_rhaiis->>ConfigLoader: load_model(model_key)
    activate ConfigLoader
    ConfigLoader-->>test_rhaiis: ResolvedModelConfig
    deactivate ConfigLoader
    
    alt Single Workload
        test_rhaiis->>BenchmarkWorkflow: __init__(model, workload, ...)
        test_rhaiis->>BenchmarkWorkflow: execute()
    else Multi-Workload (Deploy-Once)
        loop for each workload group
            test_rhaiis->>DeployVLLMStep: execute(ctx)
            activate DeployVLLMStep
            DeployVLLMStep->>OC: apply -f <kserve.yaml>
            DeployVLLMStep-->>test_rhaiis: StepResult
            deactivate DeployVLLMStep
            
            test_rhaiis->>WaitForReadyStep: execute(ctx)
            activate WaitForReadyStep
            loop poll InferenceService
                WaitForReadyStep->>OC: get inferenceservice
            end
            WaitForReadyStep->>OC: exec pod curl health check
            WaitForReadyStep-->>test_rhaiis: StepResult(service_url)
            deactivate WaitForReadyStep
            
            loop for each workload in group
                test_rhaiis->>RunGuideLLMStep: execute(ctx)
                activate RunGuideLLMStep
                RunGuideLLMStep->>OC: apply Pod manifest
                RunGuideLLMStep->>OC: exec pod logs (await BENCHMARK_COMPLETE)
                RunGuideLLMStep-->>test_rhaiis: StepResult
                deactivate RunGuideLLMStep
            end
        end
    end
    
    test_rhaiis-->>CLI: exit_code
    deactivate test_rhaiis
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 A workflow engine hops into place,

With scenarios dancing through config space,

Steps link together in sequence so neat,

While RHAIIS benchmarks run swift and fleet,

Orchestration magic—our fastest race! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Adding Forge rhaiis test entrypoint' is clear, concise, and accurately describes the main objective of the PR—introducing a new test entrypoint for the RHAIIS project within the Forge workflow engine.
Docstring Coverage ✅ Passed Docstring coverage is 84.98% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

🧹 Nitpick comments (19)
tests/rhaiis/test_workflows.py (1)

15-23: Consider consolidating duplicated fixtures.

temp_artifact_dir and context fixtures are repeated in three classes; moving them to module-level fixtures will reduce test boilerplate.

Also applies to: 80-87, 103-110

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_workflows.py` around lines 15 - 23, The temp_artifact_dir
and context fixtures are duplicated across test classes; move them to
module-level by defining pytest fixtures named temp_artifact_dir and context at
top of the test module and reuse them in tests instead of redefining inside each
class; ensure the context fixture still uses
WorkflowContext.from_environment(artifact_base=str(temp_artifact_dir)) and
yields Path(tmpdir) semantics for temp_artifact_dir so existing tests
referencing those fixture names continue to work.
tests/core/steps/test_guidellm.py (1)

49-52: Use WorkflowContext API instead of rebuilding internal path format.

Line [49] manually mirrors artifact directory naming. Prefer context.get_step_artifact_dir("benchmark") to avoid brittle coupling.

💡 Suggested change
-        step_dir = context.artifact_dir / f"{context.step_number:03d}__{context.current_step_name}"
-        step_dir.mkdir(parents=True, exist_ok=True)
+        step_dir = context.get_step_artifact_dir("benchmark")
         output_file = step_dir / "guidellm_results.json"
         output_file.write_text('{"results": []}')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/steps/test_guidellm.py` around lines 49 - 52, The test rebuilds
the artifact path manually using context.step_number and
context.current_step_name (step_dir/step_dir.mkdir/output_file), which couples
to internal formatting; replace that with the WorkflowContext API by calling
context.get_step_artifact_dir("benchmark") to obtain the correct step_dir, then
create the directory and write the results to that path (output_file) using the
returned Path; update references to step_dir/output_file accordingly (remove
manual formatting with
f"{context.step_number:03d}__{context.current_step_name}").
tests/core/scenarios/test_generator.py (2)

82-82: Consider using ASCII x instead of Unicode × in comment.

The comment uses the Unicode multiplication sign (×) which Ruff flags as potentially ambiguous. While this is purely cosmetic, using the ASCII x character improves consistency.

📝 Suggested fix
-        # 2 workloads × 1 routing × 2 TP = 4 scenarios
+        # 2 workloads x 1 routing x 2 TP = 4 scenarios
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/scenarios/test_generator.py` at line 82, Replace the Unicode
multiplication sign in the comment "# 2 workloads × 1 routing × 2 TP = 4
scenarios" with the ASCII "x" to avoid Ruff warnings; update the comment in
tests/core/scenarios/test_generator.py (the line containing that scenario count
comment) so it reads "# 2 workloads x 1 routing x 2 TP = 4 scenarios".

60-64: Temporary file created with delete=False is not explicitly cleaned up.

The temp file is created with delete=False and yielded, but there's no explicit cleanup. While pytest typically handles this, consider using tmp_path fixture or adding cleanup in a finally block for cleaner resource management.

♻️ Alternative using tmp_path
     `@pytest.fixture`
-    def sample_config_path(self):
+    def sample_config_path(self, tmp_path):
         """Create a sample scenarios.yaml file using new format."""
         config = { ... }
-        with tempfile.NamedTemporaryFile(
-            mode="w", suffix=".yaml", delete=False
-        ) as f:
-            yaml.safe_dump(config, f)
-            yield Path(f.name)
+        config_path = tmp_path / "scenarios.yaml"
+        config_path.write_text(yaml.safe_dump(config))
+        yield config_path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/scenarios/test_generator.py` around lines 60 - 64, The temporary
YAML file created with tempfile.NamedTemporaryFile(delete=False) in the
test_generator (where yaml.safe_dump writes and the code yields Path(f.name))
isn't explicitly removed; either switch to pytest's tmp_path fixture to create
and write the file there, or wrap the yield in a try/finally that calls
Path(f.name).unlink() to ensure cleanup after the test completes; update the
block that creates the NamedTemporaryFile accordingly (or replace it with
tmp_path.joinpath and open/write) so the temp file is always removed.
tests/rhaiis/test_operators.py (1)

56-61: Test path construction relies on internal context state.

The test constructs the step directory path manually using context.step_number and context.current_step_name. This couples the test to internal implementation details of WorkflowContext. Consider using context.get_step_artifact_dir("install_nfd") directly to get the path, or capturing the return value from the fixture's call.

♻️ Suggested approach
     `@pytest.fixture`
     def context(self, temp_artifact_dir):
         ctx = WorkflowContext.from_environment(artifact_base=str(temp_artifact_dir))
-        ctx.get_step_artifact_dir("install_nfd")
-        return ctx
+        step_dir = ctx.get_step_artifact_dir("install_nfd")
+        ctx._test_step_dir = step_dir  # Store for test access
+        return ctx

     # Then in test:
-        step_dir = context.artifact_dir / f"{context.step_number:03d}__{context.current_step_name}"
+        step_dir = context._test_step_dir

Or simply call get_step_artifact_dir again with the same name if it's idempotent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_operators.py` around lines 56 - 61, The test directly
constructs the step artifact path using context.step_number and
context.current_step_name which couples it to WorkflowContext internals; change
the assertion to obtain the step directory via the public API instead (call
context.get_step_artifact_dir("install_nfd") or use the fixture return value
that provided the path) and then check that (e.g., set step_dir =
context.get_step_artifact_dir("install_nfd") and use that to locate
"nfd-subscription.yaml" and assert its contents), leaving WorkflowContext
internals untouched.
projects/core/workflow/executor.py (2)

51-52: Unused variable step_artifact_dir.

The step_artifact_dir is computed on lines 52 and 87 but never used. If this is intended to ensure the directory exists as a side effect of get_step_artifact_dir(), consider adding a comment to clarify. Otherwise, remove the assignment.

♻️ Option 1: Add clarifying comment if side effect is intentional
             # Get artifact directory for this step
-            step_artifact_dir = ctx.get_step_artifact_dir(step_name)
+            # Creates the step artifact directory (side effect)
+            _ = ctx.get_step_artifact_dir(step_name)
♻️ Option 2: Remove if truly unused
-            # Get artifact directory for this step
-            step_artifact_dir = ctx.get_step_artifact_dir(step_name)
+            # Ensure step artifact directory exists
+            ctx.get_step_artifact_dir(step_name)

Also applies to: 87-87

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/executor.py` around lines 51 - 52, The variable
step_artifact_dir assigned from ctx.get_step_artifact_dir(step_name) is unused;
either remove the assignment or make the intended side effect explicit—if
calling get_step_artifact_dir(...) is meant to ensure the directory exists,
replace the assignment with a call (ctx.get_step_artifact_dir(step_name)) and
add a clarifying comment, otherwise delete the line(s) that assign
step_artifact_dir (both occurrences) to eliminate the unused variable.

82-82: finally_errors list is populated but never used.

The finally_errors list accumulates exceptions from finally steps (lines 99, 112) but is never included in the WorkflowResult or used for any other purpose. Consider either using this data (e.g., logging a summary, including in result) or removing the collection.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/executor.py` at line 82, The finally_errors list in
executor.py (variable finally_errors) is being appended with exceptions from the
finally steps but never used; either remove the collection or surface it in the
result/logs: add a field to WorkflowResult (e.g., finally_errors:
list[Exception] or finally_error_summary) and populate it with the collected
finally_errors before returning from the executor function, or if you prefer not
to change the result type, log a concise summary (count and messages) of
finally_errors via the existing logger right before returning; update uses in
the function handling finally steps (where finally_errors is appended)
accordingly.
projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md (1)

22-34: Consider adding language identifiers to code blocks.

The directory structure code blocks on lines 22 and 37 are missing language specifiers. While markdown renders these fine, adding a language identifier (e.g., text or plaintext) improves linting compliance and editor syntax highlighting.

📝 Suggested fix
-```
+```text
 projects/core/
 ├── workflow/

Also applies to: 37-49

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md` around lines 22 - 34, The
markdown code blocks showing directory trees (the blocks starting with
"projects/core/" around the two sections in LLM_D_EXTENSIBILITY_REPORT.md) lack
language identifiers; update each triple-backtick fence to include a language
token like ```text or ```plaintext so linters and editors recognize these as
plain text (e.g., change ``` to ```text for both directory-structure blocks).
projects/rhaiis/IMPLEMENTATION.md (1)

7-34: Consider adding text language identifier to ASCII diagram blocks.

The ASCII art diagrams render correctly but lack language specifiers, which triggers markdownlint warnings. Adding text or plaintext as the language identifier would satisfy the linter without affecting rendering.

Also applies to: 118-180, 212-257, 261-286

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/IMPLEMENTATION.md` around lines 7 - 34, Add a language
specifier to the ASCII diagram code fences to silence markdownlint: change the
opening ``` to ```text (or ```plaintext) for the ASCII blocks that document the
orchestration and workflow (the block showing test_rhaiis.py, ConfigLoader,
BenchmarkWorkflow and the WorkflowStep implementations like DeployVLLM,
WaitForReady, RunGuideLLM, CollectArtifacts); apply the same change to the other
ASCII sections called out (lines ~118-180, 212-257, 261-286) so the diagrams
keep rendering but now include the text language identifier.
projects/rhaiis/workflows/steps/cleanup.py (1)

41-51: Consider adding additional resource types to the cleanup list.

The cleanup list may miss persistentvolumeclaim and job resources that could be created during benchmark runs. Consider expanding the list to ensure complete cleanup.

🧹 Proposed fix
         resource_types = [
             "inferenceservice",
             "servingruntime",
             "deployment",
             "service",
             "route",
             "configmap",
             "secret",
             "pod",
+            "job",
+            "persistentvolumeclaim",
         ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/cleanup.py` around lines 41 - 51, The cleanup
resource_types list in cleanup.py currently omits PVCs and Jobs, so update the
resource_types variable (used in the cleanup step) to include
"persistentvolumeclaim" and "job" alongside the existing entries; ensure
ordering still deletes KServe types first if needed and that any callers of
resource_types (e.g., the cleanup loop/function in this module) handle these
additional resource kinds.
projects/core/steps/guidellm.py (3)

3-7: Unused import: json.

The json module is imported but never used in this file.

🧹 Proposed fix
-import json
 import logging
 import subprocess
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/steps/guidellm.py` around lines 3 - 7, The import list at the
top of guidellm.py includes an unused module `json`; remove the unused `json`
import from the imports block (the line containing "import json") so only the
used modules (logging, subprocess, time, uuid) remain and avoid linter warnings.

77-78: Inconsistent artifact directory construction.

The step manually constructs the artifact path using ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" instead of using the context's get_step_artifact_dir() method. This duplicates logic and could diverge from the canonical implementation.

♻️ Proposed fix
     def execute(self, ctx: "WorkflowContext") -> StepResult:
         """Run GuideLLM benchmark as a pod."""
-        step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
-        step_dir.mkdir(parents=True, exist_ok=True)
+        step_dir = ctx.get_step_artifact_dir(self.name)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/steps/guidellm.py` around lines 77 - 78, Replace the manual
artifact directory construction that assigns step_dir = ctx.artifact_dir /
f"{ctx.step_number:03d}__{ctx.current_step_name}" with the canonical context
helper by calling ctx.get_step_artifact_dir() (or the appropriate context
method) to obtain the step artifact directory; ensure you then call
mkdir(parents=True, exist_ok=True) on the returned path so all existing callers
(e.g., the variable step_dir and any subsequent uses) keep the same behavior but
use the context-managed path.

273-275: Progress logging condition may not trigger reliably.

With poll_interval=10 and integer division, elapsed % 60 == 0 may not trigger exactly every minute due to timing drift. Consider tracking the last log time instead.

🛠️ Proposed fix
+        last_log_time = 0
+
         while time.monotonic() - start_time < timeout:
             try:
                 # ... polling logic ...
 
                 # Still running, no marker yet
                 elapsed = int(time.monotonic() - start_time)
-                if elapsed % 60 == 0:  # Print every minute
+                if elapsed - last_log_time >= 60:  # Print every minute
                     print(f"  GuideLLM running... ({elapsed}s elapsed, phase={phase})")
+                    last_log_time = elapsed
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/steps/guidellm.py` around lines 273 - 275, The current progress
log uses "elapsed = int(time.monotonic() - start_time)" and checks "if elapsed %
60 == 0" which can miss minute boundaries due to timing drift with
poll_interval; change to track the last logged time instead (e.g., initialize
last_log_time = 0 before the loop) and inside the loop check "if elapsed -
last_log_time >= 60" then print the same message (including phase and elapsed)
and set last_log_time = elapsed; update occurrences around the
elapsed/start_time/phase logging in the GuideLLM run loop to use this new
last_log_time logic so logs reliably occur roughly every 60 seconds regardless
of poll_interval.
tests/rhaiis/test_steps.py (1)

26-31: Context fixture pre-increments step counter but discards result.

The fixture calls ctx.get_step_artifact_dir("deploy") presumably to simulate workflow execution, but the return value is unused. A comment explaining this side-effect would improve clarity.

📝 Proposed fix
     `@pytest.fixture`
     def context(self, temp_artifact_dir):
         ctx = WorkflowContext.from_environment(artifact_base=str(temp_artifact_dir))
-        # Pre-increment step number to simulate workflow execution
-        ctx.get_step_artifact_dir("deploy")
+        # Pre-increment step counter to simulate workflow execution context
+        # (return value intentionally discarded - we only need the side effect)
+        _ = ctx.get_step_artifact_dir("deploy")
         return ctx
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_steps.py` around lines 26 - 31, The context fixture is
calling ctx.get_step_artifact_dir("deploy") solely for its side-effect of
advancing the step counter but discards the return value; update the fixture to
either store the returned path (e.g., assign to a variable) or add a clarifying
comment explaining the intentional side-effect so future readers know
get_step_artifact_dir is being invoked to pre-increment the step counter;
reference the context fixture, WorkflowContext.from_environment, and
ctx.get_step_artifact_dir("deploy") when making the change.
projects/rhaiis/orchestration/ci.py (1)

24-24: Path construction using multiple parent traversals is fragile.

Path(__file__).parent.parent.parent.parent is brittle and hard to verify. Consider using a more explicit path resolution or a constant defined relative to the project root.

🛠️ Alternative approach
+# Find project root by looking for a marker file
+def _find_project_root() -> Path:
+    """Find project root by traversing up to find config directory."""
+    current = Path(__file__).resolve().parent
+    for _ in range(10):  # Limit traversal depth
+        if (current / "config" / "rhaiis").is_dir():
+            return current
+        current = current.parent
+    raise RuntimeError("Could not find project root with config/rhaiis directory")
+
 # Default config directory
-DEFAULT_CONFIG_DIR = Path(__file__).parent.parent.parent.parent / "config" / "rhaiis"
+DEFAULT_CONFIG_DIR = _find_project_root() / "config" / "rhaiis"

Alternatively, verify the path exists at module load time to fail fast if the structure changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/orchestration/ci.py` at line 24, The DEFAULT_CONFIG_DIR uses
brittle parent chaining (Path(__file__).parent.parent.parent.parent / "config" /
"rhaiis"); replace it with a robust resolution strategy: derive the project root
from a single authoritative location (e.g., a PROJECT_ROOT constant or using
pathlib.Path(__file__).resolve().parents[n] with a named constant) or compute it
via an environment variable, then build DEFAULT_CONFIG_DIR = PROJECT_ROOT /
"config" / "rhaiis"; also add a runtime existence check for DEFAULT_CONFIG_DIR
(raising a clear exception or logging an error) so module import fails fast if
the expected path is missing; update references to DEFAULT_CONFIG_DIR
accordingly.
projects/core/workflow/workflow.py (1)

64-74: Duplicate step names would overwrite results in step_results dict.

If two steps have the same name, the executor will overwrite the first step's result in step_results. Consider validating uniqueness or using a list of tuples instead.

🛠️ Proposed fix - add uniqueness validation
     def add_step(self, step: WorkflowStep) -> None:
         """
         Add a step to the workflow.
 
         Steps run in order of registration. If a step fails,
         remaining steps are skipped and finally steps run.
 
         Args:
             step: WorkflowStep instance to add
+
+        Raises:
+            ValueError: If a step with the same name already exists
         """
+        existing_names = {s.name for s in self._steps} | {s.name for s in self._finally_steps}
+        if step.name in existing_names:
+            raise ValueError(f"Step with name '{step.name}' already exists")
         self._steps.append(step)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/workflow.py` around lines 64 - 74, The add_step method
currently appends WorkflowStep instances into self._steps allowing duplicate
step.name values which will later cause step_results to overwrite earlier
results; update add_step to validate that the incoming step.name is unique among
existing steps (inspect self._steps for matching WorkflowStep.name) and raise a
clear ValueError (or custom exception) if a duplicate is found, so consumers
must choose unique names before execution; reference add_step,
WorkflowStep.name, self._steps and step_results when making the change.
projects/rhaiis/workflows/benchmark.py (1)

117-122: K8s name sanitization may produce invalid names with leading/trailing hyphens.

If the model name starts or ends with /, ., or _, the resulting name could have leading or trailing hyphens, which are invalid in Kubernetes resource names. Consider stripping these characters.

🛠️ Proposed fix
     `@staticmethod`
     def _sanitize_name(name: str) -> str:
         """Sanitize model name for K8s resource naming."""
         name = name.split("/")[-1].lower()
         name = name.replace(".", "-").replace("_", "-")
+        name = name.strip("-")
         return name[:42]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/benchmark.py` around lines 117 - 122, The
_sanitize_name function can produce leading/trailing hyphens (invalid for K8s);
update it to trim unwanted characters before/after substitution: remove
leading/trailing "/" "." "_" from the original name (or perform replacements
then strip leading/trailing "-" characters), collapse repeated hyphens if
present, ensure the final string is trimmed to max 42 chars and not empty
(fallback to a safe default if empty); modify the _sanitize_name implementation
to perform these steps so Kubernetes names never begin or end with a hyphen.
projects/core/scenarios/config.py (1)

79-81: Sanitize name may produce consecutive hyphens.

When the input contains consecutive dots, underscores, or slashes (e.g., model__name or foo..bar), the result will have consecutive hyphens (model--name, foo--bar), which while technically valid in Kubernetes, is unconventional.

🛠️ Proposed fix
+import re
+
     `@staticmethod`
     def sanitize_name(name: str, max_len: int = 42) -> str:
-        return (
-            name.lower().replace("/", "-").replace("_", "-").replace(".", "")
-        )[:max_len]
+        sanitized = name.lower().replace("/", "-").replace("_", "-").replace(".", "")
+        sanitized = re.sub(r"-+", "-", sanitized)  # Collapse consecutive hyphens
+        return sanitized.strip("-")[:max_len]

Note: re is already imported at line 3.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/config.py` around lines 79 - 81, The current
sanitization expression (the return using name.lower().replace("/",
"-").replace("_", "-").replace(".", "")) can produce consecutive hyphens;
replace that expression with a regex-based collapse: use re.sub to replace any
consecutive sequence of slashes, dots, or underscores with a single hyphen, then
strip leading/trailing hyphens and apply the existing max_len slice. Locate and
replace the expression shown (the
name.lower().replace(...).replace(...).replace(...)) in the config sanitization
function so the output never contains consecutive hyphens.
projects/core/utils/oc.py (1)

192-202: Separate “inherit default namespace” from “no namespace flag”.

The current API overloads None for both meanings. That makes cluster-scoped helpers like create_namespace() rely on ambiguous behavior, because namespace=None still falls back to self.namespace inside _build_cmd(). A private sentinel here would make the public API clearer and safer for future cluster-scoped commands.

Suggested refactor
+_INHERIT_NAMESPACE = object()
+
 class OC:
@@
-    def _build_cmd(self, args: list[str], namespace: str | None = None) -> list[str]:
+    def _build_cmd(
+        self,
+        args: list[str],
+        namespace: str | None | object = _INHERIT_NAMESPACE,
+    ) -> list[str]:
         """Build full oc command with namespace."""
         cmd = ["oc"]
 
         # Use provided namespace, fall back to instance default
-        ns = namespace if namespace is not None else self.namespace
+        ns = self.namespace if namespace is _INHERIT_NAMESPACE else namespace
         if ns:
             cmd.extend(["-n", ns])
@@
-    def run(self, *args: str, namespace: str | None = None, timeout: int | None = None, input: str | None = None) -> OCResult:
+    def run(
+        self,
+        *args: str,
+        namespace: str | None | object = _INHERIT_NAMESPACE,
+        timeout: int | None = None,
+        input: str | None = None,
+    ) -> OCResult:
@@
-    def apply(self, *args: str, namespace: str | None = None, timeout: int | None = None, input: str | None = None) -> OCResult:
+    def apply(
+        self,
+        *args: str,
+        namespace: str | None | object = _INHERIT_NAMESPACE,
+        timeout: int | None = None,
+        input: str | None = None,
+    ) -> OCResult:
@@
         return self.apply("-f", "-", input=yaml_content, namespace=None)

Also applies to: 300-313, 335-351, 514-529

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/utils/oc.py` around lines 192 - 202, Introduce a private
sentinel to distinguish “inherit default namespace” from “no namespace flag”:
define _NO_NAMESPACE = object() and change _build_cmd signature to accept
namespace: str | None | object = _NO_NAMESPACE; implement logic so that when
namespace is _NO_NAMESPACE it uses self.namespace (inherit), when namespace is
None it omits the -n flag (cluster-scoped), and when namespace is a str it uses
that value; then update callers (e.g., create_namespace() and the other helper
methods that call _build_cmd referenced in the review) to pass None when they
intend “no namespace flag” and leave the parameter unset or pass _NO_NAMESPACE
when they intend “inherit default”.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/rhaiis/defaults.yaml`:
- Around line 16-18: The default config currently sets the YAML key
"trust-remote-code" to true which enables remote model code execution globally;
change the default value of the "trust-remote-code" setting to false in the
defaults.yaml so remote code execution is opt-in, and update any code paths or
config-loading logic that assume a true default (e.g., places reading
"trust-remote-code") to honor per-model overrides or an explicit allowlist so
only explicitly trusted models can enable remote code execution.

In `@config/rhaiis/workloads.yaml`:
- Around line 48-53: The heterogeneous workload contains a placeholder value for
guidellm.data ("ADD ME"); update the heterogeneous -> guidellm -> data field to
a valid dataset or config identifier used by the benchmark (or remove the
guidellm.data key if not applicable) so the "heterogeneous" workload no longer
contains an invalid placeholder and will not fail at runtime; locate the
heterogeneous section in workloads.yaml and replace "ADD ME" with the correct
dataset name expected by your runner.

In `@projects/core/ci_entrypoint/run_ci.py`:
- Around line 499-505: The CLI currently uses context_settings with
allow_interspersed_args=False while the command signature accepts a catch-all
args parameter typed click.UNPROCESSED, so options provided after positional
arguments (e.g., --dry-run) get captured into args and never parsed; fix this by
enabling interspersed option parsing (set allow_interspersed_args=True in the
click.command context_settings) so parent/global options are parsed regardless
of position, or alternatively stop using click.UNPROCESSED for the args
parameter and accept a parsed remainder (e.g., change handling of the args
parameter) so flags like --dry-run are recognized; update the
decorator/context_settings where allow_interspersed_args is declared and adjust
how the args parameter is defined/consumed accordingly.

In `@projects/core/scenarios/config_loader.py`:
- Around line 266-273: load_scenario() is currently building
data["_resolved_defaults"] by merging only get_global_defaults() and
scenario-local "defaults", which omits accelerator-specific defaults; modify the
merge so accelerator defaults from get_accelerator_defaults() are merged between
the global and scenario defaults (e.g., deep_merge(global_defaults,
accelerator_defaults, scenario_defaults)) so that get_accelerator_defaults(),
deep_merge, and data["_resolved_defaults"] are used in that order within
load_scenario().
- Around line 181-205: The deploy merge is missing accelerator-specific entries
so accelerator-level GPU/resource settings are lost; update the merge to read
accel_defaults.get("deploy", {}) and accel_overrides.get("deploy", {}) (e.g.,
accel_deploy and override_deploy) and include them when building final_deploy
(merge base_deploy, accel_deploy, model_deploy, then override_deploy) so
final_deploy reflects accelerator defaults and overrides (ensure the same
final_deploy variable used by ResolvedModelConfig.deploy/num_gpus).

In `@projects/core/scenarios/generator.py`:
- Around line 309-326: The inline model merging code currently always converts
model dicts into ModelConfig via ModelConfig.from_dict (using symbols model_key,
model_data, ModelConfig.from_dict), which drops legacy-only fields like "matrix"
and "deploy" and later causes expand() to skip those entries (see the block that
checks isinstance(ModelConfig) around expand logic). Change the merge so that if
the incoming model_data contains legacy keys (e.g., "matrix", "deploy" or other
runtime-only fields) you preserve the raw dict (or create a distinct
LegacyModelConfig dataclass) instead of converting it with
ModelConfig.from_dict; update the later expansion loop (the code that currently
skips entries that are ModelConfig instances) to only skip real fully-normalized
ModelConfig instances and to handle the preserved legacy dict/LegacyModelConfig
by expanding its matrix/deploy fields as before. Ensure you reference/remove use
of ModelConfig.from_dict for legacy cases and add handling for legacy types in
the expand() path.

In `@projects/core/steps/artifacts.py`:
- Around line 203-246: The cleanup hides failures because _delete_resource(kind,
name, ns_args) only returns False on non-zero exit and execute() never records
error details in errors; change _delete_resource to return (bool, str|None)
where the bool is success and the str carries the subprocess output/error
(prefer result.stderr if present falling back to result.stdout) when returncode
!= 0, and update execute() to capture that tuple: when success append to deleted
as before, when failure append a descriptive message (include kind, name and the
returned error string) into the errors list so real oc delete failures (auth/API
errors) are surfaced in the StepResult data/message.

In `@projects/core/workflow/executor.py`:
- Around line 54-59: The timing fields are incorrect because result.start_time
is assigned after step.execute() returns; move the start timestamp to before
execution by setting result.start_time = datetime.now(timezone.utc) (or
equivalent) immediately after setting step_start and before calling
step.execute(ctx), then invoke result = step.execute(ctx) and compute
result.duration_seconds = time.monotonic() - step_start afterward so
result.start_time reflects the real start; update the block around step_start,
step.execute, result.start_time, and result.duration_seconds accordingly (refer
to step_start, step.execute(ctx), and result.start_time).

In `@projects/rhaiis/IMPLEMENTATION.md`:
- Around line 469-472: Replace the hardcoded developer path used in the example
command (the string "cd /Users/memehta/workspace/forge") with a generic
placeholder or relative path; update the snippet to instruct users to "cd
<path-to-repo>" or "cd /path/to/forge" or simply "cd $REPO_ROOT" (or use "cd
$(pwd)" if running from repo) so it works for all devs, and keep the rest of the
commands (source ~/test_foo/python3_virt/bin/activate and PYTHONPATH=. python -m
pytest ...) unchanged.

In `@projects/rhaiis/orchestration/ci.py`:
- Around line 67-70: The test command currently instantiates WorkflowContext via
WorkflowContext.from_environment() and reads model =
os.environ.get("FORGE_MODEL") even for --dry-run and without validating the
model; modify the test command to (1) defer creating the WorkflowContext until
after the dry-run check (matching prepare/cleanup behavior) so no context is
created when --dry-run is true, and (2) validate model (the variable read from
FORGE_MODEL) before calling run_test — if model is None or empty, log/raise a
clear error and exit instead of passing None into run_test. Ensure you update
places that call run_test to only do so when model is present and
WorkflowContext has been created.

In `@projects/rhaiis/orchestration/cli.py`:
- Around line 154-160: Validate workload inputs before calling run_test(): if
both the single --workload (variable workload) and --workloads (variable
workloads) are provided, raise a clear error (when not dry_run) instead of
silently ignoring one; when using --workloads, build workload_list by splitting
on commas and normalizing entries with trimming and filtering out empty strings
(e.g., workload_list = [w.strip() for w in workloads.split(",") if w.strip()]);
if only the single workload is provided set workload_list = [workload]; keep the
existing skip-for-dry-run behavior (check dry_run first) and ensure all checks
happen prior to dispatching to run_test().

In `@projects/rhaiis/orchestration/test_rhaiis.py`:
- Around line 356-364: The code reads max_seconds from wl_config.guidellm which
is incorrect; max_seconds is a top-level field on the resolved workload object.
In the try block where wl_config = config_loader.load_workload(wl) is used,
change the lookup for wl_max_seconds to read wl_config.get("max_seconds", 300)
(or wl_config.max_seconds if the object exposes attributes) instead of
wl_config.guidellm.get("max_seconds", 300); keep wl_rates/wl_rate_str logic
as-is but ensure wl_rates still comes from wl_config.guidellm.get("rates", ...)
if rates are nested there.

In `@projects/rhaiis/workflows/benchmark.py`:
- Line 54: Change the environment variable key used when populating
self.vllm_image: currently the constructor sets self.vllm_image = vllm_image or
ctx.get_env("VLLM_IMAGE", ""), but tests/CLI use FORGE_VLLM_IMAGE; update the
ctx.get_env call to use "FORGE_VLLM_IMAGE" so the code reads the correct env var
(locate the assignment to self.vllm_image in the class constructor where
ctx.get_env is called).

In `@projects/rhaiis/workflows/cleanup.py`:
- Around line 19-40: The remove_operators flag is currently ignored: update
define_steps to respect self.remove_operators by either (a) raising a clear
NotImplementedError/ValueError when self.remove_operators is True to fail fast
until operator-removal is implemented, or (b) add the operator-removal step when
True; specifically modify the define_steps method to check self.remove_operators
and either raise an error with a descriptive message or append the
operator-removal step alongside the existing CleanupNamespaceStep (referencing
self.remove_operators and CleanupNamespaceStep to locate the change).

In `@projects/rhaiis/workflows/steps/deploy.py`:
- Around line 358-367: The subprocess.run that populates url_result (and other
oc subprocess.run invocations later in the file) lack timeouts and can hang; add
a timeout argument (use a sensible value or a shared constant/the outer wait
deadline) to each subprocess.run call fetching oc resources (e.g., the run that
sets url_result and the similar calls around lines 419-433), and wrap them in
try/except for subprocess.TimeoutExpired to log the timeout (include
deployment_name and namespace) and fail/raise so the workflow’s outer timeout
contract is preserved.
- Around line 73-78: The generated vLLM argument list currently reads
"tensor-parallel-size" from self.runtime_args instead of using the override
stored in self.tensor_parallel, causing a mismatch between GPU requests and vLLM
startup; update the argument-building logic (the _build_args_lines method and
the other args-construction site that mirrors lines ~239-245) to set the
"tensor-parallel-size" CLI arg from self.tensor_parallel when it is provided
(falling back to self.runtime_args only if self.tensor_parallel is falsy),
leaving other runtime_args unchanged so the manifest and vLLM use the same
tensor-parallel size.

In `@projects/rhaiis/workflows/steps/operators.py`:
- Around line 23-38: The code computes step_dir and writes YAML with
yaml_path.write_text(...) but never creates the per-step artifact directory,
causing FileNotFoundError; update the operator steps to use the canonical
ctx.get_step_artifact_dir(self.name) (or call step_dir.mkdir(parents=True,
exist_ok=True) after computing step_dir) before writing files so the directory
exists; apply this change in the blocks that define step_dir and write
nfd-subscription.yaml (and analogous blocks at the other occurrences noted) to
ensure consistent step numbering and avoid write errors.
- Around line 40-44: The code currently runs a dry-run namespace creation and
then applies a YAML but ignores both command outcomes and always returns
StepResult.ok; change the sequence in the method using _run_oc so the namespace
is actually created (either remove the "--dry-run=client -o yaml" or run the
output through "oc apply -f -" so creation is applied), capture and check the
return/result of both _run_oc calls (the namespace creation/apply and the
subsequent _run_oc(["apply", "-f", str(yaml_path)])), and if either fails return
a failure StepResult instead of StepResult.ok; look for the _run_oc invocations
and StepResult.ok call to update behavior.

In `@tests/rhaiis/test_ci.py`:
- Around line 57-59: The test currently uses `assert "balanced" in result.output
or "heterogeneous" in result.output`, which lets the test pass if only one
workload is printed; change this to assert both workloads are present by using
`and` (i.e., require both `"balanced"` and `"heterogeneous"` in `result.output`)
in the same test (the assertion referencing `result`/`result.output`).
- Around line 34-35: The tests currently hardcode
FORGE_ARTIFACT_DIR="/tmp/artifacts" which can cause cross-test interference;
update the test(s) that return CliRunner(env={"FORGE_ARTIFACT_DIR":
"/tmp/artifacts"}) to accept a per-test temporary directory (e.g., pytest's
tmp_path or a tempfile.TemporaryDirectory) and set FORGE_ARTIFACT_DIR to that
path (str(tmp_path) or the temp dir path) when constructing CliRunner; locate
the return site that calls CliRunner and any helper functions that build the
runner and replace the hardcoded string with the injected temp fixture so each
test gets an isolated artifact directory.

---

Nitpick comments:
In `@projects/core/scenarios/config.py`:
- Around line 79-81: The current sanitization expression (the return using
name.lower().replace("/", "-").replace("_", "-").replace(".", "")) can produce
consecutive hyphens; replace that expression with a regex-based collapse: use
re.sub to replace any consecutive sequence of slashes, dots, or underscores with
a single hyphen, then strip leading/trailing hyphens and apply the existing
max_len slice. Locate and replace the expression shown (the
name.lower().replace(...).replace(...).replace(...)) in the config sanitization
function so the output never contains consecutive hyphens.

In `@projects/core/steps/guidellm.py`:
- Around line 3-7: The import list at the top of guidellm.py includes an unused
module `json`; remove the unused `json` import from the imports block (the line
containing "import json") so only the used modules (logging, subprocess, time,
uuid) remain and avoid linter warnings.
- Around line 77-78: Replace the manual artifact directory construction that
assigns step_dir = ctx.artifact_dir /
f"{ctx.step_number:03d}__{ctx.current_step_name}" with the canonical context
helper by calling ctx.get_step_artifact_dir() (or the appropriate context
method) to obtain the step artifact directory; ensure you then call
mkdir(parents=True, exist_ok=True) on the returned path so all existing callers
(e.g., the variable step_dir and any subsequent uses) keep the same behavior but
use the context-managed path.
- Around line 273-275: The current progress log uses "elapsed =
int(time.monotonic() - start_time)" and checks "if elapsed % 60 == 0" which can
miss minute boundaries due to timing drift with poll_interval; change to track
the last logged time instead (e.g., initialize last_log_time = 0 before the
loop) and inside the loop check "if elapsed - last_log_time >= 60" then print
the same message (including phase and elapsed) and set last_log_time = elapsed;
update occurrences around the elapsed/start_time/phase logging in the GuideLLM
run loop to use this new last_log_time logic so logs reliably occur roughly
every 60 seconds regardless of poll_interval.

In `@projects/core/utils/oc.py`:
- Around line 192-202: Introduce a private sentinel to distinguish “inherit
default namespace” from “no namespace flag”: define _NO_NAMESPACE = object() and
change _build_cmd signature to accept namespace: str | None | object =
_NO_NAMESPACE; implement logic so that when namespace is _NO_NAMESPACE it uses
self.namespace (inherit), when namespace is None it omits the -n flag
(cluster-scoped), and when namespace is a str it uses that value; then update
callers (e.g., create_namespace() and the other helper methods that call
_build_cmd referenced in the review) to pass None when they intend “no namespace
flag” and leave the parameter unset or pass _NO_NAMESPACE when they intend
“inherit default”.

In `@projects/core/workflow/executor.py`:
- Around line 51-52: The variable step_artifact_dir assigned from
ctx.get_step_artifact_dir(step_name) is unused; either remove the assignment or
make the intended side effect explicit—if calling get_step_artifact_dir(...) is
meant to ensure the directory exists, replace the assignment with a call
(ctx.get_step_artifact_dir(step_name)) and add a clarifying comment, otherwise
delete the line(s) that assign step_artifact_dir (both occurrences) to eliminate
the unused variable.
- Line 82: The finally_errors list in executor.py (variable finally_errors) is
being appended with exceptions from the finally steps but never used; either
remove the collection or surface it in the result/logs: add a field to
WorkflowResult (e.g., finally_errors: list[Exception] or finally_error_summary)
and populate it with the collected finally_errors before returning from the
executor function, or if you prefer not to change the result type, log a concise
summary (count and messages) of finally_errors via the existing logger right
before returning; update uses in the function handling finally steps (where
finally_errors is appended) accordingly.

In `@projects/core/workflow/workflow.py`:
- Around line 64-74: The add_step method currently appends WorkflowStep
instances into self._steps allowing duplicate step.name values which will later
cause step_results to overwrite earlier results; update add_step to validate
that the incoming step.name is unique among existing steps (inspect self._steps
for matching WorkflowStep.name) and raise a clear ValueError (or custom
exception) if a duplicate is found, so consumers must choose unique names before
execution; reference add_step, WorkflowStep.name, self._steps and step_results
when making the change.

In `@projects/rhaiis/IMPLEMENTATION.md`:
- Around line 7-34: Add a language specifier to the ASCII diagram code fences to
silence markdownlint: change the opening ``` to ```text (or ```plaintext) for
the ASCII blocks that document the orchestration and workflow (the block showing
test_rhaiis.py, ConfigLoader, BenchmarkWorkflow and the WorkflowStep
implementations like DeployVLLM, WaitForReady, RunGuideLLM, CollectArtifacts);
apply the same change to the other ASCII sections called out (lines ~118-180,
212-257, 261-286) so the diagrams keep rendering but now include the text
language identifier.

In `@projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md`:
- Around line 22-34: The markdown code blocks showing directory trees (the
blocks starting with "projects/core/" around the two sections in
LLM_D_EXTENSIBILITY_REPORT.md) lack language identifiers; update each
triple-backtick fence to include a language token like ```text or ```plaintext
so linters and editors recognize these as plain text (e.g., change ``` to
```text for both directory-structure blocks).

In `@projects/rhaiis/orchestration/ci.py`:
- Line 24: The DEFAULT_CONFIG_DIR uses brittle parent chaining
(Path(__file__).parent.parent.parent.parent / "config" / "rhaiis"); replace it
with a robust resolution strategy: derive the project root from a single
authoritative location (e.g., a PROJECT_ROOT constant or using
pathlib.Path(__file__).resolve().parents[n] with a named constant) or compute it
via an environment variable, then build DEFAULT_CONFIG_DIR = PROJECT_ROOT /
"config" / "rhaiis"; also add a runtime existence check for DEFAULT_CONFIG_DIR
(raising a clear exception or logging an error) so module import fails fast if
the expected path is missing; update references to DEFAULT_CONFIG_DIR
accordingly.

In `@projects/rhaiis/workflows/benchmark.py`:
- Around line 117-122: The _sanitize_name function can produce leading/trailing
hyphens (invalid for K8s); update it to trim unwanted characters before/after
substitution: remove leading/trailing "/" "." "_" from the original name (or
perform replacements then strip leading/trailing "-" characters), collapse
repeated hyphens if present, ensure the final string is trimmed to max 42 chars
and not empty (fallback to a safe default if empty); modify the _sanitize_name
implementation to perform these steps so Kubernetes names never begin or end
with a hyphen.

In `@projects/rhaiis/workflows/steps/cleanup.py`:
- Around line 41-51: The cleanup resource_types list in cleanup.py currently
omits PVCs and Jobs, so update the resource_types variable (used in the cleanup
step) to include "persistentvolumeclaim" and "job" alongside the existing
entries; ensure ordering still deletes KServe types first if needed and that any
callers of resource_types (e.g., the cleanup loop/function in this module)
handle these additional resource kinds.

In `@tests/core/scenarios/test_generator.py`:
- Line 82: Replace the Unicode multiplication sign in the comment "# 2 workloads
× 1 routing × 2 TP = 4 scenarios" with the ASCII "x" to avoid Ruff warnings;
update the comment in tests/core/scenarios/test_generator.py (the line
containing that scenario count comment) so it reads "# 2 workloads x 1 routing x
2 TP = 4 scenarios".
- Around line 60-64: The temporary YAML file created with
tempfile.NamedTemporaryFile(delete=False) in the test_generator (where
yaml.safe_dump writes and the code yields Path(f.name)) isn't explicitly
removed; either switch to pytest's tmp_path fixture to create and write the file
there, or wrap the yield in a try/finally that calls Path(f.name).unlink() to
ensure cleanup after the test completes; update the block that creates the
NamedTemporaryFile accordingly (or replace it with tmp_path.joinpath and
open/write) so the temp file is always removed.

In `@tests/core/steps/test_guidellm.py`:
- Around line 49-52: The test rebuilds the artifact path manually using
context.step_number and context.current_step_name
(step_dir/step_dir.mkdir/output_file), which couples to internal formatting;
replace that with the WorkflowContext API by calling
context.get_step_artifact_dir("benchmark") to obtain the correct step_dir, then
create the directory and write the results to that path (output_file) using the
returned Path; update references to step_dir/output_file accordingly (remove
manual formatting with
f"{context.step_number:03d}__{context.current_step_name}").

In `@tests/rhaiis/test_operators.py`:
- Around line 56-61: The test directly constructs the step artifact path using
context.step_number and context.current_step_name which couples it to
WorkflowContext internals; change the assertion to obtain the step directory via
the public API instead (call context.get_step_artifact_dir("install_nfd") or use
the fixture return value that provided the path) and then check that (e.g., set
step_dir = context.get_step_artifact_dir("install_nfd") and use that to locate
"nfd-subscription.yaml" and assert its contents), leaving WorkflowContext
internals untouched.

In `@tests/rhaiis/test_steps.py`:
- Around line 26-31: The context fixture is calling
ctx.get_step_artifact_dir("deploy") solely for its side-effect of advancing the
step counter but discards the return value; update the fixture to either store
the returned path (e.g., assign to a variable) or add a clarifying comment
explaining the intentional side-effect so future readers know
get_step_artifact_dir is being invoked to pre-increment the step counter;
reference the context fixture, WorkflowContext.from_environment, and
ctx.get_step_artifact_dir("deploy") when making the change.

In `@tests/rhaiis/test_workflows.py`:
- Around line 15-23: The temp_artifact_dir and context fixtures are duplicated
across test classes; move them to module-level by defining pytest fixtures named
temp_artifact_dir and context at top of the test module and reuse them in tests
instead of redefining inside each class; ensure the context fixture still uses
WorkflowContext.from_environment(artifact_base=str(temp_artifact_dir)) and
yields Path(tmpdir) semantics for temp_artifact_dir so existing tests
referencing those fixture names continue to work.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 028d9622-0f02-447a-aaa5-eaa381df5c94

📥 Commits

Reviewing files that changed from the base of the PR and between 75ad5c2 and 007ee06.

📒 Files selected for processing (51)
  • config/rhaiis/defaults.yaml
  • config/rhaiis/models.yaml
  • config/rhaiis/workloads.yaml
  • projects/core/ci_entrypoint/run_ci.py
  • projects/core/scenarios/__init__.py
  • projects/core/scenarios/config.py
  • projects/core/scenarios/config_loader.py
  • projects/core/scenarios/generator.py
  • projects/core/steps/__init__.py
  • projects/core/steps/artifacts.py
  • projects/core/steps/guidellm.py
  • projects/core/utils/__init__.py
  • projects/core/utils/oc.py
  • projects/core/workflow/__init__.py
  • projects/core/workflow/context.py
  • projects/core/workflow/executor.py
  • projects/core/workflow/step.py
  • projects/core/workflow/workflow.py
  • projects/rhaiis/IMPLEMENTATION.md
  • projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md
  • projects/rhaiis/__init__.py
  • projects/rhaiis/orchestration/__init__.py
  • projects/rhaiis/orchestration/ci.py
  • projects/rhaiis/orchestration/cli.py
  • projects/rhaiis/orchestration/test_rhaiis.py
  • projects/rhaiis/workflows/__init__.py
  • projects/rhaiis/workflows/benchmark.py
  • projects/rhaiis/workflows/cleanup.py
  • projects/rhaiis/workflows/prepare.py
  • projects/rhaiis/workflows/steps/__init__.py
  • projects/rhaiis/workflows/steps/cleanup.py
  • projects/rhaiis/workflows/steps/deploy.py
  • projects/rhaiis/workflows/steps/operators.py
  • tests/__init__.py
  • tests/core/__init__.py
  • tests/core/scenarios/__init__.py
  • tests/core/scenarios/test_config_loader.py
  • tests/core/scenarios/test_generator.py
  • tests/core/steps/__init__.py
  • tests/core/steps/test_artifacts.py
  • tests/core/steps/test_guidellm.py
  • tests/core/utils/__init__.py
  • tests/core/utils/test_oc.py
  • tests/core/workflow/__init__.py
  • tests/core/workflow/test_context.py
  • tests/core/workflow/test_executor.py
  • tests/rhaiis/__init__.py
  • tests/rhaiis/test_ci.py
  • tests/rhaiis/test_operators.py
  • tests/rhaiis/test_steps.py
  • tests/rhaiis/test_workflows.py

Comment on lines +16 to +18
gpu-memory-utilization: 0.9
trust-remote-code: true
disable-log-requests: true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Defaulting trust-remote-code to true is a security risk.

Line [17] enables remote model code execution globally. This should default to false and be opt-in per trusted model.

🔐 Suggested change
   vllm_args:
     gpu-memory-utilization: 0.9
-    trust-remote-code: true
+    trust-remote-code: false
     disable-log-requests: true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
gpu-memory-utilization: 0.9
trust-remote-code: true
disable-log-requests: true
gpu-memory-utilization: 0.9
trust-remote-code: false
disable-log-requests: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/rhaiis/defaults.yaml` around lines 16 - 18, The default config
currently sets the YAML key "trust-remote-code" to true which enables remote
model code execution globally; change the default value of the
"trust-remote-code" setting to false in the defaults.yaml so remote code
execution is opt-in, and update any code paths or config-loading logic that
assume a true default (e.g., places reading "trust-remote-code") to honor
per-model overrides or an explicit allowlist so only explicitly trusted models
can enable remote code execution.

Comment on lines +48 to +53
heterogeneous:
name: "Heterogeneous"
description: "Mixed token distributions simulating real traffic"
guidellm:
data: "ADD ME"
rate_type: "concurrent"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

heterogeneous workload still contains a placeholder payload.

Line [52] uses "ADD ME", which is not a valid benchmark dataset/config value and can cause runtime failure if selected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/rhaiis/workloads.yaml` around lines 48 - 53, The heterogeneous
workload contains a placeholder value for guidellm.data ("ADD ME"); update the
heterogeneous -> guidellm -> data field to a valid dataset or config identifier
used by the benchmark (or remove the guidellm.data key if not applicable) so the
"heterogeneous" workload no longer contains an invalid placeholder and will not
fail at runtime; locate the heterogeneous section in workloads.yaml and replace
"ADD ME" with the correct dataset name expected by your runner.

Comment on lines +499 to +505
@click.command(context_settings=dict(
ignore_unknown_options=True,
allow_interspersed_args=False,
))
@click.argument('project', required=False)
@click.argument('operation', required=False)
@click.argument('args', nargs=-1)
@click.argument('args', nargs=-1, type=click.UNPROCESSED)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Click 8.x, if a command uses context_settings={"allow_interspersed_args": False, "ignore_unknown_options": True} and click.argument("args", nargs=-1, type=click.UNPROCESSED), are options placed after positional args (for example: run project op --dry-run) parsed as top-level options or passed through as arguments?

💡 Result:

In Click 8.x, with context_settings={"allow_interspersed_args": False, "ignore_unknown_options": True} and a command using @click.argument("args", nargs=-1, type=click.UNPROCESSED), options placed after positional args (e.g., "run project op --dry-run") are passed through as arguments to the "args" parameter. The parser first processes all options until it encounters the first non-option argument ("project"). With allow_interspersed_args=False, it then stops processing options and treats all remaining tokens (including "--dry-run") as positional arguments for the nargs=-1 argument. The ignore_unknown_options=True ensures that any unknown options before the first positional are ignored and moved to largs (positionals), but since "--dry-run" appears after, it is not processed as an option and goes directly to args unprocessed.

Citations:


🏁 Script executed:

# First, let's locate and examine the file
find . -name "run_ci.py" -type f

Repository: openshift-psap/forge

Length of output: 103


🏁 Script executed:

# Once we find the file, let's examine the lines around 499-505 and look for --dry-run definition
head -530 projects/core/ci_entrypoint/run_ci.py | tail -50

Repository: openshift-psap/forge

Length of output: 1563


🏁 Script executed:

# Let's search for where --dry-run is defined in this file
grep -n "dry.run\|dry-run" projects/core/ci_entrypoint/run_ci.py

Repository: openshift-psap/forge

Length of output: 544


🏁 Script executed:

# Let's also check the full Click command definition and its context
sed -n '450,550p' projects/core/ci_entrypoint/run_ci.py

Repository: openshift-psap/forge

Length of output: 3307


--dry-run can be silently ignored when placed after positional args.

With allow_interspersed_args=False (line 501) and UNPROCESSED argument type (line 505), a call like run project operation --dry-run passes --dry-run directly to the args parameter instead of parsing it as the parent option. The command executes with dry_run=False even though the user provided the flag, potentially running real operations when expecting a dry-run.

🔧 Proposed fix
 `@click.command`(context_settings=dict(
     ignore_unknown_options=True,
-    allow_interspersed_args=False,
+    allow_interspersed_args=True,
 ))
 `@click.argument`('project', required=False)
 `@click.argument`('operation', required=False)
 `@click.argument`('args', nargs=-1, type=click.UNPROCESSED)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@click.command(context_settings=dict(
ignore_unknown_options=True,
allow_interspersed_args=False,
))
@click.argument('project', required=False)
@click.argument('operation', required=False)
@click.argument('args', nargs=-1)
@click.argument('args', nargs=-1, type=click.UNPROCESSED)
`@click.command`(context_settings=dict(
ignore_unknown_options=True,
allow_interspersed_args=True,
))
`@click.argument`('project', required=False)
`@click.argument`('operation', required=False)
`@click.argument`('args', nargs=-1, type=click.UNPROCESSED)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/ci_entrypoint/run_ci.py` around lines 499 - 505, The CLI
currently uses context_settings with allow_interspersed_args=False while the
command signature accepts a catch-all args parameter typed click.UNPROCESSED, so
options provided after positional arguments (e.g., --dry-run) get captured into
args and never parsed; fix this by enabling interspersed option parsing (set
allow_interspersed_args=True in the click.command context_settings) so
parent/global options are parsed regardless of position, or alternatively stop
using click.UNPROCESSED for the args parameter and accept a parsed remainder
(e.g., change handling of the args parameter) so flags like --dry-run are
recognized; update the decorator/context_settings where allow_interspersed_args
is declared and adjust how the args parameter is defined/consumed accordingly.

Comment on lines +181 to +205
# Merge accelerator defaults
accel_defaults = self.get_accelerator_defaults()
accel_vllm_args = accel_defaults.get("vllm_args", {})
accel_env_vars = accel_defaults.get("env_vars", {})

# Merge model config
model_deploy = raw_config.get("deploy", {})
model_vllm_args = raw_config.get("vllm_args", {})
model_env_vars = raw_config.get("env_vars", {})

# Merge accelerator overrides from model
accel_overrides = raw_config.get("accelerator_overrides", {}).get(self.accelerator, {})
override_vllm_args = accel_overrides.get("vllm_args", {})
override_env_vars = accel_overrides.get("env_vars", {})

# Build final config through inheritance chain
final_deploy = deep_merge(base_deploy, model_deploy)
final_vllm_args = deep_merge(
deep_merge(deep_merge(base_vllm_args, accel_vllm_args), model_vllm_args),
override_vllm_args,
)
final_env_vars = deep_merge(
deep_merge(accel_env_vars, model_env_vars),
override_env_vars,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Merge accelerator-specific deploy settings into final_deploy.

load_model() applies accelerator inheritance for vllm_args and env_vars, but not for deploy. That drops accelerator-specific GPU/resource settings from both ResolvedModelConfig.deploy and ResolvedModelConfig.num_gpus, even though the module-level resolution order says they should win.

Suggested fix
         # Merge accelerator defaults
         accel_defaults = self.get_accelerator_defaults()
+        accel_deploy = accel_defaults.get("deploy", {})
         accel_vllm_args = accel_defaults.get("vllm_args", {})
         accel_env_vars = accel_defaults.get("env_vars", {})
@@
         # Merge accelerator overrides from model
         accel_overrides = raw_config.get("accelerator_overrides", {}).get(self.accelerator, {})
+        override_deploy = accel_overrides.get("deploy", {})
         override_vllm_args = accel_overrides.get("vllm_args", {})
         override_env_vars = accel_overrides.get("env_vars", {})
 
         # Build final config through inheritance chain
-        final_deploy = deep_merge(base_deploy, model_deploy)
+        final_deploy = deep_merge(
+            deep_merge(deep_merge(base_deploy, accel_deploy), model_deploy),
+            override_deploy,
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Merge accelerator defaults
accel_defaults = self.get_accelerator_defaults()
accel_vllm_args = accel_defaults.get("vllm_args", {})
accel_env_vars = accel_defaults.get("env_vars", {})
# Merge model config
model_deploy = raw_config.get("deploy", {})
model_vllm_args = raw_config.get("vllm_args", {})
model_env_vars = raw_config.get("env_vars", {})
# Merge accelerator overrides from model
accel_overrides = raw_config.get("accelerator_overrides", {}).get(self.accelerator, {})
override_vllm_args = accel_overrides.get("vllm_args", {})
override_env_vars = accel_overrides.get("env_vars", {})
# Build final config through inheritance chain
final_deploy = deep_merge(base_deploy, model_deploy)
final_vllm_args = deep_merge(
deep_merge(deep_merge(base_vllm_args, accel_vllm_args), model_vllm_args),
override_vllm_args,
)
final_env_vars = deep_merge(
deep_merge(accel_env_vars, model_env_vars),
override_env_vars,
)
# Merge accelerator defaults
accel_defaults = self.get_accelerator_defaults()
accel_deploy = accel_defaults.get("deploy", {})
accel_vllm_args = accel_defaults.get("vllm_args", {})
accel_env_vars = accel_defaults.get("env_vars", {})
# Merge model config
model_deploy = raw_config.get("deploy", {})
model_vllm_args = raw_config.get("vllm_args", {})
model_env_vars = raw_config.get("env_vars", {})
# Merge accelerator overrides from model
accel_overrides = raw_config.get("accelerator_overrides", {}).get(self.accelerator, {})
override_deploy = accel_overrides.get("deploy", {})
override_vllm_args = accel_overrides.get("vllm_args", {})
override_env_vars = accel_overrides.get("env_vars", {})
# Build final config through inheritance chain
final_deploy = deep_merge(
deep_merge(deep_merge(base_deploy, accel_deploy), model_deploy),
override_deploy,
)
final_vllm_args = deep_merge(
deep_merge(deep_merge(base_vllm_args, accel_vllm_args), model_vllm_args),
override_vllm_args,
)
final_env_vars = deep_merge(
deep_merge(accel_env_vars, model_env_vars),
override_env_vars,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/config_loader.py` around lines 181 - 205, The deploy
merge is missing accelerator-specific entries so accelerator-level GPU/resource
settings are lost; update the merge to read accel_defaults.get("deploy", {}) and
accel_overrides.get("deploy", {}) (e.g., accel_deploy and override_deploy) and
include them when building final_deploy (merge base_deploy, accel_deploy,
model_deploy, then override_deploy) so final_deploy reflects accelerator
defaults and overrides (ensure the same final_deploy variable used by
ResolvedModelConfig.deploy/num_gpus).

Comment on lines +266 to +273
# Merge scenario defaults with global defaults
global_defaults = self.get_global_defaults()
scenario_defaults = data.get("defaults", {})
data["_resolved_defaults"] = deep_merge(global_defaults, scenario_defaults)

# Add accelerator info
data["_accelerator"] = self.accelerator
data["_accelerator_config"] = self.get_accelerator_defaults()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

load_scenario() is skipping accelerator defaults.

_resolved_defaults currently merges only defaults.defaults and the scenario-local defaults. That means accelerator-specific defaults from defaults.yaml.accelerators[accelerator] never participate, which disagrees with the documented resolution order and can make scenario expansion differ from model resolution.

Suggested fix
         # Merge scenario defaults with global defaults
         global_defaults = self.get_global_defaults()
+        accelerator_defaults = self.get_accelerator_defaults()
         scenario_defaults = data.get("defaults", {})
-        data["_resolved_defaults"] = deep_merge(global_defaults, scenario_defaults)
+        data["_resolved_defaults"] = deep_merge(
+            deep_merge(global_defaults, accelerator_defaults),
+            scenario_defaults,
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Merge scenario defaults with global defaults
global_defaults = self.get_global_defaults()
scenario_defaults = data.get("defaults", {})
data["_resolved_defaults"] = deep_merge(global_defaults, scenario_defaults)
# Add accelerator info
data["_accelerator"] = self.accelerator
data["_accelerator_config"] = self.get_accelerator_defaults()
# Merge scenario defaults with global defaults
global_defaults = self.get_global_defaults()
accelerator_defaults = self.get_accelerator_defaults()
scenario_defaults = data.get("defaults", {})
data["_resolved_defaults"] = deep_merge(
deep_merge(global_defaults, accelerator_defaults),
scenario_defaults,
)
# Add accelerator info
data["_accelerator"] = self.accelerator
data["_accelerator_config"] = self.get_accelerator_defaults()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/config_loader.py` around lines 266 - 273,
load_scenario() is currently building data["_resolved_defaults"] by merging only
get_global_defaults() and scenario-local "defaults", which omits
accelerator-specific defaults; modify the merge so accelerator defaults from
get_accelerator_defaults() are merged between the global and scenario defaults
(e.g., deep_merge(global_defaults, accelerator_defaults, scenario_defaults)) so
that get_accelerator_defaults(), deep_merge, and data["_resolved_defaults"] are
used in that order within load_scenario().

Comment on lines +358 to +367
url_result = subprocess.run(
[
"oc", "get", "inferenceservice",
self.deployment_name,
"-n", self.namespace,
"-o", "jsonpath={.status.url}",
],
capture_output=True,
text=True,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add timeouts to the remaining oc subprocesses.

These calls have no timeout, so the step can hang indefinitely after readiness is reached or even after the main wait deadline expires while collecting debug info. That bypasses the outer timeout contract for the workflow.

Suggested fix
                     url_result = subprocess.run(
                         [
                             "oc", "get", "inferenceservice",
                             self.deployment_name,
                             "-n", self.namespace,
                             "-o", "jsonpath={.status.url}",
                         ],
                         capture_output=True,
                         text=True,
+                        timeout=15,
                     )
@@
             result = subprocess.run(
                 ["oc", "get", "inferenceservice", self.deployment_name, "-n", self.namespace, "-o", "yaml"],
                 capture_output=True,
                 text=True,
+                timeout=15,
             )
@@
             result = subprocess.run(
                 ["oc", "get", "pods", "-l", f"serving.kserve.io/inferenceservice={self.deployment_name}",
                  "-n", self.namespace],
                 capture_output=True,
                 text=True,
+                timeout=15,
             )

Also applies to: 419-433

🧰 Tools
🪛 Ruff (0.15.9)

[error] 358-358: subprocess call: check for execution of untrusted input

(S603)


[error] 359-364: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/deploy.py` around lines 358 - 367, The
subprocess.run that populates url_result (and other oc subprocess.run
invocations later in the file) lack timeouts and can hang; add a timeout
argument (use a sensible value or a shared constant/the outer wait deadline) to
each subprocess.run call fetching oc resources (e.g., the run that sets
url_result and the similar calls around lines 419-433), and wrap them in
try/except for subprocess.TimeoutExpired to log the timeout (include
deployment_name and namespace) and fail/raise so the workflow’s outer timeout
contract is preserved.

Comment on lines +23 to +38
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"

subscription_yaml = """
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nfd
namespace: openshift-nfd
spec:
channel: stable
name: nfd
source: redhat-operators
sourceNamespace: openshift-marketplace
"""
yaml_path = step_dir / "nfd-subscription.yaml"
yaml_path.write_text(subscription_yaml)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Create the per-step artifact directory before writing YAML.

These methods derive step_dir from mutable context state, but they never create it. yaml_path.write_text(...) will fail with FileNotFoundError unless some caller pre-created the numbered directory first. Reusing ctx.get_step_artifact_dir(self.name) here would also keep step numbering consistent with the rest of the workflow code.

Suggested fix
-        step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
+        step_dir = ctx.get_step_artifact_dir(self.name)

Also applies to: 69-84, 113-131

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/operators.py` around lines 23 - 38, The code
computes step_dir and writes YAML with yaml_path.write_text(...) but never
creates the per-step artifact directory, causing FileNotFoundError; update the
operator steps to use the canonical ctx.get_step_artifact_dir(self.name) (or
call step_dir.mkdir(parents=True, exist_ok=True) after computing step_dir)
before writing files so the directory exists; apply this change in the blocks
that define step_dir and write nfd-subscription.yaml (and analogous blocks at
the other occurrences noted) to ensure consistent step numbering and avoid write
errors.

Comment on lines +40 to +44
# Create namespace first
self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
self._run_oc(["apply", "-f", str(yaml_path)])

return StepResult.ok("NFD operator subscription created")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don’t return success when the NFD install commands were never applied.

Line 41 uses --dry-run=client -o yaml without the follow-up oc apply -f -, so the namespace creation is a no-op. Both command results are then ignored, and Line 44 still returns success. That can make the prepare phase look green even though NFD was not installed at all.

Suggested fix
-        # Create namespace first
-        self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
-        self._run_oc(["apply", "-f", str(yaml_path)])
-
-        return StepResult.ok("NFD operator subscription created")
+        if not self._run_oc(["create", "namespace", "openshift-nfd"]):
+            return StepResult.fail("Failed to create openshift-nfd namespace")
+        if not self._run_oc(["apply", "-f", str(yaml_path)]):
+            return StepResult.fail("Failed to apply NFD subscription")
+        return StepResult.ok("NFD operator subscription created")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/operators.py` around lines 40 - 44, The code
currently runs a dry-run namespace creation and then applies a YAML but ignores
both command outcomes and always returns StepResult.ok; change the sequence in
the method using _run_oc so the namespace is actually created (either remove the
"--dry-run=client -o yaml" or run the output through "oc apply -f -" so creation
is applied), capture and check the return/result of both _run_oc calls (the
namespace creation/apply and the subsequent _run_oc(["apply", "-f",
str(yaml_path)])), and if either fails return a failure StepResult instead of
StepResult.ok; look for the _run_oc invocations and StepResult.ok call to update
behavior.

Comment thread tests/rhaiis/test_ci.py
Comment on lines +34 to +35
return CliRunner(env={"FORGE_ARTIFACT_DIR": "/tmp/artifacts"})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid shared hardcoded artifact path in tests.

Line [34], Line [40], and Line [53] use a fixed /tmp/artifacts path, which can cause cross-test interference when tests run concurrently. Prefer a per-test temp path fixture.

💡 Suggested change
 class TestCiTest:
@@
     `@pytest.fixture`
-    def runner(self):
-        return CliRunner(env={"FORGE_ARTIFACT_DIR": "/tmp/artifacts"})
+    def runner(self, tmp_path):
+        return CliRunner(env={"FORGE_ARTIFACT_DIR": str(tmp_path / "artifacts")})
@@
-            env={"FORGE_MODEL": "test/model", "FORGE_ARTIFACT_DIR": "/tmp/artifacts"}
+            env={"FORGE_MODEL": "test/model", "FORGE_ARTIFACT_DIR": str(tmp_path / "artifacts")}
@@
-                "FORGE_ARTIFACT_DIR": "/tmp/artifacts",
+                "FORGE_ARTIFACT_DIR": str(tmp_path / "artifacts"),

Also applies to: 40-41, 50-54

🧰 Tools
🪛 Ruff (0.15.9)

[error] 34-34: Probable insecure usage of temporary file or directory: "/tmp/artifacts"

(S108)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_ci.py` around lines 34 - 35, The tests currently hardcode
FORGE_ARTIFACT_DIR="/tmp/artifacts" which can cause cross-test interference;
update the test(s) that return CliRunner(env={"FORGE_ARTIFACT_DIR":
"/tmp/artifacts"}) to accept a per-test temporary directory (e.g., pytest's
tmp_path or a tempfile.TemporaryDirectory) and set FORGE_ARTIFACT_DIR to that
path (str(tmp_path) or the temp dir path) when constructing CliRunner; locate
the return site that calls CliRunner and any helper functions that build the
runner and replace the hardcoded string with the injected temp fixture so each
test gets an isolated artifact directory.

Comment thread tests/rhaiis/test_ci.py
Comment on lines +57 to +59
assert result.exit_code == 0
assert "balanced" in result.output or "heterogeneous" in result.output

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Strengthen workload assertion for full parsing coverage.

Line [58] uses or, so the test passes even if only one workload is surfaced. If the CLI is expected to reflect both inputs, assert both.

💡 Suggested change
-        assert "balanced" in result.output or "heterogeneous" in result.output
+        assert "balanced" in result.output
+        assert "heterogeneous" in result.output
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert result.exit_code == 0
assert "balanced" in result.output or "heterogeneous" in result.output
assert result.exit_code == 0
assert "balanced" in result.output
assert "heterogeneous" in result.output
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_ci.py` around lines 57 - 59, The test currently uses
`assert "balanced" in result.output or "heterogeneous" in result.output`, which
lets the test pass if only one workload is printed; change this to assert both
workloads are present by using `and` (i.e., require both `"balanced"` and
`"heterogeneous"` in `result.output`) in the same test (the assertion
referencing `result`/`result.output`).

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 18, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant