Skip to content

Adding Forge rhaiis test entrypoint#13

Closed
MML-coder wants to merge 3 commits intomainfrom
rhaiis-pr
Closed

Adding Forge rhaiis test entrypoint#13
MML-coder wants to merge 3 commits intomainfrom
rhaiis-pr

Conversation

@MML-coder
Copy link
Copy Markdown
Collaborator

@MML-coder MML-coder commented Apr 4, 2026

This PR only supports the Forge test workflow
It deploy vLLM, run benchmark, cleanup. It does NOT include:

  • prepare workflow (operator installation) - stubbed only
  • cleanup workflow - basic implementation
  • llm-d integration - design doc included, implementation separate

How It Works

  1. ConfigLoader resolves model config via inheritance chain:
    defaults.yaml → accelerator[nvidia|amd] → model → model.accelerator_overrides
  2. BenchmarkWorkflow executes steps sequentially:
    DeployVLLM → WaitForReady → RunGuideLLM → (finally) CollectArtifacts → Cleanup

`

Single model + workload

KUBECONFIG=~/kubeconfigs/my-cluster
PYTHONPATH=. python3 projects/rhaiis/orchestration/cli.py test --model qwen-0.6b --workload balanced --accelerator nvidia

Multiple workloads (deploy-once pattern)

PYTHONPATH=. python3 projects/rhaiis/orchestration/cli.py test --model llama-3.3-70b-fp8 --workloadsbalanced,short,long-prompt.

`

Summary by CodeRabbit

  • New Features

    • Configuration-driven benchmarking: global defaults, accelerator and per-model overrides, model registry, and workload presets.
    • Declarative scenario generator with matrix expansion and deploy-once grouping.
    • End-to-end workflow engine and RHAIIS workflows: prepare, deploy, wait-for-ready, run GuideLLM benchmarks, collect artifacts, and cleanup.
    • CLI/CI commands to prepare, run, and cleanup benchmarks.
    • Kubernetes CLI wrapper with retry/backoff for robust cluster operations.
  • Documentation

    • Implementation guide and extensibility roadmap for llm-d.
  • Tests

    • Extensive unit tests covering configs, generator, workflows, steps, and utilities.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign avasilevskii for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

Added a declarative, YAML-driven RHAIIS benchmarking system: configuration registries, scenario generator, workflow engine and steps, Kubernetes/oc utilities, CLI entrypoints, orchestration runners, domain workflows/steps, and comprehensive unit tests.

Changes

Cohort / File(s) Summary
Configuration
config/rhaiis/defaults.yaml, config/rhaiis/models.yaml, config/rhaiis/workloads.yaml
New YAML configs: global defaults, accelerator-specific blocks (nvidia/amd), a model registry with per-model vllm overrides and aliases, and multiple GuideLLM workload presets.
Scenario generation
projects/core/scenarios/...
projects/core/scenarios/config.py, projects/core/scenarios/config_loader.py, projects/core/scenarios/generator.py, projects/core/scenarios/__init__.py
New ScenarioConfig dataclass, deep-merge loader resolving defaults→accelerator→model→model.accelerator_overrides, Resolved* dataclasses, ConfigLoader API, ScenarioGenerator with matrix/explicit expansion and deployment grouping.
Workflow engine core
projects/core/workflow/...
context.py, step.py, workflow.py, executor.py, __init__.py
Introduced WorkflowContext, StepStatus/StepResult/WorkflowStep abstractions, Workflow base class, and SequentialExecutor implementing ordered execution with finally-step semantics.
OC/Kubernetes utilities
projects/core/utils/oc.py, projects/core/utils/__init__.py
OC wrapper with RetryConfig, OCResult, transient-error detection, retry/backoff logic, and convenience methods (run/get/apply/delete/logs/exec/wait/rsync/create_namespace).
Shared workflow steps
projects/core/steps/...
artifacts.py, guidellm.py, __init__.py
Added CollectArtifactsStep, CleanupDeploymentStep, and RunGuideLLMStep for artifact collection, best-effort cleanup, and running GuideLLM benchmarks (pod creation, polling, log/result collection).
RHAIIS workflows
projects/rhaiis/workflows/...
benchmark.py, prepare.py, cleanup.py, __init__.py
BenchmarkWorkflow (deploy → wait → benchmark with finally steps), PrepareWorkflow (operator installs), CleanupWorkflow (namespace/resource cleanup).
RHAIIS workflow steps
projects/rhaiis/workflows/steps/...
deploy.py, operators.py, cleanup.py, __init__.py
DeployVLLMStep (KServe manifest generation + oc apply), WaitForReadyStep (ready condition + in-pod health check), CleanupNamespaceStep, and operator installation steps (NFD/GPU/RHOAI subscriptions).
Orchestration & CLI
projects/rhaiis/orchestration/...
cli.py, ci.py, test_rhaiis.py, __init__.py
Click CLIs (prepare/test/cleanup), test_rhaiis orchestration runners with dry-run, multi-workload deploy-once grouping, resolved config loading, and ctx metadata writing.
CI entrypoint tweak
projects/core/ci_entrypoint/run_ci.py
Click command updated to ignore unknown options and pass trailing args unprocessed (click.UNPROCESSED, ignore_unknown_options, allow_interspersed_args=False).
Documentation
projects/rhaiis/IMPLEMENTATION.md, projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md, projects/rhaiis/__init__.py
Added implementation and extensibility docs describing architecture, config precedence, workflows, and llm-d extension proposals.
Tests
tests/... (many new files)
Comprehensive unit tests for config loader, scenario generator, workflow engine, OC wrapper retry behavior, steps, workflows, and CLI; extensive mocking of subprocess/oc interactions.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (projects/rhaiis/orchestration/cli.py)
    participant Orchestrator as test_rhaiis.py
    participant ConfigLoader as ConfigLoader
    participant Generator as ScenarioGenerator
    participant Workflow as BenchmarkWorkflow
    participant Executor as SequentialExecutor
    participant OC as OC Utility
    participant K8s as Kubernetes/OpenShift

    CLI->>Orchestrator: run_test(...)
    Orchestrator->>ConfigLoader: load_model/load_workload
    ConfigLoader-->>Orchestrator: Resolved configs
    Orchestrator->>Generator: expand / group scenarios
    Generator-->>Orchestrator: ExpandedScenario(s)
    Orchestrator->>Workflow: instantiate BenchmarkWorkflow
    Workflow->>Executor: execute()
    Executor->>OC: apply KServe manifest
    OC->>K8s: kubectl/oc apply
    K8s-->>OC: apply result
    Executor->>OC: poll inferenceservice (get)
    OC->>K8s: check Ready
    K8s-->>OC: status
    Executor->>OC: apply benchmark pod
    OC->>K8s: create pod, stream logs
    Executor->>OC: rsync /benchmark-results
    OC->>K8s: copy results
    Executor->>OC: collect artifacts & cleanup (finally steps)
    OC->>K8s: delete resources
    Executor-->>Workflow: WorkflowResult
    Workflow-->>CLI: exit code/result
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐇 Through YAML fields I hop and peep,

configs merged and scenarios deep.
Steps march forward, oc retries hum,
pods and logs and results to drum.
A tidy run — prepare, test, then sweep.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Adding Forge rhaiis test entrypoint' accurately describes the primary objective: adding a test workflow entrypoint for the rhaiis project.
Docstring Coverage ✅ Passed Docstring coverage is 84.98% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch rhaiis-pr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (15)
projects/core/utils/oc.py-204-209 (1)

204-209: ⚠️ Potential issue | 🟡 Minor

Rename input to avoid the Ruff A002 failure.

input shadows the Python builtin in all three signatures, and Ruff is already flagging it here. A small rename like stdin_input keeps the API clear and avoids a lint blocker.

Proposed rename
-        input: str | None = None,
+        stdin_input: str | None = None,
...
-                    input=input,
+                    input=stdin_input,
...
-    def run(self, *args: str, namespace: str | None = None, timeout: int | None = None, input: str | None = None) -> OCResult:
+    def run(self, *args: str, namespace: str | None = None, timeout: int | None = None, stdin_input: str | None = None) -> OCResult:
...
-        return self._run_with_retry(cmd, timeout, input)
+        return self._run_with_retry(cmd, timeout, stdin_input)
...
-    def apply(self, *args: str, namespace: str | None = None, timeout: int | None = None, input: str | None = None) -> OCResult:
+    def apply(self, *args: str, namespace: str | None = None, timeout: int | None = None, stdin_input: str | None = None) -> OCResult:
...
-        return self.run("apply", *args, namespace=namespace, timeout=timeout, input=input)
+        return self.run("apply", *args, namespace=namespace, timeout=timeout, stdin_input=stdin_input)

Also applies to: 300-300, 335-335

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/utils/oc.py` around lines 204 - 209, The parameter name `input`
shadows the Python builtin and triggers Ruff A002 in OC._run_with_retry and the
two other method signatures; rename the parameter to `stdin_input` (update type
hints from `input: str | None` to `stdin_input: str | None`) in the function
definitions and replace all internal references to `input` with `stdin_input`
(and update any places that call these methods to pass the renamed argument) so
signatures, usages, and return behavior remain identical except for the
parameter name.
projects/core/workflow/context.py-83-99 (1)

83-99: ⚠️ Potential issue | 🟡 Minor

Avoid mutating step numbering from a getter.

Line 95 increments unconditionally, so a step that asks for its artifact directory twice gets two different folders and shifts the numbering for every later step. Cache the directory per step, or have the executor advance the counter once before the step runs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/context.py` around lines 83 - 99, The getter
get_step_artifact_dir currently mutates state (increments self.step_number and
sets self.current_step_name) every call; change it to be idempotent by caching
per-step directories or only incrementing when a new step is first requested:
add an internal mapping (e.g. self._step_dirs: dict[str, Path]) and in
get_step_artifact_dir check if step_name exists, return cached Path if present,
otherwise increment self.step_number once, set self.current_step_name, create
the new step_dir from self.artifact_dir /
f"{self.step_number:03d}__{step_name}", store it in self._step_dirs[step_name]
and return it; this preserves existing names (step_dir) and avoids changing
numbering on repeated calls.
projects/rhaiis/workflows/steps/deploy.py-504-506 (1)

504-506: ⚠️ Potential issue | 🟡 Minor

Remove unnecessary f-string prefix.

Static analysis correctly identified an f-string without placeholders.

🐛 Fix
         except subprocess.TimeoutExpired:
-            click.echo(f"  Health check timeout, retrying...")
+            click.echo("  Health check timeout, retrying...")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/deploy.py` around lines 504 - 506, In the
except subprocess.TimeoutExpired block (around the health-check retry logic)
remove the unnecessary f-string prefix on the click.echo call: replace the
f-string literal used in click.echo(f"  Health check timeout, retrying...") with
a plain string click.echo("  Health check timeout, retrying..."); also scan the
surrounding exception handling in the same function for other f-strings that
have no placeholders and convert them to normal string literals (refer to the
except subprocess.TimeoutExpired handler and the click.echo call).
projects/core/steps/guidellm.py-158-179 (1)

158-179: ⚠️ Potential issue | 🟡 Minor

Potential command injection via args in shell command.

The guidellm_cmd is constructed by joining args and embedded directly in a shell script. If any runtime argument values contain shell metacharacters (e.g., ;, $, backticks), they could be interpreted by the shell.

Consider using JSON-escaped arguments or validating that values don't contain shell-special characters.

🛡️ Proposed fix using shlex.quote
+import shlex
+
 def _generate_pod_yaml(self, data: str) -> str:
     """Generate GuideLLM pod YAML."""
     # Build guidellm args
     args = [
         "--target", self.endpoint,
         "--model", self.model,
         # ...
     ]
     # ...
-    guidellm_cmd = f"python3 -m benchmark.main {' '.join(args)}"
+    guidellm_cmd = f"python3 -m benchmark.main {' '.join(shlex.quote(a) for a in args)}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/steps/guidellm.py` around lines 158 - 179, The constructed
shell command in guidellm_cmd is vulnerable because args are joined and
interpolated into a shell script; to fix, escape each argument with shlex.quote
(or equivalent) before joining or, better, avoid a shell script altogether by
setting the Pod container's command to ["/bin/sh","-c"] with a safely quoted
string or by using the container args/command list to pass arguments directly;
update the code that builds guidellm_cmd (and the Pod spec that embeds it) to
either quote every element from args via shlex.quote or switch to a structured
args array so no unescaped shell metacharacters from args can be interpreted.
projects/core/steps/artifacts.py-197-232 (1)

197-232: ⚠️ Potential issue | 🟡 Minor

Unused errors list is initialized but never populated.

The errors list on line 201 is never appended to, yet it's checked on line 225 and included in the result data on line 231. Either remove it or populate it when _delete_resource returns False.

🐛 Proposed fix to populate errors
     # Delete KServe resources first (they manage the underlying deployments)
     if self.use_kserve:
         if self._delete_resource("inferenceservice", self.deployment_name, ns_args):
             deleted.append(f"inferenceservice/{self.deployment_name}")
+        else:
+            errors.append(f"inferenceservice/{self.deployment_name}")
         if self._delete_resource("servingruntime", self.deployment_name, ns_args):
             deleted.append(f"servingruntime/{self.deployment_name}")
+        else:
+            errors.append(f"servingruntime/{self.deployment_name}")

     # Delete standard deployment (if not using KServe or as fallback)
     if self._delete_resource("deployment", self.deployment_name, ns_args):
         deleted.append(f"deployment/{self.deployment_name}")
+    else:
+        errors.append(f"deployment/{self.deployment_name}")

Or alternatively, remove the errors list entirely if tracking deletion failures is not needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/steps/artifacts.py` around lines 197 - 232, The errors list is
created in execute() but never populated; update execute to append a descriptive
error string to errors whenever _delete_resource(resource_type,
self.deployment_name, ns_args) returns False (for inferenceservice,
servingruntime, deployment, service, route attempts), e.g., append
f"{resource_type}/{self.deployment_name} failed" so the existing message
construction and StepResult data["errors"] are meaningful; alternatively, if you
prefer not to track failures, remove the errors variable and its usage in the
message and StepResult data (choose one approach and apply consistently across
the inferenceservice/servingruntime/deployment/service/route deletion checks).
config/rhaiis/workloads.yaml-48-55 (1)

48-55: ⚠️ Potential issue | 🟡 Minor

Incomplete configuration: placeholder value needs replacement.

The data field contains a placeholder "ADD ME" which will likely cause GuideLLM to fail or produce meaningless results if this workload is selected. Either provide a valid data specification or disable this workload until it's properly configured.

Would you like me to suggest a data specification for heterogeneous traffic patterns, or should I open an issue to track this?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/rhaiis/workloads.yaml` around lines 48 - 55, Replace the placeholder
in the heterogeneous workload's guidellm.data field (in the "heterogeneous"
block) with a valid data specification or temporarily disable the workload;
specifically, update guidellm.data to point to a real dataset or pattern
definition (for example a path, dataset name, or structured traffic spec) that
matches the existing fields rate_type, rates, and max_seconds, or remove/disable
the entire heterogeneous entry until a proper data value is provided so GuideLLM
won't receive "ADD ME".
projects/core/workflow/executor.py-89-93 (1)

89-93: ⚠️ Potential issue | 🟡 Minor

Inconsistency: start_time not set for finally steps.

Normal steps have result.start_time assigned (albeit incorrectly), but finally steps don't set it at all. Apply the same fix here for consistency.

🔧 Proposed fix
+            step_start_time = datetime.now(timezone.utc)
             step_start = time.monotonic()
             try:
                 result = step.execute(ctx)
                 result.duration_seconds = time.monotonic() - step_start
+                result.start_time = step_start_time
                 step_results[step_name] = result
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/executor.py` around lines 89 - 93, The finally-step
handling block currently records result.duration_seconds but never sets
result.start_time; update the finally step execution to capture a start
timestamp (e.g., start_time = time.monotonic()) immediately before calling
step.execute(ctx), assign that value to result.start_time, and then compute
result.duration_seconds = time.monotonic() - start_time, mirroring how normal
steps populate start_time/duration for consistency with step_results[step_name]
and the existing result fields.
projects/rhaiis/IMPLEMENTATION.md-7-34 (1)

7-34: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks.

These blocks are missing fence languages, which can trigger markdown lint failures.

Example fix
-```
+```text
 CLI/CI Entry Points
@@
-```
+```

-```
+```text
 config/
@@
-```
+```

Also applies to: 59-78

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/IMPLEMENTATION.md` around lines 7 - 34, The fenced code
blocks in IMPLEMENTATION.md that contain the ASCII diagrams (e.g., the block
showing "CLI/CI Entry Points → test_rhaiis.py", "ConfigLoader",
"BenchmarkWorkflow", and "WorkflowStep implementations") and the later similar
block (around the config/ section) are missing language identifiers; add a
language tag (use "text") to the opening triple-backtick for each fenced block
so markdown linters recognize them as code blocks (update the opening ``` to
```text for the diagram block(s) and any other plain-text fenced blocks
referenced in the doc).
tests/rhaiis/test_ci.py-34-35 (1)

34-35: ⚠️ Potential issue | 🟡 Minor

Avoid fixed /tmp/artifacts in tests.

Using a shared absolute temp path can cause cross-test interference and portability issues. Prefer a per-test temporary directory fixture.

Safer test setup
 class TestCiTest:
@@
     `@pytest.fixture`
-    def runner(self):
-        return CliRunner(env={"FORGE_ARTIFACT_DIR": "/tmp/artifacts"})
+    def runner(self, tmp_path):
+        artifact_dir = tmp_path / "artifacts"
+        return CliRunner(env={"FORGE_ARTIFACT_DIR": str(artifact_dir)})
@@
         result = runner.invoke(
             ci, ["test", "--dry-run"],
-            env={"FORGE_MODEL": "test/model", "FORGE_ARTIFACT_DIR": "/tmp/artifacts"}
+            env={"FORGE_MODEL": "test/model"}
         )
@@
                 "FORGE_MODEL": "test/model",
                 "FORGE_WORKLOADS": "balanced,heterogeneous",
-                "FORGE_ARTIFACT_DIR": "/tmp/artifacts",
             }
         )

Also applies to: 40-41, 53-54

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_ci.py` around lines 34 - 35, Replace the hardcoded
"/tmp/artifacts" used when creating CliRunner with a per-test temporary
directory (use pytest's tmp_path or tmp_path_factory) and set FORGE_ARTIFACT_DIR
to that temp path (e.g., str(tmp_path / "artifacts")); update all occurrences
that call CliRunner(env={"FORGE_ARTIFACT_DIR": "/tmp/artifacts"}) (including the
other instances mentioned at lines 40-41 and 53-54) so each test gets an
isolated directory and remove reliance on a shared absolute path.
tests/rhaiis/test_ci.py-58-58 (1)

58-58: ⚠️ Potential issue | 🟡 Minor

Assert both workloads, not either.

This test currently passes if only one workload is rendered, which can hide parsing/formatting regressions.

Tighten expectation
-        assert "balanced" in result.output or "heterogeneous" in result.output
+        assert "balanced" in result.output
+        assert "heterogeneous" in result.output
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_ci.py` at line 58, The test currently asserts that either
"balanced" or "heterogeneous" appears in result.output, which is too permissive;
change the assertion to require both strings be present (e.g., replace the
single-line check using "or" with two assertions or a combined condition using
"and") so that result.output contains both "balanced" and "heterogeneous" (refer
to the assertion referencing result.output in tests/rhaiis/test_ci.py).
tests/core/scenarios/test_generator.py-226-230 (1)

226-230: ⚠️ Potential issue | 🟡 Minor

Same temporary file cleanup issue as in sample_config_path.

Apply the same fix pattern here for explicit_runs_config fixture.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/scenarios/test_generator.py` around lines 226 - 230, The
explicit_runs_config fixture uses tempfile.NamedTemporaryFile(..., delete=False)
and yields Path(f.name) without removing the temp file; update the
explicit_runs_config fixture to mirror the sample_config_path fix by creating
and writing the temp YAML inside a try/finally (or context manager) and then
removing the file in the finally block (e.g., call os.remove on Path(f.name))
after the yield so the temporary file is always cleaned up; keep the existing
tempfile.NamedTemporaryFile, yaml.safe_dump and the yielded Path(f.name) but add
the cleanup logic surrounding the yield.
projects/rhaiis/orchestration/cli.py-48-59 (1)

48-59: ⚠️ Potential issue | 🟡 Minor

Inconsistent dry-run behavior: metadata is written before early return.

In prepare(), ctx.write_metadata() is called before the dry_run check, meaning even in dry-run mode, metadata files will be created. Same issue exists in cleanup(). Additionally, test() doesn't write metadata at all, which is inconsistent with the other commands.

Proposed fix for prepare()
 def prepare(rhoai_version: str, dry_run: bool):
     """Install operators (RHOAI, NFD, GPU) on OpenShift."""
-    ctx = WorkflowContext.from_environment()
-    ctx.write_metadata({"command": "prepare", "rhoai_version": rhoai_version})
-
     if dry_run:
+        ctx = WorkflowContext.from_environment()
         click.echo(f"[DRY-RUN] Would install RHOAI {rhoai_version}")
         click.echo(f"[DRY-RUN] Artifacts would be written to: {ctx.artifact_dir}")
         return

+    ctx = WorkflowContext.from_environment()
+    ctx.write_metadata({"command": "prepare", "rhoai_version": rhoai_version})
+
     exit_code = test_rhaiis.run_prepare(ctx, rhoai_version)
     sys.exit(exit_code)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/orchestration/cli.py` around lines 48 - 59, The metadata is
being written even during dry runs and is missing for test(); update prepare(),
cleanup(), and test() to be consistent: move the ctx.write_metadata({"command":
...}) calls in prepare() and cleanup() to after the dry_run check (so dry-run
returns before any metadata/file writes), and add a similar
ctx.write_metadata({"command": "test", "rhoai_version": rhoai_version}) call in
test() (or the equivalent metadata payload used elsewhere) so all commands
record metadata only for real runs; refer to the prepare, cleanup, and test
functions and the WorkflowContext.write_metadata / ctx.write_metadata symbol
names to locate and adjust the code.
projects/rhaiis/orchestration/ci.py-57-92 (1)

57-92: ⚠️ Potential issue | 🟡 Minor

Inconsistent metadata handling and missing model validation in test command.

The test command differs from prepare and cleanup in two ways:

  1. It does not call workflow_ctx.write_metadata() to record command parameters
  2. It does not validate that model is provided (unlike cli.py which validates when not in dry-run)

If FORGE_MODEL is unset, model will be None and passed directly to run_test, which may cause confusing errors downstream.

Proposed fix
 `@ci.command`()
 `@click.option`("--dry-run", is_flag=True, help="Print what would be done without executing")
 `@click.pass_context`
 def test(ctx, dry_run: bool):
     """Test phase - Run vLLM benchmark.

     Configuration from config/ directory. FOURNOS sets:
         FORGE_MODEL - Model key (e.g., qwen-0.6b)
         FORGE_WORKLOADS - Comma-separated workloads (optional)
     """
-    workflow_ctx = WorkflowContext.from_environment()
-
     # Key parameters from FOURNOS
     model = os.environ.get("FORGE_MODEL")
     workloads_str = os.environ.get("FORGE_WORKLOADS")

     # Parse workloads
     workloads = None
     if workloads_str:
         workloads = [w.strip() for w in workloads_str.split(",")]

+    if dry_run:
+        click.echo("[DRY-RUN] Test phase")
+        click.echo(f"[DRY-RUN] Model: {model}")
+        if workloads:
+            click.echo(f"[DRY-RUN] Workloads: {workloads}")
+        return
+
+    if not model:
+        click.echo("Error: FORGE_MODEL environment variable must be set", err=True)
+        sys.exit(1)
+
+    workflow_ctx = WorkflowContext.from_environment()
+    workflow_ctx.write_metadata({
+        "command": "test",
+        "model": model,
+        "workloads": workloads,
+    })
+
     # Log
     click.echo("RHAIIS CI Test")
     click.echo(f"  Model: {model}")
     if workloads:
         click.echo(f"  Workloads: {workloads}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/orchestration/ci.py` around lines 57 - 92, The test command
handler (test) must mirror prepare/cleanup by validating the model and recording
metadata: if not dry_run ensure the FORGE_MODEL-derived model is present and
abort/exit with a clear message when missing, and call
workflow_ctx.write_metadata(...) to persist the command parameters (e.g.,
command name, model, workloads, config_dir, dry_run) before invoking
test_rhaiis.run_test; update the test function to perform the model check and to
call WorkflowContext.write_metadata() on workflow_ctx prior to calling
test_rhaiis.run_test so metadata handling is consistent with the other commands.
tests/core/scenarios/test_generator.py-60-64 (1)

60-64: ⚠️ Potential issue | 🟡 Minor

Temporary file is not cleaned up after test.

NamedTemporaryFile(delete=False) creates a file that persists after the context manager exits. The yielded path should be cleaned up after the test completes to avoid leaving orphan files.

Proposed fix
+    import os
+
     `@pytest.fixture`
     def sample_config_path(self):
         """Create a sample scenarios.yaml file using new format."""
         config = {
             # ... config dict ...
         }

         with tempfile.NamedTemporaryFile(
             mode="w", suffix=".yaml", delete=False
         ) as f:
             yaml.safe_dump(config, f)
-            yield Path(f.name)
+            path = Path(f.name)
+        try:
+            yield path
+        finally:
+            os.unlink(path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/scenarios/test_generator.py` around lines 60 - 64, The temporary
YAML file created with tempfile.NamedTemporaryFile(delete=False) in the test
fixture is left on disk; after yielding Path(f.name) you must remove the file in
the fixture teardown. Update the generator fixture around
tempfile.NamedTemporaryFile / yaml.safe_dump to yield Path(f.name) and then call
Path(f.name).unlink() (or os.remove) in the code after the yield to ensure the
temporary file is cleaned up.
projects/core/scenarios/generator.py-612-612 (1)

612-612: ⚠️ Potential issue | 🟡 Minor

Useless if-else condition.

The ternary expression 1 if isinstance(model, ModelConfig) else 1 always evaluates to 1 regardless of the condition, making the check pointless.

🔧 Suggested fix
-        tp = run.get("tensor_parallel", 1 if isinstance(model, ModelConfig) else 1)
+        tp = run.get("tensor_parallel", 1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/generator.py` at line 612, The ternary in the tp
assignment is redundant; replace the expression tp = run.get("tensor_parallel",
1 if isinstance(model, ModelConfig) else 1) with a single default value by using
run.get("tensor_parallel", 1) (keep references to the same variables run and
model and remove the pointless isinstance(ModelConfig) check); if there was an
intended different default for ModelConfig, clarify and implement that branch,
otherwise simply use run.get("tensor_parallel", 1) in generator.py where tp is
set.
🧹 Nitpick comments (21)
projects/core/workflow/step.py (1)

79-85: WorkflowStep.name is lowercased, not snake_case.

WaitForReadyStep currently becomes waitforready, not wait_for_ready, so the implementation doesn't match the contract in the comment/docstring. Either insert separators or document the actual behavior.

Small fix
+import re
...
-        # Convert to lowercase
-        return class_name.lower()
+        return re.sub(r"(?<!^)(?=[A-Z])", "_", class_name).lower()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/step.py` around lines 79 - 85, The current name
property/method in the WorkflowStep class returns the class name lowercased
(e.g., WaitForReadyStep -> waitforready) instead of snake_case; update the
implementation that computes class_name from self.__class__.__name__ (and still
strips the "Step" suffix) to convert CamelCase to snake_case by inserting
underscores between word boundaries (or using a regex like (?<!^)(?=[A-Z]) to
split) and then lowercasing so WaitForReadyStep becomes wait_for_ready; keep the
"Step" suffix removal logic and ensure the symbol to change is the
WorkflowStep.name property/method that currently uses
class_name.endswith("Step") and class_name.lower().
projects/rhaiis/workflows/steps/deploy.py (2)

320-328: Importing click inside method is unconventional.

While functional, importing inside a method is unusual. Consider moving the import to the top of the file since it's used in multiple methods.

♻️ Move import to top of file
 """RHAIIS vLLM deployment steps using KServe (ServingRuntime + InferenceService)."""

+import click
 import logging
 import subprocess
 # ...

 class WaitForReadyStep(WorkflowStep):
     # ...
     def execute(self, ctx: "WorkflowContext") -> StepResult:
         """Wait for InferenceService to be ready."""
-        import click
-
         click.echo(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/deploy.py` around lines 320 - 328, The import
of click inside the execute method is unconventional; move the import to the
module scope so it can be reused by other methods and avoids repeated local
imports. Remove the inline "import click" from the execute(self, ctx:
"WorkflowContext") -> StepResult method and add a top-level "import click" at
the top of the file (near other imports) so symbols like execute and any other
methods in this module use the shared module-level click import.

93-102: Namespace creation can be simplified.

The dry-run command output is not piped to the apply command. The current pattern runs two separate commands where the first one's output is discarded. Consider simplifying to a single idempotent apply.

♻️ Simplified namespace creation
-    # Ensure namespace exists
-    subprocess.run(
-        ["oc", "create", "namespace", self.namespace, "--dry-run=client", "-o", "yaml"],
-        capture_output=True,
-    )
-    subprocess.run(
-        ["oc", "apply", "-f", "-"],
-        input=f"apiVersion: v1\nkind: Namespace\nmetadata:\n  name: {self.namespace}\n",
-        capture_output=True,
-        text=True,
-    )
+    # Ensure namespace exists (idempotent)
+    subprocess.run(
+        ["oc", "create", "namespace", self.namespace, "--dry-run=client", "-o", "yaml"],
+        capture_output=True,
+        text=True,
+    )
+    # Apply is idempotent - will create if not exists or no-op if exists
+    subprocess.run(
+        ["oc", "create", "namespace", self.namespace],
+        capture_output=True,  # Ignore "already exists" error
+    )

Or use a shell pipe if the dry-run output is actually needed:

subprocess.run(
    "oc create namespace {ns} --dry-run=client -o yaml | oc apply -f -".format(ns=self.namespace),
    shell=True,
    capture_output=True,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/deploy.py` around lines 93 - 102, Replace the
two subprocess.run calls that run a dry-run and then a separate apply (where
dry-run output is discarded) with a single idempotent apply invocation: call
subprocess.run(["oc","apply","-f","-"], input=the namespace YAML using
self.namespace, text=True, capture_output=True) so the namespace is
created/updated atomically; alternatively, if you prefer the dry-run output, run
a single shell command that pipes oc create --dry-run=client -o yaml into oc
apply -f - via subprocess.run(..., shell=True, capture_output=True). Ensure you
remove the prior dry-run subprocess.run and only use one subprocess.run that
uses self.namespace as the interpolated name.
tests/rhaiis/test_operators.py (2)

48-61: Test asserts file exists but doesn't verify directory creation.

Once the step_dir.mkdir() fix is applied to operators.py, this test should pass. However, consider adding an explicit assertion that the step_dir was created to make the test more robust.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_operators.py` around lines 48 - 61, The test
test_creates_subscription_yaml currently asserts the YAML file exists but
doesn't explicitly assert the step directory was created; update the test to
include an assertion that step_dir (constructed from context.artifact_dir and
f"{context.step_number:03d}__{context.current_step_name}") exists before reading
the file, so it fails clearly if InstallNFDOperatorStep (or the directory
creation in operators.py) didn't create the folder; keep using the same step_dir
and yaml_file variables and assert step_dir.exists() (or is_dir()) prior to
checking yaml_file.exists() and its content.

37-46: Patch target should be module-specific for test isolation.

Patching subprocess.run globally may cause unintended side effects. Prefer patching at the module level where subprocess is imported.

♻️ Suggested fix
-    `@patch`("subprocess.run")
+    `@patch`("projects.rhaiis.workflows.steps.operators.subprocess.run")
     def test_execute_success(self, mock_run, context):

Apply the same change to all @patch("subprocess.run") decorators in this file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_operators.py` around lines 37 - 46, The test currently
patches subprocess.run globally; update the `@patch` decorator to target
subprocess.run in the module that defines InstallNFDOperatorStep (i.e., patch
the subprocess imported into that module) so the mock is module-scoped and
isolated; apply the same change for all `@patch`("subprocess.run") decorators in
this test file (refer to test_execute_success and other tests that exercise
InstallNFDOperatorStep) so they patch
"<module_where_InstallNFDOperatorStep_is_defined>.subprocess.run" instead of the
top-level "subprocess.run".
projects/core/steps/guidellm.py (1)

272-276: Progress logging condition may not trigger reliably.

The condition elapsed % 60 == 0 with a 10-second poll interval and time.monotonic() precision may cause missed prints (e.g., if elapsed is 59.9 or 60.1). Consider tracking the last print time instead.

♻️ Suggested fix
 def _wait_for_pod_completion(self, timeout: int) -> dict:
     start_time = time.monotonic()
     poll_interval = 10
+    last_progress_print = 0

     print(f"Waiting for GuideLLM benchmark to complete (timeout: {timeout}s)...")

     while time.monotonic() - start_time < timeout:
         # ... existing logic ...

-            # Still running, no marker yet
-            elapsed = int(time.monotonic() - start_time)
-            if elapsed % 60 == 0:  # Print every minute
-                print(f"  GuideLLM running... ({elapsed}s elapsed, phase={phase})")
+            # Still running, no marker yet - print progress every 60s
+            elapsed = int(time.monotonic() - start_time)
+            if elapsed - last_progress_print >= 60:
+                print(f"  GuideLLM running... ({elapsed}s elapsed, phase={phase})")
+                last_progress_print = elapsed
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/steps/guidellm.py` around lines 272 - 276, The progress logging
uses `elapsed % 60 == 0` which can miss prints due to polling granularity;
replace that with a monotonic-based threshold by introducing a `last_print_time`
(initialized to `start_time`) and, inside the loop where `elapsed =
int(time.monotonic() - start_time)` is computed, check `if time.monotonic() -
last_print_time >= 60:` then do the print using `elapsed` and `phase` and set
`last_print_time = time.monotonic()` so prints occur reliably every 60 seconds
regardless of timing alignment.
projects/core/workflow/executor.py (2)

82-82: Unused: finally_errors list is populated but never consumed.

Errors from finally steps are collected into finally_errors but the list is never included in the WorkflowResult, logged as a summary, or otherwise used. Consider either:

  • Including finally errors in the result for caller visibility
  • Logging a summary of finally failures
  • Removing the collection if not needed
♻️ Option: Include in result or log summary
+        if finally_errors:
+            logger.warning(f"Finally block encountered {len(finally_errors)} error(s)")
+
         end_time = datetime.now(timezone.utc)

Or extend WorkflowResult to include finally_errors if callers need this information.

Also applies to: 98-99, 112-112

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/executor.py` at line 82, The finally_errors list
defined in executor.py (variable finally_errors) is being populated but never
used; update the code to either surface those errors to callers by adding a
finally_errors field to WorkflowResult and populating it before returning, or at
minimum log a concise summary of failures from finally_errors (e.g., count and
messages) before returning/raising; if finally errors are truly unnecessary,
remove the finally_errors collection and its appends (occuring around the
finally step handling) to avoid dead state. Ensure changes reference the
existing finally step handling and the WorkflowResult construction so callers
receive the final information.

51-52: Unused variable: step_artifact_dir is fetched but never used.

The artifact directory is retrieved on lines 52 and 87 but not passed to the step or used otherwise. Either remove these lines or pass the directory to the step if it's needed.

♻️ Option 1: Remove unused code
-            # Get artifact directory for this step
-            step_artifact_dir = ctx.get_step_artifact_dir(step_name)
-
             step_start = time.monotonic()
♻️ Option 2: Pass to step if needed

If steps need access to their artifact directory, consider passing it as a parameter or storing it in context before execution.

Also applies to: 87-87

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/executor.py` around lines 51 - 52, The local variable
step_artifact_dir (obtained via ctx.get_step_artifact_dir(step_name)) is never
used; either remove the unused retrieval calls or wire the directory into step
execution: remove the step_artifact_dir assignment at both sites if artifacts
aren’t needed, or pass it into the step invocation (e.g., add a parameter to the
step execution call or set it on the ctx/step object before calling
step.run/execute) so the symbol step_artifact_dir and the call
ctx.get_step_artifact_dir(step_name) are actually consumed; update the step API
(method name used for running the step) and any callers accordingly.
config/rhaiis/defaults.yaml (1)

19-19: Consider: uvicorn-log-level: debug may be verbose for production runs.

Debug-level logging can generate substantial output during benchmarks. If this is intentional for initial testing, consider adding a comment or parameterizing it for production use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/rhaiis/defaults.yaml` at line 19, The YAML default sets
uvicorn-log-level: debug which is noisy for production; change the default to a
less verbose level (e.g., info) or make the uvicorn-log-level configurable via
an environment-backed or CLI-overridable parameter and add an inline comment
explaining debug is only for development/benchmarks; update any
bootstrap/launcher code that reads the uvicorn-log-level key (search for
uvicorn-log-level) to prefer the env/CLI value when present.
projects/rhaiis/workflows/cleanup.py (1)

15-40: Unused parameter: remove_operators is stored but never used.

The remove_operators parameter is documented and stored on the instance (line 31), but define_steps() doesn't conditionally add operator removal steps when it's True. This creates an API contract that isn't fulfilled.

Consider either:

  1. Adding conditional operator removal steps when remove_operators=True
  2. Adding a TODO/FIXME comment if this is planned for later
  3. Removing the parameter until it's implemented
📝 Option: Add a TODO comment to track the incomplete feature
     def define_steps(self):
         """Define cleanup steps."""
+        # TODO: Add operator removal steps when remove_operators=True
         self.add_step(
             CleanupNamespaceStep(
                 namespace=self.namespace,
                 delete_namespace=False,  # Keep namespace, delete contents
             )
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/cleanup.py` around lines 15 - 40, The class stores
remove_operators but never uses it in define_steps; update define_steps to check
self.remove_operators and, when True, add a dedicated operator-cleanup step
(e.g., add RemoveOperatorsStep or CleanupOperatorsStep after the existing
CleanupNamespaceStep) so operator removal is performed; if an operator-cleanup
step class does not yet exist, either implement a small RemoveOperatorsStep and
add it conditionally, or at minimum add a clear TODO/FIXME in define_steps
referencing remove_operators to avoid the unused-parameter contract.
projects/core/scenarios/config.py (1)

100-103: Suffix stripping misses chained suffixes.

A model like foo-instruct-dynamic becomes foo-instruct today. Looping until no known suffix remains makes shortening more consistent.

Refinement
-        for suffix in ["-instruct", "-dynamic", "-chat"]:
-            if name.endswith(suffix):
-                name = name[: -len(suffix)]
+        changed = True
+        while changed:
+            changed = False
+            for suffix in ["-instruct", "-dynamic", "-chat"]:
+                if name.endswith(suffix):
+                    name = name[: -len(suffix)]
+                    changed = True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/config.py` around lines 100 - 103, The current
suffix-stripping loop over ["-instruct","-dynamic","-chat"] only removes one
suffix, so names like "foo-instruct-dynamic" become "foo-instruct"; modify the
logic in projects/core/scenarios/config.py where the variable name is trimmed to
repeatedly strip known suffixes until none match (e.g., wrap the existing
for-loop in a while-loop or use a loop that breaks only when no suffix was
removed), ensuring you reference the same suffix list and update the variable
name in-place each removal.
tests/core/steps/test_artifacts.py (1)

50-52: Prefer structured call-arg assertions over string matching.

Checking str(call_args_list) is fragile. Asserting specific positional args/flags from mock_run.call_args_list will make these tests more deterministic.

Also applies to: 96-98

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/steps/test_artifacts.py` around lines 50 - 52, Replace the fragile
string-based assertion on mock_run.call_args_list with structured checks against
each Call object's positional and keyword arguments: iterate
mock_run.call_args_list and assert that at least one Call has "logs" present in
the specific arg/flag you expect (e.g., check call.args[index] or
call.kwargs['argname']), and do the same for the other occurrence around lines
96-98; use mock_run.call_args_list directly rather than str(...) so the test
asserts the exact positional/keyword argument values.
tests/core/steps/test_guidellm.py (1)

71-73: Strengthen subprocess assertions in the success path.

The current check only verifies that at least one call happened. Adding expected call count (and optionally key command checks) will make this test catch flow regressions earlier.

Suggested test hardening
-        assert result.success
-        assert mock_run.called
+        assert result.success
+        assert mock_run.call_count == 5
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/steps/test_guidellm.py` around lines 71 - 73, The test currently
only asserts result.success and that mock_run was called; strengthen it by
asserting the exact expected number of subprocess invocations (e.g., assert
mock_run.call_count == 1) and also verify the arguments of the call(s) (e.g.,
use mock_run.assert_any_call(...) or inspect mock_run.call_args_list to ensure
the invoked command contains the expected executable/flags). Update the
assertions around result and mock_run in tests/core/steps/test_guidellm.py
(referencing result and mock_run) to check both call count and at least one
expected command pattern to prevent regressions.
projects/rhaiis/IMPLEMENTATION.md (1)

282-285: Replace user-local paths with portable commands.

The current example uses machine-specific directories, which makes the docs harder to follow for others.

Portable docs snippet
-# Activate venv and run from forge directory
-source ~/test_foo/python3_virt/bin/activate
-cd /Users/memehta/workspace/forge
+# Activate venv and run from repository root
+source .venv/bin/activate
+cd <path-to-forge-repo>
 PYTHONPATH=. python -m pytest tests/ --ignore=tests/llm_d -v
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/IMPLEMENTATION.md` around lines 282 - 285, Replace the
hardcoded activation and cd commands — specifically the lines referencing
"source ~/test_foo/python3_virt/bin/activate" and "cd
/Users/memehta/workspace/forge" — with portable instructions: instruct users to
create and activate a virtualenv inside the project (e.g., an .venv in the repo)
and to change directory to the repository root (recommend using a
repository-root discovery command like git rev-parse --show-toplevel or "cd to
project root"), and keep the pytest invocation as a relative, portable command
(using PYTHONPATH=. and the existing pytest flags and --ignore path) so the
snippet works across machines.
tests/rhaiis/test_steps.py (1)

21-31: Consider extracting duplicate fixtures to a conftest.py.

The temp_artifact_dir and context fixtures are duplicated across all three test classes (TestDeployVLLMStep, TestWaitForReadyStep, TestCleanupNamespaceStep). Extracting them to a conftest.py file would reduce duplication and improve maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_steps.py` around lines 21 - 31, Extract the duplicated
pytest fixtures into a shared conftest.py: move the temp_artifact_dir fixture
(the pytest.fixture that yields Path(tmpdir) from tempfile.TemporaryDirectory)
and the context fixture (which calls
WorkflowContext.from_environment(artifact_base=str(temp_artifact_dir)) and
pre-increments via ctx.get_step_artifact_dir("deploy")) into conftest.py, remove
their definitions from tests/rhaiis/test_steps.py, and ensure test modules
import pytest normally so they automatically use the shared fixtures.
tests/core/scenarios/test_generator.py (1)

82-83: Consider using ASCII 'x' instead of Unicode '×' in comments.

The multiplication sign × (U+00D7) may cause issues in some editors or terminals. Using ASCII x is more portable.

Proposed fix
-        # 2 workloads × 1 routing × 2 TP = 4 scenarios
+        # 2 workloads x 1 routing x 2 TP = 4 scenarios
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/scenarios/test_generator.py` around lines 82 - 83, Replace the
Unicode multiplication sign in the inline comment within test_generator.py (the
comment string "# 2 workloads × 1 routing × 2 TP = 4 scenarios") with the ASCII
letter "x" to avoid encoding/editor issues; update the comment in the same
location (referencing the comment near the scenarios length assertion in the
test_generator test function) to read "# 2 workloads x 1 routing x 2 TP = 4
scenarios".
projects/rhaiis/orchestration/test_rhaiis.py (3)

25-36: Add type hint for ctx parameter.

The ctx parameter at line 35 lacks a type annotation, unlike other function signatures in this file.

✏️ Suggested fix
 def _dry_run_test(
     model: str | None,
     workload: str | None,
     workloads: list[str] | None,
     config_loader,
     accelerator: str,
     vllm_image: str | None,
     tensor_parallel: int | None,
     max_requests: int | None,
     namespace: str | None,
-    ctx,
+    ctx: WorkflowContext,
 ) -> int:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/orchestration/test_rhaiis.py` around lines 25 - 36, The ctx
parameter in _dry_run_test is missing a type annotation; add a matching type
annotation (e.g., click.Context or the same context type used elsewhere in this
file) to the ctx parameter, update the function signature for _dry_run_test
accordingly, and add the required import (from click or typing import Any) if
not already present so the annotation resolves.

385-387: Consider making the drain time configurable.

The 5-second drain time is hardcoded. For clusters with higher latency or heavier in-flight request loads, this may be insufficient. Consider making it configurable via defaults.yaml or a parameter.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/orchestration/test_rhaiis.py` around lines 385 - 387, Replace
the hardcoded 5-second sleep in test_rhaiis.py (the block using idx,
group_workloads, click.echo and time.sleep) with a configurable drain timeout:
add a config key (e.g., drain_seconds) to defaults.yaml and read it in the test
(falling back to a sensible default like 5), or add a CLI/fixture parameter that
supplies the value; then use that variable in the click.echo message and
time.sleep call so clusters with higher latency can increase the drain time
without changing code.

278-278: Deployment name truncation may cause collisions for models with similar prefixes.

The deployment name is truncated to 42 characters after sanitization. For models with very similar long names, this could result in identical deployment names. Consider adding a hash suffix or documenting this limitation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/orchestration/test_rhaiis.py` at line 278, The
deployment_name assignment (using hf_model_id) can collide after truncation;
modify the logic that builds deployment_name so it appends a short deterministic
hash suffix derived from hf_model_id (e.g., first 6 hex chars of SHA-1/SHA-256)
after sanitization and truncation to keep the overall length within 42
characters; update the expression that sets deployment_name to compute
sanitized_base = hf_model_id.split("/")[-1].lower().replace(".",
"-").replace("_", "-") truncated to (42 - 1 - hash_len) characters, then append
"-" + hash_suffix to produce the final deployment_name.
projects/core/scenarios/generator.py (2)

1-20: Docstring uses Unicode multiplication sign.

The docstring at line 11 uses × (Unicode multiplication sign) which triggers static analysis warning RUF002. While this is cosmetic and doesn't affect functionality, using ASCII x would be more portable.

✏️ Suggested fix
-4. Matrix expansion: model × workloads × tensor_parallel
+4. Matrix expansion: model x workloads x tensor_parallel
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/generator.py` around lines 1 - 20, The docstring in
projects/core/scenarios/generator.py uses a Unicode multiplication sign in the
"Matrix expansion" example (the line showing "model × workloads ×
tensor_parallel"); replace the Unicode '×' characters with the ASCII letter 'x'
so the example reads "model x workloads x tensor_parallel" to satisfy static
analysis (RUF002) while keeping the docstring content unchanged.

104-112: Consider using hashlib for deterministic deployment IDs.

The deployment_id property uses hash() which is non-deterministic across Python runs (due to PYTHONHASHSEED randomization). This means the same configuration will produce different deployment IDs in different invocations. While currently used only for display, this could cause confusion when comparing outputs or logs across runs. Using hashlib.md5() or similar would provide consistent IDs: override_hash = int(hashlib.md5(frozenset(self.vllm_args_override.items()).encode()).hexdigest()[:4], 16).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/generator.py` around lines 104 - 112, The
deployment_id property uses Python's non-deterministic built-in hash for
vllm_args_override; replace that with a deterministic hash using hashlib (e.g.,
hashlib.md5) by serializing the override items in a stable order (sort the items
or use tuple(sorted(...))) and encoding to bytes before hashing, then derive
override_hash from the first few hex digits (or as an int) and return
f"{base}_wl{override_hash}" so deployment_id is stable across runs; update the
logic in the deployment_id property (references: deployment_id,
vllm_args_override, override_hash, base).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8609a041-7915-493c-af3f-39712b66bee9

📥 Commits

Reviewing files that changed from the base of the PR and between 06371bf and a822096.

📒 Files selected for processing (51)
  • config/rhaiis/defaults.yaml
  • config/rhaiis/models.yaml
  • config/rhaiis/workloads.yaml
  • projects/core/ci_entrypoint/run_ci.py
  • projects/core/scenarios/__init__.py
  • projects/core/scenarios/config.py
  • projects/core/scenarios/config_loader.py
  • projects/core/scenarios/generator.py
  • projects/core/steps/__init__.py
  • projects/core/steps/artifacts.py
  • projects/core/steps/guidellm.py
  • projects/core/utils/__init__.py
  • projects/core/utils/oc.py
  • projects/core/workflow/__init__.py
  • projects/core/workflow/context.py
  • projects/core/workflow/executor.py
  • projects/core/workflow/step.py
  • projects/core/workflow/workflow.py
  • projects/rhaiis/IMPLEMENTATION.md
  • projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md
  • projects/rhaiis/__init__.py
  • projects/rhaiis/orchestration/__init__.py
  • projects/rhaiis/orchestration/ci.py
  • projects/rhaiis/orchestration/cli.py
  • projects/rhaiis/orchestration/test_rhaiis.py
  • projects/rhaiis/workflows/__init__.py
  • projects/rhaiis/workflows/benchmark.py
  • projects/rhaiis/workflows/cleanup.py
  • projects/rhaiis/workflows/prepare.py
  • projects/rhaiis/workflows/steps/__init__.py
  • projects/rhaiis/workflows/steps/cleanup.py
  • projects/rhaiis/workflows/steps/deploy.py
  • projects/rhaiis/workflows/steps/operators.py
  • tests/__init__.py
  • tests/core/__init__.py
  • tests/core/scenarios/__init__.py
  • tests/core/scenarios/test_config_loader.py
  • tests/core/scenarios/test_generator.py
  • tests/core/steps/__init__.py
  • tests/core/steps/test_artifacts.py
  • tests/core/steps/test_guidellm.py
  • tests/core/utils/__init__.py
  • tests/core/utils/test_oc.py
  • tests/core/workflow/__init__.py
  • tests/core/workflow/test_context.py
  • tests/core/workflow/test_executor.py
  • tests/rhaiis/__init__.py
  • tests/rhaiis/test_ci.py
  • tests/rhaiis/test_operators.py
  • tests/rhaiis/test_steps.py
  • tests/rhaiis/test_workflows.py

Comment thread config/rhaiis/models.yaml
Comment on lines +242 to +247
ministral-14b-fp8:
name: "Ministral-3-14B-Instruct-FP8"
hf_model_id: "RedHatAI/Ministral-3-14B-Instruct-2512"
vllm_args:
tensor-parallel-size: 4
supported_workloads: [balanced, short]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

What is the correct HuggingFace model ID for RedHatAI Ministral-3-14B-Instruct FP8 quantized model?

💡 Result:

The correct Hugging Face model ID for the RedHatAI Ministral-3-14B-Instruct FP8 quantized model is RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic.

Citations:


Fix the HuggingFace model ID for ministral-14b-fp8 to use the FP8-quantized variant.

The hf_model_id is missing the -FP8-dynamic suffix. Change "RedHatAI/Ministral-3-14B-Instruct-2512" to "RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic" to load the correct quantized model.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/rhaiis/models.yaml` around lines 242 - 247, Update the hf_model_id
value for the ministral-14b-fp8 model to point to the FP8-quantized variant:
find theministral-14b-fp8 entry and change the hf_model_id
"RedHatAI/Ministral-3-14B-Instruct-2512" to
"RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic" so the correct FP8 model is
loaded.

Comment on lines +499 to +505
@click.command(context_settings=dict(
ignore_unknown_options=True,
allow_interspersed_args=False,
))
@click.argument('project', required=False)
@click.argument('operation', required=False)
@click.argument('args', nargs=-1)
@click.argument('args', nargs=-1, type=click.UNPROCESSED)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Click, when a command uses context_settings={"ignore_unknown_options": True, "allow_interspersed_args": False} with positional arguments followed by a variadic click.UNPROCESSED argument, are flags like --dry-run placed after the positional arguments still parsed as top-level options, or are they treated as plain arguments and forwarded?

💡 Result:

In Click, with context_settings={"ignore_unknown_options": True, "allow_interspersed_args": False} and positional arguments followed by a variadic click.UNPROCESSED argument, flags like --dry-run placed after the positional arguments are treated as plain arguments and forwarded in the UNPROCESSED tuple, not parsed as top-level options. This behavior occurs because allow_interspersed_args=False causes Click's parser to stop processing options upon encountering the first non-option (positional) argument. Any subsequent tokens resembling options (e.g., --dry-run) are not examined for option parsing; they are directly added to the argument list. With ignore_unknown_options=True, if any option-like tokens were processed before positionals, unknown ones would be ignored and forwarded, but due to allow_interspersed_args=False, tokens after positionals are never processed as potential options—they become plain arguments captured by the nargs=-1 UNPROCESSED argument. Official documentation confirms allow_interspersed_args=False prevents mixing options and arguments, stopping option processing at the first argument [3,4,6,19,23]. Examples and issues show that post-positional option-like strings end up in args when ignore_unknown_options=True and UNPROCESSED is used [6,7,9,10]. DeepWiki analysis of parser.py explicitly states: when False, parser stops option processing at first non-option [19].

Citations:


🏁 Script executed:

# First, let's see the file content around the flagged lines
cat -n projects/core/ci_entrypoint/run_ci.py | sed -n '490,530p'

Repository: openshift-psap/forge

Length of output: 1589


🏁 Script executed:

# Search for --dry-run handling in the file
rg "\-\-dry\-run" projects/core/ci_entrypoint/run_ci.py -B 2 -A 2

Repository: openshift-psap/forge

Length of output: 556


🏁 Script executed:

# Check git history for this file to see if it changed recently
cd projects/core/ci_entrypoint && git log --oneline -20 run_ci.py 2>/dev/null | head -20

Repository: openshift-psap/forge

Length of output: 95


🏁 Script executed:

# Search for comments around the command decorator that might explain the design
cat -n projects/core/ci_entrypoint/run_ci.py | sed -n '485,510p'

Repository: openshift-psap/forge

Length of output: 1199


🏁 Script executed:

# Search for documentation or docstrings that explain flag usage
rg "dry.run|allow_interspersed" projects/core/ci_entrypoint/run_ci.py -B 3 -A 3

Repository: openshift-psap/forge

Length of output: 1706


--dry-run flag cannot be used safely with positional arguments due to Click parsing configuration.

With allow_interspersed_args=False, Click stops parsing options once the first positional argument is encountered. This means run rhaiis test --dry-run will pass --dry-run as a plain argument to the variadic args instead of parsing it as the --dry-run flag, causing the execution prevention to fail and the command to run instead of dry-running.

The --dry-run flag will only work when placed before positional arguments (e.g., run --dry-run rhaiis test), which is not intuitive and creates a safety regression where users expect the flag to work at the end of the command line.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/ci_entrypoint/run_ci.py` around lines 499 - 505, The Click
command declaration in run_ci.py uses context_settings with
allow_interspersed_args=False which causes options like the global --dry-run
flag to be treated as positional arguments once any positional is seen; update
the decorator for the command that defines the CLI (the `@click.command`(...)
block that decorates the function handling 'project', 'operation', and 'args')
to set allow_interspersed_args=True (or remove that override) so Click will
continue to parse options interspersed with positional arguments and ensure
flags like --dry-run are recognized regardless of position.

Comment on lines +69 to +81
def sanitize_name(name: str, max_len: int = 42) -> str:
"""
Sanitize name for K8s resource naming.

Rules:
- Lowercase
- Replace / and _ with -
- Remove dots
- Truncate to max_len
"""
return (
name.lower().replace("/", "-").replace("_", "-").replace(".", "")
)[:max_len]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

sanitize_name can still produce invalid K8s resource names.

After truncation, names may end with -, and unsupported characters (beyond /, _, .) are not normalized. That can break resource creation at runtime.

Safer sanitizer
     def sanitize_name(name: str, max_len: int = 42) -> str:
@@
-        return (
-            name.lower().replace("/", "-").replace("_", "-").replace(".", "")
-        )[:max_len]
+        name = name.lower().replace("/", "-").replace("_", "-").replace(".", "-")
+        name = re.sub(r"[^a-z0-9-]", "-", name)
+        name = re.sub(r"-{2,}", "-", name).strip("-")
+        name = name[:max_len].rstrip("-")
+        return name or "model"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/config.py` around lines 69 - 81, sanitize_name
currently only replaces "/", "_", and "." and then truncates, which can leave
trailing "-" or other invalid chars; update sanitize_name to: normalize all
characters to lowercase, replace any character not in [a-z0-9-] with "-",
collapse multiple "-" into a single "-", truncate to max_len, then strip
leading/trailing "-" and ensure the result starts/ends with an alphanumeric
character (if stripping yields empty, return a safe fallback like "resource");
apply these changes inside the sanitize_name function to guarantee produced
names conform to K8s resource name rules.

Comment on lines +56 to +58
# Collect FORGE_* env vars
env_vars = {k: v for k, v in os.environ.items() if k.startswith("FORGE_")}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't dump every FORGE_* value into metadata artifacts.

This snapshots all FORGE_* environment variables and writes them verbatim to _meta/metadata.yaml. That becomes a secret/PII leak the moment any token, password, key, or kubeconfig-style path is exposed under that prefix, so this should be allowlisted or redacted before writing.

Also applies to: 126-135

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/context.py` around lines 56 - 58, The code currently
captures all FORGE_* env vars into env_vars and writes them to metadata, which
can leak secrets; modify the collection so you either use an explicit allowlist
of safe FORGE_* keys (e.g., FORGE_ENV, FORGE_PROJECT) or sanitize values by
redacting/masking any key that matches sensitive patterns (TOKEN, SECRET,
PASSWORD, KEY, KUBE, CRED, PRIVATE) before assigning to env_vars; update the
creation sites that set env_vars (the dict comprehension assigning env_vars and
the similar logic around lines 126-135) to use the sanitized/allowlisted dict so
only safe entries or redacted values are written to _meta/metadata.yaml.

Comment on lines +54 to +59
step_start = time.monotonic()
try:
result = step.execute(ctx)
result.duration_seconds = time.monotonic() - step_start
result.start_time = datetime.now(timezone.utc)
step_results[step_name] = result
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: start_time is recorded after step execution completes.

Line 58 captures datetime.now() after step.execute() returns, which means result.start_time actually reflects the step's end time, not its start time. The timestamp should be captured before execution begins.

🐛 Proposed fix
             step_start = time.monotonic()
+            step_start_time = datetime.now(timezone.utc)
             try:
                 result = step.execute(ctx)
                 result.duration_seconds = time.monotonic() - step_start
-                result.start_time = datetime.now(timezone.utc)
+                result.start_time = step_start_time
                 step_results[step_name] = result
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
step_start = time.monotonic()
try:
result = step.execute(ctx)
result.duration_seconds = time.monotonic() - step_start
result.start_time = datetime.now(timezone.utc)
step_results[step_name] = result
step_start = time.monotonic()
step_start_time = datetime.now(timezone.utc)
try:
result = step.execute(ctx)
result.duration_seconds = time.monotonic() - step_start
result.start_time = step_start_time
step_results[step_name] = result
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/executor.py` around lines 54 - 59, The code records
result.start_time after step.execute(), so start_time reflects the end time;
move the timestamp capture to before execution by setting result.start_time =
datetime.now(timezone.utc) (or a local variable) immediately after computing
step_start and before calling step.execute(ctx), then compute
result.duration_seconds = time.monotonic() - step_start after execute and
continue storing result in step_results[step_name]; update references to
step_start/result.start_time accordingly in executor.py around the step.execute
call.

Comment on lines +170 to +173
modelLoadingTimeoutMillis: 300000
imagePullSecrets:
- name: npalaska-image-pull
containers:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded imagePullSecrets name appears to be a personal/dev credential.

The secret name npalaska-image-pull looks like a personal or development credential. This should likely be configurable or use a standard naming convention.

♻️ Suggested fix to make configurable

Add a parameter to __init__:

 def __init__(
     self,
     model: str,
     deployment_name: str,
     vllm_image: str,
     runtime_args: dict[str, Any],
     namespace: str = "forge",
+    image_pull_secret: str | None = None,
     # ... other params
 ):
+    self.image_pull_secret = image_pull_secret

Then in _generate_kserve_yaml:

+  image_pull_secrets = ""
+  if self.image_pull_secret:
+      image_pull_secrets = f"""
+  imagePullSecrets:
+  - name: {self.image_pull_secret}"""
+
   return f"""---
 apiVersion: serving.kserve.io/v1alpha1
 kind: ServingRuntime
 # ...
 spec:
   builtInAdapter:
     modelLoadingTimeoutMillis: 300000
-  imagePullSecrets:
-  - name: npalaska-image-pull
+{image_pull_secrets}
   containers:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/deploy.py` around lines 170 - 173, The
hardcoded imagePullSecrets name `npalaska-image-pull` should be made
configurable: add a new parameter (e.g., image_pull_secret or
image_pull_secret_name) to the class __init__ with a sensible default or None,
store it on self, and update _generate_kserve_yaml to render the
imagePullSecrets entry using self.image_pull_secret (skipping or omitting the
block when None). Also update any callers/instantiations to pass the secret
where needed and add a clear default value or validation error if the secret is
required.

Comment on lines +21 to +44
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install NFD operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"

subscription_yaml = """
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nfd
namespace: openshift-nfd
spec:
channel: stable
name: nfd
source: redhat-operators
sourceNamespace: openshift-marketplace
"""
yaml_path = step_dir / "nfd-subscription.yaml"
yaml_path.write_text(subscription_yaml)

# Create namespace first
self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
self._run_oc(["apply", "-f", str(yaml_path)])

return StepResult.ok("NFD operator subscription created")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing step_dir.mkdir() before writing YAML file will cause FileNotFoundError.

The yaml_path.write_text() on line 38 will fail because step_dir is never created. Other steps in this PR (e.g., guidellm.py, deploy.py) call step_dir.mkdir(parents=True, exist_ok=True) before writing files.

🐛 Proposed fix
 def execute(self, ctx: "WorkflowContext") -> StepResult:
     """Install NFD operator."""
     step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
+    step_dir.mkdir(parents=True, exist_ok=True)

     subscription_yaml = """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install NFD operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
subscription_yaml = """
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nfd
namespace: openshift-nfd
spec:
channel: stable
name: nfd
source: redhat-operators
sourceNamespace: openshift-marketplace
"""
yaml_path = step_dir / "nfd-subscription.yaml"
yaml_path.write_text(subscription_yaml)
# Create namespace first
self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
self._run_oc(["apply", "-f", str(yaml_path)])
return StepResult.ok("NFD operator subscription created")
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install NFD operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
step_dir.mkdir(parents=True, exist_ok=True)
subscription_yaml = """
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nfd
namespace: openshift-nfd
spec:
channel: stable
name: nfd
source: redhat-operators
sourceNamespace: openshift-marketplace
"""
yaml_path = step_dir / "nfd-subscription.yaml"
yaml_path.write_text(subscription_yaml)
# Create namespace first
self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
self._run_oc(["apply", "-f", str(yaml_path)])
return StepResult.ok("NFD operator subscription created")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/operators.py` around lines 21 - 44, The
execute method creates step_dir but never ensures it exists, so
yaml_path.write_text(...) will raise FileNotFoundError; before writing the file,
call step_dir.mkdir(parents=True, exist_ok=True) (same pattern used in
guidellm.py/deploy.py) to create the directory, then proceed to write the
subscription_yaml and run the oc commands; update the execute function to create
the directory right after computing step_dir and before yaml_path.write_text.

Comment on lines +40 to +44
# Create namespace first
self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
self._run_oc(["apply", "-f", str(yaml_path)])

return StepResult.ok("NFD operator subscription created")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return value of _run_oc is silently ignored.

The _run_oc calls on lines 41-42 return bool indicating success/failure, but the result is not checked. If namespace creation or YAML apply fails, the step still reports success. Consider checking the return value or at minimum logging the outcome.

🛡️ Proposed fix to check return values
     # Create namespace first
-    self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
-    self._run_oc(["apply", "-f", str(yaml_path)])
+    self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
+    # Apply subscription - use pipe pattern: dry-run | apply
+    result = subprocess.run(
+        ["oc", "apply", "-f", str(yaml_path)],
+        capture_output=True,
+        text=True,
+        timeout=60,
+    )
+    if result.returncode != 0:
+        return StepResult.fail(f"Failed to install NFD operator: {result.stderr}")

     return StepResult.ok("NFD operator subscription created")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/operators.py` around lines 40 - 44, The two
calls to _run_oc (creating namespace and applying the YAML) return bool but are
ignored so StepResult.ok("NFD operator subscription created") is returned even
on failure; update the code in the method that contains these calls to capture
each _run_oc result (the call creating "namespace openshift-nfd" and the call
applying str(yaml_path)), check their boolean values, log failures via the
existing logger or processLogger, and return StepResult.fail(...) (or an
appropriate error StepResult) if either call returns False; only return
StepResult.ok(...) when both _run_oc calls succeed.

Comment on lines +67 to +84
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install GPU operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"

subscription_yaml = """
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: gpu-operator-certified
namespace: nvidia-gpu-operator
spec:
channel: v24.6
name: gpu-operator-certified
source: certified-operators
sourceNamespace: openshift-marketplace
"""
yaml_path = step_dir / "gpu-subscription.yaml"
yaml_path.write_text(subscription_yaml)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Same issue: Missing step_dir.mkdir() in InstallGPUOperatorStep.

Apply the same fix as for InstallNFDOperatorStep.

🐛 Proposed fix
 def execute(self, ctx: "WorkflowContext") -> StepResult:
     """Install GPU operator."""
     step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
+    step_dir.mkdir(parents=True, exist_ok=True)

     subscription_yaml = """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install GPU operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
subscription_yaml = """
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: gpu-operator-certified
namespace: nvidia-gpu-operator
spec:
channel: v24.6
name: gpu-operator-certified
source: certified-operators
sourceNamespace: openshift-marketplace
"""
yaml_path = step_dir / "gpu-subscription.yaml"
yaml_path.write_text(subscription_yaml)
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install GPU operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
step_dir.mkdir(parents=True, exist_ok=True)
subscription_yaml = """
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: gpu-operator-certified
namespace: nvidia-gpu-operator
spec:
channel: v24.6
name: gpu-operator-certified
source: certified-operators
sourceNamespace: openshift-marketplace
"""
yaml_path = step_dir / "gpu-subscription.yaml"
yaml_path.write_text(subscription_yaml)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/operators.py` around lines 67 - 84, The
InstallGPUOperatorStep.execute method creates step_dir but never ensures the
directory exists before writing the GPU subscription file; modify
InstallGPUOperatorStep.execute to create the directory (use
step_dir.mkdir(parents=True, exist_ok=True)) before calling
yaml_path.write_text(subscription_yaml) so the file write cannot fail if the
directory is missing.

Comment on lines +111 to +131
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install RHOAI operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"

# Determine channel from version
channel = f"stable-{self.version}"

subscription_yaml = f"""
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: rhods-operator
namespace: redhat-ods-operator
spec:
channel: {channel}
name: rhods-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
"""
yaml_path = step_dir / "rhoai-subscription.yaml"
yaml_path.write_text(subscription_yaml)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Same issue: Missing step_dir.mkdir() in InstallRHOAIOperatorStep.

Apply the same fix as for the other operator steps.

🐛 Proposed fix
 def execute(self, ctx: "WorkflowContext") -> StepResult:
     """Install RHOAI operator."""
     step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
+    step_dir.mkdir(parents=True, exist_ok=True)

     # Determine channel from version
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install RHOAI operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
# Determine channel from version
channel = f"stable-{self.version}"
subscription_yaml = f"""
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: rhods-operator
namespace: redhat-ods-operator
spec:
channel: {channel}
name: rhods-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
"""
yaml_path = step_dir / "rhoai-subscription.yaml"
yaml_path.write_text(subscription_yaml)
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install RHOAI operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
step_dir.mkdir(parents=True, exist_ok=True)
# Determine channel from version
channel = f"stable-{self.version}"
subscription_yaml = f"""
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: rhods-operator
namespace: redhat-ods-operator
spec:
channel: {channel}
name: rhods-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
"""
yaml_path = step_dir / "rhoai-subscription.yaml"
yaml_path.write_text(subscription_yaml)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/operators.py` around lines 111 - 131, The
execute method of InstallRHOAIOperatorStep is writing files into step_dir
without ensuring the directory exists; update the execute method (the function
named execute in the InstallRHOAIOperatorStep class) to create the step_dir
before writing the YAML (e.g., call step_dir.mkdir(parents=True, exist_ok=True)
immediately after computing step_dir) so yaml_path.write_text(...) won't fail.

Copy link
Copy Markdown
Contributor

@kpouget kpouget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice,
I put some very preliminary comments,
we'll need to discuss it further to understand how all of that works!

we'll need to find the way to integrate the DSL toolbox in that, if that makes sense

Comment on lines +499 to +502
@click.command(context_settings=dict(
ignore_unknown_options=True,
allow_interspersed_args=False,
))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should stay untouched

Comment on lines +2 to +3
# These define benchmark parameters - can be changed WITHOUT restarting vLLM
# Only GuideLLM needs to restart with new workload config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should live in the rhaiis project, projects/rhaii/orchestration

Comment on lines +1 to +3
# Global Defaults & Accelerator-Specific Settings
# These are merged with model configs at runtime using inheritance:
# defaults → accelerator → model → model.accelerator_overrides → scenario
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should live in the rhaiis project, projects/rhaii/orchestration

Comment thread config/rhaiis/models.yaml
Comment on lines +1 to +3
# Model Registry
# Models only specify what's DIFFERENT from defaults.yaml
# Accelerator-specific settings go in accelerator_overrides section
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should live in the rhaiis project, projects/rhaii/orchestration

@@ -0,0 +1,14 @@
"""Shared workflow steps for all projects.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should live in the rhaiis project, , we'll rediscuss how to turn it into a toolbox projects/rhaii/toolbox
or in a guidellm library project 🤔

@@ -0,0 +1,250 @@
"""Artifact collection step - shared by all projects."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should live in the rhaiis project, , we'll rediscuss how to turn it into a toolbox projects/rhaii/toolbox
or in a guidellm library project 🤔

@@ -0,0 +1,343 @@
"""GuideLLM benchmark step - runs as a pod on the cluster."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should live in the rhaiis project, , we'll rediscuss how to turn it into a toolbox projects/rhaii/toolbox
or in a guidellm library project 🤔

Comment thread projects/core/utils/oc.py
@@ -0,0 +1,529 @@
"""OpenShift CLI wrapper with built-in retry for transient failures.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can belong to the DSL directory, to be rediscussed

@@ -0,0 +1,129 @@
"""Base workflow class with step registration."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be discussed in a meeting so that we understand what's this workflow exactly



@ci.command()
@click.option("--dry-run", is_flag=True, help="Print what would be done without executing")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can define that in the def ci() function, so that all the entrypoints inherit it

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/rhaiis/IMPLEMENTATION.md`:
- Line 7: Add explicit language identifiers to the fenced code blocks that are
missing them (lint MD040) — locate the triple-backtick blocks around the "CLI/CI
Entry Points" and "config/" sections and the other similar fences (the blocks
containing short directory or CLI snippets) and change ``` to fenced blocks like
```text, ```bash, ```yaml, or ```python as appropriate for the content so
markdownlint passes; ensure each fenced block (e.g., the block showing CLI/CI
commands and the block showing config directory listings) gets a suitable
language tag.
- Line 471: Replace the user-specific line "cd /Users/memehta/workspace/forge"
with a portable repo-relative instruction; locate that exact string in
IMPLEMENTATION.md and change it to a generic direction to change into the
repository root (for example instruct contributors to "cd" into the project
directory or to use a git/top-level discovery method) so the guidance works for
all environments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0742a73-09c0-4906-8120-102b37d2e724

📥 Commits

Reviewing files that changed from the base of the PR and between a822096 and 007ee06.

📒 Files selected for processing (1)
  • projects/rhaiis/IMPLEMENTATION.md


## Architecture

```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks to satisfy lint.

These fences are missing language tags (MD040). Please annotate them (text, bash, yaml, python, etc.) so markdownlint passes consistently.

Proposed fix pattern
-```
+```text
 CLI/CI Entry Points
 ...
-```
+```

-```
+```text
 config/
 ...
-```
+```

Also applies to: 59-59, 70-70, 118-118, 184-184, 212-212, 261-261, 376-376, 448-448

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/IMPLEMENTATION.md` at line 7, Add explicit language
identifiers to the fenced code blocks that are missing them (lint MD040) —
locate the triple-backtick blocks around the "CLI/CI Entry Points" and "config/"
sections and the other similar fences (the blocks containing short directory or
CLI snippets) and change ``` to fenced blocks like ```text, ```bash, ```yaml, or
```python as appropriate for the content so markdownlint passes; ensure each
fenced block (e.g., the block showing CLI/CI commands and the block showing
config directory listings) gets a suitable language tag.

```bash
# Activate venv and run from forge directory
source ~/test_foo/python3_virt/bin/activate
cd /Users/memehta/workspace/forge
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace machine-specific absolute path with a portable instruction.

cd /Users/memehta/workspace/forge is user-specific and will not work for most contributors. Use a repo-relative placeholder.

Proposed fix
-cd /Users/memehta/workspace/forge
+cd /path/to/forge
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd /Users/memehta/workspace/forge
cd /path/to/forge
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/IMPLEMENTATION.md` at line 471, Replace the user-specific
line "cd /Users/memehta/workspace/forge" with a portable repo-relative
instruction; locate that exact string in IMPLEMENTATION.md and change it to a
generic direction to change into the repository root (for example instruct
contributors to "cd" into the project directory or to use a git/top-level
discovery method) so the guidance works for all environments.

@MML-coder
Copy link
Copy Markdown
Collaborator Author

Reopening from fork MML-coder/forge-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants