Adding Forge rhaiis test entrypoint by MML-coder · Pull Request #13 · openshift-psap/forge

MML-coder · 2026-04-04T00:05:17Z

This PR only supports the Forge test workflow
It deploy vLLM, run benchmark, cleanup. It does NOT include:

prepare workflow (operator installation) - stubbed only
cleanup workflow - basic implementation
llm-d integration - design doc included, implementation separate

How It Works

ConfigLoader resolves model config via inheritance chain:
defaults.yaml → accelerator[nvidia|amd] → model → model.accelerator_overrides
BenchmarkWorkflow executes steps sequentially:
DeployVLLM → WaitForReady → RunGuideLLM → (finally) CollectArtifacts → Cleanup

`

Single model + workload

KUBECONFIG=~/kubeconfigs/my-cluster
PYTHONPATH=. python3 projects/rhaiis/orchestration/cli.py test --model qwen-0.6b --workload balanced --accelerator nvidia

Multiple workloads (deploy-once pattern)

PYTHONPATH=. python3 projects/rhaiis/orchestration/cli.py test --model llama-3.3-70b-fp8 --workloadsbalanced,short,long-prompt.

`

Summary by CodeRabbit

New Features
- Configuration-driven benchmarking: global defaults, accelerator and per-model overrides, model registry, and workload presets.
- Declarative scenario generator with matrix expansion and deploy-once grouping.
- End-to-end workflow engine and RHAIIS workflows: prepare, deploy, wait-for-ready, run GuideLLM benchmarks, collect artifacts, and cleanup.
- CLI/CI commands to prepare, run, and cleanup benchmarks.
- Kubernetes CLI wrapper with retry/backoff for robust cluster operations.
Documentation
- Implementation guide and extensibility roadmap for llm-d.
Tests
- Extensive unit tests covering configs, generator, workflows, steps, and utilities.

openshift-ci · 2026-04-04T00:05:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign avasilevskii for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-04T00:05:32Z

📝 Walkthrough

Walkthrough

Added a declarative, YAML-driven RHAIIS benchmarking system: configuration registries, scenario generator, workflow engine and steps, Kubernetes/oc utilities, CLI entrypoints, orchestration runners, domain workflows/steps, and comprehensive unit tests.

Changes

Cohort / File(s)	Summary
Configuration `config/rhaiis/defaults.yaml`, `config/rhaiis/models.yaml`, `config/rhaiis/workloads.yaml`	New YAML configs: global defaults, accelerator-specific blocks (nvidia/amd), a model registry with per-model vllm overrides and aliases, and multiple GuideLLM workload presets.
Scenario generation `projects/core/scenarios/...` `projects/core/scenarios/config.py`, `projects/core/scenarios/config_loader.py`, `projects/core/scenarios/generator.py`, `projects/core/scenarios/__init__.py`	New ScenarioConfig dataclass, deep-merge loader resolving defaults→accelerator→model→model.accelerator_overrides, Resolved* dataclasses, ConfigLoader API, ScenarioGenerator with matrix/explicit expansion and deployment grouping.
Workflow engine core `projects/core/workflow/...` `context.py`, `step.py`, `workflow.py`, `executor.py`, `__init__.py`	Introduced WorkflowContext, StepStatus/StepResult/WorkflowStep abstractions, Workflow base class, and SequentialExecutor implementing ordered execution with finally-step semantics.
OC/Kubernetes utilities `projects/core/utils/oc.py`, `projects/core/utils/__init__.py`	OC wrapper with RetryConfig, OCResult, transient-error detection, retry/backoff logic, and convenience methods (run/get/apply/delete/logs/exec/wait/rsync/create_namespace).
Shared workflow steps `projects/core/steps/...` `artifacts.py`, `guidellm.py`, `__init__.py`	Added CollectArtifactsStep, CleanupDeploymentStep, and RunGuideLLMStep for artifact collection, best-effort cleanup, and running GuideLLM benchmarks (pod creation, polling, log/result collection).
RHAIIS workflows `projects/rhaiis/workflows/...` `benchmark.py`, `prepare.py`, `cleanup.py`, `__init__.py`	BenchmarkWorkflow (deploy → wait → benchmark with finally steps), PrepareWorkflow (operator installs), CleanupWorkflow (namespace/resource cleanup).
RHAIIS workflow steps `projects/rhaiis/workflows/steps/...` `deploy.py`, `operators.py`, `cleanup.py`, `__init__.py`	DeployVLLMStep (KServe manifest generation + oc apply), WaitForReadyStep (ready condition + in-pod health check), CleanupNamespaceStep, and operator installation steps (NFD/GPU/RHOAI subscriptions).
Orchestration & CLI `projects/rhaiis/orchestration/...` `cli.py`, `ci.py`, `test_rhaiis.py`, `__init__.py`	Click CLIs (prepare/test/cleanup), test_rhaiis orchestration runners with dry-run, multi-workload deploy-once grouping, resolved config loading, and ctx metadata writing.
CI entrypoint tweak `projects/core/ci_entrypoint/run_ci.py`	Click command updated to ignore unknown options and pass trailing args unprocessed (`click.UNPROCESSED`, ignore_unknown_options, allow_interspersed_args=False).
Documentation `projects/rhaiis/IMPLEMENTATION.md`, `projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md`, `projects/rhaiis/__init__.py`	Added implementation and extensibility docs describing architecture, config precedence, workflows, and llm-d extension proposals.
Tests `tests/...` (many new files)	Comprehensive unit tests for config loader, scenario generator, workflow engine, OC wrapper retry behavior, steps, workflows, and CLI; extensive mocking of subprocess/oc interactions.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (projects/rhaiis/orchestration/cli.py)
    participant Orchestrator as test_rhaiis.py
    participant ConfigLoader as ConfigLoader
    participant Generator as ScenarioGenerator
    participant Workflow as BenchmarkWorkflow
    participant Executor as SequentialExecutor
    participant OC as OC Utility
    participant K8s as Kubernetes/OpenShift

    CLI->>Orchestrator: run_test(...)
    Orchestrator->>ConfigLoader: load_model/load_workload
    ConfigLoader-->>Orchestrator: Resolved configs
    Orchestrator->>Generator: expand / group scenarios
    Generator-->>Orchestrator: ExpandedScenario(s)
    Orchestrator->>Workflow: instantiate BenchmarkWorkflow
    Workflow->>Executor: execute()
    Executor->>OC: apply KServe manifest
    OC->>K8s: kubectl/oc apply
    K8s-->>OC: apply result
    Executor->>OC: poll inferenceservice (get)
    OC->>K8s: check Ready
    K8s-->>OC: status
    Executor->>OC: apply benchmark pod
    OC->>K8s: create pod, stream logs
    Executor->>OC: rsync /benchmark-results
    OC->>K8s: copy results
    Executor->>OC: collect artifacts & cleanup (finally steps)
    OC->>K8s: delete resources
    Executor-->>Workflow: WorkflowResult
    Workflow-->>CLI: exit code/result

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐇 Through YAML fields I hop and peep,

configs merged and scenarios deep.
Steps march forward, oc retries hum,
pods and logs and results to drum.
A tidy run — prepare, test, then sweep.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Adding Forge rhaiis test entrypoint' accurately describes the primary objective: adding a test workflow entrypoint for the rhaiis project.
Docstring Coverage	✅ Passed	Docstring coverage is 84.98% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch rhaiis-pr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (15)

projects/core/utils/oc.py-204-209 (1)

204-209: ⚠️ Potential issue | 🟡 Minor

Rename input to avoid the Ruff A002 failure.

input shadows the Python builtin in all three signatures, and Ruff is already flagging it here. A small rename like stdin_input keeps the API clear and avoids a lint blocker.

Proposed rename

-        input: str | None = None,
+        stdin_input: str | None = None,
...
-                    input=input,
+                    input=stdin_input,
...
-    def run(self, *args: str, namespace: str | None = None, timeout: int | None = None, input: str | None = None) -> OCResult:
+    def run(self, *args: str, namespace: str | None = None, timeout: int | None = None, stdin_input: str | None = None) -> OCResult:
...
-        return self._run_with_retry(cmd, timeout, input)
+        return self._run_with_retry(cmd, timeout, stdin_input)
...
-    def apply(self, *args: str, namespace: str | None = None, timeout: int | None = None, input: str | None = None) -> OCResult:
+    def apply(self, *args: str, namespace: str | None = None, timeout: int | None = None, stdin_input: str | None = None) -> OCResult:
...
-        return self.run("apply", *args, namespace=namespace, timeout=timeout, input=input)
+        return self.run("apply", *args, namespace=namespace, timeout=timeout, stdin_input=stdin_input)

Also applies to: 300-300, 335-335

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/core/utils/oc.py` around lines 204 - 209, The parameter name `input`
shadows the Python builtin and triggers Ruff A002 in OC._run_with_retry and the
two other method signatures; rename the parameter to `stdin_input` (update type
hints from `input: str | None` to `stdin_input: str | None`) in the function
definitions and replace all internal references to `input` with `stdin_input`
(and update any places that call these methods to pass the renamed argument) so
signatures, usages, and return behavior remain identical except for the
parameter name.

projects/core/workflow/context.py-83-99 (1)

83-99: ⚠️ Potential issue | 🟡 Minor

Avoid mutating step numbering from a getter.

Line 95 increments unconditionally, so a step that asks for its artifact directory twice gets two different folders and shifts the numbering for every later step. Cache the directory per step, or have the executor advance the counter once before the step runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/context.py` around lines 83 - 99, The getter
get_step_artifact_dir currently mutates state (increments self.step_number and
sets self.current_step_name) every call; change it to be idempotent by caching
per-step directories or only incrementing when a new step is first requested:
add an internal mapping (e.g. self._step_dirs: dict[str, Path]) and in
get_step_artifact_dir check if step_name exists, return cached Path if present,
otherwise increment self.step_number once, set self.current_step_name, create
the new step_dir from self.artifact_dir /
f"{self.step_number:03d}__{step_name}", store it in self._step_dirs[step_name]
and return it; this preserves existing names (step_dir) and avoids changing
numbering on repeated calls.

projects/rhaiis/workflows/steps/deploy.py-504-506 (1)

504-506: ⚠️ Potential issue | 🟡 Minor

Remove unnecessary f-string prefix.

Static analysis correctly identified an f-string without placeholders.

🐛 Fix

         except subprocess.TimeoutExpired:
-            click.echo(f"  Health check timeout, retrying...")
+            click.echo("  Health check timeout, retrying...")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/deploy.py` around lines 504 - 506, In the
except subprocess.TimeoutExpired block (around the health-check retry logic)
remove the unnecessary f-string prefix on the click.echo call: replace the
f-string literal used in click.echo(f"  Health check timeout, retrying...") with
a plain string click.echo("  Health check timeout, retrying..."); also scan the
surrounding exception handling in the same function for other f-strings that
have no placeholders and convert them to normal string literals (refer to the
except subprocess.TimeoutExpired handler and the click.echo call).

projects/core/steps/guidellm.py-158-179 (1)

158-179: ⚠️ Potential issue | 🟡 Minor

Potential command injection via args in shell command.

The guidellm_cmd is constructed by joining args and embedded directly in a shell script. If any runtime argument values contain shell metacharacters (e.g., ;, $, backticks), they could be interpreted by the shell.

Consider using JSON-escaped arguments or validating that values don't contain shell-special characters.

🛡️ Proposed fix using shlex.quote

+import shlex
+
 def _generate_pod_yaml(self, data: str) -> str:
     """Generate GuideLLM pod YAML."""
     # Build guidellm args
     args = [
         "--target", self.endpoint,
         "--model", self.model,
         # ...
     ]
     # ...
-    guidellm_cmd = f"python3 -m benchmark.main {' '.join(args)}"
+    guidellm_cmd = f"python3 -m benchmark.main {' '.join(shlex.quote(a) for a in args)}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/core/steps/guidellm.py` around lines 158 - 179, The constructed
shell command in guidellm_cmd is vulnerable because args are joined and
interpolated into a shell script; to fix, escape each argument with shlex.quote
(or equivalent) before joining or, better, avoid a shell script altogether by
setting the Pod container's command to ["/bin/sh","-c"] with a safely quoted
string or by using the container args/command list to pass arguments directly;
update the code that builds guidellm_cmd (and the Pod spec that embeds it) to
either quote every element from args via shlex.quote or switch to a structured
args array so no unescaped shell metacharacters from args can be interpreted.

projects/core/steps/artifacts.py-197-232 (1)

197-232: ⚠️ Potential issue | 🟡 Minor

Unused errors list is initialized but never populated.

The errors list on line 201 is never appended to, yet it's checked on line 225 and included in the result data on line 231. Either remove it or populate it when _delete_resource returns False.

🐛 Proposed fix to populate errors

     # Delete KServe resources first (they manage the underlying deployments)
     if self.use_kserve:
         if self._delete_resource("inferenceservice", self.deployment_name, ns_args):
             deleted.append(f"inferenceservice/{self.deployment_name}")
+        else:
+            errors.append(f"inferenceservice/{self.deployment_name}")
         if self._delete_resource("servingruntime", self.deployment_name, ns_args):
             deleted.append(f"servingruntime/{self.deployment_name}")
+        else:
+            errors.append(f"servingruntime/{self.deployment_name}")

     # Delete standard deployment (if not using KServe or as fallback)
     if self._delete_resource("deployment", self.deployment_name, ns_args):
         deleted.append(f"deployment/{self.deployment_name}")
+    else:
+        errors.append(f"deployment/{self.deployment_name}")

Or alternatively, remove the errors list entirely if tracking deletion failures is not needed.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/core/steps/artifacts.py` around lines 197 - 232, The errors list is
created in execute() but never populated; update execute to append a descriptive
error string to errors whenever _delete_resource(resource_type,
self.deployment_name, ns_args) returns False (for inferenceservice,
servingruntime, deployment, service, route attempts), e.g., append
f"{resource_type}/{self.deployment_name} failed" so the existing message
construction and StepResult data["errors"] are meaningful; alternatively, if you
prefer not to track failures, remove the errors variable and its usage in the
message and StepResult data (choose one approach and apply consistently across
the inferenceservice/servingruntime/deployment/service/route deletion checks).

config/rhaiis/workloads.yaml-48-55 (1)

48-55: ⚠️ Potential issue | 🟡 Minor

Incomplete configuration: placeholder value needs replacement.

The data field contains a placeholder "ADD ME" which will likely cause GuideLLM to fail or produce meaningless results if this workload is selected. Either provide a valid data specification or disable this workload until it's properly configured.

Would you like me to suggest a data specification for heterogeneous traffic patterns, or should I open an issue to track this?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/rhaiis/workloads.yaml` around lines 48 - 55, Replace the placeholder
in the heterogeneous workload's guidellm.data field (in the "heterogeneous"
block) with a valid data specification or temporarily disable the workload;
specifically, update guidellm.data to point to a real dataset or pattern
definition (for example a path, dataset name, or structured traffic spec) that
matches the existing fields rate_type, rates, and max_seconds, or remove/disable
the entire heterogeneous entry until a proper data value is provided so GuideLLM
won't receive "ADD ME".

projects/core/workflow/executor.py-89-93 (1)

89-93: ⚠️ Potential issue | 🟡 Minor

Inconsistency: start_time not set for finally steps.

Normal steps have result.start_time assigned (albeit incorrectly), but finally steps don't set it at all. Apply the same fix here for consistency.

🔧 Proposed fix

+            step_start_time = datetime.now(timezone.utc)
             step_start = time.monotonic()
             try:
                 result = step.execute(ctx)
                 result.duration_seconds = time.monotonic() - step_start
+                result.start_time = step_start_time
                 step_results[step_name] = result

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/executor.py` around lines 89 - 93, The finally-step
handling block currently records result.duration_seconds but never sets
result.start_time; update the finally step execution to capture a start
timestamp (e.g., start_time = time.monotonic()) immediately before calling
step.execute(ctx), assign that value to result.start_time, and then compute
result.duration_seconds = time.monotonic() - start_time, mirroring how normal
steps populate start_time/duration for consistency with step_results[step_name]
and the existing result fields.

projects/rhaiis/IMPLEMENTATION.md-7-34 (1)

7-34: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks.

These blocks are missing fence languages, which can trigger markdown lint failures.

Example fix

-```
+```text
 CLI/CI Entry Points
@@
-```
+```

-```
+```text
 config/
@@
-```
+```

Also applies to: 59-78

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/IMPLEMENTATION.md` around lines 7 - 34, The fenced code
blocks in IMPLEMENTATION.md that contain the ASCII diagrams (e.g., the block
showing "CLI/CI Entry Points → test_rhaiis.py", "ConfigLoader",
"BenchmarkWorkflow", and "WorkflowStep implementations") and the later similar
block (around the config/ section) are missing language identifiers; add a
language tag (use "text") to the opening triple-backtick for each fenced block
so markdown linters recognize them as code blocks (update the opening ``` to
```text for the diagram block(s) and any other plain-text fenced blocks
referenced in the doc).

tests/rhaiis/test_ci.py-34-35 (1)

34-35: ⚠️ Potential issue | 🟡 Minor

Avoid fixed /tmp/artifacts in tests.

Using a shared absolute temp path can cause cross-test interference and portability issues. Prefer a per-test temporary directory fixture.

Safer test setup

 class TestCiTest:
@@
     `@pytest.fixture`
-    def runner(self):
-        return CliRunner(env={"FORGE_ARTIFACT_DIR": "/tmp/artifacts"})
+    def runner(self, tmp_path):
+        artifact_dir = tmp_path / "artifacts"
+        return CliRunner(env={"FORGE_ARTIFACT_DIR": str(artifact_dir)})
@@
         result = runner.invoke(
             ci, ["test", "--dry-run"],
-            env={"FORGE_MODEL": "test/model", "FORGE_ARTIFACT_DIR": "/tmp/artifacts"}
+            env={"FORGE_MODEL": "test/model"}
         )
@@
                 "FORGE_MODEL": "test/model",
                 "FORGE_WORKLOADS": "balanced,heterogeneous",
-                "FORGE_ARTIFACT_DIR": "/tmp/artifacts",
             }
         )

Also applies to: 40-41, 53-54

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_ci.py` around lines 34 - 35, Replace the hardcoded
"/tmp/artifacts" used when creating CliRunner with a per-test temporary
directory (use pytest's tmp_path or tmp_path_factory) and set FORGE_ARTIFACT_DIR
to that temp path (e.g., str(tmp_path / "artifacts")); update all occurrences
that call CliRunner(env={"FORGE_ARTIFACT_DIR": "/tmp/artifacts"}) (including the
other instances mentioned at lines 40-41 and 53-54) so each test gets an
isolated directory and remove reliance on a shared absolute path.

tests/rhaiis/test_ci.py-58-58 (1)

58-58: ⚠️ Potential issue | 🟡 Minor

Assert both workloads, not either.

This test currently passes if only one workload is rendered, which can hide parsing/formatting regressions.

Tighten expectation

-        assert "balanced" in result.output or "heterogeneous" in result.output
+        assert "balanced" in result.output
+        assert "heterogeneous" in result.output

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_ci.py` at line 58, The test currently asserts that either
"balanced" or "heterogeneous" appears in result.output, which is too permissive;
change the assertion to require both strings be present (e.g., replace the
single-line check using "or" with two assertions or a combined condition using
"and") so that result.output contains both "balanced" and "heterogeneous" (refer
to the assertion referencing result.output in tests/rhaiis/test_ci.py).

tests/core/scenarios/test_generator.py-226-230 (1)

226-230: ⚠️ Potential issue | 🟡 Minor

Same temporary file cleanup issue as in sample_config_path.

Apply the same fix pattern here for explicit_runs_config fixture.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/core/scenarios/test_generator.py` around lines 226 - 230, The
explicit_runs_config fixture uses tempfile.NamedTemporaryFile(..., delete=False)
and yields Path(f.name) without removing the temp file; update the
explicit_runs_config fixture to mirror the sample_config_path fix by creating
and writing the temp YAML inside a try/finally (or context manager) and then
removing the file in the finally block (e.g., call os.remove on Path(f.name))
after the yield so the temporary file is always cleaned up; keep the existing
tempfile.NamedTemporaryFile, yaml.safe_dump and the yielded Path(f.name) but add
the cleanup logic surrounding the yield.

projects/rhaiis/orchestration/cli.py-48-59 (1)

48-59: ⚠️ Potential issue | 🟡 Minor

Inconsistent dry-run behavior: metadata is written before early return.

In prepare(), ctx.write_metadata() is called before the dry_run check, meaning even in dry-run mode, metadata files will be created. Same issue exists in cleanup(). Additionally, test() doesn't write metadata at all, which is inconsistent with the other commands.

Proposed fix for prepare()

 def prepare(rhoai_version: str, dry_run: bool):
     """Install operators (RHOAI, NFD, GPU) on OpenShift."""
-    ctx = WorkflowContext.from_environment()
-    ctx.write_metadata({"command": "prepare", "rhoai_version": rhoai_version})
-
     if dry_run:
+        ctx = WorkflowContext.from_environment()
         click.echo(f"[DRY-RUN] Would install RHOAI {rhoai_version}")
         click.echo(f"[DRY-RUN] Artifacts would be written to: {ctx.artifact_dir}")
         return

+    ctx = WorkflowContext.from_environment()
+    ctx.write_metadata({"command": "prepare", "rhoai_version": rhoai_version})
+
     exit_code = test_rhaiis.run_prepare(ctx, rhoai_version)
     sys.exit(exit_code)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/orchestration/cli.py` around lines 48 - 59, The metadata is
being written even during dry runs and is missing for test(); update prepare(),
cleanup(), and test() to be consistent: move the ctx.write_metadata({"command":
...}) calls in prepare() and cleanup() to after the dry_run check (so dry-run
returns before any metadata/file writes), and add a similar
ctx.write_metadata({"command": "test", "rhoai_version": rhoai_version}) call in
test() (or the equivalent metadata payload used elsewhere) so all commands
record metadata only for real runs; refer to the prepare, cleanup, and test
functions and the WorkflowContext.write_metadata / ctx.write_metadata symbol
names to locate and adjust the code.

projects/rhaiis/orchestration/ci.py-57-92 (1)

57-92: ⚠️ Potential issue | 🟡 Minor

Inconsistent metadata handling and missing model validation in test command.

The test command differs from prepare and cleanup in two ways:

It does not call workflow_ctx.write_metadata() to record command parameters
It does not validate that model is provided (unlike cli.py which validates when not in dry-run)

If FORGE_MODEL is unset, model will be None and passed directly to run_test, which may cause confusing errors downstream.

Proposed fix

 `@ci.command`()
 `@click.option`("--dry-run", is_flag=True, help="Print what would be done without executing")
 `@click.pass_context`
 def test(ctx, dry_run: bool):
     """Test phase - Run vLLM benchmark.

     Configuration from config/ directory. FOURNOS sets:
         FORGE_MODEL - Model key (e.g., qwen-0.6b)
         FORGE_WORKLOADS - Comma-separated workloads (optional)
     """
-    workflow_ctx = WorkflowContext.from_environment()
-
     # Key parameters from FOURNOS
     model = os.environ.get("FORGE_MODEL")
     workloads_str = os.environ.get("FORGE_WORKLOADS")

     # Parse workloads
     workloads = None
     if workloads_str:
         workloads = [w.strip() for w in workloads_str.split(",")]

+    if dry_run:
+        click.echo("[DRY-RUN] Test phase")
+        click.echo(f"[DRY-RUN] Model: {model}")
+        if workloads:
+            click.echo(f"[DRY-RUN] Workloads: {workloads}")
+        return
+
+    if not model:
+        click.echo("Error: FORGE_MODEL environment variable must be set", err=True)
+        sys.exit(1)
+
+    workflow_ctx = WorkflowContext.from_environment()
+    workflow_ctx.write_metadata({
+        "command": "test",
+        "model": model,
+        "workloads": workloads,
+    })
+
     # Log
     click.echo("RHAIIS CI Test")
     click.echo(f"  Model: {model}")
     if workloads:
         click.echo(f"  Workloads: {workloads}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/orchestration/ci.py` around lines 57 - 92, The test command
handler (test) must mirror prepare/cleanup by validating the model and recording
metadata: if not dry_run ensure the FORGE_MODEL-derived model is present and
abort/exit with a clear message when missing, and call
workflow_ctx.write_metadata(...) to persist the command parameters (e.g.,
command name, model, workloads, config_dir, dry_run) before invoking
test_rhaiis.run_test; update the test function to perform the model check and to
call WorkflowContext.write_metadata() on workflow_ctx prior to calling
test_rhaiis.run_test so metadata handling is consistent with the other commands.

tests/core/scenarios/test_generator.py-60-64 (1)

60-64: ⚠️ Potential issue | 🟡 Minor

Temporary file is not cleaned up after test.

NamedTemporaryFile(delete=False) creates a file that persists after the context manager exits. The yielded path should be cleaned up after the test completes to avoid leaving orphan files.

Proposed fix

+    import os
+
     `@pytest.fixture`
     def sample_config_path(self):
         """Create a sample scenarios.yaml file using new format."""
         config = {
             # ... config dict ...
         }

         with tempfile.NamedTemporaryFile(
             mode="w", suffix=".yaml", delete=False
         ) as f:
             yaml.safe_dump(config, f)
-            yield Path(f.name)
+            path = Path(f.name)
+        try:
+            yield path
+        finally:
+            os.unlink(path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/core/scenarios/test_generator.py` around lines 60 - 64, The temporary
YAML file created with tempfile.NamedTemporaryFile(delete=False) in the test
fixture is left on disk; after yielding Path(f.name) you must remove the file in
the fixture teardown. Update the generator fixture around
tempfile.NamedTemporaryFile / yaml.safe_dump to yield Path(f.name) and then call
Path(f.name).unlink() (or os.remove) in the code after the yield to ensure the
temporary file is cleaned up.

projects/core/scenarios/generator.py-612-612 (1)

612-612: ⚠️ Potential issue | 🟡 Minor

Useless if-else condition.

The ternary expression 1 if isinstance(model, ModelConfig) else 1 always evaluates to 1 regardless of the condition, making the check pointless.

🔧 Suggested fix

-        tp = run.get("tensor_parallel", 1 if isinstance(model, ModelConfig) else 1)
+        tp = run.get("tensor_parallel", 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/generator.py` at line 612, The ternary in the tp
assignment is redundant; replace the expression tp = run.get("tensor_parallel",
1 if isinstance(model, ModelConfig) else 1) with a single default value by using
run.get("tensor_parallel", 1) (keep references to the same variables run and
model and remove the pointless isinstance(ModelConfig) check); if there was an
intended different default for ModelConfig, clarify and implement that branch,
otherwise simply use run.get("tensor_parallel", 1) in generator.py where tp is
set.

🧹 Nitpick comments (21)

projects/core/workflow/step.py (1)

79-85: WorkflowStep.name is lowercased, not snake_case.

WaitForReadyStep currently becomes waitforready, not wait_for_ready, so the implementation doesn't match the contract in the comment/docstring. Either insert separators or document the actual behavior.

Small fix

+import re
...
-        # Convert to lowercase
-        return class_name.lower()
+        return re.sub(r"(?<!^)(?=[A-Z])", "_", class_name).lower()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/step.py` around lines 79 - 85, The current name
property/method in the WorkflowStep class returns the class name lowercased
(e.g., WaitForReadyStep -> waitforready) instead of snake_case; update the
implementation that computes class_name from self.__class__.__name__ (and still
strips the "Step" suffix) to convert CamelCase to snake_case by inserting
underscores between word boundaries (or using a regex like (?<!^)(?=[A-Z]) to
split) and then lowercasing so WaitForReadyStep becomes wait_for_ready; keep the
"Step" suffix removal logic and ensure the symbol to change is the
WorkflowStep.name property/method that currently uses
class_name.endswith("Step") and class_name.lower().

projects/rhaiis/workflows/steps/deploy.py (2)

320-328: Importing click inside method is unconventional.

While functional, importing inside a method is unusual. Consider moving the import to the top of the file since it's used in multiple methods.

♻️ Move import to top of file

 """RHAIIS vLLM deployment steps using KServe (ServingRuntime + InferenceService)."""

+import click
 import logging
 import subprocess
 # ...

 class WaitForReadyStep(WorkflowStep):
     # ...
     def execute(self, ctx: "WorkflowContext") -> StepResult:
         """Wait for InferenceService to be ready."""
-        import click
-
         click.echo(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/deploy.py` around lines 320 - 328, The import
of click inside the execute method is unconventional; move the import to the
module scope so it can be reused by other methods and avoids repeated local
imports. Remove the inline "import click" from the execute(self, ctx:
"WorkflowContext") -> StepResult method and add a top-level "import click" at
the top of the file (near other imports) so symbols like execute and any other
methods in this module use the shared module-level click import.

93-102: Namespace creation can be simplified.

The dry-run command output is not piped to the apply command. The current pattern runs two separate commands where the first one's output is discarded. Consider simplifying to a single idempotent apply.

♻️ Simplified namespace creation

-    # Ensure namespace exists
-    subprocess.run(
-        ["oc", "create", "namespace", self.namespace, "--dry-run=client", "-o", "yaml"],
-        capture_output=True,
-    )
-    subprocess.run(
-        ["oc", "apply", "-f", "-"],
-        input=f"apiVersion: v1\nkind: Namespace\nmetadata:\n  name: {self.namespace}\n",
-        capture_output=True,
-        text=True,
-    )
+    # Ensure namespace exists (idempotent)
+    subprocess.run(
+        ["oc", "create", "namespace", self.namespace, "--dry-run=client", "-o", "yaml"],
+        capture_output=True,
+        text=True,
+    )
+    # Apply is idempotent - will create if not exists or no-op if exists
+    subprocess.run(
+        ["oc", "create", "namespace", self.namespace],
+        capture_output=True,  # Ignore "already exists" error
+    )

Or use a shell pipe if the dry-run output is actually needed:

subprocess.run(
    "oc create namespace {ns} --dry-run=client -o yaml | oc apply -f -".format(ns=self.namespace),
    shell=True,
    capture_output=True,
)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/steps/deploy.py` around lines 93 - 102, Replace the
two subprocess.run calls that run a dry-run and then a separate apply (where
dry-run output is discarded) with a single idempotent apply invocation: call
subprocess.run(["oc","apply","-f","-"], input=the namespace YAML using
self.namespace, text=True, capture_output=True) so the namespace is
created/updated atomically; alternatively, if you prefer the dry-run output, run
a single shell command that pipes oc create --dry-run=client -o yaml into oc
apply -f - via subprocess.run(..., shell=True, capture_output=True). Ensure you
remove the prior dry-run subprocess.run and only use one subprocess.run that
uses self.namespace as the interpolated name.

tests/rhaiis/test_operators.py (2)

48-61: Test asserts file exists but doesn't verify directory creation.

Once the step_dir.mkdir() fix is applied to operators.py, this test should pass. However, consider adding an explicit assertion that the step_dir was created to make the test more robust.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_operators.py` around lines 48 - 61, The test
test_creates_subscription_yaml currently asserts the YAML file exists but
doesn't explicitly assert the step directory was created; update the test to
include an assertion that step_dir (constructed from context.artifact_dir and
f"{context.step_number:03d}__{context.current_step_name}") exists before reading
the file, so it fails clearly if InstallNFDOperatorStep (or the directory
creation in operators.py) didn't create the folder; keep using the same step_dir
and yaml_file variables and assert step_dir.exists() (or is_dir()) prior to
checking yaml_file.exists() and its content.

37-46: Patch target should be module-specific for test isolation.

Patching subprocess.run globally may cause unintended side effects. Prefer patching at the module level where subprocess is imported.

♻️ Suggested fix

-    `@patch`("subprocess.run")
+    `@patch`("projects.rhaiis.workflows.steps.operators.subprocess.run")
     def test_execute_success(self, mock_run, context):

Apply the same change to all @patch("subprocess.run") decorators in this file.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_operators.py` around lines 37 - 46, The test currently
patches subprocess.run globally; update the `@patch` decorator to target
subprocess.run in the module that defines InstallNFDOperatorStep (i.e., patch
the subprocess imported into that module) so the mock is module-scoped and
isolated; apply the same change for all `@patch`("subprocess.run") decorators in
this test file (refer to test_execute_success and other tests that exercise
InstallNFDOperatorStep) so they patch
"<module_where_InstallNFDOperatorStep_is_defined>.subprocess.run" instead of the
top-level "subprocess.run".

projects/core/steps/guidellm.py (1)

272-276: Progress logging condition may not trigger reliably.

The condition elapsed % 60 == 0 with a 10-second poll interval and time.monotonic() precision may cause missed prints (e.g., if elapsed is 59.9 or 60.1). Consider tracking the last print time instead.

♻️ Suggested fix

 def _wait_for_pod_completion(self, timeout: int) -> dict:
     start_time = time.monotonic()
     poll_interval = 10
+    last_progress_print = 0

     print(f"Waiting for GuideLLM benchmark to complete (timeout: {timeout}s)...")

     while time.monotonic() - start_time < timeout:
         # ... existing logic ...

-            # Still running, no marker yet
-            elapsed = int(time.monotonic() - start_time)
-            if elapsed % 60 == 0:  # Print every minute
-                print(f"  GuideLLM running... ({elapsed}s elapsed, phase={phase})")
+            # Still running, no marker yet - print progress every 60s
+            elapsed = int(time.monotonic() - start_time)
+            if elapsed - last_progress_print >= 60:
+                print(f"  GuideLLM running... ({elapsed}s elapsed, phase={phase})")
+                last_progress_print = elapsed

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/core/steps/guidellm.py` around lines 272 - 276, The progress logging
uses `elapsed % 60 == 0` which can miss prints due to polling granularity;
replace that with a monotonic-based threshold by introducing a `last_print_time`
(initialized to `start_time`) and, inside the loop where `elapsed =
int(time.monotonic() - start_time)` is computed, check `if time.monotonic() -
last_print_time >= 60:` then do the print using `elapsed` and `phase` and set
`last_print_time = time.monotonic()` so prints occur reliably every 60 seconds
regardless of timing alignment.

projects/core/workflow/executor.py (2)

82-82: Unused: finally_errors list is populated but never consumed.

Errors from finally steps are collected into finally_errors but the list is never included in the WorkflowResult, logged as a summary, or otherwise used. Consider either:

Including finally errors in the result for caller visibility

Logging a summary of finally failures

Removing the collection if not needed
♻️ Option: Include in result or log summary
+        if finally_errors:
+            logger.warning(f"Finally block encountered {len(finally_errors)} error(s)")
+
         end_time = datetime.now(timezone.utc)
Or extend WorkflowResult to include finally_errors if callers need this information.
Also applies to: 98-99, 112-112
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/executor.py` at line 82, The finally_errors list
defined in executor.py (variable finally_errors) is being populated but never
used; update the code to either surface those errors to callers by adding a
finally_errors field to WorkflowResult and populating it before returning, or at
minimum log a concise summary of failures from finally_errors (e.g., count and
messages) before returning/raising; if finally errors are truly unnecessary,
remove the finally_errors collection and its appends (occuring around the
finally step handling) to avoid dead state. Ensure changes reference the
existing finally step handling and the WorkflowResult construction so callers
receive the final information.
51-52: Unused variable: step_artifact_dir is fetched but never used.

The artifact directory is retrieved on lines 52 and 87 but not passed to the step or used otherwise. Either remove these lines or pass the directory to the step if it's needed.
♻️ Option 1: Remove unused code
-            # Get artifact directory for this step
-            step_artifact_dir = ctx.get_step_artifact_dir(step_name)
-
             step_start = time.monotonic()
♻️ Option 2: Pass to step if needed

If steps need access to their artifact directory, consider passing it as a parameter or storing it in context before execution.

Also applies to: 87-87
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/workflow/executor.py` around lines 51 - 52, The local variable
step_artifact_dir (obtained via ctx.get_step_artifact_dir(step_name)) is never
used; either remove the unused retrieval calls or wire the directory into step
execution: remove the step_artifact_dir assignment at both sites if artifacts
aren’t needed, or pass it into the step invocation (e.g., add a parameter to the
step execution call or set it on the ctx/step object before calling
step.run/execute) so the symbol step_artifact_dir and the call
ctx.get_step_artifact_dir(step_name) are actually consumed; update the step API
(method name used for running the step) and any callers accordingly.

config/rhaiis/defaults.yaml (1)

19-19: Consider: uvicorn-log-level: debug may be verbose for production runs.

Debug-level logging can generate substantial output during benchmarks. If this is intentional for initial testing, consider adding a comment or parameterizing it for production use.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/rhaiis/defaults.yaml` at line 19, The YAML default sets
uvicorn-log-level: debug which is noisy for production; change the default to a
less verbose level (e.g., info) or make the uvicorn-log-level configurable via
an environment-backed or CLI-overridable parameter and add an inline comment
explaining debug is only for development/benchmarks; update any
bootstrap/launcher code that reads the uvicorn-log-level key (search for
uvicorn-log-level) to prefer the env/CLI value when present.

projects/rhaiis/workflows/cleanup.py (1)

15-40: Unused parameter: remove_operators is stored but never used.

The remove_operators parameter is documented and stored on the instance (line 31), but define_steps() doesn't conditionally add operator removal steps when it's True. This creates an API contract that isn't fulfilled.

Consider either:

Adding conditional operator removal steps when remove_operators=True

Adding a TODO/FIXME comment if this is planned for later

Removing the parameter until it's implemented
📝 Option: Add a TODO comment to track the incomplete feature
     def define_steps(self):
         """Define cleanup steps."""
+        # TODO: Add operator removal steps when remove_operators=True
         self.add_step(
             CleanupNamespaceStep(
                 namespace=self.namespace,
                 delete_namespace=False,  # Keep namespace, delete contents
             )
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/workflows/cleanup.py` around lines 15 - 40, The class stores
remove_operators but never uses it in define_steps; update define_steps to check
self.remove_operators and, when True, add a dedicated operator-cleanup step
(e.g., add RemoveOperatorsStep or CleanupOperatorsStep after the existing
CleanupNamespaceStep) so operator removal is performed; if an operator-cleanup
step class does not yet exist, either implement a small RemoveOperatorsStep and
add it conditionally, or at minimum add a clear TODO/FIXME in define_steps
referencing remove_operators to avoid the unused-parameter contract.

projects/core/scenarios/config.py (1)

100-103: Suffix stripping misses chained suffixes.

A model like foo-instruct-dynamic becomes foo-instruct today. Looping until no known suffix remains makes shortening more consistent.

Refinement

-        for suffix in ["-instruct", "-dynamic", "-chat"]:
-            if name.endswith(suffix):
-                name = name[: -len(suffix)]
+        changed = True
+        while changed:
+            changed = False
+            for suffix in ["-instruct", "-dynamic", "-chat"]:
+                if name.endswith(suffix):
+                    name = name[: -len(suffix)]
+                    changed = True

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/config.py` around lines 100 - 103, The current
suffix-stripping loop over ["-instruct","-dynamic","-chat"] only removes one
suffix, so names like "foo-instruct-dynamic" become "foo-instruct"; modify the
logic in projects/core/scenarios/config.py where the variable name is trimmed to
repeatedly strip known suffixes until none match (e.g., wrap the existing
for-loop in a while-loop or use a loop that breaks only when no suffix was
removed), ensuring you reference the same suffix list and update the variable
name in-place each removal.

tests/core/steps/test_artifacts.py (1)

50-52: Prefer structured call-arg assertions over string matching.

Checking str(call_args_list) is fragile. Asserting specific positional args/flags from mock_run.call_args_list will make these tests more deterministic.

Also applies to: 96-98

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/core/steps/test_artifacts.py` around lines 50 - 52, Replace the fragile
string-based assertion on mock_run.call_args_list with structured checks against
each Call object's positional and keyword arguments: iterate
mock_run.call_args_list and assert that at least one Call has "logs" present in
the specific arg/flag you expect (e.g., check call.args[index] or
call.kwargs['argname']), and do the same for the other occurrence around lines
96-98; use mock_run.call_args_list directly rather than str(...) so the test
asserts the exact positional/keyword argument values.

tests/core/steps/test_guidellm.py (1)

71-73: Strengthen subprocess assertions in the success path.

The current check only verifies that at least one call happened. Adding expected call count (and optionally key command checks) will make this test catch flow regressions earlier.

Suggested test hardening

-        assert result.success
-        assert mock_run.called
+        assert result.success
+        assert mock_run.call_count == 5

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/core/steps/test_guidellm.py` around lines 71 - 73, The test currently
only asserts result.success and that mock_run was called; strengthen it by
asserting the exact expected number of subprocess invocations (e.g., assert
mock_run.call_count == 1) and also verify the arguments of the call(s) (e.g.,
use mock_run.assert_any_call(...) or inspect mock_run.call_args_list to ensure
the invoked command contains the expected executable/flags). Update the
assertions around result and mock_run in tests/core/steps/test_guidellm.py
(referencing result and mock_run) to check both call count and at least one
expected command pattern to prevent regressions.

projects/rhaiis/IMPLEMENTATION.md (1)

282-285: Replace user-local paths with portable commands.

The current example uses machine-specific directories, which makes the docs harder to follow for others.

Portable docs snippet

-# Activate venv and run from forge directory
-source ~/test_foo/python3_virt/bin/activate
-cd /Users/memehta/workspace/forge
+# Activate venv and run from repository root
+source .venv/bin/activate
+cd <path-to-forge-repo>
 PYTHONPATH=. python -m pytest tests/ --ignore=tests/llm_d -v

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/IMPLEMENTATION.md` around lines 282 - 285, Replace the
hardcoded activation and cd commands — specifically the lines referencing
"source ~/test_foo/python3_virt/bin/activate" and "cd
/Users/memehta/workspace/forge" — with portable instructions: instruct users to
create and activate a virtualenv inside the project (e.g., an .venv in the repo)
and to change directory to the repository root (recommend using a
repository-root discovery command like git rev-parse --show-toplevel or "cd to
project root"), and keep the pytest invocation as a relative, portable command
(using PYTHONPATH=. and the existing pytest flags and --ignore path) so the
snippet works across machines.

tests/rhaiis/test_steps.py (1)

21-31: Consider extracting duplicate fixtures to a conftest.py.

The temp_artifact_dir and context fixtures are duplicated across all three test classes (TestDeployVLLMStep, TestWaitForReadyStep, TestCleanupNamespaceStep). Extracting them to a conftest.py file would reduce duplication and improve maintainability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/rhaiis/test_steps.py` around lines 21 - 31, Extract the duplicated
pytest fixtures into a shared conftest.py: move the temp_artifact_dir fixture
(the pytest.fixture that yields Path(tmpdir) from tempfile.TemporaryDirectory)
and the context fixture (which calls
WorkflowContext.from_environment(artifact_base=str(temp_artifact_dir)) and
pre-increments via ctx.get_step_artifact_dir("deploy")) into conftest.py, remove
their definitions from tests/rhaiis/test_steps.py, and ensure test modules
import pytest normally so they automatically use the shared fixtures.

tests/core/scenarios/test_generator.py (1)

82-83: Consider using ASCII 'x' instead of Unicode '×' in comments.

The multiplication sign × (U+00D7) may cause issues in some editors or terminals. Using ASCII x is more portable.

Proposed fix

-        # 2 workloads × 1 routing × 2 TP = 4 scenarios
+        # 2 workloads x 1 routing x 2 TP = 4 scenarios

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/core/scenarios/test_generator.py` around lines 82 - 83, Replace the
Unicode multiplication sign in the inline comment within test_generator.py (the
comment string "# 2 workloads × 1 routing × 2 TP = 4 scenarios") with the ASCII
letter "x" to avoid encoding/editor issues; update the comment in the same
location (referencing the comment near the scenarios length assertion in the
test_generator test function) to read "# 2 workloads x 1 routing x 2 TP = 4
scenarios".

projects/rhaiis/orchestration/test_rhaiis.py (3)

25-36: Add type hint for ctx parameter.

The ctx parameter at line 35 lacks a type annotation, unlike other function signatures in this file.

✏️ Suggested fix

 def _dry_run_test(
     model: str | None,
     workload: str | None,
     workloads: list[str] | None,
     config_loader,
     accelerator: str,
     vllm_image: str | None,
     tensor_parallel: int | None,
     max_requests: int | None,
     namespace: str | None,
-    ctx,
+    ctx: WorkflowContext,
 ) -> int:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/orchestration/test_rhaiis.py` around lines 25 - 36, The ctx
parameter in _dry_run_test is missing a type annotation; add a matching type
annotation (e.g., click.Context or the same context type used elsewhere in this
file) to the ctx parameter, update the function signature for _dry_run_test
accordingly, and add the required import (from click or typing import Any) if
not already present so the annotation resolves.

385-387: Consider making the drain time configurable.

The 5-second drain time is hardcoded. For clusters with higher latency or heavier in-flight request loads, this may be insufficient. Consider making it configurable via defaults.yaml or a parameter.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/orchestration/test_rhaiis.py` around lines 385 - 387, Replace
the hardcoded 5-second sleep in test_rhaiis.py (the block using idx,
group_workloads, click.echo and time.sleep) with a configurable drain timeout:
add a config key (e.g., drain_seconds) to defaults.yaml and read it in the test
(falling back to a sensible default like 5), or add a CLI/fixture parameter that
supplies the value; then use that variable in the click.echo message and
time.sleep call so clusters with higher latency can increase the drain time
without changing code.

278-278: Deployment name truncation may cause collisions for models with similar prefixes.

The deployment name is truncated to 42 characters after sanitization. For models with very similar long names, this could result in identical deployment names. Consider adding a hash suffix or documenting this limitation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/rhaiis/orchestration/test_rhaiis.py` at line 278, The
deployment_name assignment (using hf_model_id) can collide after truncation;
modify the logic that builds deployment_name so it appends a short deterministic
hash suffix derived from hf_model_id (e.g., first 6 hex chars of SHA-1/SHA-256)
after sanitization and truncation to keep the overall length within 42
characters; update the expression that sets deployment_name to compute
sanitized_base = hf_model_id.split("/")[-1].lower().replace(".",
"-").replace("_", "-") truncated to (42 - 1 - hash_len) characters, then append
"-" + hash_suffix to produce the final deployment_name.

projects/core/scenarios/generator.py (2)

1-20: Docstring uses Unicode multiplication sign.

The docstring at line 11 uses × (Unicode multiplication sign) which triggers static analysis warning RUF002. While this is cosmetic and doesn't affect functionality, using ASCII x would be more portable.
✏️ Suggested fix
-4. Matrix expansion: model × workloads × tensor_parallel
+4. Matrix expansion: model x workloads x tensor_parallel
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/generator.py` around lines 1 - 20, The docstring in
projects/core/scenarios/generator.py uses a Unicode multiplication sign in the
"Matrix expansion" example (the line showing "model × workloads ×
tensor_parallel"); replace the Unicode '×' characters with the ASCII letter 'x'
so the example reads "model x workloads x tensor_parallel" to satisfy static
analysis (RUF002) while keeping the docstring content unchanged.
104-112: Consider using hashlib for deterministic deployment IDs.

The deployment_id property uses hash() which is non-deterministic across Python runs (due to PYTHONHASHSEED randomization). This means the same configuration will produce different deployment IDs in different invocations. While currently used only for display, this could cause confusion when comparing outputs or logs across runs. Using hashlib.md5() or similar would provide consistent IDs: override_hash = int(hashlib.md5(frozenset(self.vllm_args_override.items()).encode()).hexdigest()[:4], 16).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/scenarios/generator.py` around lines 104 - 112, The
deployment_id property uses Python's non-deterministic built-in hash for
vllm_args_override; replace that with a deterministic hash using hashlib (e.g.,
hashlib.md5) by serializing the override items in a stable order (sort the items
or use tuple(sorted(...))) and encoding to bytes before hashing, then derive
override_hash from the first few hex digits (or as an int) and return
f"{base}_wl{override_hash}" so deployment_id is stable across runs; update the
logic in the deployment_id property (references: deployment_id,
vllm_args_override, override_hash, base).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8609a041-7915-493c-af3f-39712b66bee9

📥 Commits

Reviewing files that changed from the base of the PR and between 06371bf and a822096.

📒 Files selected for processing (51)

config/rhaiis/defaults.yaml
config/rhaiis/models.yaml
config/rhaiis/workloads.yaml
projects/core/ci_entrypoint/run_ci.py
projects/core/scenarios/__init__.py
projects/core/scenarios/config.py
projects/core/scenarios/config_loader.py
projects/core/scenarios/generator.py
projects/core/steps/__init__.py
projects/core/steps/artifacts.py
projects/core/steps/guidellm.py
projects/core/utils/__init__.py
projects/core/utils/oc.py
projects/core/workflow/__init__.py
projects/core/workflow/context.py
projects/core/workflow/executor.py
projects/core/workflow/step.py
projects/core/workflow/workflow.py
projects/rhaiis/IMPLEMENTATION.md
projects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.md
projects/rhaiis/__init__.py
projects/rhaiis/orchestration/__init__.py
projects/rhaiis/orchestration/ci.py
projects/rhaiis/orchestration/cli.py
projects/rhaiis/orchestration/test_rhaiis.py
projects/rhaiis/workflows/__init__.py
projects/rhaiis/workflows/benchmark.py
projects/rhaiis/workflows/cleanup.py
projects/rhaiis/workflows/prepare.py
projects/rhaiis/workflows/steps/__init__.py
projects/rhaiis/workflows/steps/cleanup.py
projects/rhaiis/workflows/steps/deploy.py
projects/rhaiis/workflows/steps/operators.py
tests/__init__.py
tests/core/__init__.py
tests/core/scenarios/__init__.py
tests/core/scenarios/test_config_loader.py
tests/core/scenarios/test_generator.py
tests/core/steps/__init__.py
tests/core/steps/test_artifacts.py
tests/core/steps/test_guidellm.py
tests/core/utils/__init__.py
tests/core/utils/test_oc.py
tests/core/workflow/__init__.py
tests/core/workflow/test_context.py
tests/core/workflow/test_executor.py
tests/rhaiis/__init__.py
tests/rhaiis/test_ci.py
tests/rhaiis/test_operators.py
tests/rhaiis/test_steps.py
tests/rhaiis/test_workflows.py

coderabbitai · 2026-04-04T00:20:08Z

+  ministral-14b-fp8:
+    name: "Ministral-3-14B-Instruct-FP8"
+    hf_model_id: "RedHatAI/Ministral-3-14B-Instruct-2512"
+    vllm_args:
+      tensor-parallel-size: 4
+    supported_workloads: [balanced, short]


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

What is the correct HuggingFace model ID for RedHatAI Ministral-3-14B-Instruct FP8 quantized model?

💡 Result:

The correct Hugging Face model ID for the RedHatAI Ministral-3-14B-Instruct FP8 quantized model is RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic.

Citations:

1: https://huggingface.co/RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic

2: https://huggingface.co/RedHatAI/Ministral-3-14B-Instruct-2512

3: https://docs.redhat.com/es/documentation/red_hat_ai_inference_server/3.3/html/inference_serving_mistral_3_models/deploying-ministral-3-edge_mistral-3

Fix the HuggingFace model ID for ministral-14b-fp8 to use the FP8-quantized variant.

The hf_model_id is missing the -FP8-dynamic suffix. Change "RedHatAI/Ministral-3-14B-Instruct-2512" to "RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic" to load the correct quantized model.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@config/rhaiis/models.yaml` around lines 242 - 247, Update the hf_model_id value for the ministral-14b-fp8 model to point to the FP8-quantized variant: find theministral-14b-fp8 entry and change the hf_model_id "RedHatAI/Ministral-3-14B-Instruct-2512" to "RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic" so the correct FP8 model is loaded.

coderabbitai · 2026-04-04T00:20:08Z

+@click.command(context_settings=dict(
+    ignore_unknown_options=True,
+    allow_interspersed_args=False,
+))
 @click.argument('project', required=False)
 @click.argument('operation', required=False)
-@click.argument('args', nargs=-1)
+@click.argument('args', nargs=-1, type=click.UNPROCESSED)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Click, when a command uses context_settings={"ignore_unknown_options": True, "allow_interspersed_args": False} with positional arguments followed by a variadic click.UNPROCESSED argument, are flags like --dry-run placed after the positional arguments still parsed as top-level options, or are they treated as plain arguments and forwarded?

💡 Result:

In Click, with context_settings={"ignore_unknown_options": True, "allow_interspersed_args": False} and positional arguments followed by a variadic click.UNPROCESSED argument, flags like --dry-run placed after the positional arguments are treated as plain arguments and forwarded in the UNPROCESSED tuple, not parsed as top-level options. This behavior occurs because allow_interspersed_args=False causes Click's parser to stop processing options upon encountering the first non-option (positional) argument. Any subsequent tokens resembling options (e.g., --dry-run) are not examined for option parsing; they are directly added to the argument list. With ignore_unknown_options=True, if any option-like tokens were processed before positionals, unknown ones would be ignored and forwarded, but due to allow_interspersed_args=False, tokens after positionals are never processed as potential options—they become plain arguments captured by the nargs=-1 UNPROCESSED argument. Official documentation confirms allow_interspersed_args=False prevents mixing options and arguments, stopping option processing at the first argument [3,4,6,19,23]. Examples and issues show that post-positional option-like strings end up in args when ignore_unknown_options=True and UNPROCESSED is used [6,7,9,10]. DeepWiki analysis of parser.py explicitly states: when False, parser stops option processing at first non-option [19].

Citations:

1: https://click.palletsprojects.com/en/8.0.x/arguments/

2: https://click.palletsprojects.com/en/stable/api/

3: https://click.palletsprojects.com/en/stable/advanced/

4: https://deepwiki.com/pallets/click/2.5-argument-parsing

5: https://github.com/pallets/click/blob/master/src/click/core.py

6: https://stackoverflow.com/questions/32944131/add-unspecified-options-to-cli-command-using-python-click

7: https://stackoverflow.com/questions/42457484/python-click-consume-options-as-arguments

8: Ignore option-like arguments after first positional argument pallets/click#785

🏁 Script executed:

# First, let's see the file content around the flagged lines cat -n projects/core/ci_entrypoint/run_ci.py | sed -n '490,530p'

Repository: openshift-psap/forge

Length of output: 1589

🏁 Script executed:

# Search for --dry-run handling in the file rg "\-\-dry\-run" projects/core/ci_entrypoint/run_ci.py -B 2 -A 2

Repository: openshift-psap/forge

Length of output: 556

🏁 Script executed:

# Check git history for this file to see if it changed recently cd projects/core/ci_entrypoint && git log --oneline -20 run_ci.py 2>/dev/null | head -20

Repository: openshift-psap/forge

Length of output: 95

🏁 Script executed:

# Search for comments around the command decorator that might explain the design cat -n projects/core/ci_entrypoint/run_ci.py | sed -n '485,510p'

Repository: openshift-psap/forge

Length of output: 1199

🏁 Script executed:

# Search for documentation or docstrings that explain flag usage rg "dry.run|allow_interspersed" projects/core/ci_entrypoint/run_ci.py -B 3 -A 3

Repository: openshift-psap/forge

Length of output: 1706

--dry-run flag cannot be used safely with positional arguments due to Click parsing configuration.

With allow_interspersed_args=False, Click stops parsing options once the first positional argument is encountered. This means run rhaiis test --dry-run will pass --dry-run as a plain argument to the variadic args instead of parsing it as the --dry-run flag, causing the execution prevention to fail and the command to run instead of dry-running.

The --dry-run flag will only work when placed before positional arguments (e.g., run --dry-run rhaiis test), which is not intuitive and creates a safety regression where users expect the flag to work at the end of the command line.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/core/ci_entrypoint/run_ci.py` around lines 499 - 505, The Click command declaration in run_ci.py uses context_settings with allow_interspersed_args=False which causes options like the global --dry-run flag to be treated as positional arguments once any positional is seen; update the decorator for the command that defines the CLI (the `@click.command`(...) block that decorates the function handling 'project', 'operation', and 'args') to set allow_interspersed_args=True (or remove that override) so Click will continue to parse options interspersed with positional arguments and ensure flags like --dry-run are recognized regardless of position.

coderabbitai · 2026-04-04T00:20:08Z

+    def sanitize_name(name: str, max_len: int = 42) -> str:
+        """
+        Sanitize name for K8s resource naming.
+
+        Rules:
+        - Lowercase
+        - Replace / and _ with -
+        - Remove dots
+        - Truncate to max_len
+        """
+        return (
+            name.lower().replace("/", "-").replace("_", "-").replace(".", "")
+        )[:max_len]


⚠️ Potential issue | 🟠 Major

sanitize_name can still produce invalid K8s resource names.

After truncation, names may end with -, and unsupported characters (beyond /, _, .) are not normalized. That can break resource creation at runtime.

Safer sanitizer

def sanitize_name(name: str, max_len: int = 42) -> str: @@ - return ( - name.lower().replace("/", "-").replace("_", "-").replace(".", "") - )[:max_len] + name = name.lower().replace("/", "-").replace("_", "-").replace(".", "-") + name = re.sub(r"[^a-z0-9-]", "-", name) + name = re.sub(r"-{2,}", "-", name).strip("-") + name = name[:max_len].rstrip("-") + return name or "model"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/core/scenarios/config.py` around lines 69 - 81, sanitize_name currently only replaces "/", "_", and "." and then truncates, which can leave trailing "-" or other invalid chars; update sanitize_name to: normalize all characters to lowercase, replace any character not in [a-z0-9-] with "-", collapse multiple "-" into a single "-", truncate to max_len, then strip leading/trailing "-" and ensure the result starts/ends with an alphanumeric character (if stripping yields empty, return a safe fallback like "resource"); apply these changes inside the sanitize_name function to guarantee produced names conform to K8s resource name rules.

coderabbitai · 2026-04-04T00:20:08Z

+        # Collect FORGE_* env vars
+        env_vars = {k: v for k, v in os.environ.items() if k.startswith("FORGE_")}
+


⚠️ Potential issue | 🟠 Major

Don't dump every FORGE_* value into metadata artifacts.

This snapshots all FORGE_* environment variables and writes them verbatim to _meta/metadata.yaml. That becomes a secret/PII leak the moment any token, password, key, or kubeconfig-style path is exposed under that prefix, so this should be allowlisted or redacted before writing.

Also applies to: 126-135

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/core/workflow/context.py` around lines 56 - 58, The code currently captures all FORGE_* env vars into env_vars and writes them to metadata, which can leak secrets; modify the collection so you either use an explicit allowlist of safe FORGE_* keys (e.g., FORGE_ENV, FORGE_PROJECT) or sanitize values by redacting/masking any key that matches sensitive patterns (TOKEN, SECRET, PASSWORD, KEY, KUBE, CRED, PRIVATE) before assigning to env_vars; update the creation sites that set env_vars (the dict comprehension assigning env_vars and the similar logic around lines 126-135) to use the sanitized/allowlisted dict so only safe entries or redacted values are written to _meta/metadata.yaml.

coderabbitai · 2026-04-04T00:20:08Z

+            step_start = time.monotonic()
+            try:
+                result = step.execute(ctx)
+                result.duration_seconds = time.monotonic() - step_start
+                result.start_time = datetime.now(timezone.utc)
+                step_results[step_name] = result


⚠️ Potential issue | 🟠 Major

Bug: start_time is recorded after step execution completes.

Line 58 captures datetime.now() after step.execute() returns, which means result.start_time actually reflects the step's end time, not its start time. The timestamp should be captured before execution begins.

🐛 Proposed fix

step_start = time.monotonic() + step_start_time = datetime.now(timezone.utc) try: result = step.execute(ctx) result.duration_seconds = time.monotonic() - step_start - result.start_time = datetime.now(timezone.utc) + result.start_time = step_start_time step_results[step_name] = result

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

step_start = time.monotonic()

try:

result = step.execute(ctx)

result.duration_seconds = time.monotonic() - step_start

result.start_time = datetime.now(timezone.utc)

step_results[step_name] = result

step_start = time.monotonic()

step_start_time = datetime.now(timezone.utc)

try:

result = step.execute(ctx)

result.duration_seconds = time.monotonic() - step_start

result.start_time = step_start_time

step_results[step_name] = result

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/core/workflow/executor.py` around lines 54 - 59, The code records result.start_time after step.execute(), so start_time reflects the end time; move the timestamp capture to before execution by setting result.start_time = datetime.now(timezone.utc) (or a local variable) immediately after computing step_start and before calling step.execute(ctx), then compute result.duration_seconds = time.monotonic() - step_start after execute and continue storing result in step_results[step_name]; update references to step_start/result.start_time accordingly in executor.py around the step.execute call.

coderabbitai · 2026-04-04T00:20:09Z

+    modelLoadingTimeoutMillis: 300000
+  imagePullSecrets:
+  - name: npalaska-image-pull
+  containers:


⚠️ Potential issue | 🟠 Major

Hardcoded imagePullSecrets name appears to be a personal/dev credential.

The secret name npalaska-image-pull looks like a personal or development credential. This should likely be configurable or use a standard naming convention.

♻️ Suggested fix to make configurable

Add a parameter to __init__:

def __init__( self, model: str, deployment_name: str, vllm_image: str, runtime_args: dict[str, Any], namespace: str = "forge", + image_pull_secret: str | None = None, # ... other params ): + self.image_pull_secret = image_pull_secret

Then in _generate_kserve_yaml:

+ image_pull_secrets = "" + if self.image_pull_secret: + image_pull_secrets = f""" + imagePullSecrets: + - name: {self.image_pull_secret}""" + return f"""--- apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime # ... spec: builtInAdapter: modelLoadingTimeoutMillis: 300000 - imagePullSecrets: - - name: npalaska-image-pull +{image_pull_secrets} containers:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/workflows/steps/deploy.py` around lines 170 - 173, The hardcoded imagePullSecrets name `npalaska-image-pull` should be made configurable: add a new parameter (e.g., image_pull_secret or image_pull_secret_name) to the class __init__ with a sensible default or None, store it on self, and update _generate_kserve_yaml to render the imagePullSecrets entry using self.image_pull_secret (skipping or omitting the block when None). Also update any callers/instantiations to pass the secret where needed and add a clear default value or validation error if the secret is required.

coderabbitai · 2026-04-04T00:20:09Z

+    def execute(self, ctx: "WorkflowContext") -> StepResult:
+        """Install NFD operator."""
+        step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
+
+        subscription_yaml = """
+apiVersion: operators.coreos.com/v1alpha1
+kind: Subscription
+metadata:
+  name: nfd
+  namespace: openshift-nfd
+spec:
+  channel: stable
+  name: nfd
+  source: redhat-operators
+  sourceNamespace: openshift-marketplace
+"""
+        yaml_path = step_dir / "nfd-subscription.yaml"
+        yaml_path.write_text(subscription_yaml)
+
+        # Create namespace first
+        self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
+        self._run_oc(["apply", "-f", str(yaml_path)])
+
+        return StepResult.ok("NFD operator subscription created")


⚠️ Potential issue | 🔴 Critical

Missing step_dir.mkdir() before writing YAML file will cause FileNotFoundError.

The yaml_path.write_text() on line 38 will fail because step_dir is never created. Other steps in this PR (e.g., guidellm.py, deploy.py) call step_dir.mkdir(parents=True, exist_ok=True) before writing files.

🐛 Proposed fix

def execute(self, ctx: "WorkflowContext") -> StepResult: """Install NFD operator.""" step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" + step_dir.mkdir(parents=True, exist_ok=True) subscription_yaml = """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def execute(self, ctx: "WorkflowContext") -> StepResult:

"""Install NFD operator."""

step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"

subscription_yaml = """

apiVersion: operators.coreos.com/v1alpha1

kind: Subscription

metadata:

name: nfd

namespace: openshift-nfd

spec:

channel: stable

name: nfd

source: redhat-operators

sourceNamespace: openshift-marketplace

"""

yaml_path = step_dir / "nfd-subscription.yaml"

yaml_path.write_text(subscription_yaml)

# Create namespace first

self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])

self._run_oc(["apply", "-f", str(yaml_path)])

return StepResult.ok("NFD operator subscription created")

def execute(self, ctx: "WorkflowContext") -> StepResult:

"""Install NFD operator."""

step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"

step_dir.mkdir(parents=True, exist_ok=True)

subscription_yaml = """

apiVersion: operators.coreos.com/v1alpha1

kind: Subscription

metadata:

name: nfd

namespace: openshift-nfd

spec:

channel: stable

name: nfd

source: redhat-operators

sourceNamespace: openshift-marketplace

"""

yaml_path = step_dir / "nfd-subscription.yaml"

yaml_path.write_text(subscription_yaml)

# Create namespace first

self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])

self._run_oc(["apply", "-f", str(yaml_path)])

return StepResult.ok("NFD operator subscription created")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/workflows/steps/operators.py` around lines 21 - 44, The execute method creates step_dir but never ensures it exists, so yaml_path.write_text(...) will raise FileNotFoundError; before writing the file, call step_dir.mkdir(parents=True, exist_ok=True) (same pattern used in guidellm.py/deploy.py) to create the directory, then proceed to write the subscription_yaml and run the oc commands; update the execute function to create the directory right after computing step_dir and before yaml_path.write_text.

coderabbitai · 2026-04-04T00:20:09Z

+        # Create namespace first
+        self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
+        self._run_oc(["apply", "-f", str(yaml_path)])
+
+        return StepResult.ok("NFD operator subscription created")


⚠️ Potential issue | 🟠 Major

Return value of _run_oc is silently ignored.

The _run_oc calls on lines 41-42 return bool indicating success/failure, but the result is not checked. If namespace creation or YAML apply fails, the step still reports success. Consider checking the return value or at minimum logging the outcome.

🛡️ Proposed fix to check return values

# Create namespace first - self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"]) - self._run_oc(["apply", "-f", str(yaml_path)]) + self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"]) + # Apply subscription - use pipe pattern: dry-run | apply + result = subprocess.run( + ["oc", "apply", "-f", str(yaml_path)], + capture_output=True, + text=True, + timeout=60, + ) + if result.returncode != 0: + return StepResult.fail(f"Failed to install NFD operator: {result.stderr}") return StepResult.ok("NFD operator subscription created")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/workflows/steps/operators.py` around lines 40 - 44, The two calls to _run_oc (creating namespace and applying the YAML) return bool but are ignored so StepResult.ok("NFD operator subscription created") is returned even on failure; update the code in the method that contains these calls to capture each _run_oc result (the call creating "namespace openshift-nfd" and the call applying str(yaml_path)), check their boolean values, log failures via the existing logger or processLogger, and return StepResult.fail(...) (or an appropriate error StepResult) if either call returns False; only return StepResult.ok(...) when both _run_oc calls succeed.

coderabbitai · 2026-04-04T00:20:09Z

+    def execute(self, ctx: "WorkflowContext") -> StepResult:
+        """Install GPU operator."""
+        step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
+
+        subscription_yaml = """
+apiVersion: operators.coreos.com/v1alpha1
+kind: Subscription
+metadata:
+  name: gpu-operator-certified
+  namespace: nvidia-gpu-operator
+spec:
+  channel: v24.6
+  name: gpu-operator-certified
+  source: certified-operators
+  sourceNamespace: openshift-marketplace
+"""
+        yaml_path = step_dir / "gpu-subscription.yaml"
+        yaml_path.write_text(subscription_yaml)


⚠️ Potential issue | 🔴 Critical

Same issue: Missing step_dir.mkdir() in InstallGPUOperatorStep.

Apply the same fix as for InstallNFDOperatorStep.

🐛 Proposed fix

def execute(self, ctx: "WorkflowContext") -> StepResult: """Install GPU operator.""" step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" + step_dir.mkdir(parents=True, exist_ok=True) subscription_yaml = """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def execute(self, ctx: "WorkflowContext") -> StepResult:

"""Install GPU operator."""

step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"

subscription_yaml = """

apiVersion: operators.coreos.com/v1alpha1

kind: Subscription

metadata:

name: gpu-operator-certified

namespace: nvidia-gpu-operator

spec:

channel: v24.6

name: gpu-operator-certified

source: certified-operators

sourceNamespace: openshift-marketplace

"""

yaml_path = step_dir / "gpu-subscription.yaml"

yaml_path.write_text(subscription_yaml)

def execute(self, ctx: "WorkflowContext") -> StepResult:

"""Install GPU operator."""

step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"

step_dir.mkdir(parents=True, exist_ok=True)

subscription_yaml = """

apiVersion: operators.coreos.com/v1alpha1

kind: Subscription

metadata:

name: gpu-operator-certified

namespace: nvidia-gpu-operator

spec:

channel: v24.6

name: gpu-operator-certified

source: certified-operators

sourceNamespace: openshift-marketplace

"""

yaml_path = step_dir / "gpu-subscription.yaml"

yaml_path.write_text(subscription_yaml)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/workflows/steps/operators.py` around lines 67 - 84, The InstallGPUOperatorStep.execute method creates step_dir but never ensures the directory exists before writing the GPU subscription file; modify InstallGPUOperatorStep.execute to create the directory (use step_dir.mkdir(parents=True, exist_ok=True)) before calling yaml_path.write_text(subscription_yaml) so the file write cannot fail if the directory is missing.

coderabbitai · 2026-04-04T00:20:09Z

+    def execute(self, ctx: "WorkflowContext") -> StepResult:
+        """Install RHOAI operator."""
+        step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
+
+        # Determine channel from version
+        channel = f"stable-{self.version}"
+
+        subscription_yaml = f"""
+apiVersion: operators.coreos.com/v1alpha1
+kind: Subscription
+metadata:
+  name: rhods-operator
+  namespace: redhat-ods-operator
+spec:
+  channel: {channel}
+  name: rhods-operator
+  source: redhat-operators
+  sourceNamespace: openshift-marketplace
+"""
+        yaml_path = step_dir / "rhoai-subscription.yaml"
+        yaml_path.write_text(subscription_yaml)


⚠️ Potential issue | 🔴 Critical

Same issue: Missing step_dir.mkdir() in InstallRHOAIOperatorStep.

Apply the same fix as for the other operator steps.

🐛 Proposed fix

def execute(self, ctx: "WorkflowContext") -> StepResult: """Install RHOAI operator.""" step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" + step_dir.mkdir(parents=True, exist_ok=True) # Determine channel from version

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def execute(self, ctx: "WorkflowContext") -> StepResult:

"""Install RHOAI operator."""

step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"

# Determine channel from version

channel = f"stable-{self.version}"

subscription_yaml = f"""

apiVersion: operators.coreos.com/v1alpha1

kind: Subscription

metadata:

name: rhods-operator

namespace: redhat-ods-operator

spec:

channel: {channel}

name: rhods-operator

source: redhat-operators

sourceNamespace: openshift-marketplace

"""

yaml_path = step_dir / "rhoai-subscription.yaml"

yaml_path.write_text(subscription_yaml)

def execute(self, ctx: "WorkflowContext") -> StepResult:

"""Install RHOAI operator."""

step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"

step_dir.mkdir(parents=True, exist_ok=True)

# Determine channel from version

channel = f"stable-{self.version}"

subscription_yaml = f"""

apiVersion: operators.coreos.com/v1alpha1

kind: Subscription

metadata:

name: rhods-operator

namespace: redhat-ods-operator

spec:

channel: {channel}

name: rhods-operator

source: redhat-operators

sourceNamespace: openshift-marketplace

"""

yaml_path = step_dir / "rhoai-subscription.yaml"

yaml_path.write_text(subscription_yaml)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/workflows/steps/operators.py` around lines 111 - 131, The execute method of InstallRHOAIOperatorStep is writing files into step_dir without ensuring the directory exists; update the execute method (the function named execute in the InstallRHOAIOperatorStep class) to create the step_dir before writing the YAML (e.g., call step_dir.mkdir(parents=True, exist_ok=True) immediately after computing step_dir) so yaml_path.write_text(...) won't fail.

kpouget

nice,
I put some very preliminary comments,
we'll need to discuss it further to understand how all of that works!

we'll need to find the way to integrate the DSL toolbox in that, if that makes sense

kpouget · 2026-04-04T06:58:46Z

+@click.command(context_settings=dict(
+    ignore_unknown_options=True,
+    allow_interspersed_args=False,
+))


this file should stay untouched

kpouget · 2026-04-04T06:59:32Z

+# These define benchmark parameters - can be changed WITHOUT restarting vLLM
+# Only GuideLLM needs to restart with new workload config


this should live in the rhaiis project, projects/rhaii/orchestration

kpouget · 2026-04-04T07:00:07Z

+# Global Defaults & Accelerator-Specific Settings
+# These are merged with model configs at runtime using inheritance:
+#   defaults → accelerator → model → model.accelerator_overrides → scenario


this should live in the rhaiis project, projects/rhaii/orchestration

kpouget · 2026-04-04T07:00:15Z

+# Model Registry
+# Models only specify what's DIFFERENT from defaults.yaml
+# Accelerator-specific settings go in accelerator_overrides section


this should live in the rhaiis project, projects/rhaii/orchestration

kpouget · 2026-04-04T07:02:02Z

@@ -0,0 +1,14 @@
+"""Shared workflow steps for all projects.


this should live in the rhaiis project, , we'll rediscuss how to turn it into a toolbox projects/rhaii/toolbox
or in a guidellm library project 🤔

kpouget · 2026-04-04T07:02:15Z

@@ -0,0 +1,250 @@
+"""Artifact collection step - shared by all projects."""


this should live in the rhaiis project, , we'll rediscuss how to turn it into a toolbox projects/rhaii/toolbox
or in a guidellm library project 🤔

kpouget · 2026-04-04T07:02:19Z

@@ -0,0 +1,343 @@
+"""GuideLLM benchmark step - runs as a pod on the cluster."""


this should live in the rhaiis project, , we'll rediscuss how to turn it into a toolbox projects/rhaii/toolbox
or in a guidellm library project 🤔

kpouget · 2026-04-04T07:04:29Z

@@ -0,0 +1,529 @@
+"""OpenShift CLI wrapper with built-in retry for transient failures.


I think this can belong to the DSL directory, to be rediscussed

kpouget · 2026-04-04T07:05:26Z

@@ -0,0 +1,129 @@
+"""Base workflow class with step registration."""


to be discussed in a meeting so that we understand what's this workflow exactly

kpouget · 2026-04-04T07:07:44Z

+
+
+@ci.command()
+@click.option("--dry-run", is_flag=True, help="Print what would be done without executing")


I think you can define that in the def ci() function, so that all the entrypoints inherit it

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/rhaiis/IMPLEMENTATION.md`:
- Line 7: Add explicit language identifiers to the fenced code blocks that are
missing them (lint MD040) — locate the triple-backtick blocks around the "CLI/CI
Entry Points" and "config/" sections and the other similar fences (the blocks
containing short directory or CLI snippets) and change ``` to fenced blocks like
```text, ```bash, ```yaml, or ```python as appropriate for the content so
markdownlint passes; ensure each fenced block (e.g., the block showing CLI/CI
commands and the block showing config directory listings) gets a suitable
language tag.
- Line 471: Replace the user-specific line "cd /Users/memehta/workspace/forge"
with a portable repo-relative instruction; locate that exact string in
IMPLEMENTATION.md and change it to a generic direction to change into the
repository root (for example instruct contributors to "cd" into the project
directory or to use a git/top-level discovery method) so the guidance works for
all environments.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0742a73-09c0-4906-8120-102b37d2e724

📥 Commits

Reviewing files that changed from the base of the PR and between a822096 and 007ee06.

📒 Files selected for processing (1)

projects/rhaiis/IMPLEMENTATION.md

coderabbitai · 2026-04-06T21:17:08Z

+
+## Architecture
+
+```


⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks to satisfy lint.

These fences are missing language tags (MD040). Please annotate them (text, bash, yaml, python, etc.) so markdownlint passes consistently.

Proposed fix pattern

-``` +```text CLI/CI Entry Points ... -``` +``` -``` +```text config/ ... -``` +```

Also applies to: 59-59, 70-70, 118-118, 184-184, 212-212, 261-261, 376-376, 448-448

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/IMPLEMENTATION.md` at line 7, Add explicit language identifiers to the fenced code blocks that are missing them (lint MD040) — locate the triple-backtick blocks around the "CLI/CI Entry Points" and "config/" sections and the other similar fences (the blocks containing short directory or CLI snippets) and change ``` to fenced blocks like ```text, ```bash, ```yaml, or ```python as appropriate for the content so markdownlint passes; ensure each fenced block (e.g., the block showing CLI/CI commands and the block showing config directory listings) gets a suitable language tag.

coderabbitai · 2026-04-06T21:17:08Z

+```bash
+# Activate venv and run from forge directory
+source ~/test_foo/python3_virt/bin/activate
+cd /Users/memehta/workspace/forge


⚠️ Potential issue | 🟡 Minor

Replace machine-specific absolute path with a portable instruction.

cd /Users/memehta/workspace/forge is user-specific and will not work for most contributors. Use a repo-relative placeholder.

Proposed fix

-cd /Users/memehta/workspace/forge +cd /path/to/forge

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd /Users/memehta/workspace/forge

cd /path/to/forge

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/IMPLEMENTATION.md` at line 471, Replace the user-specific line "cd /Users/memehta/workspace/forge" with a portable repo-relative instruction; locate that exact string in IMPLEMENTATION.md and change it to a generic direction to change into the repository root (for example instruct contributors to "cd" into the project directory or to use a git/top-level discovery method) so the guidance works for all environments.

MML-coder · 2026-04-13T18:43:14Z

Reopening from fork MML-coder/forge-1

MML-coder added 2 commits April 3, 2026 19:52

forge rhaiis test entrypoint

906b06d

Added unit tests readme

a822096

coderabbitai bot reviewed Apr 4, 2026

View reviewed changes

kpouget reviewed Apr 4, 2026

View reviewed changes

updated implemetation.md with class interface

007ee06

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

MML-coder closed this Apr 13, 2026

MML-coder mentioned this pull request Apr 13, 2026

Adding Forge rhaiis test entrypoint #23

Open

kpouget mentioned this pull request Apr 15, 2026

MVP: Allow deploying and testing RHAIIS from FORGE #27

Open

		# Collect FORGE_* env vars
		env_vars = {k: v for k, v in os.environ.items() if k.startswith("FORGE_")}

		# These define benchmark parameters - can be changed WITHOUT restarting vLLM
		# Only GuideLLM needs to restart with new workload config

		@@ -0,0 +1,250 @@
		"""Artifact collection step - shared by all projects."""

		@@ -0,0 +1,343 @@
		"""GuideLLM benchmark step - runs as a pod on the cluster."""

		@@ -0,0 +1,529 @@
		"""OpenShift CLI wrapper with built-in retry for transient failures.

		@@ -0,0 +1,129 @@
		"""Base workflow class with step registration."""



		@ci.command()
		@click.option("--dry-run", is_flag=True, help="Print what would be done without executing")

Conversation

MML-coder commented Apr 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Single model + workload

Multiple workloads (deploy-once pattern)

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Apr 4, 2026

Uh oh!

coderabbitai bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

kpouget left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

MML-coder commented Apr 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 4, 2026 •

edited

Loading