Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughAdded a declarative, YAML-driven RHAIIS benchmarking system: configuration registries, scenario generator, workflow engine and steps, Kubernetes/oc utilities, CLI entrypoints, orchestration runners, domain workflows/steps, and comprehensive unit tests. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI (projects/rhaiis/orchestration/cli.py)
participant Orchestrator as test_rhaiis.py
participant ConfigLoader as ConfigLoader
participant Generator as ScenarioGenerator
participant Workflow as BenchmarkWorkflow
participant Executor as SequentialExecutor
participant OC as OC Utility
participant K8s as Kubernetes/OpenShift
CLI->>Orchestrator: run_test(...)
Orchestrator->>ConfigLoader: load_model/load_workload
ConfigLoader-->>Orchestrator: Resolved configs
Orchestrator->>Generator: expand / group scenarios
Generator-->>Orchestrator: ExpandedScenario(s)
Orchestrator->>Workflow: instantiate BenchmarkWorkflow
Workflow->>Executor: execute()
Executor->>OC: apply KServe manifest
OC->>K8s: kubectl/oc apply
K8s-->>OC: apply result
Executor->>OC: poll inferenceservice (get)
OC->>K8s: check Ready
K8s-->>OC: status
Executor->>OC: apply benchmark pod
OC->>K8s: create pod, stream logs
Executor->>OC: rsync /benchmark-results
OC->>K8s: copy results
Executor->>OC: collect artifacts & cleanup (finally steps)
OC->>K8s: delete resources
Executor-->>Workflow: WorkflowResult
Workflow-->>CLI: exit code/result
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (15)
projects/core/utils/oc.py-204-209 (1)
204-209:⚠️ Potential issue | 🟡 MinorRename
inputto avoid the Ruff A002 failure.
inputshadows the Python builtin in all three signatures, and Ruff is already flagging it here. A small rename likestdin_inputkeeps the API clear and avoids a lint blocker.Proposed rename
- input: str | None = None, + stdin_input: str | None = None, ... - input=input, + input=stdin_input, ... - def run(self, *args: str, namespace: str | None = None, timeout: int | None = None, input: str | None = None) -> OCResult: + def run(self, *args: str, namespace: str | None = None, timeout: int | None = None, stdin_input: str | None = None) -> OCResult: ... - return self._run_with_retry(cmd, timeout, input) + return self._run_with_retry(cmd, timeout, stdin_input) ... - def apply(self, *args: str, namespace: str | None = None, timeout: int | None = None, input: str | None = None) -> OCResult: + def apply(self, *args: str, namespace: str | None = None, timeout: int | None = None, stdin_input: str | None = None) -> OCResult: ... - return self.run("apply", *args, namespace=namespace, timeout=timeout, input=input) + return self.run("apply", *args, namespace=namespace, timeout=timeout, stdin_input=stdin_input)Also applies to: 300-300, 335-335
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/utils/oc.py` around lines 204 - 209, The parameter name `input` shadows the Python builtin and triggers Ruff A002 in OC._run_with_retry and the two other method signatures; rename the parameter to `stdin_input` (update type hints from `input: str | None` to `stdin_input: str | None`) in the function definitions and replace all internal references to `input` with `stdin_input` (and update any places that call these methods to pass the renamed argument) so signatures, usages, and return behavior remain identical except for the parameter name.projects/core/workflow/context.py-83-99 (1)
83-99:⚠️ Potential issue | 🟡 MinorAvoid mutating step numbering from a getter.
Line 95 increments unconditionally, so a step that asks for its artifact directory twice gets two different folders and shifts the numbering for every later step. Cache the directory per step, or have the executor advance the counter once before the step runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/workflow/context.py` around lines 83 - 99, The getter get_step_artifact_dir currently mutates state (increments self.step_number and sets self.current_step_name) every call; change it to be idempotent by caching per-step directories or only incrementing when a new step is first requested: add an internal mapping (e.g. self._step_dirs: dict[str, Path]) and in get_step_artifact_dir check if step_name exists, return cached Path if present, otherwise increment self.step_number once, set self.current_step_name, create the new step_dir from self.artifact_dir / f"{self.step_number:03d}__{step_name}", store it in self._step_dirs[step_name] and return it; this preserves existing names (step_dir) and avoids changing numbering on repeated calls.projects/rhaiis/workflows/steps/deploy.py-504-506 (1)
504-506:⚠️ Potential issue | 🟡 MinorRemove unnecessary f-string prefix.
Static analysis correctly identified an f-string without placeholders.
🐛 Fix
except subprocess.TimeoutExpired: - click.echo(f" Health check timeout, retrying...") + click.echo(" Health check timeout, retrying...")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/workflows/steps/deploy.py` around lines 504 - 506, In the except subprocess.TimeoutExpired block (around the health-check retry logic) remove the unnecessary f-string prefix on the click.echo call: replace the f-string literal used in click.echo(f" Health check timeout, retrying...") with a plain string click.echo(" Health check timeout, retrying..."); also scan the surrounding exception handling in the same function for other f-strings that have no placeholders and convert them to normal string literals (refer to the except subprocess.TimeoutExpired handler and the click.echo call).projects/core/steps/guidellm.py-158-179 (1)
158-179:⚠️ Potential issue | 🟡 MinorPotential command injection via
argsin shell command.The
guidellm_cmdis constructed by joining args and embedded directly in a shell script. If any runtime argument values contain shell metacharacters (e.g.,;,$, backticks), they could be interpreted by the shell.Consider using JSON-escaped arguments or validating that values don't contain shell-special characters.
🛡️ Proposed fix using shlex.quote
+import shlex + def _generate_pod_yaml(self, data: str) -> str: """Generate GuideLLM pod YAML.""" # Build guidellm args args = [ "--target", self.endpoint, "--model", self.model, # ... ] # ... - guidellm_cmd = f"python3 -m benchmark.main {' '.join(args)}" + guidellm_cmd = f"python3 -m benchmark.main {' '.join(shlex.quote(a) for a in args)}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/steps/guidellm.py` around lines 158 - 179, The constructed shell command in guidellm_cmd is vulnerable because args are joined and interpolated into a shell script; to fix, escape each argument with shlex.quote (or equivalent) before joining or, better, avoid a shell script altogether by setting the Pod container's command to ["/bin/sh","-c"] with a safely quoted string or by using the container args/command list to pass arguments directly; update the code that builds guidellm_cmd (and the Pod spec that embeds it) to either quote every element from args via shlex.quote or switch to a structured args array so no unescaped shell metacharacters from args can be interpreted.projects/core/steps/artifacts.py-197-232 (1)
197-232:⚠️ Potential issue | 🟡 MinorUnused
errorslist is initialized but never populated.The
errorslist on line 201 is never appended to, yet it's checked on line 225 and included in the result data on line 231. Either remove it or populate it when_delete_resourcereturnsFalse.🐛 Proposed fix to populate errors
# Delete KServe resources first (they manage the underlying deployments) if self.use_kserve: if self._delete_resource("inferenceservice", self.deployment_name, ns_args): deleted.append(f"inferenceservice/{self.deployment_name}") + else: + errors.append(f"inferenceservice/{self.deployment_name}") if self._delete_resource("servingruntime", self.deployment_name, ns_args): deleted.append(f"servingruntime/{self.deployment_name}") + else: + errors.append(f"servingruntime/{self.deployment_name}") # Delete standard deployment (if not using KServe or as fallback) if self._delete_resource("deployment", self.deployment_name, ns_args): deleted.append(f"deployment/{self.deployment_name}") + else: + errors.append(f"deployment/{self.deployment_name}")Or alternatively, remove the
errorslist entirely if tracking deletion failures is not needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/steps/artifacts.py` around lines 197 - 232, The errors list is created in execute() but never populated; update execute to append a descriptive error string to errors whenever _delete_resource(resource_type, self.deployment_name, ns_args) returns False (for inferenceservice, servingruntime, deployment, service, route attempts), e.g., append f"{resource_type}/{self.deployment_name} failed" so the existing message construction and StepResult data["errors"] are meaningful; alternatively, if you prefer not to track failures, remove the errors variable and its usage in the message and StepResult data (choose one approach and apply consistently across the inferenceservice/servingruntime/deployment/service/route deletion checks).config/rhaiis/workloads.yaml-48-55 (1)
48-55:⚠️ Potential issue | 🟡 MinorIncomplete configuration: placeholder value needs replacement.
The
datafield contains a placeholder"ADD ME"which will likely cause GuideLLM to fail or produce meaningless results if this workload is selected. Either provide a valid data specification or disable this workload until it's properly configured.Would you like me to suggest a data specification for heterogeneous traffic patterns, or should I open an issue to track this?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/rhaiis/workloads.yaml` around lines 48 - 55, Replace the placeholder in the heterogeneous workload's guidellm.data field (in the "heterogeneous" block) with a valid data specification or temporarily disable the workload; specifically, update guidellm.data to point to a real dataset or pattern definition (for example a path, dataset name, or structured traffic spec) that matches the existing fields rate_type, rates, and max_seconds, or remove/disable the entire heterogeneous entry until a proper data value is provided so GuideLLM won't receive "ADD ME".projects/core/workflow/executor.py-89-93 (1)
89-93:⚠️ Potential issue | 🟡 MinorInconsistency:
start_timenot set for finally steps.Normal steps have
result.start_timeassigned (albeit incorrectly), but finally steps don't set it at all. Apply the same fix here for consistency.🔧 Proposed fix
+ step_start_time = datetime.now(timezone.utc) step_start = time.monotonic() try: result = step.execute(ctx) result.duration_seconds = time.monotonic() - step_start + result.start_time = step_start_time step_results[step_name] = result🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/workflow/executor.py` around lines 89 - 93, The finally-step handling block currently records result.duration_seconds but never sets result.start_time; update the finally step execution to capture a start timestamp (e.g., start_time = time.monotonic()) immediately before calling step.execute(ctx), assign that value to result.start_time, and then compute result.duration_seconds = time.monotonic() - start_time, mirroring how normal steps populate start_time/duration for consistency with step_results[step_name] and the existing result fields.projects/rhaiis/IMPLEMENTATION.md-7-34 (1)
7-34:⚠️ Potential issue | 🟡 MinorAdd language identifiers to fenced code blocks.
These blocks are missing fence languages, which can trigger markdown lint failures.
Example fix
-``` +```text CLI/CI Entry Points @@ -``` +``` -``` +```text config/ @@ -``` +```Also applies to: 59-78
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/IMPLEMENTATION.md` around lines 7 - 34, The fenced code blocks in IMPLEMENTATION.md that contain the ASCII diagrams (e.g., the block showing "CLI/CI Entry Points → test_rhaiis.py", "ConfigLoader", "BenchmarkWorkflow", and "WorkflowStep implementations") and the later similar block (around the config/ section) are missing language identifiers; add a language tag (use "text") to the opening triple-backtick for each fenced block so markdown linters recognize them as code blocks (update the opening ``` to ```text for the diagram block(s) and any other plain-text fenced blocks referenced in the doc).tests/rhaiis/test_ci.py-34-35 (1)
34-35:⚠️ Potential issue | 🟡 MinorAvoid fixed
/tmp/artifactsin tests.Using a shared absolute temp path can cause cross-test interference and portability issues. Prefer a per-test temporary directory fixture.
Safer test setup
class TestCiTest: @@ `@pytest.fixture` - def runner(self): - return CliRunner(env={"FORGE_ARTIFACT_DIR": "/tmp/artifacts"}) + def runner(self, tmp_path): + artifact_dir = tmp_path / "artifacts" + return CliRunner(env={"FORGE_ARTIFACT_DIR": str(artifact_dir)}) @@ result = runner.invoke( ci, ["test", "--dry-run"], - env={"FORGE_MODEL": "test/model", "FORGE_ARTIFACT_DIR": "/tmp/artifacts"} + env={"FORGE_MODEL": "test/model"} ) @@ "FORGE_MODEL": "test/model", "FORGE_WORKLOADS": "balanced,heterogeneous", - "FORGE_ARTIFACT_DIR": "/tmp/artifacts", } )Also applies to: 40-41, 53-54
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/rhaiis/test_ci.py` around lines 34 - 35, Replace the hardcoded "/tmp/artifacts" used when creating CliRunner with a per-test temporary directory (use pytest's tmp_path or tmp_path_factory) and set FORGE_ARTIFACT_DIR to that temp path (e.g., str(tmp_path / "artifacts")); update all occurrences that call CliRunner(env={"FORGE_ARTIFACT_DIR": "/tmp/artifacts"}) (including the other instances mentioned at lines 40-41 and 53-54) so each test gets an isolated directory and remove reliance on a shared absolute path.tests/rhaiis/test_ci.py-58-58 (1)
58-58:⚠️ Potential issue | 🟡 MinorAssert both workloads, not either.
This test currently passes if only one workload is rendered, which can hide parsing/formatting regressions.
Tighten expectation
- assert "balanced" in result.output or "heterogeneous" in result.output + assert "balanced" in result.output + assert "heterogeneous" in result.output🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/rhaiis/test_ci.py` at line 58, The test currently asserts that either "balanced" or "heterogeneous" appears in result.output, which is too permissive; change the assertion to require both strings be present (e.g., replace the single-line check using "or" with two assertions or a combined condition using "and") so that result.output contains both "balanced" and "heterogeneous" (refer to the assertion referencing result.output in tests/rhaiis/test_ci.py).tests/core/scenarios/test_generator.py-226-230 (1)
226-230:⚠️ Potential issue | 🟡 MinorSame temporary file cleanup issue as in
sample_config_path.Apply the same fix pattern here for
explicit_runs_configfixture.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/scenarios/test_generator.py` around lines 226 - 230, The explicit_runs_config fixture uses tempfile.NamedTemporaryFile(..., delete=False) and yields Path(f.name) without removing the temp file; update the explicit_runs_config fixture to mirror the sample_config_path fix by creating and writing the temp YAML inside a try/finally (or context manager) and then removing the file in the finally block (e.g., call os.remove on Path(f.name)) after the yield so the temporary file is always cleaned up; keep the existing tempfile.NamedTemporaryFile, yaml.safe_dump and the yielded Path(f.name) but add the cleanup logic surrounding the yield.projects/rhaiis/orchestration/cli.py-48-59 (1)
48-59:⚠️ Potential issue | 🟡 MinorInconsistent dry-run behavior: metadata is written before early return.
In
prepare(),ctx.write_metadata()is called before thedry_runcheck, meaning even in dry-run mode, metadata files will be created. Same issue exists incleanup(). Additionally,test()doesn't write metadata at all, which is inconsistent with the other commands.Proposed fix for prepare()
def prepare(rhoai_version: str, dry_run: bool): """Install operators (RHOAI, NFD, GPU) on OpenShift.""" - ctx = WorkflowContext.from_environment() - ctx.write_metadata({"command": "prepare", "rhoai_version": rhoai_version}) - if dry_run: + ctx = WorkflowContext.from_environment() click.echo(f"[DRY-RUN] Would install RHOAI {rhoai_version}") click.echo(f"[DRY-RUN] Artifacts would be written to: {ctx.artifact_dir}") return + ctx = WorkflowContext.from_environment() + ctx.write_metadata({"command": "prepare", "rhoai_version": rhoai_version}) + exit_code = test_rhaiis.run_prepare(ctx, rhoai_version) sys.exit(exit_code)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/orchestration/cli.py` around lines 48 - 59, The metadata is being written even during dry runs and is missing for test(); update prepare(), cleanup(), and test() to be consistent: move the ctx.write_metadata({"command": ...}) calls in prepare() and cleanup() to after the dry_run check (so dry-run returns before any metadata/file writes), and add a similar ctx.write_metadata({"command": "test", "rhoai_version": rhoai_version}) call in test() (or the equivalent metadata payload used elsewhere) so all commands record metadata only for real runs; refer to the prepare, cleanup, and test functions and the WorkflowContext.write_metadata / ctx.write_metadata symbol names to locate and adjust the code.projects/rhaiis/orchestration/ci.py-57-92 (1)
57-92:⚠️ Potential issue | 🟡 MinorInconsistent metadata handling and missing model validation in
testcommand.The
testcommand differs fromprepareandcleanupin two ways:
- It does not call
workflow_ctx.write_metadata()to record command parameters- It does not validate that
modelis provided (unlikecli.pywhich validates when not in dry-run)If
FORGE_MODELis unset,modelwill beNoneand passed directly torun_test, which may cause confusing errors downstream.Proposed fix
`@ci.command`() `@click.option`("--dry-run", is_flag=True, help="Print what would be done without executing") `@click.pass_context` def test(ctx, dry_run: bool): """Test phase - Run vLLM benchmark. Configuration from config/ directory. FOURNOS sets: FORGE_MODEL - Model key (e.g., qwen-0.6b) FORGE_WORKLOADS - Comma-separated workloads (optional) """ - workflow_ctx = WorkflowContext.from_environment() - # Key parameters from FOURNOS model = os.environ.get("FORGE_MODEL") workloads_str = os.environ.get("FORGE_WORKLOADS") # Parse workloads workloads = None if workloads_str: workloads = [w.strip() for w in workloads_str.split(",")] + if dry_run: + click.echo("[DRY-RUN] Test phase") + click.echo(f"[DRY-RUN] Model: {model}") + if workloads: + click.echo(f"[DRY-RUN] Workloads: {workloads}") + return + + if not model: + click.echo("Error: FORGE_MODEL environment variable must be set", err=True) + sys.exit(1) + + workflow_ctx = WorkflowContext.from_environment() + workflow_ctx.write_metadata({ + "command": "test", + "model": model, + "workloads": workloads, + }) + # Log click.echo("RHAIIS CI Test") click.echo(f" Model: {model}") if workloads: click.echo(f" Workloads: {workloads}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/orchestration/ci.py` around lines 57 - 92, The test command handler (test) must mirror prepare/cleanup by validating the model and recording metadata: if not dry_run ensure the FORGE_MODEL-derived model is present and abort/exit with a clear message when missing, and call workflow_ctx.write_metadata(...) to persist the command parameters (e.g., command name, model, workloads, config_dir, dry_run) before invoking test_rhaiis.run_test; update the test function to perform the model check and to call WorkflowContext.write_metadata() on workflow_ctx prior to calling test_rhaiis.run_test so metadata handling is consistent with the other commands.tests/core/scenarios/test_generator.py-60-64 (1)
60-64:⚠️ Potential issue | 🟡 MinorTemporary file is not cleaned up after test.
NamedTemporaryFile(delete=False)creates a file that persists after the context manager exits. The yielded path should be cleaned up after the test completes to avoid leaving orphan files.Proposed fix
+ import os + `@pytest.fixture` def sample_config_path(self): """Create a sample scenarios.yaml file using new format.""" config = { # ... config dict ... } with tempfile.NamedTemporaryFile( mode="w", suffix=".yaml", delete=False ) as f: yaml.safe_dump(config, f) - yield Path(f.name) + path = Path(f.name) + try: + yield path + finally: + os.unlink(path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/scenarios/test_generator.py` around lines 60 - 64, The temporary YAML file created with tempfile.NamedTemporaryFile(delete=False) in the test fixture is left on disk; after yielding Path(f.name) you must remove the file in the fixture teardown. Update the generator fixture around tempfile.NamedTemporaryFile / yaml.safe_dump to yield Path(f.name) and then call Path(f.name).unlink() (or os.remove) in the code after the yield to ensure the temporary file is cleaned up.projects/core/scenarios/generator.py-612-612 (1)
612-612:⚠️ Potential issue | 🟡 MinorUseless if-else condition.
The ternary expression
1 if isinstance(model, ModelConfig) else 1always evaluates to1regardless of the condition, making the check pointless.🔧 Suggested fix
- tp = run.get("tensor_parallel", 1 if isinstance(model, ModelConfig) else 1) + tp = run.get("tensor_parallel", 1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/scenarios/generator.py` at line 612, The ternary in the tp assignment is redundant; replace the expression tp = run.get("tensor_parallel", 1 if isinstance(model, ModelConfig) else 1) with a single default value by using run.get("tensor_parallel", 1) (keep references to the same variables run and model and remove the pointless isinstance(ModelConfig) check); if there was an intended different default for ModelConfig, clarify and implement that branch, otherwise simply use run.get("tensor_parallel", 1) in generator.py where tp is set.
🧹 Nitpick comments (21)
projects/core/workflow/step.py (1)
79-85:WorkflowStep.nameis lowercased, not snake_case.
WaitForReadyStepcurrently becomeswaitforready, notwait_for_ready, so the implementation doesn't match the contract in the comment/docstring. Either insert separators or document the actual behavior.Small fix
+import re ... - # Convert to lowercase - return class_name.lower() + return re.sub(r"(?<!^)(?=[A-Z])", "_", class_name).lower()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/workflow/step.py` around lines 79 - 85, The current name property/method in the WorkflowStep class returns the class name lowercased (e.g., WaitForReadyStep -> waitforready) instead of snake_case; update the implementation that computes class_name from self.__class__.__name__ (and still strips the "Step" suffix) to convert CamelCase to snake_case by inserting underscores between word boundaries (or using a regex like (?<!^)(?=[A-Z]) to split) and then lowercasing so WaitForReadyStep becomes wait_for_ready; keep the "Step" suffix removal logic and ensure the symbol to change is the WorkflowStep.name property/method that currently uses class_name.endswith("Step") and class_name.lower().projects/rhaiis/workflows/steps/deploy.py (2)
320-328: Importingclickinside method is unconventional.While functional, importing inside a method is unusual. Consider moving the import to the top of the file since it's used in multiple methods.
♻️ Move import to top of file
"""RHAIIS vLLM deployment steps using KServe (ServingRuntime + InferenceService).""" +import click import logging import subprocess # ... class WaitForReadyStep(WorkflowStep): # ... def execute(self, ctx: "WorkflowContext") -> StepResult: """Wait for InferenceService to be ready.""" - import click - click.echo(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/workflows/steps/deploy.py` around lines 320 - 328, The import of click inside the execute method is unconventional; move the import to the module scope so it can be reused by other methods and avoids repeated local imports. Remove the inline "import click" from the execute(self, ctx: "WorkflowContext") -> StepResult method and add a top-level "import click" at the top of the file (near other imports) so symbols like execute and any other methods in this module use the shared module-level click import.
93-102: Namespace creation can be simplified.The dry-run command output is not piped to the apply command. The current pattern runs two separate commands where the first one's output is discarded. Consider simplifying to a single idempotent apply.
♻️ Simplified namespace creation
- # Ensure namespace exists - subprocess.run( - ["oc", "create", "namespace", self.namespace, "--dry-run=client", "-o", "yaml"], - capture_output=True, - ) - subprocess.run( - ["oc", "apply", "-f", "-"], - input=f"apiVersion: v1\nkind: Namespace\nmetadata:\n name: {self.namespace}\n", - capture_output=True, - text=True, - ) + # Ensure namespace exists (idempotent) + subprocess.run( + ["oc", "create", "namespace", self.namespace, "--dry-run=client", "-o", "yaml"], + capture_output=True, + text=True, + ) + # Apply is idempotent - will create if not exists or no-op if exists + subprocess.run( + ["oc", "create", "namespace", self.namespace], + capture_output=True, # Ignore "already exists" error + )Or use a shell pipe if the dry-run output is actually needed:
subprocess.run( "oc create namespace {ns} --dry-run=client -o yaml | oc apply -f -".format(ns=self.namespace), shell=True, capture_output=True, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/workflows/steps/deploy.py` around lines 93 - 102, Replace the two subprocess.run calls that run a dry-run and then a separate apply (where dry-run output is discarded) with a single idempotent apply invocation: call subprocess.run(["oc","apply","-f","-"], input=the namespace YAML using self.namespace, text=True, capture_output=True) so the namespace is created/updated atomically; alternatively, if you prefer the dry-run output, run a single shell command that pipes oc create --dry-run=client -o yaml into oc apply -f - via subprocess.run(..., shell=True, capture_output=True). Ensure you remove the prior dry-run subprocess.run and only use one subprocess.run that uses self.namespace as the interpolated name.tests/rhaiis/test_operators.py (2)
48-61: Test asserts file exists but doesn't verify directory creation.Once the
step_dir.mkdir()fix is applied tooperators.py, this test should pass. However, consider adding an explicit assertion that the step_dir was created to make the test more robust.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/rhaiis/test_operators.py` around lines 48 - 61, The test test_creates_subscription_yaml currently asserts the YAML file exists but doesn't explicitly assert the step directory was created; update the test to include an assertion that step_dir (constructed from context.artifact_dir and f"{context.step_number:03d}__{context.current_step_name}") exists before reading the file, so it fails clearly if InstallNFDOperatorStep (or the directory creation in operators.py) didn't create the folder; keep using the same step_dir and yaml_file variables and assert step_dir.exists() (or is_dir()) prior to checking yaml_file.exists() and its content.
37-46: Patch target should be module-specific for test isolation.Patching
subprocess.runglobally may cause unintended side effects. Prefer patching at the module level wheresubprocessis imported.♻️ Suggested fix
- `@patch`("subprocess.run") + `@patch`("projects.rhaiis.workflows.steps.operators.subprocess.run") def test_execute_success(self, mock_run, context):Apply the same change to all
@patch("subprocess.run")decorators in this file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/rhaiis/test_operators.py` around lines 37 - 46, The test currently patches subprocess.run globally; update the `@patch` decorator to target subprocess.run in the module that defines InstallNFDOperatorStep (i.e., patch the subprocess imported into that module) so the mock is module-scoped and isolated; apply the same change for all `@patch`("subprocess.run") decorators in this test file (refer to test_execute_success and other tests that exercise InstallNFDOperatorStep) so they patch "<module_where_InstallNFDOperatorStep_is_defined>.subprocess.run" instead of the top-level "subprocess.run".projects/core/steps/guidellm.py (1)
272-276: Progress logging condition may not trigger reliably.The condition
elapsed % 60 == 0with a 10-second poll interval andtime.monotonic()precision may cause missed prints (e.g., if elapsed is 59.9 or 60.1). Consider tracking the last print time instead.♻️ Suggested fix
def _wait_for_pod_completion(self, timeout: int) -> dict: start_time = time.monotonic() poll_interval = 10 + last_progress_print = 0 print(f"Waiting for GuideLLM benchmark to complete (timeout: {timeout}s)...") while time.monotonic() - start_time < timeout: # ... existing logic ... - # Still running, no marker yet - elapsed = int(time.monotonic() - start_time) - if elapsed % 60 == 0: # Print every minute - print(f" GuideLLM running... ({elapsed}s elapsed, phase={phase})") + # Still running, no marker yet - print progress every 60s + elapsed = int(time.monotonic() - start_time) + if elapsed - last_progress_print >= 60: + print(f" GuideLLM running... ({elapsed}s elapsed, phase={phase})") + last_progress_print = elapsed🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/steps/guidellm.py` around lines 272 - 276, The progress logging uses `elapsed % 60 == 0` which can miss prints due to polling granularity; replace that with a monotonic-based threshold by introducing a `last_print_time` (initialized to `start_time`) and, inside the loop where `elapsed = int(time.monotonic() - start_time)` is computed, check `if time.monotonic() - last_print_time >= 60:` then do the print using `elapsed` and `phase` and set `last_print_time = time.monotonic()` so prints occur reliably every 60 seconds regardless of timing alignment.projects/core/workflow/executor.py (2)
82-82: Unused:finally_errorslist is populated but never consumed.Errors from finally steps are collected into
finally_errorsbut the list is never included in theWorkflowResult, logged as a summary, or otherwise used. Consider either:
- Including finally errors in the result for caller visibility
- Logging a summary of finally failures
- Removing the collection if not needed
♻️ Option: Include in result or log summary
+ if finally_errors: + logger.warning(f"Finally block encountered {len(finally_errors)} error(s)") + end_time = datetime.now(timezone.utc)Or extend
WorkflowResultto includefinally_errorsif callers need this information.Also applies to: 98-99, 112-112
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/workflow/executor.py` at line 82, The finally_errors list defined in executor.py (variable finally_errors) is being populated but never used; update the code to either surface those errors to callers by adding a finally_errors field to WorkflowResult and populating it before returning, or at minimum log a concise summary of failures from finally_errors (e.g., count and messages) before returning/raising; if finally errors are truly unnecessary, remove the finally_errors collection and its appends (occuring around the finally step handling) to avoid dead state. Ensure changes reference the existing finally step handling and the WorkflowResult construction so callers receive the final information.
51-52: Unused variable:step_artifact_diris fetched but never used.The artifact directory is retrieved on lines 52 and 87 but not passed to the step or used otherwise. Either remove these lines or pass the directory to the step if it's needed.
♻️ Option 1: Remove unused code
- # Get artifact directory for this step - step_artifact_dir = ctx.get_step_artifact_dir(step_name) - step_start = time.monotonic()♻️ Option 2: Pass to step if needed
If steps need access to their artifact directory, consider passing it as a parameter or storing it in context before execution.
Also applies to: 87-87
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/workflow/executor.py` around lines 51 - 52, The local variable step_artifact_dir (obtained via ctx.get_step_artifact_dir(step_name)) is never used; either remove the unused retrieval calls or wire the directory into step execution: remove the step_artifact_dir assignment at both sites if artifacts aren’t needed, or pass it into the step invocation (e.g., add a parameter to the step execution call or set it on the ctx/step object before calling step.run/execute) so the symbol step_artifact_dir and the call ctx.get_step_artifact_dir(step_name) are actually consumed; update the step API (method name used for running the step) and any callers accordingly.config/rhaiis/defaults.yaml (1)
19-19: Consider:uvicorn-log-level: debugmay be verbose for production runs.Debug-level logging can generate substantial output during benchmarks. If this is intentional for initial testing, consider adding a comment or parameterizing it for production use.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/rhaiis/defaults.yaml` at line 19, The YAML default sets uvicorn-log-level: debug which is noisy for production; change the default to a less verbose level (e.g., info) or make the uvicorn-log-level configurable via an environment-backed or CLI-overridable parameter and add an inline comment explaining debug is only for development/benchmarks; update any bootstrap/launcher code that reads the uvicorn-log-level key (search for uvicorn-log-level) to prefer the env/CLI value when present.projects/rhaiis/workflows/cleanup.py (1)
15-40: Unused parameter:remove_operatorsis stored but never used.The
remove_operatorsparameter is documented and stored on the instance (line 31), butdefine_steps()doesn't conditionally add operator removal steps when it'sTrue. This creates an API contract that isn't fulfilled.Consider either:
- Adding conditional operator removal steps when
remove_operators=True- Adding a TODO/FIXME comment if this is planned for later
- Removing the parameter until it's implemented
📝 Option: Add a TODO comment to track the incomplete feature
def define_steps(self): """Define cleanup steps.""" + # TODO: Add operator removal steps when remove_operators=True self.add_step( CleanupNamespaceStep( namespace=self.namespace, delete_namespace=False, # Keep namespace, delete contents ) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/workflows/cleanup.py` around lines 15 - 40, The class stores remove_operators but never uses it in define_steps; update define_steps to check self.remove_operators and, when True, add a dedicated operator-cleanup step (e.g., add RemoveOperatorsStep or CleanupOperatorsStep after the existing CleanupNamespaceStep) so operator removal is performed; if an operator-cleanup step class does not yet exist, either implement a small RemoveOperatorsStep and add it conditionally, or at minimum add a clear TODO/FIXME in define_steps referencing remove_operators to avoid the unused-parameter contract.projects/core/scenarios/config.py (1)
100-103: Suffix stripping misses chained suffixes.A model like
foo-instruct-dynamicbecomesfoo-instructtoday. Looping until no known suffix remains makes shortening more consistent.Refinement
- for suffix in ["-instruct", "-dynamic", "-chat"]: - if name.endswith(suffix): - name = name[: -len(suffix)] + changed = True + while changed: + changed = False + for suffix in ["-instruct", "-dynamic", "-chat"]: + if name.endswith(suffix): + name = name[: -len(suffix)] + changed = True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/scenarios/config.py` around lines 100 - 103, The current suffix-stripping loop over ["-instruct","-dynamic","-chat"] only removes one suffix, so names like "foo-instruct-dynamic" become "foo-instruct"; modify the logic in projects/core/scenarios/config.py where the variable name is trimmed to repeatedly strip known suffixes until none match (e.g., wrap the existing for-loop in a while-loop or use a loop that breaks only when no suffix was removed), ensuring you reference the same suffix list and update the variable name in-place each removal.tests/core/steps/test_artifacts.py (1)
50-52: Prefer structured call-arg assertions over string matching.Checking
str(call_args_list)is fragile. Asserting specific positional args/flags frommock_run.call_args_listwill make these tests more deterministic.Also applies to: 96-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/steps/test_artifacts.py` around lines 50 - 52, Replace the fragile string-based assertion on mock_run.call_args_list with structured checks against each Call object's positional and keyword arguments: iterate mock_run.call_args_list and assert that at least one Call has "logs" present in the specific arg/flag you expect (e.g., check call.args[index] or call.kwargs['argname']), and do the same for the other occurrence around lines 96-98; use mock_run.call_args_list directly rather than str(...) so the test asserts the exact positional/keyword argument values.tests/core/steps/test_guidellm.py (1)
71-73: Strengthen subprocess assertions in the success path.The current check only verifies that at least one call happened. Adding expected call count (and optionally key command checks) will make this test catch flow regressions earlier.
Suggested test hardening
- assert result.success - assert mock_run.called + assert result.success + assert mock_run.call_count == 5🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/steps/test_guidellm.py` around lines 71 - 73, The test currently only asserts result.success and that mock_run was called; strengthen it by asserting the exact expected number of subprocess invocations (e.g., assert mock_run.call_count == 1) and also verify the arguments of the call(s) (e.g., use mock_run.assert_any_call(...) or inspect mock_run.call_args_list to ensure the invoked command contains the expected executable/flags). Update the assertions around result and mock_run in tests/core/steps/test_guidellm.py (referencing result and mock_run) to check both call count and at least one expected command pattern to prevent regressions.projects/rhaiis/IMPLEMENTATION.md (1)
282-285: Replace user-local paths with portable commands.The current example uses machine-specific directories, which makes the docs harder to follow for others.
Portable docs snippet
-# Activate venv and run from forge directory -source ~/test_foo/python3_virt/bin/activate -cd /Users/memehta/workspace/forge +# Activate venv and run from repository root +source .venv/bin/activate +cd <path-to-forge-repo> PYTHONPATH=. python -m pytest tests/ --ignore=tests/llm_d -v🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/IMPLEMENTATION.md` around lines 282 - 285, Replace the hardcoded activation and cd commands — specifically the lines referencing "source ~/test_foo/python3_virt/bin/activate" and "cd /Users/memehta/workspace/forge" — with portable instructions: instruct users to create and activate a virtualenv inside the project (e.g., an .venv in the repo) and to change directory to the repository root (recommend using a repository-root discovery command like git rev-parse --show-toplevel or "cd to project root"), and keep the pytest invocation as a relative, portable command (using PYTHONPATH=. and the existing pytest flags and --ignore path) so the snippet works across machines.tests/rhaiis/test_steps.py (1)
21-31: Consider extracting duplicate fixtures to a conftest.py.The
temp_artifact_dirandcontextfixtures are duplicated across all three test classes (TestDeployVLLMStep,TestWaitForReadyStep,TestCleanupNamespaceStep). Extracting them to aconftest.pyfile would reduce duplication and improve maintainability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/rhaiis/test_steps.py` around lines 21 - 31, Extract the duplicated pytest fixtures into a shared conftest.py: move the temp_artifact_dir fixture (the pytest.fixture that yields Path(tmpdir) from tempfile.TemporaryDirectory) and the context fixture (which calls WorkflowContext.from_environment(artifact_base=str(temp_artifact_dir)) and pre-increments via ctx.get_step_artifact_dir("deploy")) into conftest.py, remove their definitions from tests/rhaiis/test_steps.py, and ensure test modules import pytest normally so they automatically use the shared fixtures.tests/core/scenarios/test_generator.py (1)
82-83: Consider using ASCII 'x' instead of Unicode '×' in comments.The multiplication sign
×(U+00D7) may cause issues in some editors or terminals. Using ASCIIxis more portable.Proposed fix
- # 2 workloads × 1 routing × 2 TP = 4 scenarios + # 2 workloads x 1 routing x 2 TP = 4 scenarios🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/scenarios/test_generator.py` around lines 82 - 83, Replace the Unicode multiplication sign in the inline comment within test_generator.py (the comment string "# 2 workloads × 1 routing × 2 TP = 4 scenarios") with the ASCII letter "x" to avoid encoding/editor issues; update the comment in the same location (referencing the comment near the scenarios length assertion in the test_generator test function) to read "# 2 workloads x 1 routing x 2 TP = 4 scenarios".projects/rhaiis/orchestration/test_rhaiis.py (3)
25-36: Add type hint forctxparameter.The
ctxparameter at line 35 lacks a type annotation, unlike other function signatures in this file.✏️ Suggested fix
def _dry_run_test( model: str | None, workload: str | None, workloads: list[str] | None, config_loader, accelerator: str, vllm_image: str | None, tensor_parallel: int | None, max_requests: int | None, namespace: str | None, - ctx, + ctx: WorkflowContext, ) -> int:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/orchestration/test_rhaiis.py` around lines 25 - 36, The ctx parameter in _dry_run_test is missing a type annotation; add a matching type annotation (e.g., click.Context or the same context type used elsewhere in this file) to the ctx parameter, update the function signature for _dry_run_test accordingly, and add the required import (from click or typing import Any) if not already present so the annotation resolves.
385-387: Consider making the drain time configurable.The 5-second drain time is hardcoded. For clusters with higher latency or heavier in-flight request loads, this may be insufficient. Consider making it configurable via defaults.yaml or a parameter.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/orchestration/test_rhaiis.py` around lines 385 - 387, Replace the hardcoded 5-second sleep in test_rhaiis.py (the block using idx, group_workloads, click.echo and time.sleep) with a configurable drain timeout: add a config key (e.g., drain_seconds) to defaults.yaml and read it in the test (falling back to a sensible default like 5), or add a CLI/fixture parameter that supplies the value; then use that variable in the click.echo message and time.sleep call so clusters with higher latency can increase the drain time without changing code.
278-278: Deployment name truncation may cause collisions for models with similar prefixes.The deployment name is truncated to 42 characters after sanitization. For models with very similar long names, this could result in identical deployment names. Consider adding a hash suffix or documenting this limitation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/rhaiis/orchestration/test_rhaiis.py` at line 278, The deployment_name assignment (using hf_model_id) can collide after truncation; modify the logic that builds deployment_name so it appends a short deterministic hash suffix derived from hf_model_id (e.g., first 6 hex chars of SHA-1/SHA-256) after sanitization and truncation to keep the overall length within 42 characters; update the expression that sets deployment_name to compute sanitized_base = hf_model_id.split("/")[-1].lower().replace(".", "-").replace("_", "-") truncated to (42 - 1 - hash_len) characters, then append "-" + hash_suffix to produce the final deployment_name.projects/core/scenarios/generator.py (2)
1-20: Docstring uses Unicode multiplication sign.The docstring at line 11 uses
×(Unicode multiplication sign) which triggers static analysis warning RUF002. While this is cosmetic and doesn't affect functionality, using ASCIIxwould be more portable.✏️ Suggested fix
-4. Matrix expansion: model × workloads × tensor_parallel +4. Matrix expansion: model x workloads x tensor_parallel🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/scenarios/generator.py` around lines 1 - 20, The docstring in projects/core/scenarios/generator.py uses a Unicode multiplication sign in the "Matrix expansion" example (the line showing "model × workloads × tensor_parallel"); replace the Unicode '×' characters with the ASCII letter 'x' so the example reads "model x workloads x tensor_parallel" to satisfy static analysis (RUF002) while keeping the docstring content unchanged.
104-112: Consider usinghashlibfor deterministic deployment IDs.The
deployment_idproperty useshash()which is non-deterministic across Python runs (due to PYTHONHASHSEED randomization). This means the same configuration will produce different deployment IDs in different invocations. While currently used only for display, this could cause confusion when comparing outputs or logs across runs. Usinghashlib.md5()or similar would provide consistent IDs:override_hash = int(hashlib.md5(frozenset(self.vllm_args_override.items()).encode()).hexdigest()[:4], 16).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/core/scenarios/generator.py` around lines 104 - 112, The deployment_id property uses Python's non-deterministic built-in hash for vllm_args_override; replace that with a deterministic hash using hashlib (e.g., hashlib.md5) by serializing the override items in a stable order (sort the items or use tuple(sorted(...))) and encoding to bytes before hashing, then derive override_hash from the first few hex digits (or as an int) and return f"{base}_wl{override_hash}" so deployment_id is stable across runs; update the logic in the deployment_id property (references: deployment_id, vllm_args_override, override_hash, base).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8609a041-7915-493c-af3f-39712b66bee9
📒 Files selected for processing (51)
config/rhaiis/defaults.yamlconfig/rhaiis/models.yamlconfig/rhaiis/workloads.yamlprojects/core/ci_entrypoint/run_ci.pyprojects/core/scenarios/__init__.pyprojects/core/scenarios/config.pyprojects/core/scenarios/config_loader.pyprojects/core/scenarios/generator.pyprojects/core/steps/__init__.pyprojects/core/steps/artifacts.pyprojects/core/steps/guidellm.pyprojects/core/utils/__init__.pyprojects/core/utils/oc.pyprojects/core/workflow/__init__.pyprojects/core/workflow/context.pyprojects/core/workflow/executor.pyprojects/core/workflow/step.pyprojects/core/workflow/workflow.pyprojects/rhaiis/IMPLEMENTATION.mdprojects/rhaiis/LLM_D_EXTENSIBILITY_REPORT.mdprojects/rhaiis/__init__.pyprojects/rhaiis/orchestration/__init__.pyprojects/rhaiis/orchestration/ci.pyprojects/rhaiis/orchestration/cli.pyprojects/rhaiis/orchestration/test_rhaiis.pyprojects/rhaiis/workflows/__init__.pyprojects/rhaiis/workflows/benchmark.pyprojects/rhaiis/workflows/cleanup.pyprojects/rhaiis/workflows/prepare.pyprojects/rhaiis/workflows/steps/__init__.pyprojects/rhaiis/workflows/steps/cleanup.pyprojects/rhaiis/workflows/steps/deploy.pyprojects/rhaiis/workflows/steps/operators.pytests/__init__.pytests/core/__init__.pytests/core/scenarios/__init__.pytests/core/scenarios/test_config_loader.pytests/core/scenarios/test_generator.pytests/core/steps/__init__.pytests/core/steps/test_artifacts.pytests/core/steps/test_guidellm.pytests/core/utils/__init__.pytests/core/utils/test_oc.pytests/core/workflow/__init__.pytests/core/workflow/test_context.pytests/core/workflow/test_executor.pytests/rhaiis/__init__.pytests/rhaiis/test_ci.pytests/rhaiis/test_operators.pytests/rhaiis/test_steps.pytests/rhaiis/test_workflows.py
| ministral-14b-fp8: | ||
| name: "Ministral-3-14B-Instruct-FP8" | ||
| hf_model_id: "RedHatAI/Ministral-3-14B-Instruct-2512" | ||
| vllm_args: | ||
| tensor-parallel-size: 4 | ||
| supported_workloads: [balanced, short] |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the correct HuggingFace model ID for RedHatAI Ministral-3-14B-Instruct FP8 quantized model?
💡 Result:
The correct Hugging Face model ID for the RedHatAI Ministral-3-14B-Instruct FP8 quantized model is RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic.
Citations:
- 1: https://huggingface.co/RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic
- 2: https://huggingface.co/RedHatAI/Ministral-3-14B-Instruct-2512
- 3: https://docs.redhat.com/es/documentation/red_hat_ai_inference_server/3.3/html/inference_serving_mistral_3_models/deploying-ministral-3-edge_mistral-3
Fix the HuggingFace model ID for ministral-14b-fp8 to use the FP8-quantized variant.
The hf_model_id is missing the -FP8-dynamic suffix. Change "RedHatAI/Ministral-3-14B-Instruct-2512" to "RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic" to load the correct quantized model.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@config/rhaiis/models.yaml` around lines 242 - 247, Update the hf_model_id
value for the ministral-14b-fp8 model to point to the FP8-quantized variant:
find theministral-14b-fp8 entry and change the hf_model_id
"RedHatAI/Ministral-3-14B-Instruct-2512" to
"RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic" so the correct FP8 model is
loaded.
| @click.command(context_settings=dict( | ||
| ignore_unknown_options=True, | ||
| allow_interspersed_args=False, | ||
| )) | ||
| @click.argument('project', required=False) | ||
| @click.argument('operation', required=False) | ||
| @click.argument('args', nargs=-1) | ||
| @click.argument('args', nargs=-1, type=click.UNPROCESSED) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Click, when a command uses context_settings={"ignore_unknown_options": True, "allow_interspersed_args": False} with positional arguments followed by a variadic click.UNPROCESSED argument, are flags like --dry-run placed after the positional arguments still parsed as top-level options, or are they treated as plain arguments and forwarded?
💡 Result:
In Click, with context_settings={"ignore_unknown_options": True, "allow_interspersed_args": False} and positional arguments followed by a variadic click.UNPROCESSED argument, flags like --dry-run placed after the positional arguments are treated as plain arguments and forwarded in the UNPROCESSED tuple, not parsed as top-level options. This behavior occurs because allow_interspersed_args=False causes Click's parser to stop processing options upon encountering the first non-option (positional) argument. Any subsequent tokens resembling options (e.g., --dry-run) are not examined for option parsing; they are directly added to the argument list. With ignore_unknown_options=True, if any option-like tokens were processed before positionals, unknown ones would be ignored and forwarded, but due to allow_interspersed_args=False, tokens after positionals are never processed as potential options—they become plain arguments captured by the nargs=-1 UNPROCESSED argument. Official documentation confirms allow_interspersed_args=False prevents mixing options and arguments, stopping option processing at the first argument [3,4,6,19,23]. Examples and issues show that post-positional option-like strings end up in args when ignore_unknown_options=True and UNPROCESSED is used [6,7,9,10]. DeepWiki analysis of parser.py explicitly states: when False, parser stops option processing at first non-option [19].
Citations:
- 1: https://click.palletsprojects.com/en/8.0.x/arguments/
- 2: https://click.palletsprojects.com/en/stable/api/
- 3: https://click.palletsprojects.com/en/stable/advanced/
- 4: https://deepwiki.com/pallets/click/2.5-argument-parsing
- 5: https://github.com/pallets/click/blob/master/src/click/core.py
- 6: https://stackoverflow.com/questions/32944131/add-unspecified-options-to-cli-command-using-python-click
- 7: https://stackoverflow.com/questions/42457484/python-click-consume-options-as-arguments
- 8: Ignore option-like arguments after first positional argument pallets/click#785
🏁 Script executed:
# First, let's see the file content around the flagged lines
cat -n projects/core/ci_entrypoint/run_ci.py | sed -n '490,530p'Repository: openshift-psap/forge
Length of output: 1589
🏁 Script executed:
# Search for --dry-run handling in the file
rg "\-\-dry\-run" projects/core/ci_entrypoint/run_ci.py -B 2 -A 2Repository: openshift-psap/forge
Length of output: 556
🏁 Script executed:
# Check git history for this file to see if it changed recently
cd projects/core/ci_entrypoint && git log --oneline -20 run_ci.py 2>/dev/null | head -20Repository: openshift-psap/forge
Length of output: 95
🏁 Script executed:
# Search for comments around the command decorator that might explain the design
cat -n projects/core/ci_entrypoint/run_ci.py | sed -n '485,510p'Repository: openshift-psap/forge
Length of output: 1199
🏁 Script executed:
# Search for documentation or docstrings that explain flag usage
rg "dry.run|allow_interspersed" projects/core/ci_entrypoint/run_ci.py -B 3 -A 3Repository: openshift-psap/forge
Length of output: 1706
--dry-run flag cannot be used safely with positional arguments due to Click parsing configuration.
With allow_interspersed_args=False, Click stops parsing options once the first positional argument is encountered. This means run rhaiis test --dry-run will pass --dry-run as a plain argument to the variadic args instead of parsing it as the --dry-run flag, causing the execution prevention to fail and the command to run instead of dry-running.
The --dry-run flag will only work when placed before positional arguments (e.g., run --dry-run rhaiis test), which is not intuitive and creates a safety regression where users expect the flag to work at the end of the command line.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/core/ci_entrypoint/run_ci.py` around lines 499 - 505, The Click
command declaration in run_ci.py uses context_settings with
allow_interspersed_args=False which causes options like the global --dry-run
flag to be treated as positional arguments once any positional is seen; update
the decorator for the command that defines the CLI (the `@click.command`(...)
block that decorates the function handling 'project', 'operation', and 'args')
to set allow_interspersed_args=True (or remove that override) so Click will
continue to parse options interspersed with positional arguments and ensure
flags like --dry-run are recognized regardless of position.
| def sanitize_name(name: str, max_len: int = 42) -> str: | ||
| """ | ||
| Sanitize name for K8s resource naming. | ||
|
|
||
| Rules: | ||
| - Lowercase | ||
| - Replace / and _ with - | ||
| - Remove dots | ||
| - Truncate to max_len | ||
| """ | ||
| return ( | ||
| name.lower().replace("/", "-").replace("_", "-").replace(".", "") | ||
| )[:max_len] |
There was a problem hiding this comment.
sanitize_name can still produce invalid K8s resource names.
After truncation, names may end with -, and unsupported characters (beyond /, _, .) are not normalized. That can break resource creation at runtime.
Safer sanitizer
def sanitize_name(name: str, max_len: int = 42) -> str:
@@
- return (
- name.lower().replace("/", "-").replace("_", "-").replace(".", "")
- )[:max_len]
+ name = name.lower().replace("/", "-").replace("_", "-").replace(".", "-")
+ name = re.sub(r"[^a-z0-9-]", "-", name)
+ name = re.sub(r"-{2,}", "-", name).strip("-")
+ name = name[:max_len].rstrip("-")
+ return name or "model"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/core/scenarios/config.py` around lines 69 - 81, sanitize_name
currently only replaces "/", "_", and "." and then truncates, which can leave
trailing "-" or other invalid chars; update sanitize_name to: normalize all
characters to lowercase, replace any character not in [a-z0-9-] with "-",
collapse multiple "-" into a single "-", truncate to max_len, then strip
leading/trailing "-" and ensure the result starts/ends with an alphanumeric
character (if stripping yields empty, return a safe fallback like "resource");
apply these changes inside the sanitize_name function to guarantee produced
names conform to K8s resource name rules.
| # Collect FORGE_* env vars | ||
| env_vars = {k: v for k, v in os.environ.items() if k.startswith("FORGE_")} | ||
|
|
There was a problem hiding this comment.
Don't dump every FORGE_* value into metadata artifacts.
This snapshots all FORGE_* environment variables and writes them verbatim to _meta/metadata.yaml. That becomes a secret/PII leak the moment any token, password, key, or kubeconfig-style path is exposed under that prefix, so this should be allowlisted or redacted before writing.
Also applies to: 126-135
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/core/workflow/context.py` around lines 56 - 58, The code currently
captures all FORGE_* env vars into env_vars and writes them to metadata, which
can leak secrets; modify the collection so you either use an explicit allowlist
of safe FORGE_* keys (e.g., FORGE_ENV, FORGE_PROJECT) or sanitize values by
redacting/masking any key that matches sensitive patterns (TOKEN, SECRET,
PASSWORD, KEY, KUBE, CRED, PRIVATE) before assigning to env_vars; update the
creation sites that set env_vars (the dict comprehension assigning env_vars and
the similar logic around lines 126-135) to use the sanitized/allowlisted dict so
only safe entries or redacted values are written to _meta/metadata.yaml.
| step_start = time.monotonic() | ||
| try: | ||
| result = step.execute(ctx) | ||
| result.duration_seconds = time.monotonic() - step_start | ||
| result.start_time = datetime.now(timezone.utc) | ||
| step_results[step_name] = result |
There was a problem hiding this comment.
Bug: start_time is recorded after step execution completes.
Line 58 captures datetime.now() after step.execute() returns, which means result.start_time actually reflects the step's end time, not its start time. The timestamp should be captured before execution begins.
🐛 Proposed fix
step_start = time.monotonic()
+ step_start_time = datetime.now(timezone.utc)
try:
result = step.execute(ctx)
result.duration_seconds = time.monotonic() - step_start
- result.start_time = datetime.now(timezone.utc)
+ result.start_time = step_start_time
step_results[step_name] = result📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| step_start = time.monotonic() | |
| try: | |
| result = step.execute(ctx) | |
| result.duration_seconds = time.monotonic() - step_start | |
| result.start_time = datetime.now(timezone.utc) | |
| step_results[step_name] = result | |
| step_start = time.monotonic() | |
| step_start_time = datetime.now(timezone.utc) | |
| try: | |
| result = step.execute(ctx) | |
| result.duration_seconds = time.monotonic() - step_start | |
| result.start_time = step_start_time | |
| step_results[step_name] = result |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/core/workflow/executor.py` around lines 54 - 59, The code records
result.start_time after step.execute(), so start_time reflects the end time;
move the timestamp capture to before execution by setting result.start_time =
datetime.now(timezone.utc) (or a local variable) immediately after computing
step_start and before calling step.execute(ctx), then compute
result.duration_seconds = time.monotonic() - step_start after execute and
continue storing result in step_results[step_name]; update references to
step_start/result.start_time accordingly in executor.py around the step.execute
call.
| modelLoadingTimeoutMillis: 300000 | ||
| imagePullSecrets: | ||
| - name: npalaska-image-pull | ||
| containers: |
There was a problem hiding this comment.
Hardcoded imagePullSecrets name appears to be a personal/dev credential.
The secret name npalaska-image-pull looks like a personal or development credential. This should likely be configurable or use a standard naming convention.
♻️ Suggested fix to make configurable
Add a parameter to __init__:
def __init__(
self,
model: str,
deployment_name: str,
vllm_image: str,
runtime_args: dict[str, Any],
namespace: str = "forge",
+ image_pull_secret: str | None = None,
# ... other params
):
+ self.image_pull_secret = image_pull_secretThen in _generate_kserve_yaml:
+ image_pull_secrets = ""
+ if self.image_pull_secret:
+ image_pull_secrets = f"""
+ imagePullSecrets:
+ - name: {self.image_pull_secret}"""
+
return f"""---
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
# ...
spec:
builtInAdapter:
modelLoadingTimeoutMillis: 300000
- imagePullSecrets:
- - name: npalaska-image-pull
+{image_pull_secrets}
containers:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/rhaiis/workflows/steps/deploy.py` around lines 170 - 173, The
hardcoded imagePullSecrets name `npalaska-image-pull` should be made
configurable: add a new parameter (e.g., image_pull_secret or
image_pull_secret_name) to the class __init__ with a sensible default or None,
store it on self, and update _generate_kserve_yaml to render the
imagePullSecrets entry using self.image_pull_secret (skipping or omitting the
block when None). Also update any callers/instantiations to pass the secret
where needed and add a clear default value or validation error if the secret is
required.
| def execute(self, ctx: "WorkflowContext") -> StepResult: | ||
| """Install NFD operator.""" | ||
| step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" | ||
|
|
||
| subscription_yaml = """ | ||
| apiVersion: operators.coreos.com/v1alpha1 | ||
| kind: Subscription | ||
| metadata: | ||
| name: nfd | ||
| namespace: openshift-nfd | ||
| spec: | ||
| channel: stable | ||
| name: nfd | ||
| source: redhat-operators | ||
| sourceNamespace: openshift-marketplace | ||
| """ | ||
| yaml_path = step_dir / "nfd-subscription.yaml" | ||
| yaml_path.write_text(subscription_yaml) | ||
|
|
||
| # Create namespace first | ||
| self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"]) | ||
| self._run_oc(["apply", "-f", str(yaml_path)]) | ||
|
|
||
| return StepResult.ok("NFD operator subscription created") |
There was a problem hiding this comment.
Missing step_dir.mkdir() before writing YAML file will cause FileNotFoundError.
The yaml_path.write_text() on line 38 will fail because step_dir is never created. Other steps in this PR (e.g., guidellm.py, deploy.py) call step_dir.mkdir(parents=True, exist_ok=True) before writing files.
🐛 Proposed fix
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install NFD operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
+ step_dir.mkdir(parents=True, exist_ok=True)
subscription_yaml = """📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def execute(self, ctx: "WorkflowContext") -> StepResult: | |
| """Install NFD operator.""" | |
| step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" | |
| subscription_yaml = """ | |
| apiVersion: operators.coreos.com/v1alpha1 | |
| kind: Subscription | |
| metadata: | |
| name: nfd | |
| namespace: openshift-nfd | |
| spec: | |
| channel: stable | |
| name: nfd | |
| source: redhat-operators | |
| sourceNamespace: openshift-marketplace | |
| """ | |
| yaml_path = step_dir / "nfd-subscription.yaml" | |
| yaml_path.write_text(subscription_yaml) | |
| # Create namespace first | |
| self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"]) | |
| self._run_oc(["apply", "-f", str(yaml_path)]) | |
| return StepResult.ok("NFD operator subscription created") | |
| def execute(self, ctx: "WorkflowContext") -> StepResult: | |
| """Install NFD operator.""" | |
| step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" | |
| step_dir.mkdir(parents=True, exist_ok=True) | |
| subscription_yaml = """ | |
| apiVersion: operators.coreos.com/v1alpha1 | |
| kind: Subscription | |
| metadata: | |
| name: nfd | |
| namespace: openshift-nfd | |
| spec: | |
| channel: stable | |
| name: nfd | |
| source: redhat-operators | |
| sourceNamespace: openshift-marketplace | |
| """ | |
| yaml_path = step_dir / "nfd-subscription.yaml" | |
| yaml_path.write_text(subscription_yaml) | |
| # Create namespace first | |
| self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"]) | |
| self._run_oc(["apply", "-f", str(yaml_path)]) | |
| return StepResult.ok("NFD operator subscription created") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/rhaiis/workflows/steps/operators.py` around lines 21 - 44, The
execute method creates step_dir but never ensures it exists, so
yaml_path.write_text(...) will raise FileNotFoundError; before writing the file,
call step_dir.mkdir(parents=True, exist_ok=True) (same pattern used in
guidellm.py/deploy.py) to create the directory, then proceed to write the
subscription_yaml and run the oc commands; update the execute function to create
the directory right after computing step_dir and before yaml_path.write_text.
| # Create namespace first | ||
| self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"]) | ||
| self._run_oc(["apply", "-f", str(yaml_path)]) | ||
|
|
||
| return StepResult.ok("NFD operator subscription created") |
There was a problem hiding this comment.
Return value of _run_oc is silently ignored.
The _run_oc calls on lines 41-42 return bool indicating success/failure, but the result is not checked. If namespace creation or YAML apply fails, the step still reports success. Consider checking the return value or at minimum logging the outcome.
🛡️ Proposed fix to check return values
# Create namespace first
- self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
- self._run_oc(["apply", "-f", str(yaml_path)])
+ self._run_oc(["create", "namespace", "openshift-nfd", "--dry-run=client", "-o", "yaml"])
+ # Apply subscription - use pipe pattern: dry-run | apply
+ result = subprocess.run(
+ ["oc", "apply", "-f", str(yaml_path)],
+ capture_output=True,
+ text=True,
+ timeout=60,
+ )
+ if result.returncode != 0:
+ return StepResult.fail(f"Failed to install NFD operator: {result.stderr}")
return StepResult.ok("NFD operator subscription created")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/rhaiis/workflows/steps/operators.py` around lines 40 - 44, The two
calls to _run_oc (creating namespace and applying the YAML) return bool but are
ignored so StepResult.ok("NFD operator subscription created") is returned even
on failure; update the code in the method that contains these calls to capture
each _run_oc result (the call creating "namespace openshift-nfd" and the call
applying str(yaml_path)), check their boolean values, log failures via the
existing logger or processLogger, and return StepResult.fail(...) (or an
appropriate error StepResult) if either call returns False; only return
StepResult.ok(...) when both _run_oc calls succeed.
| def execute(self, ctx: "WorkflowContext") -> StepResult: | ||
| """Install GPU operator.""" | ||
| step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" | ||
|
|
||
| subscription_yaml = """ | ||
| apiVersion: operators.coreos.com/v1alpha1 | ||
| kind: Subscription | ||
| metadata: | ||
| name: gpu-operator-certified | ||
| namespace: nvidia-gpu-operator | ||
| spec: | ||
| channel: v24.6 | ||
| name: gpu-operator-certified | ||
| source: certified-operators | ||
| sourceNamespace: openshift-marketplace | ||
| """ | ||
| yaml_path = step_dir / "gpu-subscription.yaml" | ||
| yaml_path.write_text(subscription_yaml) |
There was a problem hiding this comment.
Same issue: Missing step_dir.mkdir() in InstallGPUOperatorStep.
Apply the same fix as for InstallNFDOperatorStep.
🐛 Proposed fix
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install GPU operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
+ step_dir.mkdir(parents=True, exist_ok=True)
subscription_yaml = """📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def execute(self, ctx: "WorkflowContext") -> StepResult: | |
| """Install GPU operator.""" | |
| step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" | |
| subscription_yaml = """ | |
| apiVersion: operators.coreos.com/v1alpha1 | |
| kind: Subscription | |
| metadata: | |
| name: gpu-operator-certified | |
| namespace: nvidia-gpu-operator | |
| spec: | |
| channel: v24.6 | |
| name: gpu-operator-certified | |
| source: certified-operators | |
| sourceNamespace: openshift-marketplace | |
| """ | |
| yaml_path = step_dir / "gpu-subscription.yaml" | |
| yaml_path.write_text(subscription_yaml) | |
| def execute(self, ctx: "WorkflowContext") -> StepResult: | |
| """Install GPU operator.""" | |
| step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" | |
| step_dir.mkdir(parents=True, exist_ok=True) | |
| subscription_yaml = """ | |
| apiVersion: operators.coreos.com/v1alpha1 | |
| kind: Subscription | |
| metadata: | |
| name: gpu-operator-certified | |
| namespace: nvidia-gpu-operator | |
| spec: | |
| channel: v24.6 | |
| name: gpu-operator-certified | |
| source: certified-operators | |
| sourceNamespace: openshift-marketplace | |
| """ | |
| yaml_path = step_dir / "gpu-subscription.yaml" | |
| yaml_path.write_text(subscription_yaml) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/rhaiis/workflows/steps/operators.py` around lines 67 - 84, The
InstallGPUOperatorStep.execute method creates step_dir but never ensures the
directory exists before writing the GPU subscription file; modify
InstallGPUOperatorStep.execute to create the directory (use
step_dir.mkdir(parents=True, exist_ok=True)) before calling
yaml_path.write_text(subscription_yaml) so the file write cannot fail if the
directory is missing.
| def execute(self, ctx: "WorkflowContext") -> StepResult: | ||
| """Install RHOAI operator.""" | ||
| step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" | ||
|
|
||
| # Determine channel from version | ||
| channel = f"stable-{self.version}" | ||
|
|
||
| subscription_yaml = f""" | ||
| apiVersion: operators.coreos.com/v1alpha1 | ||
| kind: Subscription | ||
| metadata: | ||
| name: rhods-operator | ||
| namespace: redhat-ods-operator | ||
| spec: | ||
| channel: {channel} | ||
| name: rhods-operator | ||
| source: redhat-operators | ||
| sourceNamespace: openshift-marketplace | ||
| """ | ||
| yaml_path = step_dir / "rhoai-subscription.yaml" | ||
| yaml_path.write_text(subscription_yaml) |
There was a problem hiding this comment.
Same issue: Missing step_dir.mkdir() in InstallRHOAIOperatorStep.
Apply the same fix as for the other operator steps.
🐛 Proposed fix
def execute(self, ctx: "WorkflowContext") -> StepResult:
"""Install RHOAI operator."""
step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}"
+ step_dir.mkdir(parents=True, exist_ok=True)
# Determine channel from version📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def execute(self, ctx: "WorkflowContext") -> StepResult: | |
| """Install RHOAI operator.""" | |
| step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" | |
| # Determine channel from version | |
| channel = f"stable-{self.version}" | |
| subscription_yaml = f""" | |
| apiVersion: operators.coreos.com/v1alpha1 | |
| kind: Subscription | |
| metadata: | |
| name: rhods-operator | |
| namespace: redhat-ods-operator | |
| spec: | |
| channel: {channel} | |
| name: rhods-operator | |
| source: redhat-operators | |
| sourceNamespace: openshift-marketplace | |
| """ | |
| yaml_path = step_dir / "rhoai-subscription.yaml" | |
| yaml_path.write_text(subscription_yaml) | |
| def execute(self, ctx: "WorkflowContext") -> StepResult: | |
| """Install RHOAI operator.""" | |
| step_dir = ctx.artifact_dir / f"{ctx.step_number:03d}__{ctx.current_step_name}" | |
| step_dir.mkdir(parents=True, exist_ok=True) | |
| # Determine channel from version | |
| channel = f"stable-{self.version}" | |
| subscription_yaml = f""" | |
| apiVersion: operators.coreos.com/v1alpha1 | |
| kind: Subscription | |
| metadata: | |
| name: rhods-operator | |
| namespace: redhat-ods-operator | |
| spec: | |
| channel: {channel} | |
| name: rhods-operator | |
| source: redhat-operators | |
| sourceNamespace: openshift-marketplace | |
| """ | |
| yaml_path = step_dir / "rhoai-subscription.yaml" | |
| yaml_path.write_text(subscription_yaml) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/rhaiis/workflows/steps/operators.py` around lines 111 - 131, The
execute method of InstallRHOAIOperatorStep is writing files into step_dir
without ensuring the directory exists; update the execute method (the function
named execute in the InstallRHOAIOperatorStep class) to create the step_dir
before writing the YAML (e.g., call step_dir.mkdir(parents=True, exist_ok=True)
immediately after computing step_dir) so yaml_path.write_text(...) won't fail.
kpouget
left a comment
There was a problem hiding this comment.
nice,
I put some very preliminary comments,
we'll need to discuss it further to understand how all of that works!
we'll need to find the way to integrate the DSL toolbox in that, if that makes sense
| @click.command(context_settings=dict( | ||
| ignore_unknown_options=True, | ||
| allow_interspersed_args=False, | ||
| )) |
There was a problem hiding this comment.
this file should stay untouched
| # These define benchmark parameters - can be changed WITHOUT restarting vLLM | ||
| # Only GuideLLM needs to restart with new workload config |
There was a problem hiding this comment.
this should live in the rhaiis project, projects/rhaii/orchestration
| # Global Defaults & Accelerator-Specific Settings | ||
| # These are merged with model configs at runtime using inheritance: | ||
| # defaults → accelerator → model → model.accelerator_overrides → scenario |
There was a problem hiding this comment.
this should live in the rhaiis project, projects/rhaii/orchestration
| # Model Registry | ||
| # Models only specify what's DIFFERENT from defaults.yaml | ||
| # Accelerator-specific settings go in accelerator_overrides section |
There was a problem hiding this comment.
this should live in the rhaiis project, projects/rhaii/orchestration
| @@ -0,0 +1,14 @@ | |||
| """Shared workflow steps for all projects. | |||
There was a problem hiding this comment.
this should live in the rhaiis project, , we'll rediscuss how to turn it into a toolbox projects/rhaii/toolbox
or in a guidellm library project 🤔
| @@ -0,0 +1,250 @@ | |||
| """Artifact collection step - shared by all projects.""" | |||
There was a problem hiding this comment.
this should live in the rhaiis project, , we'll rediscuss how to turn it into a toolbox projects/rhaii/toolbox
or in a guidellm library project 🤔
| @@ -0,0 +1,343 @@ | |||
| """GuideLLM benchmark step - runs as a pod on the cluster.""" | |||
There was a problem hiding this comment.
this should live in the rhaiis project, , we'll rediscuss how to turn it into a toolbox projects/rhaii/toolbox
or in a guidellm library project 🤔
| @@ -0,0 +1,529 @@ | |||
| """OpenShift CLI wrapper with built-in retry for transient failures. | |||
There was a problem hiding this comment.
I think this can belong to the DSL directory, to be rediscussed
| @@ -0,0 +1,129 @@ | |||
| """Base workflow class with step registration.""" | |||
There was a problem hiding this comment.
to be discussed in a meeting so that we understand what's this workflow exactly
|
|
||
|
|
||
| @ci.command() | ||
| @click.option("--dry-run", is_flag=True, help="Print what would be done without executing") |
There was a problem hiding this comment.
I think you can define that in the def ci() function, so that all the entrypoints inherit it
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@projects/rhaiis/IMPLEMENTATION.md`:
- Line 7: Add explicit language identifiers to the fenced code blocks that are
missing them (lint MD040) — locate the triple-backtick blocks around the "CLI/CI
Entry Points" and "config/" sections and the other similar fences (the blocks
containing short directory or CLI snippets) and change ``` to fenced blocks like
```text, ```bash, ```yaml, or ```python as appropriate for the content so
markdownlint passes; ensure each fenced block (e.g., the block showing CLI/CI
commands and the block showing config directory listings) gets a suitable
language tag.
- Line 471: Replace the user-specific line "cd /Users/memehta/workspace/forge"
with a portable repo-relative instruction; locate that exact string in
IMPLEMENTATION.md and change it to a generic direction to change into the
repository root (for example instruct contributors to "cd" into the project
directory or to use a git/top-level discovery method) so the guidance works for
all environments.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f0742a73-09c0-4906-8120-102b37d2e724
📒 Files selected for processing (1)
projects/rhaiis/IMPLEMENTATION.md
|
|
||
| ## Architecture | ||
|
|
||
| ``` |
There was a problem hiding this comment.
Add language identifiers to fenced code blocks to satisfy lint.
These fences are missing language tags (MD040). Please annotate them (text, bash, yaml, python, etc.) so markdownlint passes consistently.
Proposed fix pattern
-```
+```text
CLI/CI Entry Points
...
-```
+```
-```
+```text
config/
...
-```
+```Also applies to: 59-59, 70-70, 118-118, 184-184, 212-212, 261-261, 376-376, 448-448
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 7-7: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/rhaiis/IMPLEMENTATION.md` at line 7, Add explicit language
identifiers to the fenced code blocks that are missing them (lint MD040) —
locate the triple-backtick blocks around the "CLI/CI Entry Points" and "config/"
sections and the other similar fences (the blocks containing short directory or
CLI snippets) and change ``` to fenced blocks like ```text, ```bash, ```yaml, or
```python as appropriate for the content so markdownlint passes; ensure each
fenced block (e.g., the block showing CLI/CI commands and the block showing
config directory listings) gets a suitable language tag.
| ```bash | ||
| # Activate venv and run from forge directory | ||
| source ~/test_foo/python3_virt/bin/activate | ||
| cd /Users/memehta/workspace/forge |
There was a problem hiding this comment.
Replace machine-specific absolute path with a portable instruction.
cd /Users/memehta/workspace/forge is user-specific and will not work for most contributors. Use a repo-relative placeholder.
Proposed fix
-cd /Users/memehta/workspace/forge
+cd /path/to/forge📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cd /Users/memehta/workspace/forge | |
| cd /path/to/forge |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/rhaiis/IMPLEMENTATION.md` at line 471, Replace the user-specific
line "cd /Users/memehta/workspace/forge" with a portable repo-relative
instruction; locate that exact string in IMPLEMENTATION.md and change it to a
generic direction to change into the repository root (for example instruct
contributors to "cd" into the project directory or to use a git/top-level
discovery method) so the guidance works for all environments.
|
Reopening from fork MML-coder/forge-1 |
This PR only supports the Forge test workflow
It deploy vLLM, run benchmark, cleanup. It does NOT include:
How It Works
defaults.yaml → accelerator[nvidia|amd] → model → model.accelerator_overrides
DeployVLLM → WaitForReady → RunGuideLLM → (finally) CollectArtifacts → Cleanup
`
Single model + workload
KUBECONFIG=~/kubeconfigs/my-cluster
PYTHONPATH=. python3 projects/rhaiis/orchestration/cli.py test --model qwen-0.6b --workload balanced --accelerator nvidia
Multiple workloads (deploy-once pattern)
PYTHONPATH=. python3 projects/rhaiis/orchestration/cli.py test --model llama-3.3-70b-fp8 --workloadsbalanced,short,long-prompt.
`
Summary by CodeRabbit
New Features
Documentation
Tests