Skip to content

[forge] include some documentation pages#16

Merged
kpouget merged 3 commits intoopenshift-psap:mainfrom
kpouget:docs
Apr 7, 2026
Merged

[forge] include some documentation pages#16
kpouget merged 3 commits intoopenshift-psap:mainfrom
kpouget:docs

Conversation

@kpouget
Copy link
Copy Markdown
Contributor

@kpouget kpouget commented Apr 7, 2026

Summary by CodeRabbit

  • Documentation

    • Added comprehensive CI and CLI orchestration guides covering discovery, workflows, logging, PR integration, artifact layout, and troubleshooting.
  • New Features

    • Job submit-and-wait tool: generates manifests, submits FOURNOS jobs, polls completion, captures logs/status, and writes artifacts.
  • Improvements

    • CI entrypoints now require executability to be detected.
    • CI metadata consolidated under a dedicated metadata folder in artifacts.
    • Richer task/subprocess error reporting and structured CI failure summaries.
    • Default namespace configured for job submissions.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mml-coder for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds CI/CLI orchestration docs; requires CI scripts be user-executable; relocates CI metadata and variable overrides to 000__ci_metadata/; introduces a FOURNOS submit-and-wait pipeline that creates manifests, submits FOURNOS jobs via oc, polls for completion, collects logs/status, and always cleans up; improves DSL/task and subprocess error logging.

Changes

Cohort / File(s) Summary
Documentation
docs/ci/run_ci.md, docs/orchestration/ci.md, docs/orchestration/cli.md
Added comprehensive docs for CI/CLI entrypoints, orchestration patterns, environment/setup, logging, PR integration, artifacts layout, and troubleshooting.
CI script discovery
projects/core/ci_entrypoint/run_ci.py
find_ci_script() now only returns scripts that both exist and are user-executable (os.X_OK), changing which orchestration scripts are considered runnable.
CI metadata & overrides
projects/core/ci_entrypoint/prepare_ci.py, projects/core/library/config.py, projects/core/notifications/send.py
CI metadata dir changed to 000__ci_metadata/ (new constant) and default VARIABLE_OVERRIDES_FILENAME updated to 000__ci_metadata/variable_overrides.yaml; notification discovery updated to match.
DSL/task & subprocess logging
projects/core/dsl/runtime.py, projects/core/dsl/task.py, projects/core/dsl/shell.py
Added TaskExecutionError, improved task failure context and logging (task name/description/args), moved execution.log to top-level ARTIFACT_DIR, and enhanced subprocess CalledProcessError messages with exit code and trimmed stdout/stderr.
Fournos submission wiring
projects/fournos_launcher/orchestration/submit.py, projects/fournos_launcher/orchestration/ci.py, projects/fournos_launcher/orchestration/cli.py
Renamed submit()submit_job() and updated CLI/CI commands to invoke submit_job(); submission now delegates to the new submit-and-wait pipeline, preserving KUBECONFIG handling.
Fournos config
projects/fournos_launcher/orchestration/config.yaml
Added top-level fournos block with namespace: psap-automation.
New submit-and-wait pipeline
projects/fournos_launcher/toolbox/submit_and_wait/main.py
Added executable pipeline implementing tasks: validate inputs, generate job name, ensure oc, create manifest, oc apply submit, poll job phase with retries/backoff, retrieve logs, capture final status YAML, and always cleanup manifest; includes CLI main() and run(...) entry.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as "CLI / run()"
  participant Pipeline as "submit_and_wait.run()"
  participant FS as "Artifacts (ARTIFACT_DIR)"
  participant OC as "Kubernetes (`oc`)"
  participant Logs as "Job Logs"

  CLI->>Pipeline: invoke run(cluster, project, args...)
  Pipeline->>FS: write job manifest (manifest_file)
  Pipeline->>OC: oc apply -f manifest_file
  OC-->>Pipeline: accepted
  loop poll
    Pipeline->>OC: oc get fournosjob ... (phase)
    OC-->>Pipeline: phase (Pending/Running/Completed/Failed)
  end
  Pipeline->>OC: oc logs -l "fournos.job=<name>"
  OC-->>Logs: job logs (or empty)
  Pipeline->>FS: save logs and final status YAML
  Pipeline->>FS: cleanup manifest file (always)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰
I hopped through docs by lantern light,
Scripts must be executable to take flight,
000__ci_metadata/ keeps overrides snug and dry,
Jobs submit, wait, log — then flutter by,
A carrot cheer for artifacts piled high! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title describes adding documentation pages, but the changeset includes significant code modifications beyond documentation (changes to run_ci.py, config.py, cli.py, submit.py, runtime.py, shell.py, task.py, ci.py, etc.). Revise the title to accurately reflect that this PR adds documentation AND makes implementation changes to CI/orchestration systems (e.g., '[forge] add CI/orchestration documentation and refactor metadata paths and job submission').
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 7, 2026

/test fournos llm_d hello world
/cluster psap
/var something.is.wrong: false

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Apr 7, 2026

🟢 Test of 'fournos_launcher submit' succeeded after 00 hours 00 minutes 02 seconds 🟢

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos llm_d hello world
/cluster psap
/var something.is.wrong: false

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
docs/orchestration/ci.md (1)

34-38: Command format inconsistent with other documentation.

The run_ci command here doesn't match the format in docs/ci/run_ci.md which uses ./run_ci.py. Consider aligning these examples:

 or from the CI launcher:
 
 ```bash
-run_ci llm_d ci pre-cleanup
-run_ci llm_d ci prepare
-run_ci llm_d ci test
+./run_ci.py llm_d ci pre-cleanup
+./run_ci.py llm_d ci prepare
+./run_ci.py llm_d ci test

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/orchestration/ci.md around lines 34 - 38, Update the examples in
docs/orchestration/ci.md to match the other docs by replacing the bare run_ci
invocations with the script form ./run_ci.py; specifically change occurrences of
run_ci llm_d ci pre-cleanup, run_ci llm_d ci prepare, and run_ci llm_d ci test
to use ./run_ci.py so they align with docs/ci/run_ci.md and the project's
canonical command format.


</details>

</blockquote></details>
<details>
<summary>projects/core/ci_entrypoint/run_ci.py (1)</summary><blockquote>

`199-200`: **Good addition of executable check, but error message could be clearer.**

The executable permission check aligns with the filtering in `show_project_operations` (lines 275-276) and is consistent with the documentation. However, when a script exists but isn't executable, the error message at line 396 says "Expected: {project_dir}/orchestration/{operation}.py" without mentioning the executable requirement, which could confuse users.



<details>
<summary>💡 Suggested improvement for the error message</summary>

```diff
         click.echo(
             click.style(
                 f"❌ ERROR: No CI script found for project '{project}' operation '{operation}'.",
                 fg='red'
             ),
             err=True
         )
-        click.echo(f"🔍 Expected: {project_dir}/orchestration/{operation}.py")
+        click.echo(f"🔍 Expected: {project_dir}/orchestration/{operation}.py (must be executable)")
         sys.exit(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/ci_entrypoint/run_ci.py` around lines 199 - 200, The error
message raised when a script exists but isn't executable is unclear; update the
message emitted where you validate script_path (the branch that checks
script_path.exists() and os.access(script_path, os.X_OK)) to explicitly state
that the file must be executable and show the full expected path (e.g.,
"Expected executable: {project_dir}/orchestration/{operation}.py (file exists
but is not executable)"), so callers see both the path and the
executable-permission requirement; adjust the string emitted at the current
error location that currently reads "Expected:
{project_dir}/orchestration/{operation}.py" to this clearer, permission-aware
wording.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/ci/run_ci.md`:
- Around line 162-165: Update the documentation's project aliasing section to
match the actual mapping implemented in run_ci.py: change the described alias
from "forge → fournos" to reflect the code that maps "fournos" to
"fournos_launcher" (the if block checking project == "fournos" and setting
project = "fournos_launcher"), and also remove or correct any mention of "forge"
if it is not used; ensure the docs text and bullet list consistently state
"fournos → fournos_launcher" (and adjust related environment-specific/ artifact
management phrasing if it referenced the old alias).
- Line 15: Replace the redundant phrase "CLI interface" in the docs entry for
run_ci.py by changing the description for run_ci.py to use either "CLI" or the
expanded "Command Line Interface" (e.g., "Main entrypoint script offering a
unified CLI for CI operations" or "Main entrypoint script offering a unified
Command Line Interface for CI operations"); update the line referencing
run_ci.py accordingly.

In `@docs/orchestration/ci.md`:
- Line 9: The doc uses two different phase names—"cleanup" and "pre_cleanup"—so
pick one canonical phase name (e.g., pre_cleanup) and update every occurrence in
the CI docs and examples to match; specifically, rename the phase listed in the
Commands for CI phases section (the "cleanup" entry) to the chosen canonical
name, and update any headings, example commands, and referenced pipeline stage
names in the same document to use that exact identifier so all mentions (e.g.,
prepare, test, pre_cleanup) are consistent.

In `@projects/core/library/config.py`:
- Line 20: The constant VARIABLE_OVERRIDES_FILENAME was changed to
"_meta/variable_overrides.yaml", but code still hardcodes the old path (e.g.,
variable_overrides_file in projects/jump_ci/testing/test.py) and
projects/legacy/library/config.py defines the old constant, causing missed
overrides; update callers to use the canonical VARIABLE_OVERRIDES_FILENAME
constant instead of composing "variable_overrides.yaml" manually and update the
legacy module to export the same VARIABLE_OVERRIDES_FILENAME value (or import it
from projects/core/library/config.py) so all code reads ARTIFACT_DIR /
VARIABLE_OVERRIDES_FILENAME consistently.

---

Nitpick comments:
In `@docs/orchestration/ci.md`:
- Around line 34-38: Update the examples in docs/orchestration/ci.md to match
the other docs by replacing the bare run_ci invocations with the script form
./run_ci.py; specifically change occurrences of run_ci llm_d ci pre-cleanup,
run_ci llm_d ci prepare, and run_ci llm_d ci test to use ./run_ci.py so they
align with docs/ci/run_ci.md and the project's canonical command format.

In `@projects/core/ci_entrypoint/run_ci.py`:
- Around line 199-200: The error message raised when a script exists but isn't
executable is unclear; update the message emitted where you validate script_path
(the branch that checks script_path.exists() and os.access(script_path,
os.X_OK)) to explicitly state that the file must be executable and show the full
expected path (e.g., "Expected executable:
{project_dir}/orchestration/{operation}.py (file exists but is not
executable)"), so callers see both the path and the executable-permission
requirement; adjust the string emitted at the current error location that
currently reads "Expected: {project_dir}/orchestration/{operation}.py" to this
clearer, permission-aware wording.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 132470d2-94fe-40db-9134-a3d69ed12a7d

📥 Commits

Reviewing files that changed from the base of the PR and between c6d909e and ca5bc32.

📒 Files selected for processing (5)
  • docs/ci/run_ci.md
  • docs/orchestration/ci.md
  • docs/orchestration/cli.md
  • projects/core/ci_entrypoint/run_ci.py
  • projects/core/library/config.py

Comment thread docs/ci/run_ci.md Outdated
Comment thread docs/ci/run_ci.md
Comment thread docs/orchestration/ci.md Outdated
Comment thread projects/core/library/config.py Outdated
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 7, 2026

/test fournos llm_d hello world
/cluster psap
/var something.is.wrong: false

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Apr 7, 2026

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 01 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos llm_d hello world
/cluster psap
/var something.is.wrong: false

Failure indicator:

/logs/artifacts/FAILURE | CI failed with exception:
Exception: BdbQuit: 

Full stack trace:
Traceback (most recent call last):
  File "/app/forge/projects/core/library/ci.py", line 50, in wrapper
    exit_code = command_func(*args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/forge/projects/fournos_launcher/orchestration/ci.py", line 37, in submit
    return submit_mod.submit_job()

[...]

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 7, 2026

/test fournos llm_d hello world
/cluster psap
/var something.is.wrong: false

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 7, 2026

/test fournos llm_d hello world
/cluster psap
/var something.is.wrong: false

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Apr 7, 2026

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 02 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos llm_d hello world
/cluster psap
/var something.is.wrong: false

Failure indicator:

/logs/artifacts/000__submit_and_wait/FAILURE | CalledProcessError: Command 'oc apply -f /logs/artifacts/000__submit_and_wait/src/forge-llm_d-20260407-144340-manifest.yaml' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/runtime.py", line 103, in execute_tasks
    raise execution_error
  File "/app/forge/projects/core/dsl/runtime.py", line 67, in execute_tasks
    _execute_single_task(task_info, args)
  File "/app/forge/projects/core/dsl/runtime.py", line 138, in _execute_single_task
    task_status["ret"] = task_func(args)
                         ^^^^^^^^^^^^^^^
  File "/app/forge/projects/core/dsl/task.py", line 72, in wrapper

[...]

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py`:
- Line 51: Remove the blocking debugger hook by deleting the "import pdb;
pdb.set_trace()" statement in main.py so non-interactive submission paths no
longer stop; also remove any unused "import pdb" left behind and run the
submission flow (or unit tests) to confirm tasks proceed without manual
intervention.
- Around line 138-142: The code currently echoes the kubectl/oc command instead
of executing it; update the shell.run invocation in submit_and_wait.main to run
the real apply command (replace f"echo oc apply -f {args.manifest_file}" with
f"oc apply -f {args.manifest_file}") so the FournosJob is actually created, keep
checking result.success and keep the RuntimeError with result.stderr if it fails
(references: shell.run, result, args.manifest_file, RuntimeError).
- Around line 83-87: Normalize args.final_job_name so it is a Kubernetes-safe
DNS-1123 label: import re, convert args.project to lowercase, replace
underscores and invalid characters with hyphens (e.g. re.sub(r'[^a-z0-9-]+',
'-', project_lower)), collapse multiple hyphens, trim leading/trailing hyphens,
and truncate the combined "forge-{project}-{timestamp}" to <=253 chars
(preferably keeping timestamp intact); apply this transformation when assigning
args.final_job_name in the block that currently sets it from
args.project/timestamp (referencing args.final_job_name, args.project, and
timestamp) and ensure you add the import re at the top of the file.
- Around line 147-184: The wait_for_job_completion function currently relies on
a fixed `@retry` decorator (120×30s) and ignores args.timeout_minutes and
args.poll_interval_seconds; replace the retry-based flow with an explicit loop
using time.monotonic(): compute a deadline = time.monotonic() +
args.timeout_minutes * 60, then loop while time.monotonic() < deadline calling
the same shell.run status check and the existing status handling
(Completed/Failed/Running/Pending/other), sleeping args.poll_interval_seconds
between iterations; if the deadline is exceeded raise a TimeoutError with a
clear message about args.final_job_name; keep the failure-detail lookup
(jsonpath .status.message) and existing logging behavior.
- Around line 187-215: Add the `@always` decorator to both artifact-collector
tasks so they run even if prior tasks fail: annotate retrieve_job_logs and
capture_final_job_status with `@always` (above their existing `@task` decorators)
and ensure the always decorator is imported from the task framework (e.g., add
the import for always alongside task if missing) so logs and final-status YAML
are collected regardless of earlier exceptions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8efc7105-c27a-445f-8bc8-bb42d091ba7b

📥 Commits

Reviewing files that changed from the base of the PR and between ca5bc32 and fb0346b.

📒 Files selected for processing (10)
  • docs/ci/run_ci.md
  • docs/orchestration/ci.md
  • docs/orchestration/cli.md
  • projects/core/ci_entrypoint/run_ci.py
  • projects/core/library/config.py
  • projects/fournos_launcher/orchestration/ci.py
  • projects/fournos_launcher/orchestration/cli.py
  • projects/fournos_launcher/orchestration/config.yaml
  • projects/fournos_launcher/orchestration/submit.py
  • projects/fournos_launcher/toolbox/submit_and_wait/main.py
✅ Files skipped from review due to trivial changes (4)
  • projects/fournos_launcher/orchestration/config.yaml
  • docs/orchestration/cli.md
  • docs/orchestration/ci.md
  • docs/ci/run_ci.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • projects/core/ci_entrypoint/run_ci.py
  • projects/core/library/config.py

Comment thread projects/fournos_launcher/toolbox/submit_and_wait/main.py Outdated
Comment on lines +83 to +87
if args.job_name:
args.final_job_name = args.job_name
else:
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
args.final_job_name = f"forge-{args.project}-{timestamp}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Normalize final_job_name to a Kubernetes-safe slug.

args.project is interpolated verbatim here. Values like llm_d produce forge-llm_d-..., which will be rejected as metadata.name.

Suggested fix
     if args.job_name:
         args.final_job_name = args.job_name
     else:
         timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
-        args.final_job_name = f"forge-{args.project}-{timestamp}"
+        project_slug = re.sub(r"[^a-z0-9-]+", "-", args.project.lower()).strip("-") or "job"
+        args.final_job_name = f"forge-{project_slug}-{timestamp}"
# add with the other imports
import re
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if args.job_name:
args.final_job_name = args.job_name
else:
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
args.final_job_name = f"forge-{args.project}-{timestamp}"
if args.job_name:
args.final_job_name = args.job_name
else:
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
project_slug = re.sub(r"[^a-z0-9-]+", "-", args.project.lower()).strip("-") or "job"
args.final_job_name = f"forge-{project_slug}-{timestamp}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` around lines 83 -
87, Normalize args.final_job_name so it is a Kubernetes-safe DNS-1123 label:
import re, convert args.project to lowercase, replace underscores and invalid
characters with hyphens (e.g. re.sub(r'[^a-z0-9-]+', '-', project_lower)),
collapse multiple hyphens, trim leading/trailing hyphens, and truncate the
combined "forge-{project}-{timestamp}" to <=253 chars (preferably keeping
timestamp intact); apply this transformation when assigning args.final_job_name
in the block that currently sets it from args.project/timestamp (referencing
args.final_job_name, args.project, and timestamp) and ensure you add the import
re at the top of the file.

Comment thread projects/fournos_launcher/toolbox/submit_and_wait/main.py Outdated
Comment on lines +147 to +184
@task
@retry(attempts=120, delay=30, backoff=1.0)
def wait_for_job_completion(args):
"""Wait for FOURNOS job to complete"""

# Check job status
status_result = shell.run(
f'oc get fournosjob {args.final_job_name} -n {args.namespace} -o jsonpath="{{.status.phase}}"',
check=False,
log_stdout=False
)

if not status_result.success:
# Failed to get status, will retry
print(f"Failed to get job status, retrying...")
raise RuntimeError("Failed to get job status")

status = status_result.stdout.strip()

if status == "Completed":
return f"Job {args.final_job_name} completed successfully"
elif status == "Failed":
# Get failure details
failure_result = shell.run(
f'oc get fournosjob {args.final_job_name} -n {args.namespace} -o jsonpath="{{.status.message}}"',
check=False,
log_stdout=False
)
failure_msg = failure_result.stdout.strip() if failure_result.success else "Unknown failure"
raise RuntimeError(f"Job {args.final_job_name} failed: {failure_msg}")
elif status in ["Running", "Pending"]:
print(f"Job {args.final_job_name} status: {status}")
# Job still running, will retry
raise RuntimeError("Job still running")
else:
print(f"Job {args.final_job_name} status: {status}")
# Unknown status, will retry
raise RuntimeError(f"Unknown job status: {status}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Honor timeout_minutes and poll_interval_seconds in the wait loop.

The retry budget is fixed at 120 × 30s, so this stops after about an hour even though run() documents a 120-minute default. poll_interval_seconds is also unused.

Suggested direction
-@task
-@retry(attempts=120, delay=30, backoff=1.0)
+@task
 def wait_for_job_completion(args):
deadline = time.monotonic() + args.timeout_minutes * 60
while time.monotonic() < deadline:
    ...
    time.sleep(args.poll_interval_seconds)

raise TimeoutError(f"Timed out waiting for job {args.final_job_name}")
🧰 Tools
🪛 Ruff (0.15.9)

[error] 161-161: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` around lines 147 -
184, The wait_for_job_completion function currently relies on a fixed `@retry`
decorator (120×30s) and ignores args.timeout_minutes and
args.poll_interval_seconds; replace the retry-based flow with an explicit loop
using time.monotonic(): compute a deadline = time.monotonic() +
args.timeout_minutes * 60, then loop while time.monotonic() < deadline calling
the same shell.run status check and the existing status handling
(Completed/Failed/Running/Pending/other), sleeping args.poll_interval_seconds
between iterations; if the deadline is exceeded raise a TimeoutError with a
clear message about args.final_job_name; keep the failure-detail lookup
(jsonpath .status.message) and existing logging behavior.

Comment on lines +187 to +215
@task
def retrieve_job_logs(args):
"""Retrieve and save job logs"""

# Get job logs
logs_result = shell.run(
f'oc logs -l "fournos.job={args.final_job_name}" -n {args.namespace} --all-containers=true',
check=False,
stdout_dest=args.artifact_dir / f"{args.final_job_name}-logs.txt"
)

if logs_result.success:
return f"Job logs saved to {args.final_job_name}-logs.txt"
else:
return "No logs available or failed to retrieve logs"


@task
def capture_final_job_status(args):
"""Capture final job status and details"""

# Get full job details
shell.run(
f'oc get fournosjob {args.final_job_name} -n {args.namespace} -o yaml',
stdout_dest=args.artifact_dir / f"{args.final_job_name}-final-status.yaml",
check=False
)

return f"Final job status captured to {args.final_job_name}-final-status.yaml"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the files exist and understand the structure
git ls-files | grep -E "(runtime\.py|main\.py)" | head -20

Repository: openshift-psap/forge

Length of output: 200


🏁 Script executed:

# Check the runtime.py file to understand task execution flow
cat projects/core/dsl/runtime.py | head -150

Repository: openshift-psap/forge

Length of output: 5137


🏁 Script executed:

# Check the main.py file to see the current state of the code
cat projects/fournos_launcher/toolbox/submit_and_wait/main.py | sed -n '180,220p'

Repository: openshift-psap/forge

Length of output: 1215


🏁 Script executed:

# Look for `@always` decorator definition
rg "@always" --type py -B 2 -A 5

Repository: openshift-psap/forge

Length of output: 682


🏁 Script executed:

# Check the full task ordering in main.py to understand execution sequence
rg "@task" projects/fournos_launcher/toolbox/submit_and_wait/main.py -B 1 -A 3

Repository: openshift-psap/forge

Length of output: 840


Add @always decorator to artifact collector tasks.

The task execution model stops normal tasks after the first exception. If wait_for_job_completion fails, retrieve_job_logs and capture_final_job_status never run, leaving no logs or final status for debugging. Both tasks should execute regardless of prior failures to ensure diagnostic data is captured.

Suggested fix
+@always
 `@task`
 def retrieve_job_logs(args):
     """Retrieve and save job logs"""
+@always
 `@task`
 def capture_final_job_status(args):
     """Capture final job status and details"""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@task
def retrieve_job_logs(args):
"""Retrieve and save job logs"""
# Get job logs
logs_result = shell.run(
f'oc logs -l "fournos.job={args.final_job_name}" -n {args.namespace} --all-containers=true',
check=False,
stdout_dest=args.artifact_dir / f"{args.final_job_name}-logs.txt"
)
if logs_result.success:
return f"Job logs saved to {args.final_job_name}-logs.txt"
else:
return "No logs available or failed to retrieve logs"
@task
def capture_final_job_status(args):
"""Capture final job status and details"""
# Get full job details
shell.run(
f'oc get fournosjob {args.final_job_name} -n {args.namespace} -o yaml',
stdout_dest=args.artifact_dir / f"{args.final_job_name}-final-status.yaml",
check=False
)
return f"Final job status captured to {args.final_job_name}-final-status.yaml"
`@always`
`@task`
def retrieve_job_logs(args):
"""Retrieve and save job logs"""
# Get job logs
logs_result = shell.run(
f'oc logs -l "fournos.job={args.final_job_name}" -n {args.namespace} --all-containers=true',
check=False,
stdout_dest=args.artifact_dir / f"{args.final_job_name}-logs.txt"
)
if logs_result.success:
return f"Job logs saved to {args.final_job_name}-logs.txt"
else:
return "No logs available or failed to retrieve logs"
`@always`
`@task`
def capture_final_job_status(args):
"""Capture final job status and details"""
# Get full job details
shell.run(
f'oc get fournosjob {args.final_job_name} -n {args.namespace} -o yaml',
stdout_dest=args.artifact_dir / f"{args.final_job_name}-final-status.yaml",
check=False
)
return f"Final job status captured to {args.final_job_name}-final-status.yaml"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` around lines 187 -
215, Add the `@always` decorator to both artifact-collector tasks so they run even
if prior tasks fail: annotate retrieve_job_logs and capture_final_job_status
with `@always` (above their existing `@task` decorators) and ensure the always
decorator is imported from the task framework (e.g., add the import for always
alongside task if missing) so logs and final-status YAML are collected
regardless of earlier exceptions.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
projects/fournos_launcher/toolbox/submit_and_wait/main.py (3)

160-160: Remove extraneous f prefix from string literal.

This f-string has no placeholders.

✨ Suggested fix
-        print(f"Failed to get job status, retrying...")
+        print("Failed to get job status, retrying...")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` at line 160, The
print call in main.py uses an unnecessary f-string without placeholders; locate
the print in the retry loop inside the submit_and_wait / get job status flow
(the print("Failed to get job status, retrying...") line) and remove the f
prefix so it becomes a plain string literal (replace print(f"...") with
print("...")) to avoid the redundant f-string.

245-247: Remove extraneous f prefixes from string literals.

These f-strings have no placeholders.

✨ Suggested fix
-        print(f"✅ FOURNOS job completed successfully")
+        print("✅ FOURNOS job completed successfully")
     except KeyboardInterrupt:
-        print(f"\n🚫 Operation interrupted by user")
+        print("\n🚫 Operation interrupted by user")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` around lines 245 -
247, Two print statements in submit_and_wait/main.py use unnecessary f-strings
with no placeholders; change print(f"✅ FOURNOS job completed successfully") and
print(f"\n🚫 Operation interrupted by user") to plain string literals (remove
the leading 'f') in the try/except block so they become print("✅ FOURNOS job
completed successfully") and print("\n🚫 Operation interrupted by user"); update
the prints where they appear in the submit_and_wait main flow (the
completion/KeyboardInterrupt handlers).

18-27: Use explicit | None type hints instead of implicit Optional.

PEP 484 prohibits implicit Optional. The args and variables_overrides parameters have default values of None but their type hints don't reflect this.

✨ Suggested fix
 def run(
     cluster_name: str,
     project: str,
-    args: list = None,
-    variables_overrides: dict = None,
+    args: list | None = None,
+    variables_overrides: dict | None = None,
     job_name: str = "",
     namespace: str = "fournos-jobs",
     timeout_minutes: int = 120,
     poll_interval_seconds: int = 30
 ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` around lines 18 -
27, The function signature for run uses default None for args and
variables_overrides but their type hints are implicit; update the signature to
use explicit None union types—change args: list to args: list | None = None and
variables_overrides: dict to variables_overrides: dict | None = None
(referencing the run function parameters args and variables_overrides) so the
annotations comply with PEP 484/PEP 604; no other logic changes needed.
projects/fournos_launcher/orchestration/submit.py (1)

33-46: Consider handling exceptions from submit_and_wait.

If submit_and_wait() raises an exception, the function will never return 0 and the exception will propagate. This may be intentional (letting the caller handle it), but the unconditional return 0 could mislead readers into thinking success is guaranteed.

Also, the f-string on line 34 has no placeholders.

✨ Suggested fix for f-string
 def submit_job():
-    logger.warning(f"Hello Fournos")
+    logger.warning("Hello Fournos")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/orchestration/submit.py` around lines 33 - 46, The
submit_job function currently logs a placeholder f-string and unconditionally
returns 0 while calling submit_and_wait; update submit_job to (1) change
logger.warning(f"Hello Fournos") to a normal string logger.warning("Hello
Fournos") since there are no placeholders, and (2) wrap the submit_and_wait(...)
call in a try/except around submit_job so that exceptions from submit_and_wait
are caught, logged (include the exception object), and cause submit_job to
return a non-zero status (e.g., return 1) or re-raise after logging — reference
submit_job and submit_and_wait to find the call site and logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 33-46: The submit_job function currently logs a placeholder
f-string and unconditionally returns 0 while calling submit_and_wait; update
submit_job to (1) change logger.warning(f"Hello Fournos") to a normal string
logger.warning("Hello Fournos") since there are no placeholders, and (2) wrap
the submit_and_wait(...) call in a try/except around submit_job so that
exceptions from submit_and_wait are caught, logged (include the exception
object), and cause submit_job to return a non-zero status (e.g., return 1) or
re-raise after logging — reference submit_job and submit_and_wait to find the
call site and logging.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py`:
- Line 160: The print call in main.py uses an unnecessary f-string without
placeholders; locate the print in the retry loop inside the submit_and_wait /
get job status flow (the print("Failed to get job status, retrying...") line)
and remove the f prefix so it becomes a plain string literal (replace
print(f"...") with print("...")) to avoid the redundant f-string.
- Around line 245-247: Two print statements in submit_and_wait/main.py use
unnecessary f-strings with no placeholders; change print(f"✅ FOURNOS job
completed successfully") and print(f"\n🚫 Operation interrupted by user") to
plain string literals (remove the leading 'f') in the try/except block so they
become print("✅ FOURNOS job completed successfully") and print("\n🚫 Operation
interrupted by user"); update the prints where they appear in the
submit_and_wait main flow (the completion/KeyboardInterrupt handlers).
- Around line 18-27: The function signature for run uses default None for args
and variables_overrides but their type hints are implicit; update the signature
to use explicit None union types—change args: list to args: list | None = None
and variables_overrides: dict to variables_overrides: dict | None = None
(referencing the run function parameters args and variables_overrides) so the
annotations comply with PEP 484/PEP 604; no other logic changes needed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1033a0c2-ba42-4f5b-aa57-2b3b75482065

📥 Commits

Reviewing files that changed from the base of the PR and between fb0346b and 9a63947.

📒 Files selected for processing (9)
  • docs/orchestration/ci.md
  • docs/orchestration/cli.md
  • projects/core/ci_entrypoint/run_ci.py
  • projects/core/library/config.py
  • projects/fournos_launcher/orchestration/ci.py
  • projects/fournos_launcher/orchestration/cli.py
  • projects/fournos_launcher/orchestration/config.yaml
  • projects/fournos_launcher/orchestration/submit.py
  • projects/fournos_launcher/toolbox/submit_and_wait/main.py
✅ Files skipped from review due to trivial changes (3)
  • projects/fournos_launcher/orchestration/config.yaml
  • docs/orchestration/cli.md
  • docs/orchestration/ci.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • projects/fournos_launcher/orchestration/ci.py
  • projects/fournos_launcher/orchestration/cli.py
  • projects/core/ci_entrypoint/run_ci.py
  • projects/core/library/config.py

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 7, 2026

/test fournos llm_d hello world
/cluster psap
/var something.is.wrong: false

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Apr 7, 2026

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 02 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos llm_d hello world
/cluster psap
/var something.is.wrong: false

Failure indicator:

/logs/artifacts/001__submit_and_wait/FAILURE | CalledProcessError: Command 'oc apply -f /logs/artifacts/001__submit_and_wait/src/forge-llm_d-20260407-150732-manifest.yaml' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/runtime.py", line 103, in execute_tasks
    raise execution_error
  File "/app/forge/projects/core/dsl/runtime.py", line 67, in execute_tasks
    _execute_single_task(task_info, args)
  File "/app/forge/projects/core/dsl/runtime.py", line 139, in _execute_single_task
    task_status["ret"] = task_func(args)
                         ^^^^^^^^^^^^^^^
  File "/app/forge/projects/core/dsl/task.py", line 72, in wrapper

[...]

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
projects/fournos_launcher/toolbox/submit_and_wait/main.py (3)

186-214: ⚠️ Potential issue | 🟠 Major

Add @always decorator to artifact collector tasks.

If wait_for_job_completion fails, retrieve_job_logs and capture_final_job_status won't run, leaving no logs or final status for debugging. These tasks should run regardless of prior failures to capture diagnostic data.

Proposed fix
+@always
 `@task`
 def retrieve_job_logs(args):
     """Retrieve and save job logs"""
+@always
 `@task`
 def capture_final_job_status(args):
     """Capture final job status and details"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` around lines 186 -
214, The artifact-collector tasks retrieve_job_logs and capture_final_job_status
must run even if earlier tasks fail; add the `@always` decorator (from the task
framework used) above both the retrieve_job_logs and capture_final_job_status
function definitions so they execute regardless of prior task failures, leaving
their bodies unchanged and preserving stdout_dest/check args.

146-148: ⚠️ Potential issue | 🟠 Major

The timeout_minutes and poll_interval_seconds parameters are ignored.

The @retry(attempts=120, delay=30, backoff=1.0) decorator uses hardcoded values, resulting in a fixed ~60-minute timeout (120 × 30s) regardless of the timeout_minutes parameter (documented default: 120 minutes).

Suggested direction: Use explicit deadline loop

Replace the retry decorator with an explicit polling loop that respects the configured parameters:

`@task`
def wait_for_job_completion(args):
    """Wait for FOURNOS job to complete"""
    import time
    
    deadline = time.monotonic() + args.timeout_minutes * 60
    
    while time.monotonic() < deadline:
        # ... existing status check logic ...
        
        if status == "Completed":
            return f"Job {args.final_job_name} completed successfully"
        elif status == "Failed":
            # ... existing failure handling ...
        
        time.sleep(args.poll_interval_seconds)
    
    raise TimeoutError(f"Timed out waiting for job {args.final_job_name} after {args.timeout_minutes} minutes")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` around lines 146 -
148, The wait_for_job_completion function currently uses a hardcoded
`@retry`(attempts=120, delay=30, backoff=1.0) decorator so args.timeout_minutes
and args.poll_interval_seconds are ignored; remove the retry decorator on
wait_for_job_completion and replace it with an explicit polling loop that
computes a deadline using time.monotonic() + args.timeout_minutes*60, calls the
existing job status check and existing success/failure handling inside the loop,
sleeps using args.poll_interval_seconds between polls, and when the deadline is
exceeded raise a TimeoutError with a message referencing args.final_job_name and
args.timeout_minutes.

82-86: ⚠️ Potential issue | 🟠 Major

Normalize final_job_name to a Kubernetes-safe DNS-1123 label.

The args.project value (e.g., llm_d) is interpolated verbatim into the job name. Values containing underscores produce names like forge-llm_d-... which violate DNS-1123 subdomain naming (lowercase alphanumeric and hyphens only).

Proposed fix
+import re
+
 `@task`
 def generate_job_name(args):
     """Generate job name if not provided"""

     if args.job_name:
         args.final_job_name = args.job_name
     else:
         timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
-        args.final_job_name = f"forge-{args.project}-{timestamp}"
+        # Normalize project name to DNS-1123 safe format
+        project_slug = re.sub(r"[^a-z0-9-]+", "-", args.project.lower()).strip("-") or "job"
+        args.final_job_name = f"forge-{project_slug}-{timestamp}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` around lines 82 -
86, The generated job name may include characters invalid for DNS-1123 labels
(e.g., underscores from args.project); update the logic that assigns
args.final_job_name (the block using args.job_name, args.project, and
datetime.now().strftime) to normalize the interpolated name: lowercase it,
replace any character not a-z0-9 or hyphen with a hyphen (e.g., convert
underscores to hyphens), collapse consecutive hyphens, strip leading/trailing
hyphens, and truncate to 63 characters (DNS-1123 label max) while ensuring it is
not empty (provide a safe default like "forge"). Implement this normalization
using a simple regex-based sanitizer before assigning args.final_job_name.
🧹 Nitpick comments (7)
projects/fournos_launcher/orchestration/submit.py (1)

34-34: Remove unnecessary f-string prefix.

This f-string has no placeholders. Remove the f prefix.

Proposed fix
-    logger.warning(f"Hello Fournos")
+    logger.warning("Hello Fournos")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/orchestration/submit.py` at line 34, The
logger.warning call in submit.py uses an unnecessary f-string
(logger.warning(f"Hello Fournos")) even though there are no placeholders; change
it to a plain string by removing the `f` prefix (use logger.warning("Hello
Fournos")) to avoid misleading use of f-strings and minor overhead.
projects/fournos_launcher/toolbox/submit_and_wait/main.py (3)

160-160: Remove unnecessary f-string prefix.

This f-string has no placeholders.

Proposed fix
-        print(f"Failed to get job status, retrying...")
+        print("Failed to get job status, retrying...")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` at line 160, The
print call uses an unnecessary f-string prefix: replace print(f"Failed to get
job status, retrying...") with a regular string literal print("Failed to get job
status, retrying...") in the submit/wait logic (the print statement that logs
retrying on failed job status retrieval) to remove the redundant f prefix.

245-248: Remove unnecessary f-string prefixes.

These f-strings on lines 245 and 247 have no placeholders.

Proposed fix
-        print(f"✅ FOURNOS job completed successfully")
+        print("✅ FOURNOS job completed successfully")
     except KeyboardInterrupt:
-        print(f"\n🚫 Operation interrupted by user")
+        print("\n🚫 Operation interrupted by user")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` around lines 245 -
248, The two print statements use unnecessary f-string prefixes even though they
contain no placeholders; update the prints in the try/except block (the calls
currently written as print(f"✅ FOURNOS job completed successfully") and
print(f"\n🚫 Operation interrupted by user")) to plain string prints (remove the
leading f) so they become print("✅ FOURNOS job completed successfully") and
print("\n🚫 Operation interrupted by user").

21-22: Use explicit Optional type hints per PEP 484.

The type hints use implicit Optional (mutable default None) which PEP 484 prohibits.

Proposed fix
+from typing import Optional
+
 def run(
     cluster_name: str,
     project: str,
-    args: list = None,
-    variables_overrides: dict = None,
+    args: Optional[list] = None,
+    variables_overrides: Optional[dict] = None,
     job_name: str = "",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py` around lines 21 -
22, Replace the implicit Optional hints for the parameters args and
variables_overrides with explicit PEP 484 Optional types: import Optional (and
List/Dict/Any as needed) from typing and change the annotations from "args: list
= None" and "variables_overrides: dict = None" to something like "args:
Optional[List[Any]] = None" and "variables_overrides: Optional[Dict[str, Any]] =
None" in the function signature (the parameters named args and
variables_overrides in main.py), ensuring the typing imports are added at the
top of the file.
projects/core/ci_entrypoint/prepare_ci.py (1)

396-411: Static analysis: Consider using absolute paths for curl.

The static analysis flags curl as a partial executable path (S607). While this is low risk in a CI environment, using the full path or shutil.which('curl') would be more secure.

Also, the subprocess calls don't capture output, so if the curl command fails, the warning message won't include details about why.

Suggested improvement
+import shutil
+
+curl_path = shutil.which('curl')
+if not curl_path:
+    logger.warning("curl not found, skipping PR download")
+else:
     try:
         # Download PR data
         result = subprocess.run(
-            ["curl", "-sSf", pr_url, "-o", str(artifact_path / CI_METADATA_DIRNAME / "pull_request.json")],
-            timeout=30
+            [curl_path, "-sSf", pr_url, "-o", str(artifact_path / CI_METADATA_DIRNAME / "pull_request.json")],
+            timeout=30,
+            capture_output=True,
+            text=True
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/ci_entrypoint/prepare_ci.py` around lines 396 - 411, Replace
the bare "curl" invocation in the subprocess.run calls with an absolute command
path resolved via shutil.which('curl') (and fail early with a clear logger.error
if None), and change the subprocess.run invocations that build the PR and
PR-comments files to pass capture_output=True and text=True so you can log
result.stderr/result.stdout when result.returncode != 0; update the two
locations that reference pr_url/pr_comments_url and artifact_path /
CI_METADATA_DIRNAME to use the resolved curl_path variable and include the
captured stderr in the logger.warning/error messages to surface why the download
failed.
projects/core/dsl/task.py (1)

79-83: Duplicate error logging with runtime.py may cause confusing output.

Both task.py (lines 80-81) and runtime.py (lines 146-147) now log similar error details when a task fails. Since the wrapper in task.py re-raises the exception and runtime.py catches it, the same failure will be logged twice with slightly different formats:

  • Here: ==> TASK FAILED: {task_name} - {e.__class__.__name__}: {e}
  • In runtime.py: ==> EXECUTION FAILED for {task_name}: {e.__class__.__name__}: {e}

Consider removing the error logging from one location to avoid duplicate messages in the execution log.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/dsl/task.py` around lines 79 - 83, The except block in the task
wrapper is duplicating error logs already emitted by runtime.py; in the except
handler that references task_name and func, remove (or demote to debug) the two
logger.error calls that emit "==> TASK FAILED: ..." and "==> Task description:
...", leaving the raise so runtime.py can perform the single centralized failure
log; locate the except Exception as e block in the task wrapper (the code using
task_name and func.__doc__) and delete or change those logger.error lines
accordingly.
projects/core/dsl/shell.py (1)

106-108: Overriding error.args is unconventional but achieves the goal.

The standard CalledProcessError.args is (returncode, cmd), but overriding it to (error_msg,) changes what str(error) returns. This works for logging purposes, but note that any code inspecting error.args[0] expecting an integer return code will get the string message instead.

Alternative: Subclass or use exception chaining

A cleaner approach would be to raise a new exception with the enhanced message while preserving the original:

-            error = subprocess.CalledProcessError(result.returncode, command, result.stdout, result.stderr)
-            error.args = (error_msg,)
-            raise error
+            error = subprocess.CalledProcessError(result.returncode, command, result.stdout, result.stderr)
+            raise RuntimeError(error_msg) from error

This preserves the original exception while providing the enhanced message.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/dsl/shell.py` around lines 106 - 108, Don't override
subprocess.CalledProcessError.args; instead preserve the original
CalledProcessError and raise a new exception that carries the enhanced message
while retaining the original as the cause. Specifically, keep the original
CalledProcessError created from result.returncode, command, result.stdout,
result.stderr (the variable named error) and then raise a new exception (e.g.,
RuntimeError or a small custom exception) with error_msg using "raise ... from
error" so callers can still inspect error.returncode/args via the original error
variable; reference subprocess.CalledProcessError, the local names error,
result, command, and error_msg to locate where to change the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/core/ci_entrypoint/prepare_ci.py`:
- Around line 382-383: system_prechecks() writes the FORGE git version into
artifact_path / CI_METADATA_DIRNAME / "forge.git_version" but the
CI_METADATA_DIRNAME directory may not exist yet (parse_and_save_pr_arguments()
creates it later), causing FileNotFoundError; fix by ensuring the directory
exists before writing (e.g., call mkdir(parents=True, exist_ok=True) on
artifact_path / CI_METADATA_DIRNAME) inside system_prechecks() (or in a shared
helper invoked before the write), so the write in system_prechecks() succeeds
regardless of prepare() call order.

In `@projects/core/dsl/shell.py`:
- Around line 98-109: The CalledProcessError is being constructed with string
stdout/stderr while downstream code expects bytes (it calls .decode()), causing
AttributeError; to fix, preserve text=True in subprocess.run but when creating
the CalledProcessError in shell.py (the block that builds error_msg and does
subprocess.CalledProcessError(...)), pass bytes for stdout/stderr (e.g., encode
result.stdout/result.stderr) so CalledProcessError.stderr/stdout are bytes, or
alternatively change the subprocess.run calls to text=False so outputs are bytes
consistently; update the CalledProcessError construction in the error-handling
path in shell.py (the raise of subprocess.CalledProcessError) to match whichever
approach you choose.

---

Duplicate comments:
In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py`:
- Around line 186-214: The artifact-collector tasks retrieve_job_logs and
capture_final_job_status must run even if earlier tasks fail; add the `@always`
decorator (from the task framework used) above both the retrieve_job_logs and
capture_final_job_status function definitions so they execute regardless of
prior task failures, leaving their bodies unchanged and preserving
stdout_dest/check args.
- Around line 146-148: The wait_for_job_completion function currently uses a
hardcoded `@retry`(attempts=120, delay=30, backoff=1.0) decorator so
args.timeout_minutes and args.poll_interval_seconds are ignored; remove the
retry decorator on wait_for_job_completion and replace it with an explicit
polling loop that computes a deadline using time.monotonic() +
args.timeout_minutes*60, calls the existing job status check and existing
success/failure handling inside the loop, sleeps using
args.poll_interval_seconds between polls, and when the deadline is exceeded
raise a TimeoutError with a message referencing args.final_job_name and
args.timeout_minutes.
- Around line 82-86: The generated job name may include characters invalid for
DNS-1123 labels (e.g., underscores from args.project); update the logic that
assigns args.final_job_name (the block using args.job_name, args.project, and
datetime.now().strftime) to normalize the interpolated name: lowercase it,
replace any character not a-z0-9 or hyphen with a hyphen (e.g., convert
underscores to hyphens), collapse consecutive hyphens, strip leading/trailing
hyphens, and truncate to 63 characters (DNS-1123 label max) while ensuring it is
not empty (provide a safe default like "forge"). Implement this normalization
using a simple regex-based sanitizer before assigning args.final_job_name.

---

Nitpick comments:
In `@projects/core/ci_entrypoint/prepare_ci.py`:
- Around line 396-411: Replace the bare "curl" invocation in the subprocess.run
calls with an absolute command path resolved via shutil.which('curl') (and fail
early with a clear logger.error if None), and change the subprocess.run
invocations that build the PR and PR-comments files to pass capture_output=True
and text=True so you can log result.stderr/result.stdout when result.returncode
!= 0; update the two locations that reference pr_url/pr_comments_url and
artifact_path / CI_METADATA_DIRNAME to use the resolved curl_path variable and
include the captured stderr in the logger.warning/error messages to surface why
the download failed.

In `@projects/core/dsl/shell.py`:
- Around line 106-108: Don't override subprocess.CalledProcessError.args;
instead preserve the original CalledProcessError and raise a new exception that
carries the enhanced message while retaining the original as the cause.
Specifically, keep the original CalledProcessError created from
result.returncode, command, result.stdout, result.stderr (the variable named
error) and then raise a new exception (e.g., RuntimeError or a small custom
exception) with error_msg using "raise ... from error" so callers can still
inspect error.returncode/args via the original error variable; reference
subprocess.CalledProcessError, the local names error, result, command, and
error_msg to locate where to change the code.

In `@projects/core/dsl/task.py`:
- Around line 79-83: The except block in the task wrapper is duplicating error
logs already emitted by runtime.py; in the except handler that references
task_name and func, remove (or demote to debug) the two logger.error calls that
emit "==> TASK FAILED: ..." and "==> Task description: ...", leaving the raise
so runtime.py can perform the single centralized failure log; locate the except
Exception as e block in the task wrapper (the code using task_name and
func.__doc__) and delete or change those logger.error lines accordingly.

In `@projects/fournos_launcher/orchestration/submit.py`:
- Line 34: The logger.warning call in submit.py uses an unnecessary f-string
(logger.warning(f"Hello Fournos")) even though there are no placeholders; change
it to a plain string by removing the `f` prefix (use logger.warning("Hello
Fournos")) to avoid misleading use of f-strings and minor overhead.

In `@projects/fournos_launcher/toolbox/submit_and_wait/main.py`:
- Line 160: The print call uses an unnecessary f-string prefix: replace
print(f"Failed to get job status, retrying...") with a regular string literal
print("Failed to get job status, retrying...") in the submit/wait logic (the
print statement that logs retrying on failed job status retrieval) to remove the
redundant f prefix.
- Around line 245-248: The two print statements use unnecessary f-string
prefixes even though they contain no placeholders; update the prints in the
try/except block (the calls currently written as print(f"✅ FOURNOS job completed
successfully") and print(f"\n🚫 Operation interrupted by user")) to plain string
prints (remove the leading f) so they become print("✅ FOURNOS job completed
successfully") and print("\n🚫 Operation interrupted by user").
- Around line 21-22: Replace the implicit Optional hints for the parameters args
and variables_overrides with explicit PEP 484 Optional types: import Optional
(and List/Dict/Any as needed) from typing and change the annotations from "args:
list = None" and "variables_overrides: dict = None" to something like "args:
Optional[List[Any]] = None" and "variables_overrides: Optional[Dict[str, Any]] =
None" in the function signature (the parameters named args and
variables_overrides in main.py), ensuring the typing imports are added at the
top of the file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e282bfdd-6a78-4882-b4e4-75f099cd40ea

📥 Commits

Reviewing files that changed from the base of the PR and between 9a63947 and 872074c.

📒 Files selected for processing (15)
  • docs/ci/run_ci.md
  • docs/orchestration/ci.md
  • docs/orchestration/cli.md
  • projects/core/ci_entrypoint/prepare_ci.py
  • projects/core/ci_entrypoint/run_ci.py
  • projects/core/dsl/runtime.py
  • projects/core/dsl/shell.py
  • projects/core/dsl/task.py
  • projects/core/library/config.py
  • projects/core/notifications/send.py
  • projects/fournos_launcher/orchestration/ci.py
  • projects/fournos_launcher/orchestration/cli.py
  • projects/fournos_launcher/orchestration/config.yaml
  • projects/fournos_launcher/orchestration/submit.py
  • projects/fournos_launcher/toolbox/submit_and_wait/main.py
✅ Files skipped from review due to trivial changes (5)
  • projects/fournos_launcher/orchestration/config.yaml
  • projects/core/notifications/send.py
  • docs/orchestration/cli.md
  • docs/orchestration/ci.md
  • docs/ci/run_ci.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • projects/core/ci_entrypoint/run_ci.py
  • projects/fournos_launcher/orchestration/cli.py
  • projects/core/library/config.py
  • projects/fournos_launcher/orchestration/ci.py

Comment thread projects/core/ci_entrypoint/prepare_ci.py Outdated
Comment thread projects/core/dsl/shell.py Outdated
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 7, 2026

/test fournos llm_d hello world
/cluster psap
/var something.is.wrong: false

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Apr 7, 2026

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 02 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos llm_d hello world
/cluster psap
/var something.is.wrong: false

Failure indicator:

/logs/artifacts/001__submit_and_wait/FAILURE | TaskExecutionError: Task 'submit_fournos_job' failed: CalledProcessError: Command 'oc apply -f /logs/artifacts/001__submit_and_wait/src/forge-llm_d-20260407-154640-manifest.yaml' returned non-zero exit status 1.
Task description: Submit the FOURNOS job
Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/runtime.py", line 87, in execute_tasks
    _execute_single_task(task_info, args)
  File "/app/forge/projects/core/dsl/runtime.py", line 179, in _execute_single_task
    raise task_error from None
projects.core.dsl.runtime.TaskExecutionError: Task 'submit_fournos_job' failed: CalledProcessError: Command 'oc apply -f /logs/artifacts/001__submit_and_wait/src/forge-llm_d-20260407-154640-manifest.yaml' returned non-zero exit status 1.
Task description: Submit the FOURNOS job


@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 7, 2026

@kpouget: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/fournos 241e31c link true /test fournos

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
projects/core/dsl/runtime.py (1)

26-26: Use explicit None union for optional parameter.

Per PEP 484, parameters with None default should use explicit union type annotation.

♻️ Proposed fix
-    def __init__(self, task_name: str, task_description: str, original_exception: Exception, task_args: dict = None):
+    def __init__(self, task_name: str, task_description: str, original_exception: Exception, task_args: dict | None = None):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/dsl/runtime.py` at line 26, The __init__ signature currently
sets task_args: dict = None which violates PEP 484 for optional defaults; update
the parameter annotation in the __init__ method to use an explicit None union
(e.g., task_args: Optional[dict] = None or task_args: dict | None = None) and
import typing.Optional if using Optional so the type clearly indicates the
parameter may be None; keep the parameter name and default value unchanged and
only change the type annotation in the __init__ definition.
projects/core/library/ci.py (2)

12-15: Remove unused inspect import.

The inspect module is imported but not used anywhere in this file.

♻️ Proposed fix
 import traceback
 import logging
-import inspect
 import os
 import yaml
 from pathlib import Path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/library/ci.py` around lines 12 - 15, Remove the unused import
"inspect" from the top of the file: delete the "import inspect" statement so
only used modules (os, yaml, Path) remain; verify there are no other references
to the inspect symbol in this module and run linters/formatters to confirm no
unused-import warnings remain.

123-125: Move subprocess import to module level.

Imports inside functions add minor overhead and deviate from Python conventions. Move this to the top of the file with other imports.

♻️ Proposed fix

At top of file (around line 10):

import subprocess

Then remove line 124:

     # Only include original_error for exceptions we haven't structurally parsed
-    import subprocess
     if not isinstance(actual_exception, subprocess.CalledProcessError):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/library/ci.py` around lines 123 - 125, The subprocess import is
inside a function near the check "if not isinstance(actual_exception,
subprocess.CalledProcessError)"; move "import subprocess" up to the module-level
imports with the other imports, remove the local import statement inside the
function, and run a quick grep for any other local "subprocess" imports to avoid
duplicates or name conflicts so the isinstance check against
subprocess.CalledProcessError continues to work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/core/dsl/runtime.py`:
- Around line 96-98: The except TaskExecutionError block currently re-raises
immediately and prevents the always-tasks block from running; change it to
capture and store the TaskExecutionError into the same exception variable used
for other errors (the one set in the other except path) instead of raising, let
the existing always-tasks block execute, and after that block re-raise the
stored TaskExecutionError so cleanup always runs; reference the
TaskExecutionError except handler and the always-tasks block in runtime.py when
applying this change.

In `@projects/core/library/ci.py`:
- Around line 200-206: The error summary written by _write_error_summary_to_file
is being clobbered later by postchecks, so change the write target to a
non-conflicting file (e.g., write the detailed per-operation summary to
"FAILURES_DETAILED" instead of "FAILURES") to preserve the structured summary;
update the function _write_error_summary_to_file to use the new filename and
ensure any consumers that need the consolidated view (postchecks) either
read/merge FAILURES_DETAILED or continue writing its own aggregated FAILURES
without overwriting the detailed file (alternatively, if you prefer to keep the
same name, modify postchecks to open FAILURES in append mode or merge existing
contents before writing).

---

Nitpick comments:
In `@projects/core/dsl/runtime.py`:
- Line 26: The __init__ signature currently sets task_args: dict = None which
violates PEP 484 for optional defaults; update the parameter annotation in the
__init__ method to use an explicit None union (e.g., task_args: Optional[dict] =
None or task_args: dict | None = None) and import typing.Optional if using
Optional so the type clearly indicates the parameter may be None; keep the
parameter name and default value unchanged and only change the type annotation
in the __init__ definition.

In `@projects/core/library/ci.py`:
- Around line 12-15: Remove the unused import "inspect" from the top of the
file: delete the "import inspect" statement so only used modules (os, yaml,
Path) remain; verify there are no other references to the inspect symbol in this
module and run linters/formatters to confirm no unused-import warnings remain.
- Around line 123-125: The subprocess import is inside a function near the check
"if not isinstance(actual_exception, subprocess.CalledProcessError)"; move
"import subprocess" up to the module-level imports with the other imports,
remove the local import statement inside the function, and run a quick grep for
any other local "subprocess" imports to avoid duplicates or name conflicts so
the isinstance check against subprocess.CalledProcessError continues to work.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 09d17a6d-f7a3-4ff2-a58e-bcedb16788de

📥 Commits

Reviewing files that changed from the base of the PR and between 872074c and 241e31c.

📒 Files selected for processing (4)
  • projects/core/dsl/runtime.py
  • projects/core/dsl/shell.py
  • projects/core/dsl/task.py
  • projects/core/library/ci.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • projects/core/dsl/task.py
  • projects/core/dsl/shell.py

Comment thread projects/core/dsl/runtime.py Outdated
Comment on lines +96 to +98
except TaskExecutionError:
# Re-raise the TaskExecutionError to preserve context
raise
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: TaskExecutionError bypasses always-tasks execution.

Re-raising TaskExecutionError at line 98 exits the function before the always-tasks block (lines 113-122) executes. This breaks the guaranteed cleanup pattern that other exceptions follow (line 108-109 stores the error to allow always-tasks to run).

🐛 Proposed fix to preserve always-tasks execution
-        except TaskExecutionError:
-            # Re-raise the TaskExecutionError to preserve context
-            raise
+        except TaskExecutionError as e:
+            # Store the TaskExecutionError to preserve context after always tasks
+            execution_error = e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except TaskExecutionError:
# Re-raise the TaskExecutionError to preserve context
raise
except TaskExecutionError as e:
# Store the TaskExecutionError to preserve context after always tasks
execution_error = e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/dsl/runtime.py` around lines 96 - 98, The except
TaskExecutionError block currently re-raises immediately and prevents the
always-tasks block from running; change it to capture and store the
TaskExecutionError into the same exception variable used for other errors (the
one set in the other except path) instead of raising, let the existing
always-tasks block execute, and after that block re-raise the stored
TaskExecutionError so cleanup always runs; reference the TaskExecutionError
except handler and the always-tasks block in runtime.py when applying this
change.

Comment thread projects/core/library/ci.py Outdated
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Apr 7, 2026

merging the docs update

@kpouget kpouget merged commit 210aa00 into openshift-psap:main Apr 7, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant