Skip to content

[core] notifications: new component#2

Merged
kpouget merged 18 commits intoopenshift-psap:mainfrom
kpouget:jump-ci
Mar 17, 2026
Merged

[core] notifications: new component#2
kpouget merged 18 commits intoopenshift-psap:mainfrom
kpouget:jump-ci

Conversation

@kpouget
Copy link
Copy Markdown
Contributor

@kpouget kpouget commented Mar 16, 2026

  • Add github notifications
  • Add slack notifications
  • Add GA static checks (python syntax)

Not covered:

  • Add secret support
  • Add config support
  • Make the jump-ci work

Summary by CodeRabbit

  • New Features

    • End-to-end notifications: GitHub PR comments and Slack messages for job completion (dry-run support); notifications reflect explicit finish reasons.
    • New toolbox & CI utilities: dynamic role runner, artifact export to S3, parallel runs, thread-local artifact directories, sizing helpers, and a YAML-driven config system.
  • CI / Workflows

    • Added Python syntax and repository consistency checks with a minimal Ubuntu prep step.
    • Jump CI: explicit lock/prepare/unlock lifecycle and updated PR test directive parsing.
  • Documentation

    • Added Notifications Integration guide.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a TOPSAIL notifications subsystem (GitHub PR comments + Slack), JWT generation, notification orchestration and CI integration; new legacy library tooling (ansible toolbox, run/export/config/env/sizing), Jump CI lifecycle refactor, import path updates, and new CI workflows and helper action.

Changes

Cohort / File(s) Summary
GitHub notification code
projects/core/notifications/github/api.py, projects/core/notifications/github/gen_jwt.py
New GitHub client and JWT generator: build App JWT, fetch installation ID and installation access token, post PR comments, fetch PR metadata with fault-tolerant fallback, CLI/test entry paths, and error logging.
Slack notification code
projects/core/notifications/slack/api.py
New Slack helper: client init, paginated channel history search for anchor messages, threaded or top-level posting, CHANNEL_ID/MAX_CALLS constants, and SlackApiError handling.
Notification orchestration & messages
projects/core/notifications/send.py
New orchestrator: secret discovery, CI context detection (PR/periodic/Perflab), message builders for GitHub/Slack/CPT, send flows with dry-run support, and extensive validation/error handling.
CI integration & entrypoint
projects/core/ci_entrypoint/prepare_ci.py, projects/core/ci_entrypoint/run_ci.py
Integrates notifications into CI: adds FinishReason enum, NOTIFICATIONS_AVAILABLE flag, send_notification API, updates postchecks/run flow to use FinishReason and attempt non-fatal notifications.
Docs & package init
projects/core/notifications/README.md, projects/core/notifications/__init__.py, projects/core/notifications/github/__init__.py, projects/core/notifications/slack/__init__.py
Adds README describing integration and secrets; initializes notifications package namespaces.
Workflows & actions
.github/steps/prepare_minimal_ubuntu_image/action.yml, .github/workflows/check_consistency.yml, .github/workflows/check_python_syntax.yml
Adds composite action to prepare minimal Ubuntu image and workflows for repository consistency and Python syntax checks.
PR command handling
projects/core/ci_entrypoint/github/pr_args.py
Extends /test directive parsing for jump-ci: parses cluster, target_project, remaining args; updates documentation/examples.
Jump CI lifecycle & imports
projects/jump_ci/orchestration/ci.py, projects/jump_ci/testing/*, projects/jump_ci/toolbox/jump_ci.py
Refactors Jump CI orchestration to lifecycle methods delegating to external test utilities; multiple testing scripts change imports from core to legacy library.
Legacy library additions (large)
projects/legacy/library/__init__.py, .../ansible_toolbox.py, .../common.py, .../config.py, .../env.py, .../export.py, .../run.py, .../sizing.py, .../sizing.machines
Major new tooling: dynamic toolbox and RunAnsibleRole runner, YAML-driven Config with presets/overrides, thread-local ARTIFACT_DIR and TempArtifactDir, run-tooling and Parallel executor, export CLI, sizing utilities and machines catalog, logging helpers and merge_dicts utility. Significant new classes/functions across many files.

Sequence Diagram(s)

sequenceDiagram
    participant CI as CI (prepare_ci / run_ci)
    participant Orch as Orchestrator (send.py)
    participant JWT as JWT Generator (gen_jwt.py)
    participant GitHub as GitHub API (github/api.py)
    participant Slack as Slack API (slack/api.py)

    CI->>Orch: send_job_completion_notification(finish_reason, status,...)
    Orch->>Orch: determine org/repo, pr_number, secrets, build message
    alt GitHub enabled
        Orch->>JWT: generate_encoded_jwt(pem, client_id)
        JWT-->>Orch: app_jwt
        Orch->>GitHub: get_user_token(pem_file, client_id, org, repo)
        GitHub-->>Orch: user_token
        Orch->>GitHub: send_notification(org, repo, user_token, pr_number, message)
        GitHub-->>Orch: comment_posted
    end
    alt Slack enabled
        Orch->>Slack: init_client(token)
        Slack-->>Orch: client
        Orch->>Slack: search_channel_message(client, anchor)
        Slack-->>Orch: thread_ts
        Orch->>Slack: send_message(client, formatted_message, thread_ts)
        Slack-->>Orch: post_result
    end
    Orch-->>CI: notification_result (non-fatal)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐰 I hopped through secrets, tokens, and queues,

I stitched PR notes and threaded Slack news,
A JWT snug beneath my paw so neat,
Comments and threads now dance on repeat,
Hooray — notifications delivered with a hop and a beat!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title '[core] notifications: new component' clearly describes the main change: the addition of a new notifications component to the core module.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (7)
projects/core/notifications/slack/api.py (2)

55-55: Use explicit Optional type annotation.

Per PEP 484, implicit Optional (using = None without the type hint) is prohibited. Use explicit union syntax.

Proposed fix
-def send_message(client, message: str, main_ts: str = None):
+def send_message(client, message: str, main_ts: str | None = None):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/slack/api.py` at line 55, The function signature
for send_message uses a default None without an explicit Optional type; update
the annotation for main_ts to be Optional[str] (or explicitly Union[str, None])
so it conforms to PEP 484. Modify the send_message definition to import
typing.Optional if not already imported and change the parameter annotation from
main_ts: str = None to main_ts: Optional[str] (or main_ts: Union[str, None])
while keeping the default value = None and preserving the function name
send_message and its parameter order.

9-10: Hardcoded channel ID limits flexibility.

Consider making CHANNEL_ID configurable (e.g., via environment variable or parameter) to support different channels for different notification types or environments.

CHANNEL_ID = os.environ.get("TOPSAIL_SLACK_CHANNEL_ID", "C07NS5TAKPA")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/slack/api.py` around lines 9 - 10, Replace the
hardcoded CHANNEL_ID constant with a configurable value: read CHANNEL_ID from an
environment variable (e.g., TOPSAIL_SLACK_CHANNEL_ID) with the existing value as
a sensible default, and ensure any functions that rely on CHANNEL_ID (or the
module-level constant) can accept an injected channel id if passed in; leave
MAX_CALLS untouched but update usages to reference the new CONFIGURABLE
CHANNEL_ID symbol or parameter where appropriate so different notification
types/environments can override the channel.
projects/core/notifications/send.py (5)

94-96: Function returns None on abort instead of False.

When abort is True, the function returns None (implicit), but the caller on line 60 checks the return value as a boolean. While this works (None is falsy), it's inconsistent with the explicit True/False returns elsewhere. Consider returning False explicitly.

     if abort:
         logging.error("github: Aborting due to previous error(s).")
-        return
+        return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 94 - 96, The abort branch
in the send function returns implicitly None which is inconsistent with the
function's explicit boolean returns; update the abort handling (the if abort:
block) to return False explicitly (replace the implicit return with return
False) so callers that expect a boolean from send (checked around the caller at
line 60) receive a consistent False value; ensure you still log the error with
logging.error("github: Aborting due to previous error(s).") before returning.

22-22: Multiple statements on one line.

Per static analysis, avoid placing multiple statements on a single line with a colon.

Proposed fix
     for secret_env_key in SECRET_ENV_KEYS:
-        if os.environ.get(secret_env_key): break
+        if os.environ.get(secret_env_key):
+            break
         warn.append(f"{secret_env_key} not defined, cannot access the Github secrets")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` at line 22, The single-line compound
statement "if os.environ.get(secret_env_key): break" should be split into two
lines to comply with style rules; locate the loop in
projects/core/notifications/send.py where secret_env_key is checked (the line
using os.environ.get(secret_env_key) and a break) and rewrite it so the if
condition is on its own line and the break is on the following indented line,
preserving the same logic and any surrounding control flow.

71-71: Inconsistent return value semantics.

The function returns True on line 57 when secrets are missing, and failed (boolean) on line 71. However, failed=True indicates failure while the early return also returns True. This makes the return value semantics confusing—does True mean success or failure?

Consider using consistent return semantics (e.g., return False for success, True for failure, or invert to return success boolean).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` at line 71, The function currently
returns True on the early exit when secrets are missing and later returns the
boolean variable failed (where True indicates failure), creating inconsistent
semantics; update the function so the return value meaning is consistent across
all paths (choose either success=True or failure=True), e.g., flip the early
return to return False when secrets are missing if you adopt failure=True for
failed, or invert the final return to return not failed if you adopt
success=True; adjust the early-return point that currently returns True and the
final "return failed" so both follow the same convention (refer to the variable
failed and the early secret-check return site to make the change).

101-103: Remove extraneous f prefixes from strings without placeholders.

These f-strings have no placeholders and should be regular strings.

-        logging.info(f"***")
-        logging.info(f"***")
-        logging.info(f"***\n")
+        logging.info("***")
+        logging.info("***")
+        logging.info("***\n")

Similar fixes needed at lines 149-151, 266, 280-282, 333, and 406.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 101 - 103, The log
statements in projects/core/notifications/send.py use unnecessary f-strings
(e.g., logging.info(f"***")) where there are no placeholders; change these to
plain strings (logging.info("***")) for the occurrences shown in the diff and
the similar sites referenced (lines ~149-151, 266, 280-282, 333, 406) to remove
the redundant f prefix while keeping the logging calls and message content
identical.

212-230: Unused variables org and repo.

The unpacked variables org and repo on line 215 are never used in get_slack_channel_message. Either remove the call or prefix with underscores if intentionally unused.

-    org, repo = get_org_repo()
+    _org, _repo = get_org_repo()  # Currently unused, retained for future use

Or simply remove the line if not needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 212 - 230, In
get_slack_channel_message, the unpacked variables org and repo from
get_org_repo() are unused; remove the unnecessary call or rename the variables
to indicate intentional unusedness (e.g., _org, _repo) to silence linter
warnings. Locate the get_slack_channel_message function and either delete the
"org, repo = get_org_repo()" line entirely or change it to "_org, _repo =
get_org_repo()" so the behavior remains the same but the unused-variable issue
is resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/core/notifications/github/api.py`:
- Line 25: The requests.get call that assigns installation_resp is missing a
timeout, causing potential indefinite hangs; update that call and the other
requests calls in this module (the ones around lines where installation_resp,
the subsequent requests.get/requests.post calls are used) to include a sensible
timeout parameter (e.g., timeout=5 or a configured constant) so network stalls
fail fast; ensure you add the same timeout argument to the requests calls
referenced in this file (installation_resp and the other request invocations
around the subsequent blocks at lines noted in the review) to consistently
prevent blocking.
- Around line 36-40: Ensure you validate the access token HTTP response before
reading the token: check access_token_resp.status_code (or call
access_token_resp.raise_for_status()) and verify the JSON contains the "token"
key, and if not raise/log a clear error that includes installation_id and the
response content; update the code around access_token_resp, user_token and
BASE_URL/installation_id usage to handle HTTP errors and missing "token" keys
gracefully (e.g., raise a descriptive exception or return an error).
- Around line 56-70: fetch_pr_data currently returns 0 on HTTP error which
becomes an oldest=0 in Slack searches and can match entire history; change
fetch_pr_data to return (None, None) for the timestamp when the request fails
(replace the 0 return), and update the caller that invokes
search_channel_message (the code that calls fetch_pr_data and then calls
search_channel_message) to pass not_before=None when the returned pr_created_at
is None (or explicitly use the current timestamp if you prefer a safer fallback)
so search_channel_message/Slack oldest is not set to unix epoch.

In `@projects/core/notifications/send.py`:
- Around line 259-264: When pr_number is falsy the variable pr_data is never
assigned yet the block that sets channel_message calls
get_slack_channel_message(anchor, pr_data) when is_periodic is False, causing an
UnboundLocalError; fix by ensuring pr_data is defined in all code paths (e.g.,
initialize pr_data = None before the pr_number conditional) or guard the call so
get_slack_channel_message is only invoked when pr_data is available (adjust the
channel_msg_ts/is_periodic logic to avoid referencing pr_data when pr_number is
falsy), referencing the variables pr_number, pr_data, channel_msg_ts,
is_periodic and the function get_slack_channel_message to locate the change.
- Line 468: The call assigns the whole tuple returned by slack_api.send_message
into channel_msg_ts, so main_ts becomes invalid; update the call to unpack the
tuple returned by slack_api.send_message (e.g., channel_msg_ts, channel_ok =
slack_api.send_message(client, message=channel_message) or assign only the first
element) so channel_msg_ts holds the timestamp (ts) value used later when
calling slack_api.send_message for the main thread.
- Around line 8-9: Update the incorrect import paths: replace imports of
projects.repo.toolbox.notifications.github.api and
projects.repo.toolbox.notifications.slack.api with the correct modules
projects.core.notifications.github.api and projects.core.notifications.slack.api
so that github_api and slack_api resolve properly (update the import statements
at the top of send.py where the github_api and slack_api modules are imported).

---

Nitpick comments:
In `@projects/core/notifications/send.py`:
- Around line 94-96: The abort branch in the send function returns implicitly
None which is inconsistent with the function's explicit boolean returns; update
the abort handling (the if abort: block) to return False explicitly (replace the
implicit return with return False) so callers that expect a boolean from send
(checked around the caller at line 60) receive a consistent False value; ensure
you still log the error with logging.error("github: Aborting due to previous
error(s).") before returning.
- Line 22: The single-line compound statement "if
os.environ.get(secret_env_key): break" should be split into two lines to comply
with style rules; locate the loop in projects/core/notifications/send.py where
secret_env_key is checked (the line using os.environ.get(secret_env_key) and a
break) and rewrite it so the if condition is on its own line and the break is on
the following indented line, preserving the same logic and any surrounding
control flow.
- Line 71: The function currently returns True on the early exit when secrets
are missing and later returns the boolean variable failed (where True indicates
failure), creating inconsistent semantics; update the function so the return
value meaning is consistent across all paths (choose either success=True or
failure=True), e.g., flip the early return to return False when secrets are
missing if you adopt failure=True for failed, or invert the final return to
return not failed if you adopt success=True; adjust the early-return point that
currently returns True and the final "return failed" so both follow the same
convention (refer to the variable failed and the early secret-check return site
to make the change).
- Around line 101-103: The log statements in projects/core/notifications/send.py
use unnecessary f-strings (e.g., logging.info(f"***")) where there are no
placeholders; change these to plain strings (logging.info("***")) for the
occurrences shown in the diff and the similar sites referenced (lines ~149-151,
266, 280-282, 333, 406) to remove the redundant f prefix while keeping the
logging calls and message content identical.
- Around line 212-230: In get_slack_channel_message, the unpacked variables org
and repo from get_org_repo() are unused; remove the unnecessary call or rename
the variables to indicate intentional unusedness (e.g., _org, _repo) to silence
linter warnings. Locate the get_slack_channel_message function and either delete
the "org, repo = get_org_repo()" line entirely or change it to "_org, _repo =
get_org_repo()" so the behavior remains the same but the unused-variable issue
is resolved.

In `@projects/core/notifications/slack/api.py`:
- Line 55: The function signature for send_message uses a default None without
an explicit Optional type; update the annotation for main_ts to be Optional[str]
(or explicitly Union[str, None]) so it conforms to PEP 484. Modify the
send_message definition to import typing.Optional if not already imported and
change the parameter annotation from main_ts: str = None to main_ts:
Optional[str] (or main_ts: Union[str, None]) while keeping the default value =
None and preserving the function name send_message and its parameter order.
- Around line 9-10: Replace the hardcoded CHANNEL_ID constant with a
configurable value: read CHANNEL_ID from an environment variable (e.g.,
TOPSAIL_SLACK_CHANNEL_ID) with the existing value as a sensible default, and
ensure any functions that rely on CHANNEL_ID (or the module-level constant) can
accept an injected channel id if passed in; leave MAX_CALLS untouched but update
usages to reference the new CONFIGURABLE CHANNEL_ID symbol or parameter where
appropriate so different notification types/environments can override the
channel.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2a26e973-04ad-4177-87ab-c7af5dff451b

📥 Commits

Reviewing files that changed from the base of the PR and between a1e2031 and 16c19fc.

📒 Files selected for processing (4)
  • projects/core/notifications/github/api.py
  • projects/core/notifications/github/gen_jwt.py
  • projects/core/notifications/send.py
  • projects/core/notifications/slack/api.py

Comment thread projects/core/notifications/github/api.py
Comment on lines +36 to +40
access_token_resp = requests.post(f"{BASE_URL}/app/installations/{installation_id}/access_tokens", headers=headers)

user_token = access_token_resp.json()["token"]

return user_token
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing error handling for access token response.

The code handles errors when fetching the installation ID (lines 27-32) but doesn't check if access_token_resp was successful before accessing ["token"]. This will raise a KeyError or return unexpected data on API failures.

Proposed fix
     access_token_resp = requests.post(f"{BASE_URL}/app/installations/{installation_id}/access_tokens", headers=headers, timeout=30)
 
+    if not access_token_resp.ok:
+        msg = f"Failed to get access token: {access_token_resp.text}"
+        logging.error(msg)
+        raise RuntimeError(msg)
+
     user_token = access_token_resp.json()["token"]
 
     return user_token
🧰 Tools
🪛 Ruff (0.15.6)

[error] 36-36: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/github/api.py` around lines 36 - 40, Ensure you
validate the access token HTTP response before reading the token: check
access_token_resp.status_code (or call access_token_resp.raise_for_status()) and
verify the JSON contains the "token" key, and if not raise/log a clear error
that includes installation_id and the response content; update the code around
access_token_resp, user_token and BASE_URL/installation_id usage to handle HTTP
errors and missing "token" keys gracefully (e.g., raise a descriptive exception
or return an error).

Comment thread projects/core/notifications/github/api.py
Comment thread projects/core/notifications/send.py Outdated
Comment on lines +8 to +9
import projects.repo.toolbox.notifications.github.api as github_api
import projects.repo.toolbox.notifications.slack.api as slack_api
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the import paths exist
fd -t f "api.py" --full-path ".*notifications.*(github|slack).*" | head -20

Repository: openshift-psap/topsail-ng

Length of output: 189


🏁 Script executed:

#!/bin/bash
# Find all api.py files in the repository
find . -name "api.py" -type f | grep -i notif

# Also check the repository structure for notifications
find . -type d -name "notifications" | head -20

Repository: openshift-psap/topsail-ng

Length of output: 185


🏁 Script executed:

#!/bin/bash
# Check the actual imports in send.py
head -20 ./projects/core/notifications/send.py

# Check if projects.repo.toolbox directory exists
find . -type d -path "*/projects/repo/toolbox*" | head -10

# Also check the file structure of projects directory
ls -la ./projects/ | head -20

Repository: openshift-psap/topsail-ng

Length of output: 958


🏁 Script executed:

#!/bin/bash
# Check if projects.repo exists
ls -la ./projects/repo/ 2>/dev/null || echo "projects/repo directory does NOT exist"

# Also verify the actual module structure
ls -la ./projects/core/notifications/

Repository: openshift-psap/topsail-ng

Length of output: 392


Fix incorrect import paths.

Lines 8-9 import from projects.repo.toolbox.notifications, but this module path does not exist. The actual API modules are located at projects.core.notifications. Update imports to:

import projects.core.notifications.github.api as github_api
import projects.core.notifications.slack.api as slack_api
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 8 - 9, Update the incorrect
import paths: replace imports of projects.repo.toolbox.notifications.github.api
and projects.repo.toolbox.notifications.slack.api with the correct modules
projects.core.notifications.github.api and projects.core.notifications.slack.api
so that github_api and slack_api resolve properly (update the import statements
at the top of send.py where the github_api and slack_api modules are imported).

Comment thread projects/core/notifications/send.py
Comment thread projects/core/notifications/send.py Outdated
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci skeleton abcd 1234

1 similar comment
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci skeleton abcd 1234

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
projects/core/ci_entrypoint/github/pr_args.py (1)

72-106: ⚠️ Potential issue | 🔴 Critical

jump-ci is parsing the documented directive shape incorrectly.

By the time this branch runs, the generic parser has already consumed cluster as project_name. For /test jump-ci cluster target_project, args only contains ["target_project"], so this path raises instead of populating jump_ci.{cluster,project,args}. Handle the jump-ci special case before the generic project_name pop, or read cluster from the already-popped value.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/ci_entrypoint/github/pr_args.py` around lines 72 - 106, The
jump-ci branch is using project_name as if it were the cluster but the generic
parser already popped cluster into project_name; to fix, either handle the
jump-ci special-case before popping project_name or (simpler) treat project_name
as the cluster: when test_name == 'jump-ci' set cluster = project_name, require
args to have at least one element (target_project = args[0]), set remaining_args
= args[1:], and populate result['jump_ci.cluster'], result['jump_ci.project'],
and result['jump_ci.args'] accordingly (and adjust the ValueError and logging to
reflect the correct arity and values).
♻️ Duplicate comments (2)
projects/core/notifications/send.py (1)

8-9: ⚠️ Potential issue | 🔴 Critical

Use package-relative imports for the new notification modules.

send.py lives inside projects.core.notifications, but import github.api / import slack.api asks Python for top-level packages. The CI-side package import path will fail with ModuleNotFoundError here.

Suggested fix
-import github.api as github_api
-import slack.api as slack_api
+from .github import api as github_api
+from .slack import api as slack_api
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 8 - 9, The file send.py
uses top-level imports "github.api" and "slack.api" which will raise
ModuleNotFoundError in package contexts; update these to package-relative
imports inside projects.core.notifications (e.g., replace references to
github.api and slack.api in send.py with relative imports from the local package
such as importing the api modules from .github and .slack so the symbols used in
send.py continue to be referenced as github_api and slack_api).
projects/core/notifications/github/api.py (1)

25-38: ⚠️ Potential issue | 🟠 Major

Validate both GitHub API responses before reading id and token.

Neither request checks the HTTP status before indexing into the JSON payload. A 401/403/404 response will turn into a KeyError here, and the notification path loses the real GitHub failure reason.

Suggested fix
     installation_resp = requests.get(f"{BASE_URL}/repos/{org}/{repo}/installation", headers=headers)

-    try:
-        installation_id = installation_resp.json()["id"]
-    except Exception as e:
-        msg = f"Failed to get the user token: {e}"
-        logging.error(msg)
-        raise e
+    installation_resp.raise_for_status()
+    installation_id = installation_resp.json().get("id")
+    if not installation_id:
+        raise RuntimeError("GitHub installation response did not include an installation id")

     # Get the user token

     access_token_resp = requests.post(f"{BASE_URL}/app/installations/{installation_id}/access_tokens", headers=headers)

-    user_token = access_token_resp.json()["token"]
+    access_token_resp.raise_for_status()
+    user_token = access_token_resp.json().get("token")
+    if not user_token:
+        raise RuntimeError("GitHub access-token response did not include a token")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/github/api.py` around lines 25 - 38, The code
reads installation_resp.json()["id"] and access_token_resp.json()["token"]
without validating responses; update the logic in the GitHub installation/token
flow to check HTTP status and JSON keys before indexing: call
response.raise_for_status() or verify resp.ok for both installation_resp and
access_token_resp, log and raise a descriptive error including
response.status_code and response.text when the request fails, and guard
extraction of installation_id and user_token with KeyError handling (logging the
response JSON when expected keys are missing) so failures surface the real
GitHub error instead of producing a misleading KeyError.
🧹 Nitpick comments (1)
projects/core/ci_entrypoint/prepare_ci.py (1)

457-459: Log traceback for notification failures to keep postmortems actionable.

At Line 458, use logger.exception(...) instead of logger.error(...) inside the except block so failure root causes are preserved.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/ci_entrypoint/prepare_ci.py` around lines 457 - 459, In the
except Exception as e block that catches failures when sending notifications,
replace the logger.error(...) call with logger.exception(...) so the traceback
is recorded (keep the existing message or augment it for context); specifically
update the exception handling where logger is used in prepare_ci.py to call
logger.exception("Failed to send notifications", exc_info=True) or equivalent
via logger.exception(f"Failed to send notifications: {e}") to preserve the full
stacktrace for postmortems.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/check_python_syntax.yml:
- Around line 9-24: The workflow currently declares matrix.python-version but
never uses it, so the "Check that all the files can be compiled" step runs with
the runner's default Python; add a step that uses actions/setup-python (e.g.,
setup-python@v4) and set its python-version to the matrix value
(matrix.python-version) before the compile step so python -m compileall runs
under the intended Python interpreter; update the "Check that all the files can
be compiled" step to run after the setup-python step and keep using python -m
compileall to validate syntax for Python 3.9.

In `@projects/core/ci_entrypoint/prepare_ci.py`:
- Around line 425-433: The send_notification function currently types success as
str but uses it in a boolean context, causing truthy strings like "False" to be
treated as True; update the function signature (send_notification) to accept
success: bool (or coerce/parse the incoming success string to a boolean at the
start of the function) and then use that boolean in the notification_status
construction so that ("succeeded" if success else "failed") reflects the actual
boolean outcome; ensure any callers are adjusted to pass a bool or that you
explicitly convert known string values ("true"/"false", "1"/"0") to bool before
the conditional.
- Around line 444-450: The call to send_job_completion_notification uses an
invalid keyword `success=`; update the call to match the callee's signature by
replacing `success=success, status=notification_status` with `status=success,
reason=notification_status` while keeping github=github_notifications,
slack=slack_notifications, and dry_run=dry_run so the call to
send_job_completion_notification(status=..., reason=..., github=..., slack=...,
dry_run=...) matches the expected parameters.

In `@projects/core/notifications/send.py`:
- Around line 15-24: The bug is that SECRET_ENV_KEYS is defined as a string so
the for loop in get_secrets() iterates characters instead of environment
variable names; change SECRET_ENV_KEYS to a proper sequence (e.g., a singleton
tuple or list) containing "PSAP_TOPSAIL_JUMP_CI_SECRET_PATH" (e.g.,
SECRET_ENV_KEYS = ("PSAP_TOPSAIL_JUMP_CI_SECRET_PATH",) or
["PSAP_TOPSAIL_JUMP_CI_SECRET_PATH"]) so the loop over SECRET_ENV_KEYS in
get_secrets() checks the full env var name and properly sets
secret_env_key/warn.
- Around line 369-377: The get_slack_secrets function returns the raw file
contents which may include trailing newlines or whitespace causing Slack auth to
fail; change the return to strip surrounding whitespace from the token (e.g.,
use token_file.read_text().strip()) so the returned token has no
leading/trailing newlines or spaces before being passed to WebClient (reference
get_slack_secrets and SLACK_TOKEN_FILE).
- Around line 454-471: Check the result of the initial slack_api.send_message
call (the channel_ok boolean returned into channel_ok alongside channel_msg_ts)
and abort early if it failed: after calling slack_api.send_message in the block
that sets channel_msg_ts/channel_ok, if not channel_ok (and not dry_run) log a
fatal/error indicating the thread root could not be created and return True so
we do not attempt to post the summary as a separate top-level message; ensure
the later call that builds thread_message via get_slack_cpt_message and the
subsequent send_message with main_ts=channel_msg_ts only runs when channel_ok is
True (or in dry_run).
- Around line 87-94: The code currently calls
github_api.get_user_token(pem_file, client_id, org, repo) before honoring
dry_run, causing network/credential usage even in preview mode; move the dry_run
check to precede the token fetch so that if dry_run is True the function logs
the message via logging.info and returns True without calling
github_api.get_user_token (adjust surrounding control flow in the function that
contains user_token, github_api.get_user_token, dry_run, and logging.info).

In `@projects/core/notifications/slack/api.py`:
- Around line 43-47: The current loop in
projects/core/notifications/slack/api.py uses a substring check (if
message_anchor not in message["text"]) which can falsely match "Thread for PR
`#12`" inside "Thread for PR `#123`"; change the match to require an exact
normalized header match or a boundary (e.g., check that
message["text"].strip().startswith(message_anchor) or that message_anchor
appears with a newline/prefix boundary) before returning message["ts"],
message["text"]; update the check around the variables message_anchor and
message["text"] in that loop so only exact/header-boundary matches are accepted.

In `@projects/jump_ci/orchestration/ci.py`:
- Around line 79-119: prepare/test/cleanup are currently calling
run_on_jump_ci("...") which returns a callable factory (per
testing.test.jump_ci) instead of executing the remote phase; update each method
to call the returned function with the emitted jump_ci.{cluster, project, args}
so the remote phase actually runs (e.g., replace ret = run_on_jump_ci("test_ci")
with ret_callable = run_on_jump_ci("test_ci"); ret =
ret_callable(jump_ci.cluster, jump_ci.project, jump_ci.args)), and ensure
cleanup uses "pre_cleanup_ci" similarly and returns the callable's integer
result instead of always returning 0.
- Around line 67-76: The method prepare_jump_ci(self) currently calls
prepare_jump_ci.prepare_jump_ci() which doesn't exist and will raise
AttributeError; update the call to invoke the real entrypoint prepare from the
module by replacing prepare_jump_ci.prepare_jump_ci() with
prepare_jump_ci.prepare(...) (or adjust the import so prepare refers to the
module's prepare function), keeping the surrounding try/except and logging
behavior intact so the correct function prepare in
projects/jump_ci/testing/prepare_jump_ci is executed.

---

Outside diff comments:
In `@projects/core/ci_entrypoint/github/pr_args.py`:
- Around line 72-106: The jump-ci branch is using project_name as if it were the
cluster but the generic parser already popped cluster into project_name; to fix,
either handle the jump-ci special-case before popping project_name or (simpler)
treat project_name as the cluster: when test_name == 'jump-ci' set cluster =
project_name, require args to have at least one element (target_project =
args[0]), set remaining_args = args[1:], and populate result['jump_ci.cluster'],
result['jump_ci.project'], and result['jump_ci.args'] accordingly (and adjust
the ValueError and logging to reflect the correct arity and values).

---

Duplicate comments:
In `@projects/core/notifications/github/api.py`:
- Around line 25-38: The code reads installation_resp.json()["id"] and
access_token_resp.json()["token"] without validating responses; update the logic
in the GitHub installation/token flow to check HTTP status and JSON keys before
indexing: call response.raise_for_status() or verify resp.ok for both
installation_resp and access_token_resp, log and raise a descriptive error
including response.status_code and response.text when the request fails, and
guard extraction of installation_id and user_token with KeyError handling
(logging the response JSON when expected keys are missing) so failures surface
the real GitHub error instead of producing a misleading KeyError.

In `@projects/core/notifications/send.py`:
- Around line 8-9: The file send.py uses top-level imports "github.api" and
"slack.api" which will raise ModuleNotFoundError in package contexts; update
these to package-relative imports inside projects.core.notifications (e.g.,
replace references to github.api and slack.api in send.py with relative imports
from the local package such as importing the api modules from .github and .slack
so the symbols used in send.py continue to be referenced as github_api and
slack_api).

---

Nitpick comments:
In `@projects/core/ci_entrypoint/prepare_ci.py`:
- Around line 457-459: In the except Exception as e block that catches failures
when sending notifications, replace the logger.error(...) call with
logger.exception(...) so the traceback is recorded (keep the existing message or
augment it for context); specifically update the exception handling where logger
is used in prepare_ci.py to call logger.exception("Failed to send
notifications", exc_info=True) or equivalent via logger.exception(f"Failed to
send notifications: {e}") to preserve the full stacktrace for postmortems.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cec69fed-65ac-4850-bb4b-7712c369e9de

📥 Commits

Reviewing files that changed from the base of the PR and between 16c19fc and 5c5c8f4.

📒 Files selected for processing (14)
  • .github/steps/prepare_minimal_ubuntu_image/action.yml
  • .github/workflows/check_consistency.yml
  • .github/workflows/check_python_syntax.yml
  • projects/core/ci_entrypoint/github/pr_args.py
  • projects/core/ci_entrypoint/prepare_ci.py
  • projects/core/notifications/README.md
  • projects/core/notifications/__init__.py
  • projects/core/notifications/github/__init__.py
  • projects/core/notifications/github/api.py
  • projects/core/notifications/github/gen_jwt.py
  • projects/core/notifications/send.py
  • projects/core/notifications/slack/__init__.py
  • projects/core/notifications/slack/api.py
  • projects/jump_ci/orchestration/ci.py
✅ Files skipped from review due to trivial changes (4)
  • .github/steps/prepare_minimal_ubuntu_image/action.yml
  • projects/core/notifications/github/init.py
  • projects/core/notifications/slack/init.py
  • projects/core/notifications/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • projects/core/notifications/github/gen_jwt.py

Comment on lines +9 to +24
strategy:
max-parallel: 4
matrix:
python-version: [3.9]
steps:
- name: Use checkout
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Check that all the files can be compiled
run: |
set -o pipefail
set -o errexit

python -m compileall . | (grep -v -E '(^Listing|^Compiling)' || true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

GitHub Actions matrix python-version without setup-python

💡 Result:

You can put python-version (or py) in a matrix without actions/setup-python, but it won’t automatically switch the python on PATH. Without setup-python, GitHub Actions just uses whatever Python is already on the runner’s PATH (and that default can vary by runner/image). On Windows in particular, GitHub only adds “a few” toolcache Pythons to PATH for out-of-the-box use—there’s no built-in matrix selector without setup-python. [1]

Practical options (no setup-python)

1) Use a container per matrix entry (cleanest)

This avoids installing/switching Python on the runner at all:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        py: ["3.10", "3.11", "3.12"]
    container: python:${{ matrix.py }}
    steps:
      - uses: actions/checkout@v5
      - run: python --version
      - run: pip install -r requirements.txt
      - run: pytest

2) Call a specific interpreter already present on the runner

On Ubuntu/macOS, multiple system Pythons may exist, but availability depends on the runner image:

- run: python${{ matrix.py }} --version   # e.g. python3.11

If that binary isn’t installed on that image, the job will fail. (GitHub provides many versions via the toolcache, but selecting/adding them to PATH is what setup-python is designed for.) [1]

3) Install/switch Python yourself (you’re re-implementing setup)

Example approaches: pyenv, OS packages (apt, brew), or other installers. GitHub’s Python builds are produced for runner images and exposed via the toolcache / setup-python flow, but without setup-python you must manage installation + PATH yourself. [2]

Bottom line: a matrix variable alone can’t change Python—either run each entry in a Python container, invoke a preinstalled python3.x binary explicitly, or install/switch Python yourself. [1][2]


🌐 Web query:

actions/setup-python@v5 latest version Python

💡 Result:

To install the latest stable CPython 3.x with actions/setup-python@v5, set:

- uses: actions/setup-python@v5
  with:
    python-version: '3.x'

3.x is an x-range that resolves to the latest stable Python 3 release available to the action. [1]

If you instead pin a minor (e.g. 3.13) and want the latest patch every run, add check-latest: true:

- uses: actions/setup-python@v5
  with:
    python-version: '3.13'
    check-latest: true

This makes the action prefer downloading a newer patch if the cached one isn’t the latest. [2]

Note: setup-python’s latest major is now v6 (not v5). [3]

Sources
[1] https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md
[2] https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md
[3] https://github.com/actions/setup-python


🏁 Script executed:

# Check if there are other workflow files using setup-python for reference
find .github/workflows -type f -name "*.yml" -o -name "*.yaml" | head -20

Repository: openshift-psap/topsail-ng

Length of output: 150


🏁 Script executed:

# Look at the current workflow file to understand its full context
cat -n .github/workflows/check_python_syntax.yml

Repository: openshift-psap/topsail-ng

Length of output: 758


Use actions/setup-python to actually validate Python 3.9 syntax.

matrix.python-version: [3.9] is never consumed, so python -m compileall runs with whatever Python is preinstalled on ubuntu-latest (typically 3.10+). This means syntax that is invalid on Python 3.9 can slip through.

Suggested fix
     steps:
       - name: Use checkout
         uses: actions/checkout@v4
         with:
           fetch-depth: 0

+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+
       - name: Check that all the files can be compiled
         run: |
           set -o pipefail
           set -o errexit
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/check_python_syntax.yml around lines 9 - 24, The workflow
currently declares matrix.python-version but never uses it, so the "Check that
all the files can be compiled" step runs with the runner's default Python; add a
step that uses actions/setup-python (e.g., setup-python@v4) and set its
python-version to the matrix value (matrix.python-version) before the compile
step so python -m compileall runs under the intended Python interpreter; update
the "Check that all the files can be compiled" step to run after the
setup-python step and keep using python -m compileall to validate syntax for
Python 3.9.

Comment thread projects/core/ci_entrypoint/prepare_ci.py Outdated
Comment thread projects/core/ci_entrypoint/prepare_ci.py
Comment thread projects/core/notifications/send.py
Comment on lines +87 to +94
user_token = github_api.get_user_token(pem_file, client_id, org, repo)
if dry_run:
logging.info(f"Github notification:\n{message}")
logging.info(f"***")
logging.info(f"***")
logging.info(f"***\n")

return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Short-circuit dry_run before fetching a GitHub token.

Preview mode still performs the token exchange first, so a dry run requires valid credentials and network access. That defeats the point of dry_run and can fail before the message is even logged.

Suggested fix
-    user_token = github_api.get_user_token(pem_file, client_id, org, repo)
     if dry_run:
         logging.info(f"Github notification:\n{message}")
         logging.info(f"***")
         logging.info(f"***")
         logging.info(f"***\n")

         return True

+    user_token = github_api.get_user_token(pem_file, client_id, org, repo)
     resp = github_api.send_notification(org, repo, user_token, pr_number, message)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
user_token = github_api.get_user_token(pem_file, client_id, org, repo)
if dry_run:
logging.info(f"Github notification:\n{message}")
logging.info(f"***")
logging.info(f"***")
logging.info(f"***\n")
return True
if dry_run:
logging.info(f"Github notification:\n{message}")
logging.info(f"***")
logging.info(f"***")
logging.info(f"***\n")
return True
user_token = github_api.get_user_token(pem_file, client_id, org, repo)
🧰 Tools
🪛 Ruff (0.15.6)

[error] 90-90: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 91-91: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 92-92: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 87 - 94, The code currently
calls github_api.get_user_token(pem_file, client_id, org, repo) before honoring
dry_run, causing network/credential usage even in preview mode; move the dry_run
check to precede the token fetch so that if dry_run is True the function logs
the message via logging.info and returns True without calling
github_api.get_user_token (adjust surrounding control flow in the function that
contains user_token, github_api.get_user_token, dry_run, and logging.info).

Comment on lines +369 to +377
def get_slack_secrets(secret_dir, secret_env_key):
token_file = secret_dir / SLACK_TOKEN_FILE

if not token_file.exists():
logging.warning(f"{token_file.name} not found in {secret_env_key}. "
"Cannot send the Slack notification")
return None

return token_file.read_text()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Strip the Slack token file contents.

read_text() preserves the trailing newline that secret files often contain. Passing that raw value to WebClient can make Slack auth fail even though the secret is otherwise correct.

Suggested fix
-    return token_file.read_text()
+    return token_file.read_text().strip()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 369 - 377, The
get_slack_secrets function returns the raw file contents which may include
trailing newlines or whitespace causing Slack auth to fail; change the return to
strip surrounding whitespace from the token (e.g., use
token_file.read_text().strip()) so the returned token has no leading/trailing
newlines or spaces before being passed to WebClient (reference get_slack_secrets
and SLACK_TOKEN_FILE).

Comment thread projects/core/notifications/send.py
Comment on lines +43 to +47
for message in history:
if message_anchor not in message["text"]:
continue

return message["ts"], message["text"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Match the Slack thread anchor exactly.

This substring check will treat Thread for PR #12 as a hit inside `Thread for PR `#123, so later notifications can attach to the wrong conversation. Compare the normalized thread header exactly, or at least require the expected prefix/newline boundary.

Suggested fix
         for message in history:
-            if message_anchor not in message["text"]:
+            text = message["text"]
+            expected_prefix = f"🧵 {message_anchor}"
+            if text != expected_prefix and not text.startswith(f"{expected_prefix}\n"):
                 continue

-            return message["ts"], message["text"]
+            return message["ts"], text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/slack/api.py` around lines 43 - 47, The current
loop in projects/core/notifications/slack/api.py uses a substring check (if
message_anchor not in message["text"]) which can falsely match "Thread for PR
`#12`" inside "Thread for PR `#123`"; change the match to require an exact
normalized header match or a boundary (e.g., check that
message["text"].strip().startswith(message_anchor) or that message_anchor
appears with a newline/prefix boundary) before returning message["ts"],
message["text"]; update the check around the variables message_anchor and
message["text"] in that loop so only exact/header-boundary matches are accepted.

Comment thread projects/jump_ci/orchestration/ci.py
Comment on lines +79 to +119
def prepare(self):
self.log("Starting prepare phase...")

try:
ret = run_on_jump_ci("test_ci")
self.log("Jump CI environment prepared", "success" if not ret else "error")
return ret

except Exception as e:
self.log(f"Prepare phase failed: {e}", "error")
return 1

def test(self):
"""
Test phase - Execute the main testing logic.

This is where your actual tests run:
- Performance tests
- Scale tests
- Functional tests
- Integration tests
"""
self.log("Starting test phase...")

# Example: Run functional tests
if not self.execute_command(
"echo 'Running functional tests...' && sleep 1 && echo 'All tests passed!'",
"Running functional tests"
):
self.log("Functional tests failed!", "error")
return 1

# Example: Run performance tests
if not self.execute_command(
"echo 'Running performance tests...' && sleep 2 && echo 'Performance metrics captured!'",
"Running performance tests"
):
self.log("Performance tests failed!", "error")
return 1

# Example: Run scale tests (if applicable)
if not self.execute_command(
"echo 'Running scale tests...' && sleep 1 && echo 'Scale targets achieved!'",
"Running scale tests"
):
self.log("Scale tests failed!", "error")
try:
ret = run_on_jump_ci("test_ci")
self.log("Jump CI environment test completed", "success" if not ret else "error")
return ret
except Exception as e:
self.log(f"Test phase failed: {e}", "error")
return 1

self.log("Test phase completed successfully!", "success")
return 0

def cleanup(self):
"""
Cleanup phase - Clean up resources and finalize.

This phase should clean up everything created during testing:
- Remove temporary resources
- Clean up data
- Reset environment
- Generate final reports
"""
self.log("Starting cleanup phase...")

# Example: Clean up test data
self.execute_command(
"echo 'Cleaning up test data and temporary files'",
"Cleaning up test data"
)

# Example: Reset environment
self.execute_command(
"echo 'Resetting environment to initial state'",
"Resetting environment"
)

# Example: Generate final report
self.execute_command(
"echo 'Generating final test report'",
"Generating final report"
)

self.log("Cleanup phase completed!", "success")
return 0
try:
# Clean up any temporary files or connections
self.log("Cleaning up jump CI resources...")
ret = run_on_jump_ci("pre_cleanup_ci")
self.log("Cleanup phase completed!", "success" if not ret else "error")
return 0
except Exception as e:
self.log(f"Cleanup phase failed: {e}", "error")
return 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

These phases build a callable instead of executing Jump CI.

projects/jump_ci/testing/test.py defines jump_ci(command) as a factory. Here, ret becomes the generated function object, so prepare, test, and cleanup never run the remote phase; they just propagate a truthy callable up to sys.exit(...). You also need to pass the jump_ci.{cluster,project,args} values that pr_args.py now emits.

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 87-87: Do not catch blind exception: Exception

(BLE001)


[warning] 101-101: Do not catch blind exception: Exception

(BLE001)


[warning] 117-117: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/jump_ci/orchestration/ci.py` around lines 79 - 119,
prepare/test/cleanup are currently calling run_on_jump_ci("...") which returns a
callable factory (per testing.test.jump_ci) instead of executing the remote
phase; update each method to call the returned function with the emitted
jump_ci.{cluster, project, args} so the remote phase actually runs (e.g.,
replace ret = run_on_jump_ci("test_ci") with ret_callable =
run_on_jump_ci("test_ci"); ret = ret_callable(jump_ci.cluster, jump_ci.project,
jump_ci.args)), and ensure cleanup uses "pre_cleanup_ci" similarly and returns
the callable's integer result instead of always returning 0.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/legacy/library/ansible_toolbox.py`:
- Around line 258-295: The code currently creates a temp file FD then
immediately removes the pathname (inventory_fd, path) and writes to the open FD
which produces a /proc/.../fd/* dependency; change this to create a real
temporary file on disk (e.g., use tempfile.NamedTemporaryFile(delete=False) or
mkstemp but do NOT call os.remove(path) immediately), write the inventory to
that pathname (use inventory_path variable instead of relying on /proc FD),
close the file so ansible-playbook can read it, pass the real pathname to
ansible-playbook when invoking remote execution, and ensure you delete the file
in a finally block to guarantee cleanup; apply the same fix to the other
remote-inventory creation block referenced in the comment (the block around
lines 344-362) and adjust references to inventory_fd/inventory_f/path to
inventory_path where needed.
- Around line 183-193: The code currently writes raw CLI args and vars to
artifact files (using self.py_command_args, self.py_command_name and
artifact_extra_logs_dir) which can leak secrets; add a sanitizer step before
writing any of the files (_python.gen.cmd, _python.args.yaml,
_ansible.play.yaml, _ansible.env, _python.cmd) that redacts or omits values
matching sensitive keys/patterns (e.g., token, secret, password, key,
kubeconfig, webhook, AWS/AKIA, *_TOKEN, *_KEY) and masks them (e.g., replace
with "<redacted>") — implement a helper like sanitize_dict/sanitize_mapping used
wherever args/env/vars are dumped (including the later block referenced at
329-341) and ensure shlex.quote is applied to the sanitized values when
generating _python.gen.cmd so no raw secrets are written to disk.
- Around line 164-175: The current count-then-create for ARTIFACT_EXTRA_LOGS_DIR
is racy; replace the len(list(artifact_dir.glob("*__*"))) approach with an
atomic create loop: compute artifact_base_dirname as before, then attempt to
create a candidate directory (build candidate name from an incrementing counter
or use tempfile.mkdtemp with dir=env["ARTIFACT_DIR"]) and call
pathlib.Path(...).mkdir(exist_ok=False) inside a try/except loop that catches
FileExistsError and retries with the next counter until creation succeeds; once
created set env["ARTIFACT_EXTRA_LOGS_DIR"] and then assign
artifact_extra_logs_dir = pathlib.Path(env["ARTIFACT_EXTRA_LOGS_DIR"]) and call
mkdir(parents=True, exist_ok=True) only if needed. Ensure you reference and
update the variables artifact_dir, previous_extra_count (or replace it), name,
ARTIFACT_EXTRA_LOGS_DIR, and artifact_extra_logs_dir when making this change.
- Around line 40-47: The loop currently imports modules before checking for
private names, causing _*.py and __init__.py to execute; move the underscore
checks to before building/importing the module: for each toolbox_file, first
skip if toolbox_file.name.startswith(".") or toolbox_file.name.startswith("_")
or toolbox_file.name == "__init__.py", then compute project_toolbox_module and
call importlib.import_module — this prevents importing private/helper files
(affecting importlib.import_module and the surrounding loop that initializes
Toolbox()) and resolves the inline continue/E701 issues.

In `@projects/legacy/library/config.py`:
- Around line 95-99: The except block is catching IndexError but get_config()
raises KeyError for missing JSONPath entries, so replace the except IndexError
in the ci_presets lookup with except KeyError to ensure the preset-specific
logging and re-raise run (i.e., update the exception handler around the call to
self.get_config(f'ci_presets["{name}"]', print=False) to catch KeyError and keep
the logging and raise behavior intact).
- Around line 75-91: The readback always calls self.get_config(...) without
preserving the handled_secretly flag, so secret overrides cause get_config to
raise or leak; fix by passing the same handled_secretly value when re-reading
the stored value (change actual_value = self.get_config(key, print=False) to
actual_value = self.get_config(key, print=False,
handled_secretly=handled_secretly)) and ensure logging respects handled_secretly
(avoid printing actual_value when handled_secretly is True); reference the loop
over variable_overrides and the get_config/set_config calls and the
handled_secretly variable in this change.
- Around line 145-171: The set_config function shadows the builtin print with
its parameter named "print" and also fails to stop execution in CI branches,
allowing state mutation from non-main threads; rename the parameter (e.g., to
do_print or log_output) and update all uses (the final if print: logging.info...
-> if do_print: ...) so the builtin print call (print(msg, file=f)) uses the
real built-in, and in the threading.guard CI branch (where OPENSHIFT_CI or
PERFLAB_CI is set) after logging and writing the marker file (env.ARTIFACT_DIR /
"SET_CONFIG_CALLED_FROM_THREAD") raise the same RuntimeError(msg) to prevent
further execution and shared-state mutation (use threading.current_thread().name
check and the existing msg variable and behavior).

In `@projects/legacy/library/env.py`:
- Around line 30-35: The module-level __getattr__ currently returns
globals()[name], which will raise KeyError for missing names; change it so that
after handling the special case for "ARTIFACT_DIR" (using
get_tls_artifact_dir()), you attempt to fetch the attribute from globals()
inside a try/except and convert any KeyError into an AttributeError (e.g., raise
AttributeError(f"module has no attribute {name}") from None) so callers and
hasattr()/introspection get the correct exception semantics.

In `@projects/legacy/library/export.py`:
- Around line 28-30: The wrapper currently calls init() and then invokes
fct(*args, **kwargs) but discards its return value; update the decorator's
wrapper so it returns the wrapped function's result (i.e., return the value from
fct(*args, **kwargs)) so that functions decorated with `@entrypoint`() propagate
their return values; ensure the decorator still returns the wrapper function.
- Around line 82-84: The current export_artifacts() call builds a shell command
string passed to run.run which is vulnerable to argument injection because
artifacts_dirname comes from argv; instead construct a safe invocation that does
not use shell interpolation: either call the AWS S3 copy using a subprocess call
with an arguments list (no shell=True) and pass the AWS_SHARED_CREDENTIALS_FILE
via the env mapping, or replace the shell invocation with boto3 S3 client code
that reads config.project.get_config("secrets.aws_credentials") and performs a
recursive upload of artifacts_dirname to export_dest while writing logs to
env.ARTIFACT_DIR / 'aws_s3_cp.log'; update references to run.run,
aws_creds_filename, artifacts_dirname, export_dest and ensure quoting/injection
is handled by the chosen API rather than by constructing a single shell string.

In `@projects/legacy/library/run.py`:
- Around line 16-20: The current code calls os.killpg(0, ...) even if
os.setpgrp() failed and the module doesn't own the process group; modify the
import-time block around os.setpgrp() to set a flag (e.g., own_pgrp = True) only
when os.setpgrp() succeeds (catching exceptions as you already do), and then at
the shutdown/killer code path (where os.killpg(...) is invoked) check that flag
or verify ownership with os.getpgrp() == os.getpid() before calling os.killpg;
update references to os.setpgrp(), os.getpgrp(), and os.killpg(...) in run.py
accordingly so you only kill the group you own.
- Around line 227-237: The loop that applies overrides via for kv_entry in
itertools.product(*kv_list) must ensure the original iterable_kv values are
restored even if fct(...) raises: wrap the call and its per-iteration overrides
in a try/finally so that after running fct(*args, **kwargs) (or if it raises)
you always call config.project.set_config(iter_key, iter_values, print=False)
for each iter_key, iter_values in iterable_kv; modify the block around kv_entry,
fct and the restore so the restore logic lives in the finally clause
(referencing kv_entry, fct, config.project.set_config and iterable_kv).
- Around line 69-80: The function _dict_to_run_toolbox_args prepares a
serialized value in the local variable val but mistakenly interpolates the
original v, causing dicts/lists to be sent as Python reprs and leaving single
quotes unescaped for scalars; fix by using val in both formatted argument
branches (i.e., f"--{k}={...}" should embed val) and ensure the scalar branch
uses the escaped val (not v) when wrapping in single quotes so nested JSON and
embedded quotes are preserved for run_toolbox.py.
- Around line 122-125: The wrapper enables bash-specific options but may run
under /bin/sh; in run() before injecting "set -o pipefail;...errtrace" ensure
the shell is pinned to bash by setting the subprocess executable (e.g., set
args.setdefault("executable", "/bin/bash") when args.get("shell") is True) or
explicitly invoke bash (e.g., run with "bash -lc" semantics); update the block
that checks protect_shell and the subsequent subprocess.run call so the process
uses /bin/bash before adding pipefail/errtrace to the command.

In `@projects/legacy/library/sizing.py`:
- Around line 57-83: Compute capacity per-node and base pod packing on
allocatable resources rather than aggregate-only math: first handle user_count
== 0 by setting machine_count = 0 and pods_per_machine = 0 and log/return early.
Otherwise compute allocatable_cpu = machine_size.cpu - RESERVED_CPU and
allocatable_mem = machine_size.memory - RESERVED_MEM, then compute per-node pod
capacity as pods_by_cpu = floor(allocatable_cpu / pod_size['cpu']) and
pods_by_mem = floor(allocatable_mem / pod_size['memory']); set pods_per_machine
= min(pods_by_cpu, pods_by_mem, MAX_POD_PER_NODE) (and if this becomes 0
raise/log an error). Finally set machine_count = ceil(user_count /
pods_per_machine). Use allocatable_cpu/allocatable_mem when computing
unallocated_cpu and unallocated_mem and update logging to reference these
allocatable values; adjust references to machine_count_cpu,
machine_count_memory, machine_exact_count only if still needed or remove them in
favor of per-node capacity logic.
- Around line 121-125: The current try/except passes the node count returned by
main(machine_type, user_count, cpu, memory) into sys.exit (which truncates
values >255) and exits with 0 on error; instead, call main(...) and capture its
integer result, print that integer to stdout (so full counts like 256 are
preserved), then exit normally (sys.exit(0)) for success; in the except block
log the failure with logging.error including the exception details (use
e.__class__.__name__ and e) and call sys.exit(1) to indicate failure — update
the try/except around main(...) accordingly (referencing the main function,
sys.exit, logging.error, and sys.argv).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 622f617d-3946-47b8-a16c-3d66bc7e82ee

📥 Commits

Reviewing files that changed from the base of the PR and between 5c5c8f4 and 8f5fbac.

📒 Files selected for processing (9)
  • projects/legacy/library/__init__.py
  • projects/legacy/library/ansible_toolbox.py
  • projects/legacy/library/common.py
  • projects/legacy/library/config.py
  • projects/legacy/library/env.py
  • projects/legacy/library/export.py
  • projects/legacy/library/run.py
  • projects/legacy/library/sizing.machines
  • projects/legacy/library/sizing.py

Comment on lines +40 to +47
for toolbox_file in (TOPSAIL_DIR / "projects").glob("*/toolbox/*.py"):
if toolbox_file.name.startswith("."): continue

project_toolbox_module = str(toolbox_file.relative_to(TOPSAIL_DIR).with_suffix("")).replace(os.path.sep, ".")
mod = importlib.import_module(project_toolbox_module)
toolbox_name = toolbox_file.with_suffix("").name

if toolbox_name.startswith("_"): continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Skip private toolbox files before importing them.

_*.py and __init__.py still execute here because the underscore filter runs after importlib.import_module(). Any top-level side effect or missing dependency in a helper module can break Toolbox() initialization before the file is skipped, and splitting the inline continues also clears the Ruff E701 failures.

💡 Proposed fix
         for toolbox_file in (TOPSAIL_DIR / "projects").glob("*/toolbox/*.py"):
-            if toolbox_file.name.startswith("."): continue
-
-            project_toolbox_module = str(toolbox_file.relative_to(TOPSAIL_DIR).with_suffix("")).replace(os.path.sep, ".")
-            mod = importlib.import_module(project_toolbox_module)
             toolbox_name = toolbox_file.with_suffix("").name
-
-            if toolbox_name.startswith("_"): continue
+            if toolbox_file.name.startswith("."):
+                continue
+            if toolbox_name.startswith("_"):
+                continue
+
+            project_toolbox_module = str(toolbox_file.relative_to(TOPSAIL_DIR).with_suffix("")).replace(os.path.sep, ".")
+            mod = importlib.import_module(project_toolbox_module)
🧰 Tools
🪛 Ruff (0.15.6)

[error] 41-41: Multiple statements on one line (colon)

(E701)


[error] 47-47: Multiple statements on one line (colon)

(E701)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/ansible_toolbox.py` around lines 40 - 47, The loop
currently imports modules before checking for private names, causing _*.py and
__init__.py to execute; move the underscore checks to before building/importing
the module: for each toolbox_file, first skip if
toolbox_file.name.startswith(".") or toolbox_file.name.startswith("_") or
toolbox_file.name == "__init__.py", then compute project_toolbox_module and call
importlib.import_module — this prevents importing private/helper files
(affecting importlib.import_module and the surrounding loop that initializes
Toolbox()) and resolves the inline continue/E701 issues.

Comment on lines +164 to +175
if env.get("ARTIFACT_EXTRA_LOGS_DIR") is None:
artifact_base_dirname = f"{self.group}__{self.command}" if self.group and self.command \
else "__".join(sys.argv[1:3])

previous_extra_count = len(list(artifact_dir.glob("*__*")))

name = f"{previous_extra_count:03d}__{prefix}{artifact_base_dirname}{suffix}"

env["ARTIFACT_EXTRA_LOGS_DIR"] = str(pathlib.Path(env["ARTIFACT_DIR"]) / name)

artifact_extra_logs_dir = pathlib.Path(env["ARTIFACT_EXTRA_LOGS_DIR"])
artifact_extra_logs_dir.mkdir(parents=True, exist_ok=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make ARTIFACT_EXTRA_LOGS_DIR allocation atomic.

This is a count-then-create race. Two runs that share the same ARTIFACT_DIR can both pick the same NNN__... directory and then overwrite each other’s logs and failure markers. Use an atomic directory creator instead of len(glob(...)).

💡 Proposed fix
         if env.get("ARTIFACT_EXTRA_LOGS_DIR") is None:
             artifact_base_dirname = f"{self.group}__{self.command}" if self.group and self.command \
                 else "__".join(sys.argv[1:3])
-
-            previous_extra_count = len(list(artifact_dir.glob("*__*")))
-
-            name = f"{previous_extra_count:03d}__{prefix}{artifact_base_dirname}{suffix}"
-
-            env["ARTIFACT_EXTRA_LOGS_DIR"] = str(pathlib.Path(env["ARTIFACT_DIR"]) / name)
+            env["ARTIFACT_EXTRA_LOGS_DIR"] = tempfile.mkdtemp(
+                prefix=f"{prefix}{artifact_base_dirname}{suffix}__",
+                dir=env["ARTIFACT_DIR"],
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/ansible_toolbox.py` around lines 164 - 175, The
current count-then-create for ARTIFACT_EXTRA_LOGS_DIR is racy; replace the
len(list(artifact_dir.glob("*__*"))) approach with an atomic create loop:
compute artifact_base_dirname as before, then attempt to create a candidate
directory (build candidate name from an incrementing counter or use
tempfile.mkdtemp with dir=env["ARTIFACT_DIR"]) and call
pathlib.Path(...).mkdir(exist_ok=False) inside a try/except loop that catches
FileExistsError and retries with the next counter until creation succeeds; once
created set env["ARTIFACT_EXTRA_LOGS_DIR"] and then assign
artifact_extra_logs_dir = pathlib.Path(env["ARTIFACT_EXTRA_LOGS_DIR"]) and call
mkdir(parents=True, exist_ok=True) only if needed. Ensure you reference and
update the variables artifact_dir, previous_extra_count (or replace it), name,
ARTIFACT_EXTRA_LOGS_DIR, and artifact_extra_logs_dir when making this change.

Comment on lines +183 to +193
if self.py_command_args:
with open(artifact_extra_logs_dir / "_python.gen.cmd", "w") as f:
print(f"{sys.argv[0]} {self.group} {self.command} \\", file=f)
for key, _value in self.py_command_args.items():
value = shlex.quote(str(_value))
print(f" --{key}={value} \\", file=f)
print(" --", file=f)

with open(artifact_extra_logs_dir / "_python.args.yaml", "w") as f:
print(yaml.dump({self.py_command_name: self.py_command_args}), file=f)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t dump raw args, vars, and env into artifacts.

_python.gen.cmd, _python.args.yaml, _ansible.play.yaml, _ansible.env, and _python.cmd all persist unredacted inputs. Any token, webhook, password, or kubeconfig passed via role vars, CLI args, or CI environment will be written back to disk; once secret support lands, this becomes an automatic leak path.

Also applies to: 329-341

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/ansible_toolbox.py` around lines 183 - 193, The code
currently writes raw CLI args and vars to artifact files (using
self.py_command_args, self.py_command_name and artifact_extra_logs_dir) which
can leak secrets; add a sanitizer step before writing any of the files
(_python.gen.cmd, _python.args.yaml, _ansible.play.yaml, _ansible.env,
_python.cmd) that redacts or omits values matching sensitive keys/patterns
(e.g., token, secret, password, key, kubeconfig, webhook, AWS/AKIA, *_TOKEN,
*_KEY) and masks them (e.g., replace with "<redacted>") — implement a helper
like sanitize_dict/sanitize_mapping used wherever args/env/vars are dumped
(including the later block referenced at 329-341) and ensure shlex.quote is
applied to the sanitized values when generating _python.gen.cmd so no raw
secrets are written to disk.

Comment on lines +258 to +295
remote_hostname = env.get("TOPSAIL_REMOTE_HOSTNAME")
if remote_hostname:
print(f"Using TOPSAIL_REMOTE_HOSTNAME={remote_hostname}") # value will be censored by OpenShift
if self.ansible_gather_facts:
# gather only env values
generated_play[0]["gather_facts"] = True
generated_play[0]["gather_subset"] = ['env','!all','!min']
else:
generated_play[0]["gather_facts"] = False

# run remotely
generated_play[0]["hosts"] = "remote"
inventory_fd, path = tempfile.mkstemp()
os.remove(path) # using only the FD. Ensures that the file disappears when this process terminates
inventory_f = os.fdopen(inventory_fd, 'w')

host_properties = []
if remote_username := env.get("TOPSAIL_REMOTE_USERNAME"):
print(f"Using TOPSAIL_REMOTE_USERNAME={remote_username}") # value will be censored by OpenShift

host_properties.append("ansible_user="+remote_username)

# Configure OS-specific Ansible variables
if remote_os := env.get("TOPSAIL_REMOTE_OS"):
config = ANSIBLE_OS_CONFIGURATIONS.get(remote_os.lower())
if not config:
raise ValueError(f"TOPSAIL Ansible OS configuration not found for TOPSAIL_REMOTE_OS={remote_os}")
host_properties.extend(config)

inventory_content = f"""
[all:vars]

[remote]
{remote_hostname} {" ".join(host_properties)}
"""

print(inventory_content, file=inventory_f)
inventory_f.flush()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a real temp inventory file instead of /proc/.../fd/....

Remote execution currently depends on a Linux-only procfs path. That breaks controllers outside Linux and is brittle in restricted containers; keep the inventory file on disk, pass its pathname to ansible-playbook, and remove it in finally.

Also applies to: 344-362

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/ansible_toolbox.py` around lines 258 - 295, The code
currently creates a temp file FD then immediately removes the pathname
(inventory_fd, path) and writes to the open FD which produces a /proc/.../fd/*
dependency; change this to create a real temporary file on disk (e.g., use
tempfile.NamedTemporaryFile(delete=False) or mkstemp but do NOT call
os.remove(path) immediately), write the inventory to that pathname (use
inventory_path variable instead of relying on /proc FD), close the file so
ansible-playbook can read it, pass the real pathname to ansible-playbook when
invoking remote execution, and ensure you delete the file in a finally block to
guarantee cleanup; apply the same fix to the other remote-inventory creation
block referenced in the comment (the block around lines 344-362) and adjust
references to inventory_fd/inventory_f/path to inventory_path where needed.

Comment on lines +75 to +91
for key, value in variable_overrides.items():
MAGIC_DEFAULT_VALUE = object()
handled_secretly = True # current_value MUST NOT be printed below.
current_value = self.get_config(key, MAGIC_DEFAULT_VALUE, print=False, warn=False, handled_secretly=handled_secretly)
if current_value == MAGIC_DEFAULT_VALUE:
if ignore_not_found:
continue

if "." in key:
raise ValueError(f"Config key '{key}' does not exist, and cannot create it at the moment :/")

self.config[key] = None

self.set_config(key, value, print=False)
actual_value = self.get_config(key, print=False) # ensure that key has been set, raises an exception otherwise
if log:
logging.info(f"config override: {key} --> {actual_value}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Secret overrides fail in the unconditional readback path.

Line 89 always re-reads the stored value. If an override is *$@..., get_config() raises there because handled_secretly=False, so apply_config_overrides() breaks even with log=False. Moving secret values through a masked branch avoids both the failure and an accidental leak.

Proposed fix
-            actual_value = self.get_config(key, print=False) # ensure that key has been set, raises an exception otherwise
-            if log:
-                logging.info(f"config override: {key} --> {actual_value}")
+            if isinstance(value, str) and value.startswith("*$@"):
+                self.get_config(key, print=False, handled_secretly=True) # ensure that key has been set
+                if log:
+                    logging.info("config override: %s --> <secret>", key)
+            else:
+                actual_value = self.get_config(key, print=False) # ensure that key has been set
+                if log:
+                    logging.info("config override: %s --> %s", key, actual_value)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for key, value in variable_overrides.items():
MAGIC_DEFAULT_VALUE = object()
handled_secretly = True # current_value MUST NOT be printed below.
current_value = self.get_config(key, MAGIC_DEFAULT_VALUE, print=False, warn=False, handled_secretly=handled_secretly)
if current_value == MAGIC_DEFAULT_VALUE:
if ignore_not_found:
continue
if "." in key:
raise ValueError(f"Config key '{key}' does not exist, and cannot create it at the moment :/")
self.config[key] = None
self.set_config(key, value, print=False)
actual_value = self.get_config(key, print=False) # ensure that key has been set, raises an exception otherwise
if log:
logging.info(f"config override: {key} --> {actual_value}")
for key, value in variable_overrides.items():
MAGIC_DEFAULT_VALUE = object()
handled_secretly = True # current_value MUST NOT be printed below.
current_value = self.get_config(key, MAGIC_DEFAULT_VALUE, print=False, warn=False, handled_secretly=handled_secretly)
if current_value == MAGIC_DEFAULT_VALUE:
if ignore_not_found:
continue
if "." in key:
raise ValueError(f"Config key '{key}' does not exist, and cannot create it at the moment :/")
self.config[key] = None
self.set_config(key, value, print=False)
if isinstance(value, str) and value.startswith("*$@"):
self.get_config(key, print=False, handled_secretly=True) # ensure that key has been set
if log:
logging.info("config override: %s --> <secret>", key)
else:
actual_value = self.get_config(key, print=False) # ensure that key has been set
if log:
logging.info("config override: %s --> %s", key, actual_value)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/config.py` around lines 75 - 91, The readback always
calls self.get_config(...) without preserving the handled_secretly flag, so
secret overrides cause get_config to raise or leak; fix by passing the same
handled_secretly value when re-reading the stored value (change actual_value =
self.get_config(key, print=False) to actual_value = self.get_config(key,
print=False, handled_secretly=handled_secretly)) and ensure logging respects
handled_secretly (avoid printing actual_value when handled_secretly is True);
reference the loop over variable_overrides and the get_config/set_config calls
and the handled_secretly variable in this change.

Comment on lines +69 to +80
def _dict_to_run_toolbox_args(args_dict):
args = []
for k, v in args_dict.items():
if isinstance(v, dict) or isinstance(v, list):
val = json.dumps(v)
arg = f"--{k}=\"{v}\""
else:
val = str(v).replace("'", "\'")
arg = f"--{k}='{v}'"
args.append(arg)

return " ".join(args)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the serialized/escaped value when building toolbox args.

val is prepared on Lines 73-76, but Lines 74-77 still interpolate v. Dict/list values go out as Python reprs instead of JSON, and scalar values containing ' are left unescaped, so run_toolbox.py will receive malformed arguments as soon as a config value contains nested data or quotes.

🛠️ Proposed fix
+import shlex
...
 def _dict_to_run_toolbox_args(args_dict):
     args = []
     for k, v in args_dict.items():
-        if isinstance(v, dict) or isinstance(v, list):
-            val = json.dumps(v)
-            arg = f"--{k}=\"{v}\""
-        else:
-            val = str(v).replace("'", "\'")
-            arg = f"--{k}='{v}'"
+        if isinstance(v, (dict, list)):
+            val = json.dumps(v)
+        else:
+            val = str(v)
+        arg = f"--{k}={shlex.quote(val)}"
         args.append(arg)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/run.py` around lines 69 - 80, The function
_dict_to_run_toolbox_args prepares a serialized value in the local variable val
but mistakenly interpolates the original v, causing dicts/lists to be sent as
Python reprs and leaving single quotes unescaped for scalars; fix by using val
in both formatted argument branches (i.e., f"--{k}={...}" should embed val) and
ensure the scalar branch uses the escaped val (not v) when wrapping in single
quotes so nested JSON and embedded quotes are preserved for run_toolbox.py.

Comment on lines +122 to +125
if protect_shell:
command = f"set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;{command}"

proc = subprocess.run(command, **args)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find projects/legacy -name "run.py" -type f

Repository: openshift-psap/topsail-ng

Length of output: 99


🏁 Script executed:

head -150 projects/legacy/library/run.py | tail -50

Repository: openshift-psap/topsail-ng

Length of output: 1827


🏁 Script executed:

cat -n projects/legacy/library/run.py | sed -n '100,140p'

Repository: openshift-psap/topsail-ng

Length of output: 1809


🌐 Web query:

Python subprocess.run shell=True POSIX /bin/sh default behavior

💡 Result:

On POSIX (Linux/macOS), subprocess.run(..., shell=True) runs your command via the system’s POSIX shell, /bin/sh by default, effectively like:

  • if args is a string: ['/bin/sh', '-c', args]
  • if args is a sequence: ['/bin/sh', '-c', args[0], args[1], ...] (extra items become additional argv values for the shell script, i.e., positional parameters). [1][2]

If you want a different shell (e.g., bash), pass executable=...; on POSIX, executable replaces the default /bin/sh when shell=True. [1][2]

Note: /bin/sh is not necessarily bash (often it’s dash on Debian/Ubuntu), so bash-specific syntax may fail unless you set executable='/bin/bash'. [1][3]

Sources:
[1] Python docs (subprocess, shell=True, /bin/sh, replacement via executable) (docs.python.org)
[2] Python docs (same behavior, including Popen(['/bin/sh','-c', ...]) example) (docs.python.org)
[3] Discussion/example noting /bin/sh may be dash not bash (stackoverflow.com)

Citations:


🌐 Web query:

bash pipefail errtrace option dash shell compatibility

💡 Result:

  • set -o pipefail

    • In bash, pipefail makes a pipeline return failure if any element fails (rather than only the last command’s status).
    • Portability: pipefail was not historically POSIX, so /bin/sh scripts couldn’t assume it; ShellCheck still flags it as undefined in “POSIX sh”. GitHub discussions note dash has been a notable holdout in many environments. [1]
    • dash today: the current dash(1) man page documents set -o pipefail and describes the changed pipeline exit-status rules when it’s enabled. [2]
    • Practical compatibility advice: if you must run under unknown /bin/sh implementations, feature-detect:
      (set -o pipefail) 2>/dev/null && set -o pipefail
      (and/or run ShellCheck with -s dash / -s sh depending on your target). [1]
  • set -o errtrace / set -E

    • This is bash-specific: it makes the ERR trap be inherited in functions/subshell contexts where it otherwise wouldn’t be. The Bash manual states the ERR trap “is not inherited … unless the -o errtrace shell option has been enabled.” [3]
    • dash compatibility: dash does not implement bash’s ERR trap/errtrace behavior (and dash’s -E option refers to emacs line editing for interactive use, not “errtrace”). [2][3]

Bottom line: pipefail is increasingly common and even documented in modern dash, but it’s still risky to assume for generic /bin/sh; errtrace is a bash-only error-tracing feature—don’t use it in dash/POSIX sh scripts. [2][3]

Sources:
[1] ShellCheck issue on pipefail portability/SC3040: (github.com)
[2] dash(1) manual page (pipefail option and pipeline semantics): (man7.org)
[3] GNU Bash Reference Manual (errtrace and ERR trap inheritance): (gnu.org)

Citations:


Pin run() to Bash before enabling pipefail and errtrace.

With shell=True, Python uses /bin/sh on POSIX unless executable is set. On many systems (e.g., Debian/Ubuntu), /bin/sh is dash, which does not support the errtrace option. pipefail, while increasingly supported in modern dash, is also a bash-specific extension that may fail on some systems. This wrapper will fail on dash-based runners.

🛠️ Proposed fix
     args["cwd"] = cwd
     args["shell"] = True

     if capture_stdout: args["stdout"] = subprocess.PIPE
     if capture_stderr: args["stderr"] = subprocess.PIPE
     if check: args["check"] = True
@@
     if protect_shell:
+        args["executable"] = "/bin/bash"
         command = f"set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;{command}"
🧰 Tools
🪛 Ruff (0.15.6)

[error] 125-125: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/run.py` around lines 122 - 125, The wrapper enables
bash-specific options but may run under /bin/sh; in run() before injecting "set
-o pipefail;...errtrace" ensure the shell is pinned to bash by setting the
subprocess executable (e.g., set args.setdefault("executable", "/bin/bash") when
args.get("shell") is True) or explicitly invoke bash (e.g., run with "bash -lc"
semantics); update the block that checks protect_shell and the subsequent
subprocess.run call so the process uses /bin/bash before adding
pipefail/errtrace to the command.

Comment on lines +227 to +237
for kv_entry in itertools.product(*kv_list):
for k, v in kv_entry:
config.project.set_config(k, v)

fct(*args, **kwargs)

for iter_key, iter_values in iterable_kv.items():
config.project.set_config(iter_key, iter_values, print=False)

for iter_key, iter_values in iterable_kv.items():
config.project.set_config(iter_key, iter_values, print=False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore iterated config values in a finally block.

If fct() raises for one combination, the restore code on Lines 233-237 never runs and the process keeps the last partial config override. That leaks state into later cleanup or caller code.

🛠️ Proposed fix
     kv_list = [[(key, v) for v in iterable_kv[key]] for key in iterable_kv]
     for kv_entry in itertools.product(*kv_list):
-        for k, v in kv_entry:
-            config.project.set_config(k, v)
-
-        fct(*args, **kwargs)
-
-        for iter_key, iter_values in iterable_kv.items():
-            config.project.set_config(iter_key, iter_values, print=False)
-
-    for iter_key, iter_values in iterable_kv.items():
-        config.project.set_config(iter_key, iter_values, print=False)
+        try:
+            for k, v in kv_entry:
+                config.project.set_config(k, v)
+
+            fct(*args, **kwargs)
+        finally:
+            for iter_key, iter_values in iterable_kv.items():
+                config.project.set_config(iter_key, iter_values, print=False)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for kv_entry in itertools.product(*kv_list):
for k, v in kv_entry:
config.project.set_config(k, v)
fct(*args, **kwargs)
for iter_key, iter_values in iterable_kv.items():
config.project.set_config(iter_key, iter_values, print=False)
for iter_key, iter_values in iterable_kv.items():
config.project.set_config(iter_key, iter_values, print=False)
for kv_entry in itertools.product(*kv_list):
try:
for k, v in kv_entry:
config.project.set_config(k, v)
fct(*args, **kwargs)
finally:
for iter_key, iter_values in iterable_kv.items():
config.project.set_config(iter_key, iter_values, print=False)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/run.py` around lines 227 - 237, The loop that applies
overrides via for kv_entry in itertools.product(*kv_list) must ensure the
original iterable_kv values are restored even if fct(...) raises: wrap the call
and its per-iteration overrides in a try/finally so that after running
fct(*args, **kwargs) (or if it raises) you always call
config.project.set_config(iter_key, iter_values, print=False) for each iter_key,
iter_values in iterable_kv; modify the block around kv_entry, fct and the
restore so the restore logic lives in the finally clause (referencing kv_entry,
fct, config.project.set_config and iterable_kv).

Comment on lines +57 to +83
machine_count_cpu = total_cpu_count / (machine_size.cpu - RESERVED_CPU)
machine_count_memory = total_memory_count / (machine_size.memory - RESERVED_MEM)

logging.info(f"Memory requirement: {machine_count_memory:.1f} x {machine_type}")
logging.info(f"CPU requirement: {machine_count_cpu:.1f} x {machine_type}")
logging.info("")

machine_exact_count = max([machine_count_cpu, machine_count_memory])
machine_count = math.ceil(machine_exact_count)

pods_per_machine = math.ceil(user_count/machine_count)

# ensure that the expected pod/machine
if pods_per_machine > MAX_POD_PER_NODE:
logging.info(f"Computation gives {pods_per_machine} Pods per node on {machine_count}. "
f"Increasing the node count to stay below {MAX_POD_PER_NODE} pods/node.")
pods_per_machine = MAX_POD_PER_NODE
machine_count = math.ceil(user_count/pods_per_machine)

logging.info(f"Provisioning {machine_count} {machine_type} machines "
f"for running {user_count} users with the pod size cpu={cpu}, mem={memory}")
unallocated_cpu = machine_size.cpu - pod_size['cpu'] * pods_per_machine
unallocated_mem = machine_size.memory - pod_size['memory'] * pods_per_machine

logging.info(f"Expecting {pods_per_machine:d} pods per node "
f"({unallocated_cpu:.3f} cpu and {unallocated_mem:.2f}Gi of memory "
f"not allocated per node)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Size from per-node pod capacity, not aggregate totals alone.

Line 67 redistributes pods with ceil(user_count / machine_count) after sizing only from aggregate CPU/memory. That can still overpack a node when pods do not divide evenly; e.g. 3 users at cpu=4 on an 8-vCPU machine with RESERVED_CPU=2 returns 2 nodes even though one node would need 8 CPU and exceed the 6 allocatable. This also leaves user_count == 0 falling into a divide-by-zero, and Lines 78-79 currently report spare capacity from raw instance size instead of allocatable size.

Proposed fix
-    machine_count_cpu = total_cpu_count / (machine_size.cpu - RESERVED_CPU)
-    machine_count_memory = total_memory_count / (machine_size.memory - RESERVED_MEM)
+    alloc_cpu = machine_size.cpu - RESERVED_CPU
+    alloc_mem = machine_size.memory - RESERVED_MEM
+    if user_count == 0:
+        return 0
+    if cpu <= 0 or memory <= 0:
+        raise ValueError("cpu and memory must be positive")
+    if alloc_cpu <= 0 or alloc_mem <= 0:
+        raise ValueError(f"{machine_type} has no allocatable CPU/memory after reservation")
+
+    machine_count_cpu = total_cpu_count / alloc_cpu
+    machine_count_memory = total_memory_count / alloc_mem
@@
-    machine_exact_count = max([machine_count_cpu, machine_count_memory])
-    machine_count = math.ceil(machine_exact_count)
-
-    pods_per_machine = math.ceil(user_count/machine_count)
+    max_pods_per_machine = min(
+        math.floor(alloc_cpu / cpu),
+        math.floor(alloc_mem / memory),
+        MAX_POD_PER_NODE,
+    )
+    if max_pods_per_machine <= 0:
+        raise ValueError(f"{machine_type} cannot fit a pod of size cpu={cpu}, mem={memory}")
+
+    machine_count = max(
+        math.ceil(max(machine_count_cpu, machine_count_memory)),
+        math.ceil(user_count / max_pods_per_machine),
+    )
+    pods_per_machine = math.ceil(user_count / machine_count)
@@
-    unallocated_cpu = machine_size.cpu - pod_size['cpu'] * pods_per_machine
-    unallocated_mem = machine_size.memory - pod_size['memory'] * pods_per_machine
+    unallocated_cpu = alloc_cpu - pod_size['cpu'] * pods_per_machine
+    unallocated_mem = alloc_mem - pod_size['memory'] * pods_per_machine
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/sizing.py` around lines 57 - 83, Compute capacity
per-node and base pod packing on allocatable resources rather than
aggregate-only math: first handle user_count == 0 by setting machine_count = 0
and pods_per_machine = 0 and log/return early. Otherwise compute allocatable_cpu
= machine_size.cpu - RESERVED_CPU and allocatable_mem = machine_size.memory -
RESERVED_MEM, then compute per-node pod capacity as pods_by_cpu =
floor(allocatable_cpu / pod_size['cpu']) and pods_by_mem = floor(allocatable_mem
/ pod_size['memory']); set pods_per_machine = min(pods_by_cpu, pods_by_mem,
MAX_POD_PER_NODE) (and if this becomes 0 raise/log an error). Finally set
machine_count = ceil(user_count / pods_per_machine). Use
allocatable_cpu/allocatable_mem when computing unallocated_cpu and
unallocated_mem and update logging to reference these allocatable values; adjust
references to machine_count_cpu, machine_count_memory, machine_exact_count only
if still needed or remove them in favor of per-node capacity logic.

Comment on lines +121 to +125
try:
sys.exit(main(machine_type, user_count, cpu, memory)) # returns the number of nodes required
except Exception as e:
logging.error(f"'{' '.join(sys.argv)}' failed: {e.__class__.__name__}: {e}")
sys.exit(0) # 0 means that an error occured
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import ast
import pathlib

path = pathlib.Path("projects/legacy/library/sizing.py")
source = path.read_text()
tree = ast.parse(source)

for node in ast.walk(tree):
    if isinstance(node, ast.Call) and isinstance(node.func, ast.Attribute) and node.func.attr == "exit":
        print(f"sys.exit call on line {node.lineno}: {source.splitlines()[node.lineno - 1].strip()}")
PY

Repository: openshift-psap/topsail-ng

Length of output: 298


🏁 Script executed:

sed -n '110,130p' projects/legacy/library/sizing.py

Repository: openshift-psap/topsail-ng

Length of output: 453


🏁 Script executed:

rg -A 3 'def main\(' projects/legacy/library/sizing.py

Repository: openshift-psap/topsail-ng

Length of output: 196


🏁 Script executed:

rg 'return' projects/legacy/library/sizing.py | grep -v '# ' | head -20

Repository: openshift-psap/topsail-ng

Length of output: 113


🏁 Script executed:

sed -n '1,100p' projects/legacy/library/sizing.py | grep -A 50 'def main'

Repository: openshift-psap/topsail-ng

Length of output: 2569


🏁 Script executed:

sed -n '45,65p' projects/legacy/library/sizing.py

Repository: openshift-psap/topsail-ng

Length of output: 960


Print the node count to stdout instead of encoding it in the exit status.

Line 122 passes the machine_count integer directly to sys.exit(), which truncates values exceeding 255 (e.g., 256 machines exits with code 0). Additionally, line 125 exits with 0 on error—inverting success semantics and making errors indistinguishable from large node counts. Print the count to stdout and reserve non-zero exit codes for failures.

Proposed fix
     try:
-        sys.exit(main(machine_type, user_count, cpu, memory)) # returns the number of nodes required
+        machine_count = main(machine_type, user_count, cpu, memory)
     except Exception as e:
         logging.error(f"'{' '.join(sys.argv)}' failed: {e.__class__.__name__}: {e}")
-        sys.exit(0) # 0 means that an error occured
+        sys.exit(1)
+
+    print(machine_count)
+    sys.exit(0)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
sys.exit(main(machine_type, user_count, cpu, memory)) # returns the number of nodes required
except Exception as e:
logging.error(f"'{' '.join(sys.argv)}' failed: {e.__class__.__name__}: {e}")
sys.exit(0) # 0 means that an error occured
try:
machine_count = main(machine_type, user_count, cpu, memory)
except Exception as e:
logging.error(f"'{' '.join(sys.argv)}' failed: {e.__class__.__name__}: {e}")
sys.exit(1)
print(machine_count)
sys.exit(0)
🧰 Tools
🪛 Ruff (0.15.6)

[warning] 123-123: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/sizing.py` around lines 121 - 125, The current
try/except passes the node count returned by main(machine_type, user_count, cpu,
memory) into sys.exit (which truncates values >255) and exits with 0 on error;
instead, call main(...) and capture its integer result, print that integer to
stdout (so full counts like 256 are preserved), then exit normally (sys.exit(0))
for success; in the except block log the failure with logging.error including
the exception details (use e.__class__.__name__ and e) and call sys.exit(1) to
indicate failure — update the try/except around main(...) accordingly
(referencing the main function, sys.exit, logging.error, and sys.argv).

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
projects/core/ci_entrypoint/run_ci.py (1)

389-402: ⚠️ Potential issue | 🔴 Critical

success variable is referenced but never defined — causes NameError.

After switching to finish_reason, the variable success is no longer defined, but lines 396, 399, 400, and 402 still reference it. This will raise NameError at runtime.

Proposed fix
         finish_reason = prepare_ci.FinishReason.SUCCESS if result.returncode == 0 \
             else prepare_ci.FinishReason.ERROR
+        success = finish_reason == prepare_ci.FinishReason.SUCCESS

         # Post-execution checks and status reporting
         if prepare_ci:
             status_message = prepare_ci.postchecks(project, operation, start_time, finish_reason)

             msg = click.style(status_message, fg='green' if success else 'red')
         else:
             # Fallback to simple messages if prepare_ci not available
             msg = click.style(f"✅ {project} {operation} completed successfully", fg='green') if success \
                 else  click.style(f"❌ {project} {operation} failed with exit code {result.returncode}", fg='red')

         click.echo(msg, err=not success)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/ci_entrypoint/run_ci.py` around lines 389 - 402, The code
references an undefined variable success after switching to finish_reason;
define a success boolean from result.returncode (e.g., success =
result.returncode == 0) or derive it from prepare_ci.FinishReason (e.g., success
= finish_reason == prepare_ci.FinishReason.SUCCESS) and use that in the
click.style calls and click.echo(err=...) so all occurrences (where msg color
and err flag are set) use the defined success variable; keep the existing
finish_reason logic and ensure success is declared before it's used in the
postchecks/else branches and styling.
♻️ Duplicate comments (21)
projects/legacy/library/export.py (2)

28-30: ⚠️ Potential issue | 🟡 Minor

Propagate the wrapped command result from the decorator.

On Line 30, the wrapped function result is discarded, so decorated commands always return None.

Suggested fix
         def wrapper(*args, **kwargs):
             init()
-            fct(*args, **kwargs)
+            return fct(*args, **kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/export.py` around lines 28 - 30, The wrapper
decorator currently calls init() and then fct(*args, **kwargs) but discards
fct's return value; update the wrapper (the inner function named wrapper that
calls init() and fct) to return the result of fct so decorated commands
propagate their return values (i.e., change the call to capture and return fct's
return value from wrapper).

82-84: ⚠️ Potential issue | 🟠 Major

Avoid shell-string command construction for AWS copy.

Line 84 interpolates CLI-provided artifacts_dirname into a shell command. Quoting is not sufficient against injection/breakage; invoke AWS with an argument list instead.

Suggested fix
     logging.info(f"Exporting to {export_dest} ({export_url})")
     aws_creds_filename = config.project.get_config("secrets.aws_credentials")
-    run.run(f"AWS_SHARED_CREDENTIALS_FILE=\"$PSAP_ODS_SECRET_PATH/{aws_creds_filename}\" aws s3 cp --recursive \"{artifacts_dirname}\" \"{export_dest}\" --acl public-read &> {env.ARTIFACT_DIR / 'aws_s3_cp.log'}")
+    creds_path = pathlib.Path(os.environ["PSAP_ODS_SECRET_PATH"]) / aws_creds_filename
+    log_path = env.ARTIFACT_DIR / "aws_s3_cp.log"
+    with open(log_path, "w") as log_file:
+        subprocess.run(
+            ["aws", "s3", "cp", "--recursive", str(artifacts_dirname), export_dest, "--acl", "public-read"],
+            check=True,
+            stdout=log_file,
+            stderr=subprocess.STDOUT,
+            env={**os.environ, "AWS_SHARED_CREDENTIALS_FILE": str(creds_path)},
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/export.py` around lines 82 - 84, The current code
builds a single shell string for aws s3 cp using interpolated artifacts_dirname,
which risks injection and quoting errors; instead construct an argument list and
call a programmatically-invoked command (or use boto3) to avoid a shell. Replace
the run.run(...) call in export.py with a subprocess.run([...]) passing
['aws','s3','cp','--recursive', artifacts_dirname, export_dest,
'--acl','public-read'] (or perform the upload via boto3 S3 client), set env to
include
AWS_SHARED_CREDENTIALS_FILE=os.path.join(os.environ['PSAP_ODS_SECRET_PATH'],
aws_creds_filename), redirect stdout/stderr to env.ARTIFACT_DIR /
'aws_s3_cp.log', and use check=True to surface errors; keep references to
aws_creds_filename, artifacts_dirname, export_dest, env.ARTIFACT_DIR and run.run
for locating the change.
projects/core/notifications/github/api.py (1)

25-38: ⚠️ Potential issue | 🟠 Major

Check the GitHub responses before indexing id and token.

Both requests parse JSON fields unconditionally. On any non-2xx from GitHub, Lines 28 and 38 turn the real API error into a KeyError/JSON failure, and projects/core/notifications/send.py:64-101 calls this helper without a guard. Validate installation_resp and access_token_resp before reading those fields.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/github/api.py` around lines 25 - 38, The code
reads installation_resp.json()["id"] and access_token_resp.json()["token"]
without checking GitHub's HTTP status, which can turn API errors into
KeyError/JSON errors; update the logic around installation_resp and
access_token_resp to verify response.status_code (or use
response.raise_for_status()) and handle non-2xx responses by logging the status
and response.text and raising a clear exception before attempting to access
["id"] or ["token"]; ensure the failure paths include descriptive messages so
callers (e.g., the code that calls this helper) receive meaningful errors
instead of JSON/key errors.
projects/legacy/library/run.py (4)

122-125: ⚠️ Potential issue | 🟠 Major

Pin run() to Bash before enabling pipefail and errtrace.

With shell=True, Python uses /bin/sh by default. On dash-based runners set -o errtrace is invalid, so protected commands fail before the real command runs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/run.py` around lines 122 - 125, The patch should
ensure run() pins the shell to Bash before enabling pipefail/errtrace: when
protect_shell is true and you prepend the "set -o ..." to command (in the block
that references protect_shell and command), add or set the subprocess.run
executable to a Bash binary (e.g., '/bin/bash' or resolved bash path) so the
shell used for subprocess.run(command, **args) supports 'set -o errtrace';
update the args passed into subprocess.run (or set shell=True with
executable='/bin/bash') so the protected command runs under Bash rather than the
system /bin/sh.

227-237: ⚠️ Potential issue | 🟠 Major

Always restore iterable config values in finally.

If fct(*args, **kwargs) raises for one combination, the restore block never executes and the mutated config leaks into later cleanup or caller code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/run.py` around lines 227 - 237, The loop mutates
project config via config.project.set_config for each kv_entry then calls
fct(*args, **kwargs) but restores iterable_kv only after the call; if fct raises
the restore never runs and mutated config leaks. Wrap the call and
per-combination restore in a try/finally: after applying each kv_entry (the for
kv_entry in itertools.product(*kv_list) block and calls to
config.project.set_config), invoke fct inside try and move the existing
restoration loop that resets iterable_kv into the finally clause so
config.project.set_config(iter_key, iter_values, print=False) always runs
regardless of exceptions; keep the final outer restore (the last for
iter_key...) only if it’s required, otherwise remove duplicate restore to avoid
double-setting.

16-20: ⚠️ Potential issue | 🔴 Critical

Only call killpg() if this module actually created the process group.

If os.setpgrp() fails at import time, Line 170 still kills process group 0—the inherited group. That can terminate unrelated sibling processes on the runner instead of just the workers started here.

Also applies to: 167-170

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/run.py` around lines 16 - 20, The code
unconditionally calls os.killpg later even if os.setpgrp() failed, risking
killing the inherited group; modify the module to track whether this process
actually created a new process group: initialize a flag (e.g., created_pgrp =
False), set created_pgrp = True only if os.setpgrp() succeeds in the try block,
and then guard the later call to os.killpg (the cleanup logic that currently
uses killpg/pgid 0) so it only runs when created_pgrp is True; reference
os.setpgrp and the killpg/cleanup site when making the change.

69-80: ⚠️ Potential issue | 🟠 Major

Build toolbox args from the serialized value, not the original object.

val is computed but never used, so dict/list inputs go out as Python reprs and quoted scalars are not escaped. The first nested config or single quote in a value will generate malformed CLI args.

Possible fix
+import shlex
@@
-        if isinstance(v, dict) or isinstance(v, list):
-            val = json.dumps(v)
-            arg = f"--{k}=\"{v}\""
-        else:
-            val = str(v).replace("'", "\'")
-            arg = f"--{k}='{v}'"
+        if isinstance(v, (dict, list)):
+            val = json.dumps(v)
+        else:
+            val = str(v)
+        arg = f"--{k}={shlex.quote(val)}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/run.py` around lines 69 - 80, In
_dict_to_run_toolbox_args, the computed variable val is never used so nested
dict/list values and quotes are mishandled; change the function to use val when
building the arg: for dict/list set val = json.dumps(v) and build the arg using
that serialized val (e.g. f'--{k}="{val}"'), and for scalars set val = str(v)
with single quotes escaped (e.g. val = str(v).replace("'", "\\'")) and build the
arg with single quotes (e.g. f"--{k}='{val}'"); ensure you reference
_dict_to_run_toolbox_args and the local variables k, v, val when applying the
fix.
projects/legacy/library/config.py (2)

88-91: ⚠️ Potential issue | 🟠 Major

Don't re-read secret overrides through the non-secret path.

After set_config(), Line 89 calls get_config() without handled_secretly=True. Any *$@... override will fail in resolve_reference() here, and once that is fixed this logging branch would still expose the resolved secret.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/config.py` around lines 88 - 91, After calling
set_config(key, value, print=False) do not re-read the value via the non-secret
path; change the subsequent get_config call to preserve secrecy (e.g., call
get_config(key, print=False, handled_secretly=True)) or avoid resolving secrets
at all and log a redacted placeholder instead; ensure the change prevents
resolve_reference() from being invoked on secret overrides and that logging uses
a masked value like "<secret hidden>" when key is secret.

145-153: ⚠️ Potential issue | 🔴 Critical

Raise immediately after recording threaded set_config() misuse.

Inside this function, print is the boolean parameter, so print(msg, file=f) throws TypeError instead of writing the marker file. Even after fixing that, the CI branch still falls through and mutates shared config from the worker thread.

Possible fix
         if threading.current_thread().name != "MainThread":
             msg = f"set_config({jsonpath}, {value}) cannot be called from a thread, to avoid race conditions."
             if os.environ.get("OPENSHIFT_CI") or os.environ.get("PERFLAB_CI"):
                 logging.error(msg)
                 with open(env.ARTIFACT_DIR / "SET_CONFIG_CALLED_FROM_THREAD", "a") as f:
-                    print(msg, file=f)
-            else:
-                raise RuntimeError(msg)
+                    f.write(f"{msg}\n")
+            raise RuntimeError(msg)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/config.py` around lines 145 - 153, In set_config, the
boolean parameter named print shadows the built-in print leading to a TypeError
when writing the marker file and the CI branch currently falls through and
continues execution; fix by (a) avoiding the shadowed name (rename the parameter
to e.g. do_print or print_flag) or call builtins.print explicitly, and use
file.write(msg + "\n") (or builtins.print(msg, file=f)) to record the marker to
env.ARTIFACT_DIR / "SET_CONFIG_CALLED_FROM_THREAD"; and (b) after
recording/logging the misuse in the OPENSHIFT_CI / PERFLAB_CI branch,
immediately raise RuntimeError(msg) (do not let the function continue) so
set_config does not mutate shared config from a worker thread. Ensure changes
are applied in the set_config method and reference the OPENSHIFT_CI / PERFLAB_CI
checks and env.ARTIFACT_DIR marker write.
projects/core/notifications/slack/api.py (1)

43-47: ⚠️ Potential issue | 🟠 Major

Use an exact/header-boundary match for the thread anchor.

A substring check will treat Thread for PR \#12`as a hit insideThread for PR `#123``, so later notifications can attach to the wrong Slack thread.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/slack/api.py` around lines 43 - 47, The loop
currently uses a substring check (message_anchor in message["text"]) which
causes false positives; change it to match the anchor as a header/boundary —
e.g., compare against the message's first line or use a bounded regex so only an
exact header line matches (use message["text"].splitlines()[0] == message_anchor
or re.match(r'(^|\n)'+re.escape(message_anchor)+r'($|\n)', message["text"])).
Update the loop that iterates over history and the variables message_anchor,
message["text"], and the return (message["ts"], message["text"]) to only return
when the exact/header-boundary match succeeds.
projects/core/notifications/send.py (4)

87-94: ⚠️ Potential issue | 🟠 Major

Short-circuit dry_run before fetching the GitHub token.

The get_user_token call on line 87 still executes before the dry_run check. A dry run should not require valid credentials or network access.

Proposed fix
-    user_token = github_api.get_user_token(pem_file, client_id, org, repo)
     if dry_run:
         logging.info(f"Github notification:\n{message}")
-        logging.info(f"***")
-        logging.info(f"***")
-        logging.info(f"***\n")
+        logging.info("***")
+        logging.info("***")
+        logging.info("***\n")

         return True

+    user_token = github_api.get_user_token(pem_file, client_id, org, repo)
     resp = github_api.send_notification(org, repo, user_token, pr_number, message)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 87 - 94, Move the dry-run
short-circuit to execute before calling github_api.get_user_token so
credentials/network are not required for dry runs: check the dry_run flag at the
start of the notification flow (the block that currently logs the message and
returns True) and return immediately if set, then only call
github_api.get_user_token (and assign user_token) when dry_run is False; update
references to user_token/github_api.get_user_token in send.py accordingly.

454-465: ⚠️ Potential issue | 🟠 Major

Abort if the CPT thread root could not be created.

When send_message() fails on line 459, channel_ok is unpacked but never checked. The code continues to post thread_message with an invalid channel_msg_ts, which will either fail or create a top-level message instead of a threaded reply.

Proposed fix
         else:
             channel_msg_ts, channel_ok = slack_api.send_message(client, message=channel_message)
+            if not channel_ok:
+                logging.error("Failed to create Slack thread root message")
+                return True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 454 - 465, The code posts a
thread reply even when sending the channel root failed because channel_ok
returned by slack_api.send_message is never checked; update the flow after
send_message (the call that sets channel_msg_ts, channel_ok) to abort (log error
and return False/appropriate failure) if channel_ok is falsy or channel_msg_ts
is missing before calling get_slack_cpt_message or posting the thread_message,
and ensure the error log references send_message failure; key symbols:
slack_api.send_message, channel_msg_ts, channel_ok, get_slack_cpt_message,
thread_message.

369-377: ⚠️ Potential issue | 🟠 Major

Strip the Slack token file contents.

read_text() preserves trailing newlines that secret files often contain. Passing raw content to WebClient can cause authentication failures.

Proposed fix
-    return token_file.read_text()
+    return token_file.read_text().strip()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 369 - 377,
get_slack_secrets currently returns the raw contents of the SLACK_TOKEN_FILE via
token_file.read_text(), which can include trailing newlines and break
authentication when passed to WebClient; change get_slack_secrets to strip
whitespace from the file contents (e.g., call .strip() on the read_text()
result) before returning the token so callers using WebClient receive a clean
token string while preserving the existing existence check for token_file.

15-24: ⚠️ Potential issue | 🔴 Critical

SECRET_ENV_KEYS is a string, not a tuple — the loop iterates over characters.

The parentheses without a trailing comma create a string, so the for loop in get_secrets() iterates "P", "S", "A", ... instead of the full env var name. The secret path will never be discovered.

Proposed fix
-    SECRET_ENV_KEYS = ("PSAP_TOPSAIL_JUMP_CI_SECRET_PATH")
+    SECRET_ENV_KEYS = ("PSAP_TOPSAIL_JUMP_CI_SECRET_PATH",)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 15 - 24, In get_secrets(),
SECRET_ENV_KEYS is defined as a string due to missing trailing comma, causing
the for loop to iterate characters instead of environment variable names; change
SECRET_ENV_KEYS to an actual iterable of names (e.g., add a trailing comma to
make it a tuple or use a list) so the loop in get_secrets() checks the full
"PSAP_TOPSAIL_JUMP_CI_SECRET_PATH" key correctly and the subsequent
os.environ.get(secret_env_key) can succeed.
projects/legacy/library/ansible_toolbox.py (4)

270-295: ⚠️ Potential issue | 🟠 Major

Use a real temp inventory file instead of /proc/.../fd/....

The current approach depends on Linux-only procfs paths. This breaks on non-Linux controllers and is fragile in restricted containers. Use a named temp file and clean it up in finally.

Proposed fix
-            inventory_fd, path = tempfile.mkstemp()
-            os.remove(path) # using only the FD. Ensures that the file disappears when this process terminates
-            inventory_f = os.fdopen(inventory_fd, 'w')
+            inventory_file = tempfile.NamedTemporaryFile(
+                mode='w',
+                prefix='ansible_inventory_',
+                suffix='.ini',
+                delete=False
+            )
+            inventory_path = inventory_file.name
             # ... write inventory content ...
-            print(inventory_content, file=inventory_f)
-            inventory_f.flush()
+            print(inventory_content, file=inventory_file)
+            inventory_file.close()
             # ... later in the command construction ...
-            cmd += ["--inventory-file", f"/proc/{os.getpid()}/fd/{inventory_fd}"]
+            cmd += ["--inventory-file", inventory_path]
             # ... in finally block ...
+            try:
+                os.remove(inventory_path)
+            except FileNotFoundError:
+                pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/ansible_toolbox.py` around lines 270 - 295, The temp
inventory is created with tempfile.mkstemp() then os.remove(path) which forces
use of /proc/.../fd/... paths; change to create a real named temp file (e.g. use
tempfile.NamedTemporaryFile(delete=False) or keep the mkstemp path and open that
path) and write inventory_content to that filename instead of relying on the
FD-to-proc path (affecting the inventory_fd/inventory_f usage and the
inventory_content write), then ensure the file is explicitly removed in a
finally block to guarantee cleanup; update the code around inventory_fd, path,
inventory_f, inventory_content and the host_properties assembly so the inventory
is written to a real filesystem path and removed afterwards.

40-47: ⚠️ Potential issue | 🟠 Major

Skip private toolbox files before importing them.

The underscore check on line 47 runs after importlib.import_module() on line 44. Any top-level side effect or missing dependency in _*.py helper modules can break Toolbox() initialization before the file is skipped.

Proposed fix
         for toolbox_file in (TOPSAIL_DIR / "projects").glob("*/toolbox/*.py"):
-            if toolbox_file.name.startswith("."): continue
-
-            project_toolbox_module = str(toolbox_file.relative_to(TOPSAIL_DIR).with_suffix("")).replace(os.path.sep, ".")
-            mod = importlib.import_module(project_toolbox_module)
             toolbox_name = toolbox_file.with_suffix("").name
-
-            if toolbox_name.startswith("_"): continue
+            if toolbox_file.name.startswith("."):
+                continue
+            if toolbox_name.startswith("_"):
+                continue
+
+            project_toolbox_module = str(toolbox_file.relative_to(TOPSAIL_DIR).with_suffix("")).replace(os.path.sep, ".")
+            mod = importlib.import_module(project_toolbox_module)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/ansible_toolbox.py` around lines 40 - 47, The loop
currently calls importlib.import_module(project_toolbox_module) before checking
for private files, which lets _*.py side effects run; change the order so you
compute toolbox_name (e.g., toolbox_file.with_suffix("").name) or check
toolbox_file.name.startswith("_") first and continue for private files, and only
then build project_toolbox_module and call importlib.import_module; update the
loop around TOPSAIL_DIR / "projects" glob to skip files starting with "_" before
importlib.import_module is invoked.

164-175: ⚠️ Potential issue | 🟠 Major

Make ARTIFACT_EXTRA_LOGS_DIR allocation atomic.

The count-then-create pattern is racy. Two concurrent runs sharing the same ARTIFACT_DIR can both pick the same NNN__... directory and overwrite each other's logs.

Proposed fix using atomic creation
         if env.get("ARTIFACT_EXTRA_LOGS_DIR") is None:
             artifact_base_dirname = f"{self.group}__{self.command}" if self.group and self.command \
                 else "__".join(sys.argv[1:3])
-
-            previous_extra_count = len(list(artifact_dir.glob("*__*")))
-
-            name = f"{previous_extra_count:03d}__{prefix}{artifact_base_dirname}{suffix}"
-
-            env["ARTIFACT_EXTRA_LOGS_DIR"] = str(pathlib.Path(env["ARTIFACT_DIR"]) / name)
+            # Use atomic directory creation to avoid race conditions
+            for counter in itertools.count():
+                name = f"{counter:03d}__{prefix}{artifact_base_dirname}{suffix}"
+                candidate = pathlib.Path(env["ARTIFACT_DIR"]) / name
+                try:
+                    candidate.mkdir(parents=True, exist_ok=False)
+                    env["ARTIFACT_EXTRA_LOGS_DIR"] = str(candidate)
+                    break
+                except FileExistsError:
+                    continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/ansible_toolbox.py` around lines 164 - 175, The
current count-then-create logic for ARTIFACT_EXTRA_LOGS_DIR is racy; change it
to allocate the NNN__... directory atomically by looping and attempting to
create the candidate path with mkdir(exist_ok=False) until it succeeds (catch
FileExistsError and increment the counter), then set
env["ARTIFACT_EXTRA_LOGS_DIR"] to the successful path; update the code around
variables artifact_dir, previous_extra_count/name generation and
artifact_extra_logs_dir so you only call artifact_extra_logs_dir.mkdir once
inside the try and do not overwrite an existing directory chosen by another
process.

183-193: ⚠️ Potential issue | 🟠 Major

Artifact files may leak secrets.

_python.gen.cmd, _python.args.yaml, _ansible.play.yaml, _ansible.env, and _python.cmd write unredacted inputs. Any token, password, or sensitive value passed via CLI args or environment will be persisted to disk.

Consider adding a sanitization step that redacts values matching sensitive patterns (e.g., keys containing token, secret, password, key, kubeconfig).

Also applies to: 336-341

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/ansible_toolbox.py` around lines 183 - 193, The
artifact writers (when self.py_command_args is present and other places writing
_ansible.play.yaml, _ansible.env, _python.cmd) currently persist raw CLI args
and env values which may contain secrets; add a sanitization step before writing
by implementing and invoking a helper like redact_sensitive_map(data: dict) that
iterates keys/values and replaces values with "<REDACTED>" when keys or string
values match sensitive patterns (e.g., key names containing token, secret,
password, key, kubeconfig, api[_-]?key) or when values match common secret
formats; call this sanitizer on self.py_command_args and on any dicts being
dumped to YAML or printed to files (references: self.py_command_args,
artifact_extra_logs_dir, and the filenames "_python.gen.cmd",
"_python.args.yaml", "_ansible.play.yaml", "_ansible.env", "_python.cmd") so
only redacted data is written to disk, leaving non-sensitive fields untouched.
projects/legacy/library/env.py (1)

30-35: ⚠️ Potential issue | 🟡 Minor

Raise AttributeError for unknown module attributes.

globals()[name] turns a miss into KeyError, which breaks normal module __getattr__ semantics for callers like hasattr() and import/introspection code. Convert the miss to AttributeError instead.

🛠️ Suggested fix
 def __getattr__(name):
 
     if name == "ARTIFACT_DIR":
         return get_tls_artifact_dir()
 
-    return globals()[name]
+    try:
+        return globals()[name]
+    except KeyError:
+        raise AttributeError(name) from None
Python PEP 562 module __getattr__ missing attribute should raise AttributeError instead of KeyError
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/env.py` around lines 30 - 35, The module-level
__getattr__ currently returns globals()[name] which raises KeyError for unknown
attributes; change it so unknown names raise AttributeError instead—e.g., in
__getattr__ handle the ARTIFACT_DIR special-case (using get_tls_artifact_dir())
and for other names either check name in globals() and return globals()[name] or
catch KeyError and re-raise as AttributeError(f"module has no attribute
{name}"); ensure callers like hasattr() behave correctly rather than getting
KeyError.
projects/jump_ci/orchestration/ci.py (2)

79-85: ⚠️ Potential issue | 🔴 Critical

Invoke the callable returned by run_on_jump_ci().

These branches treat run_on_jump_ci(...) like an exit code, but it returns a callable. prepare, test, and cleanup therefore never execute the remote phase, and cleanup() also drops the real status by always returning 0.

Verify that projects/jump_ci/testing/test.py implements jump_ci(...) as a factory and compare it with the current call sites; expected result: jump_ci(...) returns a nested callable, while ci.py only invokes the factory.

#!/bin/bash
set -euo pipefail
echo "=== jump_ci helper ==="
sed -n '1,220p' projects/jump_ci/testing/test.py
echo
echo "=== current ci.py call sites ==="
sed -n '79,119p' projects/jump_ci/orchestration/ci.py

Also applies to: 97-100, 111-116

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/jump_ci/orchestration/ci.py` around lines 79 - 85, The code treats
run_on_jump_ci(...) as an exit code but it returns a callable (factory) that
must be invoked; update prepare, test, and cleanup to call the returned function
(e.g., fn = run_on_jump_ci("test_ci"); status = fn()) and use that status for
logging and return values so the remote phase actually runs and cleanup
preserves the real status; apply the same pattern wherever run_on_jump_ci(...)
is currently used (including the other two call sites) and ensure cleanup
returns the invoked callable's integer status instead of always 0.

67-71: ⚠️ Potential issue | 🔴 Critical

Call the exported prepare() entrypoint.

This call site targets prepare_jump_ci(), but the testing module exposes prepare(...). As written, this phase fails with AttributeError instead of preparing the remote environment.

🛠️ Suggested fix
-            prepare_jump_ci_mod.prepare_jump_ci()
+            prepare_jump_ci_mod.prepare()

Verify that projects/jump_ci/testing/prepare_jump_ci.py exports prepare(...) and not prepare_jump_ci(...); expected result: a def prepare( definition and no def prepare_jump_ci(.

#!/bin/bash
set -euo pipefail
echo "=== prepare_jump_ci.py entrypoints ==="
rg -nP '^\s*def\s+prepare(_jump_ci)?\s*\(' projects/jump_ci/testing/prepare_jump_ci.py -A2 -B2
echo
echo "=== current call site ==="
sed -n '67,75p' projects/jump_ci/orchestration/ci.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/jump_ci/orchestration/ci.py` around lines 67 - 71, The call
currently invokes prepare_jump_ci.prepare_jump_ci(), but the testing module
exports a prepare(...) entrypoint; update the call site to invoke
prepare_jump_ci.prepare(...) instead of prepare_jump_ci.prepare_jump_ci(), and
run a quick check that projects/jump_ci/testing/prepare_jump_ci.py defines def
prepare(...) (no def prepare_jump_ci(...)) so the import/module symbol matches
the call.
🧹 Nitpick comments (11)
.github/workflows/check_consistency.yml (2)

13-20: Consider adding timeout-minutes to prevent indefinite job runs.

If any validation script hangs, the job will run until GitHub's 6-hour limit. Adding a reasonable timeout improves CI reliability.

Suggested change
 jobs:
   check_topsail_consistency:
     runs-on: ubuntu-latest
+    timeout-minutes: 15

     steps:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/check_consistency.yml around lines 13 - 20, Add a
job-level timeout to prevent indefinite runs by setting the GitHub Actions
`timeout-minutes` key on the `check_topsail_consistency` job; update the job
block that defines `check_topsail_consistency` (the job name in the workflow) to
include a reasonable `timeout-minutes` value (e.g., 10-30) so the job will be
terminated if validation scripts hang.

4-11: Consider adding workflow_dispatch for manual triggering.

Adding workflow_dispatch would allow maintainers to manually trigger this workflow for debugging or ad-hoc validation without waiting for a push or scheduled run.

Suggested change
 on:
   # Triggers the workflow on push or pull request events but only for the main branch
   pull_request:
     branches: [main]
   push:
     branches: [main]
+  workflow_dispatch:
   schedule:
     - cron: '0 */8 * * *'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/check_consistency.yml around lines 4 - 11, The workflow
currently triggers only on pull_request, push, and schedule; add manual trigger
support by adding the workflow_dispatch event under the top-level on: block so
maintainers can manually start the workflow (i.e., add a workflow_dispatch key
alongside pull_request, push, and schedule in the on: section).
projects/legacy/library/sizing.py (2)

3-4: Module-level logging configuration may affect importers.

Setting logging.getLogger().setLevel(logging.INFO) at module scope configures the root logger globally, which can unintentionally change log levels for other modules that import this one. Consider moving this into the if __name__ == "__main__" block or using a named logger.

♻️ Suggested improvement
 import logging
-logging.getLogger().setLevel(logging.INFO)
+
+logger = logging.getLogger(__name__)

Then replace all logging.info(...) / logging.error(...) calls with logger.info(...) / logger.error(...), and configure the root logger only in the CLI block:

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    # ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/sizing.py` around lines 3 - 4, The module currently
calls logging.getLogger().setLevel(logging.INFO) at import time which globally
alters root logging; change this by creating a module-level named logger (e.g.,
logger = logging.getLogger(__name__)) and replace all uses of
logging.info()/logging.error() with logger.info()/logger.error(); remove the
global setLevel call and instead configure the root logger in a CLI block using
if __name__ == "__main__": logging.basicConfig(level=logging.INFO) so importers
are not affected.

38-38: Unused constant EXTRA_USERS.

EXTRA_USERS is defined but never referenced in the code. The comment mentions "+10% of users" but the value is 1, which adds confusion. Either remove this dead code or implement the intended buffer logic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/sizing.py` at line 38, EXTRA_USERS is dead code and
either should be removed or turned into a real buffer used by the sizing
calculations; decide whether you want an absolute extra-user constant or a
percentage buffer, then implement accordingly: either delete EXTRA_USERS and
remove any references, or change EXTRA_USERS to a meaningful value (e.g., 0.10
for 10% or an integer delta) and apply it in the user-count calculation(s)
(where base user totals are computed) by adjusting the computed user count
(e.g., multiply by (1 + EXTRA_USERS) for percent semantics or add EXTRA_USERS
for absolute semantics) and update the comment to match the chosen semantics.
projects/legacy/library/export.py (1)

75-76: Normalize path_prefix before building S3 destination/URL.

Line 75 and Line 76 can produce double slashes when path_prefix is empty. Normalizing once makes links consistent.

Suggested refactor
-    export_dest = f"s3://{bucket}/{path_prefix}/{run_id}"
-    export_url = f"https://{bucket}.s3.amazonaws.com/index.html#{path_prefix}/{run_id}/"
+    normalized_prefix = path_prefix.strip("/")
+    prefix_part = f"{normalized_prefix}/" if normalized_prefix else ""
+    export_dest = f"s3://{bucket}/{prefix_part}{run_id}"
+    export_url = f"https://{bucket}.s3.amazonaws.com/index.html#{prefix_part}{run_id}/"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/export.py` around lines 75 - 76, Normalize
path_prefix before constructing S3 paths to avoid double slashes: trim
leading/trailing slashes and treat an empty path_prefix as omitted when building
export_dest and export_url. Update the logic in export.py around the export_dest
and export_url assignments (variables export_dest, export_url, and inputs
bucket, path_prefix, run_id) so you compute a cleaned_path (e.g.,
path_prefix.strip("/")) and only include it in the f-strings when non-empty,
ensuring resulting URLs and S3 destinations never contain duplicate slashes.
projects/core/notifications/slack/api.py (1)

9-10: Make the Slack destination configurable.

Both read and write paths are pinned to CHANNEL_ID, and projects/core/notifications/send.py:255-279 does not provide an override. That forces every environment into the same destination and makes testing, forks, and staged rollouts hard to isolate.

Also applies to: 25-27, 58-60

projects/core/notifications/send.py (2)

201-209: Unused unpacked variables org and repo.

Both variables are unpacked but never used in get_slack_channel_message. Prefix with underscore to indicate intentional discard.

Proposed fix
-    org, repo = get_org_repo()
+    _org, _repo = get_org_repo()

Or simply remove the line if get_org_repo() has no side effects.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 201 - 209, In
get_slack_channel_message, get_org_repo() unpacks org and repo into unused
variables; either prefix them with underscores (e.g., _org, _repo) to indicate
intentional discard or remove the unpacking entirely if get_org_repo() has no
side effects—update the line in the get_slack_channel_message function
accordingly to eliminate the unused variable warning.

89-92: Remove extraneous f prefix from strings without placeholders.

These f-strings contain no placeholders.

Proposed fix
-        logging.info(f"***")
-        logging.info(f"***")
-        logging.info(f"***\n")
+        logging.info("***")
+        logging.info("***")
+        logging.info("***\n")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/send.py` around lines 89 - 92, The logging calls
in send.py use f-strings with no placeholders (the logging.info calls that log
"Github notification:\n{message}" and the "***" lines); remove the unnecessary
leading f from the string literals (e.g., change logging.info(f"***") to
logging.info("***") and the other non-interpolated strings), but keep any
f-string where actual interpolation occurs (e.g., if message needs formatting,
ensure it's a proper f-string or use logging formatting). This targets the
logging.info calls in the module.
projects/core/ci_entrypoint/prepare_ci.py (2)

468-480: Stale docstring: refers to success parameter that no longer exists.

The docstring at line 476 mentions success: False for failure, "True" for normal completion, but the parameter is now finish_reason: FinishReason.

Proposed fix
     Args:
         project: Project name that was executed
         operation: Operation that was executed
         start_time: Unix timestamp when execution started (None if unknown)
-        success: False for failure, "True" for normal completion
+        finish_reason: FinishReason indicating how the operation completed

     Returns:
         Status message string
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/ci_entrypoint/prepare_ci.py` around lines 468 - 480, The
docstring for postchecks is stale: remove the outdated `success` parameter
description and replace it with a correct description for the `finish_reason:
FinishReason` parameter (e.g., explain that it indicates why the run finished
such as SUCCESS, FAILURE, or CANCELLED), and update any related wording in the
Args section to reflect finish_reason rather than success; keep the rest of the
docstring intact and ensure the Returns description still matches the function
behavior.

464-466: Remove extraneous f prefix and log the exception properly.

The f-string has no placeholders. Also, logger.exception() already logs the exception traceback, so the message doesn't need {e}.

Proposed fix
     except Exception as e:
-        logger.exception(f"Failed to send notifications")
+        logger.exception("Failed to send notifications")
         # Don't fail the entire job if notifications fail
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/ci_entrypoint/prepare_ci.py` around lines 464 - 466, In the
except block that currently does "except Exception as e:
logger.exception(f\"Failed to send notifications\")", remove the unnecessary
f-string and log directly with logger.exception("Failed to send notifications")
(keep the exception variable e if needed elsewhere or remove it if unused); this
uses logger.exception to include the traceback without trying to interpolate the
exception into the message. Ensure you update the code around logger.exception
and the except clause accordingly.
projects/core/notifications/README.md (1)

51-59: Add language specifier to fenced code block.

The architecture diagram code block should specify a language (e.g., text or plaintext) for better rendering.

Proposed fix
-```
+```text
 run_ci.py
 ├── prepare_ci.prepare() (before execution)
 ├── [CI operation execution]
 └── prepare_ci.postchecks() (after execution)
     └── send_job_completion_notification()
         ├── GitHub: send_job_completion_notification_to_github()
         └── Slack: send_job_completion_notification_to_slack()
 ```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/README.md` around lines 51 - 59, Add a language
specifier to the fenced code block in the README so the architecture diagram
renders correctly; update the block wrapping the run_ci.py diagram (the block
that shows prepare_ci.prepare(), prepare_ci.postchecks(),
send_job_completion_notification(),
send_job_completion_notification_to_github(), and
send_job_completion_notification_to_slack()) to start with a language like
```text (or ```plaintext) instead of a bare ``` so the renderer treats it as
plain text.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fd9e94bb-443a-4b71-894d-dab095208738

📥 Commits

Reviewing files that changed from the base of the PR and between 8f5fbac and b6db104.

📒 Files selected for processing (31)
  • .github/steps/prepare_minimal_ubuntu_image/action.yml
  • .github/workflows/check_consistency.yml
  • .github/workflows/check_python_syntax.yml
  • projects/core/ci_entrypoint/github/pr_args.py
  • projects/core/ci_entrypoint/prepare_ci.py
  • projects/core/ci_entrypoint/run_ci.py
  • projects/core/notifications/README.md
  • projects/core/notifications/__init__.py
  • projects/core/notifications/github/__init__.py
  • projects/core/notifications/github/api.py
  • projects/core/notifications/github/gen_jwt.py
  • projects/core/notifications/send.py
  • projects/core/notifications/slack/__init__.py
  • projects/core/notifications/slack/api.py
  • projects/jump_ci/orchestration/ci.py
  • projects/jump_ci/testing/config.yaml
  • projects/jump_ci/testing/prepare_jump_ci.py
  • projects/jump_ci/testing/run
  • projects/jump_ci/testing/test.py
  • projects/jump_ci/testing/tunnelling.py
  • projects/jump_ci/testing/utils.py
  • projects/jump_ci/toolbox/jump_ci.py
  • projects/legacy/library/__init__.py
  • projects/legacy/library/ansible_toolbox.py
  • projects/legacy/library/common.py
  • projects/legacy/library/config.py
  • projects/legacy/library/env.py
  • projects/legacy/library/export.py
  • projects/legacy/library/run.py
  • projects/legacy/library/sizing.machines
  • projects/legacy/library/sizing.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • .github/workflows/check_python_syntax.yml
  • projects/core/notifications/init.py
  • .github/steps/prepare_minimal_ubuntu_image/action.yml
  • projects/core/notifications/slack/init.py
  • projects/legacy/library/sizing.machines
  • projects/core/ci_entrypoint/github/pr_args.py

with:
fetch-depth: 0

- uses: ./.github/steps/prepare_minimal_ubuntu_image
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix missing description in composite action metadata.

Static analysis (actionlint) reports that the composite action at .github/steps/prepare_minimal_ubuntu_image/action.yml is missing the required description field in its metadata. While the workflow may still run, this violates the GitHub Actions specification and could cause issues with tooling or future GitHub updates.

The fix should be applied in .github/steps/prepare_minimal_ubuntu_image/action.yml:

name: 'Prepare TOPSAIL minimal ubuntu image'
description: 'Prepares a minimal Ubuntu image for TOPSAIL toolbox execution'
# ... rest of the action
🧰 Tools
🪛 actionlint (1.7.11)

[error] 22-22: description is required in metadata of "Prepare TOPSAIL minimal ubuntu image" action at "/home/jailuser/git/.github/steps/prepare_minimal_ubuntu_image/action.yml"

(action)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/check_consistency.yml at line 22, Add the required
description metadata to the composite action's action.yml for the composite
action named "Prepare TOPSAIL minimal ubuntu image": open the action.yml for
that composite action and add a top-level description field (e.g., "Prepares a
minimal Ubuntu image for TOPSAIL toolbox execution") alongside the existing name
field so the action metadata conforms to GitHub Actions schema and actionlint.

Comment thread projects/core/ci_entrypoint/run_ci.py
Comment on lines +67 to +70
pr_data = response.json()
created_at = pr_data["created_at"]

return datetime.strptime(created_at, "%Y-%m-%dT%H:%M:%SZ").timestamp(), pr_data
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Parse created_at as UTC, not local time.

created_at ends with Z, but datetime.strptime(...).timestamp() on a naive datetime uses the runner's local timezone. On non-UTC executors that shifts the Slack oldest filter and can miss or mis-thread the notification lookup.

Possible fix
-from datetime import datetime
+from datetime import datetime, timezone
@@
-    return datetime.strptime(created_at, "%Y-%m-%dT%H:%M:%SZ").timestamp(), pr_data
+    created_ts = datetime.strptime(created_at, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=timezone.utc).timestamp()
+    return created_ts, pr_data
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/github/api.py` around lines 67 - 70, The
timestamp conversion currently parses created_at as a naive datetime then calls
.timestamp(), which uses the local timezone; update the return in the function
that builds pr_data/created_at so created_at is parsed as an aware UTC datetime
before calling .timestamp() (e.g., parse created_at and attach UTC tzinfo or use
a UTC-aware parser) so the returned timestamp is in UTC rather than local time.

Comment on lines +73 to +84
if __name__ == "__main__":
PEM_FILE = "topsail-bot.2024-09-18.private-key.pem"
CLIENT_ID_FILE = "client_id"
ORG = "openshift-psap"
REPO = "topsail"

with open(CLIENT_ID_FILE) as f:
client_id = f.read().strip()

user_token = get_user_token(PEM_FILE, client_id, ORG, REPO)

print(user_token)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't print the installation token from the CLI path.

Line 84 emits a live GitHub credential to stdout. That's easy to leak into CI logs or shell history during debugging; if this helper stays, print only masked metadata or require an explicit opt-in flag.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/github/api.py` around lines 73 - 84, The script's
__main__ block prints a live GitHub installation token (user_token) to stdout
which can leak credentials; change the behavior in the __main__ section that
uses PEM_FILE, CLIENT_ID_FILE, ORG, REPO and get_user_token so it does NOT
directly print the token. Instead either (a) print only masked metadata (e.g.,
token length plus first/last 4 chars with the middle redacted) or (b) require an
explicit opt-in (a CLI flag or environment variable like --print-token or
PRINT_TOKEN=true) before emitting the full token; update the call site that
currently does print(user_token) to implement one of these safer options and add
a short comment explaining the chosen policy.

Comment on lines +15 to +23
payload = {
# Issued at time
'iat': int(time.time()),
# JWT expiration time (10 minutes maximum)
'exp': int(time.time()) + 600,

# GitHub App's client ID
'iss': client_id
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "gen_jwt.py" -type f

Repository: openshift-psap/topsail-ng

Length of output: 116


🏁 Script executed:

cat -n projects/core/notifications/github/gen_jwt.py

Repository: openshift-psap/topsail-ng

Length of output: 1465


🏁 Script executed:

rg -n "gen_jwt\|payload" projects/core/notifications/github/ -A 3 -B 3

Repository: openshift-psap/topsail-ng

Length of output: 51


🌐 Web query:

GitHub App JWT authentication time claims iat exp clock skew tolerance

💡 Result:

For GitHub App JWTs (used to call GET /app and to mint installation access tokens), GitHub’s docs specify these time-claim rules:

  • iat (issued at): required. To handle clock drift, GitHub recommends setting iat to 60 seconds in the past (e.g., now - 60) and keeping your server time accurate (e.g., NTP). [1]
  • exp (expires at): required. Must be no more than 10 minutes in the future (max TTL ≈ 600 seconds). [1]

GitHub does not document an explicit “allowed clock skew” window beyond the above guidance; the practical approach is to backdate iat by ~60s and keep exp within the 10-minute maximum. [1]

Source
[1] GitHub Docs — “Generating a JSON Web Token (JWT) for a GitHub App” (docs.github.com)

Citations:


Backdate iat by 60 seconds for GitHub App JWT reliability.

GitHub's JWT validation accommodates clock drift by recommending iat be set 60 seconds in the past. The current code calls time.time() separately for iat and exp, causing potential validation failures if the runner's clock is ahead. Capture the current time once and derive both claims from that base.

Suggested fix
-    payload = {
+    now = int(time.time())
+    payload = {
         # Issued at time
-        'iat': int(time.time()),
+        'iat': now - 60,
         # JWT expiration time (10 minutes maximum)
-        'exp': int(time.time()) + 600,
+        'exp': now + 600,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/notifications/github/gen_jwt.py` around lines 15 - 23, The
payload's 'iat' and 'exp' are computed separately causing potential clock-drift
failures; in gen_jwt.py capture a single base timestamp (e.g., now =
int(time.time())), set 'iat' to now - 60 to backdate by 60 seconds and set 'exp'
to now + 600 so both claims derive from the same base; update the payload dict
(the block that defines 'iat', 'exp', and 'iss') to use this single timestamp
variable.

Comment on lines +14 to +28
class MyThread(threading.Thread):

def __init__(self, *args, **kwargs):
super(MyThread, self).__init__(*args, **kwargs)
self.parent_artifact_dict = None

def start(self):
self.parent_artifact_dict = get_tls_artifact_dir()
super(MyThread, self).start()

def run(self):
_set_tls_artifact_dir(self.parent_artifact_dict)
super(MyThread, self).run()

threading.Thread = MyThread
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Import-time Thread monkey patch can break thread startup before init().

Importing this module rewrites threading.Thread process-wide, so unrelated code now instantiates MyThread. MyThread.start() immediately reads _tls_artifact_dir.val; before init() has populated TLS, that raises AttributeError and prevents the thread from starting.

Also applies to: 40-45

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/env.py` around lines 14 - 28, The module currently
monkey-patches threading.Thread at import and MyThread.start() calls
get_tls_artifact_dir() immediately, which can raise AttributeError if TLS isn't
initialized; change MyThread.start to not assume TLS exists by wrapping the
get_tls_artifact_dir() call in a safe check/try-except (set
self.parent_artifact_dict = None if TLS missing) so threads still start, and
remove the unconditional process-wide assignment threading.Thread = MyThread at
import; instead provide an explicit function (e.g., patch_threading()) or a
factory to apply the patch when the runtime guarantees TLS is initialized.
Ensure references: class MyThread, methods start and run, functions
get_tls_artifact_dir and _set_tls_artifact_dir, and the monkey-patch
threading.Thread = MyThread are updated accordingly.

Comment on lines +66 to +76
def NextArtifactDir(name, *, lock=None, counter_p=None):
if lock:
with lock:
next_count = counter_p[0]
counter_p[0] += 1
else:
next_count = next_artifact_index()

dirname = get_tls_artifact_dir() / f"{next_count:03d}__{name}"

return TempArtifactDir(dirname)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Artifact directory allocation can silently collide.

len(glob("*__*")) only works while indices stay contiguous and single-threaded. If an older artifact dir is removed, or two callers race without a shared lock/counter, they can choose the same index; mkdir(exist_ok=True) then reuses the existing directory and mixes logs.

Also applies to: 84-89, 108-109

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/env.py` around lines 66 - 76, The current
NextArtifactDir logic can silently reuse an existing directory when indices
collide; change it to attempt atomic directory creation and retry on collision
instead of using exist_ok=True or relying on contiguous indices. In
NextArtifactDir (and the similar blocks around the other occurrences you noted),
compute the candidate dirname with get_tls_artifact_dir() /
f"{next_count:03d}__{name}" then try to create it atomically (Path.mkdir(...,
exist_ok=False) or os.mkdir) inside a try/except that catches FileExistsError
and, on collision, increments the index (via counter_p or next_artifact_index())
and retries until successful; preserve the lock/counter behavior when provided
and apply the same retry-on-collision pattern in the other referenced
functions/blocks (the ones around lines 84-89 and 108-109) and return
TempArtifactDir(dirname) only after a successful atomic create.

sys.exit(1)
except KeyboardInterrupt:
print() # empty line after ^C
logging.error(f"Interrupted.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the F541 pattern still exists in the reviewed file.
rg -n 'logging\.error\(f"Interrupted\."\)' projects/legacy/library/export.py

Repository: openshift-psap/topsail-ng

Length of output: 111


Remove the inert f-string flagged by Ruff (F541).

Line 112 uses an f-string without any placeholders, which serves no purpose.

Suggested fix
-        logging.error(f"Interrupted.")
+        logging.error("Interrupted.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logging.error(f"Interrupted.")
logging.error("Interrupted.")
🧰 Tools
🪛 Ruff (0.15.6)

[error] 112-112: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/export.py` at line 112, The logging call uses an
unnecessary f-string flagged by Ruff (F541); replace the inert f-string
f"Interrupted." with a plain string literal "Interrupted." in the
logging.error(...) call in projects/legacy/library/export.py so the message is a
normal string (i.e., update the logging.error invocation).

Comment on lines +13 to +33
def parse_machines():
machines = {}
with open(MACHINES_FILE) as f:
for _line in f.readlines():
line = _line.strip()
if line.startswith("# "):
group = line.strip("# ")

if not line or line.startswith("#"): continue

instance, cpu, memory, price, *accel = line.split(", ")

entry = types.SimpleNamespace()
entry.cpu = int(cpu.split()[0])
entry.memory = int(memory.split()[0])
entry.price = float(price[1:])
entry.group = group
entry.name = instance

machines[entry.name] = entry
return machines
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential UnboundLocalError if data precedes group comment, and incorrect prefix removal.

  1. UnboundLocalError: If a data line appears before any # comment line, group is referenced at line 29 before being assigned.

  2. Incorrect prefix removal: line.strip("# ") removes all # and space characters from both ends, not just the "# " prefix. For example, "# Memory #" becomes "Memory" (stripping trailing # too).

🐛 Proposed fix
 def parse_machines():
     machines = {}
+    group = None
     with open(MACHINES_FILE) as f:
         for _line in f.readlines():
             line = _line.strip()
             if line.startswith("# "):
-                group = line.strip("# ")
+                group = line[2:]  # remove "# " prefix
 
-            if not line or line.startswith("#"): continue
+            if not line or line.startswith("#"):
+                continue
 
-            instance, cpu, memory, price, *accel = line.split(", ")
+            instance, cpu, memory, price, *_accel = line.split(", ")
 
             entry = types.SimpleNamespace()
             entry.cpu = int(cpu.split()[0])
🧰 Tools
🪛 Ruff (0.15.6)

[error] 21-21: Multiple statements on one line (colon)

(E701)


[warning] 23-23: Unpacked variable accel is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/sizing.py` around lines 13 - 33, In parse_machines(),
initialize the group variable before the loop (e.g., group = None or an empty
string) to avoid an UnboundLocalError when the first data line appears before
any "# " header, and replace the use of line.strip("# ") with a precise prefix
removal such as line[len("# "):].strip() (or line[2:].strip()) when handling the
startswith("# ") case so you only remove the leading "# " and not other "#"
characters; keep the existing continue that skips comment lines after setting
group.


pod_size = {"cpu": cpu, "memory": memory}

machine_size = machines[machine_type]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing validation for unknown machine_type.

If machine_type is not found in the parsed machines dictionary, line 47 raises an unhandled KeyError with no user-friendly message. Consider validating and providing available options.

🛡️ Proposed fix
     machines = parse_machines()
 
     pod_size = {"cpu": cpu, "memory": memory}
 
-    machine_size = machines[machine_type]
+    if machine_type not in machines:
+        available = ", ".join(sorted(machines.keys()))
+        raise ValueError(f"Unknown machine type '{machine_type}'. Available: {available}")
+    machine_size = machines[machine_type]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/sizing.py` at line 47, The code directly indexes the
machines dict with machine_type (machine_size = machines[machine_type]) which
can raise a KeyError for unknown values; update the logic in the function where
machines and machine_type are used to validate the key first (e.g., check if
machine_type in machines), and if missing raise a clear ValueError or return a
user-friendly error containing the requested machine_type and a list of
available keys (machines.keys()); ensure this validation is applied before any
use of machine_size so callers see a helpful message instead of an unhandled
KeyError.

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
projects/core/ci_entrypoint/run_ci.py (1)

389-402: ⚠️ Potential issue | 🔴 Critical

success is undefined—this will crash at runtime.

The refactoring replaced the boolean success with finish_reason, but Lines 396, 399, and 402 still reference the undefined success variable. This causes a NameError when execution reaches the post-checks block.

🐛 Proposed fix
         finish_reason = prepare_ci.FinishReason.SUCCESS if result.returncode == 0 \
             else prepare_ci.FinishReason.ERROR
+        success = (finish_reason == prepare_ci.FinishReason.SUCCESS)

         # Post-execution checks and status reporting
         if prepare_ci:
             status_message = prepare_ci.postchecks(project, operation, start_time, finish_reason)

             msg = click.style(status_message, fg='green' if success else 'red')
         else:
             # Fallback to simple messages if prepare_ci not available
             msg = click.style(f"✅ {project} {operation} completed successfully", fg='green') if success \
                 else  click.style(f"❌ {project} {operation} failed with exit code {result.returncode}", fg='red')

         click.echo(msg, err=not success)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/ci_entrypoint/run_ci.py` around lines 389 - 402, The code
references an undefined variable success in the postchecks block; update the
conditional uses to derive success from finish_reason (or result.returncode)
instead of success: e.g., treat success as finish_reason ==
prepare_ci.FinishReason.SUCCESS (or result.returncode == 0) when building the
styled message and setting click.echo(err=...). Modify the branches that call
prepare_ci.postchecks and the fallback message logic to use this computed
boolean so references on lines that currently use success become valid.
♻️ Duplicate comments (13)
projects/legacy/library/sizing.py (4)

121-125: ⚠️ Potential issue | 🔴 Critical

Do not return machine count via sys.exit; preserve output and failure semantics.

Line 122 encodes result as process exit status (truncated/semantically wrong for non-zero counts), and Line 125 exits 0 on failure. Print the count to stdout and use non-zero exit codes only for errors.

Proposed fix
-    try:
-        sys.exit(main(machine_type, user_count, cpu, memory)) # returns the number of nodes required
-    except Exception as e:
-        logging.error(f"'{' '.join(sys.argv)}' failed: {e.__class__.__name__}: {e}")
-        sys.exit(0) # 0 means that an error occured
+    try:
+        machine_count = main(machine_type, user_count, cpu, memory)
+    except Exception as e:
+        logging.error(f"'{' '.join(sys.argv)}' failed: {e.__class__.__name__}: {e}")
+        sys.exit(1)
+
+    print(machine_count)
+    sys.exit(0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/sizing.py` around lines 121 - 125, The current code
uses sys.exit(main(...)) to encode the machine count into the process exit code
and exits with 0 on exceptions; change this to have main(machine_type,
user_count, cpu, memory) return the integer count (or raise on error), print
that count to stdout (e.g., using print) instead of passing it to sys.exit, and
on exceptions call logging.error with the existing message and then sys.exit
with a non-zero exit code; update the invocation that currently references
sys.exit(main(...)) and the exception handler that calls sys.exit(0) so the
program preserves correct output and failure semantics (refer to main, sys.exit,
logging.error, and sys.argv in the diff).

47-47: ⚠️ Potential issue | 🟡 Minor

Validate unknown machine types before dict indexing.

Line 47 raises a raw KeyError for invalid input; return a clear error with available options instead.

Proposed fix
-    machine_size = machines[machine_type]
+    if machine_type not in machines:
+        available = ", ".join(sorted(machines.keys()))
+        raise ValueError(f"Unknown machine type '{machine_type}'. Available: {available}")
+    machine_size = machines[machine_type]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/sizing.py` at line 47, Before indexing machines,
validate machine_type exists to avoid a raw KeyError: check if machine_type in
machines and if not raise a clear ValueError (or return an explicit error) that
includes the invalid machine_type and the sorted list of available keys from
machines; then proceed to set machine_size = machines[machine_type]. Refer to
the machines dict and the machine_type variable (and the machine_size
assignment) when implementing the check so the error message provides actionable
options for callers.

13-23: ⚠️ Potential issue | 🟡 Minor

Initialize group and avoid strip("# ") for header parsing.

Line 29 can raise UnboundLocalError if a data row appears before a # header, and Line 19 removes characters too broadly (strip) instead of removing only the prefix. Also address Ruff items on Line 21 (E701) and Line 23 (unused unpack target) while touching this block.

Proposed fix
 def parse_machines():
     machines = {}
+    group = None
     with open(MACHINES_FILE) as f:
         for _line in f.readlines():
             line = _line.strip()
             if line.startswith("# "):
-                group = line.strip("# ")
+                group = line[2:].strip()
 
-            if not line or line.startswith("#"): continue
+            if not line or line.startswith("#"):
+                continue
 
-            instance, cpu, memory, price, *accel = line.split(", ")
+            instance, cpu, memory, price, *_accel = line.split(", ")
+            if group is None:
+                raise ValueError(f"Machine entry '{instance}' has no group header")

Also applies to: 29-29

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/sizing.py` around lines 13 - 23, In parse_machines(),
initialize the variable group (e.g., group = None) before the loop to avoid
UnboundLocalError, change the header handling from line.strip("# ") to using
startswith("# ") and extracting the header with slicing (e.g., line[2:]) so you
only remove the prefix, avoid multiple statements on one line (split the "if not
line..." into its own line) to fix E701, and replace the unused unpack target
accel with a named throwaway (e.g., _accel) or drop it entirely so instance,
cpu, memory, price, _accel = ... doesn’t leave an unused name; keep the
conditional logic to skip blank/comment lines intact and set group when a "# "
header is seen so subsequent data rows can reference it.

57-79: ⚠️ Potential issue | 🟠 Major

Size by per-node allocatable capacity; current math can under-provision and divide by zero.

Line 67 divides by machine_count and fails for user_count == 0, and aggregate-only sizing can still overpack a node after integer redistribution. Lines 78-79 should compute unallocated values from allocatable resources (after reservation), not raw instance totals.

Proposed fix
-    machine_count_cpu = total_cpu_count / (machine_size.cpu - RESERVED_CPU)
-    machine_count_memory = total_memory_count / (machine_size.memory - RESERVED_MEM)
+    alloc_cpu = machine_size.cpu - RESERVED_CPU
+    alloc_mem = machine_size.memory - RESERVED_MEM
+    if user_count == 0:
+        logging.info("No users requested. Provisioning 0 machines.")
+        return 0
+    if cpu <= 0 or memory <= 0:
+        raise ValueError("cpu and memory must be positive")
+    if alloc_cpu <= 0 or alloc_mem <= 0:
+        raise ValueError(f"{machine_type} has no allocatable CPU/memory after reservation")
+
+    machine_count_cpu = total_cpu_count / alloc_cpu
+    machine_count_memory = total_memory_count / alloc_mem
@@
-    machine_exact_count = max([machine_count_cpu, machine_count_memory])
-    machine_count = math.ceil(machine_exact_count)
-
-    pods_per_machine = math.ceil(user_count/machine_count)
+    max_pods_per_machine = min(
+        math.floor(alloc_cpu / cpu),
+        math.floor(alloc_mem / memory),
+        MAX_POD_PER_NODE,
+    )
+    if max_pods_per_machine <= 0:
+        raise ValueError(f"{machine_type} cannot fit pods of cpu={cpu}, mem={memory}")
+
+    machine_count = max(
+        math.ceil(max(machine_count_cpu, machine_count_memory)),
+        math.ceil(user_count / max_pods_per_machine),
+    )
+    pods_per_machine = math.ceil(user_count / machine_count)
@@
-    unallocated_cpu = machine_size.cpu - pod_size['cpu'] * pods_per_machine
-    unallocated_mem = machine_size.memory - pod_size['memory'] * pods_per_machine
+    unallocated_cpu = alloc_cpu - pod_size['cpu'] * pods_per_machine
+    unallocated_mem = alloc_mem - pod_size['memory'] * pods_per_machine
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/sizing.py` around lines 57 - 79, The sizing logic can
divide by zero and uses raw instance totals instead of allocatable capacity;
update computations to use allocatable_cpu = machine_size.cpu - RESERVED_CPU and
allocatable_mem = machine_size.memory - RESERVED_MEM when computing
machine_count_cpu and machine_count_memory and when computing
unallocated_cpu/unallocated_mem, and guard against zero/negative allocatable
(raise/log and treat machine_count as 0 or 1 based on policy). Also protect the
user_count path: if user_count == 0 set machine_count = 0 and pods_per_machine =
0 (skip MAX_POD_PER_NODE logic), otherwise ensure machine_count = max(1,
math.ceil(machine_exact_count)) before doing pods_per_machine =
math.ceil(user_count / machine_count) to avoid division by zero; reference
symbols: machine_count_cpu, machine_count_memory, machine_exact_count,
machine_count, pods_per_machine, RESERVED_CPU, RESERVED_MEM, unallocated_cpu,
unallocated_mem, pod_size, machine_size, user_count, MAX_POD_PER_NODE.
projects/legacy/library/config.py (3)

88-91: ⚠️ Potential issue | 🟠 Major

Secret overrides are re-read without secret handling.

Line 89 calls get_config(key, print=False) without handled_secretly=True. For *$@... overrides, this can raise or leak behavior in logs.

Proposed fix
-            actual_value = self.get_config(key, print=False) # ensure that key has been set, raises an exception otherwise
-            if log:
-                logging.info(f"config override: {key} --> {actual_value}")
+            actual_value = self.get_config(key, print=False, handled_secretly=handled_secretly)
+            if log:
+                if handled_secretly:
+                    logging.info("config override: %s --> <secret>", key)
+                else:
+                    logging.info("config override: %s --> %s", key, actual_value)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/config.py` around lines 88 - 91, After calling
self.set_config(key, value, print=False) the subsequent validation uses
self.get_config(key, print=False) without secret handling; change the call to
pass handled_secretly=True (i.e., self.get_config(key, print=False,
handled_secretly=True)) so secret-style overrides (e.g., "*$@...") are retrieved
safely and do not raise or expose secrets in logs, and keep the existing logging
behavior (logging.info for non-secret keys) unchanged.

95-99: ⚠️ Potential issue | 🟡 Minor

Catch KeyError in preset lookup path.

Line 97 catches IndexError, but missing config keys are converted to KeyError in get_config().

Proposed fix
-        except IndexError:
+        except KeyError:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/config.py` around lines 95 - 99, The try/except
around the preset lookup needs to handle KeyError (the get_config call
translates missing keys to KeyError) instead of IndexError; update the except
clause in the block that calls self.get_config(f'ci_presets["{name}"]',
print=False) to catch KeyError (or catch (KeyError, IndexError) if you want to
preserve existing behavior), and keep the existing logging and re-raise behavior
using process-local symbols get_config and the surrounding method in config.py.

145-153: ⚠️ Potential issue | 🔴 Critical

set_config() thread guard is broken (builtin shadow + missing raise in CI path).

Line 145 shadows builtin print; Line 151 then calls it as a function and will throw TypeError. Also, in CI branch the function logs but does not stop, so it can still mutate shared state from non-main threads.

#!/bin/bash
set -euo pipefail

python - <<'PY'
import ast, pathlib
p = pathlib.Path("projects/legacy/library/config.py")
src = p.read_text()
tree = ast.parse(src)
for node in ast.walk(tree):
    if isinstance(node, ast.FunctionDef) and node.name == "set_config":
        print("set_config line:", node.lineno)
        print("params:", [a.arg for a in node.args.args])
        for call in ast.walk(node):
            if isinstance(call, ast.Call) and isinstance(call.func, ast.Name) and call.func.id == "print":
                print("print() call line:", call.lineno)
PY

rg -n -C2 'SET_CONFIG_CALLED_FROM_THREAD|OPENSHIFT_CI|PERFLAB_CI|raise RuntimeError' projects/legacy/library/config.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/config.py` around lines 145 - 153, The set_config
thread-guard is broken because the parameter named "print" shadows the built-in
print (causing a TypeError at the file-write call) and the CI branch only
logs/writes the marker file but does not stop execution; update the set_config
signature (function set_config) to rename the "print" parameter to something
like "do_print" or "verbose" to avoid shadowing the built-in, update any
internal uses accordingly so builtins.print works for writing to
env.ARTIFACT_DIR / "SET_CONFIG_CALLED_FROM_THREAD", and ensure the CI branch
(the branch checking os.environ.get("OPENSHIFT_CI") or "PERFLAB_CI") raises the
same RuntimeError after logging/writing the marker so non-main threads cannot
proceed to mutate shared state.
projects/legacy/library/env.py (3)

30-35: ⚠️ Potential issue | 🟡 Minor

__getattr__ should raise AttributeError, not leak KeyError.

Line 35 returns globals()[name]; for missing names this raises KeyError, which breaks standard attribute semantics (hasattr, introspection).

#!/bin/bash
set -euo pipefail
python - <<'PY'
import ast, pathlib
p = pathlib.Path("projects/legacy/library/env.py")
src = p.read_text()
tree = ast.parse(src)
for node in tree.body:
    if isinstance(node, ast.FunctionDef) and node.name == "__getattr__":
        print(f"Found __getattr__ at line {node.lineno}")
        print(ast.get_source_segment(src, node))
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/env.py` around lines 30 - 35, The __getattr__
implementation in function __getattr__ currently returns globals()[name] which
raises KeyError for unknown attributes; change it to explicitly raise
AttributeError when the name isn't found so attribute access and
hasattr/introspection behave correctly: check for "ARTIFACT_DIR" and return
get_tls_artifact_dir() as before, but for any other name attempt to fetch from
globals() safely (e.g., using .get()) and if not present raise
AttributeError(f"module {__name__!r} has no attribute {name!r}"); ensure the
raised exception is AttributeError rather than letting a KeyError leak.

66-76: ⚠️ Potential issue | 🟠 Major

Artifact directory allocation can collide and silently reuse existing paths.

Line 109 derives index from glob("*__*"), and Line 87 uses mkdir(exist_ok=True). Under races or gaps, two writers can target the same dir.

Proposed fix
 def NextArtifactDir(name, *, lock=None, counter_p=None):
-    if lock:
-        with lock:
-            next_count = counter_p[0]
-            counter_p[0] += 1
-    else:
-        next_count = next_artifact_index()
-
-    dirname = get_tls_artifact_dir() / f"{next_count:03d}__{name}"
-
-    return TempArtifactDir(dirname)
+    base = get_tls_artifact_dir()
+    while True:
+        if lock:
+            with lock:
+                next_count = counter_p[0]
+                counter_p[0] += 1
+        else:
+            next_count = next_artifact_index()
+        dirname = base / f"{next_count:03d}__{name}"
+        try:
+            dirname.mkdir(parents=True, exist_ok=False)
+            return TempArtifactDir(dirname)
+        except FileExistsError:
+            continue
@@
     def __enter__(self):
         self.previous_dirname = get_tls_artifact_dir()
         os.environ["ARTIFACT_DIR"] = str(self.dirname)
-        self.dirname.mkdir(exist_ok=True)
+        self.dirname.mkdir(parents=True, exist_ok=True)

Also applies to: 84-89, 108-109

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/env.py` around lines 66 - 76, NextArtifactDir can
silently reuse existing paths because creation uses exist_ok=True and
next_artifact_index/glob races; change allocation to attempt atomic creation and
retry on collision: inside NextArtifactDir (and the path allocation logic in
next_artifact_index/get_tls_artifact_dir usage) build the candidate dirname and
call mkdir with exist_ok=False (or otherwise try an atomic create) and catch
FileExistsError, then increment the counter and retry until creation succeeds;
preserve the lock behavior when provided (use the lock to fetch/increment
counter atomically) and only fall back to the non-locked
next_artifact_index/rescan loop when no lock is given, finally return
TempArtifactDir for the successfully created directory.

20-29: ⚠️ Potential issue | 🔴 Critical

Process-wide threading.Thread monkey patch is unsafe and can break startup.

Line 28 rewires thread behavior globally, and Line 21 assumes TLS is initialized; this can throw before init() and prevent thread start.

Proposed fix
 class MyThread(threading.Thread):
@@
     def start(self):
-        self.parent_artifact_dict = get_tls_artifact_dir()
+        try:
+            self.parent_artifact_dict = get_tls_artifact_dir()
+        except AttributeError:
+            self.parent_artifact_dict = None
         super(MyThread, self).start()
@@
-        _set_tls_artifact_dir(self.parent_artifact_dict)
+        if self.parent_artifact_dict is not None:
+            _set_tls_artifact_dir(self.parent_artifact_dict)
         super(MyThread, self).run()
-
-threading.Thread = MyThread
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/env.py` around lines 20 - 29, The global monkey-patch
of threading.Thread is unsafe: replace or guard the behavior so startup can't
fail before TLS is initialized. Remove the unconditional assignment
threading.Thread = MyThread and instead use a dedicated subclass MyThread where
you explicitly construct threads that need TLS propagation; if you must patch
globally, only do it after init() has run. Also harden MyThread.start and run:
do not assume get_tls_artifact_dir() always succeeds—capture/get the TLS value
lazily or wrap get_tls_artifact_dir()/_set_tls_artifact_dir() calls in
try/except and fall back to None so threads can still start. Locate MyThread,
its start and run methods and the uses of
get_tls_artifact_dir/_set_tls_artifact_dir to implement these changes.
.github/workflows/check_consistency.yml (1)

22-22: ⚠️ Potential issue | 🟠 Major

Composite action metadata is incomplete and fails actionlint.

Line 22 invokes .github/steps/prepare_minimal_ubuntu_image, but its action.yml is missing required top-level description.

Proposed fix in .github/steps/prepare_minimal_ubuntu_image/action.yml
 name: "Prepare TOPSAIL minimal ubuntu image"
+description: "Prepares a minimal Ubuntu image for TOPSAIL toolbox execution"
 
 runs:
   using: "composite"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/check_consistency.yml at line 22, Add the missing
top-level description to the composite action metadata: open the composite
action's action.yml for prepare_minimal_ubuntu_image and add a concise top-level
description field (e.g., description: "Prepare minimal Ubuntu image for CI
steps") so actionlint no longer flags the file as incomplete; ensure the
description value is a short string at the root of the YAML alongside name and
runs.
projects/legacy/library/__init__.py (1)

11-11: ⚠️ Potential issue | 🟡 Minor

Split the inline conditional in merge_dicts() to pass lint.

Line 11 has two statements on one line (E701) and will fail Ruff in strict CI.

Proposed fix
-    if path is None: path = []
+    if path is None:
+        path = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/__init__.py` at line 11, In merge_dicts(), split the
inline conditional that sets the local variable path so it isn't two statements
on one line: replace the single-line `if path is None: path = []` with a proper
two-line if block that checks `if path is None` and then assigns `path = []`;
locate this inside the merge_dicts function where the local variable `path` is
initialized and update that statement only.
projects/jump_ci/orchestration/ci.py (1)

111-116: ⚠️ Potential issue | 🟠 Major

Cleanup returns 0 unconditionally, ignoring the actual result.

Line 114 captures ret from the cleanup call, but Line 116 returns 0 regardless of whether cleanup succeeded. This masks failures.

🐛 Proposed fix
             ret = run_on_jump_ci("pre_cleanup_ci")
             self.log("Cleanup phase completed!", "success" if not ret else "error")
-            return 0
+            return ret
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/jump_ci/orchestration/ci.py` around lines 111 - 116, The cleanup
block currently ignores the actual result by always returning 0; change the
return so the function propagates the real cleanup result from run_on_jump_ci.
After calling ret = run_on_jump_ci("pre_cleanup_ci") (and logging via
self.log(...)), return ret (or a boolean/exit code based on ret) instead of the
hard-coded 0 so failures are not masked; update the return statement in the try
block that follows run_on_jump_ci to use ret and keep the existing self.log(ret)
behavior.
🧹 Nitpick comments (3)
projects/legacy/library/__init__.py (1)

14-14: Prefer unpacking over list concatenation in recursive path extension.

Line 14 can use [*path, str(key)] for clearer intent.

Proposed refactor
-            merge_dicts(a[key], b[key], path + [str(key)])
+            merge_dicts(a[key], b[key], [*path, str(key)])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/__init__.py` at line 14, The recursive call in
merge_dicts currently extends the path via list concatenation (path +
[str(key)]); change it to use unpacking for clarity and intent by reconstructing
the list with [*path, str(key)] when passing to merge_dicts so the path
extension is expressed with unpacking instead of concatenation; update the
recursive call site in merge_dicts where merge_dicts(a[key], b[key], path +
[str(key)]) is invoked to use the unpacking form.
projects/legacy/library/run.py (1)

161-163: Use bare raise to preserve the traceback.

raise e can lose traceback context in some edge cases. A bare raise re-raises with the original traceback intact.

♻️ Suggested fix
             except Exception as e:
                 if not self.exit_on_exception:
-                    raise e
+                    raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/run.py` around lines 161 - 163, In the except block
that currently does "except Exception as e: if not self.exit_on_exception: raise
e" in projects/legacy/library/run.py, change the re-raise to a bare "raise" so
the original traceback is preserved; locate the handler using the
self.exit_on_exception check inside the method containing that except and
replace "raise e" with a bare re-raise.
projects/legacy/library/ansible_toolbox.py (1)

125-129: Use explicit | None type hints instead of implicit Optional.

PEP 484 prohibits implicit Optional. The parameters with = None defaults should have their types annotated explicitly.

♻️ Suggested type hints
     def __init__(self,
-                 ansible_vars: dict = None,
-                 role_name: str = None,
+                 ansible_vars: dict | None = None,
+                 role_name: str | None = None,
                  group: str = "",
                  command: str = ""):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/ansible_toolbox.py` around lines 125 - 129, Update
the __init__ signature to use explicit None unions for parameters that default
to None: change "ansible_vars: dict = None" to "ansible_vars: dict | None =
None" and "role_name: str = None" to "role_name: str | None = None" in the
__init__ method of the class in ansible_toolbox.py (leave group: str = "" and
command: str = "" unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/core/ci_entrypoint/github/pr_args.py`:
- Around line 270-271: The documentation line that says jump-ci sets
jump_ci.{cluster,project,args} is inaccurate; the parser actually updates
cluster.name, project.name, and project.args. Update the docstring/comment text
(the `/test` directive help text in pr_args.py) to list the actual keys
emitted—cluster.name, project.name, project.args—or alternatively change the
parser to emit jump_ci.cluster/jump_ci.project/jump_ci.args; prefer the former
and modify the doc wording to match the parser's behavior so the docs and code
are consistent.
- Around line 89-103: The jump-ci branch misreads and rejects valid positional
args: change the length check to require at least two remaining tokens (use
len(args) < 2 instead of <= 2), pop the first two tokens in order (cluster =
args.pop(0); target_project = args.pop(0)) instead of using args.pop() which
removes the last item, then set remaining_args = args and update result as
before ('cluster.name', 'project.name', 'project.args') so extraction aligns
with the earlier consumed project_name and preserves extra args.

In `@projects/core/ci_entrypoint/prepare_ci.py`:
- Line 465: The logger.exception call in prepare_ci.py uses an unnecessary
f-string: replace logger.exception(f"Failed to send notifications") with a plain
string logger.exception("Failed to send notifications") to remove the unused
interpolation and resolve F541; update the call at the location where
notifications are sent (the logger.exception in the notification error handling
block) — no other behavior changes are needed.
- Around line 22-33: In prepare_ci.py locate the logger.exception call that
currently reads logger.exception(f"Failed to send notifications") and remove the
unnecessary f-string prefix so it becomes logger.exception("Failed to send
notifications"); update the invocation in the same function where notifications
are sent (the block that catches notification errors) to use a plain string
literal instead of an f-string.

In `@projects/jump_ci/orchestration/ci.py`:
- Around line 42-52: Both lock_cluster and unlock_cluster currently return 1 on
error but implicitly return None on success; update both functions (lock_cluster
and unlock_cluster) to explicitly return 0 on successful completion (after the
success log lines) so CLI handlers that pass their results to sys.exit() receive
the proper success exit code. Ensure the return 0 is added after
self.log("Cluster locked successfully", "success") in lock_cluster and after the
corresponding success log in unlock_cluster.
- Around line 26-27: Change the bare re-raise of ImportError to chain the
original exception so tracebacks are preserved: in the except ImportError as e
block that currently does raise RuntimeError(f"Jump CI testing functionality not
available: {e}"), modify the raise to use exception chaining (raise
RuntimeError("Jump CI testing functionality not available") from e) so the
original ImportError is attached to the new RuntimeError.

In `@projects/legacy/library/ansible_toolbox.py`:
- Around line 49-51: The hasattr check is using a non-standard dunder name
"__entrypoint" which never exists; update the branch to check for and use
"__entrypoint__" instead (replace hasattr(mod, "__entrypoint") and getattr(mod,
"__entrypoint") with "__entrypoint__"), or remove the unreachable branch
entirely if you prefer not to support a module-level entrypoint; adjust the
references around mod and toolbox_name accordingly so the code assigns
self.__dict__[toolbox_name] = getattr(mod, "__entrypoint__") when present.

In `@projects/legacy/library/common.py`:
- Around line 6-7: prepare_light_cluster calls run.run_toolbox which expects a
local ./run_toolbox.py that isn’t in the repo; ensure runtime availability by
either (A) documenting/deploying ./run_toolbox.py in your deployment README and
CI artifacts, or (B) adding a runtime existence check and clearer error: before
calling run.run_toolbox in prepare_light_cluster, verify
os.path.exists("./run_toolbox.py") and raise/log a descriptive error if missing
(or modify run.run_toolbox to accept/configure an absolute path or package
resource and fall back to a documented default). Include the change to reference
./run_toolbox.py in the error/log message so operators know what to provide.

In `@projects/legacy/library/export.py`:
- Around line 51-57: Accessing job_spec["refs"]["pulls"][0]["number"] can raise
KeyError/IndexError; update the OPENSHIFT_CI branch to defensively validate
JOB_SPEC contents before constructing run_id: parse JOB_SPEC into job_spec,
check that "refs" in job_spec and "pulls" in job_spec["refs"] and that pulls is
a non-empty list, and that the expected "number", "job", and "buildid" keys
exist; if any piece is missing, either set sensible fallbacks (e.g., None or
"unknown") or raise a clear ValueError with context including the raw JOB_SPEC
and which key is missing; change the code paths that currently set pull_number,
job, build_id and run_id to use these validated values (references: job_spec,
pull_number, job, build_id, run_id).

In `@projects/legacy/library/run.py`:
- Around line 1-6: The file imports the signal module twice (duplicate import of
signal). Remove the redundant import so only a single import signal remains
(e.g., delete the second "import signal" in the top-level imports) to avoid
duplication and keep imports clean; locate the duplicate in the top of run.py
around the existing import block that also contains imports like sys, os,
traceback, logging, and json.

---

Outside diff comments:
In `@projects/core/ci_entrypoint/run_ci.py`:
- Around line 389-402: The code references an undefined variable success in the
postchecks block; update the conditional uses to derive success from
finish_reason (or result.returncode) instead of success: e.g., treat success as
finish_reason == prepare_ci.FinishReason.SUCCESS (or result.returncode == 0)
when building the styled message and setting click.echo(err=...). Modify the
branches that call prepare_ci.postchecks and the fallback message logic to use
this computed boolean so references on lines that currently use success become
valid.

---

Duplicate comments:
In @.github/workflows/check_consistency.yml:
- Line 22: Add the missing top-level description to the composite action
metadata: open the composite action's action.yml for
prepare_minimal_ubuntu_image and add a concise top-level description field
(e.g., description: "Prepare minimal Ubuntu image for CI steps") so actionlint
no longer flags the file as incomplete; ensure the description value is a short
string at the root of the YAML alongside name and runs.

In `@projects/jump_ci/orchestration/ci.py`:
- Around line 111-116: The cleanup block currently ignores the actual result by
always returning 0; change the return so the function propagates the real
cleanup result from run_on_jump_ci. After calling ret =
run_on_jump_ci("pre_cleanup_ci") (and logging via self.log(...)), return ret (or
a boolean/exit code based on ret) instead of the hard-coded 0 so failures are
not masked; update the return statement in the try block that follows
run_on_jump_ci to use ret and keep the existing self.log(ret) behavior.

In `@projects/legacy/library/__init__.py`:
- Line 11: In merge_dicts(), split the inline conditional that sets the local
variable path so it isn't two statements on one line: replace the single-line
`if path is None: path = []` with a proper two-line if block that checks `if
path is None` and then assigns `path = []`; locate this inside the merge_dicts
function where the local variable `path` is initialized and update that
statement only.

In `@projects/legacy/library/config.py`:
- Around line 88-91: After calling self.set_config(key, value, print=False) the
subsequent validation uses self.get_config(key, print=False) without secret
handling; change the call to pass handled_secretly=True (i.e.,
self.get_config(key, print=False, handled_secretly=True)) so secret-style
overrides (e.g., "*$@...") are retrieved safely and do not raise or expose
secrets in logs, and keep the existing logging behavior (logging.info for
non-secret keys) unchanged.
- Around line 95-99: The try/except around the preset lookup needs to handle
KeyError (the get_config call translates missing keys to KeyError) instead of
IndexError; update the except clause in the block that calls
self.get_config(f'ci_presets["{name}"]', print=False) to catch KeyError (or
catch (KeyError, IndexError) if you want to preserve existing behavior), and
keep the existing logging and re-raise behavior using process-local symbols
get_config and the surrounding method in config.py.
- Around line 145-153: The set_config thread-guard is broken because the
parameter named "print" shadows the built-in print (causing a TypeError at the
file-write call) and the CI branch only logs/writes the marker file but does not
stop execution; update the set_config signature (function set_config) to rename
the "print" parameter to something like "do_print" or "verbose" to avoid
shadowing the built-in, update any internal uses accordingly so builtins.print
works for writing to env.ARTIFACT_DIR / "SET_CONFIG_CALLED_FROM_THREAD", and
ensure the CI branch (the branch checking os.environ.get("OPENSHIFT_CI") or
"PERFLAB_CI") raises the same RuntimeError after logging/writing the marker so
non-main threads cannot proceed to mutate shared state.

In `@projects/legacy/library/env.py`:
- Around line 30-35: The __getattr__ implementation in function __getattr__
currently returns globals()[name] which raises KeyError for unknown attributes;
change it to explicitly raise AttributeError when the name isn't found so
attribute access and hasattr/introspection behave correctly: check for
"ARTIFACT_DIR" and return get_tls_artifact_dir() as before, but for any other
name attempt to fetch from globals() safely (e.g., using .get()) and if not
present raise AttributeError(f"module {__name__!r} has no attribute {name!r}");
ensure the raised exception is AttributeError rather than letting a KeyError
leak.
- Around line 66-76: NextArtifactDir can silently reuse existing paths because
creation uses exist_ok=True and next_artifact_index/glob races; change
allocation to attempt atomic creation and retry on collision: inside
NextArtifactDir (and the path allocation logic in
next_artifact_index/get_tls_artifact_dir usage) build the candidate dirname and
call mkdir with exist_ok=False (or otherwise try an atomic create) and catch
FileExistsError, then increment the counter and retry until creation succeeds;
preserve the lock behavior when provided (use the lock to fetch/increment
counter atomically) and only fall back to the non-locked
next_artifact_index/rescan loop when no lock is given, finally return
TempArtifactDir for the successfully created directory.
- Around line 20-29: The global monkey-patch of threading.Thread is unsafe:
replace or guard the behavior so startup can't fail before TLS is initialized.
Remove the unconditional assignment threading.Thread = MyThread and instead use
a dedicated subclass MyThread where you explicitly construct threads that need
TLS propagation; if you must patch globally, only do it after init() has run.
Also harden MyThread.start and run: do not assume get_tls_artifact_dir() always
succeeds—capture/get the TLS value lazily or wrap
get_tls_artifact_dir()/_set_tls_artifact_dir() calls in try/except and fall back
to None so threads can still start. Locate MyThread, its start and run methods
and the uses of get_tls_artifact_dir/_set_tls_artifact_dir to implement these
changes.

In `@projects/legacy/library/sizing.py`:
- Around line 121-125: The current code uses sys.exit(main(...)) to encode the
machine count into the process exit code and exits with 0 on exceptions; change
this to have main(machine_type, user_count, cpu, memory) return the integer
count (or raise on error), print that count to stdout (e.g., using print)
instead of passing it to sys.exit, and on exceptions call logging.error with the
existing message and then sys.exit with a non-zero exit code; update the
invocation that currently references sys.exit(main(...)) and the exception
handler that calls sys.exit(0) so the program preserves correct output and
failure semantics (refer to main, sys.exit, logging.error, and sys.argv in the
diff).
- Line 47: Before indexing machines, validate machine_type exists to avoid a raw
KeyError: check if machine_type in machines and if not raise a clear ValueError
(or return an explicit error) that includes the invalid machine_type and the
sorted list of available keys from machines; then proceed to set machine_size =
machines[machine_type]. Refer to the machines dict and the machine_type variable
(and the machine_size assignment) when implementing the check so the error
message provides actionable options for callers.
- Around line 13-23: In parse_machines(), initialize the variable group (e.g.,
group = None) before the loop to avoid UnboundLocalError, change the header
handling from line.strip("# ") to using startswith("# ") and extracting the
header with slicing (e.g., line[2:]) so you only remove the prefix, avoid
multiple statements on one line (split the "if not line..." into its own line)
to fix E701, and replace the unused unpack target accel with a named throwaway
(e.g., _accel) or drop it entirely so instance, cpu, memory, price, _accel = ...
doesn’t leave an unused name; keep the conditional logic to skip blank/comment
lines intact and set group when a "# " header is seen so subsequent data rows
can reference it.
- Around line 57-79: The sizing logic can divide by zero and uses raw instance
totals instead of allocatable capacity; update computations to use
allocatable_cpu = machine_size.cpu - RESERVED_CPU and allocatable_mem =
machine_size.memory - RESERVED_MEM when computing machine_count_cpu and
machine_count_memory and when computing unallocated_cpu/unallocated_mem, and
guard against zero/negative allocatable (raise/log and treat machine_count as 0
or 1 based on policy). Also protect the user_count path: if user_count == 0 set
machine_count = 0 and pods_per_machine = 0 (skip MAX_POD_PER_NODE logic),
otherwise ensure machine_count = max(1, math.ceil(machine_exact_count)) before
doing pods_per_machine = math.ceil(user_count / machine_count) to avoid division
by zero; reference symbols: machine_count_cpu, machine_count_memory,
machine_exact_count, machine_count, pods_per_machine, RESERVED_CPU,
RESERVED_MEM, unallocated_cpu, unallocated_mem, pod_size, machine_size,
user_count, MAX_POD_PER_NODE.

---

Nitpick comments:
In `@projects/legacy/library/__init__.py`:
- Line 14: The recursive call in merge_dicts currently extends the path via list
concatenation (path + [str(key)]); change it to use unpacking for clarity and
intent by reconstructing the list with [*path, str(key)] when passing to
merge_dicts so the path extension is expressed with unpacking instead of
concatenation; update the recursive call site in merge_dicts where
merge_dicts(a[key], b[key], path + [str(key)]) is invoked to use the unpacking
form.

In `@projects/legacy/library/ansible_toolbox.py`:
- Around line 125-129: Update the __init__ signature to use explicit None unions
for parameters that default to None: change "ansible_vars: dict = None" to
"ansible_vars: dict | None = None" and "role_name: str = None" to "role_name:
str | None = None" in the __init__ method of the class in ansible_toolbox.py
(leave group: str = "" and command: str = "" unchanged).

In `@projects/legacy/library/run.py`:
- Around line 161-163: In the except block that currently does "except Exception
as e: if not self.exit_on_exception: raise e" in projects/legacy/library/run.py,
change the re-raise to a bare "raise" so the original traceback is preserved;
locate the handler using the self.exit_on_exception check inside the method
containing that except and replace "raise e" with a bare re-raise.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c1285588-08c7-4bb0-a6fe-198d715955fe

📥 Commits

Reviewing files that changed from the base of the PR and between b6db104 and 3f2dd7e.

📒 Files selected for processing (23)
  • .github/steps/prepare_minimal_ubuntu_image/action.yml
  • .github/workflows/check_consistency.yml
  • .github/workflows/check_python_syntax.yml
  • projects/core/ci_entrypoint/github/pr_args.py
  • projects/core/ci_entrypoint/prepare_ci.py
  • projects/core/ci_entrypoint/run_ci.py
  • projects/jump_ci/orchestration/ci.py
  • projects/jump_ci/testing/config.yaml
  • projects/jump_ci/testing/prepare_jump_ci.py
  • projects/jump_ci/testing/run
  • projects/jump_ci/testing/test.py
  • projects/jump_ci/testing/tunnelling.py
  • projects/jump_ci/testing/utils.py
  • projects/jump_ci/toolbox/jump_ci.py
  • projects/legacy/library/__init__.py
  • projects/legacy/library/ansible_toolbox.py
  • projects/legacy/library/common.py
  • projects/legacy/library/config.py
  • projects/legacy/library/env.py
  • projects/legacy/library/export.py
  • projects/legacy/library/run.py
  • projects/legacy/library/sizing.machines
  • projects/legacy/library/sizing.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • projects/jump_ci/testing/tunnelling.py
  • projects/legacy/library/sizing.machines
  • projects/jump_ci/testing/utils.py
  • .github/steps/prepare_minimal_ubuntu_image/action.yml
  • projects/jump_ci/testing/config.yaml
  • .github/workflows/check_python_syntax.yml

Comment thread projects/core/ci_entrypoint/github/pr_args.py
Comment thread projects/core/ci_entrypoint/github/pr_args.py
Comment on lines +22 to +33
from enum import StrEnum

# Import notifications module (will be done after logger is set up)
NOTIFICATIONS_AVAILABLE = False

IS_LIGHTWEIGHT_IMAGE = os.environ.get("TOPSAIL_LIGHT_IMAGE")

class FinishReason(StrEnum):
SUCCESS = "success"
ERROR = "error"
OTHER = "other"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate repo-level Python version constraints and CI runtime declarations
fd -HI 'pyproject.toml|setup.cfg|setup.py|tox.ini|.python-version|Pipfile|poetry.lock' .
fd -HI '.github/workflows/*.yml' .github/workflows || true

# Show relevant declarations
rg -n -C2 'requires-python|python_requires|actions/setup-python|python-version|3\.[0-9]+' \
  pyproject.toml setup.cfg setup.py tox.ini .python-version Pipfile .github/workflows/*.yml 2>/dev/null || true

Repository: openshift-psap/topsail-ng

Length of output: 1496


🏁 Script executed:

fd -t f prepare_ci.py

Repository: openshift-psap/topsail-ng

Length of output: 110


🏁 Script executed:

# Get the file and check line 465 area
cat -n projects/core/ci_entrypoint/prepare_ci.py | sed -n '460,470p'

Repository: openshift-psap/topsail-ng

Length of output: 594


Remove unnecessary f-string prefix on line 465.

Line 465 contains logger.exception(f"Failed to send notifications") where the f-string has no interpolated variables. Remove the f prefix: logger.exception("Failed to send notifications").

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/core/ci_entrypoint/prepare_ci.py` around lines 22 - 33, In
prepare_ci.py locate the logger.exception call that currently reads
logger.exception(f"Failed to send notifications") and remove the unnecessary
f-string prefix so it becomes logger.exception("Failed to send notifications");
update the invocation in the same function where notifications are sent (the
block that catches notification errors) to use a plain string literal instead of
an f-string.

Comment thread projects/core/ci_entrypoint/prepare_ci.py
Comment on lines +26 to +27
except ImportError as e:
raise RuntimeError(f"Jump CI testing functionality not available: {e}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Chain the exception for better tracebacks.

Use raise ... from e to preserve the original exception context.

🔧 Proposed fix
 except ImportError as e:
-    raise RuntimeError(f"Jump CI testing functionality not available: {e}")
+    raise RuntimeError(f"Jump CI testing functionality not available: {e}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except ImportError as e:
raise RuntimeError(f"Jump CI testing functionality not available: {e}")
except ImportError as e:
raise RuntimeError(f"Jump CI testing functionality not available: {e}") from e
🧰 Tools
🪛 Ruff (0.15.6)

[warning] 27-27: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/jump_ci/orchestration/ci.py` around lines 26 - 27, Change the bare
re-raise of ImportError to chain the original exception so tracebacks are
preserved: in the except ImportError as e block that currently does raise
RuntimeError(f"Jump CI testing functionality not available: {e}"), modify the
raise to use exception chaining (raise RuntimeError("Jump CI testing
functionality not available") from e) so the original ImportError is attached to
the new RuntimeError.

Comment on lines +42 to +52
def lock_cluster(self):
"""
Prepare phase - Lock cluster for exclusive access
"""
self.log("Locking cluster for exclusive access...")
try:
start_time = time.time()
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
check=False
)
duration = time.time() - start_time

if result.returncode == 0:
if self.verbose and result.stdout:
self.log(f"Output: {result.stdout.strip()}")
self.log(f"Command completed in {duration:.2f}s")
return True
else:
self.log(f"Command failed (exit code {result.returncode})", "error")
if result.stderr:
self.log(f"Error: {result.stderr.strip()}", "error")
return False

prepare_jump_ci_mod.lock_cluster()
self.log("Cluster locked successfully", "success")
except Exception as e:
self.log(f"Command execution failed: {e}", "error")
return False
self.log(f"Lock phase failed: {e}", "error")
return 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing return value on success paths.

lock_cluster and unlock_cluster return 1 on error but implicitly return None on success. Since the CLI handlers pass these to sys.exit(), this should be 0 for success.

🐛 Proposed fix
     def lock_cluster(self):
         ...
         try:
             prepare_jump_ci_mod.lock_cluster()
             self.log("Cluster locked successfully", "success")
+            return 0
         except Exception as e:
             self.log(f"Lock phase failed: {e}", "error")
             return 1
 
     def unlock_cluster(self):
         ...
         try:
             prepare_jump_ci_mod.unlock_cluster()
             self.log("Cluster unlocked successfully", "success")
+            return 0
         except Exception as e:
             self.log(f"Unlock cluster phase failed: {e}", "error")
             return 1

Also applies to: 54-65

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 50-50: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/jump_ci/orchestration/ci.py` around lines 42 - 52, Both lock_cluster
and unlock_cluster currently return 1 on error but implicitly return None on
success; update both functions (lock_cluster and unlock_cluster) to explicitly
return 0 on successful completion (after the success log lines) so CLI handlers
that pass their results to sys.exit() receive the proper success exit code.
Ensure the return 0 is added after self.log("Cluster locked successfully",
"success") in lock_cluster and after the corresponding success log in
unlock_cluster.

Comment on lines +49 to +51
if hasattr(mod, "__entrypoint"):
self.__dict__[toolbox_name] = getattr(mod, "__entrypoint")
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if any module actually defines __entrypoint (single trailing underscore)
rg -n '__entrypoint\s*=' projects/

Repository: openshift-psap/topsail-ng

Length of output: 51


🏁 Script executed:

# First, let's look at the actual code in ansible_toolbox.py around lines 49-51
cat -n projects/legacy/library/ansible_toolbox.py | sed -n '40,60p'

Repository: openshift-psap/topsail-ng

Length of output: 1136


🏁 Script executed:

# Search for all entrypoint variations in the codebase
rg -n 'entrypoint' projects/ --type py

Repository: openshift-psap/topsail-ng

Length of output: 1507


🏁 Script executed:

# Search for __entrypoint with both underscore variations
rg -n '__entrypoint' . --type py

Repository: openshift-psap/topsail-ng

Length of output: 284


This is dead code due to non-standard dunder naming: __entrypoint will never be found.

The attribute check at line 49 uses __entrypoint (single trailing underscore), which violates Python's dunder naming convention. Search of the entire codebase shows no module defines __entrypoint, making the hasattr() check always fail. The intended attribute was likely __entrypoint__ (proper dunder format). Either fix the attribute name to follow Python conventions or remove this unreachable code path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/ansible_toolbox.py` around lines 49 - 51, The hasattr
check is using a non-standard dunder name "__entrypoint" which never exists;
update the branch to check for and use "__entrypoint__" instead (replace
hasattr(mod, "__entrypoint") and getattr(mod, "__entrypoint") with
"__entrypoint__"), or remove the unreachable branch entirely if you prefer not
to support a module-level entrypoint; adjust the references around mod and
toolbox_name accordingly so the code assigns self.__dict__[toolbox_name] =
getattr(mod, "__entrypoint__") when present.

Comment on lines +6 to +7
def prepare_light_cluster():
run.run_toolbox("cluster", "wait_fully_awake")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if run_toolbox.py exists anywhere in the repo
fd -t f 'run_toolbox.py' .

# Check where it's expected to be called from
rg -n 'run_toolbox\.py' projects/legacy/library/run.py

Repository: openshift-psap/topsail-ng

Length of output: 315


🏁 Script executed:

# Search for any build/deployment scripts or setup files
fd -t f -e sh -e yml -e yaml -e json . | head -20

# Check if run_toolbox.py is mentioned in any docs or config
rg -i 'run_toolbox' --type md --type txt --type rst . | head -20

# Look at the structure of projects/legacy/library/run.py to understand the run() function
cat -n projects/legacy/library/run.py | head -30

Repository: openshift-psap/topsail-ng

Length of output: 2243


🏁 Script executed:

# View the run() function and context around the references at lines 66 and 102
sed -n '60,110p' projects/legacy/library/run.py

# Check if run_toolbox.py is mentioned anywhere else in the repo
rg -r 'run_toolbox' . --type py

# Look for any build/generation scripts or Makefiles
fd -t f -e Makefile -e makefile . | head -10
fd -t f 'setup.py' . | head -10

# Check the common.py file to understand context of the call
cat -n projects/legacy/library/common.py | head -20

Repository: openshift-psap/topsail-ng

Length of output: 50383


🏁 Script executed:

# Check the projects/legacy/library directory structure
ls -la projects/legacy/library/ | head -30

# Check if there's a run_toolbox.py specifically in that directory
ls -la projects/legacy/library/run_toolbox.py 2>&1

# View the common.py file around the referenced lines
cat -n projects/legacy/library/common.py | sed -n '1,15p'

Repository: openshift-psap/topsail-ng

Length of output: 1415


🏁 Script executed:

# View the run() function implementation to understand how commands are executed
sed -n '104,150p' projects/legacy/library/run.py

Repository: openshift-psap/topsail-ng

Length of output: 1711


Verify or document the runtime availability of ./run_toolbox.py.

run.run_toolbox() constructs shell commands that execute ./run_toolbox.py relative to the current working directory. This file does not exist in the repository, so either it must be created/deployed at runtime or its availability must be documented and guaranteed by the deployment process.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/common.py` around lines 6 - 7, prepare_light_cluster
calls run.run_toolbox which expects a local ./run_toolbox.py that isn’t in the
repo; ensure runtime availability by either (A) documenting/deploying
./run_toolbox.py in your deployment README and CI artifacts, or (B) adding a
runtime existence check and clearer error: before calling run.run_toolbox in
prepare_light_cluster, verify os.path.exists("./run_toolbox.py") and raise/log a
descriptive error if missing (or modify run.run_toolbox to accept/configure an
absolute path or package resource and fall back to a documented default).
Include the change to reference ./run_toolbox.py in the error/log message so
operators know what to provide.

Comment on lines +51 to +57
if os.environ.get("OPENSHIFT_CI") == "true":
job_spec = json.loads(os.environ["JOB_SPEC"])
pull_number = job_spec["refs"]["pulls"][0]["number"]
job = job_spec["job"]
build_id = job_spec["buildid"]

run_id=f"prow/{pull_number}/{build_id}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Accessing nested job_spec keys may raise KeyError/IndexError.

If JOB_SPEC lacks refs.pulls[0], this will crash with an unhelpful error. Consider defensive access or a clearer error message.

🛡️ Proposed defensive access
     if os.environ.get("OPENSHIFT_CI") == "true":
         job_spec = json.loads(os.environ["JOB_SPEC"])
+        pulls = job_spec.get("refs", {}).get("pulls")
+        if not pulls:
+            raise ValueError("JOB_SPEC missing refs.pulls for OPENSHIFT_CI")
-        pull_number = job_spec["refs"]["pulls"][0]["number"]
+        pull_number = pulls[0]["number"]
         job = job_spec["job"]
         build_id = job_spec["buildid"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/export.py` around lines 51 - 57, Accessing
job_spec["refs"]["pulls"][0]["number"] can raise KeyError/IndexError; update the
OPENSHIFT_CI branch to defensively validate JOB_SPEC contents before
constructing run_id: parse JOB_SPEC into job_spec, check that "refs" in job_spec
and "pulls" in job_spec["refs"] and that pulls is a non-empty list, and that the
expected "number", "job", and "buildid" keys exist; if any piece is missing,
either set sensible fallbacks (e.g., None or "unknown") or raise a clear
ValueError with context including the raw JOB_SPEC and which key is missing;
change the code paths that currently set pull_number, job, build_id and run_id
to use these validated values (references: job_spec, pull_number, job, build_id,
run_id).

Comment on lines +1 to +6
import sys, os, signal
import traceback
import logging
logging.getLogger().setLevel(logging.INFO)
import json
import signal
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate import of signal module.

signal is imported on both Line 1 and Line 6.

🔧 Proposed fix
-import sys, os, signal
+import sys, os
 import traceback
 import logging
 logging.getLogger().setLevel(logging.INFO)
 import json
-import signal
+import signal
 import itertools

Or simply remove line 6.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/legacy/library/run.py` around lines 1 - 6, The file imports the
signal module twice (duplicate import of signal). Remove the redundant import so
only a single import signal remains (e.g., delete the second "import signal" in
the top-level imports) to avoid duplication and keep imports clean; locate the
duplicate in the top of run.py around the existing import block that also
contains imports like sys, os, traceback, logging, and json.

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

1 similar comment
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test llmd-jump-ci hello world

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Mar 17, 2026

🔴 Test of 'jump_ci ci' failed after 00 hours 00 minutes 00 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.args:
- world
ci_job.name: llmd-jump-ci
ci_job.project: hello

Failure indicator: Empty. (See run.log)

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Mar 17, 2026

🔴 Test of 'jump_ci ci' failed after 00 hours 00 minutes 06 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.args:
- world
ci_job.name: llmd-jump-ci
ci_job.project: hello

Failure indicator: Empty. (See run.log)

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test jump-ci CLUSTER skeleton abcd 1234

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

/test llmd-jump-ci hello world
/cluster CLUSTER

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Mar 17, 2026

🔴 Test of 'jump_ci ci' failed after 00 hours 00 minutes 05 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.args:
- world
ci_job.name: llmd-jump-ci
ci_job.project: hello
cluster.name: CLUSTER

Failure indicator: Empty. (See run.log)

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Mar 17, 2026

🔴 Test of 'jump_ci ci' failed after 00 hours 00 minutes 06 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

ci_job.args:
- world
ci_job.name: llmd-jump-ci
ci_job.project: hello
cluster.name: CLUSTER

Failure indicator: Empty. (See run.log)

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 17, 2026

@kpouget: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/jump-ci 1d4301b link true /test jump-ci
ci/prow/llmd-jump-ci 3b0dcc3 link true /test llmd-jump-ci

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 17, 2026

let's consider that good enough. Merging :)

@kpouget kpouget merged commit e115712 into openshift-psap:main Mar 17, 2026
1 of 3 checks passed
@kpouget kpouget deleted the jump-ci branch March 17, 2026 18:06
This was referenced Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant