Skip to content

Conversation

@ardaguclu
Copy link
Member

setsid brings us robust session killing, when volumesizechecker detects any exceeded volume usage. This works well in default must-gather image as well as others that setsid is installed.

However, there are some images that setsid is not installed. This method does not work and more importantly must-gather does not work. So we must introduce a fall back approach for the images that lack such binaries.

This PR checks the presence of setsid. If it is present, it will continue using the current approach. If it is not, it will run the scripts in simpler way. If volumesize checker detects the exceeded volume usage, it will run kill 0 to kill every process. We can't provide elegant options because these images do not include any binaries that we can use such as ps, setsid, etc.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 25, 2025
@openshift-ci-robot
Copy link

@ardaguclu: This pull request references Jira Issue OCPBUGS-65925, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

setsid brings us robust session killing, when volumesizechecker detects any exceeded volume usage. This works well in default must-gather image as well as others that setsid is installed.

However, there are some images that setsid is not installed. This method does not work and more importantly must-gather does not work. So we must introduce a fall back approach for the images that lack such binaries.

This PR checks the presence of setsid. If it is present, it will continue using the current approach. If it is not, it will run the scripts in simpler way. If volumesize checker detects the exceeded volume usage, it will run kill 0 to kill every process. We can't provide elegant options because these images do not include any binaries that we can use such as ps, setsid, etc.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Nov 25, 2025

Walkthrough

Runtime detection for setsid, ps, and pkill was added: session-based termination and starting the gather in a separate session occur only if those tools exist; otherwise the code falls back to kill 0 for termination and runs the gather directly. Tests were updated to expect the conditional heredoc flows.

Changes

Cohort / File(s) Summary
Disk-usage termination & pod command
pkg/cli/admin/mustgather/mustgather.go
Add a one-time runtime check for setsid, ps, and pkill (stored in HAVE_SESSION_TOOLS). Use session-based termination and setsid to start the gather only when tools are present; otherwise fall back to kill 0 for termination and run the gather directly. Add comments and minor formatting.
Unit test expectations
pkg/cli/admin/mustgather/mustgather_test.go
Update TestBuildPodCommand to expect the runtime if check in the emitted heredoc with two branches: a setsid/session branch and an else fallback that invokes the gather directly; adjust final sync/notification placement to follow the conditional.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify the combined check requires all three utilities (setsid, ps, pkill) intentionally.
  • Confirm the shell heredoc and if/then/else/fi structure is correct so heredoc markers and gather invocation are valid in both branches.
  • Ensure disk-usage fallback to kill 0 preserves original exit behavior and tests precisely match emitted shell snippets.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 566e167 and 201abbf.

📒 Files selected for processing (2)
  • pkg/cli/admin/mustgather/mustgather.go (2 hunks)
  • pkg/cli/admin/mustgather/mustgather_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/cli/admin/mustgather/mustgather_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/cli/admin/mustgather/mustgather.go
🔇 Additional comments (1)
pkg/cli/admin/mustgather/mustgather.go (1)

1328-1352: LGTM!

The session-tools detection and conditional branching logic is well-structured. Setting HAVE_SESSION_TOOLS once at the start and reusing it for both the volume checker and the gather command execution avoids redundant checks. The fallback to direct execution when session tools are unavailable is a sensible graceful degradation strategy for minimal images.


Comment @coderabbitai help to get the list of available commands and usage tips.

@ardaguclu
Copy link
Member Author

I got this PR #2152 into this PR.

@ardaguclu
Copy link
Member Author

/cc @tchap could you PTAL?.
/hold
until reviews, pre-merge tests are completed.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 25, 2025
@ardaguclu
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Nov 25, 2025
@openshift-ci-robot
Copy link

@ardaguclu: This pull request references Jira Issue OCPBUGS-65925, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @zhouying7780

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 25, 2025
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 25, 2025
@tchap
Copy link
Contributor

tchap commented Nov 25, 2025

Read it as thoroughly as I managed, looks principally ok.

If we wanted to be super pretty, we could do SETSID="$(command -v setsid)" and then just check SETSID everywhere, but just using command directly is no problem, it's just 2 places really.

/lgtm

@ardaguclu
Copy link
Member Author

Read it as thoroughly as I managed, looks principally ok.

If we wanted to be super pretty, we could do SETSID="$(command -v setsid)" and then just check SETSID everywhere, but just using command directly is no problem, it's just 2 places really.

/lgtm

Thank you. So next step would have a pre-merge tests.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 25, 2025
@tchap
Copy link
Contributor

tchap commented Nov 25, 2025

Note that #2152 is unfinished and the change does not fix anything yet. But no problem with merging it, I guess.

@ardaguclu
Copy link
Member Author

Note that #2152 is unfinished and the change does not fix anything yet. But no problem with merging it, I guess.

Why do you think that?. The ask is to printing this message to stderr and this achieves this. What am I missing?

@tchap
Copy link
Contributor

tchap commented Nov 25, 2025

Why do you think that?. The ask is to printing this message to stderr and this achieves this. What am I missing?

We are streaming pod logs, right? So there are two separate steps involved:

  1. Actually print the message to stderr in the pod.
  2. Stream pod stderr to local stderr.

While the first is trivial, the latter is not and it seems the relevant functionality is behind an alpha feature gate PodLogsQuerySplitStreams, so it's disabled by default: https://kubernetes.io/docs/concepts/cluster-administration/logging/#container-log-streams

We cannot even implement this currently, I guess.

@ardaguclu
Copy link
Member Author

Why do you think that?. The ask is to printing this message to stderr and this achieves this. What am I missing?

We are streaming pod logs, right? So there are two separate steps involved:

  1. Actually print the message to stderr in the pod.
  2. Stream pod stderr to local stderr.

While the first is trivial, the latter is not and it seems the relevant functionality is behind an alpha feature gate PodLogsQuerySplitStreams, so it's disabled by default: https://kubernetes.io/docs/concepts/cluster-administration/logging/#container-log-streams

We cannot even implement this currently, I guess.

Thanks. That makes sense. So I think, we can close the bug. Because it sounds like a feature request. I'd prefer reverting that feature from this PR as well.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 25, 2025
@tchap
Copy link
Contributor

tchap commented Nov 25, 2025

It might be more of a feature request.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 25, 2025
@ardaguclu
Copy link
Member Author

It might be more of a feature request.

That is correct. However, if we rely on a feature on upstream, I think this can be considered as a feature.

@ardaguclu
Copy link
Member Author

/retest

@ardaguclu
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 27, 2025
@zhouying7780
Copy link
Contributor

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Nov 27, 2025
@openshift-ci-robot
Copy link

@ardaguclu: This pull request references Jira Issue OCPBUGS-65925, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @zhouying7780

In response to this:

setsid brings us robust session killing, when volumesizechecker detects any exceeded volume usage. This works well in default must-gather image as well as others that setsid is installed.

However, there are some images that setsid is not installed. This method does not work and more importantly must-gather does not work. So we must introduce a fall back approach for the images that lack such binaries.

This PR checks the presence of setsid. If it is present, it will continue using the current approach. If it is not, it will run the scripts in simpler way. If volumesize checker detects the exceeded volume usage, it will run kill 0 to kill every process. We can't provide elegant options because these images do not include any binaries that we can use such as ps, setsid, etc.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ardaguclu
Copy link
Member Author

/verified by @zhouying7780 and @ardaguclu

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 27, 2025
@openshift-ci-robot
Copy link

@ardaguclu: This PR has been marked as verified by @zhouying7780 and @ardaguclu.

In response to this:

/verified by @zhouying7780 and @ardaguclu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ardaguclu
Copy link
Member Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 27, 2025
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Nov 28, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4a34714 and d8d5647.

📒 Files selected for processing (1)
  • pkg/cli/admin/mustgather/mustgather.go (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/cli/admin/mustgather/mustgather.go
🔇 Additional comments (1)
pkg/cli/admin/mustgather/mustgather.go (1)

1332-1345: LGTM! The fallback logic is well-implemented.

The runtime check correctly handles both scenarios:

  • When setsid, ps, and pkill are available, the gather command runs in a separate session as intended.
  • When these utilities are missing, the gather command runs directly, which is a reasonable fallback for minimal images.

The subsequent sync ensures data is flushed to disk after gathering completes.

Comment on lines 102 to 106
if command -v setsid >/dev/null 2>&1 && command -v ps >/dev/null 2>&1 && command -v pkill >/dev/null 2>&1; then
ps -o sess --no-headers | sort -u | while read sid; do
[[ "$sid" -eq "${$}" ]] && continue
pkill --signal SIGKILL --session "$sid"
done
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix syntax and logic error in session comparison.

Line 104 has two critical issues:

  1. Syntax error: ${$} is invalid bash syntax. The standard way to reference the current PID is $$ (without braces).
  2. Logic error: The code compares a session ID ($sid) with a process ID (${$}), which are different values. To correctly skip the current session, you must compare session IDs.

Apply this diff to fix both issues:

 	if command -v setsid >/dev/null 2>&1 && command -v ps >/dev/null 2>&1 && command -v pkill >/dev/null 2>&1; then
+		current_sid=$(ps -o sess= -p $$)
 		ps -o sess --no-headers | sort -u | while read sid; do
-			[[ "$sid" -eq "${$}" ]] && continue
+			[[ "$sid" -eq "$current_sid" ]] && continue
 			pkill --signal SIGKILL --session "$sid"
 		done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if command -v setsid >/dev/null 2>&1 && command -v ps >/dev/null 2>&1 && command -v pkill >/dev/null 2>&1; then
ps -o sess --no-headers | sort -u | while read sid; do
[[ "$sid" -eq "${$}" ]] && continue
pkill --signal SIGKILL --session "$sid"
done
if command -v setsid >/dev/null 2>&1 && command -v ps >/dev/null 2>&1 && command -v pkill >/dev/null 2>&1; then
current_sid=$(ps -o sess= -p $$)
ps -o sess --no-headers | sort -u | while read sid; do
[[ "$sid" -eq "$current_sid" ]] && continue
pkill --signal SIGKILL --session "$sid"
done
🤖 Prompt for AI Agents
In pkg/cli/admin/mustgather/mustgather.go around lines 102 to 106, fix the
invalid bash syntax `${$}` and the incorrect comparison between a session ID and
the current process ID: determine the current session ID (e.g., via `ps -o sess=
-p $$`) into a variable (current_sid) and then compare `"$sid"` to
`"$current_sid"` to skip killing the current session; replace the `${$}` usage
with that current_sid variable and keep the rest of the pkill logic unchanged.

@ardaguclu ardaguclu changed the title OCPBUGS-65925: Fall back to simpler behavior, if setsid is not installed OCPBUGS-65925: Fall back to simpler behavior, if setsid,ps,pkill are not installed Nov 28, 2025
@ardaguclu
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2025
[[ "$sid" -eq "${$}" ]] && continue
pkill --signal SIGKILL --session "$sid"
done
if command -v setsid >/dev/null 2>&1 && command -v ps >/dev/null 2>&1 && command -v pkill >/dev/null 2>&1; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would honestly just set a helper variable initially and then use it everywhere. This is getting ugly. Like, set a variable to non-empty value if we can do sessions, and do that at the beginning of buildPodCommand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to any suggestions as long as we can inject this helper variable from buildPodCommand to volumeCheckerScript. Would you please elaborate your exact suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that everything is just concatenated in buildPodCommand, so isn't any variable visible later on? Perhaps you need to export it for the subshells to make it work...

Copy link
Member Author

@ardaguclu ardaguclu Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, when I tested this, subshells did not see the variables in the main session. Claude suggested me exporting this variable and I rejected the suggestion. But I'll retry exporting this environment variable.

Copy link
Member Author

@ardaguclu ardaguclu Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a mistake about ordering. So that's why variable didn't work previously. This works now. Could you PTAL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better, thanks!

@ardaguclu
Copy link
Member Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2025
@ardaguclu
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2025
@tchap
Copy link
Contributor

tchap commented Nov 28, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ardaguclu, tchap

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ardaguclu
Copy link
Member Author

/hold cancel

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 28, 2025

@ardaguclu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants