Fix must-gather hang: apply --timeout to log streaming phase#2297
Fix must-gather hang: apply --timeout to log streaming phase#2297afcollins wants to merge 1 commit into
Conversation
getGatherContainerLogs had no timeout, blocking forever if gather scripts ran long. Wrap all gather phases in a shared deadline context. Co-Authored-By: Claude Opus 4.6 Signed-off-by: Andrew Collins <ancollin@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: afcollins The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughThe change scopes a ChangesMust-Gather Timeout Scoping
Estimated code review effort: 2 (Simple) | ~10 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/cli/admin/mustgather/mustgather.go (1)
834-854: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd/update unit test coverage for the new timeout scoping behavior.
This change alters lifecycle timing and error-suppression logic but no corresponding test changes are included.
processNextWorkItemis exercised viaMustGatherOptions, so a table-driven test simulating a slow/hanging log stream (context deadline expiring mid-stream) would validate that the command now returns promptly with "gather never finished" instead of hanging, and thatDeadlineExceededfrom log streaming doesn't spuriously log "gather logs unavailable."
As per coding guidelines, "All changes must include unit test additions/changes."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cli/admin/mustgather/mustgather.go` around lines 834 - 854, Add or update unit tests for the new timeout scoping in MustGatherOptions.processNextWorkItem. Cover a slow or hanging getGatherContainerLogs path where gatherCtx expires mid-stream, and assert the command returns promptly with the expected "gather never finished" behavior instead of hanging. Also verify that context.DeadlineExceeded or context.Canceled from getGatherContainerLogs does not trigger the "gather logs unavailable" log path, while non-context errors still do.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/cli/admin/mustgather/mustgather.go`:
- Around line 834-854: Add or update unit tests for the new timeout scoping in
MustGatherOptions.processNextWorkItem. Cover a slow or hanging
getGatherContainerLogs path where gatherCtx expires mid-stream, and assert the
command returns promptly with the expected "gather never finished" behavior
instead of hanging. Also verify that context.DeadlineExceeded or
context.Canceled from getGatherContainerLogs does not trigger the "gather logs
unavailable" log path, while non-context errors still do.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 79101a03-76c9-4e86-9a09-f4f771ba4057
📒 Files selected for processing (1)
pkg/cli/admin/mustgather/mustgather.go
|
@afcollins: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
|
||
| // wait for gather container to be running (gather is running) | ||
| if err := o.waitForGatherContainerRunning(ctx, pod); err != nil { | ||
| if err := o.waitForGatherContainerRunning(gatherCtx, pod); err != nil { |
There was a problem hiding this comment.
I would use ctx here. No need to include this kind of waiting in the timeout IMO.
| // wait for pod to be running (gather has completed) | ||
| log("waiting for gather to complete") | ||
| if err := o.waitForGatherToComplete(ctx, pod); err != nil { | ||
| if err := o.waitForGatherToComplete(gatherCtx, pod); err != nil { |
There was a problem hiding this comment.
Since you are passing a timeout context here, we don't need to use PollUntilContextTimeout and o.Timeout within this function, so it can be removed, I think. Not that it really matters, but better not to repeat the same thing on multiple places.
|
Also I am not sure about our policy, but I tend to remove |
getGatherContainerLogs had no timeout, blocking forever if gather scripts ran long. Wrap all gather phases in a shared deadline context.
Co-Authored-By: Claude Opus 4.6
Summary by CodeRabbit