New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix retry logic in DisruptionController #82152
Conversation
1f54a1d
to
8066969
Compare
8066969
to
6baf98a
Compare
thanks for the quick fix, but I think we also need a test? |
This changes the retry logic in DisruptionController so that it reconciles update conflicts. In the old behavior, any pdb status update failure was retried with the same status, regardless of error. Now there is no retry logic with the status update. The error is passed up the stack where the PDB can be requeued for processing. If the PDB status update error is a conflict error, there are some new special cases: - failSafe is not triggered, since this is considered a retryable error - the PDB is requeued immediately (ignoring the rate limiter) because we assume that conflict can be resolved by getting the latest version
6baf98a
to
1cd9c2a
Compare
7cea878
to
f0c5a7c
Compare
@@ -1025,3 +1038,64 @@ func TestDeploymentFinderFunction(t *testing.T) { | |||
}) | |||
} | |||
} | |||
|
|||
func TestUpdatePDBStatusRetries(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test verifies that it retries on a conflict error, but nowhere is it made clear in the code or comments why this is important. Can you mention the race here and that this is just the simplest way of avoiding it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've drastically revamped the test so that it asserts on the underlying bad behavior rather than how we fix it. PTAL
f0c5a7c
to
d9a5cbc
Compare
This tests the PDB status update path in DisruptionController and asserts that conflicting writes (with eviciton handler) are handled gracefully. This adds the client-go fake.Clientset into our tests, because that is the layer required for injecting update failures. This also adds a TestMain so that DisruptionController logs can be enabled during test. e.g., go test ./pkg/controller/disruption -v -args -v=4
d9a5cbc
to
c8d937c
Compare
// not guarantee that informer event handlers have completed. Fortunately, | ||
// DisruptionController does most of its logic by reading from informer | ||
// listers, so this guarantee is sufficient. | ||
if err := waitForCacheCount(dc.pdbStore, 1); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this wait covered by the wait on line 1084 below ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the test without this wait and the test passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is likely to be flaky without these waits.
|
||
// Evict simulates the visible effects of eviction in our fake client. | ||
evict := func(podNames ...string) { | ||
// These GVRs are copied from the generated fake code because they are not exported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe extract the code till line 1125 in a helper for future code reuse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the evict function, or just the GVR constants? I would prefer to extract these in the future when we have a second or third call site.
The function could be extracted as a generic helper, but the current implementation is tightly coupled to this unit test. It both captures local variables from the surrounding scope and makes assumptions about the fake clientset that will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactoring in the future is fine.
/remove-sig api-machinery |
friday ping |
friendly honk |
/lgtm Thank you for the extensive test. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lavalamp, misterikkit The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
5 similar comments
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
…82152-upstream-release-1.15 Automated cherry pick of #82152: Fix retry logic in DisruptionController
…82152-upstream-release-1.16 Automated cherry pick of #82152: Fix retry logic in DisruptionController
I'am confiused about that why the disruption controller and evict handler can update the pdb status concurrently? // refresh tries to re-GET the given PDB. If there are any errors, it just } func (dc *DisruptionController) writePdbStatus(pdb *policy.PodDisruptionBudget) error {
} |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This changes the retry logic in DisruptionController so that it
reconciles update conflicts. In the old behavior, any pdb status update
failure was retried with the same status, regardless of error.
Now there is no retry logic with the status update. The error is passed
up the stack where the PDB can be requeued for processing.
If the PDB status update error is a conflict error, there are some new
special cases:
assume that conflict can be resolved by getting the latest version
Which issue(s) this PR fixes:
Fixes #82149
Special notes for your reviewer:
I am uncertain whether bypassing the rate limiter is advisable.
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: