Fix retry logic in DisruptionController #82152

misterikkit · 2019-08-29T23:29:22Z

What type of PR is this?
/kind bug

What this PR does / why we need it:

This changes the retry logic in DisruptionController so that it
reconciles update conflicts. In the old behavior, any pdb status update
failure was retried with the same status, regardless of error.

Now there is no retry logic with the status update. The error is passed
up the stack where the PDB can be requeued for processing.

If the PDB status update error is a conflict error, there are some new
special cases:

failSafe is not triggered, since this is considered a retryable error
the PDB is requeued immediately (ignoring the rate limiter) because we
assume that conflict can be resolved by getting the latest version

Which issue(s) this PR fixes:

Fixes #82149

Special notes for your reviewer:

I am uncertain whether bypassing the rate limiter is advisable.

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

lavalamp · 2019-08-30T21:21:12Z

thanks for the quick fix, but I think we also need a test?

This changes the retry logic in DisruptionController so that it reconciles update conflicts. In the old behavior, any pdb status update failure was retried with the same status, regardless of error. Now there is no retry logic with the status update. The error is passed up the stack where the PDB can be requeued for processing. If the PDB status update error is a conflict error, there are some new special cases: - failSafe is not triggered, since this is considered a retryable error - the PDB is requeued immediately (ignoring the rate limiter) because we assume that conflict can be resolved by getting the latest version

misterikkit · 2019-08-30T23:19:43Z

/assign @mml @lavalamp

lavalamp · 2019-09-03T16:47:22Z

pkg/controller/disruption/disruption_test.go

@@ -1025,3 +1038,64 @@ func TestDeploymentFinderFunction(t *testing.T) {
 		})
 	}
 }
+
+func TestUpdatePDBStatusRetries(t *testing.T) {


This test verifies that it retries on a conflict error, but nowhere is it made clear in the code or comments why this is important. Can you mention the race here and that this is just the simplest way of avoiding it?

I've drastically revamped the test so that it asserts on the underlying bad behavior rather than how we fix it. PTAL

This tests the PDB status update path in DisruptionController and asserts that conflicting writes (with eviciton handler) are handled gracefully. This adds the client-go fake.Clientset into our tests, because that is the layer required for injecting update failures. This also adds a TestMain so that DisruptionController logs can be enabled during test. e.g., go test ./pkg/controller/disruption -v -args -v=4

tedyu · 2019-09-04T01:02:09Z

pkg/controller/disruption/disruption_test.go

+	// not guarantee that informer event handlers have completed. Fortunately,
+	// DisruptionController does most of its logic by reading from informer
+	// listers, so this guarantee is sufficient.
+	if err := waitForCacheCount(dc.pdbStore, 1); err != nil {


Isn't this wait covered by the wait on line 1084 below ?

I ran the test without this wait and the test passed.

The test is likely to be flaky without these waits.

tedyu · 2019-09-04T01:28:13Z

pkg/controller/disruption/disruption_test.go

+
+	// Evict simulates the visible effects of eviction in our fake client.
+	evict := func(podNames ...string) {
+		// These GVRs are copied from the generated fake code because they are not exported.


Maybe extract the code till line 1125 in a helper for future code reuse

Are you referring to the evict function, or just the GVR constants? I would prefer to extract these in the future when we have a second or third call site.

The function could be extracted as a generic helper, but the current implementation is tightly coupled to this unit test. It both captures local variables from the surrounding scope and makes assumptions about the fake clientset that will be used.

refactoring in the future is fine.

misterikkit · 2019-09-11T19:39:11Z

/remove-sig api-machinery

misterikkit · 2019-09-13T22:58:13Z

friday ping

misterikkit · 2019-10-18T18:08:21Z

friendly honk

lavalamp · 2019-10-22T19:01:39Z

/lgtm
/approve

Thank you for the extensive test.

k8s-ci-robot · 2019-10-22T19:02:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, misterikkit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/disruption/OWNERS~~ [lavalamp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2019-10-23T00:24:48Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.