Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix retry logic in DisruptionController #82152

Merged
merged 2 commits into from Oct 23, 2019

Conversation

misterikkit
Copy link

What type of PR is this?
/kind bug

What this PR does / why we need it:

This changes the retry logic in DisruptionController so that it
reconciles update conflicts. In the old behavior, any pdb status update
failure was retried with the same status, regardless of error.

Now there is no retry logic with the status update. The error is passed
up the stack where the PDB can be requeued for processing.

If the PDB status update error is a conflict error, there are some new
special cases:

  • failSafe is not triggered, since this is considered a retryable error
  • the PDB is requeued immediately (ignoring the rate limiter) because we
    assume that conflict can be resolved by getting the latest version

Which issue(s) this PR fixes:

Fixes #82149

Special notes for your reviewer:

I am uncertain whether bypassing the rate limiter is advisable.

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 29, 2019
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. area/code-generation area/kubectl sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cli Categorizes an issue or PR as relevant to SIG CLI. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 29, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2019
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 30, 2019
@lavalamp
Copy link
Member

thanks for the quick fix, but I think we also need a test?

This changes the retry logic in DisruptionController so that it
reconciles update conflicts. In the old behavior, any pdb status update
failure was retried with the same status, regardless of error.

Now there is no retry logic with the status update. The error is passed
up the stack where the PDB can be requeued for processing.

If the PDB status update error is a conflict error, there are some new
special cases:
- failSafe is not triggered, since this is considered a retryable error
- the PDB is requeued immediately (ignoring the rate limiter) because we
  assume that conflict can be resolved by getting the latest version
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 30, 2019
@misterikkit misterikkit force-pushed the disruption branch 2 times, most recently from 7cea878 to f0c5a7c Compare August 30, 2019 23:09
@misterikkit
Copy link
Author

/assign @mml @lavalamp

@@ -1025,3 +1038,64 @@ func TestDeploymentFinderFunction(t *testing.T) {
})
}
}

func TestUpdatePDBStatusRetries(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test verifies that it retries on a conflict error, but nowhere is it made clear in the code or comments why this is important. Can you mention the race here and that this is just the simplest way of avoiding it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've drastically revamped the test so that it asserts on the underlying bad behavior rather than how we fix it. PTAL

This tests the PDB status update path in DisruptionController and
asserts that conflicting writes (with eviciton handler) are handled
gracefully.

This adds the client-go fake.Clientset into our tests, because that is
the layer required for injecting update failures.

This also adds a TestMain so that DisruptionController logs can be
enabled during test. e.g.,

    go test ./pkg/controller/disruption -v -args -v=4
// not guarantee that informer event handlers have completed. Fortunately,
// DisruptionController does most of its logic by reading from informer
// listers, so this guarantee is sufficient.
if err := waitForCacheCount(dc.pdbStore, 1); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this wait covered by the wait on line 1084 below ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the test without this wait and the test passed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is likely to be flaky without these waits.


// Evict simulates the visible effects of eviction in our fake client.
evict := func(podNames ...string) {
// These GVRs are copied from the generated fake code because they are not exported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extract the code till line 1125 in a helper for future code reuse

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the evict function, or just the GVR constants? I would prefer to extract these in the future when we have a second or third call site.

The function could be extracted as a generic helper, but the current implementation is tightly coupled to this unit test. It both captures local variables from the surrounding scope and makes assumptions about the fake clientset that will be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactoring in the future is fine.

@misterikkit
Copy link
Author

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Sep 11, 2019
@misterikkit
Copy link
Author

friday ping

@misterikkit
Copy link
Author

friendly honk

@lavalamp
Copy link
Member

/lgtm
/approve

Thank you for the extensive test.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 22, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, misterikkit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 22, 2019
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

5 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 65fa4c9 into kubernetes:master Oct 23, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 23, 2019
k8s-ci-robot added a commit that referenced this pull request Dec 7, 2019
…82152-upstream-release-1.15

Automated cherry pick of #82152: Fix retry logic in DisruptionController
k8s-ci-robot added a commit that referenced this pull request Dec 7, 2019
…82152-upstream-release-1.16

Automated cherry pick of #82152: Fix retry logic in DisruptionController
@b0b0haha
Copy link

I'am confiused about that why the disruption controller and evict handler can update the pdb status concurrently?
It seems when the disruption controller notice the pdb's update from the evict handler , it calls trySync() to update the pdb's PodDisruptionsAllowed again,
The original code below:

// refresh tries to re-GET the given PDB. If there are any errors, it just
// returns the old PDB. Intended to be used in a retry loop where it runs a
// bounded number of times.
func refresh(pdbClient policyclientset.PodDisruptionBudgetInterface, pdb *policy.PodDisruptionBudget) *policy.PodDisruptionBudget {
newPdb, err := pdbClient.Get(pdb.Name, metav1.GetOptions{})
if err == nil {
return newPdb
}
return pdb

}

func (dc *DisruptionController) writePdbStatus(pdb *policy.PodDisruptionBudget) error {
pdbClient := dc.kubeClient.PolicyV1beta1().PodDisruptionBudgets(pdb.Namespace)
st := pdb.Status

var err error
for i, pdb := 0, pdb; i < statusUpdateRetries; i, pdb = i+1, refresh(pdbClient, pdb) {
	pdb.Status = st
	if _, err = pdbClient.UpdateStatus(pdb); err == nil {
		break
	}
}

return err

}
but the pdb's status has been updated by evict handler, why the evict handler can vilolate the pdb's minAvailable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/code-generation area/kubectl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cli Categorizes an issue or PR as relevant to SIG CLI. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PodDisruptionBudget can be violated by disruption controller
7 participants