Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Augmenting API call retry in nodeinfomanager; Revert #70891 #71058

Merged
merged 2 commits into from Nov 17, 2018

Conversation

verult
Copy link
Contributor

@verult verult commented Nov 15, 2018

/kind bug

What this PR does / why we need it: Fixes the retry loop inside kubelet NodeInfoManager around updating nodes. The original retry didn't seem to catch conflict errors from PatchNodeStatus(). Also for the sake of completeness, the retry loop around CSINodeInfo was upgraded as well.

Also re-enables a previously failing test.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes one of the issues in #70760 , and also the last issue in #67972

Does this PR introduce a user-facing change?:

NONE

/sig storage
/assign @davidz627 @vladimirvivien

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. sig/testing Categorizes an issue or PR as relevant to SIG Testing. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 15, 2018
@@ -341,9 +339,6 @@ func testVolumeModeSuccessForDynamicPV(input *volumeModeTestInput) {
ns := f.Namespace
var err error

// TODO: This skip should be removed once #70760 is fixed
skipTestUntilBugfix("70760", input.driverName, []string{"csi-hostpath", "com.google.csi.gcepd"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't quite re-enable it yet. We need to update pd csi driver to pick up all the 1.0 changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the test as is will use the v0.2.0 driver and v0.4.0 sidecars so it should actually be fine. Then I believe Saad will do an atomic update with all the driver versions + spec update

}
nodeClient := kubeClient.CoreV1().Nodes()
originalNode, err := nodeClient.Get(string(nim.nodeName), metav1.GetOptions{})
node := originalNode.DeepCopy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is originalNode if err is returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's potentially a nil pointer, thanks for the catch!

@AishSundar
Copy link
Contributor

/milestone v1.13
/priority important-soon
/kind failing-test

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 15, 2018
@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Nov 15, 2018
@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 15, 2018
// updateNode repeatedly attempts to update the corresponding node object
// which is modified by applying the given update functions sequentially.
// Because updateFuncs are applied sequentially, later updateFuncs should take into account
// the effects of previous updateFuncs to avoid potential conflicts. For example, if multiple
// functions update the same field, updates in the last function are persisted.
func (nim *nodeInfoManager) updateNode(updateFuncs ...nodeUpdateFunc) error {
retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as I understand it, if there is a conflict on patch the returned http error code is not 409 ? and that is why the original RetryOnConflict did not work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't verify what error code was returned by Patch, but regardless I think we need to retry on all errors, not just on conflict. Otherwise a driver registration event will be dropped entirely

@@ -137,51 +140,59 @@ func (nim *nodeInfoManager) UninstallCSIDriver(driverName string) error {
return nil
}

func (nim *nodeInfoManager) updateNode(updateFuncs ...nodeUpdateFunc) error {
var updateErrs []error
err := wait.ExponentialBackoff(retry.DefaultRetry, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking maybe DefaultBackoff would be better here:

// DefaultRetry is the recommended retry for a conflict where multiple clients
// are making changes to the same resource.
var DefaultRetry = wait.Backoff{
	Steps:    5,
	Duration: 10 * time.Millisecond,
	Factor:   1.0,
	Jitter:   0.1,
}

// DefaultBackoff is the recommended backoff for a conflict where a client
// may be attempting to make an unrelated modification to a resource under
// active management by one or more controllers.
var DefaultBackoff = wait.Backoff{
	Steps:    4,
	Duration: 10 * time.Millisecond,
	Factor:   5.0,
	Jitter:   0.1,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, it's probably better to define a custom Backoff since the ones you listed above are meant for conflicts.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 15, 2018
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 15, 2018
@davidz627
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Nov 15, 2018
@verult
Copy link
Contributor Author

verult commented Nov 15, 2018

Realized I missed one other instance of RetryOnConflict(); replaced it in the latest commit

@davidz627
Copy link
Contributor

please squash but lgtm

@verult
Copy link
Contributor Author

verult commented Nov 16, 2018

Squashed!

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@@ -137,51 +148,59 @@ func (nim *nodeInfoManager) UninstallCSIDriver(driverName string) error {
return nil
}

func (nim *nodeInfoManager) updateNode(updateFuncs ...nodeUpdateFunc) error {
var updateErrs []error
err := wait.ExponentialBackoff(updateBackoff, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concern about this blocking for a long time and preventing progress on the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin watcher spawns a separate goroutine for each event, so we should be good here

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saad-ali, verult

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 16, 2018
@davidz627
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2018
@k8s-ci-robot
Copy link
Contributor

@verult: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e-containerized 9731f273833e0ab2e46e695a37b143a195118eaa link /test pull-kubernetes-local-e2e-containerized

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@AishSundar
Copy link
Contributor

@verult looks like this one needs to rebase?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 16, 2018
@AishSundar
Copy link
Contributor

/priority critical-urgent
/remove-priority important-soon

Uping the priority since this is needed to enable the CSI Alpha tests in 1.13

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 16, 2018
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 16, 2018
@davidz627
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2018
@k8s-ci-robot k8s-ci-robot merged commit b6bbb01 into kubernetes:master Nov 17, 2018
@davidz627
Copy link
Contributor

@verult @msau42 @saad-ali thoughts on cherry-picking this to 1.12, maybe even 1.11? Issue has existed for a while and this users are hitting the problem with older k8s versions: kubernetes-csi/external-attacher#126

@msau42
Copy link
Member

msau42 commented Feb 27, 2019

Sounds fine with me, but I'm not sure if it's going to be a straight cherrypick. I believe nodeinfomanager has changed signficantly

k8s-ci-robot added a commit that referenced this pull request Mar 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants