Augmenting API call retry in nodeinfomanager; Revert #70891 #71058

verult · 2018-11-15T02:22:36Z

/kind bug

What this PR does / why we need it: Fixes the retry loop inside kubelet NodeInfoManager around updating nodes. The original retry didn't seem to catch conflict errors from PatchNodeStatus(). Also for the sake of completeness, the retry loop around CSINodeInfo was upgraded as well.

Also re-enables a previously failing test.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes one of the issues in #70760 , and also the last issue in #67972

Does this PR introduce a user-facing change?:

NONE

/sig storage
/assign @davidz627 @vladimirvivien

msau42 · 2018-11-15T02:24:59Z

test/e2e/storage/testsuites/volumemode.go

@@ -341,9 +339,6 @@ func testVolumeModeSuccessForDynamicPV(input *volumeModeTestInput) {
 		ns := f.Namespace
 		var err error

-		// TODO: This skip should be removed once #70760 is fixed
-		skipTestUntilBugfix("70760", input.driverName, []string{"csi-hostpath", "com.google.csi.gcepd"})


We can't quite re-enable it yet. We need to update pd csi driver to pick up all the 1.0 changes

I believe the test as is will use the v0.2.0 driver and v0.4.0 sidecars so it should actually be fine. Then I believe Saad will do an atomic update with all the driver versions + spec update

msau42 · 2018-11-15T02:26:07Z

pkg/volume/csi/nodeinfomanager/nodeinfomanager.go

-		}
+	nodeClient := kubeClient.CoreV1().Nodes()
+	originalNode, err := nodeClient.Get(string(nim.nodeName), metav1.GetOptions{})
+	node := originalNode.DeepCopy()


What is originalNode if err is returned?

that's potentially a nil pointer, thanks for the catch!

AishSundar · 2018-11-15T03:56:57Z

/milestone v1.13
/priority important-soon
/kind failing-test

gnufied · 2018-11-15T21:06:00Z

pkg/volume/csi/nodeinfomanager/nodeinfomanager.go

 // updateNode repeatedly attempts to update the corresponding node object
 // which is modified by applying the given update functions sequentially.
 // Because updateFuncs are applied sequentially, later updateFuncs should take into account
 // the effects of previous updateFuncs to avoid potential conflicts. For example, if multiple
 // functions update the same field, updates in the last function are persisted.
-func (nim *nodeInfoManager) updateNode(updateFuncs ...nodeUpdateFunc) error {
-	retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {


So as I understand it, if there is a conflict on patch the returned http error code is not 409 ? and that is why the original RetryOnConflict did not work?

I didn't verify what error code was returned by Patch, but regardless I think we need to retry on all errors, not just on conflict. Otherwise a driver registration event will be dropped entirely

davidz627 · 2018-11-15T21:11:25Z

pkg/volume/csi/nodeinfomanager/nodeinfomanager.go

@@ -137,51 +140,59 @@ func (nim *nodeInfoManager) UninstallCSIDriver(driverName string) error {
 	return nil
 }

+func (nim *nodeInfoManager) updateNode(updateFuncs ...nodeUpdateFunc) error {
+	var updateErrs []error
+	err := wait.ExponentialBackoff(retry.DefaultRetry, func() (bool, error) {


I'm thinking maybe DefaultBackoff would be better here:

// DefaultRetry is the recommended retry for a conflict where multiple clients // are making changes to the same resource. var DefaultRetry = wait.Backoff{ Steps: 5, Duration: 10 * time.Millisecond, Factor: 1.0, Jitter: 0.1, } // DefaultBackoff is the recommended backoff for a conflict where a client // may be attempting to make an unrelated modification to a resource under // active management by one or more controllers. var DefaultBackoff = wait.Backoff{ Steps: 4, Duration: 10 * time.Millisecond, Factor: 5.0, Jitter: 0.1, }

Now that I think about it, it's probably better to define a custom Backoff since the ones you listed above are meant for conflicts.

davidz627 · 2018-11-15T22:57:30Z

/lgtm

verult · 2018-11-15T23:08:25Z

Realized I missed one other instance of RetryOnConflict(); replaced it in the latest commit

davidz627 · 2018-11-15T23:30:45Z

please squash but lgtm

verult · 2018-11-16T00:06:34Z

Squashed!

saad-ali

/approve

saad-ali · 2018-11-16T00:34:05Z

pkg/volume/csi/nodeinfomanager/nodeinfomanager.go

@@ -137,51 +148,59 @@ func (nim *nodeInfoManager) UninstallCSIDriver(driverName string) error {
 	return nil
 }

+func (nim *nodeInfoManager) updateNode(updateFuncs ...nodeUpdateFunc) error {
+	var updateErrs []error
+	err := wait.ExponentialBackoff(updateBackoff, func() (bool, error) {


Any concern about this blocking for a long time and preventing progress on the caller?

The plugin watcher spawns a separate goroutine for each event, so we should be good here

k8s-ci-robot · 2018-11-16T00:34:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saad-ali, verult

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/volume/csi/OWNERS~~ [saad-ali]
~~test/e2e/storage/OWNERS~~ [saad-ali]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

davidz627 · 2018-11-16T00:39:14Z

/lgtm

k8s-ci-robot · 2018-11-16T00:48:36Z

@verult: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-local-e2e-containerized	9731f273833e0ab2e46e695a37b143a195118eaa	link	`/test pull-kubernetes-local-e2e-containerized`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

AishSundar · 2018-11-16T14:33:15Z

@verult looks like this one needs to rebase?

AishSundar · 2018-11-16T20:23:38Z

/priority critical-urgent
/remove-priority important-soon

Uping the priority since this is needed to enable the CSI Alpha tests in 1.13

This reverts commit f98b87c.

davidz627 · 2018-11-16T21:06:23Z

/lgtm

davidz627 · 2019-02-27T00:58:07Z

@verult @msau42 @saad-ali thoughts on cherry-picking this to 1.12, maybe even 1.11? Issue has existed for a while and this users are hitting the problem with older k8s versions: kubernetes-csi/external-attacher#126

msau42 · 2019-02-27T02:15:47Z

Sounds fine with me, but I'm not sure if it's going to be a straight cherrypick. I believe nodeinfomanager has changed signficantly

Partial cherrypick of #71058

k8s-ci-robot assigned davidz627 and vladimirvivien Nov 15, 2018

k8s-ci-robot requested review from gnufied and saad-ali November 15, 2018 02:22

msau42 reviewed Nov 15, 2018

View reviewed changes

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 15, 2018

k8s-ci-robot added this to the v1.13 milestone Nov 15, 2018

k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 15, 2018

AishSundar mentioned this pull request Nov 15, 2018

CSI alpha tests are failing #70760

Closed

gnufied reviewed Nov 15, 2018

View reviewed changes

davidz627 reviewed Nov 15, 2018

View reviewed changes

verult force-pushed the nodeinfomanager-retry branch from f4b632f to 9731f27 Compare November 15, 2018 22:46

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 15, 2018

verult force-pushed the nodeinfomanager-retry branch from 9731f27 to 2c8d77c Compare November 15, 2018 22:51

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 15, 2018

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Nov 15, 2018

verult force-pushed the nodeinfomanager-retry branch from 2dee433 to 77ea4f5 Compare November 16, 2018 00:06

saad-ali approved these changes Nov 16, 2018

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 16, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2018

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 16, 2018

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 16, 2018

verult added 2 commits November 16, 2018 13:00

Augmenting API call retry in nodeinfomanager

ca18690

Revert "Make csi alpha failing test skip"

b275ebb

This reverts commit f98b87c.

verult force-pushed the nodeinfomanager-retry branch from 77ea4f5 to b275ebb Compare November 16, 2018 21:01

k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 16, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2018

k8s-ci-robot merged commit b6bbb01 into kubernetes:master Nov 17, 2018

verult mentioned this pull request Mar 7, 2019

Partial cherrypick of #71058 #74826

Merged

k8s-ci-robot added a commit that referenced this pull request Mar 9, 2019

Merge pull request #74826 from verult/pr71058-1.12

136ae33

Partial cherrypick of #71058

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Augmenting API call retry in nodeinfomanager; Revert #70891 #71058

Augmenting API call retry in nodeinfomanager; Revert #70891 #71058

verult commented Nov 15, 2018

msau42 Nov 15, 2018

davidz627 Nov 15, 2018

msau42 Nov 15, 2018

verult Nov 15, 2018

AishSundar commented Nov 15, 2018

gnufied Nov 15, 2018

verult Nov 15, 2018

davidz627 Nov 15, 2018

verult Nov 15, 2018

davidz627 commented Nov 15, 2018

verult commented Nov 15, 2018

davidz627 commented Nov 15, 2018

verult commented Nov 16, 2018

saad-ali left a comment

saad-ali Nov 16, 2018

verult Nov 16, 2018

k8s-ci-robot commented Nov 16, 2018

davidz627 commented Nov 16, 2018

k8s-ci-robot commented Nov 16, 2018

AishSundar commented Nov 16, 2018

AishSundar commented Nov 16, 2018

davidz627 commented Nov 16, 2018

davidz627 commented Feb 27, 2019

msau42 commented Feb 27, 2019

Augmenting API call retry in nodeinfomanager; Revert #70891 #71058

Augmenting API call retry in nodeinfomanager; Revert #70891 #71058

Conversation

verult commented Nov 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AishSundar commented Nov 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidz627 commented Nov 15, 2018

verult commented Nov 15, 2018

davidz627 commented Nov 15, 2018

verult commented Nov 16, 2018

saad-ali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 16, 2018

davidz627 commented Nov 16, 2018

k8s-ci-robot commented Nov 16, 2018

AishSundar commented Nov 16, 2018

AishSundar commented Nov 16, 2018

davidz627 commented Nov 16, 2018

davidz627 commented Feb 27, 2019

msau42 commented Feb 27, 2019