New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Augmenting API call retry in nodeinfomanager; Revert #70891 #71058
Conversation
@@ -341,9 +339,6 @@ func testVolumeModeSuccessForDynamicPV(input *volumeModeTestInput) { | |||
ns := f.Namespace | |||
var err error | |||
|
|||
// TODO: This skip should be removed once #70760 is fixed | |||
skipTestUntilBugfix("70760", input.driverName, []string{"csi-hostpath", "com.google.csi.gcepd"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't quite re-enable it yet. We need to update pd csi driver to pick up all the 1.0 changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the test as is will use the v0.2.0 driver and v0.4.0 sidecars so it should actually be fine. Then I believe Saad will do an atomic update with all the driver versions + spec update
} | ||
nodeClient := kubeClient.CoreV1().Nodes() | ||
originalNode, err := nodeClient.Get(string(nim.nodeName), metav1.GetOptions{}) | ||
node := originalNode.DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is originalNode if err is returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's potentially a nil pointer, thanks for the catch!
/milestone v1.13 |
// updateNode repeatedly attempts to update the corresponding node object | ||
// which is modified by applying the given update functions sequentially. | ||
// Because updateFuncs are applied sequentially, later updateFuncs should take into account | ||
// the effects of previous updateFuncs to avoid potential conflicts. For example, if multiple | ||
// functions update the same field, updates in the last function are persisted. | ||
func (nim *nodeInfoManager) updateNode(updateFuncs ...nodeUpdateFunc) error { | ||
retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as I understand it, if there is a conflict on patch the returned http error code is not 409
? and that is why the original RetryOnConflict
did not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't verify what error code was returned by Patch, but regardless I think we need to retry on all errors, not just on conflict. Otherwise a driver registration event will be dropped entirely
@@ -137,51 +140,59 @@ func (nim *nodeInfoManager) UninstallCSIDriver(driverName string) error { | |||
return nil | |||
} | |||
|
|||
func (nim *nodeInfoManager) updateNode(updateFuncs ...nodeUpdateFunc) error { | |||
var updateErrs []error | |||
err := wait.ExponentialBackoff(retry.DefaultRetry, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking maybe DefaultBackoff
would be better here:
// DefaultRetry is the recommended retry for a conflict where multiple clients
// are making changes to the same resource.
var DefaultRetry = wait.Backoff{
Steps: 5,
Duration: 10 * time.Millisecond,
Factor: 1.0,
Jitter: 0.1,
}
// DefaultBackoff is the recommended backoff for a conflict where a client
// may be attempting to make an unrelated modification to a resource under
// active management by one or more controllers.
var DefaultBackoff = wait.Backoff{
Steps: 4,
Duration: 10 * time.Millisecond,
Factor: 5.0,
Jitter: 0.1,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it, it's probably better to define a custom Backoff since the ones you listed above are meant for conflicts.
f4b632f
to
9731f27
Compare
9731f27
to
2c8d77c
Compare
/lgtm |
Realized I missed one other instance of |
please squash but lgtm |
2dee433
to
77ea4f5
Compare
Squashed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
@@ -137,51 +148,59 @@ func (nim *nodeInfoManager) UninstallCSIDriver(driverName string) error { | |||
return nil | |||
} | |||
|
|||
func (nim *nodeInfoManager) updateNode(updateFuncs ...nodeUpdateFunc) error { | |||
var updateErrs []error | |||
err := wait.ExponentialBackoff(updateBackoff, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any concern about this blocking for a long time and preventing progress on the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plugin watcher spawns a separate goroutine for each event, so we should be good here
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: saad-ali, verult The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
@verult: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@verult looks like this one needs to rebase? |
/priority critical-urgent Uping the priority since this is needed to enable the CSI Alpha tests in 1.13 |
77ea4f5
to
b275ebb
Compare
/lgtm |
@verult @msau42 @saad-ali thoughts on cherry-picking this to 1.12, maybe even 1.11? Issue has existed for a while and this users are hitting the problem with older k8s versions: kubernetes-csi/external-attacher#126 |
Sounds fine with me, but I'm not sure if it's going to be a straight cherrypick. I believe nodeinfomanager has changed signficantly |
Partial cherrypick of #71058
What this PR does / why we need it: Fixes the retry loop inside kubelet NodeInfoManager around updating nodes. The original retry didn't seem to catch conflict errors from
PatchNodeStatus()
. Also for the sake of completeness, the retry loop around CSINodeInfo was upgraded as well.Also re-enables a previously failing test.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes one of the issues in #70760 , and also the last issue in #67972
Does this PR introduce a user-facing change?:
/sig storage
/assign @davidz627 @vladimirvivien