-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unset selected node when storage is exhausted for topology segment #405
Conversation
pkg/controller/controller.go
Outdated
case codes.ResourceExhausted: // CSI: operation not pending, "Unable to provision in `accessible_topology`" | ||
if mayReschedule { | ||
// may succeed elsewhere -> give up for now | ||
return true | ||
} | ||
// may still succeed at a later time -> continue | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unfortunate case of mixing gRPC transport error codes with application / CSI errors.
Go gRPC implementation returns ResourceExhausted
when it processes too big messages (both when sending and receiving). This is the logic behind the current code - the operation itself may be in progress or even successfully finished (too big success response may have been just thrown away) and the provisioner should retry.
Looking at the code, gRPC default max message size is sometimes defaulted to 4KiB
When a CSI driver intentionally returns ResourceExhausted
when it gets out of space on the storage backend, the error is final and we can be 100% sure that the volume is not being provisioned. Rescheduling the pod / PVC to a different node is correct.
Questions are:
- How to recognize who sent the response. IMO it's impossible to distinguish CSI driver from gRPC implementation.
- Whether we can assume that
ResourceExhausted
returned by gRPC implementation is safe not to retry.- If CreateVolumeResponse / CreateVolumeRequest was too big, retrying the request should yield the same response and one of them is going to be too big again. Retrying is useless and there is a leaked volume. Trying on a different node won't harm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a problem in the CSI spec, right? Instead of re-using status codes, it would have been better to pick new ones to avoid such ambiguity.
I agree that it is impossible to determine the sender. If sending the request already fails, then trying elsewhere is safe and in fact the only way how (theoretically) the volume might still succeed, because retrying locally will just send the same message over and over again.
When receiving the reply fails with ResourceExhausted
, then the situation is worse in the sense that the volume probably has been provisioned, external-provisioner just never gets to know that. Retrying won't help here much either unless an admin steps in an reconfigures the external-provisioner with larger max message sizes.
I think we should bring this up as a problem with spec. But given what it is right now, avoiding the gRPC ResourceExhausted
by using large enough message size limits and then assuming that the error really comes from the CSI driver seems viable to me.
4364061
to
f204cc3
Compare
f204cc3
to
5790768
Compare
5790768
to
68ef5db
Compare
sig-storage-lib-external-provisioner v5.0.0 switched to Go modules, therefore we have to adapt the import path.
This makes sense under some specific circumstances: - volume is supposed to be created only in a certain topology segment - that segment is chosen via the pod scheduler via late binding - the CSI driver supports topology - the CSI driver reports that it ran out of storage Previously, external-provisioner would keep retrying to create the volume instead of notifying the scheduler to pick a node anew. It's okay to treat ResourceExhausted as final error, the CSI spec explicitly describes this case. However, it could also come from the gRPC transport layer and thus previously it was treated as non-final error merely because retrying made more sense (resources might become available, except when the root cause "message size exceeded", which is unlikely to change).
68ef5db
to
56610ce
Compare
@jsafrane I've updated this to use sig-storage-lib-external-provisioner v5.0.0. I also filed an issue against the CSI spec about the What's missing is a local test in E2E testing will cover this (kubernetes/kubernetes#88114), too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this unit test be extended? https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller_test.go#L2473
pkg/controller/controller.go
Outdated
// only makes sense if: | ||
// - The CSI driver supports topology: without that, the next CreateVolume call after | ||
// rescheduling will be exactly the same. | ||
// - We are using strict topology: otherwise the CSI driver is already allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the check for strict topology. I'm not aware of any drivers that retry provisioning on subsequent topologies in the list if the first fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
pkg/controller/controller.go
Outdated
p.strictTopology && | ||
options.SelectedNode != nil | ||
state := checkError(err, mayReschedule) | ||
klog.V(5).Infof("CreateVolume failed, supports topology = %v, strict topology %v, node selected %v => may reschedule = %v => state = %v: %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a warning is too strong because it is a situation that now is supposed to be handled automatically without admin attention. The log entry is really just information that this is happening.
Even drivers which did not explicitly ask for strict topology may benefit from rescheduling (kubernetes-csi#405 (comment)).
The full set of return values of ProvisionExt and the code paths in checkError are covered by the new test. While at it, klog logging gets enabled and a cut-and-paste error for the temp directory name gets fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this unit test be extended? https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller_test.go#L2473
I ended up creating a new test specifically for error handling. ProvisionExt
and its state return value had not been covered at all before.
@msau42: please have another look.
pkg/controller/controller.go
Outdated
p.strictTopology && | ||
options.SelectedNode != nil | ||
state := checkError(err, mayReschedule) | ||
klog.V(5).Infof("CreateVolume failed, supports topology = %v, strict topology %v, node selected %v => may reschedule = %v => state = %v: %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a warning is too strong because it is a situation that now is supposed to be handled automatically without admin attention. The log entry is really just information that this is happening.
pkg/controller/controller.go
Outdated
// only makes sense if: | ||
// - The CSI driver supports topology: without that, the next CreateVolume call after | ||
// rescheduling will be exactly the same. | ||
// - We are using strict topology: otherwise the CSI driver is already allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
/assign @msau42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Just have one question about the dependencies
k8s.io/apiserver v0.17.0 | ||
k8s.io/client-go v0.17.0 | ||
k8s.io/component-base v0.17.0 | ||
k8s.io/csi-translation-lib v0.17.0 | ||
k8s.io/klog v1.0.0 | ||
k8s.io/kubernetes v1.14.0 | ||
sigs.k8s.io/sig-storage-lib-external-provisioner v4.1.0+incompatible | ||
sigs.k8s.io/sig-storage-lib-external-provisioner v4.1.0+incompatible // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of strange, we have dependencies on both versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had noticed that, too, but because we never actually pull in any code from it (= it's not in vendor
) ignored it.
This might be a bug in sig-storage-lib-external-provisioner
:
$ go mod why -m sigs.k8s.io/sig-storage-lib-external-provisioner
# sigs.k8s.io/sig-storage-lib-external-provisioner
github.com/kubernetes-csi/external-provisioner/cmd/csi-provisioner
sigs.k8s.io/sig-storage-lib-external-provisioner/v5/controller
sigs.k8s.io/sig-storage-lib-external-provisioner/v5/controller.test
sigs.k8s.io/sig-storage-lib-external-provisioner/controller/metrics
=> controller.test (which we don't use) pulls in the unversioned sigs.k8s.io/sig-storage-lib-external-provisioner
here: https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/7d9e8b4c678a803070309639110a4494a4010efe/controller/controller_test.go#L26
Not sure how that works; perhaps go test
simply maps that to the checked out module source code. Will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Since we're not actually using it here, we don't need to wait for.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/kind bug |
The controller.test binary and the example still imported without the /v5 suffix. This resulted in spurious "sigs.k8s.io/sig-storage-lib-external-provisioner v4.1.0+incompatible // indirect" go.mod entries in projects using the lib (kubernetes-csi/external-provisioner#405 (comment)). The example needs to use the sources that it was checked out with. Otherwise it is impossible to change the API and the example in a single commit and/or there is a risk that breaking changes go undetected because the example continues to build with the unmodified lib.
The controller.test binary and the example still imported without the /v5 suffix. This resulted in spurious "sigs.k8s.io/sig-storage-lib-external-provisioner v4.1.0+incompatible // indirect" go.mod entries in projects using the lib (kubernetes-csi/external-provisioner#405 (comment)). The example needs to use the sources that it was checked out with. Otherwise it is impossible to change the API and the example in a single commit and/or there is a risk that breaking changes go undetected because the example continues to build with the unmodified lib.
/lgtm |
…go_modules/k8s.io/api-0.26.1 Bump k8s.io/api from 0.26.0 to 0.26.1
What type of PR is this?
What this PR does / why we need it:
This makes sense under some specific circumstances:
Previously, external-provisioner would keep retrying to create the
volume instead of notifying the scheduler to pick a node anew.
Which issue(s) this PR fixes:
Related-to: kubernetes/kubernetes#72031
Special notes for your reviewer:
One downside of this change is that if the conditions above are met
and some other final error occurs, rescheduling also is triggered. A
richer API for the sig-storage-lib-external-provisioner library would
be needed to avoid this.
It's okay to treat ResourceExhausted as final error, the CSI spec
explicitly describes this case. Previously it was treated as non-final
error merely because retrying made more sense (resources might become
available).
The check for "strict topology" was added because without it, a CSI driver should already go ahead and pick some other topology segment than the one preferred for the selected node. It seems to make little sense to do another scheduling round if no topology segment has enough space.
Does this PR introduce a user-facing change?: