Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unset selected node when storage is exhausted for topology segment #405

Merged
merged 4 commits into from
Apr 1, 2020

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Feb 13, 2020

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

This makes sense under some specific circumstances:

  • volume is supposed to be created only in a certain topology segment
  • that segment is chosen via the pod scheduler via late binding
  • the CSI driver supports topology
  • the CSI driver reports that it ran out of storage

Previously, external-provisioner would keep retrying to create the
volume instead of notifying the scheduler to pick a node anew.

Which issue(s) this PR fixes:
Related-to: kubernetes/kubernetes#72031

Special notes for your reviewer:

One downside of this change is that if the conditions above are met
and some other final error occurs, rescheduling also is triggered. A
richer API for the sig-storage-lib-external-provisioner library would
be needed to avoid this.

It's okay to treat ResourceExhausted as final error, the CSI spec
explicitly describes this case. Previously it was treated as non-final
error merely because retrying made more sense (resources might become
available).

The check for "strict topology" was added because without it, a CSI driver should already go ahead and pick some other topology segment than the one preferred for the selected node. It seems to make little sense to do another scheduling round if no topology segment has enough space.

Does this PR introduce a user-facing change?:

If a CSI driver returns ResourceExhausted for CreateVolume, supports topology, the volume uses late binding, and the driver has requested strict topology, then external-scheduler now asks for a rescheduling of the pod that triggered the volume creation. This may then result in a retry with a different selected node.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 13, 2020
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 13, 2020
Comment on lines 1192 to 1198
case codes.ResourceExhausted: // CSI: operation not pending, "Unable to provision in `accessible_topology`"
if mayReschedule {
// may succeed elsewhere -> give up for now
return true
}
// may still succeed at a later time -> continue
return false
Copy link
Contributor

@jsafrane jsafrane Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate case of mixing gRPC transport error codes with application / CSI errors.

Go gRPC implementation returns ResourceExhausted when it processes too big messages (both when sending and receiving). This is the logic behind the current code - the operation itself may be in progress or even successfully finished (too big success response may have been just thrown away) and the provisioner should retry.

Looking at the code, gRPC default max message size is sometimes defaulted to 4KiB

When a CSI driver intentionally returns ResourceExhausted when it gets out of space on the storage backend, the error is final and we can be 100% sure that the volume is not being provisioned. Rescheduling the pod / PVC to a different node is correct.

Questions are:

  • How to recognize who sent the response. IMO it's impossible to distinguish CSI driver from gRPC implementation.
  • Whether we can assume that ResourceExhausted returned by gRPC implementation is safe not to retry.
    • If CreateVolumeResponse / CreateVolumeRequest was too big, retrying the request should yield the same response and one of them is going to be too big again. Retrying is useless and there is a leaked volume. Trying on a different node won't harm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problem in the CSI spec, right? Instead of re-using status codes, it would have been better to pick new ones to avoid such ambiguity.

I agree that it is impossible to determine the sender. If sending the request already fails, then trying elsewhere is safe and in fact the only way how (theoretically) the volume might still succeed, because retrying locally will just send the same message over and over again.

When receiving the reply fails with ResourceExhausted, then the situation is worse in the sense that the volume probably has been provisioned, external-provisioner just never gets to know that. Retrying won't help here much either unless an admin steps in an reconfigures the external-provisioner with larger max message sizes.

I think we should bring this up as a problem with spec. But given what it is right now, avoiding the gRPC ResourceExhausted by using large enough message size limits and then assuming that the error really comes from the CSI driver seems viable to me.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 24, 2020
sig-storage-lib-external-provisioner v5.0.0 switched to Go modules,
therefore we have to adapt the import path.
This makes sense under some specific circumstances:
- volume is supposed to be created only in a certain topology segment
- that segment is chosen via the pod scheduler via late binding
- the CSI driver supports topology
- the CSI driver reports that it ran out of storage

Previously, external-provisioner would keep retrying to create the
volume instead of notifying the scheduler to pick a node anew.

It's okay to treat ResourceExhausted as final error, the CSI spec
explicitly describes this case. However, it could also come from the
gRPC transport layer and thus previously it was treated as non-final
error merely because retrying made more sense (resources might become
available, except when the root cause "message size exceeded", which
is unlikely to change).
@pohly
Copy link
Contributor Author

pohly commented Mar 24, 2020

@jsafrane I've updated this to use sig-storage-lib-external-provisioner v5.0.0. I also filed an issue against the CSI spec about the RESOURCE_EXHAUSTED ambiguity and added that as comment to the code.

What's missing is a local test in pkg/controller for the new code path. Do we need one and if so, any suggestions what a good place for that might be?

E2E testing will cover this (kubernetes/kubernetes#88114), too.

@pohly pohly changed the title WIP: unset selected node when storage is exhausted for topology segment unset selected node when storage is exhausted for topology segment Mar 24, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2020
Copy link
Collaborator

@msau42 msau42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// only makes sense if:
// - The CSI driver supports topology: without that, the next CreateVolume call after
// rescheduling will be exactly the same.
// - We are using strict topology: otherwise the CSI driver is already allowed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the check for strict topology. I'm not aware of any drivers that retry provisioning on subsequent topologies in the list if the first fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

p.strictTopology &&
options.SelectedNode != nil
state := checkError(err, mayReschedule)
klog.V(5).Infof("CreateVolume failed, supports topology = %v, strict topology %v, node selected %v => may reschedule = %v => state = %v: %v",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a warning is too strong because it is a situation that now is supposed to be handled automatically without admin attention. The log entry is really just information that this is happening.

Even drivers which did not explicitly ask for strict topology may
benefit from
rescheduling (kubernetes-csi#405 (comment)).
The full set of return values of ProvisionExt and the code paths in
checkError are covered by the new test.

While at it, klog logging gets enabled and a cut-and-paste error for
the temp directory name gets fixed.
Copy link
Contributor Author

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this unit test be extended? https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller_test.go#L2473

I ended up creating a new test specifically for error handling. ProvisionExt and its state return value had not been covered at all before.

@msau42: please have another look.

p.strictTopology &&
options.SelectedNode != nil
state := checkError(err, mayReschedule)
klog.V(5).Infof("CreateVolume failed, supports topology = %v, strict topology %v, node selected %v => may reschedule = %v => state = %v: %v",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a warning is too strong because it is a situation that now is supposed to be handled automatically without admin attention. The log entry is really just information that this is happening.

// only makes sense if:
// - The CSI driver supports topology: without that, the next CreateVolume call after
// rescheduling will be exactly the same.
// - We are using strict topology: otherwise the CSI driver is already allowed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

@pohly
Copy link
Contributor Author

pohly commented Mar 31, 2020

/assign @msau42

Copy link
Collaborator

@msau42 msau42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Just have one question about the dependencies

k8s.io/apiserver v0.17.0
k8s.io/client-go v0.17.0
k8s.io/component-base v0.17.0
k8s.io/csi-translation-lib v0.17.0
k8s.io/klog v1.0.0
k8s.io/kubernetes v1.14.0
sigs.k8s.io/sig-storage-lib-external-provisioner v4.1.0+incompatible
sigs.k8s.io/sig-storage-lib-external-provisioner v4.1.0+incompatible // indirect
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of strange, we have dependencies on both versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had noticed that, too, but because we never actually pull in any code from it (= it's not in vendor) ignored it.

This might be a bug in sig-storage-lib-external-provisioner:

$ go mod why -m sigs.k8s.io/sig-storage-lib-external-provisioner
# sigs.k8s.io/sig-storage-lib-external-provisioner
github.com/kubernetes-csi/external-provisioner/cmd/csi-provisioner
sigs.k8s.io/sig-storage-lib-external-provisioner/v5/controller
sigs.k8s.io/sig-storage-lib-external-provisioner/v5/controller.test
sigs.k8s.io/sig-storage-lib-external-provisioner/controller/metrics

=> controller.test (which we don't use) pulls in the unversioned sigs.k8s.io/sig-storage-lib-external-provisioner here: https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/7d9e8b4c678a803070309639110a4494a4010efe/controller/controller_test.go#L26

Not sure how that works; perhaps go test simply maps that to the checked out module source code. Will fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Since we're not actually using it here, we don't need to wait for.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 1, 2020
@msau42
Copy link
Collaborator

msau42 commented Apr 1, 2020

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 1, 2020
pohly added a commit to pohly/sig-storage-lib-external-provisioner that referenced this pull request Apr 1, 2020
The controller.test binary and the example still imported without the
/v5 suffix. This resulted in spurious
"sigs.k8s.io/sig-storage-lib-external-provisioner v4.1.0+incompatible
// indirect" go.mod entries in projects using the
lib (kubernetes-csi/external-provisioner#405 (comment)).

The example needs to use the sources that it was checked out
with. Otherwise it is impossible to change the API and the example in
a single commit and/or there is a risk that breaking changes go
undetected because the example continues to build with the unmodified
lib.
pohly added a commit to pohly/sig-storage-lib-external-provisioner that referenced this pull request Apr 1, 2020
The controller.test binary and the example still imported without the
/v5 suffix. This resulted in spurious
"sigs.k8s.io/sig-storage-lib-external-provisioner v4.1.0+incompatible
// indirect" go.mod entries in projects using the
lib (kubernetes-csi/external-provisioner#405 (comment)).

The example needs to use the sources that it was checked out
with. Otherwise it is impossible to change the API and the example in
a single commit and/or there is a risk that breaking changes go
undetected because the example continues to build with the unmodified
lib.
@msau42
Copy link
Collaborator

msau42 commented Apr 1, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 1, 2020
@k8s-ci-robot k8s-ci-robot merged commit 309b38a into kubernetes-csi:master Apr 1, 2020
kbsonlong pushed a commit to kbsonlong/external-provisioner that referenced this pull request Dec 29, 2023
…go_modules/k8s.io/api-0.26.1

Bump k8s.io/api from 0.26.0 to 0.26.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants