Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: initial provisioning rollback KEP #1703

Closed
wants to merge 1 commit into from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Apr 20, 2020

Currently pods can get stuck when they use multiple volumes with late binding. The case with a single volume is handled by rescheduling the pod, but that can fail when some volumes already got provisioned and then provisioning of the remaining ones fails.

The proposal is to rollback the provisioning in that case to unblock rescheduling.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 20, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
To complete the pull request process, please assign msau42
You can assign the PR to them by writing /assign @msau42 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Apr 20, 2020
picking some other node, the library can treat unused volumes as "not
provisioned yet" and set a new selected node for them.

The external-provisioner then needs to de-provision those PVCs which are:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we signal to external-provisioner to delete these PVs? Normally it's signaled when a user deletes a PVC object, but I don't think we can delete the PVC object here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the tentatively provisioned volume is no longer accessible from the currently selected node.


## Proposal

The proposal is to mark provisioned PVCs as "unused" via a label until
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who sets this label to "used"? How do we ensure there aren't races between the time we check the label and when we go to delete the PV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can solve this entirely in the external-provisioner: in addition to the "selected node" label that it gets already now, it also needs to know which pod(s) are triggering the provisioning so that it can group several PVCs together.

Then we need an intermediate phase where external-provisioner creates PVs, but doesn't finally bind them to the PVCs yet. That's the "unused" state. Then when all volumes have been created, external-provisioner finishes the binding.

But it will still be tricky to update several objects together atomically 😢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that: there is more than one external-provisioner involved, so this is the wrong place.

A better place is the volume scheduling code. That's where kube-scheduler checks whether a pod can move ahead; we can block that until all volumes have been provisioned. Before giving it's okay, that code can also remove the "unused" flag.


The proposal is to mark provisioned PVCs as "unused" via a label until
a pod actually starts to use them. Another label stores the node that
the PVC was provisioned for.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have "selected-node" on the PVC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that's the label that describes the desired state and when we reschedule, that will change.

My intention was to add another label "provisioned-node" and then compare that against "selected-node" to detect when a volume no longer meets the requirements. But that may be too strict. A better solution would be to use the accessible topology of the volume to determine whether it needs to be recreated: if the currently selected node has no access to the volume, then (and only then) do we delete it and try again.

- Only CSI drivers using an up-to-date external-provisioner will be
supported.

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer having some stronger guarantee on provisioning and rolling back a group of volumes atomically to avoid races that could result in deletion of a volume in use.

So something that can prevent a Pod from using a volume before all the volumes are successfully provisioned. Maybe something like taints can help us here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer having some stronger guarantee on provisioning and rolling back a group of volumes atomically

As multiple different storage systems are involved, it will have to be Kubernetes which treats the set of volumes as something that is kept unused until all volumes are ready.

avoid races that could result in deletion of a volume in use

I don't think the current proposal would have led to deleting a volume that is in use, because once it is used, it won't be deleted anymore. But another pod starting to use one of the volumes is a potential issue because that then can prevent rescheduling of the original pod which triggered provisioning.

So something that can prevent a Pod from using a volume before all the volumes are successfully provisioned. Maybe something like taints can help us here.

Yes, something like that or another label.

Copy link
Member

@cofyc cofyc May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer having some stronger guarantee on provisioning and rolling back a group of volumes atomically

As multiple different storage systems are involved, it will have to be Kubernetes which treats the set of volumes as something that is kept unused until all volumes are ready.

As for static binding, we need to reverse multiple volumes for later use too (kubernetes/kubernetes#73263). It's a special case in which volumes are already provisioned and don't need provisioning rollback.

@saad-ali
Copy link
Member

Meeting Notes

  • As soon as the scheduler makes a decision the AvailableCapacity per StorageClass reported in the StoragePool object will be incorrect, how do you accommodate for that?
    • The scheduler will just retry if scheduling fails.
  • Even if the scheduler retries won't it be stuck if the first volume provision decision it makes is inefficient?
    • TODO - Yes, this is still an issue, scheduler would be stuck -- need to think of a solution to this.
  • Strawman Proposal:
    1. Reservation space by actually creating volume
    2. Don't make volume available to use until provisioning of all volumes is complete.
    3. Delete and recreate previously created volume, if needed, for more efficient scheduling.
  • How common is it to have multiple volumes from different storage systems?
    • Pretty common for Local Volumes -- disks from different storage classes (e.g. SSD or HDD)
  • Challanges?
    • Swapping PVCs may be easy but roll back may be challenging
  • Next steps:
    • Ensure we have a complete flow we are fairly confident will work.

@cofyc
Copy link
Member

cofyc commented May 28, 2020

Do we try to rollback PV/PVC binding or bind multiple PV/PVC objects together? That would solve this problem too. If possible, we can solve static binding and dynamic provisioning problems together.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2020
@pohly
Copy link
Contributor Author

pohly commented Aug 26, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2020
Copy link
Member

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is a WIP but wanted to note it is missing a kep.yaml

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2020
@pohly
Copy link
Contributor Author

pohly commented Jan 2, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2021
@pohly
Copy link
Contributor Author

pohly commented Apr 2, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2021

## Alternatives

Instead of allocating volumes, space could be merely reserved in

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Bare-metal CSI we decided to go with reservation approach

@k8s-triage-robot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2021
@pohly
Copy link
Contributor Author

pohly commented Aug 2, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2021
@pohly
Copy link
Contributor Author

pohly commented Nov 2, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 2, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 2, 2022
@pohly
Copy link
Contributor Author

pohly commented Mar 2, 2022

/remove-lifecycle stale

@alculquicondor
Copy link
Member

Is this being actively reviewed? How does it play with GA graduation?

@pohly
Copy link
Contributor Author

pohly commented Mar 4, 2022

This is a problem for all volumes with "wait for first consumer" provisioning mode, whether they use storage capacity tracking or not.

Interest in solving this seems to be low, therefore this KEP has not progressed beyond the initial "we could solve it like this" phase. That's probably because it doesn't really affect many users.

@mishoyama
Copy link

This is a problem for all volumes with "wait for first consumer" provisioning mode, whether they use storage capacity tracking or not.

Interest in solving this seems to be low, therefore this KEP has not progressed beyond the initial "we could solve it like this" phase. That's probably because it doesn't really affect many users.

I'm very interested for this issue to be addressed.

@pohly
Copy link
Contributor Author

pohly commented Mar 5, 2022

@mishoyama: Is that because you have pods with multiple volumes and these volumes are dynamically provisioned with "wait for first consumer"? Can you describe your use case a bit? For example, how long are pods running?

@mishoyama
Copy link

@mishoyama: Is that because you have pods with multiple volumes and these volumes are dynamically provisioned with "wait for first consumer"? Can you describe your use case a bit? For example, how long are pods running?

@pohly yes, we have multiple volumes per pod with "wait for first consumer" binding mode. This is storage system and pod lifecycle is not limited.

@pohly
Copy link
Contributor Author

pohly commented Mar 14, 2022

What's the advantage of having multiple volumes per pod instead of just one? Do they come from different storage providers?

Are you perhaps interested in working on this? The SIG Storage community meeting would be a good place to start discussing who could work on this and when.

@mishoyama
Copy link

What's the advantage of having multiple volumes per pod instead of just one? Do they come from different storage providers?

Are you perhaps interested in working on this? The SIG Storage community meeting would be a good place to start discussing who could work on this and when.

We might have up to 100 disks attached to the physical node. Handling IO requests in a single pod allows to manage related resources more effectively.

At the current moment we use scheduler extender to solve the problem with storage capacity. But anyway I'm interested in working on this.

@pohly
Copy link
Contributor Author

pohly commented Apr 7, 2022

/close

I don't have time to work on this anytime soon. If someone wants to pick it up, please reach out to SIG Storage.

@k8s-ci-robot
Copy link
Contributor

@pohly: Closed this PR.

In response to this:

/close

I don't have time to work on this anytime soon. If someone wants to pick it up, please reach out to SIG Storage.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants