Take dataSource topology into account when scheduling a pod using unbound WFFC storage #107479

awels · 2022-01-11T22:03:35Z

What happened?

I have some dynamically provisioned storage that uses WFFC like the included csi-hostpath. I have a multi node cluster. When I attempt to do a csi clone using that storage, the scheduler does not take the topology of the dataSource into account when scheduling the pod. Example:

I have 3 nodes.
I have 1 bound volume (source) on node01.
I create a new PVC with a dataSource of source.
The scheduler schedules the pod on node02, and the clone can not succeed because the source is on a different node.
Same thing happens if my dataSource is a snapshot of a volume on a different node than where the pod is scheduled.

What did you expect to happen?

I am expecting the VolumeZone plugin to take the topology of the dataSource of a PVC Into account when filtering nodes. A snapshot restore or csi clone cannot succeed if the source volume doesn't exist in the same topology as the node. I am fairly certain the offending piece of code is here where it just skips checking unbound WFFC PVCs.

How can we reproduce it (as minimally and precisely as possible)?

Create a multi node cluster
Use the csi hostpath provisioner in distributed fashion
Create a PVC, and run a pod to bind the PVC to a PV on a node
Create another PVC with a dataSource that is the PVC from the previous step
Create a pod that uses the new PVC, it may or not get scheduled on the same node. If it is scheduled on a different node, the pod will remain pending because the csi clone will fail.

Similar to the above setup, but add an intermediate step of making a snapshot of the first PVC, and set the dataSource to that snapshot will yield the same result. You will need a version of the csi-snapshot sidecar/controller that includes kubernetes-csi/external-snapshotter#585 so the snapshots are properly created on the right node. This will also label the snapshot volume content to include the node name

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
NAME=Fedora
VERSION="33 (Thirty Three)"
ID=fedora
VERSION_ID=33
VERSION_CODENAME=""
PLATFORM_ID="platform:f33"
PRETTY_NAME="Fedora 33 (Thirty Three)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:33"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f33/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=33
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=33
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"

$ uname -a
[awels@awels hostpath-provisioner]$ uname -a
Linux awels.localdomain 5.14.18-100.fc33.x86_64 #1 SMP Fri Nov 12 17:38:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

CSI Snapshotter 5.0.0 or newer: https://github.com/kubernetes-csi/external-snapshotter/releases/tag/v5.0.0

awels · 2022-01-11T22:04:08Z

/sig storage

awels · 2022-01-11T22:06:12Z

Not sure if storage or scheduling is the right sig, but this is mainly a storage issue so setting it to sig-storage for now. I am more than happy to supply a PR to fix this, just want some confirmation I am looking at the right thing for this particular issue. Basically need to filter the nodes down based on the dataSource for an unbound WFFC storage PVC, instead of just assuming it can go anywhere.

xing-yang · 2022-02-22T13:01:50Z

/triage accepted

k8s-triage-robot · 2022-05-23T13:30:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

awels · 2022-05-23T13:32:24Z

/remove-lifecycle stale

k8s-triage-robot · 2022-08-21T14:07:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

zjm232 · 2022-08-21T14:11:21Z

please do not sent anything to me.it“s to much--发自新浪邮箱客户端在 8月21日 22:09，Kubernetes Triage Robot ***@***.***> 写道： The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules: After 90d of inactivity, lifecycle/stale is applied After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied After 30d of inactivity since lifecycle/rotten was applied, the issue is closed You can: Mark this issue or PR as fresh with /remove-lifecycle stale Mark this issue or PR as rotten with /lifecycle rotten Close this issue or PR with /close Offer to help out with Issue Triage Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

awels · 2022-08-21T14:52:42Z

/remove-lifecycle stale

k8s-triage-robot · 2022-11-19T15:22:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 · 2022-12-03T22:04:05Z

/remove-lifecycle stale

k8s-triage-robot · 2024-01-19T17:00:47Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

awels · 2024-01-19T19:43:32Z

going to close this issue as I found an acceptable work around. It is technically still a problem though.

awels added the kind/bug Categorizes issue or PR as related to a bug. label Jan 11, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 11, 2022

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 11, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 22, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2022

awels mentioned this issue May 25, 2022

Make volume_zone scheduler plugin aware of topology so it can schedule #110215

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 19, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 3, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

awels closed this as completed Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take dataSource topology into account when scheduling a pod using unbound WFFC storage #107479

Take dataSource topology into account when scheduling a pod using unbound WFFC storage #107479

awels commented Jan 11, 2022 •

edited

awels commented Jan 11, 2022

awels commented Jan 11, 2022 •

edited

xing-yang commented Feb 22, 2022

k8s-triage-robot commented May 23, 2022

awels commented May 23, 2022

k8s-triage-robot commented Aug 21, 2022

zjm232 commented Aug 21, 2022 via email

awels commented Aug 21, 2022

k8s-triage-robot commented Nov 19, 2022

vaibhav2107 commented Dec 3, 2022

k8s-triage-robot commented Jan 19, 2024

awels commented Jan 19, 2024

Take dataSource topology into account when scheduling a pod using unbound WFFC storage #107479

Take dataSource topology into account when scheduling a pod using unbound WFFC storage #107479

Comments

awels commented Jan 11, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

awels commented Jan 11, 2022

awels commented Jan 11, 2022 • edited

xing-yang commented Feb 22, 2022

k8s-triage-robot commented May 23, 2022

awels commented May 23, 2022

k8s-triage-robot commented Aug 21, 2022

zjm232 commented Aug 21, 2022 via email

awels commented Aug 21, 2022

k8s-triage-robot commented Nov 19, 2022

vaibhav2107 commented Dec 3, 2022

k8s-triage-robot commented Jan 19, 2024

awels commented Jan 19, 2024

awels commented Jan 11, 2022 •

edited

awels commented Jan 11, 2022 •

edited