Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume should not be provisioned on the nodes like master node which are marked as noScheduled #47

Closed
w3aman opened this issue Feb 12, 2020 · 11 comments · Fixed by #101
Closed
Assignees
Labels
Need community involvement Needs community involvement on some action item.
Milestone

Comments

@w3aman
Copy link
Contributor

w3aman commented Feb 12, 2020

what happened:

  • zpool was created on all of the worker nodes where application pods can be scheduled.
  • When volume was provisioned PV scheduling was created on master node which in general marked as a noSchedule node. Because of that PV which is created on master node, application pod remains in Pending state as it doesn't get volume.
  • FYI in storage class spec under allowedTopologies field list of nodes, where zpool is present, was intentionally not used as all the worker node are having zpools.

what you expected to happen

  • PV should not be created on master nodes. For that zfs provisioner should be aware of such info like NoScheduling.
@w3aman w3aman changed the title Volume should not be provisioned on the nodes which are marked as noScheduled Volume should not be provisioned on the nodes like master node which are marked as noScheduled Feb 12, 2020
@freym
Copy link

freym commented Feb 17, 2020

Hi we have the same problem.
Is possible to use an other node label than the hard coded Label "kubernetes.io/hostname" under allowedTopologies for a StorageClass ?
For example "node.kubernetes.io/node" or some custom Labels.

@pawanpraka1
Copy link
Contributor

pawanpraka1 commented Feb 17, 2020

@freym, thanks for trying out zfs-localpv. Could you please explain more about your use case and why you want to use different key?

Right now, we support only one topology key "kubernetes.io/hostname". We can avoid provisioning of PV on the master using fowllowing 2 ways

  1. Since OpenEBS ZFS driver does not know anything about taints and toleration, so, for this, master is a valid candidate for provisioning the PV. We can use topology in storage class to tell all the valid nodes which has ZPOOL running.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: openebs-zfspv
parameters:
  fstype: "zfs"
  poolname: "zfspv-pool"
provisioner: zfs.csi.openebs.io
allowedTopologies:
- matchLabelExpressions:
  - key: kubernetes.io/hostname
    values:
      - worker-1
      - worker-2
  1. Another way to solve this is use waitforfirstconsumer as volumeBindingMode mode. Here in this case kubernetes scheduler will take case of scheduling the POD, which will consider all the taints and toleration settings. once scheduling decision has been made, it will ask the ZFS driver to provision the PV, the driver will provision the PV on node which is selected for scheduling.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: openebs-zfspv
parameters:
  fstype: "zfs"
  poolname: "zfspv-pool"
provisioner: zfs.csi.openebs.io
volumeBindingMode: WaitForFirstConsumer

You can use either of the above approach to solve this issue.

@pawanpraka1 pawanpraka1 self-assigned this Feb 17, 2020
@freym
Copy link

freym commented Feb 18, 2020

Yes both workarounds help to solve the problem: "Avoid provisioning of PV on the master". (The first solution is a bit cumbersome if you have a lot hosts or the number changes constantly when you use a dynamic provisioner)

We have two other use cases:

  1. We use multiple Machines with different zpool configurations (SATA/SAS/SSD/NVME only or mixed Pools). All Machines are tagged with: zpool/hdd_type = SATA/...
    You can schedule a Pod with taints and toleration and use "WaitForFirstConsumer", but i think it would be easier for users if there are StorageClasses like "SATA/SAS... or SLOW/FAST...).

  2. Our Kubernetes Cluster is stretched over 2 Buildings (= Zones). Its also easier to say "give me a PV with the StorageClass "SATA_BUILDING1".

(I also just want to say that this project is really great and really helps here in our work 👍)

@pawanpraka1 pawanpraka1 transferred this issue from openebs/zfs-localpv Feb 18, 2020
@pawanpraka1 pawanpraka1 transferred this issue from openebs/openebs Feb 18, 2020
@pawanpraka1
Copy link
Contributor

pawanpraka1 commented Mar 10, 2020

@freym sorry for the delay. There are two things probably we can do (after discussing it internally and with CSI community)

  1. Change the key from 'kubernetes.io/hostname" to some unique name which is not present on the node. Since this label is already present on the master node, so because of that this is being considered and the driver is scheduling the PV on that node. We will change the key to "openebs.io/nodename". So now we don't have to list the worker node in case master in not schedulable. In this case we have to update the storageclass and change the key to "openebs.io/nodename".
  2. Second thing, to support the buildings/zpool_type, here we can label the nodes whatever way we want and we can mention that in the storageclass, for this to work, we have to change our scheduler to expect any key rather than the supported keys only. WaitForFirstConsumer will not work here as topology key spefified in the storageclass is not supported by the driver, this limitation comes from the CSI provisioner itself.

@kmova kmova added the Need community involvement Needs community involvement on some action item. label Mar 28, 2020
@pawanpraka1 pawanpraka1 added this to the v0.7.0 milestone Apr 20, 2020
@pawanpraka1
Copy link
Contributor

@freym I have raised the PR(#94) to support custom keys. We can now label the nodes and then deploy the ZFS-LocalPV driver. The driver will support all the keys that node has.

@pawanpraka1
Copy link
Contributor

pawanpraka1 commented Apr 30, 2020

@freym The PR to support custom topology has been merged. You can follow the steps mentioned here to add custom topology key : https://github.com/openebs/zfs-localpv/blob/master/docs/faq.md#6-how-to-add-custom-topology-key.
Let us know if you are facing any issue. Currently you can use latest zfs-driver:ci image to try it out, we will release it as part of 0.7.0 release and it will have zfs-driver:0.7.0 tag (due May 15th). Please note that if you are using ZFS-LocalPV version less than 0.6.0, then you have to follow the upgrade steps here : https://github.com/openebs/zfs-localpv/tree/master/upgrade.

@w3aman We have fixed the issue(via #101) where we have to list all the worker nodes in the storageclass to make ZFS-LocalPV driver not to create any PV on the master node. Now you don't need mention workers in the storageclass or use waitforfirstconsumer to solve this. Can you please try out and see if that is working?

@pawanpraka1 pawanpraka1 reopened this Apr 30, 2020
@w3aman
Copy link
Contributor Author

w3aman commented Apr 30, 2020

Cool...i will try this out !
Thanks @pawanpraka1 for your work on this issue.

@freym
Copy link

freym commented Apr 30, 2020

Thx. I will try this also. (I can't say when I'm gonna do this because of covid-19)

@pawanpraka1
Copy link
Contributor

Thanks @freym. I have a favor to ask for you: can you sign the adopters file for us and detail your use-case a bit? Here is the adopters file where a lot of users have already signed it : openebs/openebs#2719.

@w3aman
Copy link
Contributor Author

w3aman commented Apr 30, 2020

I tried this scenario with both v0.6.x and ci image. Both the time i kept master node with NoSchedule taint. And in PVC spec i used immediate as volume binding mode. ci images solved this issue, as master was on noSchedule taint it didn't provisioned the volume there.

with v0.6.x:

NAME                                       ZPOOL   NODE              SIZE         VOLBLOCKSIZE   RECORDSIZE   FILESYSTEM   CREATIONTIME   AGE
pvc-17d977c6-3b89-4fe8-a1ba-0b0273bd0ee3   test    e2e2-dev-master   1073741824                  4k           zfs                         2m46s
pvc-bdd37896-fd1b-40ce-a614-6b83c9b41403   test    e2e2-dev-node3    1073741824                  4k           zfs                         76s
pvc-ca27fe42-90d4-430a-8301-8bd0c6031d7f   test    e2e2-dev-node1    1073741824                  4k           zfs                         44s
pvc-d9be6f4c-c6ce-4b4b-8f59-76599d4febfb   test    e2e2-dev-master   1073741824                  4k           zfs                         14s
pvc-f57ad11a-65cb-4558-aa1f-c3c9c6f39a02   test    e2e2-dev-node2    1073741824                  4k           zfs                         58s

But with CI images volumes were provisioning on the worker nodes only.

devuser@rack2:~/aman$ kubectl get zv -n openebs
NAME                                       ZPOOL   NODE        SIZE         VOLBLOCKSIZE   RECORDSIZE   FILESYSTEM   CREATIONTIME   AGE
pvc-1f729f0d-5d01-4dff-86d9-2b4fd9dc7fbc   test    k8s-node2   1073741824                  4k           zfs                         37m
pvc-3253a2c8-ce54-4a75-a029-9d610186eeaf   test    k8s-node3   1073741824                  4k           zfs                         37m
pvc-6814a63b-d566-4e1d-b1f4-cf8913ec21a9   test    k8s-node1   1073741824                  4k           zfs                         36m
pvc-cd62da32-0c4f-4b1d-945b-cad001da4bac   test    k8s-node1   1073741824                  4k           zfs                         38m

Note: Here, because of zfs-schedular volume-count based scheduling is done. so in (1 master + 3 worker) cluster provisioning of 4 volumes should consider all 4 nodes one by one but because of the taints, master was not the suitable candidate for provisioning volume so the 4th one goes with node-1 again.

@w3aman
Copy link
Contributor Author

w3aman commented May 4, 2020

/issue has been fixed and verified.
/closed

@w3aman w3aman closed this as completed May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Need community involvement Needs community involvement on some action item.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants