Broken on 3 node clusters #109

rmkraus · 2020-11-06T01:12:11Z

As of this commit, three node clusters with GPU acceleration seem to no longer be supported. In a three node cluster, all nodes will have the master and worker labels. Since they have the master label, the NFD workers will not be scheduled. The comment from the commit indicates that the intent was to allow NFD to run on servers that may be tagged with roles other than worker, but the effect of the change is greater than that.

Might I suggest running the NFD worker on the worker nodes by default and allowing the nodeselector to be set in the spec of the NodeFeatureDiscovery resource?

rmkraus · 2020-11-06T01:52:58Z

I created my own nfd-worker ds and I was able to work around the issue. Obviously not a great permanent fix, though:

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: nfd-worker-new-hotness
  namespace: gpu-operator-resources
  labels:
    app: nfd-worker
spec:
  selector:
    matchLabels:
      app: nfd-worker
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nfd-worker
    spec:
      restartPolicy: Always
      serviceAccountName: nfd-worker
      schedulerName: default-scheduler
      hostNetwork: true
      terminationGracePeriodSeconds: 30
      securityContext: {}
      containers:
        - resources: {}
          terminationMessagePath: /dev/termination-log
          name: nfd-worker
          command:
            - nfd-worker
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
          securityContext:
            capabilities:
              drop:
                - ALL
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
          imagePullPolicy: Always
          volumeMounts:
            - name: host-boot
              readOnly: true
              mountPath: /host-boot
            - name: host-os-release
              readOnly: true
              mountPath: /host-etc/os-release
            - name: host-sys
              mountPath: /host-sys
            - name: config
              mountPath: /etc/kubernetes/node-feature-discovery
            - name: nfd-hooks
              mountPath: /etc/kubernetes/node-feature-discovery/source.d
            - name: nfd-features
              mountPath: /etc/kubernetes/node-feature-discovery/features.d
          terminationMessagePolicy: File
          image: 'quay.io/openshift/origin-node-feature-discovery:latest'
          args:
            - '--sleep-interval=60s'
            - '--server=$(NFD_MASTER_SERVICE_HOST):$(NFD_MASTER_SERVICE_PORT)'
      serviceAccount: nfd-worker
      volumes:
        - name: host-boot
          hostPath:
            path: /boot
            type: ''
        - name: host-os-release
          hostPath:
            path: /etc/os-release
            type: ''
        - name: host-sys
          hostPath:
            path: /sys
            type: ''
        - name: nfd-hooks
          hostPath:
            path: /etc/kubernetes/node-feature-discovery/source.d
            type: ''
        - name: nfd-features
          hostPath:
            path: /etc/kubernetes/node-feature-discovery/features.d
            type: ''
        - name: config
          configMap:
            name: nfd-worker
            items:
              - key: nfd-worker-conf
                path: nfd-worker.conf
            defaultMode: 420
      dnsPolicy: ClusterFirst
      tolerations:
        - operator: Exists
          effect: NoSchedule
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  revisionHistoryLimit: 10

zvonkok · 2020-11-12T11:15:06Z

I think we can do the following to support both configurations.

multiple nodeSelectorTerms, the pod can be scheduled if one of the terms can be satisfied
multiple matchExpressions, the pod can be scheduled only if all expression are satisfied

Since we're tolerating all Taints we need to exclude masters but allow scheduling on a worker with a label or without a label.

      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
              - key:  node-role.kubernetes.io/master
                operator: DoesNotExist
          nodeSelectorTerms:
            - matchExpressions:
              - key:  node-role.kubernetes.io/worker
                operator: Exists

If we have a three-node cluster with nodes that have the worker and master label the first term would fail but the second would be valid -> pod scheduled on the node.

If we have three masters with label master the first term would repel them from the master(s) and nodes with worker label would get assigned because the second term is also valid.

If we have three masters with label master the first term would repel them from the master(s) which means the first term is valid and the second term will fail on workers that do not have a worker label. On nodes, without any label, the first term would be valid and hence the pod scheduled on the nodes.

I hope I do not have a mistake in thinking. @rmkraus @marquiz @ArangoGutierrez PTAL.

@marquiz We may need to update the upstream version as well.

rmkraus · 2020-11-12T19:04:22Z

@zvonkok I have confirmed that the following affinity rules work as specified. These rules are exactly what you posted with the duplicate nodeSelectorTerms key removed.

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-role.kubernetes.io/master
                    operator: DoesNotExist
              - matchExpressions:
                  - key: node-role.kubernetes.io/worker
                    operator: Exists

zvonkok · 2020-11-12T19:07:58Z

@rmkraus Wait, this should not work since all matchExpressions need to be satisfied for a Pod to be scheduled.
Can you show me the labels/annotations of your nodes? Which in the end means you have a node with a worker and without a master label, which makes "no sense" if you have the three node cluster, where each node should have the master and worker label ....

rmkraus · 2020-11-12T19:12:33Z

Negative, from K8s docs:

If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms can be satisfied.

If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions is satisfied.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

rmkraus · 2020-11-12T19:14:22Z

So the conditions in the nodeSelectorTerms list are combined with OR. The conditions in the matchExpressions list are combined with AND.

zvonkok · 2020-11-12T19:16:33Z

Negative, from K8s docs:

If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms can be satisfied.
If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions is satisfied.

Here: ^^ only if all matchExpressions are satisfied ... Am I misunderstanding ? ... well english is not my first language so I may miss something :)

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

zvonkok · 2020-11-12T19:18:28Z

Yes and the matchExpressions ANDed means label master does not exist and label worker does exist. That is why I am confused.

rmkraus · 2020-11-12T19:22:23Z

English is my first language and it is still not super clear. ;-)

I understand that to mean:
When you define a matchExpressions clause, it contains a list. All of the items in the list must be true to get pod placement.
When you define a nodeSelectorTerms clause, it also contains a list. Only one of the items in that list needs to be true in order to get pod placement.

In the selector clause I posted, we have one nodeSelectorTerms clause that contains two distinct matchExpressions clauses. The matchExpresions clauses are evaluated separately and then the parent nodeSelectorTerms clause is evaluated.

zvonkok · 2020-11-12T19:26:02Z

Alrighty, this makes sense, I was thinking on the wrong level of abstraction ... Make sense thanks for verifying this, we need to create a fix for 4.7 and backport it to 4.6.x z-stream.

zvonkok · 2020-11-12T19:26:35Z

@ArangoGutierrez we need a fix ASAP for this, please also cherry-pick it for 4.6, thanks!

rmkraus · 2020-11-12T19:27:25Z

👍 Thanks all!

ArangoGutierrez · 2020-11-12T19:28:22Z

Ack

ArangoGutierrez · 2020-11-12T19:28:50Z

We would need a fully reported Bugzilla to perform the backport
@zvonkok @rmkraus

rmkraus · 2020-11-12T19:35:42Z

I unfortunately do not have BZ access. 😢

ArangoGutierrez · 2020-11-12T20:08:22Z

No worries, @zvonkok I am testing the fix on the latest kube upstream. could you create the BZ and Assign to me

After kubernetes-sigs#31 three node cluster, where all nodes will have the master and worker(or node for vanilla clusters) labels are just getting the nfd-master daemonset deployed. Since they have the master label, the NFD workers will not be scheduled. Discussion started in a downstream distribution of NFD: openshift/cluster-nfd-operator#109 This Patch fix that adding support for master:worker type of nodes, being used in edge deployments Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>

After kubernetes-sigs#31 three node cluster, where all nodes will have the master and worker(or node for vanilla clusters) labels are just getting the nfd-master daemonset deployed. Since they have the master label, the NFD workers will not be scheduled. Discussion started in a downstream distribution of NFD: openshift/cluster-nfd-operator#109 This Patch fix that, by adding support for master:worker type of nodes, modifying the nodeAffinity rules on the nfd-worker daemonSet to allow any node with the "node" label (independent if it is a master also) to get scheduled Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>

deanpeterson · 2020-12-05T23:13:29Z

@rmkraus, when I try to modify the Daemonset directly in Openshift it will not let me as it is being managed by nfd-master-server. How were you able to apply your affinity overrides?

ArangoGutierrez · 2020-12-06T00:15:51Z

Hi @deanpeterson the new version of the operator will come with full support for 3 node clusters
the patch was merged yesterday and will be part of the payload for OCP 4.7. in the mean wile, I am more than happy to assist you in your use case

deanpeterson · 2020-12-06T01:28:50Z

Thanks @ArangoGutierrez! Is there an easy way to modify a 4.6 cluster to overwrite the affinity rules in the nfd-worker daemonset?

ArangoGutierrez · 2020-12-06T02:01:27Z

current master branch on this repo is able to do so.
are you familiar on how to deploy the operator using the Makefile on this repo?

ArangoGutierrez · 2020-12-06T02:07:30Z

@rmkraus does the state of current master branch satisfy your use case?
if so please feel free to close this issue
thanks for your interest in the project and helping us enhance it!

deanpeterson · 2020-12-06T03:37:03Z

No I wasn't aware of the Makefile to replace the existing nfd operator

rmkraus · 2020-12-06T17:31:13Z

@ArangoGutierrez Yes, the latest version of the nfd operator works for me!

@deanpeterson For the sale of K8s debugging tricks: I just duplicated the DaemonSet and made my desired changes. That’s definitely not something you’d want to do on a production cluster, but it’s useful for quick debugging.

ArangoGutierrez · 2020-12-06T17:33:09Z

Awesome !!
Thanks for your feedback, is always appreciated and needed to enhance thee operator

rmkraus mentioned this issue Nov 6, 2020

openshift 4.6 install NVIDIA/gpu-operator#99

Closed

ArangoGutierrez mentioned this issue Nov 13, 2020

master:worker nodes enablement for nfd-worker kubernetes-sigs/node-feature-discovery-operator#37

Merged

ArangoGutierrez mentioned this issue Nov 13, 2020

Bug 1897346: NFD worker pods not scheduled on a 3 node master/worker cluster #110

Merged

zvonkok mentioned this issue Nov 17, 2020

[release-4.6] Bug 1898510: NFD worker pods not scheduled on a 3 node master/worker cluster #111

Closed

rmkraus closed this as completed Dec 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken on 3 node clusters #109

Broken on 3 node clusters #109

rmkraus commented Nov 6, 2020

rmkraus commented Nov 6, 2020

zvonkok commented Nov 12, 2020 •

edited

Loading

rmkraus commented Nov 12, 2020

zvonkok commented Nov 12, 2020 •

edited

Loading

rmkraus commented Nov 12, 2020 •

edited

Loading

rmkraus commented Nov 12, 2020

zvonkok commented Nov 12, 2020

zvonkok commented Nov 12, 2020

rmkraus commented Nov 12, 2020

zvonkok commented Nov 12, 2020

zvonkok commented Nov 12, 2020

rmkraus commented Nov 12, 2020

ArangoGutierrez commented Nov 12, 2020

ArangoGutierrez commented Nov 12, 2020

rmkraus commented Nov 12, 2020

ArangoGutierrez commented Nov 12, 2020

deanpeterson commented Dec 5, 2020

ArangoGutierrez commented Dec 6, 2020

deanpeterson commented Dec 6, 2020

ArangoGutierrez commented Dec 6, 2020

ArangoGutierrez commented Dec 6, 2020

deanpeterson commented Dec 6, 2020

rmkraus commented Dec 6, 2020

ArangoGutierrez commented Dec 6, 2020

Broken on 3 node clusters #109

Broken on 3 node clusters #109

Comments

rmkraus commented Nov 6, 2020

rmkraus commented Nov 6, 2020

zvonkok commented Nov 12, 2020 • edited Loading

rmkraus commented Nov 12, 2020

zvonkok commented Nov 12, 2020 • edited Loading

rmkraus commented Nov 12, 2020 • edited Loading

rmkraus commented Nov 12, 2020

zvonkok commented Nov 12, 2020

zvonkok commented Nov 12, 2020

rmkraus commented Nov 12, 2020

zvonkok commented Nov 12, 2020

zvonkok commented Nov 12, 2020

rmkraus commented Nov 12, 2020

ArangoGutierrez commented Nov 12, 2020

ArangoGutierrez commented Nov 12, 2020

rmkraus commented Nov 12, 2020

ArangoGutierrez commented Nov 12, 2020

deanpeterson commented Dec 5, 2020

ArangoGutierrez commented Dec 6, 2020

deanpeterson commented Dec 6, 2020

ArangoGutierrez commented Dec 6, 2020

ArangoGutierrez commented Dec 6, 2020

deanpeterson commented Dec 6, 2020

rmkraus commented Dec 6, 2020

ArangoGutierrez commented Dec 6, 2020

zvonkok commented Nov 12, 2020 •

edited

Loading

zvonkok commented Nov 12, 2020 •

edited

Loading

rmkraus commented Nov 12, 2020 •

edited

Loading