Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken on 3 node clusters #109

Closed
rmkraus opened this issue Nov 6, 2020 · 24 comments
Closed

Broken on 3 node clusters #109

rmkraus opened this issue Nov 6, 2020 · 24 comments

Comments

@rmkraus
Copy link

rmkraus commented Nov 6, 2020

As of this commit, three node clusters with GPU acceleration seem to no longer be supported. In a three node cluster, all nodes will have the master and worker labels. Since they have the master label, the NFD workers will not be scheduled. The comment from the commit indicates that the intent was to allow NFD to run on servers that may be tagged with roles other than worker, but the effect of the change is greater than that.

Might I suggest running the NFD worker on the worker nodes by default and allowing the nodeselector to be set in the spec of the NodeFeatureDiscovery resource?

@rmkraus
Copy link
Author

rmkraus commented Nov 6, 2020

I created my own nfd-worker ds and I was able to work around the issue. Obviously not a great permanent fix, though:

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: nfd-worker-new-hotness
  namespace: gpu-operator-resources
  labels:
    app: nfd-worker
spec:
  selector:
    matchLabels:
      app: nfd-worker
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nfd-worker
    spec:
      restartPolicy: Always
      serviceAccountName: nfd-worker
      schedulerName: default-scheduler
      hostNetwork: true
      terminationGracePeriodSeconds: 30
      securityContext: {}
      containers:
        - resources: {}
          terminationMessagePath: /dev/termination-log
          name: nfd-worker
          command:
            - nfd-worker
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
          securityContext:
            capabilities:
              drop:
                - ALL
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
          imagePullPolicy: Always
          volumeMounts:
            - name: host-boot
              readOnly: true
              mountPath: /host-boot
            - name: host-os-release
              readOnly: true
              mountPath: /host-etc/os-release
            - name: host-sys
              mountPath: /host-sys
            - name: config
              mountPath: /etc/kubernetes/node-feature-discovery
            - name: nfd-hooks
              mountPath: /etc/kubernetes/node-feature-discovery/source.d
            - name: nfd-features
              mountPath: /etc/kubernetes/node-feature-discovery/features.d
          terminationMessagePolicy: File
          image: 'quay.io/openshift/origin-node-feature-discovery:latest'
          args:
            - '--sleep-interval=60s'
            - '--server=$(NFD_MASTER_SERVICE_HOST):$(NFD_MASTER_SERVICE_PORT)'
      serviceAccount: nfd-worker
      volumes:
        - name: host-boot
          hostPath:
            path: /boot
            type: ''
        - name: host-os-release
          hostPath:
            path: /etc/os-release
            type: ''
        - name: host-sys
          hostPath:
            path: /sys
            type: ''
        - name: nfd-hooks
          hostPath:
            path: /etc/kubernetes/node-feature-discovery/source.d
            type: ''
        - name: nfd-features
          hostPath:
            path: /etc/kubernetes/node-feature-discovery/features.d
            type: ''
        - name: config
          configMap:
            name: nfd-worker
            items:
              - key: nfd-worker-conf
                path: nfd-worker.conf
            defaultMode: 420
      dnsPolicy: ClusterFirst
      tolerations:
        - operator: Exists
          effect: NoSchedule
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  revisionHistoryLimit: 10

@zvonkok
Copy link
Contributor

zvonkok commented Nov 12, 2020

I think we can do the following to support both configurations.

  • multiple nodeSelectorTerms, the pod can be scheduled if one of the terms can be satisfied
  • multiple matchExpressions, the pod can be scheduled only if all expression are satisfied

Since we're tolerating all Taints we need to exclude masters but allow scheduling on a worker with a label or without a label.

      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
              - key:  node-role.kubernetes.io/master
                operator: DoesNotExist
          nodeSelectorTerms:
            - matchExpressions:
              - key:  node-role.kubernetes.io/worker
                operator: Exists
      

If we have a three-node cluster with nodes that have the worker and master label the first term would fail but the second would be valid -> pod scheduled on the node.

If we have three masters with label master the first term would repel them from the master(s) and nodes with worker label would get assigned because the second term is also valid.

If we have three masters with label master the first term would repel them from the master(s) which means the first term is valid and the second term will fail on workers that do not have a worker label. On nodes, without any label, the first term would be valid and hence the pod scheduled on the nodes.

I hope I do not have a mistake in thinking. @rmkraus @marquiz @ArangoGutierrez PTAL.

@marquiz We may need to update the upstream version as well.

@rmkraus
Copy link
Author

rmkraus commented Nov 12, 2020

@zvonkok I have confirmed that the following affinity rules work as specified. These rules are exactly what you posted with the duplicate nodeSelectorTerms key removed.

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-role.kubernetes.io/master
                    operator: DoesNotExist
              - matchExpressions:
                  - key: node-role.kubernetes.io/worker
                    operator: Exists

@zvonkok
Copy link
Contributor

zvonkok commented Nov 12, 2020

@rmkraus Wait, this should not work since all matchExpressions need to be satisfied for a Pod to be scheduled.
Can you show me the labels/annotations of your nodes? Which in the end means you have a node with a worker and without a master label, which makes "no sense" if you have the three node cluster, where each node should have the master and worker label ....

@rmkraus
Copy link
Author

rmkraus commented Nov 12, 2020

Negative, from K8s docs:

If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms can be satisfied.

If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions is satisfied.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

@rmkraus
Copy link
Author

rmkraus commented Nov 12, 2020

So the conditions in the nodeSelectorTerms list are combined with OR. The conditions in the matchExpressions list are combined with AND.

@zvonkok
Copy link
Contributor

zvonkok commented Nov 12, 2020

Negative, from K8s docs:

If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms can be satisfied.
If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions is satisfied.

Here: ^^ only if all matchExpressions are satisfied ... Am I misunderstanding ? ... well english is not my first language so I may miss something :)

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

@zvonkok
Copy link
Contributor

zvonkok commented Nov 12, 2020

Yes and the matchExpressions ANDed means label master does not exist and label worker does exist. That is why I am confused.

@rmkraus
Copy link
Author

rmkraus commented Nov 12, 2020

English is my first language and it is still not super clear. ;-)

I understand that to mean:
When you define a matchExpressions clause, it contains a list. All of the items in the list must be true to get pod placement.
When you define a nodeSelectorTerms clause, it also contains a list. Only one of the items in that list needs to be true in order to get pod placement.

In the selector clause I posted, we have one nodeSelectorTerms clause that contains two distinct matchExpressions clauses. The matchExpresions clauses are evaluated separately and then the parent nodeSelectorTerms clause is evaluated.

@zvonkok
Copy link
Contributor

zvonkok commented Nov 12, 2020

Alrighty, this makes sense, I was thinking on the wrong level of abstraction ... Make sense thanks for verifying this, we need to create a fix for 4.7 and backport it to 4.6.x z-stream.

@zvonkok
Copy link
Contributor

zvonkok commented Nov 12, 2020

@ArangoGutierrez we need a fix ASAP for this, please also cherry-pick it for 4.6, thanks!

@rmkraus
Copy link
Author

rmkraus commented Nov 12, 2020

👍 Thanks all!

@ArangoGutierrez
Copy link
Contributor

Ack

@ArangoGutierrez
Copy link
Contributor

We would need a fully reported Bugzilla to perform the backport
@zvonkok @rmkraus

@rmkraus
Copy link
Author

rmkraus commented Nov 12, 2020

I unfortunately do not have BZ access. 😢

@ArangoGutierrez
Copy link
Contributor

No worries, @zvonkok I am testing the fix on the latest kube upstream. could you create the BZ and Assign to me

ArangoGutierrez added a commit to ArangoGutierrez/node-feature-discovery-operator that referenced this issue Nov 13, 2020
After kubernetes-sigs#31 three node cluster, where all nodes will have the master and
worker(or node for vanilla clusters) labels are just getting the
nfd-master daemonset deployed. Since they have the master label, the NFD
workers will not be scheduled.

Discussion started in a downstream distribution of NFD:
openshift/cluster-nfd-operator#109

This Patch fix that adding support for master:worker type of nodes,
being used in edge deployments

Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
ArangoGutierrez added a commit to ArangoGutierrez/node-feature-discovery-operator that referenced this issue Nov 13, 2020
After kubernetes-sigs#31 three node cluster, where all nodes will have the master and
worker(or node for vanilla clusters) labels are just getting the
nfd-master daemonset deployed. Since they have the master label, the NFD
workers will not be scheduled.

Discussion started in a downstream distribution of NFD:
openshift/cluster-nfd-operator#109

This Patch fix that, by adding support for master:worker type of nodes,
modifying the nodeAffinity rules on the nfd-worker daemonSet to allow any node with the "node" label
(independent if it is a master also) to get scheduled

Signed-off-by: Carlos Eduardo Arango Gutierrez <carangog@redhat.com>
@deanpeterson
Copy link

@rmkraus, when I try to modify the Daemonset directly in Openshift it will not let me as it is being managed by nfd-master-server. How were you able to apply your affinity overrides?

@ArangoGutierrez
Copy link
Contributor

Hi @deanpeterson the new version of the operator will come with full support for 3 node clusters
the patch was merged yesterday and will be part of the payload for OCP 4.7. in the mean wile, I am more than happy to assist you in your use case

@deanpeterson
Copy link

Thanks @ArangoGutierrez! Is there an easy way to modify a 4.6 cluster to overwrite the affinity rules in the nfd-worker daemonset?

@ArangoGutierrez
Copy link
Contributor

current master branch on this repo is able to do so.
are you familiar on how to deploy the operator using the Makefile on this repo?

@ArangoGutierrez
Copy link
Contributor

@rmkraus does the state of current master branch satisfy your use case?
if so please feel free to close this issue
thanks for your interest in the project and helping us enhance it!

@deanpeterson
Copy link

No I wasn't aware of the Makefile to replace the existing nfd operator

@rmkraus
Copy link
Author

rmkraus commented Dec 6, 2020

@ArangoGutierrez Yes, the latest version of the nfd operator works for me!

@deanpeterson For the sale of K8s debugging tricks: I just duplicated the DaemonSet and made my desired changes. That’s definitely not something you’d want to do on a production cluster, but it’s useful for quick debugging.

@rmkraus rmkraus closed this as completed Dec 6, 2020
@ArangoGutierrez
Copy link
Contributor

Awesome !!
Thanks for your feedback, is always appreciated and needed to enhance thee operator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants