`sriov-network-operator` should list nodes based on label filters to decide whether to skip drain #463

dastonzerg · 2023-06-26T20:00:16Z

Currently in sriov-network-operator it is determining whether a cluster is a single worker node cluster by checking node count without any filtering https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/pkg/utils/cluster.go#L78 . If it is a single worker node cluster then it marks as "skip drain" https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/main.go#L249-L263 and then daemon will decide whether to "skip drain" in https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/pkg/daemon/daemon.go#L519 .

Now when we are using sriov-network-operator in a cluster where there is only one "worker node" where we deploy data plane pods on (including sriov-device-plugin and sriov-network-config-daemon, we make those pods to go to worker-node only by adding configDaemonNodeSelector in SriovOperatorConfig , see https://cloud.google.com/anthos/clusters/docs/bare-metal/latest/how-to/sriov#configure_the_sr-iov_operator), and control plane is used for kube-apiserver, the worker-node will stuck in SchedulingDisabled after applying SriovNetworkNodePolicy:

$ k get nodes
NAME                                  STATUS                     ROLES           AGE   VERSION
control-plane-0                       Ready                      control-plane   41d   v1.26.2-gke.1001
control-plane-1                       Ready                      control-plane   41d   v1.26.2-gke.1001
control-plane-2                       Ready                      control-plane   41d   v1.26.2-gke.1001
worker-node               Ready,SchedulingDisabled       <none>          41d   v1.25.5-gke.1001

The reason behind is that we have PodDisruptionBudget for some pods on worker-node:

$ k get PodDisruptionBudget/istio-ingress -n gke-system
NAME            MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
istio-ingress   1               N/A               0                     41d

$ k get PodDisruptionBudget/istiod -n gke-system
NAME     MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
istiod   1               N/A               0                     41d

and those pods are supposed to be scheduled on work-nodes only as control-plane nodes have taints:

taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane

which those pods don't tolerate. So when sriov-network-config-daemon try to drain the node and evict those pods, they don't have any other nodes to go and show the following error in sriov-network-config-daemon

2023-06-26T18:57:43.787766559Z stderr F I0626 18:57:43.787737   11036 writer.go:132] setNodeStateStatus(): syncStatus: InProgress, lastSyncError:
2023-06-26T18:57:44.764460102Z stderr F I0626 18:57:44.764395   11036 daemon.go:133] evicting pod gke-system/istiod-665ccd8cfb-dtcc9
2023-06-26T18:57:44.765505368Z stderr F I0626 18:57:44.765470   11036 daemon.go:133] evicting pod gke-system/istio-ingress-77cbf5d986-8cvhc
2023-06-26T18:57:44.783806543Z stderr F E0626 18:57:44.783775   11036 daemon.go:133] error when evicting pods/"istiod-665ccd8cfb-dtcc9" -n "gke-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
2023-06-26T18:57:44.784855495Z stderr F E0626 18:57:44.784818   11036 daemon.go:133] error when evicting pods/"istio-ingress-77cbf5d986-8cvhc" -n "gke-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

and it just keeps retrying so node drain will never succeed. I think for such kind of cluster settings we should just skip node drain as it will never succeed.

Given that configDaemonNodeSelector in SriovOperatorConfig is already deciding which nodes to deploy sriov daemon for configs and for potential drains, I think it makes more sense to just decide whether there is only one "node" in cluster by listing node according to labels listed in configDaemonNodeSelector . I am thinking of the following 2 solutions:

In pkg/utils/cluster.go we modify the node list to be based on label filtering in ns os.Getenv("NAMESPACE") sriovnetworkv1.SriovOperatorConfig named default. Then in controllers/sriovoperatorconfig_controller.go it will add DisableDrain based on whether there is only one node.
We don't make the controller to manipulate SriovOperatorConfig CR spec by adding DisableDrain, instead when there is DisableDrain in operator config, we just follow that, when there is no DisableDrain in operator config, we modify the code to still decide whether it is a single node cluster based on node label filtering, but we need to expose this if it is a single node cluster somewhere through CR I suppose, would like to hear people's ideas to see where we should put this information to show status of single node cluster determination.

The text was updated successfully, but these errors were encountered:

dastonzerg · 2023-07-12T23:58:22Z

@SchSeba Hey Sebastian can you comment more on what we discussed on Jul, 17, 2023's meeting?

We are trying to determine node count by filtering based on node labels in default SriovOperatorConfig .Spec.ConfigDaemonNodeSelector if they are provided. (Since only nodes with those labels will have sriov daemon deployed.) Thus we can really make the right decision to disable drain if there is only one node that allows sriov daemon to be deployed.

Would like to get more clarifications on whether this is a feasible change as I think this is a bug when counting nodes. Thanks!

SchSeba · 2023-08-01T16:31:14Z

we don't want to do it automatically when you have more than 1 node because the user needs to understand the implications of that.

meaning when you have one node you need to handle the pods reset after a configuration. when you have multiple nodes the user must manually configure the skipDrain and understand that the implication his that he will need to reset the workloads

SchSeba · 2023-12-21T07:21:43Z

doing some housekeeping closing this one please free to reopen if needed

SchSeba closed this as completed Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`sriov-network-operator` should list nodes based on label filters to decide whether to skip drain #463

`sriov-network-operator` should list nodes based on label filters to decide whether to skip drain #463

dastonzerg commented Jun 26, 2023

dastonzerg commented Jul 12, 2023 •

edited

SchSeba commented Aug 1, 2023

SchSeba commented Dec 21, 2023

sriov-network-operator should list nodes based on label filters to decide whether to skip drain #463

sriov-network-operator should list nodes based on label filters to decide whether to skip drain #463

Comments

dastonzerg commented Jun 26, 2023

dastonzerg commented Jul 12, 2023 • edited

SchSeba commented Aug 1, 2023

SchSeba commented Dec 21, 2023

`sriov-network-operator` should list nodes based on label filters to decide whether to skip drain #463

`sriov-network-operator` should list nodes based on label filters to decide whether to skip drain #463

dastonzerg commented Jul 12, 2023 •

edited