Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sriov-network-operator should list nodes based on label filters to decide whether to skip drain #463

Closed
dastonzerg opened this issue Jun 26, 2023 · 3 comments

Comments

@dastonzerg
Copy link

Currently in sriov-network-operator it is determining whether a cluster is a single worker node cluster by checking node count without any filtering https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/pkg/utils/cluster.go#L78 . If it is a single worker node cluster then it marks as "skip drain" https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/main.go#L249-L263 and then daemon will decide whether to "skip drain" in https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/pkg/daemon/daemon.go#L519 .

Now when we are using sriov-network-operator in a cluster where there is only one "worker node" where we deploy data plane pods on (including sriov-device-plugin and sriov-network-config-daemon, we make those pods to go to worker-node only by adding configDaemonNodeSelector in SriovOperatorConfig , see https://cloud.google.com/anthos/clusters/docs/bare-metal/latest/how-to/sriov#configure_the_sr-iov_operator), and control plane is used for kube-apiserver, the worker-node will stuck in SchedulingDisabled after applying SriovNetworkNodePolicy:

$ k get nodes
NAME                                  STATUS                     ROLES           AGE   VERSION
control-plane-0                       Ready                      control-plane   41d   v1.26.2-gke.1001
control-plane-1                       Ready                      control-plane   41d   v1.26.2-gke.1001
control-plane-2                       Ready                      control-plane   41d   v1.26.2-gke.1001
worker-node               Ready,SchedulingDisabled       <none>          41d   v1.25.5-gke.1001

The reason behind is that we have PodDisruptionBudget for some pods on worker-node:

$ k get PodDisruptionBudget/istio-ingress -n gke-system
NAME            MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
istio-ingress   1               N/A               0                     41d

$ k get PodDisruptionBudget/istiod -n gke-system
NAME     MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
istiod   1               N/A               0                     41d

and those pods are supposed to be scheduled on work-nodes only as control-plane nodes have taints:

taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane

which those pods don't tolerate. So when sriov-network-config-daemon try to drain the node and evict those pods, they don't have any other nodes to go and show the following error in sriov-network-config-daemon

2023-06-26T18:57:43.787766559Z stderr F I0626 18:57:43.787737   11036 writer.go:132] setNodeStateStatus(): syncStatus: InProgress, lastSyncError:
2023-06-26T18:57:44.764460102Z stderr F I0626 18:57:44.764395   11036 daemon.go:133] evicting pod gke-system/istiod-665ccd8cfb-dtcc9
2023-06-26T18:57:44.765505368Z stderr F I0626 18:57:44.765470   11036 daemon.go:133] evicting pod gke-system/istio-ingress-77cbf5d986-8cvhc
2023-06-26T18:57:44.783806543Z stderr F E0626 18:57:44.783775   11036 daemon.go:133] error when evicting pods/"istiod-665ccd8cfb-dtcc9" -n "gke-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
2023-06-26T18:57:44.784855495Z stderr F E0626 18:57:44.784818   11036 daemon.go:133] error when evicting pods/"istio-ingress-77cbf5d986-8cvhc" -n "gke-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

and it just keeps retrying so node drain will never succeed. I think for such kind of cluster settings we should just skip node drain as it will never succeed.

Given that configDaemonNodeSelector in SriovOperatorConfig is already deciding which nodes to deploy sriov daemon for configs and for potential drains, I think it makes more sense to just decide whether there is only one "node" in cluster by listing node according to labels listed in configDaemonNodeSelector . I am thinking of the following 2 solutions:

  1. In pkg/utils/cluster.go we modify the node list to be based on label filtering in ns os.Getenv("NAMESPACE") sriovnetworkv1.SriovOperatorConfig named default. Then in controllers/sriovoperatorconfig_controller.go it will add DisableDrain based on whether there is only one node.
  2. We don't make the controller to manipulate SriovOperatorConfig CR spec by adding DisableDrain, instead when there is DisableDrain in operator config, we just follow that, when there is no DisableDrain in operator config, we modify the code to still decide whether it is a single node cluster based on node label filtering, but we need to expose this if it is a single node cluster somewhere through CR I suppose, would like to hear people's ideas to see where we should put this information to show status of single node cluster determination.
@dastonzerg
Copy link
Author

dastonzerg commented Jul 12, 2023

@SchSeba Hey Sebastian can you comment more on what we discussed on Jul, 17, 2023's meeting?

We are trying to determine node count by filtering based on node labels in default SriovOperatorConfig .Spec.ConfigDaemonNodeSelector if they are provided. (Since only nodes with those labels will have sriov daemon deployed.) Thus we can really make the right decision to disable drain if there is only one node that allows sriov daemon to be deployed.

Would like to get more clarifications on whether this is a feasible change as I think this is a bug when counting nodes. Thanks!

@SchSeba
Copy link
Collaborator

SchSeba commented Aug 1, 2023

we don't want to do it automatically when you have more than 1 node because the user needs to understand the implications of that.

meaning when you have one node you need to handle the pods reset after a configuration. when you have multiple nodes the user must manually configure the skipDrain and understand that the implication his that he will need to reset the workloads

@SchSeba
Copy link
Collaborator

SchSeba commented Dec 21, 2023

doing some housekeeping closing this one please free to reopen if needed

@SchSeba SchSeba closed this as completed Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants