test: Every control plane component should have PDB #26160

ravisantoshgudimetla · 2021-05-13T23:25:20Z

This ensures every component is having PDB associated. Need to figure out how to ensure only operands have the PDB not the operators. One way to do it is to exclude -operator namespace but not all components are following the convention.

it is currently missing static pods, will add them soon.

cc @smarterclayton @wking @deads2k @soltysh

openshift-ci · 2021-05-13T23:25:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ravisantoshgudimetla
To complete the pull request process, please assign adambkaplan after the PR has been reviewed.
You can assign the PR to them by writing /assign @adambkaplan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/extended/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2021-05-17T22:30:55Z

test/extended/pods/pdb.go

+		}
+		var inHAMode bool
+		if infra.Status.InfrastructureTopology == oapi.HighlyAvailableTopologyMode {
+			inHAMode = true


This seems like it should be orthogonal. Can't we get at this just by ignoring replicas < 2 workloads? Because even if they allow surging, we don't surge on drains.

wking · 2021-05-17T22:31:40Z

test/extended/pods/pdb.go

+			inHAMode = true
+		}
+		//controlPlaneStaticPodNames := sets.NewString("etcd", "kube-apiserver", "kube-controller-manager",
+		//	"openshift-kube-scheduler")


nit: did you mean to drop this comment now that you have getAllControlPlaneStaticPods?

Of course :(

wking · 2021-05-17T22:34:56Z

test/extended/pods/pdb.go

+				selectorLabels = t.Spec.Selector
+				replicas = t.Status.DesiredNumberScheduled
+			default:
+				panic("not an object")


nit: panic(fmt.Sprintf("unrecognized workload type %T", workload)) or some such to give more context?

wking · 2021-05-17T22:36:16Z

test/extended/pods/pdb.go

+			if !strings.Contains(meta.Namespace, "openshift-") {
+				// Not every component is following the convention of -operator at the end, there are some namespaces
+				// like openshift-monitoring and openshift-network-diagnostics and some of them run on worker nodes
+				// as well. So, need to figure out how to exclude operator components and components running on worker node


I don't understand the need to exclude operators from this. If we have an operator with replicas > 1, can't we apply the same PDB considerations to it as well?

The goal here is to get control plane components most of which run on master nodes.

OpenShift-core operators will also run on the control-plane nodes. If any of them have replicas > 1 (maybe none of them), why wouldn't we want them covered by PDBs?

wking · 2021-05-17T22:39:40Z

test/extended/pods/pdb.go

+			e2e.Failf("unable to list statefulsets: %v", err)
+		}
+		// ovn-kube, openshift-controller-manager are running as DS.
+		daemonsets, err := kubeClient.AppsV1().DaemonSets("").List(context.Background(), metav1.ListOptions{})


DaemonSets aren't drained. There may still be caps around how many we want to disrupt by shutting down nodes, but I dunno if PDBs actually protect them from this today.

michaelgugino

Many things don't require PDBs, like the machine-api. That runs with 1 replica, short periods of unavailability while it is rescheduled to another host are fine.

michaelgugino · 2021-05-17T22:37:06Z

test/extended/pods/pdb.go

+			e2e.Failf("unable to list statefulsets: %v", err)
+		}
+		// ovn-kube, openshift-controller-manager are running as DS.
+		daemonsets, err := kubeClient.AppsV1().DaemonSets("").List(context.Background(), metav1.ListOptions{})


daemonsets can't be evicted today via drain, so enforcing PDBs without upstream changes on these would not be useful.

Fair enough, I think we have a flag which evicts DS pods too but I can keep DS out of the PR now.

wking · 2021-05-17T22:41:50Z

test/extended/pods/pdb.go

+			if inHAMode {
+				key = "HA/" + key
+				if !isPDBConfigured {
+					pdbMissingWorkloads = append(pdbMissingWorkloads, key)


I don't think we need to complain about missing PDBs on replicas=1 workloads; they're clearly interruptible. Can we restructure this like:

if replicas > 1 { // no reason to set this unless we are trying to be HA, right? if currentPDB == nil { // you'd need to make currentPDB a pointer for this to work pdbMissingWorkloads = append(pdbMissingWorkloads, key) } else { // checks to see if we were comfortable with the PDB configuration } }

I can change it. In fact, I can ignore it before the loop

wking · 2021-05-17T22:47:42Z

test/extended/pods/pdb.go

+					if currentPDB.Spec.MaxUnavailable != nil && currentPDB.Spec.MaxUnavailable.IntValue() < 1 {
+						pdbMisconfiguredWorkloads = append(pdbMisconfiguredWorkloads, key)
+					} else if currentPDB.Spec.MinAvailable != nil && currentPDB.Spec.MinAvailable.IntValue() < 1 {
+						pdbMisconfiguredWorkloads = append(pdbMisconfiguredWorkloads, key)


We should probably add a message about why we are mad at this key, by either making pdbMisconfiguredWorkloads a map from workload key to string:

pdbMisconfiguredWorkloads[key] = fmt.Sprintf("sets minAvailable %s, but $WHY_WE_WANT_IT_UNSET_OR_GREATER_THAN_ZERO", currentPDB.Spec.MinAvailable)

or a map from a misconfigured-reason to a slice of keys:

pdbMisconfiguredWorkloads[pdbMaxUnavailableTooSmall] = append(pdbMisconfiguredWorkloads[pdbMaxUnavailableTooSmall], key)

deads2k · 2021-05-18T14:09:18Z

test/extended/pods/pdb.go

+					operatorsHavingPDBS = append(operatorsHavingPDBS, key)
+					// The convention we came up with for static pod control plane components is to have a deployment
+					// doing health check for the corresponding component. This can be changed to labels later.
+					// As of now, I am assuming if the deployment name has `-guard` in the name and it runs in


this seems reasonable

michaelgugino · 2021-05-18T16:33:55Z

FYI, if we're interested in draining daemonsets, I have a patch upstream ready to go: kubernetes/kubernetes#88345

This would allow us to put PDBs that cover daemonsets to add back pressure to rolling nodes, however, we will be exposed to endless looping of drain if any daemonset takes longer to stop than the timeout, as a new one would be present when we requeue the drain operation. We would need to implement some kind of filtering mechanism. We added custom pod filters for draining, so it's possible to do that today, we just need to work out what that might look like if we're interested in pursuing this.

openshift-ci · 2021-06-25T17:40:05Z

@ravisantoshgudimetla: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/verify	`089b74e`	link	`/test verify`
ci/prow/e2e-aws-csi	`089b74e`	link	`/test e2e-aws-csi`
ci/prow/e2e-agnostic-cmd	`089b74e`	link	`/test e2e-agnostic-cmd`
ci/prow/e2e-gcp-csi	`089b74e`	link	`/test e2e-gcp-csi`
ci/prow/e2e-metal-ipi-ovn-ipv6	`089b74e`	link	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-aws-disruptive	`089b74e`	link	`/test e2e-aws-disruptive`
ci/prow/e2e-aws-jenkins	`089b74e`	link	`/test e2e-aws-jenkins`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-09-24T02:48:47Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-10-24T03:17:37Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-11-23T03:47:47Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-11-23T03:49:18Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested a review from gabemontero May 13, 2021 23:25

openshift-ci bot requested a review from smarterclayton May 13, 2021 23:25

test: Every control plane component should have PDB

089b74e

ravisantoshgudimetla force-pushed the add-pdb-tests branch from 8274722 to 089b74e Compare May 17, 2021 14:49

wking reviewed May 17, 2021

View reviewed changes

michaelgugino suggested changes May 17, 2021

View reviewed changes

wking reviewed May 17, 2021

View reviewed changes

deads2k reviewed May 18, 2021

View reviewed changes

wking mentioned this pull request May 21, 2021

Support evict annotation for namespaces openshift/cluster-kube-descheduler-operator#176

Closed

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 24, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 24, 2021

openshift-ci bot closed this Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Every control plane component should have PDB #26160

test: Every control plane component should have PDB #26160

ravisantoshgudimetla commented May 13, 2021 •

edited

Loading

openshift-ci bot commented May 13, 2021

wking May 17, 2021

wking May 17, 2021

ravisantoshgudimetla May 19, 2021

wking May 17, 2021

wking May 17, 2021

ravisantoshgudimetla May 19, 2021

wking May 19, 2021 •

edited

Loading

wking May 17, 2021

michaelgugino left a comment

michaelgugino May 17, 2021

ravisantoshgudimetla May 19, 2021

wking May 17, 2021

ravisantoshgudimetla May 19, 2021

wking May 17, 2021

deads2k May 18, 2021

michaelgugino commented May 18, 2021

openshift-ci bot commented Jun 25, 2021

openshift-bot commented Sep 24, 2021

openshift-bot commented Oct 24, 2021

openshift-bot commented Nov 23, 2021

openshift-ci bot commented Nov 23, 2021

test: Every control plane component should have PDB #26160

test: Every control plane component should have PDB #26160

Conversation

ravisantoshgudimetla commented May 13, 2021 • edited Loading

openshift-ci bot commented May 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wking May 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelgugino left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelgugino commented May 18, 2021

openshift-ci bot commented Jun 25, 2021

openshift-bot commented Sep 24, 2021

openshift-bot commented Oct 24, 2021

openshift-bot commented Nov 23, 2021

openshift-ci bot commented Nov 23, 2021

ravisantoshgudimetla commented May 13, 2021 •

edited

Loading

wking May 19, 2021 •

edited

Loading