New test for node draining (respecting PDB) #159

ingvagabund · 2019-02-18T11:41:47Z

SSIA

ingvagabund · 2019-02-18T12:05:30Z

/retest

ingvagabund · 2019-02-18T22:00:21Z

/test e2e-aws

ingvagabund · 2019-02-18T22:59:17Z

/test e2e-aws

ingvagabund · 2019-02-19T08:15:19Z

/test e2e-aws

ingvagabund · 2019-02-19T08:52:10Z

/test e2e

ingvagabund · 2019-02-19T09:42:30Z

/test e2e

ingvagabund · 2019-02-19T11:40:19Z

/test e2e-aws

ingvagabund · 2019-02-19T11:40:54Z

/test e2e

spangenberg · 2019-02-19T12:51:08Z

/lgtm

ingvagabund · 2019-02-19T13:57:14Z

/retest

ingvagabund · 2019-02-19T15:40:26Z

/retest

enxebre · 2019-02-19T15:51:55Z

test/e2e/provider_expectations.go

+					NodeSelector: nodeDrainLabels,
+					Tolerations: []corev1.Toleration{
+						{
+							Key:      "kubemark",


Yes, so the same test can by used by both aws and kubemark. The toleration has no effect as long as node is not tainted.

enxebre · 2019-02-19T15:53:20Z

test/e2e/provider_expectations.go

+	"node-draining-test": "",
+}
+
+func machineFromMachineset(machineset *mapiv1beta1.MachineSet) *mapiv1beta1.Machine {


I suspect you are doing this process because machineSet does not reconcile labels to machines today, so it'd be nice to add a comment/TODO so eventually we can add another broader test case for just scaling down a set with machine annotated for priority deletion

It's done to get a machine template. This way I don't need to assume anything about machine provider config. I just take what is already provided by the cluster. In case of openshift installer I don't have a knowledge of which AMI I should use so I take one that is rendered by the installer.

enxebre · 2019-02-19T15:53:58Z

test/e2e/provider_expectations.go

+		return err
+	}
+
+	// All pods are distrubution evenly among all nodes so it's fine to drain


typo distrubution

enxebre · 2019-02-19T16:14:18Z

test/e2e/provider_expectations.go

+		glog.Info("Expected result: all pods from the RC up to last one or two got scheduled to a different node while respecting PDB")
+		return true, nil
+	}); err != nil {
+		return err


where are we validating the machine object and the backed node does not exist anymore?

if the machine object does not get removed, the loop above fails due to

if err := tc.client.Get(context.TODO(), key, &machine); err != nil { glog.Errorf("error querying api machine %q object: %v, retrying...", machine1.Name, err) return false, nil }

The backed up node is actually removed later (by cloud controller manager once it goes NotReady). It can happen in 1 minute, in 2 minutes, later, sooner? So, we can't verify that.

yes we can, and we should. We are actually validating that in other test cases

We can do it though this test is about verifying a node is drained before machine object is deleted. It does not test a case when a node is deleted. Since testing "node is deleted after linked machine is delete" is another test case.

This should test e2e features from a product end user pov. This test is pretty much validating "Pod disruption budget" and eviction (so draining) which is only a part of the user expected e2e feature. Then it is assuming that the pods were drained properly because the machine was deleted (but that could be untrue. Correct draining could be a result of buggy code on a controller telling to drain the node but then failing to delete the machine and as a result the node).
We want to validate in code explicitly exactly the end user story: When I delete a machine the node is drained (covered by the test), then the machine is deleted (not covered), then the node is deleted (not covered).
We need another polling for the last two

ingvagabund · 2019-02-19T16:49:57Z

/retest

bison · 2019-02-19T17:08:14Z

I find this a little hard to follow, but I think it looks okay. One thing I don't see us checking is that the node is marked as unscheduable. The drain should mark the node unscheduable then create evictions when it can. Should we check for that directly rather than just watching pod counts?

ingvagabund · 2019-02-20T09:32:45Z

I find this a little hard to follow, but I think it looks okay. One thing I don't see us checking is that the node is marked as unscheduable. The drain should mark the node unscheduable then create evictions when it can. Should we check for that directly rather than just watching pod counts?

We might check if a node is unscheduable in addition to what we have now. Yet, it does not guarantee pods already scheduled on the node to be re-scheduled some place else. Draining operation makes sure all pods are properly evicted before a node is removed. So the main goal of the test is to verify all relevant pods (excluding daemon set pods) are removed from the drained node while still making sure RC has at most one pod unready. So we need to check the pod count as well.

ingvagabund · 2019-02-20T09:34:00Z

/retest

bison · 2019-02-20T09:44:04Z

We might check if a node is unscheduable in addition to what we have now. Yet, it does not guarantee pods already scheduled on the node to be re-scheduled some place else. Draining operation makes sure all pods are properly evicted before a node is removed. So the main goal of the test is to verify all relevant pods (excluding daemon set pods) are removed from the drained node while still making sure RC has at most one pod unready. So we need to check the pod count as well.

Right, I get that, but I guess that's kind of my point. That sounds like testing the eviction or scheduling code in a way, not ours. I think we mainly care that the drain process was initiated and that we don't remove the machine prematurely. It's unlikely, but if something else caused the pods to be rescheduled, this could pass. But it's kind of a weird situation because "draining" isn't really a concept in the API, it's just a series of steps some tools take.

Anyway, not saying this is wrong, just that anything we can do to check more directly that Drain() was called is nice.

ingvagabund · 2019-02-20T10:05:49Z

Right, I get that, but I guess that's kind of my point. That sounds like testing the eviction or scheduling code in a way, not ours. I think we mainly care that the drain process was initiated and that we don't remove the machine prematurely. It's unlikely, but if something else caused the pods to be rescheduled, this could pass. But it's kind of a weird situation because "draining" isn't really a concept in the API, it's just a series of steps some tools take.

+1

Anyway, not saying this is wrong, just that anything we can do to check more directly that Drain() was called is nice.

Added the check for node unschedulable condition

In case the node draining takes too much time or is stacked in a loop (e.g. missing RBAC rules), timeout and allow other machines to be reconciled.

ingvagabund · 2019-02-20T11:46:50Z

/retest

enxebre · 2019-02-20T11:58:49Z

/approve
Agree with @bison that we want approach drain abstraction

openshift-ci-robot · 2019-02-20T11:59:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bison

/lgtm

ingvagabund · 2019-02-20T13:00:31Z

#163 needs to be merged before this gets

ingvagabund · 2019-02-20T16:05:05Z

/retest

ingvagabund · 2019-02-20T16:55:06Z

/retest

enxebre · 2019-02-20T17:13:46Z

/retest

ingvagabund · 2019-02-20T18:06:37Z

/retest

ingvagabund · 2019-02-20T18:55:56Z

/retest

ingvagabund · 2019-02-20T18:58:39Z

/retest

enxebre · 2019-02-20T19:29:13Z

/retest

ingvagabund · 2019-02-20T21:50:12Z

/retest

ingvagabund · 2019-02-20T22:17:32Z

/test e2e-aws-operator

Signed-off-by: Vince Prignano <vince@vincepri.com>

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 18, 2019

openshift-ci-robot requested review from enxebre and frobware February 18, 2019 11:42

openshift-ci-robot assigned spangenberg Feb 19, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2019

enxebre reviewed Feb 19, 2019

View reviewed changes

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2019

ingvagabund added 2 commits February 20, 2019 11:19

New test for node draining (respecting PDB)

1b7b6b7

Timeout node draining in 20 seconds and retry

8cba8a8

In case the node draining takes too much time or is stacked in a loop (e.g. missing RBAC rules), timeout and allow other machines to be reconciled.

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 20, 2019

openshift-ci-robot assigned bison Feb 20, 2019

bison approved these changes Feb 20, 2019

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2019

openshift-merge-robot merged commit a861de9 into openshift:master Feb 20, 2019

ingvagabund deleted the test-node-draining branch February 20, 2019 23:32

michaelgugino pushed a commit to mgugino-upstream-stage/cluster-api-provider-aws that referenced this pull request Feb 12, 2020

Refactor use of consts and minor improvements (openshift#159)

e2782de

Signed-off-by: Vince Prignano <vince@vincepri.com>

New test for node draining (respecting PDB) #159

New test for node draining (respecting PDB) #159

Conversation

ingvagabund commented Feb 18, 2019

ingvagabund commented Feb 18, 2019

ingvagabund commented Feb 18, 2019

ingvagabund commented Feb 18, 2019

ingvagabund commented Feb 19, 2019

ingvagabund commented Feb 19, 2019

ingvagabund commented Feb 19, 2019

ingvagabund commented Feb 19, 2019

ingvagabund commented Feb 19, 2019

spangenberg commented Feb 19, 2019

ingvagabund commented Feb 19, 2019

ingvagabund commented Feb 19, 2019

enxebre Feb 19, 2019

Choose a reason for hiding this comment

ingvagabund Feb 19, 2019

Choose a reason for hiding this comment

enxebre Feb 19, 2019 • edited

Choose a reason for hiding this comment

ingvagabund Feb 19, 2019

Choose a reason for hiding this comment

enxebre Feb 19, 2019

Choose a reason for hiding this comment

ingvagabund Feb 19, 2019

Choose a reason for hiding this comment

enxebre Feb 19, 2019 • edited

Choose a reason for hiding this comment

ingvagabund Feb 19, 2019

Choose a reason for hiding this comment

enxebre Feb 19, 2019

Choose a reason for hiding this comment

ingvagabund Feb 19, 2019 • edited

Choose a reason for hiding this comment

enxebre Feb 19, 2019 • edited

Choose a reason for hiding this comment

ingvagabund commented Feb 19, 2019

bison commented Feb 19, 2019

ingvagabund commented Feb 20, 2019

ingvagabund commented Feb 20, 2019

bison commented Feb 20, 2019

ingvagabund commented Feb 20, 2019 • edited

ingvagabund commented Feb 20, 2019

enxebre commented Feb 20, 2019

openshift-ci-robot commented Feb 20, 2019

bison left a comment

Choose a reason for hiding this comment

ingvagabund commented Feb 20, 2019

ingvagabund commented Feb 20, 2019

ingvagabund commented Feb 20, 2019

enxebre commented Feb 20, 2019

ingvagabund commented Feb 20, 2019

ingvagabund commented Feb 20, 2019

ingvagabund commented Feb 20, 2019

enxebre commented Feb 20, 2019

ingvagabund commented Feb 20, 2019

ingvagabund commented Feb 20, 2019

enxebre Feb 19, 2019 •

edited

enxebre Feb 19, 2019 •

edited

ingvagabund Feb 19, 2019 •

edited

enxebre Feb 19, 2019 •

edited

ingvagabund commented Feb 20, 2019 •

edited