Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New test for node draining (respecting PDB) #159

Merged
merged 2 commits into from Feb 20, 2019
Merged

New test for node draining (respecting PDB) #159

merged 2 commits into from Feb 20, 2019

Conversation

ingvagabund
Copy link
Member

SSIA

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 18, 2019
@ingvagabund
Copy link
Member Author

/retest

@ingvagabund
Copy link
Member Author

/test e2e-aws

2 similar comments
@ingvagabund
Copy link
Member Author

/test e2e-aws

@ingvagabund
Copy link
Member Author

/test e2e-aws

@ingvagabund
Copy link
Member Author

/test e2e

1 similar comment
@ingvagabund
Copy link
Member Author

/test e2e

@ingvagabund
Copy link
Member Author

/test e2e-aws

@ingvagabund
Copy link
Member Author

/test e2e

@spangenberg
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2019
@ingvagabund
Copy link
Member Author

/retest

1 similar comment
@ingvagabund
Copy link
Member Author

/retest

NodeSelector: nodeDrainLabels,
Tolerations: []corev1.Toleration{
{
Key: "kubemark",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubemark?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so the same test can by used by both aws and kubemark. The toleration has no effect as long as node is not tainted.

"node-draining-test": "",
}

func machineFromMachineset(machineset *mapiv1beta1.MachineSet) *mapiv1beta1.Machine {
Copy link
Member

@enxebre enxebre Feb 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you are doing this process because machineSet does not reconcile labels to machines today, so it'd be nice to add a comment/TODO so eventually we can add another broader test case for just scaling down a set with machine annotated for priority deletion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done to get a machine template. This way I don't need to assume anything about machine provider config. I just take what is already provided by the cluster. In case of openshift installer I don't have a knowledge of which AMI I should use so I take one that is rendered by the installer.

return err
}

// All pods are distrubution evenly among all nodes so it's fine to drain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo distrubution

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups

glog.Info("Expected result: all pods from the RC up to last one or two got scheduled to a different node while respecting PDB")
return true, nil
}); err != nil {
return err
Copy link
Member

@enxebre enxebre Feb 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are we validating the machine object and the backed node does not exist anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the machine object does not get removed, the loop above fails due to

if err := tc.client.Get(context.TODO(), key, &machine); err != nil {
	glog.Errorf("error querying api machine %q object: %v, retrying...", machine1.Name, err)
	return false, nil
}

The backed up node is actually removed later (by cloud controller manager once it goes NotReady). It can happen in 1 minute, in 2 minutes, later, sooner? So, we can't verify that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can, and we should. We are actually validating that in other test cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it though this test is about verifying a node is drained before machine object is deleted. It does not test a case when a node is deleted. Since testing "node is deleted after linked machine is delete" is another test case.

Copy link
Member

@enxebre enxebre Feb 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should test e2e features from a product end user pov. This test is pretty much validating "Pod disruption budget" and eviction (so draining) which is only a part of the user expected e2e feature. Then it is assuming that the pods were drained properly because the machine was deleted (but that could be untrue. Correct draining could be a result of buggy code on a controller telling to drain the node but then failing to delete the machine and as a result the node).
We want to validate in code explicitly exactly the end user story: When I delete a machine the node is drained (covered by the test), then the machine is deleted (not covered), then the node is deleted (not covered).
We need another polling for the last two

@ingvagabund
Copy link
Member Author

/retest

@bison
Copy link

bison commented Feb 19, 2019

I find this a little hard to follow, but I think it looks okay. One thing I don't see us checking is that the node is marked as unscheduable. The drain should mark the node unscheduable then create evictions when it can. Should we check for that directly rather than just watching pod counts?

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2019
@ingvagabund
Copy link
Member Author

I find this a little hard to follow, but I think it looks okay. One thing I don't see us checking is that the node is marked as unscheduable. The drain should mark the node unscheduable then create evictions when it can. Should we check for that directly rather than just watching pod counts?

We might check if a node is unscheduable in addition to what we have now. Yet, it does not guarantee pods already scheduled on the node to be re-scheduled some place else. Draining operation makes sure all pods are properly evicted before a node is removed. So the main goal of the test is to verify all relevant pods (excluding daemon set pods) are removed from the drained node while still making sure RC has at most one pod unready. So we need to check the pod count as well.

@ingvagabund
Copy link
Member Author

/retest

@bison
Copy link

bison commented Feb 20, 2019

We might check if a node is unscheduable in addition to what we have now. Yet, it does not guarantee pods already scheduled on the node to be re-scheduled some place else. Draining operation makes sure all pods are properly evicted before a node is removed. So the main goal of the test is to verify all relevant pods (excluding daemon set pods) are removed from the drained node while still making sure RC has at most one pod unready. So we need to check the pod count as well.

Right, I get that, but I guess that's kind of my point. That sounds like testing the eviction or scheduling code in a way, not ours. I think we mainly care that the drain process was initiated and that we don't remove the machine prematurely. It's unlikely, but if something else caused the pods to be rescheduled, this could pass. But it's kind of a weird situation because "draining" isn't really a concept in the API, it's just a series of steps some tools take.

Anyway, not saying this is wrong, just that anything we can do to check more directly that Drain() was called is nice.

@ingvagabund
Copy link
Member Author

ingvagabund commented Feb 20, 2019

Right, I get that, but I guess that's kind of my point. That sounds like testing the eviction or scheduling code in a way, not ours. I think we mainly care that the drain process was initiated and that we don't remove the machine prematurely. It's unlikely, but if something else caused the pods to be rescheduled, this could pass. But it's kind of a weird situation because "draining" isn't really a concept in the API, it's just a series of steps some tools take.

+1

Anyway, not saying this is wrong, just that anything we can do to check more directly that Drain() was called is nice.

Added the check for node unschedulable condition

In case the node draining takes too much time or is stacked
in a loop (e.g. missing RBAC rules), timeout and allow other
machines to be reconciled.
@ingvagabund
Copy link
Member Author

/retest

@enxebre
Copy link
Member

enxebre commented Feb 20, 2019

/approve
Agree with @bison that we want approach drain abstraction

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 20, 2019
Copy link

@bison bison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2019
@ingvagabund
Copy link
Member Author

#163 needs to be merged before this gets

@ingvagabund
Copy link
Member Author

/retest

7 similar comments
@ingvagabund
Copy link
Member Author

/retest

@enxebre
Copy link
Member

enxebre commented Feb 20, 2019

/retest

@ingvagabund
Copy link
Member Author

/retest

@ingvagabund
Copy link
Member Author

/retest

@ingvagabund
Copy link
Member Author

/retest

@enxebre
Copy link
Member

enxebre commented Feb 20, 2019

/retest

@ingvagabund
Copy link
Member Author

/retest

@ingvagabund
Copy link
Member Author

/test e2e-aws-operator

@openshift-merge-robot openshift-merge-robot merged commit a861de9 into openshift:master Feb 20, 2019
@ingvagabund ingvagabund deleted the test-node-draining branch February 20, 2019 23:32
michaelgugino pushed a commit to mgugino-upstream-stage/cluster-api-provider-aws that referenced this pull request Feb 12, 2020
Signed-off-by: Vince Prignano <vince@vincepri.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants