New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New test for node draining (respecting PDB) #159
New test for node draining (respecting PDB) #159
Conversation
/retest |
/test e2e-aws |
2 similar comments
/test e2e-aws |
/test e2e-aws |
/test e2e |
1 similar comment
/test e2e |
/test e2e-aws |
/test e2e |
/lgtm |
/retest |
1 similar comment
/retest |
NodeSelector: nodeDrainLabels, | ||
Tolerations: []corev1.Toleration{ | ||
{ | ||
Key: "kubemark", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubemark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so the same test can by used by both aws and kubemark. The toleration has no effect as long as node is not tainted.
"node-draining-test": "", | ||
} | ||
|
||
func machineFromMachineset(machineset *mapiv1beta1.MachineSet) *mapiv1beta1.Machine { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you are doing this process because machineSet does not reconcile labels to machines today, so it'd be nice to add a comment/TODO so eventually we can add another broader test case for just scaling down a set with machine annotated for priority deletion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done to get a machine template. This way I don't need to assume anything about machine provider config. I just take what is already provided by the cluster. In case of openshift installer I don't have a knowledge of which AMI I should use so I take one that is rendered by the installer.
test/e2e/provider_expectations.go
Outdated
return err | ||
} | ||
|
||
// All pods are distrubution evenly among all nodes so it's fine to drain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo distrubution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ups
glog.Info("Expected result: all pods from the RC up to last one or two got scheduled to a different node while respecting PDB") | ||
return true, nil | ||
}); err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are we validating the machine object and the backed node does not exist anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the machine object does not get removed, the loop above fails due to
if err := tc.client.Get(context.TODO(), key, &machine); err != nil {
glog.Errorf("error querying api machine %q object: %v, retrying...", machine1.Name, err)
return false, nil
}
The backed up node is actually removed later (by cloud controller manager once it goes NotReady). It can happen in 1 minute, in 2 minutes, later, sooner? So, we can't verify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we can, and we should. We are actually validating that in other test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do it though this test is about verifying a node is drained before machine object is deleted. It does not test a case when a node is deleted. Since testing "node is deleted after linked machine is delete" is another test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should test e2e features from a product end user pov. This test is pretty much validating "Pod disruption budget" and eviction (so draining) which is only a part of the user expected e2e feature. Then it is assuming that the pods were drained properly because the machine was deleted (but that could be untrue. Correct draining could be a result of buggy code on a controller telling to drain the node but then failing to delete the machine and as a result the node).
We want to validate in code explicitly exactly the end user story: When I delete a machine the node is drained (covered by the test), then the machine is deleted (not covered), then the node is deleted (not covered).
We need another polling for the last two
/retest |
I find this a little hard to follow, but I think it looks okay. One thing I don't see us checking is that the node is marked as unscheduable. The drain should mark the node unscheduable then create evictions when it can. Should we check for that directly rather than just watching pod counts? |
We might check if a node is |
/retest |
Right, I get that, but I guess that's kind of my point. That sounds like testing the eviction or scheduling code in a way, not ours. I think we mainly care that the drain process was initiated and that we don't remove the machine prematurely. It's unlikely, but if something else caused the pods to be rescheduled, this could pass. But it's kind of a weird situation because "draining" isn't really a concept in the API, it's just a series of steps some tools take. Anyway, not saying this is wrong, just that anything we can do to check more directly that |
+1
Added the check for node unschedulable condition |
In case the node draining takes too much time or is stacked in a loop (e.g. missing RBAC rules), timeout and allow other machines to be reconciled.
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
#163 needs to be merged before this gets |
/retest |
7 similar comments
/retest |
/retest |
/retest |
/retest |
/retest |
/retest |
/retest |
/test e2e-aws-operator |
Signed-off-by: Vince Prignano <vince@vincepri.com>
SSIA