-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E test to verify pod failover during node power-off #54509
E2E test to verify pod failover during node power-off #54509
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Signed cncf-cla. |
/assign @saad-ali this looks to be storage related |
@jeffvance Can you help review this PR? |
/ok-to-test |
e5844d4
to
7227c87
Compare
7227c87
to
1bd279f
Compare
if err != nil { | ||
return nil, fmt.Errorf("deployment %q Create API error: %v", deploymentSpec.Name, err) | ||
} | ||
glog.Infof(fmt.Sprintf("Waiting deployment %q to complete", deploymentSpec.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to be using this vs. framework.Logf, which is the typical e2e log func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right - we should use framework.Logf here for consistency. Will fix.
test/e2e/storage/vsphere_utils.go
Outdated
|
||
if currentState == expectedState { | ||
if expectedState == volumeStateDetached { | ||
framework.Logf("Volume %q has successfully detached from %q.", volumePath, nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A state map could simplify this and below. Eg:
var attachedState = map[bool]int{
true: volumeStateAttached,
false: volumeStateDetached,
}
var attachedStateMsgs = map[int]string{
volumeStateAttached: "attached to",
volumeStateDetached : "detached from",
}
...
// in wait.Poll func...
...
currentState := attachedState[diskAttached]
if currentState == expectedState {
framework.Logf("Volume %q has successfully %s %q", volumePath, attachedStateMsgs[currentState], nodeName)
return true, nil
}
framework.Logf("Waiting for Volume %q to %s %q.", volumePath, attachedStateMsgs[currentState], nodeName)
return false, nil
})
test/e2e/storage/vsphere_utils.go
Outdated
return fmt.Errorf("Gave up waiting for Volume %q to detach from %q after %v", volumePath, nodeName, detachTimeout) | ||
|
||
if currentState != expectedState { | ||
if expectedState == volumeStateDetached { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continuing...
if currentState != expectedState {
err = fmt.Errorf("Gave up waiting for Volume %q to %s %q after %v", volumePath, attachedStateMsgs[expectedState], nodeName, timeout)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
"k8s.io/apimachinery/pkg/types" | ||
"k8s.io/apimachinery/pkg/util/wait" | ||
clientset "k8s.io/client-go/kubernetes" | ||
vsphere "k8s.io/kubernetes/pkg/cloudprovider/providers/vsphere" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is "vsphere" alias needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
vsp, err = vsphere.GetVSphere() | ||
Expect(err).NotTo(HaveOccurred()) | ||
workingDir = os.Getenv("VSPHERE_WORKING_DIR") | ||
Expect(workingDir).NotTo(BeEmpty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any sanity check on the value of VSPHERE_WORKING_DIR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's nice to add some sanity check, but it's not really straightforward to verify the validity of this value. So I'd prefer to keep it as is for now.
|
||
By("Creating a Deployment") | ||
deployment, err := framework.CreateDeployment(client, int32(1), map[string]string{"test": "app"}, namespace, pvclaims, "") | ||
defer client.Extensions().Deployments(namespace).Delete(deployment.Name, &metav1.DeleteOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see all the defer
s
f := find.NewFinder(govMoMiClient.Client, true) | ||
ctx, _ := context.WithCancel(context.Background()) | ||
|
||
vmPath := workingDir + string(node1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if supplied workingDir does not end in a "/"? You may want to use filepath.Join
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will fix.
}) | ||
|
||
if err != nil { | ||
framework.Logf("Pod did not fail over from %q with error: %v", oldNode, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the timeout
value logged too? Can be useful in debugging in case 3m is later deemed too short...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a log for timeout.
node2, err := waitForPodToFailover(client, deployment, node1) | ||
Expect(err).NotTo(HaveOccurred(), "Pod did not fail over to a different node") | ||
|
||
By(fmt.Sprintf("Waiting for disk to be attached to the new node: %v", node2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wait for attached before waiting for detach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The disk will be attached to new node almost immediately, but it takes a few minutes to detach from the previous node. The attach doesn't really depend on detach here.
Expect(err).NotTo(HaveOccurred(), "Disk is not attached to the node") | ||
|
||
By(fmt.Sprintf("Waiting for disk to be detached from the previous node: %v", node1)) | ||
err = waitForVSphereDiskToDetach(vsp, volumePath, node1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works ok even when node1 has powered off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this check doesn't communicate with the node which has been powered off.
Expect(err).NotTo(HaveOccurred(), "Disk is not detached from the node") | ||
|
||
By(fmt.Sprintf("Power on the previous node: %v", node1)) | ||
vm.PowerOn(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this interfere in any way with the defer
'd poweron call which will be unconditionally executed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't happen. If this operation gets executed, it means all test steps have passed smoothly. Then the defer'd poweron call will be just a no-op which will return successfully.
We add this step just to make this test case more robust - make sure the node gets powered on before running other tests.
/retest |
/unassign |
7beb2aa
to
d0ffa09
Compare
@jeffvance Can I get a ship-it for this PR? |
@shaominchen It looks like you need the CLA from the linux foundation. |
/assign @saad-ali Hi Saad, I have already signed up for the CLA from the linux foundation. I have no idea why it still shows "cncf-cla: no" label here. Could you let me know what I can do on my side? |
/test pull-kubernetes-node-e2e |
/approve |
Looks like you authored the commits in this PR with the email address |
/approve no-issue |
d0ffa09
to
3db4f2b
Compare
@saad-ali Thanks for pointing it out! I've set up my git user.email properly and re-pushed the commit (also squashed 2 commits into 1). |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jeffvance, saad-ali, shaominchen Associated issue requirement bypassed by: saad-ali The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/retest |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
What this PR does / why we need it:
This PR adds test to verify volume status after the node where the pod got provisioned being powered off and failed over to a different node.
Test performs following tasks:
Which issue this PR fixes:
Fixes vmware-archive#272
Special notes for your reviewer:
Test logs:
VMware Reviewers: @divyenpatel @pshahzeb
Release note: