-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix topology hints test #107548
Fix topology hints test #107548
Conversation
Welcome @lzhecheng! |
@lzhecheng: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @lzhecheng. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
test/e2e/network/topology_hints.go
Outdated
@@ -153,7 +153,7 @@ var _ = common.SIGDescribe("Feature:Topology Hints", func() { | |||
framework.Failf("Expected zone to be specified for %s node", nodeName) | |||
} | |||
ginkgo.By("creating a client pod for probing the service from " + fromZone) | |||
podName := "curl-from-" + fromZone | |||
podName := fmt.Sprintf("client%d-curl-from-%s", i, fromZone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we leaked the environment assumptions into the test, this is an example of an execution.
[BeforeEach] [sig-network] Feature:Topology Hints
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/topology_hints.go:47
[It] should distribute endpoints evenly
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/network/topology_hints.go:52
Jan 14 06:43:17.801: INFO: DaemonSet pods can't tolerate node kind-control-plane with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node
Jan 14 06:43:17.808: INFO: Number of nodes with available pods controlled by daemonset topology-serve-hostname: 0
Jan 14 06:43:17.808: INFO: Node kind-worker is running 0 daemon pod, expected 1
Jan 14 06:43:22.799: INFO: DaemonSet pods can't tolerate node kind-control-plane with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node
Jan 14 06:43:22.803: INFO: Number of nodes with available pods controlled by daemonset topology-serve-hostname: 3
Jan 14 06:43:22.803: INFO: Number of running nodes: 3, number of available pods: 3 in daemonset topology-serve-hostname
Jan 14 06:43:22.808: INFO: DaemonSet pods can't tolerate node kind-control-plane with taints [{Key:node-role.kubernetes.io/master Value: Effect:NoSchedule TimeAdded:<nil>}], skip checking this node
Jan 14 06:43:22.808: INFO: Waiting for 3 endpoints to be tracked in EndpointSlices
�[1mSTEP�[0m: having hints set for each endpoint
�[1mSTEP�[0m: keeping requests in the same zone
�[1mSTEP�[0m: creating a client pod for probing the service from zone-a
Jan 14 06:43:27.830: INFO: The status of Pod curl-from-zone-a is Pending, waiting for it to be Running (with Ready = true)
Jan 14 06:43:29.834: INFO: The status of Pod curl-from-zone-a is Pending, waiting for it to be Running (with Ready = true)
Jan 14 06:43:31.835: INFO: The status of Pod curl-from-zone-a is Running (Ready = true)
Jan 14 06:43:31.837: INFO: Ensuring that requests from curl-from-zone-a pod on kind-worker node stay in zone-a zone
Jan 14 06:43:36.850: INFO: Pod client logs: Fri Jan 14 06:43:29 UTC 2022
Date: Fri Jan 14 06:43:30 UTC 2022 Try: 1
topology-serve-hostname-cvlkk
Date: Fri Jan 14 06:43:31 UTC 2022 Try: 2
topology-serve-hostname-cvlkk
Date: Fri Jan 14 06:43:32 UTC 2022 Try: 3
topology-serve-hostname-cvlkk
Date: Fri Jan 14 06:43:33 UTC 2022 Try: 4
topology-serve-hostname-cvlkk
Date: Fri Jan 14 06:43:34 UTC 2022 Try: 5
topology-serve-hostname-cvlkk
Date: Fri Jan 14 06:43:35 UTC 2022 Try: 6
topology-serve-hostname-cvlkk
Date: Fri Jan 14 06:43:36 UTC 2022 Try: 7
topology-serve-hostname-cvlkk
�[1mSTEP�[0m: creating a client pod for probing the service from zone-b
Jan 14 06:43:36.858: INFO: The status of Pod curl-from-zone-b is Pending, waiting for it to be Running (with Ready = true)
Jan 14 06:43:38.866: INFO: The status of Pod curl-from-zone-b is Pending, waiting for it to be Running (with Ready = true)
Jan 14 06:43:40.863: INFO: The status of Pod curl-from-zone-b is Running (Ready = true)
Jan 14 06:43:40.866: INFO: Ensuring that requests from curl-from-zone-b pod on kind-worker2 node stay in zone-b zone
Jan 14 06:43:45.873: INFO: Pod client logs: Fri Jan 14 06:43:38 UTC 2022
Date: Fri Jan 14 06:43:39 UTC 2022 Try: 1
topology-serve-hostname-fx8kl
Date: Fri Jan 14 06:43:40 UTC 2022 Try: 2
topology-serve-hostname-fx8kl
Date: Fri Jan 14 06:43:41 UTC 2022 Try: 3
topology-serve-hostname-fx8kl
Date: Fri Jan 14 06:43:42 UTC 2022 Try: 4
topology-serve-hostname-fx8kl
Date: Fri Jan 14 06:43:43 UTC 2022 Try: 5
topology-serve-hostname-fx8kl
Date: Fri Jan 14 06:43:44 UTC 2022 Try: 6
topology-serve-hostname-fx8kl
Date: Fri Jan 14 06:43:45 UTC 2022 Try: 7
topology-serve-hostname-fx8kl
�[1mSTEP�[0m: creating a client pod for probing the service from zone-c
Jan 14 06:43:45.882: INFO: The status of Pod curl-from-zone-c is Pending, waiting for it to be Running (with Ready = true)
Jan 14 06:43:47.888: INFO: The status of Pod curl-from-zone-c is Pending, waiting for it to be Running (with Ready = true)
Jan 14 06:43:49.886: INFO: The status of Pod curl-from-zone-c is Pending, waiting for it to be Running (with Ready = true)
Jan 14 06:43:51.887: INFO: The status of Pod curl-from-zone-c is Pending, waiting for it to be Running (with Ready = true)
Jan 14 06:43:53.886: INFO: The status of Pod curl-from-zone-c is Running (Ready = true)
Jan 14 06:43:53.892: INFO: Ensuring that requests from curl-from-zone-c pod on kind-worker3 node stay in zone-c zone
Jan 14 06:43:58.900: INFO: Pod client logs: Fri Jan 14 06:43:46 UTC 2022
Date: Fri Jan 14 06:43:47 UTC 2022 Try: 1
If I recall correctly this loop has to iterate for one node of each zone, so it checks that for all existing zones in the cluster, the traffic remains in the zone.
Based on that, the name is correct, what is not correct is the loop, that instead of iterating over the nodeNames and break at 3, it has to iterate over the zones and pick one node of each zone, this will guarantee that the pods names are unique too
Let's wait for @robscott for confirmation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this test assumes a variety of things about the cluster it is running in. It would make sense to update this test to support more flexible deployments and bail out/error if it is in an unsupported one.
For reference, this test currently assumes the following:
- All nodes have equivalent allocatable CPU
- There are 3 nodes, each in a different zone
- The TopologyAwareHints feature gate is enabled
I don't think we have a way to verify 3, but we could add a check for 1, and refactor the test to work if the number of nodes/zones is different. We still would need nodes to exist in more than one zone for this test to work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't add new helper functions, until we have clear we can't reuse them I prefer to keep that code within the test.
Another thing is that you can do a List instead of a Get per node, it is going to be more efficiente and reduce flakiness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Code is updated.
/ok-to-test @lzhecheng I think we don't need a release not for this change, but let's wait for Rob to reply to my comment |
@aojea sounds reasonable. Thanks! |
/release-note-none |
fc56d45
to
0649e8a
Compare
} | ||
nodeCPU, found := node.Status.Allocatable[v1.ResourceCPU] | ||
if !found { | ||
framework.Failf("Error when getting allocatable CPU of Node '%s'", node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robscott fail or skip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fail seems appropriate. If the test has not been skipped already, we're likely intentionally trying to run it, so failing will likely provide more noticeable/meaningful output.
/test |
@aojea: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-kubernetes-e2e-kind-multizone |
0649e8a
to
befbc7b
Compare
/test pull-kubernetes-e2e-kind-multizone |
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, lzhecheng The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this @lzhecheng!
test/e2e/network/topology_hints.go
Outdated
nodesByZone := map[string]string{} | ||
zonesWithNodes := map[string][]string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we all we really need is one node name per zone, so a simpler data structure like nodeForZone := map[zone]nodeName{}
would work. (Where zone and node name would just be strings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
@@ -124,12 +152,12 @@ var _ = common.SIGDescribe("Feature:Topology Hints", func() { | |||
} | |||
} | |||
|
|||
nodeList, err := c.CoreV1().Nodes().List(context.TODO(), metav1.ListOptions{}) | |||
framework.ExpectNoError(err) | |||
nodesByZone := map[string]string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this any more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think it is needed because podsByZone
still needs it to set up.
} | ||
nodeCPU, found := node.Status.Allocatable[v1.ResourceCPU] | ||
if !found { | ||
framework.Failf("Error when getting allocatable CPU of Node '%s'", node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fail seems appropriate. If the test has not been skipped already, we're likely intentionally trying to run it, so failing will likely provide more noticeable/meaningful output.
test/e2e/network/topology_hints.go
Outdated
@@ -83,6 +84,33 @@ var _ = common.SIGDescribe("Feature:Topology Hints", func() { | |||
framework.ExpectNoError(err, "timed out waiting for DaemonSets to be ready") | |||
|
|||
nodeNames := e2edaemonset.SchedulableNodes(c, ds) | |||
|
|||
// All Nodes should have same allocatable CPUs. If not, then skip the test. | |||
nodeNamesMap := map[string]bool{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a slight rename to indicate that this was specific to schedulable Nodes? Not tied to this specific name, but would help make the rest of the code a bit easier to understand.
nodeNamesMap := map[string]bool{} | |
schedulableNodes := map[string]bool{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/e2e/network/topology_hints.go
Outdated
lastNodeCPU = nodeCPU | ||
firstNode = false | ||
} else if !nodeCPU.Equal(lastNodeCPU) { | ||
e2eskipper.Skipf("Not all Nodes have same allocatable CPUs. Value of Node '%s' is different from the last one. %d not equals %d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe reword this to clarify what we expected here.
e2eskipper.Skipf("Not all Nodes have same allocatable CPUs. Value of Node '%s' is different from the last one. %d not equals %d", | |
e2eskipper.Skipf("Expected Nodes to have equivalent allocatable CPUs, but Node '%s' is different from the previous one. %d not equals %d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/e2e/network/topology_hints.go
Outdated
framework.Failf("Expected zone to be specified for %s node", nodeName) | ||
} | ||
for fromZone, nodes := range zonesWithNodes { | ||
nodeName := nodes[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move to map[string]string{}
as I suggested above, I think this will be a non issue.
bfde0e9
to
641327e
Compare
* It should check one Node in a zone instead of each Node and its fromZone. * Check Nodes' CPUs if they are equivalent Signed-off-by: Zhecheng Li <zhechengli@microsoft.com>
641327e
to
9292742
Compare
/test pull-kubernetes-e2e-kind-multizone |
/test pull-kubernetes-e2e-kind-ipv6 |
1 similar comment
/test pull-kubernetes-e2e-kind-ipv6 |
Thanks @lzhecheng! /lgtm |
/retest |
/cherrypick release-1.23 |
Signed-off-by: Zhecheng Li zhechengli@microsoft.com
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: