New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update ResourceMetricsAPI tests #93868
update ResourceMetricsAPI tests #93868
Conversation
ginkgo.By("Waiting 15 seconds for cAdvisor to collect 2 stats points") | ||
time.Sleep(15 * time.Second) | ||
}) | ||
ginkgo.It("should report resource usage through the v1alpha1 resouce metrics api", func() { | ||
ginkgo.It("should report resource usage through the resouce metrics api", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick:
ginkgo.It("should report resource usage through the resouce metrics api", func() { | |
ginkgo.It("should report resource usage through the resource metrics api", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you.
Just for my understanding, I see that the alpha endpoint is still being registered: kubernetes/pkg/kubelet/server/server.go Lines 381 to 387 in da5ec16
Were tests failing because of a timeout? As for the change, even if alpha endpoint is still registered, I don't think it's worth validating it and the change looks good: /lgtm |
looks like bazel update is needed. |
ginkgo.By("Waiting for test pods to restart the desired number of times") | ||
gomega.Eventually(func() error { | ||
for _, pod := range pods { | ||
err := verifyPodRestartCount(f, pod.Name, len(pod.Spec.Containers), numRestarts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was initially added to catch a bug in which container metrics would disappear after container restarts. It was copied from the summary API test IIRC. I'd like to keep this if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put back the loop with your explanation as comment.
I'm not sure this is going to have the desired effect since we wait later for the metrics to populate. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. IIRC, the bug that we fixed, and added this test for was where we only collected metrics for the first instance of a container, and not for ones created after the container restarts. So we want to first have a pod with containers that have already restarted, and then make sure metrics look correct after that. Make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand now. I'll think on this and if it needs to be rewritten again. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be
BY "restarting containers to ensure that future instances of a container besides the first instance collect metrics"
or something like that.
Is there an issue/PR we can link to for demonstative purposes? Or does this need a separate Test case itself?
33468d4
to
c72c040
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebased + hack/update-bazel.sh
Just for my understanding, I see that the alpha endpoint is still being registered:
kubernetes/pkg/kubelet/server/server.go
Lines 381 to 387 in da5ec16
// deprecated endpoint which will be removed in release 1.20.0+. s.addMetricsBucketMatcher("metrics/resource/v1alpha1") v1alpha1ResourceRegistry := compbasemetrics.NewKubeRegistry() v1alpha1ResourceRegistry.CustomMustRegister(stats.NewPrometheusResourceMetricCollector(s.resourceAnalyzer, stats.Config())) s.restfulCont.Handle(path.Join(resourceMetricsPath, v1alpha1.Version), compbasemetrics.HandlerFor(v1alpha1ResourceRegistry, compbasemetrics.HandlerOpts{ErrorHandling: compbasemetrics.ContinueOnError}), ) Were tests failing because of a timeout?
As for the change, even if alpha endpoint is still registered, I don't think it's worth validating it and the change looks good:
Tests were failing because there was no content in the response. I think #88568 shows that the endpoint still exists, but has nothing in it, do I also need to remove the registration in pkg/kubelet/server/server.go
?
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/resource_metrics_test.go:67
Timed out after 60.000s.
Expected
<string>: KubeletMetrics
to match keys: {
missing expected key scrape_error
missing expected key node_cpu_usage_seconds_total
missing expected key node_memory_working_set_bytes
missing expected key container_cpu_usage_seconds_total
missing expected key container_memory_working_set_bytes
}
Maybe I've gone about this the wrong way and we should put back the removed api? Or remove even more code?
ginkgo.By("Fetching node so we can know proper node memory bounds for unconstrained cgroups") | ||
node := getLocalNode(f) | ||
memoryCapacity := node.Status.Capacity["memory"] | ||
memoryLimit := memoryCapacity.Value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashpole can you help me understand this comment? It looks like it refers to the following couple of lines, and then goes into the matcher. I'm guessing that this is sort of a default return value of the node having access to all of it's own memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The matcher checks that the returned value is within a range. Memory usage should not exceed the capacity of the node, so it is an upper bound on some ranges IIRC.
Oh! I didn't realize the resource metrics test was failing We should definitely fix it in 1.19. The existing endpoint should still work this release. cc @RainbowMango |
Would it be possible to test both endpoints? I'd like to try and fix the v1alpha1 endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -64,13 +63,13 @@ var _ = framework.KubeDescribe("ResourceMetricsAPI", func() { | |||
ginkgo.By("Waiting 15 seconds for cAdvisor to collect 2 stats points") | |||
time.Sleep(15 * time.Second) | |||
}) | |||
ginkgo.It("should report resource usage through the v1alpha1 resouce metrics api", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add another clause like this for v1alpha1? Same test, different endpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. #88568 was supposed to keep the contents of the endpoint the same IIRC, but just refactor and add a new endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. #88568 was supposed to keep the contents of the endpoint the same IIRC, but just refactor and add a new endpoint.
Yes.
Should I also remove the v1alpha1 api registration? |
361b27a
to
347bee1
Compare
After 1.19 is cut. |
We still need the endpoint, as metrics can be unhidden by flag. |
What's the flag to unhide? I am curious, more than anything |
/metrics/resource/v1alpha1 was deprecated and moved to /metrics/resource Renames to remove v1alpha1 from function names and matcher variables. Pod deletion was taking multiple minutes, so set GracePeriodSeconds to 0. Commented restart loop during test pod startup. Move ResourceMetricsAPI out of Orphans by giving it a NodeFeature tag. API removed in 7b7c73b kubernetes#88568 Test created 6051664 kubernetes#73946
347bee1
to
916c73b
Compare
alright, this is hopefully the final version 🤞 |
I don't know what to do with this:
/test pull-kubernetes-e2e-gce-100-performance |
the flag is (e.g.) /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dashpole, MHBauer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
looks like maybe a rare flake? 18 in the past week https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=Services%20should%20be%20able%20to%20change%20the%20type%20from%20NodePort%20to%20ExternalName#2fa91b3a0e6e1a4514d4 |
/test pull-kubernetes-conformance-kind-ga-only-parallel |
alrighty, looking forward to thaw. |
func getV1alpha1ResourceMetrics() (e2emetrics.KubeletMetrics, error) { | ||
return e2emetrics.GrabKubeletMetricsWithoutProxy(framework.TestContext.NodeName+":10255", "/metrics/resource/"+kubeletresourcemetricsv1alpha1.Version) | ||
func getResourceMetrics() (e2emetrics.KubeletMetrics, error) { | ||
ginkgo.By("getting stable resource metrics API") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove stable
, or refactor the log with something like grabbing kubelet resource metrics
.
Looks like the matcher needs to be tweaked and/or the test marked as disruptived as it can be failing due to the local environment. |
/metrics/resource/v1alpha1 was deprecated and moved to
/metrics/resource
Renames to remove v1alpha1 from function names and matcher variables.
Pod deletion was taking multiple minutes, so set GracePeriodSeconds to 0.
Removed restart loop during test pod startup.
Move ResourceMetricsAPI out of Orphans by giving it a NodeFeature tag.
API removed in 7b7c73b #88568
Test created 6051664 #73946
What type of PR is this?
/kind cleanup
/kind failing-test
What this PR does / why we need it:
API moved without updating test.
Which issue(s) this PR fixes:
Failing tests in https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-orphans
Special notes for your reviewer:
/cc @vpickard
/assign @dashpole
Is there an explanation for the restart loop? It looks like it may have been a copy and paste from the garbage collection test.
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: