update ResourceMetricsAPI tests #93868

MHBauer · 2020-08-10T22:26:16Z

/metrics/resource/v1alpha1 was deprecated and moved to
/metrics/resource

Renames to remove v1alpha1 from function names and matcher variables.

Pod deletion was taking multiple minutes, so set GracePeriodSeconds to 0.

Removed restart loop during test pod startup.

Move ResourceMetricsAPI out of Orphans by giving it a NodeFeature tag.

API removed in 7b7c73b #88568
Test created 6051664 #73946

What type of PR is this?

/kind cleanup
/kind failing-test

What this PR does / why we need it:
API moved without updating test.

Which issue(s) this PR fixes:
Failing tests in https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-orphans

Special notes for your reviewer:
/cc @vpickard
/assign @dashpole
Is there an explanation for the restart loop? It looks like it may have been a copy and paste from the garbage collection test.

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

hasheddan · 2020-08-10T22:41:41Z

test/e2e_node/resource_metrics_test.go

 			ginkgo.By("Waiting 15 seconds for cAdvisor to collect 2 stats points")
 			time.Sleep(15 * time.Second)
 		})
-		ginkgo.It("should report resource usage through the v1alpha1 resouce metrics api", func() {
+		ginkgo.It("should report resource usage through the resouce metrics api", func() {


nitpick:

Suggested change

ginkgo.It("should report resource usage through the resouce metrics api", func() {

ginkgo.It("should report resource usage through the resource metrics api", func() {

SergeyKanzhelev · 2020-08-10T23:42:34Z

Just for my understanding, I see that the alpha endpoint is still being registered:

kubernetes/pkg/kubelet/server/server.go

Lines 381 to 387 in da5ec16

    
           // deprecated endpoint which will be removed in release 1.20.0+. 
        
           s.addMetricsBucketMatcher("metrics/resource/v1alpha1") 
        
           v1alpha1ResourceRegistry := compbasemetrics.NewKubeRegistry() 
        
           v1alpha1ResourceRegistry.CustomMustRegister(stats.NewPrometheusResourceMetricCollector(s.resourceAnalyzer, stats.Config())) 
        
           s.restfulCont.Handle(path.Join(resourceMetricsPath, v1alpha1.Version), 
        
           	compbasemetrics.HandlerFor(v1alpha1ResourceRegistry, compbasemetrics.HandlerOpts{ErrorHandling: compbasemetrics.ContinueOnError}), 
        
           )

Were tests failing because of a timeout?

As for the change, even if alpha endpoint is still registered, I don't think it's worth validating it and the change looks good:

/lgtm
/retest

SergeyKanzhelev · 2020-08-11T05:16:48Z

looks like bazel update is needed.

dashpole · 2020-08-11T16:55:29Z

test/e2e_node/resource_metrics_test.go

-			ginkgo.By("Waiting for test pods to restart the desired number of times")
-			gomega.Eventually(func() error {
-				for _, pod := range pods {
-					err := verifyPodRestartCount(f, pod.Name, len(pod.Spec.Containers), numRestarts)


This was initially added to catch a bug in which container metrics would disappear after container restarts. It was copied from the summary API test IIRC. I'd like to keep this if possible.

I put back the loop with your explanation as comment.
I'm not sure this is going to have the desired effect since we wait later for the metrics to populate. WDYT?

Right. IIRC, the bug that we fixed, and added this test for was where we only collected metrics for the first instance of a container, and not for ones created after the container restarts. So we want to first have a pod with containers that have already restarted, and then make sure metrics look correct after that. Make sense?

I understand now. I'll think on this and if it needs to be rewritten again. Thank you.

I think this should be
BY "restarting containers to ensure that future instances of a container besides the first instance collect metrics"
or something like that.

Is there an issue/PR we can link to for demonstative purposes? Or does this need a separate Test case itself?

MHBauer

rebased + hack/update-bazel.sh

Just for my understanding, I see that the alpha endpoint is still being registered:

kubernetes/pkg/kubelet/server/server.go

Lines 381 to 387 in da5ec16

// deprecated endpoint which will be removed in release 1.20.0+.

s.addMetricsBucketMatcher("metrics/resource/v1alpha1")

v1alpha1ResourceRegistry := compbasemetrics.NewKubeRegistry()

v1alpha1ResourceRegistry.CustomMustRegister(stats.NewPrometheusResourceMetricCollector(s.resourceAnalyzer, stats.Config()))

s.restfulCont.Handle(path.Join(resourceMetricsPath, v1alpha1.Version),

compbasemetrics.HandlerFor(v1alpha1ResourceRegistry, compbasemetrics.HandlerOpts{ErrorHandling: compbasemetrics.ContinueOnError}),

)

Were tests failing because of a timeout?

As for the change, even if alpha endpoint is still registered, I don't think it's worth validating it and the change looks good:

Tests were failing because there was no content in the response. I think #88568 shows that the endpoint still exists, but has nothing in it, do I also need to remove the registration in pkg/kubelet/server/server.go ?

_output/local/go/src/k8s.io/kubernetes/test/e2e_node/resource_metrics_test.go:67
Timed out after 60.000s.
Expected
    <string>: KubeletMetrics
to match keys: {
missing expected key scrape_error
missing expected key node_cpu_usage_seconds_total
missing expected key node_memory_working_set_bytes
missing expected key container_cpu_usage_seconds_total
missing expected key container_memory_working_set_bytes
}

Maybe I've gone about this the wrong way and we should put back the removed api? Or remove even more code?

MHBauer · 2020-08-13T16:23:40Z

test/e2e_node/resource_metrics_test.go

 			ginkgo.By("Fetching node so we can know proper node memory bounds for unconstrained cgroups")
 			node := getLocalNode(f)
 			memoryCapacity := node.Status.Capacity["memory"]
 			memoryLimit := memoryCapacity.Value()


@dashpole can you help me understand this comment? It looks like it refers to the following couple of lines, and then goes into the matcher. I'm guessing that this is sort of a default return value of the node having access to all of it's own memory?

The matcher checks that the returned value is within a range. Memory usage should not exceed the capacity of the node, so it is an upper bound on some ranges IIRC.

dashpole · 2020-08-13T16:43:16Z

Oh! I didn't realize the resource metrics test was failing

We should definitely fix it in 1.19. The existing endpoint should still work this release. cc @RainbowMango

dashpole · 2020-08-13T17:27:58Z

Would it be possible to test both endpoints? I'd like to try and fix the v1alpha1 endpoint

MHBauer

@dashpole Does that mean add back the code of #88568 ?

Same test, different endpoints?

MHBauer · 2020-08-13T17:30:28Z

test/e2e_node/resource_metrics_test.go

@@ -64,13 +63,13 @@ var _ = framework.KubeDescribe("ResourceMetricsAPI", func() {
 			ginkgo.By("Waiting 15 seconds for cAdvisor to collect 2 stats points")
 			time.Sleep(15 * time.Second)
 		})
-		ginkgo.It("should report resource usage through the v1alpha1 resouce metrics api", func() {


Add another clause like this for v1alpha1? Same test, different endpoints?

Yes. #88568 was supposed to keep the contents of the endpoint the same IIRC, but just refactor and add a new endpoint.

Yes. #88568 was supposed to keep the contents of the endpoint the same IIRC, but just refactor and add a new endpoint.

Yes.

MHBauer · 2020-08-14T18:10:53Z

Should I also remove the v1alpha1 api registration?

dashpole · 2020-08-14T18:22:19Z

Should then v1alpha1 endpoint registration be removed as well? Maybe a separate PR?

Should I also remove the v1alpha1 api registration?

After 1.19 is cut.

dashpole · 2020-08-14T18:22:33Z

We still need the endpoint, as metrics can be unhidden by flag.

MHBauer · 2020-08-14T18:23:09Z

What's the flag to unhide? I am curious, more than anything

/metrics/resource/v1alpha1 was deprecated and moved to /metrics/resource Renames to remove v1alpha1 from function names and matcher variables. Pod deletion was taking multiple minutes, so set GracePeriodSeconds to 0. Commented restart loop during test pod startup. Move ResourceMetricsAPI out of Orphans by giving it a NodeFeature tag. API removed in 7b7c73b kubernetes#88568 Test created 6051664 kubernetes#73946

MHBauer · 2020-08-14T18:29:20Z

alright, this is hopefully the final version 🤞

MHBauer · 2020-08-14T18:48:31Z

I don't know what to do with this:

failed to try resolving symlinks in path "/var/log/pods/test-pods_09debb56-de5c-11ea-bf0d-0a7878750357_0b856363-de5c-11ea-9893-42010a800098/test/0.log": lstat /var/log/pods/test-pods_09debb56-de5c-11ea-bf0d-0a7878750357_0b856363-de5c-11ea-9893-42010a800098/test/0.log: no such file or directory

/test pull-kubernetes-e2e-gce-100-performance

dashpole · 2020-08-14T19:23:59Z

the flag is (e.g.) --show-hidden-metrics-for-version=1.19 to re-enable these disabled metrics.

/lgtm
/approve

k8s-ci-robot · 2020-08-14T19:24:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dashpole, MHBauer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e_node/OWNERS~~ [dashpole]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

MHBauer · 2020-08-14T19:32:55Z

looks like maybe a rare flake? 18 in the past week https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=Services%20should%20be%20able%20to%20change%20the%20type%20from%20NodePort%20to%20ExternalName#2fa91b3a0e6e1a4514d4
/test pull-kubernetes-conformance-kind-ipv6-parallel

MHBauer · 2020-08-14T20:51:27Z

/test pull-kubernetes-conformance-kind-ga-only-parallel

MHBauer · 2020-08-14T22:39:26Z

alrighty, looking forward to thaw.

RainbowMango · 2020-08-17T01:27:12Z

test/e2e_node/resource_metrics_test.go

-func getV1alpha1ResourceMetrics() (e2emetrics.KubeletMetrics, error) {
-	return e2emetrics.GrabKubeletMetricsWithoutProxy(framework.TestContext.NodeName+":10255", "/metrics/resource/"+kubeletresourcemetricsv1alpha1.Version)
+func getResourceMetrics() (e2emetrics.KubeletMetrics, error) {
+	ginkgo.By("getting stable resource metrics API")


nit: remove stable, or refactor the log with something like grabbing kubelet resource metrics.

RainbowMango · 2020-08-27T07:07:07Z

@MHBauer
v1.19 has been released and #94272 will clean up the deprecated /metrics/resource/v1alpha1 endpoint.

MHBauer · 2020-08-28T23:33:18Z

Looks like the matcher needs to be tweaked and/or the test marked as disruptived as it can be failing due to the local environment.

k8s-ci-robot assigned dashpole Aug 10, 2020

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 10, 2020

k8s-ci-robot requested a review from vpickard August 10, 2020 22:26

hasheddan reviewed Aug 10, 2020

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Aug 10, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 10, 2020

dashpole reviewed Aug 11, 2020

View reviewed changes

MHBauer force-pushed the resourcemetricsapi-test branch from 33468d4 to c72c040 Compare August 13, 2020 16:14

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 13, 2020

MHBauer commented Aug 13, 2020

View reviewed changes

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 13, 2020

MHBauer force-pushed the resourcemetricsapi-test branch from 361b27a to 347bee1 Compare August 14, 2020 18:21

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 14, 2020

MHBauer force-pushed the resourcemetricsapi-test branch from 347bee1 to 916c73b Compare August 14, 2020 18:29

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 14, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 14, 2020

RainbowMango reviewed Aug 17, 2020

View reviewed changes

dashpole mentioned this pull request Aug 19, 2020

ResourceMetricsAPI test case fails #94103

Closed

justaugustus modified the milestones: v1.20, v1.20-phase-doc-cleanup Aug 26, 2020

justaugustus modified the milestones: v1.20-phase-doc-cleanup, v1.20 Aug 27, 2020

k8s-ci-robot merged commit d506ff0 into kubernetes:master Aug 27, 2020

MHBauer mentioned this pull request Aug 31, 2020

[Flaky Test] [k8s.io] ResourceMetricsAPI [NodeFeature:ResourceMetrics] when querying /resource/metrics should report resource usage through the resouce metrics api #94370

Closed

mikebrow mentioned this pull request Sep 1, 2020

Update to latest k8s 1.19.2 release containerd/cri#1562

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update ResourceMetricsAPI tests #93868

update ResourceMetricsAPI tests #93868

MHBauer commented Aug 10, 2020 •

edited

hasheddan Aug 10, 2020

MHBauer Aug 13, 2020

SergeyKanzhelev commented Aug 10, 2020

SergeyKanzhelev commented Aug 11, 2020

dashpole Aug 11, 2020

MHBauer Aug 13, 2020

dashpole Aug 13, 2020

MHBauer Aug 13, 2020

MHBauer Aug 13, 2020

MHBauer left a comment

MHBauer Aug 13, 2020

dashpole Aug 13, 2020

dashpole commented Aug 13, 2020

dashpole commented Aug 13, 2020

MHBauer left a comment

MHBauer Aug 13, 2020

dashpole Aug 13, 2020

RainbowMango Aug 14, 2020

MHBauer commented Aug 14, 2020

dashpole commented Aug 14, 2020

dashpole commented Aug 14, 2020

MHBauer commented Aug 14, 2020

MHBauer commented Aug 14, 2020

MHBauer commented Aug 14, 2020

dashpole commented Aug 14, 2020

k8s-ci-robot commented Aug 14, 2020

MHBauer commented Aug 14, 2020

MHBauer commented Aug 14, 2020

MHBauer commented Aug 14, 2020

RainbowMango Aug 17, 2020

RainbowMango commented Aug 27, 2020

MHBauer commented Aug 28, 2020

	ginkgo.It("should report resource usage through the resouce metrics api", func() {
	ginkgo.It("should report resource usage through the resource metrics api", func() {

	// deprecated endpoint which will be removed in release 1.20.0+.
	s.addMetricsBucketMatcher("metrics/resource/v1alpha1")
	v1alpha1ResourceRegistry := compbasemetrics.NewKubeRegistry()
	v1alpha1ResourceRegistry.CustomMustRegister(stats.NewPrometheusResourceMetricCollector(s.resourceAnalyzer, stats.Config()))
	s.restfulCont.Handle(path.Join(resourceMetricsPath, v1alpha1.Version),
	compbasemetrics.HandlerFor(v1alpha1ResourceRegistry, compbasemetrics.HandlerOpts{ErrorHandling: compbasemetrics.ContinueOnError}),
	)

update ResourceMetricsAPI tests #93868

update ResourceMetricsAPI tests #93868

Conversation

MHBauer commented Aug 10, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergeyKanzhelev commented Aug 10, 2020

SergeyKanzhelev commented Aug 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MHBauer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dashpole commented Aug 13, 2020

dashpole commented Aug 13, 2020

MHBauer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MHBauer commented Aug 14, 2020

dashpole commented Aug 14, 2020

dashpole commented Aug 14, 2020

MHBauer commented Aug 14, 2020

MHBauer commented Aug 14, 2020

MHBauer commented Aug 14, 2020

dashpole commented Aug 14, 2020

k8s-ci-robot commented Aug 14, 2020

MHBauer commented Aug 14, 2020

MHBauer commented Aug 14, 2020

MHBauer commented Aug 14, 2020

Choose a reason for hiding this comment

RainbowMango commented Aug 27, 2020

MHBauer commented Aug 28, 2020

MHBauer commented Aug 10, 2020 •

edited