Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update ResourceMetricsAPI tests #93868

Merged
merged 1 commit into from Aug 27, 2020

Conversation

MHBauer
Copy link
Contributor

@MHBauer MHBauer commented Aug 10, 2020

/metrics/resource/v1alpha1 was deprecated and moved to
/metrics/resource

Renames to remove v1alpha1 from function names and matcher variables.

Pod deletion was taking multiple minutes, so set GracePeriodSeconds to 0.

Removed restart loop during test pod startup.

Move ResourceMetricsAPI out of Orphans by giving it a NodeFeature tag.

API removed in 7b7c73b #88568
Test created 6051664 #73946

What type of PR is this?

/kind cleanup
/kind failing-test

What this PR does / why we need it:
API moved without updating test.

Which issue(s) this PR fixes:
Failing tests in https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-orphans

Special notes for your reviewer:
/cc @vpickard
/assign @dashpole
Is there an explanation for the restart loop? It looks like it may have been a copy and paste from the garbage collection test.

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 10, 2020
@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 10, 2020
ginkgo.By("Waiting 15 seconds for cAdvisor to collect 2 stats points")
time.Sleep(15 * time.Second)
})
ginkgo.It("should report resource usage through the v1alpha1 resouce metrics api", func() {
ginkgo.It("should report resource usage through the resouce metrics api", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Suggested change
ginkgo.It("should report resource usage through the resouce metrics api", func() {
ginkgo.It("should report resource usage through the resource metrics api", func() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you.

@SergeyKanzhelev
Copy link
Member

Just for my understanding, I see that the alpha endpoint is still being registered:

// deprecated endpoint which will be removed in release 1.20.0+.
s.addMetricsBucketMatcher("metrics/resource/v1alpha1")
v1alpha1ResourceRegistry := compbasemetrics.NewKubeRegistry()
v1alpha1ResourceRegistry.CustomMustRegister(stats.NewPrometheusResourceMetricCollector(s.resourceAnalyzer, stats.Config()))
s.restfulCont.Handle(path.Join(resourceMetricsPath, v1alpha1.Version),
compbasemetrics.HandlerFor(v1alpha1ResourceRegistry, compbasemetrics.HandlerOpts{ErrorHandling: compbasemetrics.ContinueOnError}),
)

Were tests failing because of a timeout?

As for the change, even if alpha endpoint is still registered, I don't think it's worth validating it and the change looks good:

/lgtm
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 10, 2020
@SergeyKanzhelev
Copy link
Member

looks like bazel update is needed.

ginkgo.By("Waiting for test pods to restart the desired number of times")
gomega.Eventually(func() error {
for _, pod := range pods {
err := verifyPodRestartCount(f, pod.Name, len(pod.Spec.Containers), numRestarts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was initially added to catch a bug in which container metrics would disappear after container restarts. It was copied from the summary API test IIRC. I'd like to keep this if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put back the loop with your explanation as comment.
I'm not sure this is going to have the desired effect since we wait later for the metrics to populate. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. IIRC, the bug that we fixed, and added this test for was where we only collected metrics for the first instance of a container, and not for ones created after the container restarts. So we want to first have a pod with containers that have already restarted, and then make sure metrics look correct after that. Make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand now. I'll think on this and if it needs to be rewritten again. Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be
BY "restarting containers to ensure that future instances of a container besides the first instance collect metrics"
or something like that.

Is there an issue/PR we can link to for demonstative purposes? Or does this need a separate Test case itself?

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 13, 2020
Copy link
Contributor Author

@MHBauer MHBauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebased + hack/update-bazel.sh

Just for my understanding, I see that the alpha endpoint is still being registered:

// deprecated endpoint which will be removed in release 1.20.0+.
s.addMetricsBucketMatcher("metrics/resource/v1alpha1")
v1alpha1ResourceRegistry := compbasemetrics.NewKubeRegistry()
v1alpha1ResourceRegistry.CustomMustRegister(stats.NewPrometheusResourceMetricCollector(s.resourceAnalyzer, stats.Config()))
s.restfulCont.Handle(path.Join(resourceMetricsPath, v1alpha1.Version),
compbasemetrics.HandlerFor(v1alpha1ResourceRegistry, compbasemetrics.HandlerOpts{ErrorHandling: compbasemetrics.ContinueOnError}),
)

Were tests failing because of a timeout?

As for the change, even if alpha endpoint is still registered, I don't think it's worth validating it and the change looks good:

Tests were failing because there was no content in the response. I think #88568 shows that the endpoint still exists, but has nothing in it, do I also need to remove the registration in pkg/kubelet/server/server.go ?

_output/local/go/src/k8s.io/kubernetes/test/e2e_node/resource_metrics_test.go:67
Timed out after 60.000s.
Expected
    <string>: KubeletMetrics
to match keys: {
missing expected key scrape_error
missing expected key node_cpu_usage_seconds_total
missing expected key node_memory_working_set_bytes
missing expected key container_cpu_usage_seconds_total
missing expected key container_memory_working_set_bytes
}

Maybe I've gone about this the wrong way and we should put back the removed api? Or remove even more code?

Comment on lines 67 to 70
ginkgo.By("Fetching node so we can know proper node memory bounds for unconstrained cgroups")
node := getLocalNode(f)
memoryCapacity := node.Status.Capacity["memory"]
memoryLimit := memoryCapacity.Value()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole can you help me understand this comment? It looks like it refers to the following couple of lines, and then goes into the matcher. I'm guessing that this is sort of a default return value of the node having access to all of it's own memory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matcher checks that the returned value is within a range. Memory usage should not exceed the capacity of the node, so it is an upper bound on some ranges IIRC.

@dashpole
Copy link
Contributor

Oh! I didn't realize the resource metrics test was failing

We should definitely fix it in 1.19. The existing endpoint should still work this release. cc @RainbowMango

@dashpole
Copy link
Contributor

Would it be possible to test both endpoints? I'd like to try and fix the v1alpha1 endpoint

Copy link
Contributor Author

@MHBauer MHBauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole Does that mean add back the code of #88568 ?

Same test, different endpoints?

@@ -64,13 +63,13 @@ var _ = framework.KubeDescribe("ResourceMetricsAPI", func() {
ginkgo.By("Waiting 15 seconds for cAdvisor to collect 2 stats points")
time.Sleep(15 * time.Second)
})
ginkgo.It("should report resource usage through the v1alpha1 resouce metrics api", func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another clause like this for v1alpha1? Same test, different endpoints?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. #88568 was supposed to keep the contents of the endpoint the same IIRC, but just refactor and add a new endpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. #88568 was supposed to keep the contents of the endpoint the same IIRC, but just refactor and add a new endpoint.

Yes.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 13, 2020
@MHBauer
Copy link
Contributor Author

MHBauer commented Aug 14, 2020

Should I also remove the v1alpha1 api registration?

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 14, 2020
@dashpole
Copy link
Contributor

Should then v1alpha1 endpoint registration be removed as well? Maybe a separate PR?

Should I also remove the v1alpha1 api registration?

After 1.19 is cut.

@dashpole
Copy link
Contributor

We still need the endpoint, as metrics can be unhidden by flag.

@MHBauer
Copy link
Contributor Author

MHBauer commented Aug 14, 2020

What's the flag to unhide? I am curious, more than anything

/metrics/resource/v1alpha1 was deprecated and moved to
/metrics/resource

Renames to remove v1alpha1 from function names and matcher variables.

Pod deletion was taking multiple minutes, so set GracePeriodSeconds to 0.

Commented restart loop during test pod startup.

Move ResourceMetricsAPI out of Orphans by giving it a NodeFeature tag.

API removed in 7b7c73b kubernetes#88568
Test created 6051664 kubernetes#73946
@MHBauer
Copy link
Contributor Author

MHBauer commented Aug 14, 2020

alright, this is hopefully the final version 🤞

@MHBauer
Copy link
Contributor Author

MHBauer commented Aug 14, 2020

I don't know what to do with this:

failed to try resolving symlinks in path "/var/log/pods/test-pods_09debb56-de5c-11ea-bf0d-0a7878750357_0b856363-de5c-11ea-9893-42010a800098/test/0.log": lstat /var/log/pods/test-pods_09debb56-de5c-11ea-bf0d-0a7878750357_0b856363-de5c-11ea-9893-42010a800098/test/0.log: no such file or directory 

/test pull-kubernetes-e2e-gce-100-performance

@dashpole
Copy link
Contributor

the flag is (e.g.) --show-hidden-metrics-for-version=1.19 to re-enable these disabled metrics.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 14, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dashpole, MHBauer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 14, 2020
@MHBauer
Copy link
Contributor Author

MHBauer commented Aug 14, 2020

@MHBauer
Copy link
Contributor Author

MHBauer commented Aug 14, 2020

/test pull-kubernetes-conformance-kind-ga-only-parallel

@MHBauer
Copy link
Contributor Author

MHBauer commented Aug 14, 2020

alrighty, looking forward to thaw.

func getV1alpha1ResourceMetrics() (e2emetrics.KubeletMetrics, error) {
return e2emetrics.GrabKubeletMetricsWithoutProxy(framework.TestContext.NodeName+":10255", "/metrics/resource/"+kubeletresourcemetricsv1alpha1.Version)
func getResourceMetrics() (e2emetrics.KubeletMetrics, error) {
ginkgo.By("getting stable resource metrics API")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove stable, or refactor the log with something like grabbing kubelet resource metrics.

@RainbowMango
Copy link
Member

@MHBauer
v1.19 has been released and #94272 will clean up the deprecated /metrics/resource/v1alpha1 endpoint.

@k8s-ci-robot k8s-ci-robot merged commit d506ff0 into kubernetes:master Aug 27, 2020
@MHBauer
Copy link
Contributor Author

MHBauer commented Aug 28, 2020

Looks like the matcher needs to be tweaked and/or the test marked as disruptived as it can be failing due to the local environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants