Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert kubelet metrics(running_pod_count and running_container_count) from non-standard prometheus collectors to standard gauges #81573

Conversation

@irajdeep
Copy link
Contributor

commented Aug 18, 2019

What type of PR is this?
/kind design

What this PR does / why we need it:
This PR attempts to fix this issue. It converts kubelet metrics (e.g running_pod_count and running_container_count)from non-standard prometheus collecter to standard gauge metrics

Which issue(s) this PR fixes:

Fixes #
#79286

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Add `container_state` label to `running_container_count` kubelet metrics, to get count of containers based on their state(running/exited/created/unknown)

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 18, 2019

Welcome @irajdeep!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 18, 2019

Hi @irajdeep. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/node and removed needs-sig labels Aug 18, 2019
@k8s-ci-robot k8s-ci-robot requested review from dchen1107 and feiskyer Aug 18, 2019
@irajdeep

This comment has been minimized.

Copy link
Contributor Author

commented Aug 18, 2019

/assign logicalhan

Copy link
Contributor

left a comment

lgtm but @dashpole for more input on the change in the metric.

containers := pod.Containers
for _, container := range containers {
// check if the state of the container is "running"
if container.State == kubecontainer.ContainerStateRunning {

This comment has been minimized.

Copy link
@logicalhan

logicalhan Aug 18, 2019

Contributor

This logic is a bit different than the existing logic for the metric (which does a simpler len(p.Containers)). @dashpole, for more input.

This comment has been minimized.

Copy link
@irajdeep

irajdeep Aug 18, 2019

Author Contributor

Right, I did this change because it aligns more with the logic of "metric name", i.e counting the containers in the running state.

This comment has been minimized.

Copy link
@logicalhan

logicalhan Aug 18, 2019

Contributor

Yeah, I agree it makes more sense; regardless, it is a change in the semantics of the existing metric so we should check with the node team to ensure it isn't likely to break anything.

This comment has been minimized.

Copy link
@dashpole

dashpole Aug 22, 2019

Contributor

I don't think we can change it. Its hard to reason about what can break. Could we add the container state as a label instead? That wouldn't change queries over the base metric, and allow doing the filtering at a later point.

This comment has been minimized.

Copy link
@irajdeep

irajdeep Aug 26, 2019

Author Contributor

That makes total sense.
I have now added the container state as a label container_state in the running_container_count metrics.

@logicalhan

This comment has been minimized.

Copy link
Contributor

commented Aug 18, 2019

/assign @dashpole

@logicalhan

This comment has been minimized.

Copy link
Contributor

commented Aug 18, 2019

/sig instrumentation

@logicalhan

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

/ok-to-test

@irajdeep

This comment has been minimized.

Copy link
Contributor Author

commented Aug 19, 2019

/test pull-kubernetes-bazel-test

@irajdeep irajdeep force-pushed the irajdeep:irajdeep/change_runningPod_runningContainer_metrics branch 3 times, most recently from fdd2376 to d546700 Aug 19, 2019
@dashpole

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

/lgtm
@logicalhan do you know if adding this label is backwards compatible based on our new conventions?

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 26, 2019
@logicalhan

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2019

/hold

Oh, you'll need a release note about the metrics change, since you are introducing new labels (potentially breaking ingestion).

@irajdeep

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2019

Good point. Updated PR description with release note

Copy link
Member

left a comment

/lgtm

@logicalhan

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2019

You will need to rebase as the kubelet migration changes went in first. That means you'll have to change your registration calls to use the migration registry ones.

pkg/kubelet/pleg/generic.go Outdated Show resolved Hide resolved
@irajdeep irajdeep force-pushed the irajdeep:irajdeep/change_runningPod_runningContainer_metrics branch from 5e0632c to dc026ef Aug 29, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 29, 2019
As already mentioned in this issue #79286, some metrics like
"running_pod_count" and "running_container_count" uses non-standard prometheus metrics, this change converts them to be
standard prometheus gauges

Minor refactor in kubelet/pleg/generic.go and added some test for ruuning container and running pod metrics

Fixed issues related to github CI pipeline failure

* Updated bazel for new deps
* Add comment for exported metrics variables,RuuningContainerCount and RunningPodCount
* Specify keys explicitly in Guage metric instantation

Fix go lint errors

Replace "+=1" with "++", as reported by go lint

Set container state as a label for the metrics "running_container_count"

As per the metrics name "running_container_count" it should "ideally" be showing
the number of containers in "running" state , but it was showing all the container count, irrespective of the state it is in.
This commit adds a new label "container_running_state" to the metrics "running_container_count", which doesn't change the base metrics but adds the
option to query the metrics with "container_state" such as "running"/"unknown/...

remove unused methods reported by staticcheck

Remove variables while instantiating gauge(vec) which are default set to nil

Convert kubelet metrics(running_pod_count and running_container_count) to standard gauges and added label to running_container_count metrics.

Currently kubelet metrics(running_pod_count and running_container_count) use non-standard prometheus collectors , this change
converts them to standard prometheus gauges. Also this adds a new label(container_state) to running_container_count which does a breakdown of
containers tracked by kubelet based on the containers' state(running/unknown/created/exited).

Set statbility explicitly for running_pod_count and running_container_count and reformat test

register metrics explicitly in test , so that they don't become no-op
@irajdeep irajdeep force-pushed the irajdeep:irajdeep/change_runningPod_runningContainer_metrics branch from e81ecd2 to c02d49d Aug 29, 2019
@irajdeep

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

/test pull-kubernetes-e2e-gce

@irajdeep

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

You will need to rebase as the kubelet migration changes went in first. That means you'll have to change your registration calls to use the migration registry ones.

Done!

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Aug 29, 2019

/hold cancel
/approve

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Aug 29, 2019

/milestone v1.16

@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 29, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, irajdeep

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@logicalhan

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 29, 2019
@k8s-ci-robot k8s-ci-robot merged commit 7da563f into kubernetes:master Aug 30, 2019
24 checks passed
24 checks passed
cla/linuxfoundation irajdeep authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-conformance-kind-ipv6 Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
containerStateCount := make(map[string]int)

for _, pod := range pods {
containers := pod.Containers

This comment has been minimized.

Copy link
@tedyu

tedyu Aug 30, 2019

Contributor

Should Sandboxes be covered (in a new GaugeVec) ?

This comment has been minimized.

Copy link
@irajdeep

irajdeep Aug 30, 2019

Author Contributor

Makes sense introducing it in a new GaugeVec(running_initcontainer_count something like this maybe)
@logicalhan what do you think?

This comment has been minimized.

Copy link
@tedyu

tedyu Aug 30, 2019

Contributor

I found that container.Pod doesn't expose initcontainers.
Edited comment above for Sandboxes.

If this makes sense, I can send out a PR.

This comment has been minimized.

Copy link
@dashpole

dashpole Aug 30, 2019

Contributor

We could add this, although i'm not 100% sure what the value of it would be. It should be exactly equal to running pod count except in rare scenarios when the kubelet is recreating a sandbox. But i'm not entirely sure why knowing that a sandbox is being recreated is useful... What did you intend to use it for?

This comment has been minimized.

Copy link
@tedyu

tedyu Aug 30, 2019

Contributor

Assuming the sandbox count settles down after the recreation, probably we don't need the extra gauge.

irajdeep added a commit to irajdeep/enhancements that referenced this pull request Aug 30, 2019
Mark kubelet metrics migration to standard prometheus collector as done
as a result of this: kubernetes/kubernetes#81573
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.