New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kubelet PVC stale metrics #59170

Merged
merged 1 commit into from Feb 16, 2018

Conversation

@cofyc
Contributor

cofyc commented Feb 1, 2018

What this PR does / why we need it:

Volumes on each node changes, we should not only add PVC metrics into
gauge vector. It's better use a collector to collector metrics from internal
stats.

Currently, if a PV (bound to a PVC testpv) is attached and used by node A, then migrated to node B or just deleted from node A later. testpvc metrics will not disappear from kubelet on node A. After a long running time, kubelet process will keep a lot of stale volume metrics in memory.

For these dynamic metrics, it's better to use a collector to collect metrics from a data source (StatsProvider here), like kube-state-metrics scraping metrics from kube-apiserver.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #57686

Special notes for your reviewer:

Release note:

Fix kubelet PVC stale metrics
@cofyc

This comment has been minimized.

Contributor

cofyc commented Feb 1, 2018

@k8s-ci-robot k8s-ci-robot requested review from gnufied and wongma7 Feb 1, 2018

@cofyc cofyc changed the title from Fix kubelet PVC metrics using a volume stats collector. to Fix kubelet PVC stale metrics Feb 1, 2018

@zhangxiaoyu-zidif

This comment has been minimized.

Member

zhangxiaoyu-zidif commented Feb 1, 2018

/ok-to-test

@cofyc

This comment has been minimized.

Contributor

cofyc commented Feb 1, 2018

return fmt.Errorf(`metric output does not match expectation; want:
%s

This comment has been minimized.

@zhangxiaoyu-zidif

zhangxiaoyu-zidif Feb 1, 2018

Member

run 'gofmt' =)

This comment has been minimized.

@cofyc
@cofyc

This comment has been minimized.

Contributor

cofyc commented Feb 1, 2018

/sig storage
/assign @gnufied @jingxu97

@jingxu97

This comment has been minimized.

Contributor

jingxu97 commented Feb 1, 2018

@cofyc Thanks for your PR. I am ok with this PR, but will wait for @gnufied to review in more detail since I am very familiar with area.
One question is that besides PVC metrics, do you notice any other metrics have the same problem?

@cofyc

This comment has been minimized.

Contributor

cofyc commented Feb 2, 2018

@jingxu97 Just checked other metrics, they don't have the same problem.

cofyc added a commit to cofyc/kubernetes that referenced this pull request Feb 2, 2018

Fix kubelet PVC metrics using a volume stats collector.
Volumes on each node changes, we should not only add PVC metrics into
gauge vector. It's better use a collector to collector metrics from
stats.

Sync from kubernetes#59170 with some
modifications;

- Use SummaryProvider instead of StatsProvider
- Remove test files.
- Generate build files.
@jingxu97

This comment has been minimized.

Contributor

jingxu97 commented Feb 2, 2018

@cofyc thank you for checking. What metrics did you check and why pvc metric has the problem and others don't? Thanks!

@cofyc

This comment has been minimized.

Contributor

cofyc commented Feb 2, 2018

hi, @jingxu97

I've checked all other metrics:

PVC metrics have problem, because they use namespace and pvc names as label values. WithLabelValues(labelValues) will create new metric for each new label values and PVC may be migrated to another node or deleted from cluster, but it's metrics will not to be cleared. So, after a long running time, kubelet will keep a lot of stale volume metrics in memory.

Theoretically there are two ways to fix:

  • Clear its metrics when PVC is removed.
  • Use a collector to generate all PVC metrics from stats when collecting.

The second is good and easier way to go for PVC metrics. It also decouples statistics and metrics generating.

}
addGauge := func(desc *prometheus.Desc, pvcRef *stats.PVCReference, v float64, lv ...string) {
lv = append([]string{pvcRef.Namespace, pvcRef.Name}, lv...)
ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, v, lv...)

This comment has been minimized.

@gnufied

gnufied Feb 2, 2018

Member

I would prefer that we did not use version that causes panics.

This comment has been minimized.

@cofyc

cofyc Feb 2, 2018

Contributor

Updated. Log a warning message if NewConstMetric emits an error.

podStats = []statsapi.PodStats{
{
PodRef: statsapi.PodReference{Name: "test-pod", Namespace: "test-namespace", UID: "UID_test-pod"},
StartTime: metav1.NewTime(time.Now()),

This comment has been minimized.

@gnufied

gnufied Feb 2, 2018

Member

I think there is a metav1.Now()

This comment has been minimized.

@cofyc

cofyc Feb 2, 2018

Contributor

Fixed.

@gnufied

This comment has been minimized.

Member

gnufied commented Feb 2, 2018

Mostly looks good to me. some minor nits.

@cofyc

This comment has been minimized.

Contributor

cofyc commented Feb 9, 2018

@jsafrane

That's weird.

curl -s http://localhost:10255/metrics | grep '^kubelet_volume'

Could you check this output on each node instead of prometheus?

If all PVC volumes on node were correctly unmounted and detached, kubelet /metrics endpoint should have no kubelet_volume_xxx metrics.

@jsafrane

This comment has been minimized.

Member

jsafrane commented Feb 9, 2018

@cofyc, weird, I can't reproduce the issue today. I recompiled and reinstalled everything and it's gone. Sorry for the noise. Tested and LGTM from my side :-)

@gnufied

This comment has been minimized.

Member

gnufied commented Feb 9, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 9, 2018

@rootfs

This comment has been minimized.

Member

rootfs commented Feb 9, 2018

/assign @dchen1107 @derekwaynecarr for approval

continue
}
pvcUniqStr := pvcRef.Namespace + pvcRef.Name
if allPVCs.Has(pvcUniqStr) {

This comment has been minimized.

@cbandy

cbandy Feb 11, 2018

It looks as though pvcRef can be used as a map key.

I would prefer to gather everything into a map[stats.PVCReference]stats.VolumeStats then iterate over that.

This comment has been minimized.

@cofyc

cofyc Feb 11, 2018

Contributor

@cbandy sets.String is implemented by a map. It has no performance issue here and we can generate metrics in one iteration.

// ignore if no PVC reference
continue
}
pvcUniqStr := pvcRef.Namespace + pvcRef.Name

This comment has been minimized.

@cbandy

cbandy Feb 11, 2018

This isn't really unique... a/bc and ab/c would collide.

This comment has been minimized.

@cofyc

cofyc Feb 11, 2018

Contributor

@cbandy Thanks! I should use / to separate namespace and name. It's the way to generate unique key for objects in client-go cache store. Fixed now: fecff55#diff-ae1cd608f0ca9bcb432eeeca0cf3cde0R106
@gnufied Sorry, please take a look at again.

Fix kubelet PVC metrics using a volume stats collector.
Volumes on each node changes, we should not only add PVC metrics into
gauge vector. It's better use a collector to collector metrics from
stats.

@k8s-merge-robot k8s-merge-robot removed the lgtm label Feb 11, 2018

@jsafrane

This comment has been minimized.

Member

jsafrane commented Feb 12, 2018

/lgtm

@jsafrane

This comment has been minimized.

Member

jsafrane commented Feb 12, 2018

@derekwaynecarr

/lgtm
/approve

Kubelet changes look good

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Feb 16, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cofyc, derekwaynecarr, gnufied, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Feb 16, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Feb 16, 2018

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit bfdd94c into kubernetes:master Feb 16, 2018

13 of 14 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
cla/linuxfoundation cofyc authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce-canary Skipped
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-unit Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

k8s-merge-robot added a commit that referenced this pull request Feb 24, 2018

Merge pull request #60014 from cofyc/automated-cherry-pick-of-#59170-…
…upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #59170: Fix kubelet PVC metrics using a volume stats collector.

Cherry pick of #59170 on release-1.8.

#59170: Fix kubelet PVC metrics using a volume stats collector.

@cofyc cofyc deleted the cofyc:fix_kubelet_volume_metrics branch Mar 14, 2018

k8s-merge-robot added a commit that referenced this pull request Apr 12, 2018

Merge pull request #60013 from cofyc/automated-cherry-pick-of-#59170-…
…upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #59170: Fix kubelet PVC metrics using a volume stats collector.

Cherry pick of #59170 on release-1.9.

#59170: Fix kubelet PVC metrics using a volume stats collector.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment