add gc metrics and collect sync errors #106844

atiratree · 2021-12-06T22:29:56Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is a complementary PR to #105705 which helps with debugging garbage collector.

Adds new metric garbagecollector_controller_garbagecollector_resources_sync_error_total which together with existing metrics workqueue_retries_total{name="garbage_collector_attempt_to_delete"} and workqueue_retries_total{name="garbage_collector_attempt_to_orphan"} can help with alerting when gc cannot reconcile.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla · 2021-12-06T22:29:59Z

The committers are authorized under a signed CLA.

✅ Filip Křepinský (c4c2529b3ce143023cad94c70d02282e96cba84c)

atiratree · 2021-12-06T22:40:30Z

/check-cla

atiratree · 2021-12-06T22:46:03Z

/meow

k8s-ci-robot · 2021-12-06T22:46:05Z

@atiratree:

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

atiratree · 2021-12-06T22:49:40Z

/check-cla

atiratree · 2021-12-07T11:14:10Z

/check-cla

atiratree · 2021-12-07T11:24:22Z

/easycla

dgrisonnet · 2021-12-07T14:29:52Z

pkg/controller/garbagecollector/metrics/metrics.go

+var (
+	GarbageCollectorResourcesSyncError = metrics.NewCounter(
+		&metrics.CounterOpts{
+			Subsystem:      GarbageCollectorControllerSubsystem,


it would be better to have garbagecollector as Namespace and controller as Subsystem in the options. The result will be the same, but it will be easier to differentiate subsystems from the same namespace.

Your change seems a bit different from what I was expecting before. Also note that the name of the metric will end up being: Subsystem + Namespace + Name, so you need to remove the garbagecollector prefix from the metric Name. Currently the metric will be named garbagecollector_controller_garbagecollector_resources_sync_error_total which I don't think is what you want since garbagecollector is redundant.

Your change seems a bit different from what I was expecting before.

I have reverted it back because of #106844 (comment)

Currently the metric will be named garbagecollector_controller_garbagecollector_resources_sync_error_total

Yup, that makes sense not to have it redundant.

dgrisonnet · 2021-12-07T14:33:18Z

pkg/controller/garbagecollector/metrics/metrics.go

+	GarbageCollectorResourcesSyncError = metrics.NewCounter(
+		&metrics.CounterOpts{
+			Subsystem:      GarbageCollectorControllerSubsystem,
+			Name:           "garbagecollector_resources_sync_error_total",


Seeing how the metric is used, there seem to be different reasons behind sync failures. As such, would it make sense to introduce a reason label to differentiate them more easily?

I suppose it could be somewhat useful, but usually you want to go to see the logs once the error occurs.

you'd have to be careful with cardinality if you go with this route though, otherwise this can cause memory issues.

There are two possible values at the moment and I am not expecting them to increase, so I hope that is ok

I suggest explicitly bounding the values in a conditional and default to 'unknown' for any other ones.

I have done exactly that by introducing RecordGarbageCollectorSyncErrorMetric function.

But I have run into a problem. By introducing reason label value, I am not picking up any metric values before the first error is recorded. I played with query quite a bit but I am unable to register a change from 0 -> 1 error, which is essentially null -> 1 and isn't picked up by rate.

Query I am trying to achieve for an alert: rate(garbagecollector_controller_garbagecollector_resources_sync_error_total{}[1h]) > 0

I am seeing two options:

force initialize metric for all reasons with 0 at the begining.

revert back to not using labels which would make observability simpler. As I said before, the benefit of adding the reason is not that big.

thougths?

I suppose it could be somewhat useful, but usually you want to go to see the logs once the error occurs.

If you don't have any use for the reason label in terms of alerting, I think it is correct to stick with the approach of looking into the logs when an error occurs, although it might be simpler to investigate if you have the reason in the metric.

But I have run into a problem. By introducing reason label value, I am not picking up any metric values before the first error is recorded. I played with query quite a bit but I am unable to register a change from 0 -> 1 error, which is essentially null -> 1 and isn't picked up by rate.

Not seeing the error metric until an actual error occurs is a normal monitoring behavior. For example, the apiserver request metric will only show metrics with a 500 status code if the apiserver answers a request with 500.

Going from 0 -> 1 to null -> 1 should be the same in Prometheus, so there might have been something else that went wrong. Were you perhaps able to see the actual metric with the reason?

I am not sure what went wrong, but I did see the metric with the reason fine and tried variations of the query above.

Nevertheless, let's go with the simpler solution for now then.

fedebongio · 2021-12-07T21:08:59Z

/assign @caesarxuchao
/triage accepted

dims · 2022-01-05T15:34:47Z

looks like we are waiting on @caesarxuchao for review. thanks!

atiratree · 2022-01-10T17:13:09Z

@caesarxuchao implemented feedback from @dgrisonnet

caesarxuchao

Sorry for the delay. Some minor comments and otherwise LGTM.

@logicalhan could you also take a look from the metrics perspective? Thanks.

/sig instrumentation

caesarxuchao · 2022-01-11T23:59:28Z

pkg/controller/garbagecollector/metrics/metrics.go

+	GarbageCollectorResourcesSyncError = metrics.NewCounterVec(
+		&metrics.CounterOpts{
+			Namespace:      "garbagecollector",
+			Subsystem:      "controller",


Let's follow the style of other controllers, defining a xxxSubsystem const and not using namespace, e.g, the replicaset/metrics.

reverted back

caesarxuchao · 2022-01-12T01:04:54Z

pkg/controller/garbagecollector/metrics/metrics.go

+
+var registerMetrics sync.Once
+
+// Register registers CronjobController metrics.


Can you update the "CronjobController" :)

caesarxuchao · 2022-01-12T01:11:41Z

pkg/controller/garbagecollector/garbagecollector.go

@@ -203,6 +207,7 @@ func (gc *GarbageCollector) Sync(discoveryClient discovery.ServerResourcesInterf
 				newResources = GetDeletableResources(discoveryClient)
 				if len(newResources) == 0 {
 					klog.V(2).Infof("no resources reported by discovery (attempt %d)", attempt)
+					metrics.GarbageCollectorResourcesSyncError.WithLabelValues("resource_discovery_failure").Inc()


Can you define the label values as const in the metrics package?

atiratree · 2022-01-12T18:25:50Z

@caesarxuchao resolved

atiratree · 2022-01-13T18:48:20Z

currently PR blocked on #106844 (comment)

atiratree · 2022-01-20T16:10:03Z

@caesarxuchao @logicalhan @dgrisonnet

I have decided to simplify the solution and return the metrics to the previous state. Please see #106844 (comment).

I think better / user friendly observability is more important than not that much beneficial labels. The details of the error are accessible in KCM log if needed.

atiratree · 2022-02-04T19:25:24Z

updated

atiratree · 2022-02-07T14:20:51Z

unit test failure does not seem related
/retest

dgrisonnet · 2022-02-07T14:49:45Z

/lgtm

atiratree · 2022-02-22T13:18:29Z

@caesarxuchao please rereview

deads2k · 2022-03-23T19:01:13Z

/approve

k8s-ci-robot · 2022-03-23T19:02:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: atiratree, deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/garbagecollector/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-triage-robot · 2022-03-23T23:45:44Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

k8s-triage-robot · 2022-03-24T02:45:44Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

k8s-triage-robot · 2022-03-24T06:06:44Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

k8s-triage-robot · 2022-03-24T09:07:44Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 6, 2021

k8s-ci-robot requested review from caesarxuchao and lavalamp December 6, 2021 22:30

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 6, 2021

atiratree mentioned this pull request Dec 6, 2021

add alerts for garbage collector openshift/cluster-kube-controller-manager-operator#580

Closed

1 task

dgrisonnet reviewed Dec 7, 2021

View reviewed changes

k8s-ci-robot assigned caesarxuchao Dec 7, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 7, 2021

atiratree force-pushed the gc_metrics branch 2 times, most recently from aeb123e to 01e388c Compare January 10, 2022 17:06

caesarxuchao reviewed Jan 12, 2022

View reviewed changes

atiratree force-pushed the gc_metrics branch from 01e388c to 3070fc4 Compare January 12, 2022 18:22

atiratree force-pushed the gc_metrics branch from 3070fc4 to 87c13f6 Compare January 13, 2022 18:35

atiratree force-pushed the gc_metrics branch from 87c13f6 to 48f3691 Compare January 20, 2022 16:04

add gc metrics and collect sync errors

2fc401d

atiratree force-pushed the gc_metrics branch from 48f3691 to 2fc401d Compare February 4, 2022 19:21

k8s-ci-robot assigned dgrisonnet Feb 7, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 23, 2022

k8s-ci-robot merged commit 4e8b2ff into kubernetes:master Mar 24, 2022

k8s-ci-robot added this to the v1.24 milestone Mar 24, 2022

atiratree mentioned this pull request Apr 4, 2022

add status endpoint to GC debug handler #105705

Closed

atiratree mentioned this pull request Jun 1, 2022

add metrics to GC to measure staleness of queue retry errors #110329

Closed


		var registerMetrics sync.Once

		// Register registers CronjobController metrics.

add gc metrics and collect sync errors #106844

add gc metrics and collect sync errors #106844

Conversation

atiratree commented Dec 6, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla bot commented Dec 6, 2021 • edited

atiratree commented Dec 6, 2021

atiratree commented Dec 6, 2021

k8s-ci-robot commented Dec 6, 2021

atiratree commented Dec 6, 2021

atiratree commented Dec 7, 2021

atiratree commented Dec 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgrisonnet Feb 4, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree Jan 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fedebongio commented Dec 7, 2021

dims commented Jan 5, 2022

atiratree commented Jan 10, 2022

caesarxuchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree commented Jan 12, 2022

atiratree commented Jan 13, 2022

atiratree commented Jan 20, 2022

atiratree commented Feb 4, 2022

atiratree commented Feb 7, 2022

dgrisonnet commented Feb 7, 2022

atiratree commented Feb 22, 2022

deads2k commented Mar 23, 2022

k8s-ci-robot commented Mar 23, 2022

k8s-triage-robot commented Mar 23, 2022

k8s-triage-robot commented Mar 24, 2022

k8s-triage-robot commented Mar 24, 2022

k8s-triage-robot commented Mar 24, 2022

linux-foundation-easycla bot commented Dec 6, 2021 •

edited

dgrisonnet Feb 4, 2022 •

edited

atiratree Jan 13, 2022 •

edited