Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gc metrics and collect sync errors #106844

Merged
merged 1 commit into from Mar 24, 2022

Conversation

atiratree
Copy link
Member

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is a complementary PR to #105705 which helps with debugging garbage collector.

Adds new metric garbagecollector_controller_garbagecollector_resources_sync_error_total which together with existing metrics workqueue_retries_total{name="garbage_collector_attempt_to_delete"} and workqueue_retries_total{name="garbage_collector_attempt_to_orphan"} can help with alerting when gc cannot reconcile.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 6, 2021
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Dec 6, 2021

CLA Signed

The committers are authorized under a signed CLA.

  • ✅ Filip Křepinský (c4c2529b3ce143023cad94c70d02282e96cba84c)

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Dec 6, 2021
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 6, 2021
@atiratree
Copy link
Member Author

/check-cla

@atiratree
Copy link
Member Author

/meow

@k8s-ci-robot
Copy link
Contributor

@atiratree: cat image

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@atiratree
Copy link
Member Author

/check-cla

@atiratree
Copy link
Member Author

/check-cla

@atiratree
Copy link
Member Author

/easycla

var (
GarbageCollectorResourcesSyncError = metrics.NewCounter(
&metrics.CounterOpts{
Subsystem: GarbageCollectorControllerSubsystem,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to have garbagecollector as Namespace and controller as Subsystem in the options. The result will be the same, but it will be easier to differentiate subsystems from the same namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, done

Copy link
Member

@dgrisonnet dgrisonnet Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your change seems a bit different from what I was expecting before. Also note that the name of the metric will end up being: Subsystem + Namespace + Name, so you need to remove the garbagecollector prefix from the metric Name. Currently the metric will be named garbagecollector_controller_garbagecollector_resources_sync_error_total which I don't think is what you want since garbagecollector is redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your change seems a bit different from what I was expecting before.

I have reverted it back because of #106844 (comment)

Currently the metric will be named garbagecollector_controller_garbagecollector_resources_sync_error_total

Yup, that makes sense not to have it redundant.

GarbageCollectorResourcesSyncError = metrics.NewCounter(
&metrics.CounterOpts{
Subsystem: GarbageCollectorControllerSubsystem,
Name: "garbagecollector_resources_sync_error_total",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing how the metric is used, there seem to be different reasons behind sync failures. As such, would it make sense to introduce a reason label to differentiate them more easily?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it could be somewhat useful, but usually you want to go to see the logs once the error occurs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'd have to be careful with cardinality if you go with this route though, otherwise this can cause memory issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two possible values at the moment and I am not expecting them to increase, so I hope that is ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest explicitly bounding the values in a conditional and default to 'unknown' for any other ones.

Copy link
Member Author

@atiratree atiratree Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done exactly that by introducing RecordGarbageCollectorSyncErrorMetric function.

But I have run into a problem. By introducing reason label value, I am not picking up any metric values before the first error is recorded. I played with query quite a bit but I am unable to register a change from 0 -> 1 error, which is essentially null -> 1 and isn't picked up by rate.

Query I am trying to achieve for an alert: rate(garbagecollector_controller_garbagecollector_resources_sync_error_total{}[1h]) > 0

I am seeing two options:

  1. force initialize metric for all reasons with 0 at the begining.
  2. revert back to not using labels which would make observability simpler. As I said before, the benefit of adding the reason is not that big.

thougths?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it could be somewhat useful, but usually you want to go to see the logs once the error occurs.

If you don't have any use for the reason label in terms of alerting, I think it is correct to stick with the approach of looking into the logs when an error occurs, although it might be simpler to investigate if you have the reason in the metric.

But I have run into a problem. By introducing reason label value, I am not picking up any metric values before the first error is recorded. I played with query quite a bit but I am unable to register a change from 0 -> 1 error, which is essentially null -> 1 and isn't picked up by rate.

Not seeing the error metric until an actual error occurs is a normal monitoring behavior. For example, the apiserver request metric will only show metrics with a 500 status code if the apiserver answers a request with 500.

Going from 0 -> 1 to null -> 1 should be the same in Prometheus, so there might have been something else that went wrong. Were you perhaps able to see the actual metric with the reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what went wrong, but I did see the metric with the reason fine and tried variations of the query above.

Nevertheless, let's go with the simpler solution for now then.

@fedebongio
Copy link
Contributor

/assign @caesarxuchao
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 7, 2021
@dims
Copy link
Member

dims commented Jan 5, 2022

looks like we are waiting on @caesarxuchao for review. thanks!

@atiratree atiratree force-pushed the gc_metrics branch 2 times, most recently from aeb123e to 01e388c Compare January 10, 2022 17:06
@atiratree
Copy link
Member Author

@caesarxuchao implemented feedback from @dgrisonnet

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Some minor comments and otherwise LGTM.

@logicalhan could you also take a look from the metrics perspective? Thanks.

/sig instrumentation

GarbageCollectorResourcesSyncError = metrics.NewCounterVec(
&metrics.CounterOpts{
Namespace: "garbagecollector",
Subsystem: "controller",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow the style of other controllers, defining a xxxSubsystem const and not using namespace, e.g, the replicaset/metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted back


var registerMetrics sync.Once

// Register registers CronjobController metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the "CronjobController" :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ^^

@@ -203,6 +207,7 @@ func (gc *GarbageCollector) Sync(discoveryClient discovery.ServerResourcesInterf
newResources = GetDeletableResources(discoveryClient)
if len(newResources) == 0 {
klog.V(2).Infof("no resources reported by discovery (attempt %d)", attempt)
metrics.GarbageCollectorResourcesSyncError.WithLabelValues("resource_discovery_failure").Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define the label values as const in the metrics package?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@atiratree
Copy link
Member Author

@caesarxuchao resolved

@atiratree
Copy link
Member Author

currently PR blocked on #106844 (comment)

@atiratree
Copy link
Member Author

@caesarxuchao @logicalhan @dgrisonnet

I have decided to simplify the solution and return the metrics to the previous state. Please see #106844 (comment).

I think better / user friendly observability is more important than not that much beneficial labels. The details of the error are accessible in KCM log if needed.

@atiratree
Copy link
Member Author

updated

@atiratree
Copy link
Member Author

unit test failure does not seem related
/retest

@dgrisonnet
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2022
@atiratree
Copy link
Member Author

@caesarxuchao please rereview

@deads2k
Copy link
Contributor

deads2k commented Mar 23, 2022

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: atiratree, deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

3 similar comments
@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants