New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add gc metrics and collect sync errors #106844
Conversation
/check-cla |
/meow |
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/check-cla |
/check-cla |
/easycla |
var ( | ||
GarbageCollectorResourcesSyncError = metrics.NewCounter( | ||
&metrics.CounterOpts{ | ||
Subsystem: GarbageCollectorControllerSubsystem, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better to have garbagecollector
as Namespace
and controller
as Subsystem
in the options. The result will be the same, but it will be easier to differentiate subsystems from the same namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your change seems a bit different from what I was expecting before. Also note that the name of the metric will end up being: Subsystem + Namespace + Name, so you need to remove the garbagecollector
prefix from the metric Name. Currently the metric will be named garbagecollector_controller_garbagecollector_resources_sync_error_total
which I don't think is what you want since garbagecollector is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your change seems a bit different from what I was expecting before.
I have reverted it back because of #106844 (comment)
Currently the metric will be named garbagecollector_controller_garbagecollector_resources_sync_error_total
Yup, that makes sense not to have it redundant.
GarbageCollectorResourcesSyncError = metrics.NewCounter( | ||
&metrics.CounterOpts{ | ||
Subsystem: GarbageCollectorControllerSubsystem, | ||
Name: "garbagecollector_resources_sync_error_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing how the metric is used, there seem to be different reasons behind sync failures. As such, would it make sense to introduce a reason
label to differentiate them more easily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it could be somewhat useful, but usually you want to go to see the logs once the error occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'd have to be careful with cardinality if you go with this route though, otherwise this can cause memory issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two possible values at the moment and I am not expecting them to increase, so I hope that is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest explicitly bounding the values in a conditional and default to 'unknown' for any other ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done exactly that by introducing RecordGarbageCollectorSyncErrorMetric function.
But I have run into a problem. By introducing reason label value, I am not picking up any metric values before the first error is recorded. I played with query quite a bit but I am unable to register a change from 0 -> 1 error, which is essentially null -> 1 and isn't picked up by rate
.
Query I am trying to achieve for an alert: rate(garbagecollector_controller_garbagecollector_resources_sync_error_total{}[1h]) > 0
I am seeing two options:
- force initialize metric for all reasons with 0 at the begining.
- revert back to not using labels which would make observability simpler. As I said before, the benefit of adding the reason is not that big.
thougths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it could be somewhat useful, but usually you want to go to see the logs once the error occurs.
If you don't have any use for the reason label in terms of alerting, I think it is correct to stick with the approach of looking into the logs when an error occurs, although it might be simpler to investigate if you have the reason in the metric.
But I have run into a problem. By introducing reason label value, I am not picking up any metric values before the first error is recorded. I played with query quite a bit but I am unable to register a change from 0 -> 1 error, which is essentially null -> 1 and isn't picked up by rate.
Not seeing the error metric until an actual error occurs is a normal monitoring behavior. For example, the apiserver request metric will only show metrics with a 500 status code if the apiserver answers a request with 500.
Going from 0 -> 1 to null -> 1 should be the same in Prometheus, so there might have been something else that went wrong. Were you perhaps able to see the actual metric with the reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what went wrong, but I did see the metric with the reason fine and tried variations of the query above.
Nevertheless, let's go with the simpler solution for now then.
/assign @caesarxuchao |
looks like we are waiting on @caesarxuchao for review. thanks! |
aeb123e
to
01e388c
Compare
@caesarxuchao implemented feedback from @dgrisonnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Some minor comments and otherwise LGTM.
@logicalhan could you also take a look from the metrics perspective? Thanks.
/sig instrumentation
GarbageCollectorResourcesSyncError = metrics.NewCounterVec( | ||
&metrics.CounterOpts{ | ||
Namespace: "garbagecollector", | ||
Subsystem: "controller", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's follow the style of other controllers, defining a xxxSubsystem
const and not using namespace, e.g, the replicaset/metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted back
|
||
var registerMetrics sync.Once | ||
|
||
// Register registers CronjobController metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the "CronjobController" :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done ^^
@@ -203,6 +207,7 @@ func (gc *GarbageCollector) Sync(discoveryClient discovery.ServerResourcesInterf | |||
newResources = GetDeletableResources(discoveryClient) | |||
if len(newResources) == 0 { | |||
klog.V(2).Infof("no resources reported by discovery (attempt %d)", attempt) | |||
metrics.GarbageCollectorResourcesSyncError.WithLabelValues("resource_discovery_failure").Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define the label values as const in the metrics package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@caesarxuchao resolved |
currently PR blocked on #106844 (comment) |
@caesarxuchao @logicalhan @dgrisonnet I have decided to simplify the solution and return the metrics to the previous state. Please see #106844 (comment). I think better / user friendly observability is more important than not that much beneficial labels. The details of the error are accessible in KCM log if needed. |
updated |
unit test failure does not seem related |
/lgtm |
@caesarxuchao please rereview |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: atiratree, deads2k The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
3 similar comments
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This is a complementary PR to #105705 which helps with debugging garbage collector.
Adds new metric
garbagecollector_controller_garbagecollector_resources_sync_error_total
which together with existing metricsworkqueue_retries_total{name="garbage_collector_attempt_to_delete"}
andworkqueue_retries_total{name="garbage_collector_attempt_to_orphan"}
can help with alerting when gc cannot reconcile.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: