New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve observability of node authorizer #92466
Conversation
|
||
var ( | ||
// NodeAuthorizerActionsDuration is a histogram of duration of graph actions in node authorizer. | ||
NodeAuthorizerActionsDuration = metrics.NewHistogramVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to export this? do we want things from other packages referencing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Name: "actions_duration_seconds", | ||
Help: "Histogram of duration of graph actions in node authorizer.", | ||
StabilityLevel: metrics.ALPHA, | ||
// Start with 0.1ms with the last bucket being [~200ms, Inf) | ||
Buckets: metrics.ExponentialBuckets(0.0001, 2, 12), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @logicalhan @kubernetes/sig-instrumentation-pr-reviews for feedback on metric type/name/construction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actions seems a bit generic but the metric suffix is in line with current instrumentation guidance (i.e. duration_seconds)
) | ||
) | ||
|
||
func init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind modifying this to the following signature?
func RegisterMetrics(r Registry) {
r.MustRegister(graphActionsDuration)
}
And then instead of importing this package to instantiate the metric as a side-effect, we can explicitly invoke the MustRegister function with the metric registry in some boot sequence code-path. This is the pattern we are trying to move towards. We do this currently in the kubelet (https://github.com/kubernetes/kubernetes/blob/v1.19.0-alpha.0/pkg/kubelet/server/server.go#L368-L373).
You can also just pass in the legacy registry instead of a custom one, if your metrics endpoint is already just using the legacyregistry. This will make it easier for us to get rid of the legacyregistry later on (which we intend to do, since it is deprecated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to this pattern with one diff: legancyregistry is a package and not an object, so I cannot call RegisterMetrics(legacyregistry). I changed the signature to accept function to call, like: RegisterMetrics(legacyregistry.MustRegister). Please let me know if this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not possible to do that this way with legacyregistry:
- I added a node.RegisterMetrics(legacyregistry.MustRegistry) call to node/config (where we decide if we are going to use node authorizer)
- This node/config can be called multiple times in a single binary (e.g. in our tests)
- This triggers panic due to multiple metric registration
- I have to add sync.Once to dedupe metrics registration, but this way, if RegisterMetric will be called twice for multiple metric registries, the second call will be completely ignored leading to not registering this
- I changed this to the pattern that is used everywhere in kube-apiserver for legacyregistry: hardcoding legacyregistry in MustRegister and calling MustRegister on the node/config which seems to be the best option we have currently in kube-apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a bit entangled in the apiserver currently.
Help: "Histogram of duration of graph actions in node authorizer.", | ||
StabilityLevel: metrics.ALPHA, | ||
// Start with 0.1ms with the last bucket being [~200ms, Inf) | ||
Buckets: metrics.ExponentialBuckets(0.0001, 2, 12), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you don't need granularity for this metric above 200 milliseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should. Graph struct is doing purely in-memory operations, so 200 ms for a single call seems to be way too much.
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
PTAL |
@mborsz can we get tests passing for review? At least one of these failures doesn't look like a flake at a glance |
Agree - this one seems relevant: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/92466/pull-kubernetes-integration/1321459745499910144 |
14a65ca
to
325efd5
Compare
/retest |
Sorry, tests should be fixed now. |
/assign @logicalhan |
Friendly ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one small comment from me.
start := time.Now() | ||
defer func() { | ||
graphActionsDuration.WithLabelValues("AddPod").Observe(time.Since(start).Seconds()) | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is DeletePod not intstrumented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I added metrics to DeletePod and SetNodeConfigMap.
* Adding some metrics to the graph * Adding log message when node authorizer has synced Change-Id: I3447d6bc389a0b82ded1db2a7a4ae41d79486c2b
/lgtm [Based on Han`s approval from sig-instrumentation perspective] |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: logicalhan, mborsz, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
It make it possible to check if node authorizer is working and if it's able to process all incoming events.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/assign @wojtek-t
/assign @liggitt