Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve observability of node authorizer #92466

Merged
merged 1 commit into from Nov 10, 2020

Conversation

mborsz
Copy link
Member

@mborsz mborsz commented Jun 24, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it:
It make it possible to check if node authorizer is working and if it's able to process all incoming events.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Add node_authorizer_actions_duration_seconds metric that can be used to estimate load to node authorizer.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/assign @wojtek-t
/assign @liggitt

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jun 24, 2020
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 24, 2020
@k8s-ci-robot k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 24, 2020

var (
// NodeAuthorizerActionsDuration is a histogram of duration of graph actions in node authorizer.
NodeAuthorizerActionsDuration = metrics.NewHistogramVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to export this? do we want things from other packages referencing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 34 to 36
Name: "actions_duration_seconds",
Help: "Histogram of duration of graph actions in node authorizer.",
StabilityLevel: metrics.ALPHA,
// Start with 0.1ms with the last bucket being [~200ms, Inf)
Buckets: metrics.ExponentialBuckets(0.0001, 2, 12),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @logicalhan @kubernetes/sig-instrumentation-pr-reviews for feedback on metric type/name/construction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actions seems a bit generic but the metric suffix is in line with current instrumentation guidance (i.e. duration_seconds)

plugin/pkg/auth/authorizer/node/graph.go Outdated Show resolved Hide resolved
)
)

func init() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind modifying this to the following signature?

func RegisterMetrics(r Registry) {
    r.MustRegister(graphActionsDuration)
}

And then instead of importing this package to instantiate the metric as a side-effect, we can explicitly invoke the MustRegister function with the metric registry in some boot sequence code-path. This is the pattern we are trying to move towards. We do this currently in the kubelet (https://github.com/kubernetes/kubernetes/blob/v1.19.0-alpha.0/pkg/kubelet/server/server.go#L368-L373).

You can also just pass in the legacy registry instead of a custom one, if your metrics endpoint is already just using the legacyregistry. This will make it easier for us to get rid of the legacyregistry later on (which we intend to do, since it is deprecated).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to this pattern with one diff: legancyregistry is a package and not an object, so I cannot call RegisterMetrics(legacyregistry). I changed the signature to accept function to call, like: RegisterMetrics(legacyregistry.MustRegister). Please let me know if this works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible to do that this way with legacyregistry:

  • I added a node.RegisterMetrics(legacyregistry.MustRegistry) call to node/config (where we decide if we are going to use node authorizer)
  • This node/config can be called multiple times in a single binary (e.g. in our tests)
  • This triggers panic due to multiple metric registration
  • I have to add sync.Once to dedupe metrics registration, but this way, if RegisterMetric will be called twice for multiple metric registries, the second call will be completely ignored leading to not registering this
  • I changed this to the pattern that is used everywhere in kube-apiserver for legacyregistry: hardcoding legacyregistry in MustRegister and calling MustRegister on the node/config which seems to be the best option we have currently in kube-apiserver.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a bit entangled in the apiserver currently.

Help: "Histogram of duration of graph actions in node authorizer.",
StabilityLevel: metrics.ALPHA,
// Start with 0.1ms with the last bucket being [~200ms, Inf)
Buckets: metrics.ExponentialBuckets(0.0001, 2, 12),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you don't need granularity for this metric above 200 milliseconds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should. Graph struct is doing purely in-memory operations, so 200 ms for a single call seems to be way too much.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 23, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 23, 2020
@mborsz
Copy link
Member Author

mborsz commented Oct 28, 2020

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 28, 2020
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Oct 28, 2020
@mborsz
Copy link
Member Author

mborsz commented Oct 28, 2020

PTAL

@ehashman
Copy link
Member

@mborsz can we get tests passing for review? At least one of these failures doesn't look like a flake at a glance

@wojtek-t
Copy link
Member

Agree - this one seems relevant: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/92466/pull-kubernetes-integration/1321459745499910144
@mborsz - can you please take a look?

@mborsz mborsz force-pushed the nmetrics branch 2 times, most recently from 14a65ca to 325efd5 Compare November 5, 2020 10:53
@mborsz
Copy link
Member Author

mborsz commented Nov 5, 2020

/retest

@mborsz
Copy link
Member Author

mborsz commented Nov 5, 2020

Sorry, tests should be fixed now.

@ehashman
Copy link
Member

ehashman commented Nov 5, 2020

/assign @logicalhan

@mborsz
Copy link
Member Author

mborsz commented Nov 9, 2020

Friendly ping

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2020
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small comment from me.

start := time.Now()
defer func() {
graphActionsDuration.WithLabelValues("AddPod").Observe(time.Since(start).Seconds())
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is DeletePod not intstrumented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I added metrics to DeletePod and SetNodeConfigMap.

* Adding some metrics to the graph
* Adding log message when node authorizer has synced

Change-Id: I3447d6bc389a0b82ded1db2a7a4ae41d79486c2b
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2020
@wojtek-t
Copy link
Member

/lgtm
/approve

[Based on Han`s approval from sig-instrumentation perspective]

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: logicalhan, mborsz, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 10, 2020
@k8s-ci-robot k8s-ci-robot merged commit 4628c60 into kubernetes:master Nov 10, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants