Add rbac to /metrics by mjudeikis · Pull Request #4165 · kcp-dev/kcp

mjudeikis · 2026-05-28T07:12:59Z

Summary

Reject workspace-scoped /clusters//metrics and cache-server /services/cache/shards//clusters//metrics with 501 Not Implemented. Metrics are shard-wide and have no per-workspace meaning; allowing them via a workspace ClusterRole + binding was a privilege escalation.

We might want to implement this down the line. For now - no.

Top-level :6443/metrics is now authorized via root-workspace RBAC: a new bootstrap ClusterRole system:kcp:metrics-reader grants GET on /metrics, and a kcp-admin binds it once in :root for an identity of their choice. The binding is replicated to every shard via the cache server, so a single root binding covers all shards.

Cache server top-level /metrics is now reachable as well (previously, WithShardScope returned 400 for any path without a /shards/ prefix).

Fixes #4062

What Type of PR Is This?

/kind feature

Related Issue(s)

Fixes #

Release Notes

Implement rbac for /metrics

Reject workspace-scoped /clusters/<ws>/metrics and cache-server /services/cache/shards/<sh>/clusters/<ws>/metrics with 501 Not Implemented. Metrics are shard-wide and have no per-workspace meaning; allowing them via a workspace ClusterRole + binding was a privilege escalation. Top-level <shard>:6443/metrics is now authorized via root-workspace RBAC: a new bootstrap ClusterRole system:kcp:metrics-reader grants GET on /metrics, and a kcp-admin binds it once in :root for an identity of their choice. The binding is replicated to every shard via the cache server, so a single root binding covers all shards. Cache server top-level /metrics is now reachable as well (previously WithShardScope returned 400 for any path without a /shards/ prefix). Fixes kcp-dev#4062 Signed-off-by: Mangirdas Judeikis <mangirdas@judeikis.lt> On-behalf-of: SAP <mangirdas.judeikis@sap.com>

The current /metrics handler exposes shard-wide data with no per-workspace or per-shard meaning, so workspace-scoped variants are rejected with 501. Note this is a placeholder: when per-workspace metrics are implemented in the future, the URL contract stays the same and the handler fills in. Signed-off-by: Mangirdas Judeikis <mangirdas@judeikis.lt> On-behalf-of: SAP <mangirdas.judeikis@sap.com>

Five subtests exercise the new behavior end-to-end: - workspace-scoped /clusters/<ws>/metrics is 501'd for an unprivileged user even when they hold a ClusterRole+Binding granting nonResourceURLs: [/metrics] inside that workspace - workspace-scoped /clusters/<ws>/metrics is 501'd for system:masters too (the filter runs before authz; no escape hatch) - top-level /metrics for a user with no root binding is 403'd - top-level /metrics for a user bound to system:kcp:metrics-reader in :root scrapes successfully and returns prometheus-format output - top-level /metrics for system:masters keeps working (regression guard) Signed-off-by: Mangirdas Judeikis <mangirdas@judeikis.lt> On-behalf-of: SAP <mangirdas.judeikis@sap.com>

mjudeikis · 2026-05-29T06:38:58Z

/retest

ntnn · 2026-05-29T08:20:34Z

+kcp exposes Prometheus metrics on the `/metrics` endpoint of every shard and of
+the cache server. These are **shard-wide** resources: a single scrape returns


Suggested change

kcp exposes Prometheus metrics on the `/metrics` endpoint of every shard and of

the cache server. These are **shard-wide** resources: a single scrape returns

kcp exposes Prometheus metrics on the `/metrics` endpoint of every shard and on

the cache server. These are **shard-wide** resources: a single scrape returns

ntnn · 2026-05-29T08:22:48Z

+return `501 Not Implemented`. Today this is a placeholder: per-workspace and
+per-shard metrics are not yet implemented, and the data the underlying
+`/metrics` handler exposes is shard-wide with no per-workspace or per-shard
+meaning. The kcp HTTP filter rejects these requests before authorization runs
+so that a future implementation can fill in real workspace-scoped metrics
+without changing the URL contract.


This feels like it waves three things at once and makes it confusing to read for people who don't already know whats going on.

/metrics on workspaces and VWs returns 501

/metrics on the shard/cache server is aggregated over all resources

Discussing future implementations for a per-workspace

I think these should be separate topics and not in one blob of text.

ntnn · 2026-05-29T08:24:17Z

+kcp ships a bootstrap `ClusterRole` named `system:kcp:metrics-reader` that grants
+`GET` on `/metrics`. To allow an identity to scrape every shard, create a
+`ClusterRoleBinding` in the `:root` workspace. The binding is replicated to all
+shards via the cache server, so a single binding is enough.


Arguably you can only do a single binding because there's only one root workspace on the root shard. Maybe reword the last sentence a bit?

ntnn · 2026-05-29T08:24:27Z

+    name: prometheus
+```
+
+Apply it against the root workspace:


Suggested change

Apply it against the root workspace:

Apply it to the root workspace:

ntnn · 2026-05-29T08:27:30Z

+### Note: workspace-local `nonResourceURLs: /metrics` no longer works
+
+Earlier kcp releases accidentally allowed a workspace administrator to grant
+themselves access to shard metrics by creating a `ClusterRole` with
+`nonResourceURLs: ["/metrics"]` and a binding inside their own workspace. This
+was a privilege escalation: the data exposed is shard-wide, not workspace
+content. The path is now rejected at the workspace scope and the only
+authoritative binding is one created in `:root`.
+


Should we put this into documentation for people setting up metrics now? If someone used this and it broke I'd guess they check at least the existing issues and then see that it was a security issue.

Wdym? This is docs page.

- docs/metrics.md: split into separate sections for 501 rejection, what /metrics aggregates, and future per-workspace metrics; minor reword. - shardpaths: include /livez, /readyz, /healthz alongside /metrics. - WithShardLevelPaths: reject /clusters/root/<path> too; only the bare URL is valid since the data is shard-wide. - cache server: drop WithShardScope's hardcoded probe special-case now that shardpaths covers it. - tests: extend filter unit tests; add cache-server e2e check that scoped /metrics forms return 501.

ntnn

/lgtm
/approve

kcp-ci-bot · 2026-06-01T07:20:05Z

LGTM label has been added.

Details

Git tree hash: 2690099ce6ed04590c78c0ce2c4929fba8739604

…heck - Revert shardpaths expansion. Probes must stay reachable via /clusters/<ws>/{livez,readyz,healthz}; TestAuthorizer asserts this. Restore the hardcoded probe special-case in cache server's WithShardScope. - e2e: use a raw HTTP client for the cache /metrics rejection check; rest.Request.StatusCode() did not capture 501 from a plain-text body.

ntnn

/lgtm
/approve

kcp-ci-bot · 2026-06-03T07:44:57Z

LGTM label has been added.

Details

Git tree hash: 1ee003638cd141bcc13df02642358444e9f9495d

mjudeikis · 2026-06-03T08:33:14Z

/retest

mjudeikis · 2026-06-03T11:46:52Z

/retest

quota bug

mjudeikis · 2026-06-03T12:04:58Z

/retest
OCI is funky again

mjudeikis · 2026-06-03T12:47:47Z

/retest

ntnn

/lgtm
/approve

kcp-ci-bot · 2026-06-03T12:57:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ntnn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ntnn]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mjudeikis added 3 commits May 28, 2026 09:45

lint

f8df79c

ntnn reviewed May 29, 2026

View reviewed changes

mjudeikis added 2 commits June 1, 2026 10:02

update docs

8317bf9

ntnn approved these changes Jun 1, 2026

View reviewed changes

kcp-ci-bot assigned ntnn Jun 1, 2026

kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 1, 2026

kcp-ci-bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm Indicates that a PR is ready to be merged. labels Jun 1, 2026

kcp-ci-bot requested a review from ntnn June 1, 2026 08:06

mjudeikis force-pushed the fix.metrics branch from 7992303 to 1662c5f Compare June 3, 2026 07:30

ntnn approved these changes Jun 3, 2026

View reviewed changes

kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2026

mjudeikis added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 3, 2026

ntnn added this to tbd Jun 3, 2026

ntnn moved this to Reviewing in tbd Jun 3, 2026

ntnn approved these changes Jun 3, 2026

View reviewed changes

kcp-ci-bot merged commit 83a6f11 into kcp-dev:main Jun 3, 2026
14 checks passed

github-project-automation Bot moved this from Reviewing to Done in tbd Jun 3, 2026

		kcp exposes Prometheus metrics on the `/metrics` endpoint of every shard and of
		the cache server. These are shard-wide resources: a single scrape returns

	Apply it against the root workspace:
	Apply it to the root workspace:

Conversation

mjudeikis commented May 28, 2026

Summary

What Type of PR Is This?

Related Issue(s)

Release Notes

Uh oh!

mjudeikis commented May 29, 2026

Uh oh!

ntnn May 29, 2026

Choose a reason for hiding this comment

Uh oh!

ntnn May 29, 2026

Choose a reason for hiding this comment

Uh oh!

ntnn May 29, 2026

Choose a reason for hiding this comment

Uh oh!

ntnn May 29, 2026

Choose a reason for hiding this comment

Uh oh!

ntnn May 29, 2026

Choose a reason for hiding this comment

Uh oh!

mjudeikis Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ntnn left a comment

Choose a reason for hiding this comment

Uh oh!

kcp-ci-bot commented Jun 1, 2026

Uh oh!

ntnn left a comment

Choose a reason for hiding this comment

Uh oh!

kcp-ci-bot commented Jun 3, 2026

Uh oh!

mjudeikis commented Jun 3, 2026

Uh oh!

mjudeikis commented Jun 3, 2026

Uh oh!

mjudeikis commented Jun 3, 2026

Uh oh!

mjudeikis commented Jun 3, 2026

Uh oh!

ntnn left a comment

Choose a reason for hiding this comment

Uh oh!

kcp-ci-bot commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants