Skip to content

Add rbac to /metrics #4165

Merged
kcp-ci-bot merged 7 commits into
kcp-dev:mainfrom
mjudeikis:fix.metrics
Jun 3, 2026
Merged

Add rbac to /metrics #4165
kcp-ci-bot merged 7 commits into
kcp-dev:mainfrom
mjudeikis:fix.metrics

Conversation

@mjudeikis
Copy link
Copy Markdown
Contributor

Summary

Reject workspace-scoped /clusters//metrics and cache-server /services/cache/shards//clusters//metrics with 501 Not Implemented. Metrics are shard-wide and have no per-workspace meaning; allowing them via a workspace ClusterRole + binding was a privilege escalation.

We might want to implement this down the line. For now - no.

Top-level :6443/metrics is now authorized via root-workspace RBAC: a new bootstrap ClusterRole system:kcp:metrics-reader grants GET on /metrics, and a kcp-admin binds it once in :root for an identity of their choice. The binding is replicated to every shard via the cache server, so a single root binding covers all shards.

Cache server top-level /metrics is now reachable as well (previously, WithShardScope returned 400 for any path without a /shards/ prefix).

Fixes #4062

What Type of PR Is This?

/kind feature

Related Issue(s)

Fixes #

Release Notes

Implement rbac for /metrics

mjudeikis added 3 commits May 28, 2026 09:45
Reject workspace-scoped /clusters/<ws>/metrics and cache-server
/services/cache/shards/<sh>/clusters/<ws>/metrics with 501 Not
Implemented. Metrics are shard-wide and have no per-workspace meaning;
allowing them via a workspace ClusterRole + binding was a privilege
escalation.

Top-level <shard>:6443/metrics is now authorized via root-workspace
RBAC: a new bootstrap ClusterRole system:kcp:metrics-reader grants GET
on /metrics, and a kcp-admin binds it once in :root for an identity of
their choice. The binding is replicated to every shard via the cache
server, so a single root binding covers all shards.

Cache server top-level /metrics is now reachable as well (previously
WithShardScope returned 400 for any path without a /shards/ prefix).

Fixes kcp-dev#4062

Signed-off-by: Mangirdas Judeikis <mangirdas@judeikis.lt>
On-behalf-of: SAP <mangirdas.judeikis@sap.com>
The current /metrics handler exposes shard-wide data with no per-workspace
or per-shard meaning, so workspace-scoped variants are rejected with 501.
Note this is a placeholder: when per-workspace metrics are implemented in
the future, the URL contract stays the same and the handler fills in.

Signed-off-by: Mangirdas Judeikis <mangirdas@judeikis.lt>
On-behalf-of: SAP <mangirdas.judeikis@sap.com>
Five subtests exercise the new behavior end-to-end:

- workspace-scoped /clusters/<ws>/metrics is 501'd for an unprivileged
  user even when they hold a ClusterRole+Binding granting
  nonResourceURLs: [/metrics] inside that workspace
- workspace-scoped /clusters/<ws>/metrics is 501'd for system:masters
  too (the filter runs before authz; no escape hatch)
- top-level /metrics for a user with no root binding is 403'd
- top-level /metrics for a user bound to system:kcp:metrics-reader in
  :root scrapes successfully and returns prometheus-format output
- top-level /metrics for system:masters keeps working (regression
  guard)

Signed-off-by: Mangirdas Judeikis <mangirdas@judeikis.lt>
On-behalf-of: SAP <mangirdas.judeikis@sap.com>
@kcp-ci-bot kcp-ci-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the DCO. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 28, 2026
@mjudeikis
Copy link
Copy Markdown
Contributor Author

/retest

Comment on lines +3 to +4
kcp exposes Prometheus metrics on the `/metrics` endpoint of every shard and of
the cache server. These are **shard-wide** resources: a single scrape returns
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kcp exposes Prometheus metrics on the `/metrics` endpoint of every shard and of
the cache server. These are **shard-wide** resources: a single scrape returns
kcp exposes Prometheus metrics on the `/metrics` endpoint of every shard and on
the cache server. These are **shard-wide** resources: a single scrape returns

Comment on lines +31 to +36
return `501 Not Implemented`. Today this is a placeholder: per-workspace and
per-shard metrics are not yet implemented, and the data the underlying
`/metrics` handler exposes is shard-wide with no per-workspace or per-shard
meaning. The kcp HTTP filter rejects these requests before authorization runs
so that a future implementation can fill in real workspace-scoped metrics
without changing the URL contract.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it waves three things at once and makes it confusing to read for people who don't already know whats going on.

  1. /metrics on workspaces and VWs returns 501
  2. /metrics on the shard/cache server is aggregated over all resources
  3. Discussing future implementations for a per-workspace

I think these should be separate topics and not in one blob of text.

Comment on lines +40 to +43
kcp ships a bootstrap `ClusterRole` named `system:kcp:metrics-reader` that grants
`GET` on `/metrics`. To allow an identity to scrape every shard, create a
`ClusterRoleBinding` in the `:root` workspace. The binding is replicated to all
shards via the cache server, so a single binding is enough.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably you can only do a single binding because there's only one root workspace on the root shard. Maybe reword the last sentence a bit?

name: prometheus
```

Apply it against the root workspace:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Apply it against the root workspace:
Apply it to the root workspace:

Comment on lines +78 to +86
### Note: workspace-local `nonResourceURLs: /metrics` no longer works

Earlier kcp releases accidentally allowed a workspace administrator to grant
themselves access to shard metrics by creating a `ClusterRole` with
`nonResourceURLs: ["/metrics"]` and a binding inside their own workspace. This
was a privilege escalation: the data exposed is shard-wide, not workspace
content. The path is now rejected at the workspace scope and the only
authoritative binding is one created in `:root`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this into documentation for people setting up metrics now? If someone used this and it broke I'd guess they check at least the existing issues and then see that it was a security issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdym? This is docs page.

Comment thread pkg/server/filters/shardlevelpaths.go Outdated
Comment thread pkg/cache/server/handler.go
Comment thread test/e2e/authorizer/shardmetrics_test.go
mjudeikis added 2 commits June 1, 2026 10:02
- docs/metrics.md: split into separate sections for 501 rejection, what
  /metrics aggregates, and future per-workspace metrics; minor reword.
- shardpaths: include /livez, /readyz, /healthz alongside /metrics.
- WithShardLevelPaths: reject /clusters/root/<path> too; only the bare URL
  is valid since the data is shard-wide.
- cache server: drop WithShardScope's hardcoded probe special-case now
  that shardpaths covers it.
- tests: extend filter unit tests; add cache-server e2e check that scoped
  /metrics forms return 501.
Copy link
Copy Markdown
Member

@ntnn ntnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@kcp-ci-bot kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 1, 2026
@kcp-ci-bot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 2690099ce6ed04590c78c0ce2c4929fba8739604

@kcp-ci-bot kcp-ci-bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm Indicates that a PR is ready to be merged. labels Jun 1, 2026
@kcp-ci-bot kcp-ci-bot requested a review from ntnn June 1, 2026 08:06
…heck

- Revert shardpaths expansion. Probes must stay reachable via
  /clusters/<ws>/{livez,readyz,healthz}; TestAuthorizer asserts this.
  Restore the hardcoded probe special-case in cache server's WithShardScope.
- e2e: use a raw HTTP client for the cache /metrics rejection check;
  rest.Request.StatusCode() did not capture 501 from a plain-text body.
Copy link
Copy Markdown
Member

@ntnn ntnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@kcp-ci-bot kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2026
@kcp-ci-bot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 1ee003638cd141bcc13df02642358444e9f9495d

@mjudeikis mjudeikis added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 3, 2026
@mjudeikis
Copy link
Copy Markdown
Contributor Author

/retest

@ntnn ntnn added this to tbd Jun 3, 2026
@ntnn ntnn moved this to Reviewing in tbd Jun 3, 2026
@mjudeikis
Copy link
Copy Markdown
Contributor Author

/retest

quota bug

@mjudeikis
Copy link
Copy Markdown
Contributor Author

/retest
OCI is funky again

@mjudeikis
Copy link
Copy Markdown
Contributor Author

/retest

Copy link
Copy Markdown
Member

@ntnn ntnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@kcp-ci-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ntnn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kcp-ci-bot kcp-ci-bot merged commit 83a6f11 into kcp-dev:main Jun 3, 2026
14 checks passed
@github-project-automation github-project-automation Bot moved this from Reviewing to Done in tbd Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has signed the DCO. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: accessing shard metrics is handled on workspace level

3 participants