Add rbac to /metrics #4165
Conversation
Reject workspace-scoped /clusters/<ws>/metrics and cache-server /services/cache/shards/<sh>/clusters/<ws>/metrics with 501 Not Implemented. Metrics are shard-wide and have no per-workspace meaning; allowing them via a workspace ClusterRole + binding was a privilege escalation. Top-level <shard>:6443/metrics is now authorized via root-workspace RBAC: a new bootstrap ClusterRole system:kcp:metrics-reader grants GET on /metrics, and a kcp-admin binds it once in :root for an identity of their choice. The binding is replicated to every shard via the cache server, so a single root binding covers all shards. Cache server top-level /metrics is now reachable as well (previously WithShardScope returned 400 for any path without a /shards/ prefix). Fixes kcp-dev#4062 Signed-off-by: Mangirdas Judeikis <mangirdas@judeikis.lt> On-behalf-of: SAP <mangirdas.judeikis@sap.com>
The current /metrics handler exposes shard-wide data with no per-workspace or per-shard meaning, so workspace-scoped variants are rejected with 501. Note this is a placeholder: when per-workspace metrics are implemented in the future, the URL contract stays the same and the handler fills in. Signed-off-by: Mangirdas Judeikis <mangirdas@judeikis.lt> On-behalf-of: SAP <mangirdas.judeikis@sap.com>
Five subtests exercise the new behavior end-to-end: - workspace-scoped /clusters/<ws>/metrics is 501'd for an unprivileged user even when they hold a ClusterRole+Binding granting nonResourceURLs: [/metrics] inside that workspace - workspace-scoped /clusters/<ws>/metrics is 501'd for system:masters too (the filter runs before authz; no escape hatch) - top-level /metrics for a user with no root binding is 403'd - top-level /metrics for a user bound to system:kcp:metrics-reader in :root scrapes successfully and returns prometheus-format output - top-level /metrics for system:masters keeps working (regression guard) Signed-off-by: Mangirdas Judeikis <mangirdas@judeikis.lt> On-behalf-of: SAP <mangirdas.judeikis@sap.com>
|
/retest |
| kcp exposes Prometheus metrics on the `/metrics` endpoint of every shard and of | ||
| the cache server. These are **shard-wide** resources: a single scrape returns |
There was a problem hiding this comment.
| kcp exposes Prometheus metrics on the `/metrics` endpoint of every shard and of | |
| the cache server. These are **shard-wide** resources: a single scrape returns | |
| kcp exposes Prometheus metrics on the `/metrics` endpoint of every shard and on | |
| the cache server. These are **shard-wide** resources: a single scrape returns |
| return `501 Not Implemented`. Today this is a placeholder: per-workspace and | ||
| per-shard metrics are not yet implemented, and the data the underlying | ||
| `/metrics` handler exposes is shard-wide with no per-workspace or per-shard | ||
| meaning. The kcp HTTP filter rejects these requests before authorization runs | ||
| so that a future implementation can fill in real workspace-scoped metrics | ||
| without changing the URL contract. |
There was a problem hiding this comment.
This feels like it waves three things at once and makes it confusing to read for people who don't already know whats going on.
/metricson workspaces and VWs returns 501/metricson the shard/cache server is aggregated over all resources- Discussing future implementations for a per-workspace
I think these should be separate topics and not in one blob of text.
| kcp ships a bootstrap `ClusterRole` named `system:kcp:metrics-reader` that grants | ||
| `GET` on `/metrics`. To allow an identity to scrape every shard, create a | ||
| `ClusterRoleBinding` in the `:root` workspace. The binding is replicated to all | ||
| shards via the cache server, so a single binding is enough. |
There was a problem hiding this comment.
Arguably you can only do a single binding because there's only one root workspace on the root shard. Maybe reword the last sentence a bit?
| name: prometheus | ||
| ``` | ||
|
|
||
| Apply it against the root workspace: |
There was a problem hiding this comment.
| Apply it against the root workspace: | |
| Apply it to the root workspace: |
| ### Note: workspace-local `nonResourceURLs: /metrics` no longer works | ||
|
|
||
| Earlier kcp releases accidentally allowed a workspace administrator to grant | ||
| themselves access to shard metrics by creating a `ClusterRole` with | ||
| `nonResourceURLs: ["/metrics"]` and a binding inside their own workspace. This | ||
| was a privilege escalation: the data exposed is shard-wide, not workspace | ||
| content. The path is now rejected at the workspace scope and the only | ||
| authoritative binding is one created in `:root`. | ||
|
|
There was a problem hiding this comment.
Should we put this into documentation for people setting up metrics now? If someone used this and it broke I'd guess they check at least the existing issues and then see that it was a security issue.
There was a problem hiding this comment.
Wdym? This is docs page.
- docs/metrics.md: split into separate sections for 501 rejection, what /metrics aggregates, and future per-workspace metrics; minor reword. - shardpaths: include /livez, /readyz, /healthz alongside /metrics. - WithShardLevelPaths: reject /clusters/root/<path> too; only the bare URL is valid since the data is shard-wide. - cache server: drop WithShardScope's hardcoded probe special-case now that shardpaths covers it. - tests: extend filter unit tests; add cache-server e2e check that scoped /metrics forms return 501.
|
LGTM label has been added. DetailsGit tree hash: 2690099ce6ed04590c78c0ce2c4929fba8739604 |
…heck
- Revert shardpaths expansion. Probes must stay reachable via
/clusters/<ws>/{livez,readyz,healthz}; TestAuthorizer asserts this.
Restore the hardcoded probe special-case in cache server's WithShardScope.
- e2e: use a raw HTTP client for the cache /metrics rejection check;
rest.Request.StatusCode() did not capture 501 from a plain-text body.
|
LGTM label has been added. DetailsGit tree hash: 1ee003638cd141bcc13df02642358444e9f9495d |
|
/retest |
|
/retest quota bug |
|
/retest |
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ntnn The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
Reject workspace-scoped /clusters//metrics and cache-server /services/cache/shards//clusters//metrics with 501 Not Implemented. Metrics are shard-wide and have no per-workspace meaning; allowing them via a workspace ClusterRole + binding was a privilege escalation.
We might want to implement this down the line. For now - no.
Top-level :6443/metrics is now authorized via root-workspace RBAC: a new bootstrap ClusterRole
system:kcp:metrics-readergrants GET on/metrics, and a kcp-admin binds it once in:rootfor an identity of their choice. The binding is replicated to every shard via the cache server, so a single root binding covers all shards.Cache server top-level /metrics is now reachable as well (previously, WithShardScope returned 400 for any path without a /shards/ prefix).
Fixes #4062
What Type of PR Is This?
/kind feature
Related Issue(s)
Fixes #
Release Notes