SREP-4866: Add RHOBS MCP server for AI agent integration#892
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds stdio MCP support for RHOBS: introduces MCP helper utilities, updates skip-version allowlist to include ChangesMCP Server Integration for RHOBS
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
cmd/rhobs/mcp_test.go (1)
232-264: 🏗️ Heavy liftThese tests don’t validate the behaviors named in the test titles.
TestHandleLogs_InvalidSinceDurationandTestHandleLogs_LimitCappingboth fail earlier on fetcher/cluster lookup and only assert a generic tool error. That means invalidsinceparsing and limit capping can regress without failing these tests. Please make these paths deterministic (e.g., stub fetcher or validate these inputs before fetcher resolution) and assert the specific expected outcome.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/rhobs/mcp_test.go` around lines 232 - 264, The tests TestHandleLogs_InvalidSinceDuration and TestHandleLogs_LimitCapping are not exercising the behaviors in their names because handleLogs currently fails earlier during fetcher/cluster lookup; make the path deterministic by either (A) injecting or stubbing a fetcher that returns a successful, controlled response for the test cluster_id (e.g., create a mock fetcher and pass it into handleLogs or the request context so the handler proceeds to parse the "since" and apply limit capping), or (B) move/duplicate the input validation logic into handleLogs before any fetcher/cluster resolution so parsing of "since" (use the same parser called in handleLogs) and capping of "limit" are exercised; update TestHandleLogs_InvalidSinceDuration to assert a parse error/result from the since parsing code and TestHandleLogs_LimitCapping to assert the capped limit value (referencing handleLogs, makeRequest, and the cluster/fetcher resolution logic to locate where to inject the mock or relocate validation).cmd/rhobs/mcpCmd.go (1)
35-37: ⚡ Quick winRemove the no-op
PersistentPreRunEblock for code clarity.Line 35–37 contains an empty
PersistentPreRunEthat adds no value. WithEnableTraverseRunHooks = trueexplicitly set inmain.go, the parent hook inrootCmd.gowill execute correctly and does not need a child hook placeholder.Proposed diff
- PersistentPreRunE: func(cmd *cobra.Command, args []string) error { - return nil - },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/rhobs/mcpCmd.go` around lines 35 - 37, Remove the no-op PersistentPreRunE function from the command definition (the anonymous func assigned to PersistentPreRunE in cmd/rhobs/mcpCmd.go) since it does nothing and parent hooks run via EnableTraverseRunHooks; simply delete the PersistentPreRunE field so the command relies on the parent hook (rootCmd) behavior.cmd/rhobs/mcp_helpers.go (1)
13-23: ⚡ Quick winAvoid duplicate fetcher creation under concurrent cache misses.
Two concurrent requests for the same key can both pass
Loadand both executeCreateRhobsFetcher(...)before eitherStore, causing redundant OCM/Vault work.Suggested direction
+// Option A: guard initialization with per-key singleflight +// Option B: protect create/store with a mutex and double-check Load before StoreA per-key singleflight around
CreateRhobsFetcheris the cleanest way to preserve cache intent during parallel MCP calls.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/rhobs/mcp_helpers.go` around lines 13 - 23, getCachedFetcher currently races on concurrent cache misses: multiple goroutines can Load the same key, then each call CreateRhobsFetcher and Store duplicate entries; wrap the CreateRhobsFetcher call with a per-key singleflight to ensure only one creation runs for a given key, using the key used with fetcherCache; on singleflight result, cast/return the *RhobsFetcher and only Store into fetcherCache once (or rely on the singleflight result) so redundant OCM/Vault work is avoided.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/rhobs/mcp_test.go`:
- Around line 268-272: TestRegisterMcpTools is vacuous; after calling
registerMcpTools(s) use the server's introspection API (e.g.,
s.RegisteredTools(), s.ListTools(), or the s.Tools slice/field) to assert tools
were actually registered: check the tool count is > 0 and that the expected tool
names from registerMcpTools are present. Update TestRegisterMcpTools to call
that introspection method on the server.NewMCPServer instance and use
t.Fatalf/t.Errorf or require/assert to fail the test when the count or expected
names do not match.
In `@cmd/rhobs/mcp_tools.go`:
- Around line 137-146: The rawQuery branch for building lokiExpr currently
bypasses cluster scoping; ensure lokiExpr always includes the
openshift_cluster_id filter by appending ` | openshift_cluster_id = "<id>"` when
rawQuery is non-empty (unless the provided rawQuery already contains
`openshift_cluster_id`), so update the logic around rawQuery/lokiExpr to append
fetcher.clusterExternalId and/or check for an existing openshift_cluster_id
token; reference variables to change: rawQuery, lokiExpr,
fetcher.clusterExternalId, containRegex, namespace.
- Around line 97-99: The fallback currently drops the metadata envelope when
json.Unmarshal(rawResult, &parsed) fails; change the fallback to preserve the
envelope fields (cell, cluster_id, environment) and return a structured result
that includes the raw payload as a field. Specifically, where
json.Unmarshal(rawResult, &parsed) is used (variables rawResult and parsed and
the call to mcp.NewToolResultText), build and return an mcp.ToolResult (or the
existing envelope type used elsewhere) populated with
parsed.Cell/ClusterID/Environment if available (or defaulted from earlier
context) and put string(rawResult) into a RawPayload or Payload field instead of
returning plain text; apply the same change at both occurrences (around the
json.Unmarshal at lines referenced and the other occurrence near lines 190-191)
so metadata is preserved on parse failure.
- Around line 91-92: The handler currently treats a single supplied time bound
as an instant query by falling through when only one of start/end is set; change
the validation where start and end are checked so that you explicitly return a
validation error (e.g., HTTP 400) if exactly one of start or end is provided. In
practice, update the branch around the start/end check that calls
fetcher.QueryRangeMetrics to first validate (start != "" && end != "") or (start
== "" && end == "") and if only one is present return an error response
describing that both start and end are required for range queries; keep the
existing fetcher.QueryRangeMetrics call unchanged when both are present.
In `@docs/README.md`:
- Around line 4319-4320: Update the MCP server synopsis text that currently
reads "Start a stdio-based MCP (Model Context Protocol) server that exposes
RHOBS metrics and logs querying as tools for Claude Code" to also mention the
alerts tool: include "and alerts (rhobs_alerts)" or similar so the description
lists metrics, logs and rhobs_alerts; edit the sentence containing "Start a
stdio-based MCP (Model Context Protocol) server that exposes" to explicitly
reference rhobs_alerts as one of the available tools.
---
Nitpick comments:
In `@cmd/rhobs/mcp_helpers.go`:
- Around line 13-23: getCachedFetcher currently races on concurrent cache
misses: multiple goroutines can Load the same key, then each call
CreateRhobsFetcher and Store duplicate entries; wrap the CreateRhobsFetcher call
with a per-key singleflight to ensure only one creation runs for a given key,
using the key used with fetcherCache; on singleflight result, cast/return the
*RhobsFetcher and only Store into fetcherCache once (or rely on the singleflight
result) so redundant OCM/Vault work is avoided.
In `@cmd/rhobs/mcp_test.go`:
- Around line 232-264: The tests TestHandleLogs_InvalidSinceDuration and
TestHandleLogs_LimitCapping are not exercising the behaviors in their names
because handleLogs currently fails earlier during fetcher/cluster lookup; make
the path deterministic by either (A) injecting or stubbing a fetcher that
returns a successful, controlled response for the test cluster_id (e.g., create
a mock fetcher and pass it into handleLogs or the request context so the handler
proceeds to parse the "since" and apply limit capping), or (B) move/duplicate
the input validation logic into handleLogs before any fetcher/cluster resolution
so parsing of "since" (use the same parser called in handleLogs) and capping of
"limit" are exercised; update TestHandleLogs_InvalidSinceDuration to assert a
parse error/result from the since parsing code and TestHandleLogs_LimitCapping
to assert the capped limit value (referencing handleLogs, makeRequest, and the
cluster/fetcher resolution logic to locate where to inject the mock or relocate
validation).
In `@cmd/rhobs/mcpCmd.go`:
- Around line 35-37: Remove the no-op PersistentPreRunE function from the
command definition (the anonymous func assigned to PersistentPreRunE in
cmd/rhobs/mcpCmd.go) since it does nothing and parent hooks run via
EnableTraverseRunHooks; simply delete the PersistentPreRunE field so the command
relies on the parent hook (rootCmd) behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2b5aafe3-45bd-4b3a-aabb-99cf3607fc15
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (11)
cmd/cmd.gocmd/rhobs/mcpCmd.gocmd/rhobs/mcp_helpers.gocmd/rhobs/mcp_query.gocmd/rhobs/mcp_test.gocmd/rhobs/mcp_tools.gocmd/rhobs/rootCmd.godocs/README.mddocs/osdctl_rhobs.mddocs/osdctl_rhobs_mcp.mdgo.mod
|
@dustman9000: This pull request references SREP-4866 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
36473fe to
e265bf7
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/rhobs/mcp_tools.go`:
- Around line 167-169: Validate and reject non-positive limit and since values
before calling Loki: check if limit <= 0 or since <= 0 at the start of the
handler that reads the limit and since parameters (the same check also needs to
be added in the other occurrence around the block at lines 187-194), and return
a 400 Bad Request with a clear message instead of proceeding; keep the existing
upper cap (if limit > 10000 then limit = 10000) but perform that after
validating limit is positive. Ensure you update both spots that use the
variables limit and since to prevent submitting invalid queries to Loki.
- Around line 102-103: The handlers are discarding the incoming context (e.g.,
handleMetrics currently takes _ context.Context) and calling backend fetches
with context.Background(); change the function signatures to accept a named ctx
context.Context (replace the underscore) and pass that ctx into all backend
calls instead of context.Background(); apply the same replacement in the other
handler functions flagged in the review so cancellation/deadline/values from the
incoming MCP request are propagated.
In `@cmd/rhobs/mcpCmd.go`:
- Line 49: The MCP server run is using context.Background(), which ignores Cobra
command cancellation; change the call to use the Cobra command context (e.g.
replace server.Run(context.Background(), &mcp.StdioTransport{}) with
server.Run(cmd.Context(), &mcp.StdioTransport{})), ensuring the surrounding
function has access to the cobra *Command (cmd) so cancellation/interrupts
propagate into server.Run.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 25d03549-56d8-483e-b322-0a1a0bdde267
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (11)
cmd/cmd.gocmd/rhobs/mcpCmd.gocmd/rhobs/mcp_helpers.gocmd/rhobs/mcp_query.gocmd/rhobs/mcp_test.gocmd/rhobs/mcp_tools.gocmd/rhobs/rootCmd.godocs/README.mddocs/osdctl_rhobs.mddocs/osdctl_rhobs_mcp.mdgo.mod
✅ Files skipped from review due to trivial changes (1)
- docs/osdctl_rhobs.md
🚧 Files skipped from review as they are similar to previous changes (3)
- cmd/rhobs/rootCmd.go
- cmd/cmd.go
- cmd/rhobs/mcp_query.go
d6f6de3 to
7eca0e8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/rhobs/mcpCmd.go`:
- Around line 13-20: The checkVaultToken function currently runs exec.Command
without a timeout and returns a single generic error; change its signature to
accept a context.Context (rename to checkVaultToken(ctx context.Context)), use
exec.LookPath("vault") to detect missing CLI and return a distinct error, run
the command with exec.CommandContext using a short timeout derived from the
passed ctx (e.g. context.WithTimeout) and capture stdout/stderr into a buffer,
and then differentiate errors: if ctx deadline exceeded return a
timeout-specific error, if the command exited with a non-zero exit code
(exec.ExitError) return an error indicating token lookup failed (suggesting
login), and for other errors return the underlying error (e.g.
network/permission) with the captured stderr to aid debugging; update the caller
that invokes checkVaultToken to pass its context so deadlines propagate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0e151ff8-b331-4383-aafc-a4610b0319aa
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (11)
cmd/cmd.gocmd/rhobs/mcpCmd.gocmd/rhobs/mcp_helpers.gocmd/rhobs/mcp_query.gocmd/rhobs/mcp_test.gocmd/rhobs/mcp_tools.gocmd/rhobs/rootCmd.godocs/README.mddocs/osdctl_rhobs.mddocs/osdctl_rhobs_mcp.mdgo.mod
✅ Files skipped from review due to trivial changes (1)
- docs/osdctl_rhobs.md
🚧 Files skipped from review as they are similar to previous changes (7)
- cmd/rhobs/rootCmd.go
- cmd/cmd.go
- cmd/rhobs/mcp_helpers.go
- go.mod
- cmd/rhobs/mcp_query.go
- cmd/rhobs/mcp_tools.go
- cmd/rhobs/mcp_test.go
7eca0e8 to
454fcd1
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/rhobs/mcp_helpers.go`:
- Around line 13-23: getCachedFetcher currently has a check-then-act race:
multiple goroutines can miss fetcherCache.Load and each call CreateRhobsFetcher
for the same key; fix by deduplicating initialization (e.g., use a
singleflight.Group or sync.Map.LoadOrStore) so only one goroutine runs
CreateRhobsFetcher for a given key and others wait and receive the stored
result; implement this inside getCachedFetcher using the existing fetcherCache
and CreateRhobsFetcher (or add a package-level singleflight.Group) to ensure a
single creation and then store the created *RhobsFetcher in fetcherCache before
returning.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: af2041dd-45b1-4baf-b17d-0d95fda079f7
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (11)
cmd/cmd.gocmd/rhobs/mcpCmd.gocmd/rhobs/mcp_helpers.gocmd/rhobs/mcp_query.gocmd/rhobs/mcp_test.gocmd/rhobs/mcp_tools.gocmd/rhobs/rootCmd.godocs/README.mddocs/osdctl_rhobs.mddocs/osdctl_rhobs_mcp.mdgo.mod
✅ Files skipped from review due to trivial changes (1)
- docs/osdctl_rhobs.md
🚧 Files skipped from review as they are similar to previous changes (5)
- go.mod
- cmd/rhobs/mcpCmd.go
- cmd/rhobs/mcp_query.go
- cmd/rhobs/mcp_test.go
- cmd/rhobs/mcp_tools.go
454fcd1 to
15caf5c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/rhobs/mcp_query.go`:
- Around line 156-179: The loop currently drops entries whose getTimeStamp()
equals edgeTimeStamp and adjusts endTimeStamp in a way that loses tied-timestamp
rows; change the pagination to preserve ties or fail explicitly: when computing
edgeTimeStamp (from flattenedResults[len(flattenedResults)-1].getTimeStamp())
detect how many entries share that timestamp and do not skip all ts ==
edgeTimeStamp in the for loop (replace the ts != edgeTimeStamp check with logic
that either includes all entries with that timestamp into entries until
logsCount is satisfied, or if including them would overflow the page, set
endTimeStamp = edgeTimeStamp (not edgeTimeStamp-1) and return an explicit
error/flag indicating an ambiguous boundary so the caller can retry with a
different strategy); update the code paths around flattenedResults,
edgeTimeStamp, endTimeStamp, and the entries append (mcpLogEntry, getTimeStamp,
getTime, getMessage, result.Stream) accordingly so tied timestamps are never
silently dropped.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c5be0d02-cccb-43cf-94d5-405c624496f7
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (11)
cmd/cmd.gocmd/rhobs/mcpCmd.gocmd/rhobs/mcp_helpers.gocmd/rhobs/mcp_query.gocmd/rhobs/mcp_test.gocmd/rhobs/mcp_tools.gocmd/rhobs/rootCmd.godocs/README.mddocs/osdctl_rhobs.mddocs/osdctl_rhobs_mcp.mdgo.mod
✅ Files skipped from review due to trivial changes (2)
- cmd/cmd.go
- docs/osdctl_rhobs.md
🚧 Files skipped from review as they are similar to previous changes (6)
- cmd/rhobs/rootCmd.go
- go.mod
- cmd/rhobs/mcpCmd.go
- cmd/rhobs/mcp_helpers.go
- cmd/rhobs/mcp_tools.go
- cmd/rhobs/mcp_test.go
|
/retest |
|
/retest-required |
15caf5c to
46ebd81
Compare
|
All CodeRabbit review comments have been addressed in the current squashed commit. Summarizing for reviewers since force pushes invalidated the old line references:
|
46ebd81 to
a058351
Compare
geowa4
left a comment
There was a problem hiding this comment.
Just a couple of questions/thoughts, nothing blocking.
8c46aaf to
7a0a4e2
Compare
|
/lgtm |
Add `osdctl rhobs mcp` with two subcommands: - `server`: starts a stdio-based MCP server exposing RHOBS tools - `config`: prints MCP client configuration JSON for easy setup Usage: claude --mcp-config "$(osdctl rhobs mcp config)" Uses the official modelcontextprotocol/go-sdk maintained by the MCP org and Google. Compatible with any MCP client (Claude Code, Cursor, Windsurf, custom agents). Tools (all read-only, with OutputSchema): - rhobs_metrics: PromQL instant/range queries against Thanos - rhobs_logs: LogQL queries against Loki - rhobs_alerts: Firing alerts from Alertmanager Covers ROSA HCP infrastructure: hosted clusters, Management Clusters (MC), and Service Clusters (SC). Accepts any cluster ID; the correct RHOBS cell is resolved automatically.
7a0a4e2 to
862fa54
Compare
|
@dustman9000: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dustman9000, geowa4 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
osdctl rhobs mcpsubcommand that starts a stdio-based MCP (Model Context Protocol) serverrhobs_metrics(PromQL instant/range),rhobs_logs(LogQL),rhobs_alerts(Alertmanager)RhobsFetcherfor cell resolution, OIDC auth, and API queries with no code duplicationsync.Mapwithsingleflightdeduplication to avoid repeated OCM lookupsMotivation
SREs using AI agents for operational work (alert investigation, cluster debugging) can connect this MCP server to query RHOBS directly. The server provides structured JSON responses with cell metadata, eliminating the need to shell out to CLI commands or manage query syntax manually. Compatible with any MCP-capable client (Claude Code, Cursor, Windsurf, custom agents, etc.).
Configuration
Add to your MCP client configuration (e.g.,
~/.claude/mcp_settings.jsonfor Claude Code):{ "mcpServers": { "rhobs": { "command": "osdctl", "args": ["rhobs", "mcp"] } } }Prerequisites: OCM login, vault login (
vault.devshift.net),~/.config/osdctlwithrhobs_<env>_vault_pathentries.Tools
rhobs_metricscluster_id,query, optionalstart/end/stepfor rangerhobs_logscluster_id,namespaceorquery,contain_regex,since,limitrhobs_alertscluster_idAll responses include cell URL, cluster metadata, and environment. All tools are annotated as read-only.
Test plan
go test ./cmd/rhobs/ -v(73 subtests covering helpers, parameter validation, tool registration via in-memory transport, end-to-end tool call roundtrips)hs-mc-qj7qbfg8g/us-east-1-2cellhs-mc-kbj34s6bg/us-west-2-0cellmake verify-docsgofmtclean, all CI checks greenJira: https://redhat.atlassian.net/browse/SREP-4866