Skip to content

fix(controller): add timeout to RemoteMCPServer registration#1805

Open
AnantKumar17 wants to merge 5 commits intokagent-dev:mainfrom
AnantKumar17:fix/remotemcpserver-registration-timeout
Open

fix(controller): add timeout to RemoteMCPServer registration#1805
AnantKumar17 wants to merge 5 commits intokagent-dev:mainfrom
AnantKumar17:fix/remotemcpserver-registration-timeout

Conversation

@AnantKumar17
Copy link
Copy Markdown

What

Fixes a bug where a single hung RemoteMCPServer registration blocks all subsequent RemoteMCPServer reconciliations for the lifetime of the controller process.

Closes #1785

Why

newHTTPClient() was returning http.DefaultClient (no timeout) when there were no custom headers, and listTools() was passing the context directly to client.Connect() and session.ListTools() with no deadline. A hung or unreachable endpoint — for example, an IPv6 address resolved in an IPv4-only cluster — would block the goroutine indefinitely, preventing any subsequent RemoteMCPServer from registering until the controller pod was restarted.

Changes

Single file changed: go/core/internal/controller/reconciler/reconciler.go

  • newHTTPClient(): always returns a *http.Client with Timeout: mcpRegistrationTimeout set, in both the no-headers and with-headers branches. Removes the use of http.DefaultClient.
  • upsertToolServerForRemoteMCPServer(): wraps the transport creation and tool listing in context.WithTimeout(ctx, mcpRegistrationTimeout) so the entire registration sequence is bounded. The database call (RefreshToolsForServer) intentionally uses the original ctx.
  • ReconcileKagentRemoteMCPServer(): logs registration start (url, protocol), success (url, tool count, duration), and failure (error, duration) using structured logr key-value pairs, matching existing codebase conventions.
  • mcpRegistrationTimeout: new unexported constant set to 30 * time.Second, matching DefaultTimeout used in discoverer.go.

Testing

go test -race -skip 'TestE2E.*' ./...

All 26 test packages pass. go vet clean. go build clean.

Copilot AI review requested due to automatic review settings May 6, 2026 11:49
@github-actions github-actions Bot added the bug Something isn't working label May 6, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a controller reliability bug where a hung RemoteMCPServer registration could block all subsequent RemoteMCPServer reconciliations by ensuring the registration workflow is time-bounded and observable via structured logs.

Changes:

  • Adds a 30s registration timeout and applies it to MCP transport creation + tool listing.
  • Ensures newHTTPClient() always returns an *http.Client with a timeout (no longer returns http.DefaultClient).
  • Adds structured logr entries for registration start, success, and failure with duration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 50 to 57
// mcpRegistrationTimeout is the deadline applied to each RemoteMCPServer
// registration attempt (header resolution + MCP connect + tool listing).
// A hung or unreachable endpoint is bounded to this duration, ensuring the
// reconciler goroutine is always released and does not block subsequent
// RemoteMCPServer reconciliations.
mcpRegistrationTimeout = 30 * time.Second
)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a remoteMCPRegistrationTimeout() helper that returns spec.timeout when set and falls back to the 30s default when the field is nil. The constant is now just the fallback value.

Comment on lines +983 to +990
// Bound the entire registration sequence (header resolution + MCP connect +
// tool listing) to mcpRegistrationTimeout so that a hung or unreachable
// endpoint cannot block this goroutine — and therefore all subsequent
// RemoteMCPServer reconciliations — indefinitely.
tCtx, cancel := context.WithTimeout(ctx, mcpRegistrationTimeout)
defer cancel()

tsp, err := a.createMcpTransport(tCtx, remoteMcpServer)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - upsertToolServerForRemoteMCPServer now calls context.WithTimeout(ctx, remoteMCPRegistrationTimeout(remoteMcpServer)) so operators can tune the deadline per resource via .spec.timeout without any code changes.

Comment on lines 1037 to 1047
// go-sdk does not have a WithHeaders option when initializing transport
// so we need to create a custom HTTP client that adds headers to all requests.
func newHTTPClient(headers map[string]string) *http.Client {
if len(headers) == 0 {
return http.DefaultClient
return &http.Client{
Timeout: mcpRegistrationTimeout,
}
}
return &http.Client{
Timeout: mcpRegistrationTimeout,
Transport: &headerTransport{
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated newHTTPClient to accept an explicit timeout time.Duration parameter. createMcpTransport derives it from remoteMCPRegistrationTimeout(s) and passes it in, so the HTTP client timeout matches the CRD configuration in both branches.

Comment on lines +983 to +995
// Bound the entire registration sequence (header resolution + MCP connect +
// tool listing) to mcpRegistrationTimeout so that a hung or unreachable
// endpoint cannot block this goroutine — and therefore all subsequent
// RemoteMCPServer reconciliations — indefinitely.
tCtx, cancel := context.WithTimeout(ctx, mcpRegistrationTimeout)
defer cancel()

tsp, err := a.createMcpTransport(tCtx, remoteMcpServer)
if err != nil {
return nil, fmt.Errorf("failed to create client for toolServer %s: %w", toolServer.Name, err)
}

tools, err := a.listTools(ctx, tsp, toolServer)
tools, err := a.listTools(tCtx, tsp, toolServer)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TestRemoteMCPRegistrationTimeout (covers nil server, nil spec.timeout, and a custom value) and TestNewHTTPClient (covers nil/empty/with-headers - all assert the correct timeout and transport type). These are in reconciler_test.go in the same package.

Copy link
Copy Markdown
Contributor

@jmhbh jmhbh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM overall, just have on enit

// .spec.timeout is not set. A hung or unreachable endpoint is bounded to this
// duration, ensuring the reconciler goroutine is always released and does not
// block subsequent RemoteMCPServer reconciliations.
mcpRegistrationTimeout = 30 * time.Second
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I almost prefer setting this as a default value in remotemcpserver_types.go instead in the spec.Timeout field so its more explicit.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added // +kubebuilder:default="30s" on Timeout in remotemcpserver_types.go and regenerated both the base CRD and the Helm CRD template.
The Go fallback in remoteMCPRegistrationTimeout() is kept for existing resources that were created before this default takes effect (nil Timeout at admission time).

Fixes the issue 1785

Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
…ient

Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
Signed-off-by: AnantKumar17 <anant3011k@gmail.com>
@AnantKumar17 AnantKumar17 force-pushed the fix/remotemcpserver-registration-timeout branch from b868f98 to f3f1282 Compare May 7, 2026 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Failure of one RemoteMCPServer to register prevents subsequent registrations

4 participants