metrics: stop counting client-cancelled requests as server errors by rdimitrov · Pull Request #1255 · modelcontextprotocol/registry

rdimitrov · 2026-05-05T09:54:41Z

Summary

Remap recorded status to 499 (NGINX-style "client closed") when ctx.Context().Err() is context.Canceled after the handler returns
Skip the http_errors_total increment for 499s so the availability metric reflects server-visible errors only

Why

Under scraper-driven load, clients with short timeouts paginate /v0.1/servers, give up while PG is busy, and close the TCP connection. The handler's DB iteration returns context canceled, the handler converts that to huma.Error500InternalServerError, and tries to write a response to the now-closed socket. NGINX records 499 (client closed) and never delivers anything to the client — but the registry's middleware records status_code=500 and bumps mcp_registry_http_errors_total.

That's what's been firing the Availability dropped below 95% alert during the daily 17:00 UTC bursts. Recent prod numbers from one pod (~19h uptime):

mcp_registry_http_errors_total{path="/v0.1/servers", status_code="500"} 7396
mcp_registry_http_errors_total{path="/v0.1/servers/{serverName}/versions", status_code="500"} 283
mcp_registry_http_errors_total{path="/v0/servers", status_code="500"} 352

…against zero 5xx in NGINX ingress logs over the same window. The alert was correct given the data it had; the data was misclassifying client cancellations as server errors.

What this does

In internal/api/router/router.go (the metric middleware): after next(ctx), check ctx.Context().Err(). If it's context.Canceled, override statusCode to 499 and skip the error counter. Requests counter and duration histogram still record so the cancellation rate stays visible.

context.DeadlineExceeded is intentionally not remapped — that would indicate a server-side timeout if we ever add per-request deadlines, and should still count as a server error.

What this does not do

Doesn't reduce the underlying load (ServiceNow + notion catalog scrapers driving the daily peak — separate work, see the updated_since outreach we're doing).
Doesn't change handler behaviour — handlers still call huma.Error500… on context.Canceled. We could short-circuit earlier (return without writing) to save the work, but that's a bigger refactor and the metric fix unblocks the alert immediately.

Companion change (no code, just a Grafana annotation tweak)

The Availability dropped below 95% alert (UID bexrc60etvhmoa) currently has the annotation:

No. of 5xx are increased from acceptable limit i.e. 5% of total throughput

The query is on mcp_registry_http_errors_total, not 5xx specifically. After this change the count is meaningful again, but the wording is still misleading. Suggest updating to:

Internal error rate exceeded 5% of total throughput

(no PromQL change required.)

Test plan

go test ./internal/api/router/... — new test passes
golangci-lint run ./internal/... — clean
After merge, watch a daily 17:00 UTC burst — mcp_registry_http_errors_total{status_code="500", path="/v0.1/servers"} should stay flat while mcp_registry_http_requests_total{status_code="499", path="/v0.1/servers"} grows. Availability alert should not fire on cancellation-only bursts.

🤖 Generated with Claude Code

…or counter Under scraper-driven load (e.g. ServiceNow paginating /v0.1/servers with short client timeouts), the registry handler runs the DB query, the client times out and closes the connection, the in-flight context is cancelled, and the handler converts the resulting context.Canceled into huma.Error500InternalServerError before trying to write the response to a closed socket. The middleware records that as status_code=500 and bumps mcp_registry_http_errors_total — even though no client ever received a 5xx (NGINX records it as 499 / client closed). This was causing the "Availability dropped below 95%" alert to fire during daily bursts despite zero 5xx in the ingress logs: yesterday's peak generated 7396 internal 500s on /v0.1/servers alone, all of which were really client cancellations. Remap the recorded status to 499 when ctx.Context().Err() is context.Canceled and skip the error counter for that code. Keep the request and duration counters so the cancellation rate is still visible. context.DeadlineExceeded is intentionally not remapped — that would indicate a server-side timeout if we ever add per-request deadlines. The matching Grafana alert annotation says "5xx are increased" but the underlying query is on http_errors_total, which this change makes accurate again — no annotation update is required for the alert to behave correctly, but it would be clearer to reword it to "internal error rate" rather than "5xx". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## Summary - Bump `mcp-registry:imageTag` from `1.7.7` → `1.7.8` in `deploy/Pulumi.gcpProd.yaml` - v1.7.8 ships the metrics fix that stops counting client-cancelled requests as server errors (#1255). Under scraper-driven load, clients with short timeouts close the socket while PG is still iterating; the handler converts `context.Canceled` to a 500 and the metric middleware was bumping `mcp_registry_http_errors_total` even though NGINX recorded 499 and never delivered a response. This is what's been firing the daily ~17:00 UTC `Availability dropped below 95%` alert. - Also includes a CI-only `pulumi/actions` 6.6.1 → 7.0.0 bump (#1254). No registry runtime impact. - Staging is on `4e14707` (the v1.7.8 commit) and `staging.registry.modelcontextprotocol.io/v0.1/version` reports it. ## Test plan - [ ] Confirm release workflow's `ko-push` job has published `ghcr.io/modelcontextprotocol/registry:1.7.8` before merging - [ ] Watch deploy-production roll the registry pods to `1.7.8` - [ ] Hit `https://registry.modelcontextprotocol.io/v0.1/version` and confirm `git_commit` matches `4e147072f03aeacfe8456c7f24407e0ea65d92c0` - [ ] Through the next ~17:00 UTC scraper burst, watch in Prometheus: - `mcp_registry_http_errors_total{status_code="500", path="/v0.1/servers"}` should stay flat - `mcp_registry_http_requests_total{status_code="499", path="/v0.1/servers"}` should grow during the burst - The `Availability dropped below 95%` alert should not fire on cancellation-only bursts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rdimitrov merged commit 4e14707 into main May 5, 2026
5 checks passed

rdimitrov deleted the fix/exclude-client-cancelled-from-error-metric branch May 5, 2026 09:58

rdimitrov mentioned this pull request May 5, 2026

deploy: update prod to v1.7.8 #1256

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: stop counting client-cancelled requests as server errors#1255

metrics: stop counting client-cancelled requests as server errors#1255
rdimitrov merged 1 commit intomainfrom
fix/exclude-client-cancelled-from-error-metric

rdimitrov commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rdimitrov commented May 5, 2026

Summary

Why

What this does

What this does not do

Companion change (no code, just a Grafana annotation tweak)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant