metrics: stop counting client-cancelled requests as server errors#1255
Merged
metrics: stop counting client-cancelled requests as server errors#1255
Conversation
…or counter Under scraper-driven load (e.g. ServiceNow paginating /v0.1/servers with short client timeouts), the registry handler runs the DB query, the client times out and closes the connection, the in-flight context is cancelled, and the handler converts the resulting context.Canceled into huma.Error500InternalServerError before trying to write the response to a closed socket. The middleware records that as status_code=500 and bumps mcp_registry_http_errors_total — even though no client ever received a 5xx (NGINX records it as 499 / client closed). This was causing the "Availability dropped below 95%" alert to fire during daily bursts despite zero 5xx in the ingress logs: yesterday's peak generated 7396 internal 500s on /v0.1/servers alone, all of which were really client cancellations. Remap the recorded status to 499 when ctx.Context().Err() is context.Canceled and skip the error counter for that code. Keep the request and duration counters so the cancellation rate is still visible. context.DeadlineExceeded is intentionally not remapped — that would indicate a server-side timeout if we ever add per-request deadlines. The matching Grafana alert annotation says "5xx are increased" but the underlying query is on http_errors_total, which this change makes accurate again — no annotation update is required for the alert to behave correctly, but it would be clearer to reword it to "internal error rate" rather than "5xx". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
rdimitrov
added a commit
that referenced
this pull request
May 5, 2026
## Summary - Bump `mcp-registry:imageTag` from `1.7.7` → `1.7.8` in `deploy/Pulumi.gcpProd.yaml` - v1.7.8 ships the metrics fix that stops counting client-cancelled requests as server errors (#1255). Under scraper-driven load, clients with short timeouts close the socket while PG is still iterating; the handler converts `context.Canceled` to a 500 and the metric middleware was bumping `mcp_registry_http_errors_total` even though NGINX recorded 499 and never delivered a response. This is what's been firing the daily ~17:00 UTC `Availability dropped below 95%` alert. - Also includes a CI-only `pulumi/actions` 6.6.1 → 7.0.0 bump (#1254). No registry runtime impact. - Staging is on `4e14707` (the v1.7.8 commit) and `staging.registry.modelcontextprotocol.io/v0.1/version` reports it. ## Test plan - [ ] Confirm release workflow's `ko-push` job has published `ghcr.io/modelcontextprotocol/registry:1.7.8` before merging - [ ] Watch deploy-production roll the registry pods to `1.7.8` - [ ] Hit `https://registry.modelcontextprotocol.io/v0.1/version` and confirm `git_commit` matches `4e147072f03aeacfe8456c7f24407e0ea65d92c0` - [ ] Through the next ~17:00 UTC scraper burst, watch in Prometheus: - `mcp_registry_http_errors_total{status_code="500", path="/v0.1/servers"}` should stay flat - `mcp_registry_http_requests_total{status_code="499", path="/v0.1/servers"}` should grow during the burst - The `Availability dropped below 95%` alert should not fire on cancellation-only bursts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
499(NGINX-style "client closed") whenctx.Context().Err()iscontext.Canceledafter the handler returnshttp_errors_totalincrement for499s so the availability metric reflects server-visible errors onlyWhy
Under scraper-driven load, clients with short timeouts paginate
/v0.1/servers, give up while PG is busy, and close the TCP connection. The handler's DB iteration returnscontext canceled, the handler converts that tohuma.Error500InternalServerError, and tries to write a response to the now-closed socket. NGINX records499(client closed) and never delivers anything to the client — but the registry's middleware recordsstatus_code=500and bumpsmcp_registry_http_errors_total.That's what's been firing the
Availability dropped below 95%alert during the daily 17:00 UTC bursts. Recent prod numbers from one pod (~19h uptime):…against zero 5xx in NGINX ingress logs over the same window. The alert was correct given the data it had; the data was misclassifying client cancellations as server errors.
What this does
In
internal/api/router/router.go(the metric middleware): afternext(ctx), checkctx.Context().Err(). If it'scontext.Canceled, overridestatusCodeto499and skip the error counter. Requests counter and duration histogram still record so the cancellation rate stays visible.context.DeadlineExceededis intentionally not remapped — that would indicate a server-side timeout if we ever add per-request deadlines, and should still count as a server error.What this does not do
updated_sinceoutreach we're doing).huma.Error500…oncontext.Canceled. We could short-circuit earlier (return without writing) to save the work, but that's a bigger refactor and the metric fix unblocks the alert immediately.Companion change (no code, just a Grafana annotation tweak)
The
Availability dropped below 95%alert (UIDbexrc60etvhmoa) currently has the annotation:The query is on
mcp_registry_http_errors_total, not 5xx specifically. After this change the count is meaningful again, but the wording is still misleading. Suggest updating to:(no PromQL change required.)
Test plan
go test ./internal/api/router/...— new test passesgolangci-lint run ./internal/...— cleanmcp_registry_http_errors_total{status_code="500", path="/v0.1/servers"}should stay flat whilemcp_registry_http_requests_total{status_code="499", path="/v0.1/servers"}grows. Availability alert should not fire on cancellation-only bursts.🤖 Generated with Claude Code