Skip to content

metrics: stop counting client-cancelled requests as server errors#1255

Merged
rdimitrov merged 1 commit intomainfrom
fix/exclude-client-cancelled-from-error-metric
May 5, 2026
Merged

metrics: stop counting client-cancelled requests as server errors#1255
rdimitrov merged 1 commit intomainfrom
fix/exclude-client-cancelled-from-error-metric

Conversation

@rdimitrov
Copy link
Copy Markdown
Member

Summary

  • Remap recorded status to 499 (NGINX-style "client closed") when ctx.Context().Err() is context.Canceled after the handler returns
  • Skip the http_errors_total increment for 499s so the availability metric reflects server-visible errors only

Why

Under scraper-driven load, clients with short timeouts paginate /v0.1/servers, give up while PG is busy, and close the TCP connection. The handler's DB iteration returns context canceled, the handler converts that to huma.Error500InternalServerError, and tries to write a response to the now-closed socket. NGINX records 499 (client closed) and never delivers anything to the client — but the registry's middleware records status_code=500 and bumps mcp_registry_http_errors_total.

That's what's been firing the Availability dropped below 95% alert during the daily 17:00 UTC bursts. Recent prod numbers from one pod (~19h uptime):

mcp_registry_http_errors_total{path="/v0.1/servers", status_code="500"} 7396
mcp_registry_http_errors_total{path="/v0.1/servers/{serverName}/versions", status_code="500"} 283
mcp_registry_http_errors_total{path="/v0/servers", status_code="500"} 352

…against zero 5xx in NGINX ingress logs over the same window. The alert was correct given the data it had; the data was misclassifying client cancellations as server errors.

What this does

In internal/api/router/router.go (the metric middleware): after next(ctx), check ctx.Context().Err(). If it's context.Canceled, override statusCode to 499 and skip the error counter. Requests counter and duration histogram still record so the cancellation rate stays visible.

context.DeadlineExceeded is intentionally not remapped — that would indicate a server-side timeout if we ever add per-request deadlines, and should still count as a server error.

What this does not do

  • Doesn't reduce the underlying load (ServiceNow + notion catalog scrapers driving the daily peak — separate work, see the updated_since outreach we're doing).
  • Doesn't change handler behaviour — handlers still call huma.Error500… on context.Canceled. We could short-circuit earlier (return without writing) to save the work, but that's a bigger refactor and the metric fix unblocks the alert immediately.

Companion change (no code, just a Grafana annotation tweak)

The Availability dropped below 95% alert (UID bexrc60etvhmoa) currently has the annotation:

No. of 5xx are increased from acceptable limit i.e. 5% of total throughput

The query is on mcp_registry_http_errors_total, not 5xx specifically. After this change the count is meaningful again, but the wording is still misleading. Suggest updating to:

Internal error rate exceeded 5% of total throughput

(no PromQL change required.)

Test plan

  • go test ./internal/api/router/... — new test passes
  • golangci-lint run ./internal/... — clean
  • After merge, watch a daily 17:00 UTC burst — mcp_registry_http_errors_total{status_code="500", path="/v0.1/servers"} should stay flat while mcp_registry_http_requests_total{status_code="499", path="/v0.1/servers"} grows. Availability alert should not fire on cancellation-only bursts.

🤖 Generated with Claude Code

…or counter

Under scraper-driven load (e.g. ServiceNow paginating /v0.1/servers with
short client timeouts), the registry handler runs the DB query, the
client times out and closes the connection, the in-flight context is
cancelled, and the handler converts the resulting context.Canceled into
huma.Error500InternalServerError before trying to write the response to
a closed socket. The middleware records that as status_code=500 and
bumps mcp_registry_http_errors_total — even though no client ever
received a 5xx (NGINX records it as 499 / client closed).

This was causing the "Availability dropped below 95%" alert to fire
during daily bursts despite zero 5xx in the ingress logs: yesterday's
peak generated 7396 internal 500s on /v0.1/servers alone, all of which
were really client cancellations.

Remap the recorded status to 499 when ctx.Context().Err() is
context.Canceled and skip the error counter for that code. Keep the
request and duration counters so the cancellation rate is still visible.
context.DeadlineExceeded is intentionally not remapped — that would
indicate a server-side timeout if we ever add per-request deadlines.

The matching Grafana alert annotation says "5xx are increased" but the
underlying query is on http_errors_total, which this change makes
accurate again — no annotation update is required for the alert to
behave correctly, but it would be clearer to reword it to "internal
error rate" rather than "5xx".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdimitrov rdimitrov merged commit 4e14707 into main May 5, 2026
5 checks passed
@rdimitrov rdimitrov deleted the fix/exclude-client-cancelled-from-error-metric branch May 5, 2026 09:58
@rdimitrov rdimitrov mentioned this pull request May 5, 2026
4 tasks
rdimitrov added a commit that referenced this pull request May 5, 2026
## Summary
- Bump `mcp-registry:imageTag` from `1.7.7` → `1.7.8` in
`deploy/Pulumi.gcpProd.yaml`
- v1.7.8 ships the metrics fix that stops counting client-cancelled
requests as server errors (#1255). Under scraper-driven load, clients
with short timeouts close the socket while PG is still iterating; the
handler converts `context.Canceled` to a 500 and the metric middleware
was bumping `mcp_registry_http_errors_total` even though NGINX recorded
499 and never delivered a response. This is what's been firing the daily
~17:00 UTC `Availability dropped below 95%` alert.
- Also includes a CI-only `pulumi/actions` 6.6.1 → 7.0.0 bump (#1254).
No registry runtime impact.
- Staging is on `4e14707` (the v1.7.8 commit) and
`staging.registry.modelcontextprotocol.io/v0.1/version` reports it.

## Test plan
- [ ] Confirm release workflow's `ko-push` job has published
`ghcr.io/modelcontextprotocol/registry:1.7.8` before merging
- [ ] Watch deploy-production roll the registry pods to `1.7.8`
- [ ] Hit `https://registry.modelcontextprotocol.io/v0.1/version` and
confirm `git_commit` matches `4e147072f03aeacfe8456c7f24407e0ea65d92c0`
- [ ] Through the next ~17:00 UTC scraper burst, watch in Prometheus:
- `mcp_registry_http_errors_total{status_code="500",
path="/v0.1/servers"}` should stay flat
- `mcp_registry_http_requests_total{status_code="499",
path="/v0.1/servers"}` should grow during the burst
- The `Availability dropped below 95%` alert should not fire on
cancellation-only bursts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant