inference_provider_pool: don't retry upstream 5xx caused by client media-fetch errors#688
inference_provider_pool: don't retry upstream 5xx caused by client media-fetch errors#688Evrard-Nil wants to merge 3 commits into
Conversation
…dia-fetch errors
vLLM and SGLang return HTTP 500 when they can't fetch or decode a
multimodal media URL the client supplied (e.g. broken Facebook CDN
image URLs, geo-blocked YouTube videos, base64 mp4 sent as image_url).
These are permanent client-input errors — the same payload re-runs the
same fetch and produces the same failure — but `classify_retry_decision`
treated every 5xx as `retryable_http_5xx`, so each malformed request
got 4 attempts (1 + 3 retries) of identical work.
Under sustained bad-media load (observed 2026-05-26 17:07 on
google/gemma-4-31B-it and Qwen/Qwen3.5-122B-A10B; recurred 2026-05-27
on gemma after the SGLang switch) the 4x amplification saturated both
backends and produced full-timeout windows on cloud-api.
Add a substring check on the upstream error_detail body: when an HTTP
500 message matches a high-confidence client-media-fetch pattern
('loading IMAGE/VIDEO data', 'cannot identify image file',
'Failed to open input buffer', or the aiohttp wrapper format
"HTTP error 500: 4xx, message='...', url='http..."), classify as
`non_retryable_client_media_error` so the inner retry loop bails out
after the first attempt. Genuine backend 5xx (no media markers) still
retry as before.
Same shape as PR #611's classify_provider_error pattern, applied to
the retry decision instead of the response classification.
Refs #687.
There was a problem hiding this comment.
Code Review
This pull request introduces a helper function is_client_media_fetch_error to classify certain HTTP 500 errors from inference engines (vLLM, SGLang) as non-retryable client media errors, preventing unnecessary retries on broken client-supplied URLs. Feedback points out that the substring check for the aiohttp wrapper format is too broad and could lead to false positives on genuine backend errors. It suggests using a more precise regular expression instead.
| fn is_client_media_fetch_error(message: &str) -> bool { | ||
| let lower = message.to_lowercase(); | ||
| lower.contains("loading image data") | ||
| || lower.contains("loading video data") | ||
| || lower.contains("cannot identify image file") | ||
| || lower.contains("failed to open input buffer") | ||
| // aiohttp wrapper format: "HTTP error 500: 4xx, message='...', url='http..." | ||
| // — the inference engine collapsed a client-fetch 4xx into a 500. | ||
| || (lower.contains(", url='http") && lower.contains(", message='")) | ||
| } |
There was a problem hiding this comment.
The substring check lower.contains(", url='http") && lower.contains(", message='") is too broad and can lead to false positives. For example, if a genuine backend error (such as a database or internal service connection failure) returns a 5xx status code and happens to include those substrings in its error message, it will be incorrectly classified as a non-retryable client media error, preventing necessary retries.
Since the regex crate is already imported in this file, we can use a precise regular expression to match the aiohttp wrapper format and ensure that the collapsed status code is indeed a 4xx client error.
fn is_client_media_fetch_error(message: &str) -> bool {
let lower = message.to_lowercase();
if lower.contains("loading image data")
|| lower.contains("loading video data")
|| lower.contains("cannot identify image file")
|| lower.contains("failed to open input buffer")
{
return true;
}
// aiohttp wrapper format: "HTTP error 500: 4xx, message='...', url='http..."
// — the inference engine collapsed a client-fetch 4xx into a 500.
static AIOHTTP_RE: std::sync::OnceLock<regex::Regex> = std::sync::OnceLock::new();
let re = AIOHTTP_RE.get_or_init(|| {
regex::Regex::new(r"(?i)http error 500:\s*4\d{2},\s*message='.*?',\s*url='https?://").unwrap()
});
re.is_match(message)
}
ReviewTightly scoped fix; the implementation matches the description and the test cases are pulled verbatim from prod logs (good). No critical issues found. Minor observations (non-blocking)
Suggestion for the followup PR (out of scope)As the PR body notes, the response status stays at 500 even though these are unambiguously client-input errors. Surfacing them as 422 with a sanitized message (no URL echoed back — privacy) would be a strict UX improvement and align with the ✅ |
There was a problem hiding this comment.
Pull request overview
Adjusts retry classification in InferenceProviderPool so upstream 5xx responses that are actually permanent client media-fetch/decode failures (broken image/video URLs, unsupported media, etc.) do not get retried and amplified into repeated backend work. This targets a specific operational failure mode where inference engines (vLLM/SGLang) return 500 for client-supplied media URL failures.
Changes:
- Add
is_client_media_fetch_error()heuristic matcher for high-confidence media-fetch/decode failure markers in upstream error messages. - Update
classify_retry_decision()to return a non-retryable label for 5xx responses matching those markers. - Extend
test_classify_retry_decisionwith new cases covering common production error strings and a negative control.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| || lower.contains("failed to open input buffer") | ||
| // aiohttp wrapper format: "HTTP error 500: 4xx, message='...', url='http..." | ||
| // — the inference engine collapsed a client-fetch 4xx into a 500. | ||
| || (lower.contains(", url='http") && lower.contains(", message='")) |
| /// the same failure. Treat these as non-retryable so one broken URL | ||
| /// from a client doesn't get amplified into 4x backend work. | ||
| fn is_client_media_fetch_error(message: &str) -> bool { | ||
| let lower = message.to_lowercase(); |
| assert_eq!( | ||
| InferenceProviderPool::classify_retry_decision(&CompletionError::HttpError { | ||
| status_code: 500, | ||
| message: "Internal server error: An exception occurred while loading VIDEO data at index 0: Error while loading data https://www.youtube.com/watch?v=2w_pbUrVZHY: SingleStreamDecoder, Failed to open input buffer: Invalid data found when processing input".to_string(), |
| assert_eq!( | ||
| InferenceProviderPool::classify_retry_decision(&CompletionError::HttpError { | ||
| status_code: 500, | ||
| message: "HTTP error 500: 404, message='Not Found', url='https://www.facebook.com/v24.0/122181986942783109_1506385084189441'".to_string(), |
PierreLeGuen
left a comment
There was a problem hiding this comment.
Requesting changes for one retry-classification regression:
crates/services/src/inference_provider_pool/mod.rs:1391: the aiohttp-wrapper branch is broader than the comment says. It marks any 500 body containing, message='and, url='httpasnon_retryable_client_media_error, even when the wrapped aiohttp status is another 5xx such asHTTP error 500: 503, message='Service Unavailable', url='https://...'. That skips the existing provider retry behavior for transient upstream/backend 5xxs just because the error string includes a URL. Please constrain this branch to wrapped 4xx media-fetch statuses, or otherwise require a stronger media-fetch marker, and add a negative test for wrapped 5xx stayingretryable_http_5xx.
Local check run: cargo test -p services test_classify_retry_decision passed. GitHub lint/security/claude-review checks passed when inspected; the GitHub Test Suite was still pending.
… hygiene
PierreLeGuen, Gemini, and Copilot all flagged the same regression: the
aiohttp-wrapper branch matched any 5xx body containing both
`, url='http` and `, message='`, which would swallow legitimately
retryable wrapped 5xx like `HTTP error 500: 503, message='Service
Unavailable', url='...'`.
Replace the two substring checks with a precise regex against the
wrapper shape that requires the wrapped status to be a 4xx:
http error 500:\s*4\d{2},\s*message='[^']*',\s*url='https?://
Compiled once via OnceLock, run on the ascii-lowercased message
(swapped to_lowercase -> to_ascii_lowercase — all markers are ASCII
and this path runs at high volume during malformed-media incidents,
per Copilot).
Tests:
- Add the regression case PierreLeGuen called out: 500 wrapping a 503
must stay `retryable_http_5xx`.
- Add a generic-noise case: a non-wrapper 5xx that happens to contain
url=... and message=... stays retryable (guards against drift toward
the old loose substring check).
- Redact verbatim production URLs (Facebook ID, YouTube ID) to
example.test dummies — the matcher only depends on the marker
substrings, not the actual host (Copilot).
|
Thanks all — pushed 0906f30 addressing the convergent feedback in one go. @PierreLeGuen — fixed the aiohttp-wrapper regression. The branch now requires the wrapped status to be a 4xx via regex ( @gemini-code-assist — adopted the regex approach almost verbatim; only difference is I lowercase the message once and match against that (instead of @copilot —
Test count is now 7 assertions (4 positive, 3 negative). The new "500 wrapping 503" and "generic noise with url=/message=" negatives guard against drift back to the old loose substring check. The Test Suite hang on the first run was in step "Run unit tests" (48m, no completed_at — looks like a step-level hang). My added tests are pure substring/regex matching with no I/O, so unlikely to be the cause. Watching the rerun on this push. @lloydmak99 @PierreLeGuen ready for another look. |
|
Thanks for the update, Evrard-Nil. The regex-based approach for the |
PierreLeGuen
left a comment
There was a problem hiding this comment.
Blocking findings:
-
crates/services/src/inference_provider_pool/mod.rs:1405,1712,1735: the aiohttp 500-wrapping-4xx case is tested against the raw message, but the production retry gate classifies
last_erroraftersanitize_completion_errorstores it.sanitize_error_messageredactshttps://...to[URL_REDACTED], so the regex requiringurl='https?://no longer matches and those wrapped client-fetch 4xx errors still retry asretryable_http_5xx. Please classify/store the retry decision before sanitizing, or make the matcher/tests cover the sanitized message used by the retry gate. -
crates/services/src/inference_provider_pool/mod.rs:1687-1697,1735: the new non-retryable client-media 5xx path is still counted as a provider health failure before the retry decision is applied. Sustained bad client media can still increment
provider_failure_countsand demote healthy backends, even though the cause is client input. Please gate the failure counter on the same raw-error retry decision.
Checks run locally:
cargo test -p services --lib test_classify_retry_decisionpassed.cargo check -p servicespassed.cargo fmt --checkfailed on formatting in the changed test lines; GitHub Lint is failing too.
Fixes #687.
Summary
vLLM and SGLang return HTTP 500 when they can't fetch or decode a multimodal media URL the client supplied (broken Facebook CDN image URLs, geo-blocked YouTube videos, base64 mp4 sent as
image_url, etc.). These are permanent client-input errors — the same payload re-runs the same fetch and produces the same failure. Butclassify_retry_decisiontreated every 5xx the same way →retryable_http_5xx, so each malformed request got 4 attempts (1 + 3 retries) of identical work.Under sustained bad-media load this 4× amplifies until the backends saturate. Observed twice:
google/gemma-4-31B-it+Qwen/Qwen3.5-122B-A10B, ~40All providers failed for modelevents in 20 min, samerequest_idretried 4× each on identical broken URLs (https://www.facebook.com/v24.0/...returning 404,https://www.youtube.com/watch?v=2w_pbUrVZHYfailing torchcodec).0/5then recovered.Fix
is_client_media_fetch_error(message)substring-matches the upstreamerror_detailbody for high-confidence client-fetch markers:loading image data/loading video data(sglang + vLLM)cannot identify image file(PILUnidentifiedImageError)Failed to open input buffer(torchcodec), url='http…combined with, message='…(aiohttp wrapper format:HTTP error 500: 404, message='Not Found', url='https://www.facebook.com/...')In
classify_retry_decision, the 5xx branch now consults this helper and returnsnon_retryable_client_media_errorwhen it matches. The inner retry loop already gates onstarts_with("retryable_"), so the label change is sufficient to stop the retry — no other plumbing needed.Genuine backend 5xx (no media markers — KV-cache evictions, NCCL hiccups, OOM, etc.) still classify as
retryable_http_5xxand retry as before.Same shape as #611
This is the
classify_provider_errorpattern from PR #611 (upstream auth-error masking), applied to the retry decision instead of the response classification. The two complement each other: #611 protects callers from seeing our backend-creds bugs; this protects backends from being retry-amplified by client-input bugs.What this does not do
error_kind(http_5xxstays accurate at the engine layer; onlyretry_decisionflips).Test plan
cargo test -p services --lib test_classify_retry_decisionpasses (added 5 new assertions: 4 verbatim from prod logs, 1 negative case to confirm non-media 500s still retry).cargo check -p servicesclean.All providers failed for modelrate forgoogle/gemma-4-31B-itandQwen/Qwen3.5-122B-A10B— expect a significant drop in "4-attempts-same-error" patterns. Total RPS to inference backends should fall by ~3× of the malformed-media RPS.