fix: retry once on transient connection errors before failing#520
fix: retry once on transient connection errors before failing#520Evrard-Nil wants to merge 3 commits intomainfrom
Conversation
Most models route through a single provider (model-proxy), so provider fallback alone doesn't help. Add a single retry with 500ms delay for connection failures and 5xx errors. This handles transient issues like QEMU SLIRP listen backlog=1, brief nginx reloads, and Docker bridge churn that cause ~9.5k "All providers failed" errors/day in prod. 4xx client errors are still not retried (unchanged behavior).
There was a problem hiding this comment.
Code Review
This pull request implements a retry mechanism for transient connection errors in the inference provider pool by wrapping the provider selection logic in a retry loop. Feedback highlights that the calculation of total_attempts in the error logs is inaccurate, as it fails to account for early exits from the retry loop when encountering non-retryable errors.
|
|
||
| // All providers failed | ||
| // All providers failed after all retry rounds | ||
| let total_attempts = providers.len() * MAX_ROUNDS.min(if last_error.is_some() { MAX_ROUNDS } else { 1 }); |
There was a problem hiding this comment.
The calculation of total_attempts is inaccurate because it assumes that all retry rounds were completed if last_error is present. If the loop breaks early after the first round (e.g., because is_retryable is false due to a 4xx error), total_attempts will incorrectly report providers.len() * 2 instead of providers.len() * 1.
Additionally, the expression MAX_ROUNDS.min(if last_error.is_some() { MAX_ROUNDS } else { 1 }) is unnecessarily complex and effectively constant since last_error is guaranteed to be Some if the code reaches this point (as the inner loop runs at least once for a non-empty provider list).
Code ReviewThe retry logic is well-motivated. Two issues worth addressing before merge: Issue 1 (functional): Failure counter double-counted across retry rounds A provider that fails in both round 0 and round 1 has its failure counter incremented twice for a single client request. With MAX_CONSECUTIVE_FAILURES = 10, a provider reaches demotion after only 5 failing requests instead of 10. For the common case (1 provider per model), each failed request under load now increments the counter twice, halving the effective demotion threshold. Fix: track already-counted provider keys in a local HashSet within try_with_providers, and only increment the failure counter once per provider per request regardless of retry rounds. Issue 2 (minor): total_attempts log value is incorrect when retry does not occur The expression No other critical issues. The is_retryable check using the last provider error is an acceptable simplification given most models have a single provider. Warning: Issue 1 should be fixed before merge to avoid distorting provider health demotion tracking. |
There was a problem hiding this comment.
Pull request overview
Adds a single “second round” retry (after a 500ms delay) to InferenceProviderPool::retry_with_fallback to mitigate transient connection/5xx failures when a model effectively has only one usable provider (e.g., via model-proxy), improving resilience before surfacing “All providers failed”.
Changes:
- Introduce up to 2 retry rounds with a fixed 500ms delay between rounds.
- Extend tracing fields/logs to include
roundandtotal_attemptswhen all providers fail.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Check if the last error is retryable (connection/server errors, not client errors) | ||
| let is_retryable = match &last_error { | ||
| Some(CompletionError::CompletionError(_)) => true, // Connection failures, timeouts | ||
| Some(CompletionError::HttpError { status_code, .. }) => *status_code >= 500, | ||
| _ => false, | ||
| }; |
There was a problem hiding this comment.
The retry gate is too broad/inaccurate: it treats CompletionError::CompletionError(_) as always retryable, but that variant is also used for non-transient failures (e.g., invalid API key/header building or JSON parse/serialization errors in inference_providers). This can introduce unnecessary 500ms sleeps + duplicate attempts for deterministic failures and also retries based only on the last provider’s error rather than “all providers failed with transient errors”. Consider tracking per-round whether all provider errors were transient, and restrict CompletionError::CompletionError retries to known network/connection/timeout cases (or introduce a structured transient error kind).
| // All providers failed after all retry rounds | ||
| let total_attempts = |
There was a problem hiding this comment.
total_attempts is computed as providers.len() * 2 whenever last_error.is_some(), even if the retry loop broke early (e.g., last error was non-retryable) and only 1 round actually executed. This makes the error log misleading for ops/debugging. Consider tracking an attempts_made counter incremented per provider call (across rounds) and logging that instead.
| operation = operation_name, | ||
| "Retrying after transient connection failure" | ||
| ); | ||
| tokio::time::sleep(RETRY_DELAY).await; |
There was a problem hiding this comment.
New behavior (second round retry + 500ms delay) isn’t covered by the existing tests in this module. Adding a unit test that asserts retry happens exactly once for retryable failures (e.g., 5xx / connection error) and does not happen for non-retryable CompletionError::CompletionError messages (like invalid API key / parse error) would prevent regressions; ideally structure the delay so tests don’t need to sleep 500ms in real time (e.g., inject delay or use Tokio’s paused time if available).
| tokio::time::sleep(RETRY_DELAY).await; | |
| #[cfg(not(test))] | |
| { | |
| tokio::time::sleep(RETRY_DELAY).await; | |
| } | |
| #[cfg(test)] | |
| { | |
| // In tests, avoid incurring the full retry delay so retry behavior | |
| // can be exercised without slowing down the test suite. | |
| tokio::time::sleep(Duration::from_millis(0)).await; | |
| } |
Summary
Root cause
QEMU SLIRP has a hardcoded listen backlog of 1, brief nginx reloads during config updates, and Docker bridge churn during container restarts all cause transient TCP connection failures. With only 1 provider per model, these fail immediately with no recovery.
Top affected models (24h):
openai/gpt-oss-120b: ~6,000 errorszai-org/GLM-5-FP8: ~2,600 errorsQwen/Qwen3.5-122B-A10B: ~580 errorsReproduction steps
See
repro_connection_retry.sh(gitignored) for the full reproduction script.Test plan
cargo checkcompiles cleanlycargo test --lib --bins)"Retrying after transient connection failure"round=2in success log🤖 Generated with Claude Code