Skip to content

fix: retry once on transient connection errors before failing#520

Open
Evrard-Nil wants to merge 3 commits intomainfrom
fix/retry-transient-connection-errors
Open

fix: retry once on transient connection errors before failing#520
Evrard-Nil wants to merge 3 commits intomainfrom
fix/retry-transient-connection-errors

Conversation

@Evrard-Nil
Copy link
Copy Markdown
Contributor

Summary

  • Add a single retry with 500ms delay when all providers fail with connection or server errors (5xx)
  • Most models have only 1 provider (via model-proxy), so the existing provider fallback was ineffective
  • 4xx client errors are still not retried (unchanged behavior)

Root cause

QEMU SLIRP has a hardcoded listen backlog of 1, brief nginx reloads during config updates, and Docker bridge churn during container restarts all cause transient TCP connection failures. With only 1 provider per model, these fail immediately with no recovery.

Top affected models (24h):

  • openai/gpt-oss-120b: ~6,000 errors
  • zai-org/GLM-5-FP8: ~2,600 errors
  • Qwen/Qwen3.5-122B-A10B: ~580 errors

Reproduction steps

# Send 20 concurrent requests to stress the model-proxy connection path
# QEMU SLIRP backlog=1 means only 1 pending connect() at a time
for i in $(seq 1 20); do
  curl -s --max-time 30 -X POST "https://cloud-api.near.ai/v1/chat/completions" \
    -H "Authorization: Bearer $API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "zai-org/GLM-5-FP8",
      "messages": [{"role": "user", "content": "hi"}],
      "max_tokens": 5
    }' &
done
wait

# Check Datadog for "All providers failed" errors
# Query: service:cloud-api env:prod @level:ERROR "All providers failed"

See repro_connection_retry.sh (gitignored) for the full reproduction script.

Test plan

  • cargo check compiles cleanly
  • All 188 unit tests pass (cargo test --lib --bins)
  • Deploy to staging:
    • Verify retry log messages appear: "Retrying after transient connection failure"
    • Verify 4xx errors are NOT retried
    • Verify successful retries show round=2 in success log
    • Monitor latency: retry adds max 500ms to failed requests (not to successful ones)

🤖 Generated with Claude Code

Most models route through a single provider (model-proxy), so provider
fallback alone doesn't help. Add a single retry with 500ms delay for
connection failures and 5xx errors.

This handles transient issues like QEMU SLIRP listen backlog=1,
brief nginx reloads, and Docker bridge churn that cause ~9.5k
"All providers failed" errors/day in prod.

4xx client errors are still not retried (unchanged behavior).
Copilot AI review requested due to automatic review settings March 31, 2026 03:42
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a retry mechanism for transient connection errors in the inference provider pool by wrapping the provider selection logic in a retry loop. Feedback highlights that the calculation of total_attempts in the error logs is inaccurate, as it fails to account for early exits from the retry loop when encountering non-retryable errors.


// All providers failed
// All providers failed after all retry rounds
let total_attempts = providers.len() * MAX_ROUNDS.min(if last_error.is_some() { MAX_ROUNDS } else { 1 });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation of total_attempts is inaccurate because it assumes that all retry rounds were completed if last_error is present. If the loop breaks early after the first round (e.g., because is_retryable is false due to a 4xx error), total_attempts will incorrectly report providers.len() * 2 instead of providers.len() * 1.

Additionally, the expression MAX_ROUNDS.min(if last_error.is_some() { MAX_ROUNDS } else { 1 }) is unnecessarily complex and effectively constant since last_error is guaranteed to be Some if the code reaches this point (as the inner loop runs at least once for a non-empty provider list).

@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

Code Review

The retry logic is well-motivated. Two issues worth addressing before merge:

Issue 1 (functional): Failure counter double-counted across retry rounds

A provider that fails in both round 0 and round 1 has its failure counter incremented twice for a single client request. With MAX_CONSECUTIVE_FAILURES = 10, a provider reaches demotion after only 5 failing requests instead of 10.

For the common case (1 provider per model), each failed request under load now increments the counter twice, halving the effective demotion threshold. Fix: track already-counted provider keys in a local HashSet within try_with_providers, and only increment the failure counter once per provider per request regardless of retry rounds.

Issue 2 (minor): total_attempts log value is incorrect when retry does not occur

The expression providers.len() * MAX_ROUNDS.min(if last_error.is_some() { MAX_ROUNDS } else { 1 }) always evaluates to providers.len() * MAX_ROUNDS since last_error is always Some at this point. If round 0 fails with a non-retryable error (e.g., 429) and we break early, total_attempts still logs providers.len() * 2 instead of the actual providers.len() * 1. Minor but misleading in error logs.


No other critical issues. The is_retryable check using the last provider error is an acceptable simplification given most models have a single provider.

Warning: Issue 1 should be fixed before merge to avoid distorting provider health demotion tracking.

@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a single “second round” retry (after a 500ms delay) to InferenceProviderPool::retry_with_fallback to mitigate transient connection/5xx failures when a model effectively has only one usable provider (e.g., via model-proxy), improving resilience before surfacing “All providers failed”.

Changes:

  • Introduce up to 2 retry rounds with a fixed 500ms delay between rounds.
  • Extend tracing fields/logs to include round and total_attempts when all providers fail.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +750 to +755
// Check if the last error is retryable (connection/server errors, not client errors)
let is_retryable = match &last_error {
Some(CompletionError::CompletionError(_)) => true, // Connection failures, timeouts
Some(CompletionError::HttpError { status_code, .. }) => *status_code >= 500,
_ => false,
};
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry gate is too broad/inaccurate: it treats CompletionError::CompletionError(_) as always retryable, but that variant is also used for non-transient failures (e.g., invalid API key/header building or JSON parse/serialization errors in inference_providers). This can introduce unnecessary 500ms sleeps + duplicate attempts for deterministic failures and also retries based only on the last provider’s error rather than “all providers failed with transient errors”. Consider tracking per-round whether all provider errors were transient, and restrict CompletionError::CompletionError retries to known network/connection/timeout cases (or introduce a structured transient error kind).

Copilot uses AI. Check for mistakes.
Comment on lines +762 to +763
// All providers failed after all retry rounds
let total_attempts =
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_attempts is computed as providers.len() * 2 whenever last_error.is_some(), even if the retry loop broke early (e.g., last error was non-retryable) and only 1 round actually executed. This makes the error log misleading for ops/debugging. Consider tracking an attempts_made counter incremented per provider call (across rounds) and logging that instead.

Copilot uses AI. Check for mistakes.
operation = operation_name,
"Retrying after transient connection failure"
);
tokio::time::sleep(RETRY_DELAY).await;
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior (second round retry + 500ms delay) isn’t covered by the existing tests in this module. Adding a unit test that asserts retry happens exactly once for retryable failures (e.g., 5xx / connection error) and does not happen for non-retryable CompletionError::CompletionError messages (like invalid API key / parse error) would prevent regressions; ideally structure the delay so tests don’t need to sleep 500ms in real time (e.g., inject delay or use Tokio’s paused time if available).

Suggested change
tokio::time::sleep(RETRY_DELAY).await;
#[cfg(not(test))]
{
tokio::time::sleep(RETRY_DELAY).await;
}
#[cfg(test)]
{
// In tests, avoid incurring the full retry delay so retry behavior
// can be exercised without slowing down the test suite.
tokio::time::sleep(Duration::from_millis(0)).await;
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants