fix: retry once on transient connection errors before failing by Evrard-Nil · Pull Request #520 · nearai/cloud-api

Evrard-Nil · 2026-03-31T03:42:46Z

Summary

Add a single retry with 500ms delay when all providers fail with connection or server errors (5xx)
Most models have only 1 provider (via model-proxy), so the existing provider fallback was ineffective
4xx client errors are still not retried (unchanged behavior)

Root cause

QEMU SLIRP has a hardcoded listen backlog of 1, brief nginx reloads during config updates, and Docker bridge churn during container restarts all cause transient TCP connection failures. With only 1 provider per model, these fail immediately with no recovery.

Top affected models (24h):

openai/gpt-oss-120b: ~6,000 errors
zai-org/GLM-5-FP8: ~2,600 errors
Qwen/Qwen3.5-122B-A10B: ~580 errors

Reproduction steps

# Send 20 concurrent requests to stress the model-proxy connection path
# QEMU SLIRP backlog=1 means only 1 pending connect() at a time
for i in $(seq 1 20); do
  curl -s --max-time 30 -X POST "https://cloud-api.near.ai/v1/chat/completions" \
    -H "Authorization: Bearer $API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "zai-org/GLM-5-FP8",
      "messages": [{"role": "user", "content": "hi"}],
      "max_tokens": 5
    }' &
done
wait

# Check Datadog for "All providers failed" errors
# Query: service:cloud-api env:prod @level:ERROR "All providers failed"

See repro_connection_retry.sh (gitignored) for the full reproduction script.

Test plan

cargo check compiles cleanly
All 188 unit tests pass (cargo test --lib --bins)
Deploy to staging:
- Verify retry log messages appear: "Retrying after transient connection failure"
- Verify 4xx errors are NOT retried
- Verify successful retries show round=2 in success log
- Monitor latency: retry adds max 500ms to failed requests (not to successful ones)

🤖 Generated with Claude Code

Most models route through a single provider (model-proxy), so provider fallback alone doesn't help. Add a single retry with 500ms delay for connection failures and 5xx errors. This handles transient issues like QEMU SLIRP listen backlog=1, brief nginx reloads, and Docker bridge churn that cause ~9.5k "All providers failed" errors/day in prod. 4xx client errors are still not retried (unchanged behavior).

gemini-code-assist

Code Review

This pull request implements a retry mechanism for transient connection errors in the inference provider pool by wrapping the provider selection logic in a retry loop. Feedback highlights that the calculation of total_attempts in the error logs is inaccurate, as it fails to account for early exits from the retry loop when encountering non-retryable errors.

gemini-code-assist · 2026-03-31T03:45:04Z

crates/services/src/inference_provider_pool/mod.rs


-        // All providers failed
+        // All providers failed after all retry rounds
+        let total_attempts = providers.len() * MAX_ROUNDS.min(if last_error.is_some() { MAX_ROUNDS } else { 1 });


The calculation of total_attempts is inaccurate because it assumes that all retry rounds were completed if last_error is present. If the loop breaks early after the first round (e.g., because is_retryable is false due to a 4xx error), total_attempts will incorrectly report providers.len() * 2 instead of providers.len() * 1.

Additionally, the expression MAX_ROUNDS.min(if last_error.is_some() { MAX_ROUNDS } else { 1 }) is unnecessarily complex and effectively constant since last_error is guaranteed to be Some if the code reaches this point (as the inner loop runs at least once for a non-empty provider list).

claude · 2026-03-31T03:46:12Z

Code Review

The retry logic is well-motivated. Two issues worth addressing before merge:

Issue 1 (functional): Failure counter double-counted across retry rounds

A provider that fails in both round 0 and round 1 has its failure counter incremented twice for a single client request. With MAX_CONSECUTIVE_FAILURES = 10, a provider reaches demotion after only 5 failing requests instead of 10.

For the common case (1 provider per model), each failed request under load now increments the counter twice, halving the effective demotion threshold. Fix: track already-counted provider keys in a local HashSet within try_with_providers, and only increment the failure counter once per provider per request regardless of retry rounds.

Issue 2 (minor): total_attempts log value is incorrect when retry does not occur

The expression providers.len() * MAX_ROUNDS.min(if last_error.is_some() { MAX_ROUNDS } else { 1 }) always evaluates to providers.len() * MAX_ROUNDS since last_error is always Some at this point. If round 0 fails with a non-retryable error (e.g., 429) and we break early, total_attempts still logs providers.len() * 2 instead of the actual providers.len() * 1. Minor but misleading in error logs.

No other critical issues. The is_retryable check using the last provider error is an acceptable simplification given most models have a single provider.

Warning: Issue 1 should be fixed before merge to avoid distorting provider health demotion tracking.

Copilot

Pull request overview

Adds a single “second round” retry (after a 500ms delay) to InferenceProviderPool::retry_with_fallback to mitigate transient connection/5xx failures when a model effectively has only one usable provider (e.g., via model-proxy), improving resilience before surfacing “All providers failed”.

Changes:

Introduce up to 2 retry rounds with a fixed 500ms delay between rounds.
Extend tracing fields/logs to include round and total_attempts when all providers fail.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T03:47:43Z

crates/services/src/inference_provider_pool/mod.rs

+            // Check if the last error is retryable (connection/server errors, not client errors)
+            let is_retryable = match &last_error {
+                Some(CompletionError::CompletionError(_)) => true, // Connection failures, timeouts
+                Some(CompletionError::HttpError { status_code, .. }) => *status_code >= 500,
+                _ => false,
+            };


The retry gate is too broad/inaccurate: it treats CompletionError::CompletionError(_) as always retryable, but that variant is also used for non-transient failures (e.g., invalid API key/header building or JSON parse/serialization errors in inference_providers). This can introduce unnecessary 500ms sleeps + duplicate attempts for deterministic failures and also retries based only on the last provider’s error rather than “all providers failed with transient errors”. Consider tracking per-round whether all provider errors were transient, and restrict CompletionError::CompletionError retries to known network/connection/timeout cases (or introduce a structured transient error kind).

Copilot · 2026-03-31T03:47:44Z

crates/services/src/inference_provider_pool/mod.rs

+        // All providers failed after all retry rounds
+        let total_attempts =


total_attempts is computed as providers.len() * 2 whenever last_error.is_some(), even if the retry loop broke early (e.g., last error was non-retryable) and only 1 round actually executed. This makes the error log misleading for ops/debugging. Consider tracking an attempts_made counter incremented per provider call (across rounds) and logging that instead.

Copilot · 2026-03-31T03:47:44Z

crates/services/src/inference_provider_pool/mod.rs

+                    operation = operation_name,
+                    "Retrying after transient connection failure"
+                );
+                tokio::time::sleep(RETRY_DELAY).await;


New behavior (second round retry + 500ms delay) isn’t covered by the existing tests in this module. Adding a unit test that asserts retry happens exactly once for retryable failures (e.g., 5xx / connection error) and does not happen for non-retryable CompletionError::CompletionError messages (like invalid API key / parse error) would prevent regressions; ideally structure the delay so tests don’t need to sleep 500ms in real time (e.g., inject delay or use Tokio’s paused time if available).

Suggested change

tokio::time::sleep(RETRY_DELAY).await;

#[cfg(not(test))]

{

tokio::time::sleep(RETRY_DELAY).await;

}

#[cfg(test)]

{

// In tests, avoid incurring the full retry delay so retry behavior

// can be exercised without slowing down the test suite.

tokio::time::sleep(Duration::from_millis(0)).await;

}

Copilot AI review requested due to automatic review settings March 31, 2026 03:42

Evrard-Nil had a problem deploying to Cloud API test env March 31, 2026 03:42 — with GitHub Actions Failure

Copilot started reviewing on behalf of Evrard-Nil March 31, 2026 03:43 View session

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

fix: cargo fmt

6e83915

Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive

chore: gitignore repro scripts

ad01dab

Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive

Copilot AI reviewed Mar 31, 2026

View reviewed changes

Evrard-Nil mentioned this pull request Mar 31, 2026

fix: return specific error for stale E2EE attestation keys #521

Open

3 tasks

PierreLeGuen approved these changes Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry once on transient connection errors before failing#520

fix: retry once on transient connection errors before failing#520
Evrard-Nil wants to merge 3 commits intomainfrom
fix/retry-transient-connection-errors

Evrard-Nil commented Mar 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

claude bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// All providers failed after all retry rounds
		let total_attempts =

-                tokio::time::sleep(RETRY_DELAY).await;
+                #[cfg(not(test))]
+                {
+                    tokio::time::sleep(RETRY_DELAY).await;
+                }
+                #[cfg(test)]
+                {
+                    // In tests, avoid incurring the full retry delay so retry behavior
+                    // can be exercised without slowing down the test suite.
+                    tokio::time::sleep(Duration::from_millis(0)).await;
+                }

Conversation

Evrard-Nil commented Mar 31, 2026

Summary

Root cause

Reproduction steps

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 31, 2026

Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants