Responses API swallows upstream HTTP errors (429) — returns 200 with status:failed

## Bug

The `/v1/responses` endpoint swallows upstream HTTP errors (including 429 rate limits) and returns HTTP 200 with `status: "failed"`, empty content, and 0 usage tokens. It should propagate the upstream error as the corresponding HTTP status code.

## Reproduction

```bash
# Chat completions correctly returns 429:
curl -s 'https://cloud-api.near.ai/v1/chat/completions' \
  -H 'Authorization: Bearer <key>' \
  -d '{"model":"google/gemini-3-pro","messages":[{"role":"user","content":"hi"}],"max_tokens":32}'
# → HTTP 429: {"error":{"message":"Rate limit exceeded..."}}

# Responses API swallows the 429:
curl -s 'https://cloud-api.near.ai/v1/responses' \
  -H 'Authorization: Bearer <key>' \
  -d '{"model":"google/gemini-3-pro","input":"hi","max_output_tokens":32}'
# → HTTP 200: {"status":"failed","output":[{"content":[{"text":""}]}],"usage":{"total_tokens":0}}
```

## Impact

- Clients cannot detect rate limits and retry appropriately
- `infra-tests` `test_responses[gemini-3-pro]` persistently fails because `request_with_retry()` sees HTTP 200 and doesn't retry, while the equivalent chat completion test retries on 429 and eventually succeeds

## Root Cause

Traced through the code:

1. **Gemini backend** (`inference_providers/src/external/gemini/mod.rs:216`): Returns `CompletionError::HttpError { status_code: 429 }` — correct
2. **Completion stream** (`services/src/responses/service.rs:696-703`): Catches the error, sets `stream_error = true`, breaks — no usage captured
3. **Service** (`services/src/responses/service.rs:1184-1232`): Emits `response.failed` event but never `response.completed` — no final response object
4. **Route handler fallback** (`api/src/routes/responses.rs:500-560`): No `final_response` from completed event → falls through to fallback with `Usage::new(0, 0)` hardcoded → returns HTTP 200

## Suggested Fix

When the completion stream errors with an HTTP error, propagate it as the corresponding HTTP status code from the `/v1/responses` endpoint, rather than wrapping it in `status: "failed"` with HTTP 200. At minimum, 429 rate limit errors should be propagated as HTTP 429 so clients can implement retry logic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Responses API swallows upstream HTTP errors (429) — returns 200 with status:failed #511

Bug

Reproduction

Impact

Root Cause

Suggested Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Responses API swallows upstream HTTP errors (429) — returns 200 with status:failed #511

Description

Bug

Reproduction

Impact

Root Cause

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions