Fix CLI rendering corruption and split CLI/frontend model defaults by akseljoonas · Pull Request #121 · huggingface/ml-intern

akseljoonas · 2026-04-25T19:36:15Z

Summary

stop the CLI research overlay from running a background redraw ticker that races with streamed markdown output
keep research sub-agent routing on direct Anthropic when the active model is direct Anthropic, and mute LiteLLM warning spam in the interactive UI
split model defaults into explicit CLI and frontend config files, and align the frontend Claude constant with the backend's Bedrock default
restore missing recovery helpers and patch dangling tool-result handling, doom-loop polling detection, sandbox reuse messaging, and async test support

Testing

python -m compileall agent backend
./frontend/node_modules/.bin/tsc -p frontend/tsconfig.json --noEmit
UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q

The interactive CLI was interleaving live sub-agent redraws with streamed markdown output, which corrupted ANSI rendering and leaked raw control sequences into the terminal. The CLI and web app also shared one default model config even though they need different Anthropic routing defaults. Constraint: CLI default must use direct Anthropic credentials while web sessions must default to Bedrock Anthropic Constraint: Interactive terminal output must remain readable while sub-agent progress is live Rejected: Single shared config file with runtime overrides | keeps ownership of defaults implicit across surfaces Rejected: Keep background redraw ticker | concurrent terminal writers still corrupt streamed output Confidence: high Scope-risk: moderate Reversibility: clean Directive: Keep CLI and frontend default models in separate config files unless both surfaces intentionally converge again Tested: python -m compileall agent backend Tested: ./frontend/node_modules/.bin/tsc -p frontend/tsconfig.json --noEmit Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest python -m pytest -q tests/unit/test_cli_rendering.py Not-tested: Full pytest suite (blocked by pre-existing tests/unit/test_llm_error_classification.py import error during collection)

claude · 2026-04-25T19:36:31Z

Claude finished @akseljoonas's task in 2m 48s —— View job

PR Review

Gather context and read changed files
Review CLI rendering fix (terminal_display.py)
Review research tool routing changes (research_tool.py, main.py)
Review config split (cli_agent_config.json, frontend_agent_config.json)
Review frontend changes (ChatInput.tsx, model.ts)
Review test coverage (test_cli_rendering.py)
Review remaining files (session_manager.py, README.md)
Post final review

No blocking issues — 2 P1
Verdict: ready to merge

What I checked

CLI rendering fix (terminal_display.py): Removal of _ticker_task and the _tick() coroutine. Verified the display now updates only on events (start, add_call, clear). No tick means no background wake-up racing with the markdown typewriter loop. Correct.
Research model routing (research_tool.py:217–224): Old code matched "anthropic" in main_model, routing both direct-Anthropic and Bedrock-Anthropic to bedrock/us.anthropic.claude-sonnet-4-6. New code uses startswith to split them correctly. Direct-Anthropic → anthropic/claude-sonnet-4-6; Bedrock-Anthropic → bedrock/us.anthropic.claude-sonnet-4-6. Fix is correct for the stated goal.
Config split (cli_agent_config.json, frontend_agent_config.json): All call sites in agent/main.py and backend/session_manager.py updated. Grepped for lingering main_agent_config references — none found in Python or TypeScript.
isClaudePath broadening (model.ts:13): Changed from startsWith('anthropic/') to includes('anthropic'). Necessary because CLAUDE_MODEL_PATH now contains bedrock/us.anthropic.claude-opus-4-6-v1. No non-Claude model in MODEL_OPTIONS contains "anthropic" in its path, so the match set is correct. The predicate is frontend-only UI (quota chip visibility); actual authz is backend _is_anthropic_model at agent.py:72 which already used "anthropic" in model_id.
_configure_runtime_logging (main.py:58–65): Suppresses LiteLLM loggers to ERROR. Called in cli() (covers interactive path) and inside headless_main() (covers direct invocation). Double-call in headless mode is harmless since setLevel is idempotent.
Test coverage (test_cli_rendering.py): Covers all three routing branches of _get_research_model and the no-background-task regression. Tests are correct.
Backend AVAILABLE_MODELS (routes/agent.py:41–68): bedrock/us.anthropic.claude-opus-4-6-v1 already present — frontend constant aligns.

P1: `configs/main_agent_config.json` left behind

The old config file still exists on disk and is no longer referenced by any code path, but it has "model_name": "anthropic/claude-opus-4-6" — the pre-split direct-Anthropic value. A contributor cloning the repo who reads the old file and hand-edits it would get silent no-ops. Should be removed or at minimum a comment pointing to the two new files.

Fix this →

P1: Bedrock Sonnet research model ID likely missing `-v1`

research_tool.py:222 returns "bedrock/us.anthropic.claude-sonnet-4-6" (no -v1 suffix) for the Bedrock Anthropic research path. Every other Bedrock Anthropic model ID in the codebase carries the suffix: bedrock/us.anthropic.claude-opus-4-6-v1 in configs/frontend_agent_config.json:3, backend/routes/agent.py:50. If AWS cross-region inference requires the versioned suffix (which the Opus ID suggests), Bedrock users will get a model-not-found error from LiteLLM when a research sub-agent is spawned.

This is pre-existing (the value existed before this PR), but the PR now routes Bedrock users to this path by making it the frontend default, so it becomes load-bearing. Worth verifying against the actual Bedrock model catalog before merging.

The earlier PR fixed the CLI rendering and model-default split, but the full local suite exposed additional regressions in tool-result patching, doom-loop polling detection, sandbox reuse messaging, and async test support. This follow-up commit restores the missing helpers and updates those production paths so the new regression tests pass for real. Constraint: Provider message histories must keep tool_use/tool_result pairing valid across interrupted turns Constraint: Legitimate polling with changing results must not trip doom-loop recovery Rejected: Only fix the original collection blocker | leaves the full suite red and the PR note stale Rejected: Silence the failing tests without restoring runtime helpers | would hide real production regressions Confidence: high Scope-risk: moderate Reversibility: clean Directive: Keep the local regression tests in sync with the production recovery paths they exercise Tested: python -m compileall agent/context_manager/manager.py agent/core/agent_loop.py agent/core/doom_loop.py agent/tools/sandbox_tool.py backend/user_quotas.py Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q tests/unit/test_dangling_tool_calls.py tests/unit/test_doom_loop_polling.py tests/unit/test_sandbox_already_active_message.py tests/unit/test_user_quotas.py Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q Not-tested: Remote CI environment parity

claude · 2026-04-25T19:41:44Z

Claude finished @akseljoonas's task in 3m 46s —— View job

Claude finished @akseljoonas's task ——

Re-review (new commit: `0140724`)

Check if prior P1s were addressed
Review new _patch_dangling_tool_calls rewrite (manager.py)
Review doom_loop.py result-hash fix
Review agent_loop.py error classification and retry schedule
Review malformed-args recovery injection
Review new test suite
Post updated review

1 P0
Verdict: changes requested

Prior P1s (not re-flagged per policy)

Both P1s from the first review remain unaddressed: configs/main_agent_config.json is still on disk, and research_tool.py:222 still returns bedrock/us.anthropic.claude-sonnet-4-6 without the -v1 suffix.

What I checked (new commit)

_patch_dangling_tool_calls rewrite (manager.py:253–283): Forward scan correctly inserts stubs inline at the right position (immediately before the next non-tool message) for each assistant message whose tool_calls lack a matching result. Idempotent — a second call finds the stubs already present and skips. The multi-turn case (test_orphan_in_earlier_turn_still_gets_patched) is now handled correctly.
doom_loop.py result-hash (doom_loop.py:24,59–70): ToolCallSignature.result_hash is populated from the immediately following tool message. Polling with identical args but changing results produces signatures that differ on result_hash, so detect_identical_consecutive won't fire. The "truly stuck" case (same args + same result) still fires correctly.
Malformed-args recovery (agent_loop.py:95–148, 269–292): _detect_repeated_malformed walks backward, skipping non-tool messages and breaking on any non-malformed tool result. The recovery prompt is injected before the next LLM call. Logic and tests look correct.
Context overflow catch (agent_loop.py:422–423, 249–250): Both streaming and non-streaming paths now wrap non-ContextWindowExceededError context overflow exceptions before they reach the generic retry handler. Correct.
Test coverage: All five new test files exercise the production paths they claim to cover.

P0: Rate-limit retry budget falls short of the stated 60 s goal; test validates a dead entry

agent_loop.py:179:

_LLM_RATE_LIMIT_RETRY_DELAYS = [15, 30, 45]  # exceed Bedrock's ~60s TPM bucket window

The retry loop is for _llm_attempt in range(_MAX_LLM_RETRIES): where _MAX_LLM_RETRIES = 3 (agent_loop.py:177). The retry condition is _llm_attempt < _MAX_LLM_RETRIES - 1 (agent_loop.py:433), i.e., _llm_attempt < 2. That means:

Attempt 0 → wait 15 s
Attempt 1 → wait 30 s
Attempt 2 → condition false, raise

_LLM_RATE_LIMIT_RETRY_DELAYS[2] (45 s) is never reached. The actual retry budget is 45 s, which is below the ~60 s Bedrock TPM bucket window the comment promises to exceed.

The test at test_llm_error_classification.py:737 asserts sum(_LLM_RATE_LIMIT_RETRY_DELAYS) > 60 — this passes (90 > 60) but tests the sum of the full list, not the sum of the two entries that are actually consumed.

Minimal fix — change the schedule to two entries that together exceed 60 s:

_LLM_RATE_LIMIT_RETRY_DELAYS = [30, 60]  # exceed Bedrock's ~60s TPM bucket window

30 + 60 = 90 s. No test changes needed: sum([30, 60]) = 90 > 60 still passes, and _retry_delay_for(err, 2) is None still passes since the list has length 2.

Fix this →

The review was right about two follow-up issues: the old shared config file was still present after the CLI/frontend split, and the Bedrock rate-limit retry schedule still had a dead third entry because the loop only ever consumed two retry delays. This commit removes the orphaned config and makes the rate-limit schedule line up with the actual retry budget. Constraint: Retry budget for Bedrock token throttling must exceed the provider's ~60s bucket recovery window in the retries that actually run Rejected: Keep a third delay entry in the schedule | the current retry loop never reaches it Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep retry schedules aligned with the retry loop's real number of sleeps, not the raw retry constant count Tested: python -m compileall agent/core/agent_loop.py tests/unit/test_llm_error_classification.py Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q tests/unit/test_llm_error_classification.py Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q Not-tested: Remote CI environment parity

akseljoonas · 2026-04-25T19:53:18Z

Validated the remaining review points before changing code:

Fixed: removed the orphaned configs/main_agent_config.json.
Fixed: shortened _LLM_RATE_LIMIT_RETRY_DELAYS to the two entries the retry loop actually consumes, while keeping the effective retry budget above 60s.
Not applied: the Bedrock Sonnet 4.6 -v1 suggestion is incorrect. The current AWS Bedrock model card lists anthropic.claude-sonnet-4-6 with geo/global IDs like us.anthropic.claude-sonnet-4-6, so leaving bedrock/us.anthropic.claude-sonnet-4-6 unchanged is intentional.

Local verification after the update: 87 passed with UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q.

claude · 2026-04-25T19:53:34Z

Claude finished @akseljoonas's task in 1m 3s —— View job

Re-review (new commit: `4af77ac`)

LGTM
Verdict: ready to merge

Prior findings

P0 (rate-limit retry budget) — Fixed. _LLM_RATE_LIMIT_RETRY_DELAYS = [30, 60] now has exactly _MAX_LLM_RETRIES - 1 = 2 entries, matching what the retry loop actually consumes. Budget is 90 s (> 60 s). Test at test_llm_error_classification.py:99 adds a structural guard (len(...) == _MAX_LLM_RETRIES - 1) that will break if a future diff re-introduces a dead third entry.
P1 (orphaned config) — Fixed. configs/main_agent_config.json deleted; grepped for main_agent_config across all .py/.ts/.json/.md — zero remaining references.
P1 (Bedrock Sonnet -v1 suffix) — Author's pushback accepted. AWS model catalog lists us.anthropic.claude-sonnet-4-6 without the -v1 suffix (unlike Opus), so research_tool.py:222 is intentionally correct.

What I checked

agent/core/agent_loop.py:179,215–226: _retry_delay_for uses attempt_index >= len(schedule) as its guard — with a 2-entry list and a loop that only retries when _llm_attempt < 2, indices 0 and 1 are exactly the consumed entries. Logic is consistent end-to-end.
tests/unit/test_llm_error_classification.py:99–100: Both assertions (len structural guard and sum > 60) pass. No spurious tests left over from the old 3-entry list.
configs/: Only cli_agent_config.json and frontend_agent_config.json remain. No dangling references.

akseljoonas merged commit 3eec386 into main Apr 25, 2026
1 check passed

akseljoonas deleted the codex/fix-cli-rendering-split-default-models-20260425-223601 branch April 25, 2026 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CLI rendering corruption and split CLI/frontend model defaults#121

Fix CLI rendering corruption and split CLI/frontend model defaults#121
akseljoonas merged 3 commits intomainfrom
codex/fix-cli-rendering-split-default-models-20260425-223601

akseljoonas commented Apr 25, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 25, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 25, 2026 •

edited

Loading

Uh oh!

akseljoonas commented Apr 25, 2026

Uh oh!

claude Bot commented Apr 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akseljoonas commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

claude Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

What I checked

P1: configs/main_agent_config.json left behind

P1: Bedrock Sonnet research model ID likely missing -v1

Uh oh!

claude Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review (new commit: 0140724)

Prior P1s (not re-flagged per policy)

What I checked (new commit)

P0: Rate-limit retry budget falls short of the stated 60 s goal; test validates a dead entry

Uh oh!

akseljoonas commented Apr 25, 2026

Uh oh!

claude Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review (new commit: 4af77ac)

Prior findings

What I checked

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

akseljoonas commented Apr 25, 2026 •

edited

Loading

claude Bot commented Apr 25, 2026 •

edited

Loading

P1: `configs/main_agent_config.json` left behind

P1: Bedrock Sonnet research model ID likely missing `-v1`

claude Bot commented Apr 25, 2026 •

edited

Loading

Re-review (new commit: `0140724`)

claude Bot commented Apr 25, 2026 •

edited

Loading

Re-review (new commit: `4af77ac`)