Skip to content

Intermittent failed to record rollout items: thread <uuid> not found during sub-agent shutdown / close, more likely with slower models #19532

@hac425xxx

Description

@hac425xxx

What version of Codex CLI is running?

codex-cli 0.124.0

What subscription do you have?

pro

Which model were you using?

No response

What platform is your computer?

No response

What terminal emulator and version are you using (if applicable)?

No response

What issue are you seeing?

What issue are you seeing?

Codex intermittently logs an error like:

ERROR codex_core::session: failed to record rollout items: thread 12345678-1234-1234-1234-123456789abc not found

In practice, this appears during multi-agent workflows when a sub-agent is being closed or shut down, especially when using slower / weaker models. The
model itself does not appear to be the root cause; it only seems to increase the probability of hitting the race.

From code inspection, this looks like a lifecycle race between:

  1. removing a live thread from ThreadManager, and
  2. late async event / rollout persistence still being emitted by the session that was just shut down.

Relevant call sites:

  • codex-rs/core/src/codex.rs:2892
    send_event_raw() persists rollout items before delivering the event.
  • codex-rs/core/src/codex.rs:3860
    persist_rollout_items() logs failed to record rollout items: ....
  • codex-rs/core/src/codex.rs:5630
    session shutdown drains and shuts down the rollout recorder.
  • codex-rs/core/src/agent/control.rs:661
    shutdown_live_agent() flushes rollout, sends Op::Shutdown, and then immediately removes the thread from ThreadManager.
  • codex-rs/core/src/agent/control.rs:674
    remove_thread(&agent_id) happens immediately after the shutdown op is sent.

The suspicious sequence is:

  • agent session is still capable of emitting late terminal / legacy / compact-related events,
  • shutdown_live_agent() removes the thread from the live manager too early,
  • a late persistence / event path still references the old thread id,
  • some downstream app-server / thread lookup path returns thread not found,
  • core logs failed to record rollout items.

This is consistent with the observed symptom that the issue is more reproducible with weaker/slower models: they increase the duration of streaming, tail
events, and compaction/close overlap windows.

What steps can reproduce the bug?

I do not yet have a minimized deterministic repro, but the issue appears to be reproducible with the following pattern:

High-level trigger conditions

  • multi-agent session
  • at least one spawned sub-agent
  • slower model for child agents
  • parent closes the child soon after completion or while tail events are still in flight
  • optional but likely to increase probability: long streaming responses, compaction, or final event fan-out

Suggested repro workflow

  1. Start Codex with multi-agent capable workflow.
  2. Use a relatively slow / weaker model for spawned agents.
  3. Spawn one or more child agents that produce enough streamed output to keep the session active for a while.
  4. As soon as the child reaches a terminal state, or while it is close to finishing, call close_agent.
  5. Repeat several times in a loop.

Pseudocode repro shape

root agent
-> spawn child agent using slower model
-> child streams output for a while
-> parent calls close_agent(child) quickly after completion / near completion
-> occasionally observe:
ERROR codex_core::session: failed to record rollout items: thread not found

Why this seems timing-sensitive

shutdown_live_agent() currently does:

  1. ensure_rollout_materialized()
  2. flush_rollout()
  3. send_op(agent_id, Op::Shutdown {})
  4. remove_thread(&agent_id)

That means the thread can disappear from the live ThreadManager before all late session activity has fully quiesced.

More specific repro candidate for maintainers

A robust integration test would likely need to simulate:

  • a child agent with delayed final event emission,
  • parent calling close_agent,
  • one or more late send_event_raw() / persist_rollout_items() calls after remove_thread().

That should cover the suspected race window directly.

What is the expected behavior?

Closing or shutting down an agent should not produce any error related to missing thread state.

Expected outcomes:

  • either all pending rollout items are safely persisted before the thread is removed, or
  • late events after shutdown are safely ignored / downgraded without surfacing an error.

In particular, close_agent / shutdown should not leave the session in a state where late rollout persistence attempts fail with thread not found.

Additional information

Root cause hypothesis

This looks like a shutdown ordering bug rather than a model-specific bug.

The key issue appears to be that the live thread is removed too early relative to the tail of async session activity.

Evidence from code

send_event_raw() always persists before delivering:

pub(crate) async fn send_event_raw(&self, event: Event) {
let rollout_items = vec![RolloutItem::EventMsg(event.msg.clone())];
self.persist_rollout_items(&rollout_items).await;
self.deliver_event_raw(event).await;
}

persist_rollout_items() logs any recorder failure:

pub(crate) async fn persist_rollout_items(&self, items: &[RolloutItem]) {
let recorder = {
let guard = self.services.rollout.lock().await;
guard.clone()
};
if let Some(rec) = recorder
&& let Err(e) = rec.record_items(items).await
{
error!("failed to record rollout items: {e:#}");
}
}

shutdown_live_agent() removes the thread immediately after sending shutdown:

pub(crate) async fn shutdown_live_agent(&self, agent_id: ThreadId) -> CodexResult {
let state = self.upgrade()?;
let result = if let Ok(thread) = state.get_thread(agent_id).await {
thread.codex.session.ensure_rollout_materialized().await;
thread.codex.session.flush_rollout().await;
if matches!(thread.agent_status().await, AgentStatus::Shutdown) {
Ok(String::new())
} else {
state.send_op(agent_id, Op::Shutdown {}).await
}
} else {
state.send_op(agent_id, Op::Shutdown {}).await
};
let _ = state.remove_thread(&agent_id).await;
self.state.release_spawned_thread(agent_id);
result
}

Session shutdown does drain the rollout recorder:

let recorder_opt = {
let mut guard = sess.services.rollout.lock().await;
guard.take()
};
if let Some(rec) = recorder_opt
&& let Err(e) = rec.shutdown().await
{
warn!("failed to shutdown rollout recorder: {e}");
}

That is helpful, but it does not by itself guarantee that no late async task will still attempt to emit events or persist rollout items after the live
thread has already been removed.

Why slower / weaker models increase repro rate

I do not think weaker models are the root cause.

They likely make the bug easier to trigger because they tend to produce:

  • longer streaming windows,
  • more opportunities for parent/child overlap,
  • more chances to hit close/shutdown while tail events are still propagating,
  • more chances to overlap with compaction or terminal event fan-out.

So the model choice appears to affect race probability, not correctness.

Example log snippet

Representative error:

ERROR codex_core::session: failed to record rollout items: thread 12345678-1234-1234-1234-123456789abc not found

Potential surrounding context in a real run would likely include agent close/shutdown activity shortly before the error.

Proposed fix direction

I think the safest fix is to delay live thread removal until the session is fully quiesced.

Option A: move remove_thread() later

In shutdown_live_agent(), do not call remove_thread(&agent_id) immediately after Op::Shutdown.

Instead, remove the thread only after the session has definitively reached shutdown completion and no more rollout/event emission can occur.

This seems like the cleanest behavioral fix.

Option B: add a post-shutdown event/persistence gate

After shutdown begins, make send_event_raw() / persist_rollout_items() no-op or degrade gracefully if the session is closing/closed.

That would prevent harmless late tail events from surfacing as errors.

Option C: downgrade the specific tail case

If a late persistence attempt occurs after thread shutdown and the only failure is thread not found, treat it as expected during teardown and log at
debug/warn instead of error.

This would reduce noise, but it feels more like mitigation than root-cause resolution.

Suggested tests

A regression test should cover at least one of these scenarios:

  1. spawned child agent emits delayed terminal events after parent calls close_agent
  2. child agent is closed while still streaming
  3. compaction or replacement-history persistence overlaps with child shutdown
  4. late legacy event emission after recorder shutdown / live-thread removal

The invariant should be: no thread not found persistence error during normal agent close/shutdown.

Environment

Observed on:

  • repository: openai/codex
  • branch: main
  • commit: 23f4cd8

If useful, I can also help turn this into a more deterministic test plan or a smaller integration repro.

———

If you want, I can also turn this into:

  • a shorter GitHub-form version that fits directly into the issue template fields, or
  • a maintainer-oriented version with a sharper root-cause section and proposed patch sketch.

What steps can reproduce the bug?

.

What is the expected behavior?

No response

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingsessionIssues involving session (thread) management, resuming, forking, naming, archivingsubagentIssues involving subagents or multi-agent features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions