What version of Codex CLI is running?
codex-cli 0.124.0
What subscription do you have?
pro
Which model were you using?
No response
What platform is your computer?
No response
What terminal emulator and version are you using (if applicable)?
No response
What issue are you seeing?
What issue are you seeing?
Codex intermittently logs an error like:
ERROR codex_core::session: failed to record rollout items: thread 12345678-1234-1234-1234-123456789abc not found
In practice, this appears during multi-agent workflows when a sub-agent is being closed or shut down, especially when using slower / weaker models. The
model itself does not appear to be the root cause; it only seems to increase the probability of hitting the race.
From code inspection, this looks like a lifecycle race between:
- removing a live thread from ThreadManager, and
- late async event / rollout persistence still being emitted by the session that was just shut down.
Relevant call sites:
- codex-rs/core/src/codex.rs:2892
send_event_raw() persists rollout items before delivering the event.
- codex-rs/core/src/codex.rs:3860
persist_rollout_items() logs failed to record rollout items: ....
- codex-rs/core/src/codex.rs:5630
session shutdown drains and shuts down the rollout recorder.
- codex-rs/core/src/agent/control.rs:661
shutdown_live_agent() flushes rollout, sends Op::Shutdown, and then immediately removes the thread from ThreadManager.
- codex-rs/core/src/agent/control.rs:674
remove_thread(&agent_id) happens immediately after the shutdown op is sent.
The suspicious sequence is:
- agent session is still capable of emitting late terminal / legacy / compact-related events,
- shutdown_live_agent() removes the thread from the live manager too early,
- a late persistence / event path still references the old thread id,
- some downstream app-server / thread lookup path returns thread not found,
- core logs failed to record rollout items.
This is consistent with the observed symptom that the issue is more reproducible with weaker/slower models: they increase the duration of streaming, tail
events, and compaction/close overlap windows.
What steps can reproduce the bug?
I do not yet have a minimized deterministic repro, but the issue appears to be reproducible with the following pattern:
High-level trigger conditions
- multi-agent session
- at least one spawned sub-agent
- slower model for child agents
- parent closes the child soon after completion or while tail events are still in flight
- optional but likely to increase probability: long streaming responses, compaction, or final event fan-out
Suggested repro workflow
- Start Codex with multi-agent capable workflow.
- Use a relatively slow / weaker model for spawned agents.
- Spawn one or more child agents that produce enough streamed output to keep the session active for a while.
- As soon as the child reaches a terminal state, or while it is close to finishing, call close_agent.
- Repeat several times in a loop.
Pseudocode repro shape
root agent
-> spawn child agent using slower model
-> child streams output for a while
-> parent calls close_agent(child) quickly after completion / near completion
-> occasionally observe:
ERROR codex_core::session: failed to record rollout items: thread not found
Why this seems timing-sensitive
shutdown_live_agent() currently does:
- ensure_rollout_materialized()
- flush_rollout()
- send_op(agent_id, Op::Shutdown {})
- remove_thread(&agent_id)
That means the thread can disappear from the live ThreadManager before all late session activity has fully quiesced.
More specific repro candidate for maintainers
A robust integration test would likely need to simulate:
- a child agent with delayed final event emission,
- parent calling close_agent,
- one or more late send_event_raw() / persist_rollout_items() calls after remove_thread().
That should cover the suspected race window directly.
What is the expected behavior?
Closing or shutting down an agent should not produce any error related to missing thread state.
Expected outcomes:
- either all pending rollout items are safely persisted before the thread is removed, or
- late events after shutdown are safely ignored / downgraded without surfacing an error.
In particular, close_agent / shutdown should not leave the session in a state where late rollout persistence attempts fail with thread not found.
Additional information
Root cause hypothesis
This looks like a shutdown ordering bug rather than a model-specific bug.
The key issue appears to be that the live thread is removed too early relative to the tail of async session activity.
Evidence from code
send_event_raw() always persists before delivering:
pub(crate) async fn send_event_raw(&self, event: Event) {
let rollout_items = vec![RolloutItem::EventMsg(event.msg.clone())];
self.persist_rollout_items(&rollout_items).await;
self.deliver_event_raw(event).await;
}
persist_rollout_items() logs any recorder failure:
pub(crate) async fn persist_rollout_items(&self, items: &[RolloutItem]) {
let recorder = {
let guard = self.services.rollout.lock().await;
guard.clone()
};
if let Some(rec) = recorder
&& let Err(e) = rec.record_items(items).await
{
error!("failed to record rollout items: {e:#}");
}
}
shutdown_live_agent() removes the thread immediately after sending shutdown:
pub(crate) async fn shutdown_live_agent(&self, agent_id: ThreadId) -> CodexResult {
let state = self.upgrade()?;
let result = if let Ok(thread) = state.get_thread(agent_id).await {
thread.codex.session.ensure_rollout_materialized().await;
thread.codex.session.flush_rollout().await;
if matches!(thread.agent_status().await, AgentStatus::Shutdown) {
Ok(String::new())
} else {
state.send_op(agent_id, Op::Shutdown {}).await
}
} else {
state.send_op(agent_id, Op::Shutdown {}).await
};
let _ = state.remove_thread(&agent_id).await;
self.state.release_spawned_thread(agent_id);
result
}
Session shutdown does drain the rollout recorder:
let recorder_opt = {
let mut guard = sess.services.rollout.lock().await;
guard.take()
};
if let Some(rec) = recorder_opt
&& let Err(e) = rec.shutdown().await
{
warn!("failed to shutdown rollout recorder: {e}");
}
That is helpful, but it does not by itself guarantee that no late async task will still attempt to emit events or persist rollout items after the live
thread has already been removed.
Why slower / weaker models increase repro rate
I do not think weaker models are the root cause.
They likely make the bug easier to trigger because they tend to produce:
- longer streaming windows,
- more opportunities for parent/child overlap,
- more chances to hit close/shutdown while tail events are still propagating,
- more chances to overlap with compaction or terminal event fan-out.
So the model choice appears to affect race probability, not correctness.
Example log snippet
Representative error:
ERROR codex_core::session: failed to record rollout items: thread 12345678-1234-1234-1234-123456789abc not found
Potential surrounding context in a real run would likely include agent close/shutdown activity shortly before the error.
Proposed fix direction
I think the safest fix is to delay live thread removal until the session is fully quiesced.
Option A: move remove_thread() later
In shutdown_live_agent(), do not call remove_thread(&agent_id) immediately after Op::Shutdown.
Instead, remove the thread only after the session has definitively reached shutdown completion and no more rollout/event emission can occur.
This seems like the cleanest behavioral fix.
Option B: add a post-shutdown event/persistence gate
After shutdown begins, make send_event_raw() / persist_rollout_items() no-op or degrade gracefully if the session is closing/closed.
That would prevent harmless late tail events from surfacing as errors.
Option C: downgrade the specific tail case
If a late persistence attempt occurs after thread shutdown and the only failure is thread not found, treat it as expected during teardown and log at
debug/warn instead of error.
This would reduce noise, but it feels more like mitigation than root-cause resolution.
Suggested tests
A regression test should cover at least one of these scenarios:
- spawned child agent emits delayed terminal events after parent calls close_agent
- child agent is closed while still streaming
- compaction or replacement-history persistence overlaps with child shutdown
- late legacy event emission after recorder shutdown / live-thread removal
The invariant should be: no thread not found persistence error during normal agent close/shutdown.
Environment
Observed on:
- repository: openai/codex
- branch: main
- commit: 23f4cd8
If useful, I can also help turn this into a more deterministic test plan or a smaller integration repro.
———
If you want, I can also turn this into:
- a shorter GitHub-form version that fits directly into the issue template fields, or
- a maintainer-oriented version with a sharper root-cause section and proposed patch sketch.
What steps can reproduce the bug?
.
What is the expected behavior?
No response
Additional information
No response
What version of Codex CLI is running?
codex-cli 0.124.0
What subscription do you have?
pro
Which model were you using?
No response
What platform is your computer?
No response
What terminal emulator and version are you using (if applicable)?
No response
What issue are you seeing?
What issue are you seeing?
Codex intermittently logs an error like:
ERROR codex_core::session: failed to record rollout items: thread 12345678-1234-1234-1234-123456789abc not found
In practice, this appears during multi-agent workflows when a sub-agent is being closed or shut down, especially when using slower / weaker models. The
model itself does not appear to be the root cause; it only seems to increase the probability of hitting the race.
From code inspection, this looks like a lifecycle race between:
Relevant call sites:
send_event_raw() persists rollout items before delivering the event.
persist_rollout_items() logs failed to record rollout items: ....
session shutdown drains and shuts down the rollout recorder.
shutdown_live_agent() flushes rollout, sends Op::Shutdown, and then immediately removes the thread from ThreadManager.
remove_thread(&agent_id) happens immediately after the shutdown op is sent.
The suspicious sequence is:
This is consistent with the observed symptom that the issue is more reproducible with weaker/slower models: they increase the duration of streaming, tail
events, and compaction/close overlap windows.
What steps can reproduce the bug?
I do not yet have a minimized deterministic repro, but the issue appears to be reproducible with the following pattern:
High-level trigger conditions
Suggested repro workflow
Pseudocode repro shape
root agent
-> spawn child agent using slower model
-> child streams output for a while
-> parent calls close_agent(child) quickly after completion / near completion
-> occasionally observe:
ERROR codex_core::session: failed to record rollout items: thread not found
Why this seems timing-sensitive
shutdown_live_agent() currently does:
That means the thread can disappear from the live ThreadManager before all late session activity has fully quiesced.
More specific repro candidate for maintainers
A robust integration test would likely need to simulate:
That should cover the suspected race window directly.
What is the expected behavior?
Closing or shutting down an agent should not produce any error related to missing thread state.
Expected outcomes:
In particular, close_agent / shutdown should not leave the session in a state where late rollout persistence attempts fail with thread not found.
Additional information
Root cause hypothesis
This looks like a shutdown ordering bug rather than a model-specific bug.
The key issue appears to be that the live thread is removed too early relative to the tail of async session activity.
Evidence from code
send_event_raw() always persists before delivering:
pub(crate) async fn send_event_raw(&self, event: Event) {
let rollout_items = vec![RolloutItem::EventMsg(event.msg.clone())];
self.persist_rollout_items(&rollout_items).await;
self.deliver_event_raw(event).await;
}
persist_rollout_items() logs any recorder failure:
pub(crate) async fn persist_rollout_items(&self, items: &[RolloutItem]) {
let recorder = {
let guard = self.services.rollout.lock().await;
guard.clone()
};
if let Some(rec) = recorder
&& let Err(e) = rec.record_items(items).await
{
error!("failed to record rollout items: {e:#}");
}
}
shutdown_live_agent() removes the thread immediately after sending shutdown:
pub(crate) async fn shutdown_live_agent(&self, agent_id: ThreadId) -> CodexResult {
let state = self.upgrade()?;
let result = if let Ok(thread) = state.get_thread(agent_id).await {
thread.codex.session.ensure_rollout_materialized().await;
thread.codex.session.flush_rollout().await;
if matches!(thread.agent_status().await, AgentStatus::Shutdown) {
Ok(String::new())
} else {
state.send_op(agent_id, Op::Shutdown {}).await
}
} else {
state.send_op(agent_id, Op::Shutdown {}).await
};
let _ = state.remove_thread(&agent_id).await;
self.state.release_spawned_thread(agent_id);
result
}
Session shutdown does drain the rollout recorder:
let recorder_opt = {
let mut guard = sess.services.rollout.lock().await;
guard.take()
};
if let Some(rec) = recorder_opt
&& let Err(e) = rec.shutdown().await
{
warn!("failed to shutdown rollout recorder: {e}");
}
That is helpful, but it does not by itself guarantee that no late async task will still attempt to emit events or persist rollout items after the live
thread has already been removed.
Why slower / weaker models increase repro rate
I do not think weaker models are the root cause.
They likely make the bug easier to trigger because they tend to produce:
So the model choice appears to affect race probability, not correctness.
Example log snippet
Representative error:
ERROR codex_core::session: failed to record rollout items: thread 12345678-1234-1234-1234-123456789abc not found
Potential surrounding context in a real run would likely include agent close/shutdown activity shortly before the error.
Proposed fix direction
I think the safest fix is to delay live thread removal until the session is fully quiesced.
Option A: move remove_thread() later
In shutdown_live_agent(), do not call remove_thread(&agent_id) immediately after Op::Shutdown.
Instead, remove the thread only after the session has definitively reached shutdown completion and no more rollout/event emission can occur.
This seems like the cleanest behavioral fix.
Option B: add a post-shutdown event/persistence gate
After shutdown begins, make send_event_raw() / persist_rollout_items() no-op or degrade gracefully if the session is closing/closed.
That would prevent harmless late tail events from surfacing as errors.
Option C: downgrade the specific tail case
If a late persistence attempt occurs after thread shutdown and the only failure is thread not found, treat it as expected during teardown and log at
debug/warn instead of error.
This would reduce noise, but it feels more like mitigation than root-cause resolution.
Suggested tests
A regression test should cover at least one of these scenarios:
The invariant should be: no thread not found persistence error during normal agent close/shutdown.
Environment
Observed on:
If useful, I can also help turn this into a more deterministic test plan or a smaller integration repro.
———
If you want, I can also turn this into:
What steps can reproduce the bug?
.
What is the expected behavior?
No response
Additional information
No response