Intermittent failed to record rollout items: thread <uuid> not found during sub-agent shutdown / close, more likely with slower models

### What version of Codex CLI is running?

codex-cli 0.124.0

### What subscription do you have?

pro

### Which model were you using?

_No response_

### What platform is your computer?

_No response_

### What terminal emulator and version are you using (if applicable)?

_No response_

### What issue are you seeing?



  What issue are you seeing?

  Codex intermittently logs an error like:

  ERROR codex_core::session: failed to record rollout items: thread 12345678-1234-1234-1234-123456789abc not found

  In practice, this appears during multi-agent workflows when a sub-agent is being closed or shut down, especially when using slower / weaker models. The
  model itself does not appear to be the root cause; it only seems to increase the probability of hitting the race.

  From code inspection, this looks like a lifecycle race between:

  1. removing a live thread from ThreadManager, and
  2. late async event / rollout persistence still being emitted by the session that was just shut down.

  Relevant call sites:

  - codex-rs/core/src/codex.rs:2892
    send_event_raw() persists rollout items before delivering the event.
  - codex-rs/core/src/codex.rs:3860
    persist_rollout_items() logs failed to record rollout items: ....
  - codex-rs/core/src/codex.rs:5630
    session shutdown drains and shuts down the rollout recorder.
  - codex-rs/core/src/agent/control.rs:661
    shutdown_live_agent() flushes rollout, sends Op::Shutdown, and then immediately removes the thread from ThreadManager.
  - codex-rs/core/src/agent/control.rs:674
    remove_thread(&agent_id) happens immediately after the shutdown op is sent.

  The suspicious sequence is:

  - agent session is still capable of emitting late terminal / legacy / compact-related events,
  - shutdown_live_agent() removes the thread from the live manager too early,
  - a late persistence / event path still references the old thread id,
  - some downstream app-server / thread lookup path returns thread not found,
  - core logs failed to record rollout items.

  This is consistent with the observed symptom that the issue is more reproducible with weaker/slower models: they increase the duration of streaming, tail
  events, and compaction/close overlap windows.

  What steps can reproduce the bug?

  I do not yet have a minimized deterministic repro, but the issue appears to be reproducible with the following pattern:

  ### High-level trigger conditions

  - multi-agent session
  - at least one spawned sub-agent
  - slower model for child agents
  - parent closes the child soon after completion or while tail events are still in flight
  - optional but likely to increase probability: long streaming responses, compaction, or final event fan-out

  ### Suggested repro workflow

  1. Start Codex with multi-agent capable workflow.
  2. Use a relatively slow / weaker model for spawned agents.
  3. Spawn one or more child agents that produce enough streamed output to keep the session active for a while.
  4. As soon as the child reaches a terminal state, or while it is close to finishing, call close_agent.
  5. Repeat several times in a loop.

  ### Pseudocode repro shape

  root agent
    -> spawn child agent using slower model
    -> child streams output for a while
    -> parent calls close_agent(child) quickly after completion / near completion
    -> occasionally observe:
       ERROR codex_core::session: failed to record rollout items: thread <uuid> not found

  ### Why this seems timing-sensitive

  shutdown_live_agent() currently does:

  1. ensure_rollout_materialized()
  2. flush_rollout()
  3. send_op(agent_id, Op::Shutdown {})
  4. remove_thread(&agent_id)

  That means the thread can disappear from the live ThreadManager before all late session activity has fully quiesced.

  ### More specific repro candidate for maintainers

  A robust integration test would likely need to simulate:

  - a child agent with delayed final event emission,
  - parent calling close_agent,
  - one or more late send_event_raw() / persist_rollout_items() calls after remove_thread().

  That should cover the suspected race window directly.

  What is the expected behavior?

  Closing or shutting down an agent should not produce any error related to missing thread state.

  Expected outcomes:

  - either all pending rollout items are safely persisted before the thread is removed, or
  - late events after shutdown are safely ignored / downgraded without surfacing an error.

  In particular, close_agent / shutdown should not leave the session in a state where late rollout persistence attempts fail with thread not found.

  Additional information

  ## Root cause hypothesis

  This looks like a shutdown ordering bug rather than a model-specific bug.

  The key issue appears to be that the live thread is removed too early relative to the tail of async session activity.

  ### Evidence from code

  send_event_raw() always persists before delivering:

  pub(crate) async fn send_event_raw(&self, event: Event) {
      let rollout_items = vec![RolloutItem::EventMsg(event.msg.clone())];
      self.persist_rollout_items(&rollout_items).await;
      self.deliver_event_raw(event).await;
  }

  persist_rollout_items() logs any recorder failure:

  pub(crate) async fn persist_rollout_items(&self, items: &[RolloutItem]) {
      let recorder = {
          let guard = self.services.rollout.lock().await;
          guard.clone()
      };
      if let Some(rec) = recorder
          && let Err(e) = rec.record_items(items).await
      {
          error!("failed to record rollout items: {e:#}");
      }
  }

  shutdown_live_agent() removes the thread immediately after sending shutdown:

  pub(crate) async fn shutdown_live_agent(&self, agent_id: ThreadId) -> CodexResult<String> {
      let state = self.upgrade()?;
      let result = if let Ok(thread) = state.get_thread(agent_id).await {
          thread.codex.session.ensure_rollout_materialized().await;
          thread.codex.session.flush_rollout().await;
          if matches!(thread.agent_status().await, AgentStatus::Shutdown) {
              Ok(String::new())
          } else {
              state.send_op(agent_id, Op::Shutdown {}).await
          }
      } else {
          state.send_op(agent_id, Op::Shutdown {}).await
      };
      let _ = state.remove_thread(&agent_id).await;
      self.state.release_spawned_thread(agent_id);
      result
  }

  Session shutdown does drain the rollout recorder:

  let recorder_opt = {
      let mut guard = sess.services.rollout.lock().await;
      guard.take()
  };
  if let Some(rec) = recorder_opt
      && let Err(e) = rec.shutdown().await
  {
      warn!("failed to shutdown rollout recorder: {e}");
  }

  That is helpful, but it does not by itself guarantee that no late async task will still attempt to emit events or persist rollout items after the live
  thread has already been removed.

  ## Why slower / weaker models increase repro rate

  I do not think weaker models are the root cause.

  They likely make the bug easier to trigger because they tend to produce:

  - longer streaming windows,
  - more opportunities for parent/child overlap,
  - more chances to hit close/shutdown while tail events are still propagating,
  - more chances to overlap with compaction or terminal event fan-out.

  So the model choice appears to affect race probability, not correctness.

  ## Example log snippet

  Representative error:

  ERROR codex_core::session: failed to record rollout items: thread 12345678-1234-1234-1234-123456789abc not found

  Potential surrounding context in a real run would likely include agent close/shutdown activity shortly before the error.

  ## Proposed fix direction

  I think the safest fix is to delay live thread removal until the session is fully quiesced.

  ### Option A: move remove_thread() later

  In shutdown_live_agent(), do not call remove_thread(&agent_id) immediately after Op::Shutdown.

  Instead, remove the thread only after the session has definitively reached shutdown completion and no more rollout/event emission can occur.

  This seems like the cleanest behavioral fix.

  ### Option B: add a post-shutdown event/persistence gate

  After shutdown begins, make send_event_raw() / persist_rollout_items() no-op or degrade gracefully if the session is closing/closed.

  That would prevent harmless late tail events from surfacing as errors.

  ### Option C: downgrade the specific tail case

  If a late persistence attempt occurs after thread shutdown and the only failure is thread not found, treat it as expected during teardown and log at
  debug/warn instead of error.

  This would reduce noise, but it feels more like mitigation than root-cause resolution.

  ## Suggested tests

  A regression test should cover at least one of these scenarios:

  1. spawned child agent emits delayed terminal events after parent calls close_agent
  2. child agent is closed while still streaming
  3. compaction or replacement-history persistence overlaps with child shutdown
  4. late legacy event emission after recorder shutdown / live-thread removal

  The invariant should be: no thread not found persistence error during normal agent close/shutdown.

  ## Environment

  Observed on:

  - repository: openai/codex
  - branch: main
  - commit: 23f4cd845

  If useful, I can also help turn this into a more deterministic test plan or a smaller integration repro.

  ———

  If you want, I can also turn this into:

  - a shorter GitHub-form version that fits directly into the issue template fields, or
  - a maintainer-oriented version with a sharper root-cause section and proposed patch sketch.

### What steps can reproduce the bug?

.

### What is the expected behavior?

_No response_

### Additional information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent failed to record rollout items: thread <uuid> not found during sub-agent shutdown / close, more likely with slower models #19532

What version of Codex CLI is running?

What subscription do you have?

Which model were you using?

What platform is your computer?

What terminal emulator and version are you using (if applicable)?

What issue are you seeing?

High-level trigger conditions

Suggested repro workflow

Pseudocode repro shape

Why this seems timing-sensitive

More specific repro candidate for maintainers

Root cause hypothesis

Evidence from code

Why slower / weaker models increase repro rate

Example log snippet

Proposed fix direction

Option A: move remove_thread() later

Option B: add a post-shutdown event/persistence gate

Option C: downgrade the specific tail case

Suggested tests

Environment

What steps can reproduce the bug?

What is the expected behavior?

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Intermittent failed to record rollout items: thread <uuid> not found during sub-agent shutdown / close, more likely with slower models #19532

Description

What version of Codex CLI is running?

What subscription do you have?

Which model were you using?

What platform is your computer?

What terminal emulator and version are you using (if applicable)?

What issue are you seeing?

High-level trigger conditions

Suggested repro workflow

Pseudocode repro shape

Why this seems timing-sensitive

More specific repro candidate for maintainers

Root cause hypothesis

Evidence from code

Why slower / weaker models increase repro rate

Example log snippet

Proposed fix direction

Option A: move remove_thread() later

Option B: add a post-shutdown event/persistence gate

Option C: downgrade the specific tail case

Suggested tests

Environment

What steps can reproduce the bug?

What is the expected behavior?

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions