Skip to content

feat: add agent residency reload runtime#26611

Open
jif-oai wants to merge 6 commits into
jif/agent-runtime-slotsfrom
jif/agent-residency-runtime
Open

feat: add agent residency reload runtime#26611
jif-oai wants to merge 6 commits into
jif/agent-runtime-slotsfrom
jif/agent-residency-runtime

Conversation

@jif-oai
Copy link
Copy Markdown
Collaborator

@jif-oai jif-oai commented Jun 5, 2026

Summary

Adds the V2 residency runtime on top of execution-slot accounting. Agents can remain logically known while idle agents are unloaded from memory, then reloaded when a message, follow-up, completion notification, or path lookup needs them again.

This keeps residency transparent to V2 agent tools while preserving the separate active-slot cap introduced in the previous PR.

Changes

  • Add resident-agent LRU tracking, eviction, targeted reload, and unload cleanup.
  • Route V2 send_message, followup_task, and completion delivery through reload-aware helpers.
  • Resolve agents by durable canonical path, including unloaded open descendants.
  • Reject path collisions against open durable descendants.
  • Fix initial V2 spawn residency/scheduling and stale dead-thread cleanup.
  • Normalize V2 reload/source behavior so edge-derived topology wins and V2 nicknames are not exposed.

Stack

  1. refactor: split agent control modules #26610 - Mechanical AgentControl split: extracts spawn and V1 legacy code without behavior changes.
  2. feat: add agent execution slot accounting #26614 - Execution slot accounting: separates logical agents from active execution slots.
  3. This PR - Residency and reload runtime: adds resident-agent LRU, eviction/reload, durable lookup, and V2 delivery through reload.
  4. feat: make v2 close_agent interrupt only #26612 - V2 tool semantics: narrows close_agent to interrupt-only and updates V2 tool coverage.

@jif-oai jif-oai requested a review from a team as a code owner June 5, 2026 13:34
@jif-oai jif-oai changed the title feat: decouple agent execution slots from residency feat: decouple V2 agent execution slots from residency Jun 5, 2026
@jif-oai jif-oai changed the title feat: decouple V2 agent execution slots from residency feat: decouple agent execution slots from residency Jun 5, 2026
@jif-oai jif-oai force-pushed the jif/agent-residency-runtime branch from 2f32ac2 to 6390bc7 Compare June 5, 2026 13:41
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f32ac2a15

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/core/src/agent/control/spawn.rs Outdated
Comment thread codex-rs/core/src/agent/registry.rs
Comment thread codex-rs/core/src/agent/control/resident.rs
Comment thread codex-rs/core/src/agent/control/resident.rs
@jif-oai
Copy link
Copy Markdown
Collaborator Author

jif-oai commented Jun 5, 2026

@codex review

Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3287fc042d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/core/src/agent/control_tests.rs
Comment thread codex-rs/core/src/agent/control.rs
Comment thread codex-rs/core/src/agent/control/resident.rs
@jif-oai jif-oai force-pushed the jif/agent-residency-runtime branch from 3287fc0 to cc46838 Compare June 5, 2026 14:05
@jif-oai jif-oai changed the title feat: decouple agent execution slots from residency feat: add agent residency reload runtime Jun 5, 2026
@jif-oai jif-oai changed the base branch from jif/agent-control-split to jif/agent-runtime-slots June 5, 2026 14:05
@jif-oai jif-oai force-pushed the jif/agent-residency-runtime branch from cc46838 to 0ae3083 Compare June 5, 2026 14:22
jif-oai added a commit that referenced this pull request Jun 5, 2026
## Summary

Mechanically splits `AgentControl` into focused modules so later agent
runtime changes are easier to review. The shared lookup, messaging, and
completion logic remains in `control.rs`, while spawn-specific code and
V1 legacy close/resume behavior move into dedicated files.

## Changes

- Extract spawn-agent code into `agent/control/spawn.rs`.
- Extract V1-only legacy close/resume behavior into
`agent/control/legacy.rs`.
- Keep shared control-plane behavior in `agent/control.rs`.
- Preserve existing behavior; this PR is intended to be mechanical.

## Stack

1. This PR - Mechanical `AgentControl` split: extracts spawn and V1
legacy code without behavior changes.
2. #26614 - Execution slot accounting: separates logical agents from
active execution slots.
3. #26611 - Residency and reload runtime: adds resident-agent LRU,
eviction/reload, durable lookup, and V2 delivery through reload.
4. #26612 - V2 tool semantics: narrows `close_agent` to interrupt-only
and updates V2 tool coverage.
@jif-oai jif-oai force-pushed the jif/agent-runtime-slots branch from 569848b to bdb868f Compare June 5, 2026 14:26
@parasol-aser
Copy link
Copy Markdown

tasks/mod.rsstart_task, the active.as_mut() rejection after execution_reservation.commit()

Once commit() disables the reservation's release_on_drop, the slot is owned by the turn that's about to start. If abort_all_tasks clears the active turn during the .awaits that follow (the placeholder's task is still None, so no recovery branch runs), the re-lock returns Rejected("active turn reservation was lost…") and the slot is never released. It's reclaimed only on the thread's next turn or on close, so a V2 subagent aborted in this window and never resumed holds its slot indefinitely; enough of those exhaust max_concurrent_threads_per_session and new turns start failing with AgentLimitReached. Race-gated and same-session only (self-DoS, no cross-tenant impact), and largely self-correcting for threads that do resume — but it degrades the very cap this PR adds. Note the reservation-error path just above already calls clear_reserved_active_turn; this branch is the one early return after commit() that frees nothing. The same pattern applies to send_followup_to_agent in control/resident.rs, which commits then get_thread().await?.

Fix: keep the ExecutionReservation alive (don't commit() early) until turn.task is set, or call release_execution_slot(self.thread_id) (and clear_reserved_active_turn) on the rejection branch.


control/resident.rsensure_agent_known (receiver resolution from multi_agents_v2/message_tool.rs)

The receiver lookup moved from get_agent_metadata().unwrap_or_default() (caller-tree registry; a miss was rejected) to ensure_agent_known, which on a miss reads the shared state DB and authorizes a target purely on is_thread_spawn_edge_open(id) && source == ThreadSpawn — with no check that the target descends from the caller's root thread. Since the thread map and state DB are shared across root trees in one process, a model in one conversation that knows a subagent ThreadId from another can enqueue into its mailbox, and followup_task will additionally reload that evicted sibling and trigger a turn on it. It's bounded — ThreadId is a UUIDv7, so the caller must already hold a foreign id (the nickname removal in this PR helps here), and it's same-user/same-process, not cross-tenant — but it's a cross-conversation reachability the old in-memory path didn't allow.

Fix: scope ensure_agent_known (and resolve_agent_target for a raw id) to descendants of the caller's root thread before authorizing delivery or reload. Given this is a public repo, consider working out the exact reachability privately rather than in the PR thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants