Skip to content

feat: Async subagents#801

Merged
Henry-811 merged 3 commits intodev/v0.1.41from
async_subagnets
Jan 21, 2026
Merged

feat: Async subagents#801
Henry-811 merged 3 commits intodev/v0.1.41from
async_subagnets

Conversation

@ncrispino
Copy link
Collaborator

@ncrispino ncrispino commented Jan 21, 2026

PR: Subagent Continuation & Round Timeouts

Summary

This PR implements two major features for subagent functionality, along with several important enhancements and bug fixes:

  1. MAS-211: Subagent Continuation - Allows resuming timed-out, failed, or completed subagents with new messages
  2. MAS-239: Round Timeouts - Adds per-round timeout support for subagents with parent inheritance

Key Benefits:

  • Resume timed-out subagents without losing work
  • Refine completed subagent answers with follow-up questions
  • Configure round-level timeouts (soft/hard) for subagents
  • Single-pass vs iterative refinement modes for faster execution
  • Async subagent spawning for background execution
  • Cross-agent subagent visibility via registry merging
  • Consistent context handling via CONTEXT.md files

What Changed (10 Phases):

  1. Session registry integration with auto-generated session IDs
  2. Continue subagent tool for resuming conversations
  3. Round timeout support with parent inheritance
  4. Refine parameter for single-pass vs iterative modes
  5. Enhanced list_subagents with registry support
  6. Context parameter removal (breaking change - now requires CONTEXT.md)
  7. Registry merging for cross-agent visibility
  8. Async subagent spawning with broadcast integration
  9. Detailed subprocess error logging for debugging
  10. Triple unpacking fix for exception handlers

Implementation Details

Phase 1: Session Registry Integration

Closes MAS-211 (partial)

Changes:

  • Removed --no-session-registry flag from subagent subprocess invocation
  • Subagents now register with auto-generated session IDs during execution
  • Session IDs extracted from status.json after subagent completes
  • CLI auto-detects subagent sessions by session_id prefix
  • --continue command filters out subagent sessions (user-facing only)
  • Added registry file: {workspace}/subagents/_registry.json

Files Modified:

  • massgen/subagent/manager.py:622-637 - Removed --session-id from initial spawn
  • massgen/subagent/manager.py:688-694 - Extract session ID from status.json
  • massgen/subagent/manager.py:995-1027 - Updated _parse_subprocess_status() to return 3 values
  • massgen/cli.py:7415-7423 - Auto-detect and label subagent sessions
  • massgen/session/_registry.py:198-216 - Filter subagents from --continue
  • massgen/subagent/manager.py:326-369 - Added _save_subagent_to_registry()

How It Works:

# DON'T pass --session-id on initial spawn (that's for restoring existing sessions)
# Let MassGen auto-generate the session ID
cmd = [
    "uv", "run", "massgen",
    "--config", str(yaml_path),
    "--automation",
    "--output-file", str(answer_file),
    full_task,
]

# After subagent completes, extract the auto-generated session ID from status.json
token_usage, subprocess_log_dir, session_id = self._parse_subprocess_status(workspace)

# Track session ID for continuation support
if session_id:
    self._subagent_sessions[config.id] = session_id
    logger.info(f"[SubagentManager] Tracked session ID for {config.id}: {session_id}")

Key Fix: The original implementation passed --session-id to new subagents, but the CLI treats --session-id as "restore existing session" not "create with this ID". This caused "Session not found in registry" errors. The fix lets MassGen auto-generate the session ID, then extracts it from status.json afterward.


Phase 2: Continue Subagent Tool

Closes MAS-211

Changes:

  • Added continue_subagent() method to SubagentManager
  • Added continue_subagent MCP tool
  • Reuses existing --session-id mechanism for session restoration
  • ALL subagents can be continued (completed, timeout, failed)

Files Modified:

  • massgen/subagent/manager.py:1424-1609 - Added continue_subagent() method
  • massgen/mcp_tools/subagent/_subagent_mcp_server.py:697-802 - Added MCP tool

Usage Example:

# Continue a timed-out subagent
result = continue_subagent(
    subagent_id="research_oauth",
    message="Please continue where you left off and finish the research"
)

# Refine a completed subagent's answer
result = continue_subagent(
    subagent_id="bio",
    message="Please add more details about Bob Dylan's early life"
)

How It Works:

  1. Looks up subagent in _registry.json to get session_id
  2. Invokes: massgen --session-id {session_id} "new message"
  3. Existing restore_session() handles conversation restoration
  4. Returns new answer with continuation metadata

Phase 3: Round Timeouts for Subagents

Closes MAS-239

Changes:

  • Added round timeout configuration with parent inheritance
  • Supports: initial_round_timeout_seconds, subsequent_round_timeout_seconds, round_timeout_grace_seconds
  • Subagent-specific config overrides parent settings
  • Empty subagent config inherits from parent

Files Modified:

  • massgen/subagent/manager.py:957-976 - Round timeout inheritance logic

Configuration:

orchestrator:
  coordination:
    # Parent round timeouts (applied to parent agent)
    parent_round_timeouts:
      initial_round_timeout_seconds: 180
      subsequent_round_timeout_seconds: 60
      round_timeout_grace_seconds: 30

    # Subagent-specific overrides (optional)
    subagent_round_timeouts:
      initial_round_timeout_seconds: 120  # Override
      subsequent_round_timeout_seconds: 45  # Override
      # grace_seconds not specified, inherits 30 from parent

Inheritance Logic:

  1. Start with parent_round_timeouts
  2. Override with subagent_round_timeouts (only non-None values)
  3. Write to subagent YAML's timeout_settings

Phase 4: Refine Parameter

Closes MAS-211 (enhancement)

Changes:

  • Added refine parameter to spawn_subagents, spawn_parallel, spawn_subagent, spawn_subagent_background
  • Follows TUI's refinement pattern (see massgen/agent_config.py:254-269)
  • refine=False: Single-pass execution (faster)
  • refine=True: Multi-round coordination with voting (default)

Files Modified:

  • massgen/subagent/manager.py:790 - Read refine from metadata
  • massgen/subagent/manager.py:913-921 - Apply TUI refinement flags
  • massgen/mcp_tools/subagent/_subagent_mcp_server.py:279 - Added parameter to MCP tool

Refinement Flags (when refine=False):

# All modes
orchestrator_config["max_new_answers_per_agent"] = 1
orchestrator_config["skip_final_presentation"] = True

# Single agent
orchestrator_config["skip_voting"] = True

# Multi-agent
orchestrator_config["disable_injection"] = True
orchestrator_config["defer_voting_until_all_answered"] = True

Usage:

# Fast single-pass execution
spawn_subagents(
    tasks=[{"task": "Quick research task"}],
    context="Building auth system",
    refine=False  # Skip iteration, return first answer
)

# Full iterative refinement (default)
spawn_subagents(
    tasks=[{"task": "Complex analysis task"}],
    context="Building auth system",
    refine=True  # Multi-round coordination with voting
)

Phase 5: Enhanced list_subagents

Closes MAS-211 (partial)

Changes:

  • list_subagents now reads from _registry.json
  • Shows all subagents from current and previous turns
  • Includes session_id, continuable, last_continued_at
  • ALL subagents marked as continuable if they have a session_id

Files Modified:

  • massgen/subagent/manager.py:1863-1908 - Enhanced list implementation
  • massgen/mcp_tools/subagent/_subagent_mcp_server.py:533 - Removed unused parameter

Output Format:

{
  "success": true,
  "subagents": [
    {
      "subagent_id": "research_oauth",
      "status": "timeout",
      "workspace": "/path/to/workspace",
      "task": "Research OAuth 2.0...",
      "session_id": "subagent_research_oauth_abc123",
      "continuable": true,
      "created_at": "2024-01-20T10:00:00",
      "last_continued_at": "2024-01-20T10:05:00"
    }
  ],
  "count": 1
}

Phase 6: Context Parameter Removal

Closes MAS-211 (enhancement)

Changes:

  • Removed context parameter from all subagent spawning methods
  • Requires CONTEXT.md file in workspace before spawning subagents
  • Follows the same pattern as generate_media tool (see massgen/mcp_tools/media/generate_media.py)
  • Added early validation to check for CONTEXT.md existence

Files Modified:

  • massgen/subagent/models.py:20-45 - Removed context field from SubagentConfig
  • massgen/subagent/manager.py:1107-1116 - Removed context parameter from spawn_subagent()
  • massgen/subagent/manager.py:1251-1256 - Removed context parameter from spawn_parallel()
  • massgen/subagent/manager.py:1167-1180 - Added CONTEXT.md validation
  • massgen/mcp_tools/subagent/_subagent_mcp_server.py:275-285 - Removed context from MCP tool

Why This Change:
The original implementation passed context as a string parameter, which created inconsistency with other tools. Following the generate_media pattern, we now require a CONTEXT.md file that subagents read directly. This:

  • Provides better separation between shared context and task-specific instructions
  • Follows established patterns in the codebase
  • Makes context visible in the workspace (can be edited before spawning)
  • Reduces parameter clutter in tool signatures

Usage Pattern:

# BEFORE (old pattern - removed):
spawn_subagents(
    tasks=[{"task": "Research OAuth"}],
    context="Building secure auth system"  # REMOVED
)

# AFTER (new pattern - required):
# Step 1: Create CONTEXT.md file (REQUIRED)
write_file("CONTEXT.md", """
# Task Context

Building a secure authentication system for a web application.

## Key Terms
- OAuth 2.0: Authorization framework
- JWT: JSON Web Tokens for stateless auth

## Visual Style
- Modern, clean interface
- Security-focused design
""")

# Step 2: Spawn subagents (reads from CONTEXT.md)
spawn_subagents(
    tasks=[{"task": "Research OAuth 2.0 best practices"}]
)

Validation:
If CONTEXT.md doesn't exist, the tool returns a clear error:

CONTEXT.md not found in workspace. Before spawning subagents, create a CONTEXT.md file with task context. This helps subagents understand what they're working on.

Phase 7: Registry Merging (Cross-Agent Visibility)

Closes MAS-211 (enhancement)

Changes:

  • Added agent_temporary_workspace parameter to SubagentManager
  • list_subagents() now scans all agent temporary workspaces
  • continue_subagent() searches across all agent registries
  • Follows MassGen's workspace visibility principles

Files Modified:

  • massgen/subagent/manager.py:67 - Added agent_temporary_workspace parameter
  • massgen/subagent/manager.py:1870-1975 - Enhanced list_subagents() with registry merging
  • massgen/subagent/manager.py:1491-1530 - Enhanced continue_subagent() with cross-registry search

Why This Matters:
When multiple agents spawn subagents, each agent needs visibility into subagents from other agents. This follows the same principle as workspace visibility - if an agent can see another agent's temporary workspace, it should be able to see and continue that agent's subagents.

How It Works:

# Initialize SubagentManager with temp workspace path
manager = SubagentManager(
    parent_workspace=workspace,
    agent_temporary_workspace="/path/to/agent_temp_workspaces"  # NEW
)

# list_subagents() now scans all agent registries
subagents = manager.list_subagents()
# Returns subagents from all agents, with source_agent field:
# [
#   {"subagent_id": "sub1", "source_agent": "agent_1", ...},
#   {"subagent_id": "sub2", "source_agent": "agent_2", ...},
# ]

# continue_subagent() searches all registries
result = manager.continue_subagent(
    subagent_id="sub2",  # Spawned by agent_2
    new_message="Continue where you left off"
)

Workspace Structure:

agent_temp_workspaces/
├── agent_1/
│   └── subagents/
│       └── _registry.json  # Agent 1's subagents
└── agent_2/
    └── subagents/
        └── _registry.json  # Agent 2's subagents

# list_subagents() merges both registries

Phase 8: Async Subagent Spawning

Related to MAS-211 (background execution)

Changes:

  • Added async_ parameter to spawn_subagents MCP tool
  • Subagents spawn in background without blocking parent agent
  • Results automatically injected via broadcast mechanism
  • Added spawn_subagent_background() method for async execution

Files Modified:

  • massgen/subagent/manager.py:1206-1250 - Added spawn_subagent_background() method
  • massgen/mcp_tools/subagent/_subagent_mcp_server.py:275 - Added async_ parameter
  • massgen/mcp_tools/subagent/_subagent_mcp_server.py:416-472 - Async execution logic

How It Works:

# BLOCKING (default): Wait for subagents to complete
result = spawn_subagents(
    tasks=[{"task": "Research OAuth 2.0"}],
    async_=False  # Wait for completion
)
# Result contains answers immediately

# ASYNC: Spawn in background, continue parent agent work
result = spawn_subagents(
    tasks=[
        {"task": "Research OAuth 2.0", "subagent_id": "oauth"},
        {"task": "Research JWT tokens", "subagent_id": "jwt"}
    ],
    async_=True  # Spawn and return immediately
)
# Result contains spawn info, not answers
# Answers will be injected via broadcast when subagents complete

Broadcast Integration:
When async subagents complete:

  1. Result written to {workspace}/subagents/{id}/answer.txt
  2. Broadcast message sent to parent orchestrator
  3. Parent agent receives: "Subagent 'oauth' has completed. Answer: {content}"
  4. Parent can continue working while subagents run in parallel

Use Cases:

  • Long-running research tasks: Spawn subagents to research multiple topics while parent continues planning
  • Parallel data gathering: Collect data from multiple sources simultaneously
  • Background analysis: Run analysis tasks while parent works on other aspects

Example Flow:

# Parent agent spawns background subagents
spawn_subagents(
    tasks=[
        {"task": "Research OAuth providers", "subagent_id": "providers"},
        {"task": "Research security best practices", "subagent_id": "security"}
    ],
    async_=True
)

# Parent continues working immediately (doesn't wait)
# ... parent does other work ...

# Later, parent receives broadcast:
# "Subagent 'providers' completed. Answer: Top OAuth providers are..."
# "Subagent 'security' completed. Answer: Key security practices include..."

Phase 9: Detailed Subprocess Error Logging

Bug Fix

Changes:

  • Added comprehensive subprocess error logging
  • Captures command, working directory, stderr, and stdout
  • Helps debug subagent failures (exit codes, session errors, etc.)

Files Modified:

  • massgen/subagent/manager.py:720-727 - Enhanced error logging

Example Output:

[SubagentManager] Subagent geese_bio failed with exit code 1
Command: uv run massgen --config /path/to/config.yaml --automation --output-file /path/to/answer.txt "Research the band Geese"
Working directory: /path/to/workspace
STDERR:
STDOUT: ❌ Session error: Session 'subagent_geese_bio_06e1c958' not found in registry
Run 'massgen --list-sessions' to see available sessions

This logging was critical for discovering the session ID bug (Phase 1 fix).


Phase 10: Triple Unpacking Fix

Bug Fix

Changes:

  • Updated all exception handlers to unpack 3 values instead of 2
  • Fixed "too many values to unpack (expected 2)" errors
  • Ensures consistent unpacking across timeout/cancelled/error paths

Files Modified:

  • massgen/subagent/manager.py:738 - Timeout handler
  • massgen/subagent/manager.py:760 - Cancelled handler
  • massgen/subagent/manager.py:774 - Generic error handler

The Bug:
Changed _parse_subprocess_status() to return 3 values (token_usage, log_dir, session_id) but forgot to update exception handlers:

# BEFORE (caused unpacking errors):
_, subprocess_log_dir = self._parse_subprocess_status(workspace)

# AFTER (fixed):
_, subprocess_log_dir, _ = self._parse_subprocess_status(workspace)

Testing

Unit Tests

massgen/tests/test_subagent_continuation.py - Tests for MAS-211:

  • test_subagent_gets_session_id - Verify session ID generation
  • test_session_registry_saves_metadata - Registry persistence
  • test_continue_subagent_with_session_id - Continuation mechanism
  • test_continue_nonexistent_subagent - Error handling
  • test_continue_completed_subagent - Refinement use case
  • test_list_shows_session_id - Enhanced listing
  • test_list_reads_from_registry - Registry integration
  • test_refine_false_single_agent - Refine parameter behavior

massgen/tests/test_subagent_round_timeouts.py - Tests for MAS-239:

  • test_inherits_parent_round_timeouts - Inheritance from parent
  • test_subagent_overrides_parent_timeouts - Subagent-specific overrides
  • test_no_timeouts_when_not_configured - Empty config handling
  • test_subagent_timeout_overrides_only_specified_fields - Partial overrides
  • test_none_values_dont_override - None value handling

Manual Test Script:

  • scripts/test_subagent_continuation_manual.py - End-to-end flow test

Run Tests:

# Unit tests
uv run pytest massgen/tests/test_subagent_continuation.py -v
uv run pytest massgen/tests/test_subagent_round_timeouts.py -v

# Manual test (requires API keys)
uv run python scripts/test_subagent_continuation_manual.py

Example Configs

Basic Continuation Example

# config: massgen/configs/tools/subagent/continuation_example.yaml
agents:
  - id: parent_agent
    backend:
      type: claude
      model: claude-sonnet-4-5-20250929

orchestrator:
  coordination:
    enable_subagents: true
    subagent_default_timeout: 300  # 5 minutes

Usage:

# Step 1: Create CONTEXT.md (REQUIRED)
write_file("CONTEXT.md", """
# Task Context

Building a secure authentication system for a web application.

## Key Requirements
- OAuth 2.0 integration
- JWT token management
- Secure session handling
""")

# Step 2: Spawn subagent
result = spawn_subagents(
    tasks=[{"task": "Research OAuth 2.0 best practices", "subagent_id": "oauth"}]
)

# Step 3: If it times out, continue it
if result["results"][0]["status"] == "timeout":
    continued = continue_subagent(
        subagent_id="oauth",
        message="Please continue where you left off"
    )

Round Timeout Configuration

# config: massgen/configs/tools/subagent/round_timeouts_example.yaml
agents:
  - id: parent_agent
    backend:
      type: claude
      model: claude-sonnet-4-5-20250929

orchestrator:
  coordination:
    enable_subagents: true

    # Parent round timeouts
    parent_round_timeouts:
      initial_round_timeout_seconds: 180  # 3 min first round
      subsequent_round_timeout_seconds: 60  # 1 min later rounds
      round_timeout_grace_seconds: 30  # 30 sec grace period

    # Subagent-specific overrides (optional)
    subagent_round_timeouts:
      initial_round_timeout_seconds: 120  # 2 min for subagents
      subsequent_round_timeout_seconds: 45  # 45 sec for subagents

Refine Mode Examples

# config: massgen/configs/tools/subagent/refine_modes_example.yaml
agents:
  - id: parent_agent
    backend:
      type: claude
      model: claude-sonnet-4-5-20250929

orchestrator:
  coordination:
    enable_subagents: true
    subagent_default_timeout: 300

Fast Mode (refine=False):

# Step 1: Create CONTEXT.md
write_file("CONTEXT.md", "Auth system research: OAuth 2.0 integration planning")

# Step 2: Quick single-pass execution
spawn_subagents(
    tasks=[
        {"task": "List the top 5 OAuth providers", "subagent_id": "quick_list"}
    ],
    refine=False  # Fast mode: single answer, no iteration
)

Full Mode (refine=True, default):

# Step 1: Create CONTEXT.md
write_file("CONTEXT.md", "Auth system research: Security analysis and threat modeling")

# Step 2: Iterative refinement with voting
spawn_subagents(
    tasks=[
        {"task": "Analyze OAuth security implications", "subagent_id": "deep_analysis"}
    ],
    refine=True  # Full mode: multi-round coordination
)

Verification Checklist

Core Features (MAS-211, MAS-239)

  • Spawn subagent, verify registered with auto-generated session ID
  • Verify session ID extracted from status.json (not passed as --session-id)
  • Run massgen --continue, verify skips subagent sessions
  • Continue timed-out subagent, verify conversation restored
  • Continue completed subagent, verify refinement works
  • List subagents across turns, verify all appear with continuable flag
  • Configure round timeouts, verify soft/hard timeout behavior
  • Test inheritance when subagent config empty
  • Spawn with refine=False, verify single answer returned
  • Spawn with refine=True, verify multi-round coordination

Bug Fixes

  • Verify no "Session not found in registry" errors
  • Verify no "too many values to unpack (expected 2)" errors
  • Verify subprocess errors show detailed logging (command, stderr, stdout)
  • Verify CONTEXT.md validation returns clear error when missing

Context Parameter Removal

  • Verify spawn_subagents requires CONTEXT.md file
  • Verify error returned if CONTEXT.md doesn't exist
  • Verify subagents can read CONTEXT.md content
  • Verify no context parameter in tool signature

Registry Merging

  • Spawn subagents from multiple agents
  • Verify list_subagents() shows all subagents with source_agent
  • Verify continue_subagent() can continue subagents from other agents
  • Verify registry merging follows workspace visibility principles

Async Spawning

  • Spawn with async_=True, verify parent continues immediately
  • Verify async subagent results injected via broadcast
  • Verify multiple async subagents run in parallel
  • Verify async subagent failures don't block parent

Breaking Changes

Context Parameter Removed (MAS-211)

Breaking: The context parameter has been removed from all subagent spawning methods.

Migration:

# BEFORE:
spawn_subagents(
    tasks=[{"task": "Research OAuth"}],
    context="Building auth system"  # REMOVED
)

# AFTER:
# Step 1: Create CONTEXT.md file
write_file("CONTEXT.md", "Building auth system")

# Step 2: Spawn subagents
spawn_subagents(
    tasks=[{"task": "Research OAuth"}]
)

Rationale: This change follows the generate_media pattern and provides better separation between shared context and task-specific instructions.

Other Changes (Backward Compatible)

  • New parameters have sensible defaults (refine=True, async_=False)
  • Round timeouts are optional (inherit from parent or none)
  • Session registry integration is transparent to users

Future Work

  • Phase 7: Update documentation (docs/source/user_guide/subagents.rst)
  • Add continuation examples to user guide
  • Document round timeout configuration
  • Document refine parameter usage

Related Issues

  • Closes MAS-211
  • Closes MAS-239
  • Related to async subagent spawning (previous work on this branch)

ncrispino and others added 3 commits January 10, 2026 14:12
…n (MAS-214)

Add async_=True parameter to spawn_subagents MCP tool for non-blocking
subagent execution. When enabled, subagents run in background while
parent continues working. Results are automatically injected via
SubagentCompleteHook when subagents complete.

Key changes:
- SubagentManager: Add callback mechanism for completion notification
- SubagentCompleteHook: New PostToolUse hook for result injection
- Orchestrator: Add pending results queue and hook registration
- MCP tool: Add async_ parameter with immediate return for background mode
- Result formatter: XML-structured format for injected results
- Config: Add async_subagents config with injection_strategy option

Includes TDD test suite (196 tests) and integration tests.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 21, 2026

📝 Walkthrough

Walkthrough

This PR implements async subagent execution with automatic result injection. It reorganizes subagent configuration under a coordination block, adds per-round timeouts, introduces async spawning and continuation capabilities, implements a PostToolUse hook for result injection, and extends orchestration to queue and deliver subagent results to parent agents.

Changes

Cohort / File(s) Summary
Configuration & Schema Documentation
docs/source/quickstart/configuration.rst, docs/source/reference/yaml_schema.rst, docs/source/reference/timeouts.rst
Move subagent config from top-level orchestrator to nested coordination block; document new subagent_round_timeouts with per-round timeout settings; document async_subagents configuration with enabled and injection_strategy options.
User Guide & Example Documentation
docs/source/user_guide/advanced/subagents.rst, massgen/configs/features/async_subagent_example.yaml
Update configuration examples to use coordination.enable_subagents and coordination.subagent_round_timeouts; add async_subagents examples with injection strategies; document refine and async execution flows.
Agent Configuration
massgen/agent_config.py
Add subagent_round_timeouts and async_subagents fields to CoordinationConfig for per-round and async execution configuration.
Configuration Validation
massgen/config_validator.py
Add validation for coordination.async_subagents (enabled bool, injection_strategy enum), coordination.subagent_round_timeouts (initial/subsequent/grace timeout fields), and related error reporting.
Result Formatting
massgen/subagent/result_formatter.py
New module with format_single_result() and format_batch_results() for XML-like formatting of subagent results with metadata (id, status, answer, execution_time, workspace, token_usage).
Hook Framework Integration
massgen/mcp_tools/hooks.py
Add SubagentCompleteHook class (PatternHook) to inject completed async subagent results into tool outputs with configurable injection_strategy; supports pending-results getter and fail-open error handling.
Subagent MCP Tool
massgen/mcp_tools/subagent/_subagent_mcp_server.py
Update spawn_subagents with async_ and refine flags (replace context parameter with CONTEXT.md workspace requirement); add new continue_subagent tool for multi-turn conversations; branch execution into async (background) and blocking paths.
Subagent Models
massgen/subagent/models.py
Remove context field from SubagentConfig public API and serialization methods.
Subagent Manager
massgen/subagent/manager.py
Add completion callback mechanism (register_completion_callback, _invoke_completion_callbacks), session tracking via agent_temporary_workspace, per-round timeout propagation, continuation API (continue_subagent), and registry persistence (_save_subagent_to_registry); update spawn_subagent* signatures to accept refine flag.
Orchestrator Integration
massgen/orchestrator.py
Add per-agent pending subagent results queue, async configuration flags, SubagentCompleteHook registration in multiple injection paths, completion callbacks (_on_subagent_complete), pending results flushing (_flush_pending_subagent_results), and subagent_round_timeouts propagation in config inheritance.
CLI & Session Management
massgen/cli.py, massgen/session/_registry.py
Update CLI to detect and label subagent sessions by memory_session_id prefix; add subagent and parent_session_id parameters to SessionRegistry.register_session; filter subagent sessions from continuable session lookup.
System Prompt Documentation
massgen/system_prompt_sections.py
Update spawn_subagents API documentation to reflect async_, refine parameters and CONTEXT.md workspace requirement instead of context parameter.
Comprehensive Testing
massgen/tests/test_hook_framework.py, massgen/tests/test_subagent_manager.py, massgen/tests/test_subagent_mcp_server.py, massgen/tests/test_subagent_result_formatter.py, scripts/test_async_subagent_integration.py
Add test suites covering SubagentCompleteHook execution and injection strategies, SubagentManager callbacks and SubagentResult creation, async spawn_subagents parameter and return formats, result formatter XML output, and end-to-end integration of callbacks, hooks, and orchestration.
Specification & Design Documentation
openspec/changes/add-async-subagent-execution/design.md, openspec/changes/add-async-subagent-execution/proposal.md, openspec/changes/add-async-subagent-execution/specs/subagent/spec.md, openspec/changes/add-async-subagent-execution/tasks.md
Document design rationale, proposal with phased rollout, formal specifications for async execution and result injection, multi-phase implementation plan, and testing/integration goals.

Sequence Diagram(s)

sequenceDiagram
    participant Parent as Parent Agent
    participant Orch as Orchestrator
    participant SubMgr as SubagentManager
    participant Sub as Subagent
    participant Hook as SubagentCompleteHook

    Parent->>Orch: Run with coordination config<br/>(async_subagents enabled)
    Orch->>Orch: Register SubagentCompleteHook<br/>with pending results queue
    Orch->>SubMgr: Register completion callback<br/>(_on_subagent_complete)
    
    Parent->>Orch: Tool call: spawn_subagents<br/>(async_=true)
    Orch->>SubMgr: spawn_subagent_background()
    SubMgr->>Sub: Launch subprocess<br/>(subagent with refine=false)
    SubMgr-->>Orch: Return immediately with IDs
    Orch-->>Parent: Return async mode response<br/>(subagent IDs, running status)

    Note over Sub: Subagent executes<br/>(potentially multiple turns<br/>if refine=true)

    Sub-->>SubMgr: Completes with result
    SubMgr->>Orch: Invoke completion callback<br/>(_on_subagent_complete)
    Orch->>Orch: Queue result in<br/>pending_subagent_results[parent_id]

    Parent->>Orch: Next tool call
    Orch->>Hook: PostToolUse event
    Hook->>Orch: Get pending results
    Hook->>Hook: format_batch_results()
    Hook-->>Orch: Inject formatted content<br/>(tool_result or user_message)
    Orch-->>Parent: Tool output with<br/>injected subagent results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • Async subagent execution with result injection—directly implements async spawning, completion callbacks, pending results queue, and SubagentCompleteHook-based injection.
  • Subagent round timeouts—directly implements subagent_round_timeouts configuration, validation, and propagation to subagent coordination settings.

Possibly related PRs

  • PR #769: Introduces the hook framework (PatternHook, HookManager) that SubagentCompleteHook extends and integrates into.
  • PR #764: Modifies SubagentManager configuration generation (_generate_subagent_yaml_config); this PR extends that path with refine and per-round timeout propagation.
  • PR #740: Prior work on subagent orchestration config and timeout settings; this PR builds on and reorganizes that configuration under coordination.

Suggested reviewers

  • a5507203
🚥 Pre-merge checks | ✅ 4 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Documentation Updated ⚠️ Warning Documentation lacks required Google-style docstrings and contains invalid context parameter examples contradicting implementation. Update SubagentCompleteHook.execute() with full Args/Returns sections, remove invalid context parameters from examples, and clarify CONTEXT.md workspace requirement.
Config Parameter Sync ❓ Inconclusive The specified files (massgen/backend/base.py and massgen/api_params_handler/_api_params_handler_base.py) and their get_base_excluded_config_params() method do not exist in this repository. Verify whether these files and methods exist in the repository or clarify if the check applies to this codebase.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Async subagents' accurately describes the primary feature being added and follows the required format convention.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering implementation details, testing, examples, and verification. It aligns with the template structure and provides extensive context about the changes.
Docstring Coverage ✅ Passed Docstring coverage is 86.81% which is sufficient. The required threshold is 80.00%.
Capabilities Registry Check ✅ Passed PR implements async subagent execution, continuation, per-round timeouts, and refinement controls within orchestration and subagent lifecycle layers with no backend or model capability modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Henry-811 Henry-811 changed the base branch from main to dev/v0.1.41 January 21, 2026 16:55
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
massgen/config_validator.py (1)

785-858: Add async_subagents, subagent_round_timeouts, and plan_depth to excluded config parameters and config builder.

The new coordination validation logic adds three parameters that are missing from the required propagation points:

  1. massgen/backend/base.pyget_base_excluded_config_params(): Must exclude async_subagents, subagent_round_timeouts, and plan_depth alongside existing coordination params like vote_only and use_two_tier_workspace, otherwise these parameters will leak to backend provider APIs.

  2. massgen/api_params_handler/_api_params_handler_base.pyget_base_excluded_params(): Same exclusions needed to prevent API parameter pollution.

  3. massgen/config_builder.py: Missing interactive wizard prompts for async_subagents, subagent_round_timeouts, and plan_depth. The builder currently supports enable_subagents and subagent_orchestrator but not the new async coordination controls or planning depth options.

Without these updates, users cannot configure these settings through the interactive builder, and the new parameters will incorrectly propagate to backend implementations.

massgen/mcp_tools/subagent/_subagent_mcp_server.py (1)

323-338: Return docs don’t match the actual summary payload.

The docstring omits failed (and doesn’t mention partial / completed_but_timeout), but the implementation returns failed and supports additional statuses. Please update the return docs to mirror the actual shape. As per coding guidelines, keep documentation consistent with implementation.

massgen/orchestrator.py (1)

31-63: Stale subagent results can persist across coordination rounds and leak into later turns; add state validation to drop results when parent is inactive.

_on_subagent_complete() appends results to _pending_subagent_results without checking parent state. If a subagent completes after its parent has voted or after coordination ends, the result is queued but never retrieved (hooks only trigger on tool use). Results then persist in the dictionary across coordination rounds since it's not cleared when workflow_phase resets to idle, risking injection into the next turn if the same agent is reused.

Add validation to drop results when the parent is no longer active:

🛠️ Suggested fix
     def _on_subagent_complete(
         self,
         parent_agent_id: str,
         subagent_id: str,
         result: "SubagentResult",
     ) -> None:
+        state = self.agent_states.get(parent_agent_id)
+        if self.workflow_phase != "coordinating" or (state and state.has_voted):
+            logger.debug(
+                f"[Orchestrator] Dropping subagent result {subagent_id} for {parent_agent_id} (parent inactive)"
+            )
+            return
         if parent_agent_id not in self._pending_subagent_results:
             self._pending_subagent_results[parent_agent_id] = []
         self._pending_subagent_results[parent_agent_id].append((subagent_id, result))

Also applies to: 284-295, 4316-4334, 4875-4891

massgen/subagent/manager.py (1)

735-784: Critical: Tuple unpacking mismatch will cause ValueError at runtime.

The _parse_subprocess_status method was updated to return a 3-tuple (token_usage, subprocess_log_dir, session_id) (line 997), but the timeout, cancellation, and generic exception handlers still use 2-tuple unpacking.

🐛 Proposed fix
         except asyncio.TimeoutError:
             logger.error(f"[SubagentManager] Subagent {config.id} timed out")
             # Still copy logs even on timeout - they contain useful debugging info
-            _, subprocess_log_dir = self._parse_subprocess_status(workspace)
+            _, subprocess_log_dir, _ = self._parse_subprocess_status(workspace)
             self._write_subprocess_log_reference(config.id, subprocess_log_dir, error="Subagent timed out")
         except asyncio.CancelledError:
             # ...
             # Still copy logs even on cancellation - they contain useful debugging info
-            _, subprocess_log_dir = self._parse_subprocess_status(workspace)
+            _, subprocess_log_dir, _ = self._parse_subprocess_status(workspace)
             self._write_subprocess_log_reference(config.id, subprocess_log_dir, error="Subagent cancelled")
         except Exception as e:
             logger.error(f"[SubagentManager] Subagent {config.id} error: {e}")
             # Still copy logs even on error - they contain useful debugging info
-            _, subprocess_log_dir = self._parse_subprocess_status(workspace)
+            _, subprocess_log_dir, _ = self._parse_subprocess_status(workspace)
             self._write_subprocess_log_reference(config.id, subprocess_log_dir, error=str(e))
🤖 Fix all issues with AI agents
In `@docs/source/user_guide/advanced/subagents.rst`:
- Around line 166-169: The example call to spawn_subagents in the docs includes
an invalid "context" parameter; remove the "context": "Building a Bob Dylan
tribute website with biography, discography, and songs pages" line from the JSON
example so the call only uses the supported parameters (tasks, async_, refine)
and relies on the workspace CONTEXT.md for context; keep the "refine": true line
intact and ensure the example matches the spawn_subagents signature.

In `@massgen/cli.py`:
- Around line 2289-2294: CoordinationConfig constructions are missing
propagation of the subagent_round_timeouts option, so per-round subagent
timeouts are ignored in modes like run_single_question and run_turn; update
every CoordinationConfig instantiation (e.g., where run_single_question and
run_turn build a CoordinationConfig) to include
subagent_round_timeouts=coord_cfg.get("subagent_round_timeouts") alongside the
existing subagent_* fields (subagent_default_timeout, subagent_max_concurrent,
enable_subagents, subagent_orchestrator, use_two_tier_workspace) so the setting
is consistently applied across all code paths.

In `@massgen/mcp_tools/hooks.py`:
- Around line 888-895: Add a Google-style docstring to
SubagentCompleteHook.execute that documents parameters and return value:
describe Args including function_name (str): name of the subagent function,
arguments (str): serialized args passed, context (Optional[Dict[str, Any]]):
optional execution context, and **kwargs for extra options; and add a Returns
section describing that the method returns a HookResult indicating
success/failure and any payload, plus an optional Raises section if the method
can raise errors; place this docstring directly under the async def execute
signature so tools and linters pick it up.

In `@massgen/mcp_tools/subagent/_subagent_mcp_server.py`:
- Around line 422-447: The async branch currently returns "success": True
regardless of individual spawn results from manager.spawn_subagent_background;
change the logic after collecting spawned to inspect each returned info (e.g.,
check for dicts with success==False or presence of an "error" key) and compute
an overall success boolean (False if any spawn failed), include a list/summary
of failures in the response, and only call _save_subagents_to_filesystem() if
there is at least one successful spawn; reference
manager.spawn_subagent_background, normalized_tasks, spawned, and
_save_subagents_to_filesystem when making these checks and updating the returned
payload.
- Around line 498-513: The summary counts currently only count statuses
"completed", "error", and "timeout" and thus misses the new statuses; update the
summary calculation in the spawn_subagents response to include the new statuses
from results: compute completed as sum for r.status in ("completed",
"completed_but_timeout", "partial"), compute timeout as sum for r.status in
("timeout", "completed_but_timeout"), keep failed as r.status == "error", and
ensure any downstream use of all_success (all(r.success for r in results))
remains correct; update the returned "summary" object accordingly where
completed, failed, and timeout are set.

In `@massgen/subagent/manager.py`:
- Around line 2003-2011: The conditional inside the loop over
agent_registry.get("subagents", {}) is wrong: `agent_id == self.parent_agent_id`
can never be relied on there and makes the skip logic dead/incorrect; update the
skip to only check whether this subagent is already present by using `if
subagent_id in subagents` (remove the `agent_id == self.parent_agent_id`
conjunct) so the loop correctly skips duplicates when iterating agent_registry
in the manager code that builds prefixed_id from agent_id and subagent_id.
- Around line 1680-1683: The NameError happens because registry (and possibly
registry_file) are only set when the subagent is found in the current agent's
registry; when you locate the subagent in another agent's registry you must
assign the same names before updating. Modify the code path that handles
“subagent found in another agent's registry” to set registry = other_registry
(or the variable holding that other agent's registry) and registry_file =
other_registry_file (or the Path used for that registry) so that the subsequent
lines that set subagent_entry["status"] and call
registry_file.write_text(json.dumps(registry, ...)) operate on the correct
registry object; ensure you reference the same subagent_entry object found in
that registry.
- Around line 1396-1410: The SubagentState instantiation in manager.py passes
finished_at=datetime.now() but the SubagentState dataclass
(massgen/subagent/models.py) has no finished_at field; remove the finished_at
keyword from the SubagentState(...) call in manager.py to match the dataclass,
or if you need to track finish time, add a finished_at: Optional[datetime] =
None field (with appropriate import) to the SubagentState dataclass definition
instead; update either SubagentState(...) or the dataclass so both sides use the
same fields.

In `@massgen/subagent/result_formatter.py`:
- Around line 73-85: The header currently uses count = len(results) and labels
all entries as "completed" which is misleading; change it to compute a
completed_count by iterating results and checking each result's status (e.g.,
sum(1 for _, r in results if r.get("status") == "completed" or getattr(r,
"status", None) == "completed")), then use that completed_count in the header
string (and optionally include total=len(results) for clarity); update
references around format_single_result, formatted_results, results, and header
so the header accurately reflects completed vs. total results.
- Around line 35-53: The returned XML is vulnerable because raw result.answer,
result.error and other fields (e.g., result.workspace_path, token_element
content) can contain XML-breaking characters or closing tags; update the
formatter that builds the string (the block returning the f-string in
result_formatter.py) to escape XML special characters (at minimum &, <, >, " and
') for all injected fields or wrap the dynamic content in a CDATA section
(choose one consistent approach), applying it to content (derived from
result.answer/result.error), workspace_path, and any token_element text before
constructing the final <subagent_result> string so injected text cannot break
the XML wrapper.

In `@massgen/tests/test_subagent_result_formatter.py`:
- Around line 403-419: The formatter in massgen.subagent.result_formatter
(format_single_result and any helpers that build XML for SubagentResult)
currently injects raw result.answer/result.error (and any fields used as XML
attributes) into XML; update format_single_result to escape XML special
characters (&, <, >, ", ') for both element text and attribute values before
inserting (use a standard utility such as xml.sax.saxutils.escape or
html.escape), ensure workspace_path/subagent_id used in attributes are escaped
too, and run/adjust tests so they assert the escaped content appears (or
continue asserting unescaped substrings like "Answer with" if they remain
present).

In `@openspec/changes/add-async-subagent-execution/proposal.md`:
- Around line 28-36: Update the example YAML so it matches the actual validation
path by nesting async_subagents under orchestrator.coordination (i.e., change
the block from orchestrator: async_subagents: ... to orchestrator.coordination:
async_subagents: ...); ensure the keys (enabled, injection_strategy,
inject_progress, max_background) remain unchanged and reference the same
async_subagents identifier used by the validator to avoid copy‑paste
misconfigurations.

In `@openspec/changes/add-async-subagent-execution/specs/subagent/spec.md`:
- Around line 116-123: The spec snippet shows the config under
orchestrator.async_subagents but the validator (massgen/config_validator.py)
expects orchestrator.coordination.async_subagents; update the spec example in
specs/subagent/spec.md to use the correct path
orchestrator.coordination.async_subagents (including the same nested keys
enabled and injection_strategy) so the example matches the implementation and
validation logic.
- Around line 1-2: There are duplicate section headers "## ADDED Requirements"
in the spec; locate the second occurrence of the header (the duplicate "## ADDED
Requirements" block) and either remove it or rename it to "## MODIFIED
Requirements" if the content is meant to be a modification, ensuring only one
"## ADDED Requirements" header remains and the document structure/linting is
preserved.

In `@openspec/changes/add-async-subagent-execution/tasks.md`:
- Around line 144-154: Update the documentation checklist paths to match actual
locations: replace `docs/source/user_guide/subagents.rst` with
`docs/source/user_guide/advanced/subagents.rst` and replace
`massgen/configs/tools/subagent/async_subagent_example.yaml` with
`massgen/configs/features/async_subagent_example.yaml` in the tasks.md entry
(the checklist items under "9.1 Update subagent user guide" and "9.2 Add example
configs") so the referenced targets point to the correct files.

In `@scripts/test_async_subagent_integration.py`:
- Around line 22-26: The import lines referencing SubagentResult and the two
formatter functions (format_batch_results, format_single_result) include unused
"# noqa: E402" directives; remove those trailing "# noqa: E402" annotations from
the import statements so the imports remain the same but without the unnecessary
linter suppressions (update the imports that import SubagentResult and the two
format_* functions accordingly).
🧹 Nitpick comments (11)
massgen/tests/test_subagent_manager.py (1)

31-36: Consider using pytest tmp_path for workspace paths.

Hard-coded /tmp/test paths reduce isolation and trigger Ruff S108. Using tmp_path keeps tests hermetic and portable. Apply across this file as feasible.

♻️ Example refactor
-def test_register_completion_callback(self):
+def test_register_completion_callback(self, tmp_path):
     """Test that a callback can be registered."""
     from massgen.subagent.manager import SubagentManager
 
     manager = SubagentManager(
-        parent_workspace="/tmp/test",
+        parent_workspace=str(tmp_path),
         parent_agent_id="test-agent",
         orchestrator_id="test-orch",
         parent_agent_configs=[],
     )
docs/source/reference/yaml_schema.rst (1)

994-1005: List sub-keys for subagent_round_timeouts and async_subagents.
These are listed as generic objects; adding the expected keys (and types) would keep the schema as precise as other sections.

openspec/changes/add-async-subagent-execution/design.md (1)

133-169: Align injection format in the design doc with the implemented formatter.
The design specifies a <subagent_results> wrapper and Summary/Details sections, but the current formatter uses a separator header plus individual <subagent_result> blocks. Update the design doc (or add a note) to reflect the final format.

massgen/tests/test_subagent_result_formatter.py (1)

20-167: Adopt Google-style docstrings for new test methods.

These test functions use brief one-line docstrings; the guideline calls for Google-style docstrings on new/changed functions. Consider expanding them (or documenting an exemption) for this test suite. As per coding guidelines, please align docstrings accordingly.

scripts/test_async_subagent_integration.py (1)

29-62: Use Google‑style docstrings for new helper functions.

These new test helpers use short docstrings; the guideline calls for Google‑style docstrings on new/changed functions. Consider updating them across this script. As per coding guidelines, please align docstrings accordingly.

massgen/tests/test_subagent_mcp_server.py (4)

17-105: Exercise real code paths (and complete the placeholder test).

This class mostly asserts against locally-constructed dicts and includes a placeholder pass. Consider invoking the MCP tool via fixtures/mocks (or marking as xfail/skip until implemented) to avoid tautological tests and improve coverage. Also, Google‑style docstrings are expected for new test functions. As per coding guidelines, please align docstrings accordingly.


246-264: Implement or mark placeholder validation tests.

These pass tests currently provide no coverage. Consider implementing assertions or marking them xfail/skip until behavior is ready.


271-303: Implement or mark placeholder background‑spawning tests.

These pass tests currently provide no coverage. Consider implementing assertions or marking them xfail/skip until behavior is ready.


310-322: Implement or mark placeholder error‑handling tests.

These pass tests currently provide no coverage. Consider implementing assertions or marking them xfail/skip until behavior is ready.

massgen/subagent/manager.py (2)

997-1029: Consider logging exceptions for debugging.

The bare except Exception: pass at lines 1027-1028 silently swallows all errors when parsing status.json. While this is acceptable for graceful fallback, logging at DEBUG level would help troubleshoot issues without cluttering normal output.

♻️ Optional improvement
                 except Exception:
-                    pass
+                    logger.debug(f"[SubagentManager] Failed to parse status.json in {log_dir}")

1720-1728: Use logging.exception for better error diagnostics.

Per static analysis hint, logging.exception automatically includes the stack trace without needing exc_info=True, making the error more useful for debugging.

♻️ Proposed fix
         except Exception as e:
             execution_time = time.time() - start_time
-            logger.error(f"[SubagentManager] Error continuing subagent {subagent_id}: {e}")
+            logger.exception(f"[SubagentManager] Error continuing subagent {subagent_id}")
             return SubagentResult.create_error(
                 subagent_id=subagent_id,
                 error=str(e),
                 workspace_path=str(workspace),
                 execution_time_seconds=execution_time,
             )

@Henry-811 Henry-811 merged commit c55ca72 into dev/v0.1.41 Jan 21, 2026
25 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Jan 21, 2026
18 tasks
@coderabbitai coderabbitai bot mentioned this pull request Feb 2, 2026
18 tasks
@coderabbitai coderabbitai bot mentioned this pull request Feb 18, 2026
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants