Skip to content

Support multi-action per turn #438

@lmorchard

Description

@lmorchard

Current state

Pilo enforces "exactly one tool per turn" in three places:

  1. The AI SDK call uses toolChoice: "required" (webAgent.ts:905) and the action loop processes only aiResponse.toolResults[0] (webAgent.ts:1019).
  2. The system prompt repeats "EXACTLY ONE tool" (prompts.ts:213-217, also in the per-step snapshot prompt and error feedback prompt).
  3. When zero tool calls come back, ToolExecutionError("You must use exactly one tool") is thrown (webAgent.ts:1011-1017).

For multi-step physical actions (filling a 5-field form, then submitting), this means 5+ separate LLM round-trips. Each round-trip is typically 5-15 seconds. A typical form submission is 30-60+ seconds of LLM time alone.

The gap

A 5-field form is one task step ("fill in the user details") but five LLM round-trips. The LLM round-trip is the dominant latency in most Pilo runs. Halving or thirding the count translates linearly into wall-clock improvement and a smaller LLM bill.

The constraint is also semantically off: many actions are obviously safe to batch (a fill doesn't change the URL or the ref space; multiple fills + an enter at the end is a coherent unit), while others are obviously not (a click on a link changes the page and invalidates remaining refs).

Proposed scope

A. Allow up to N tool calls per turn

Introduce WebAgentOptions.maxActionsPerStep?: number (default 1 for backwards compat; recommend 3 for trustworthy providers).

In generateAndProcessAction (webAgent.ts:874-1113):

// Stop requiring exactly one tool
const streamResult = streamText({
  ...this.providerConfig,
  messages: this.messages,
  tools: webActionTools,
  toolChoice: this.maxActionsPerStep > 1 ? "auto" : "required",
  // ...
});

// Process tool results in order
const results: ActionResult[] = [];
for (const toolResult of aiResponse.toolResults.slice(0, this.maxActionsPerStep)) {
  const output = toolResult.output as ActionResult;
  results.push(output);

  // Stop the batch if this action changes the page or is terminal
  if (output.isTerminal) break;
  if (isPageChangingAction(output.action)) break;
  if (!output.success && !output.isRecoverable) break;
}

isPageChangingAction: goto, back, forward, click (if it triggers navigation), enter (if it submits), webSearch, terminal actions. Conservative default: click and enter are always treated as page-changing (some clicks don't navigate, but we err safe).

B. Single AGENT_REASONED per turn, multiple AGENT_ACTION

Today the event stream emits one AGENT_REASONED at reasoning-end and one AGENT_ACTION per tool. Keep this shape; per-turn batch just emits multiple AGENT_ACTION events in sequence.

C. System prompt updates

Change the "EXACTLY ONE tool" instructions to "Up to {{ maxActionsPerStep }} tool(s) per turn":

**Action batching:**
You may call up to {{ maxActionsPerStep }} tools in one turn. Use this for related
actions that don't change the page — for example, filling several fields of the
same form. After a page-changing action (click, enter, goto, back, forward, search),
remaining actions in the batch are skipped because the refs become stale.

Safe to batch:
- Multiple fill() calls into one form
- focus() + send_keys() to navigate a combobox
- check() / uncheck() across related checkboxes

Page-changing — must be last or alone:
- click() (may trigger navigation)
- enter() (may submit a form)
- goto(), back(), forward()
- webSearch()
- done(), abort()

The instruction in toolCallInstruction (prompts.ts:213-217) and its echoes in the per-step user message and error feedback templates need parallel updates.

D. Error handling for mid-batch failures

If action 2 of a 3-action batch fails with a recoverable error (e.g., element ref stale), stop the batch and proceed as if a single-action turn errored. The next iteration's snapshot will refresh refs.

If action 1 succeeds, action 2 errors recoverably, action 3 was queued but didn't run: results contains just [action1, action2]. The next AI generation message will include both tool results so the model sees what happened.

E. Telemetry

Add to the AI_GENERATION event:

{ actionsRequested: N, actionsExecuted: M, batchTruncatedBy: "page-change" | "terminal" | "error" | "none" }

Implementation notes

  • This is a substantial change to the inner loop. Recommend a feature flag (or just gating on maxActionsPerStep > 1) so the new path can be enabled cautiously and rolled back if issues surface.
  • Test extensively across providers. Some providers may always emit one tool call regardless of toolChoice; the change should still work (just no win for them). Some may emit many; the truncation needs to be tested.
  • The repetition detector (checkAndHandleRepeatedAction) currently checks one action per iteration. With batching, it should check each action in the batch, not just the last one.
  • The validator-on-done flow: if done is in a batch, only execute the actions before done, then run validation on done's result. Don't execute anything after done in the batch.
  • Update the system-prompt guidance about ref invalidation: refs ARE stable within a batch (no snapshot happens between batch actions). The instruction is "after a page-changing action, refs are stale" — within the batch this means the page-changer is last.
  • Some tools have implicit page changes that aren't obvious (e.g., a click on a <button type="submit"> inside a form). Conservative classification is fine — we lose a small amount of batch efficiency, not correctness.

Acceptance criteria

  • maxActionsPerStep option (default 1) controls batch size.
  • With maxActionsPerStep: 3, the agent can fill 3 form fields in one LLM turn.
  • Page-changing actions correctly terminate the batch.
  • Recoverable errors terminate the batch.
  • Telemetry surfaces batch size and termination reason.
  • Existing single-action behavior is unchanged when default is kept at 1.
  • Tests cover: 3-fill batch, fill + enter terminating batch, mid-batch ref-stale error terminating batch, batch with done last, batch with done mid-position (rejected by validator or treated as terminal).
  • Manual eval: form-heavy task (e.g., signup or checkout) shows 2-3× latency reduction at maxActionsPerStep: 3 vs 1.

Effort estimate

3-5 days. Most of the time is testing across providers and validating edge cases (page-change detection, mid-batch errors, repetition detection over a batch).

Related issues

Affects the repetition signature fix (the detector needs per-action tracking, not per-iteration). Affects prompt caching (longer histories = more to cache; this is complementary, not in conflict).

Files likely affected

  • packages/core/src/webAgent.ts (generateAndProcessAction, ExecutionState)
  • packages/core/src/prompts.ts (tool call instructions across multiple templates)
  • packages/core/src/types/ (WebAgentOptions)
  • packages/core/src/events.ts
  • packages/core/test/webAgent.test.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions