Current state
Pilo's web action vocabulary (packages/core/src/tools/webActionTools.ts) covers the core interactions: click, fill, select, hover, check, uncheck, focus, enter, wait, goto, back, forward, extract, done, abort. Plus webSearch, Tabstack tools, and request_user_data are gated tools.
Common interaction patterns the agent cannot currently perform without workarounds:
- Keyboard shortcuts: Escape to dismiss modals, Tab/Shift+Tab to navigate focus, Ctrl+A to select all, ArrowDown/ArrowUp for autocomplete navigation, etc. The only existing keyboard tool is
enter which presses Enter on a specific ref.
- On-demand screenshot: screenshots are only captured at snapshot time when
vision: true. The agent has no way to request a fresh screenshot mid-step (e.g., "did that click cause a visual change I should look at?").
- File upload: form file inputs are common. Pilo has no tool to populate them.
- Dropdown option enumeration: when selecting from a
<select>, the agent currently has to guess the value or click-to-open the dropdown, see the options, then click an option. A direct "what options does this select have?" tool is cheaper.
The gap
Each missing tool has real-world cases:
send_keys — sites that don't expose a "close modal" button (rely on Escape), forms that require keyboard navigation (combobox + ArrowDown + Enter to select), Ctrl+S to save in web editors. The system prompt currently mentions Escape (prompts.ts:339) and arrow keys but the agent can't use them.
screenshot — diagnostic tool for the agent itself ("the click didn't seem to do anything, let me see"), and a debug surface for users watching task replays. Currently the screenshot is bundled with each snapshot (vision mode) and is wasted on snapshots where vision isn't useful.
upload_file — any task involving "upload your resume", "attach a screenshot", "import a CSV" requires a file upload tool.
dropdown_options — closes a common multi-step pattern (open dropdown, read options, close, then select) into a single zero-cost tool call.
Proposed scope
Four small tool additions. Each can ship independently.
A. send_keys tool
send_keys: tool({
description:
"Press a keyboard key or combination on the currently focused element. " +
"Examples: 'Escape' to dismiss a modal, 'Tab' to move focus, 'Control+a' " +
"to select all, 'ArrowDown' for combobox navigation.",
inputSchema: z.object({
keys: z.string().describe("Key name or combination (e.g., 'Escape', 'Control+a', 'Shift+Tab')"),
ref: z.string().optional()
.describe("If provided, focus this element first before pressing keys"),
}),
execute: async ({ keys, ref }) => {
return performActionWithValidation(PageAction.SendKeys, context, ref, keys);
},
}),
Implementation in playwrightBrowser.ts: page.keyboard.press(keys) (Playwright understands Control+a syntax natively).
B. screenshot tool
screenshot: tool({
description:
"Take a screenshot of the current page. The image will be included in the next " +
"page snapshot. Use sparingly — most decisions can be made from the accessibility " +
"tree alone.",
inputSchema: z.object({
fullPage: z.boolean().default(false)
.describe("Capture the entire page including content below the fold"),
ref: z.string().optional()
.describe("If provided, screenshot only this element"),
}),
execute: async ({ fullPage, ref }) => {
const buf = await context.browser.getScreenshot({ withMarks: true, fullPage, ref });
// Emit BROWSER_SCREENSHOT_CAPTURED event
// Attach to next user message? Or include in the action result?
// Probably: include in the next snapshot, and emit immediately.
},
}),
The exact threading (where the screenshot data ends up in messages) needs design. Simplest: the action result includes a screenshotAvailable: true flag; the next snapshot user message includes the captured image. Vision mode behavior is unchanged.
C. upload_file tool
upload_file: tool({
description:
"Upload a file to a file input element. The file path must be a local filesystem path " +
"or a URL that the agent has been authorized to fetch.",
inputSchema: z.object({
ref: z.string().describe("Element reference of the file input (or its container)"),
path: z.string().describe("Local file path or pre-authorized URL"),
}),
execute: async ({ ref, path }) => {
return performActionWithValidation(PageAction.UploadFile, context, ref, path);
},
}),
Implementation via Playwright's locator.setInputFiles(path). If the ref isn't a file input directly, find the nearest descendant file input. Validate the file exists and is non-empty before passing to Playwright.
Security considerations: this tool can read arbitrary local files. Gate behind a WebAgentOptions.allowFileUpload?: { allowedPaths?: string[] } option so deployments can constrain it. Default: disabled.
D. dropdown_options tool
dropdown_options: tool({
description:
"List the options available in a dropdown (<select>) or ARIA menu. " +
"Free and fast — use before select() to verify the option you want is present.",
inputSchema: z.object({
ref: z.string().describe("Element reference of the dropdown"),
}),
execute: async ({ ref }) => {
return performActionWithValidation(PageAction.DropdownOptions, context, ref);
},
}),
Implementation: locate the element, check if it's a <select> (return its <option> text + value) or has role="combobox"/role="menu" (return descendants with role="option" or role="menuitem"). Return up to N options (cap at 200 to prevent runaway).
Implementation notes
- All four tools follow the existing
performActionWithValidation pattern for consistency.
- Each requires a new
PageAction.* enum value and corresponding handler in playwrightBrowser.ts.
- Each needs a tool example added to
buildToolExamples in prompts.ts.
upload_file security is non-trivial. Default to disabled; opt-in via config; document the security model.
screenshot adds latency and image-content to the conversation. Keep its description discouraging overuse ("Use sparingly").
- These can ship as four separate PRs or one. The simpler ones (
send_keys, dropdown_options) are 2-4 hours each. screenshot and upload_file are 1 day each.
Acceptance criteria
- All four tools exist in
webActionTools.ts (upload_file gated by config), have tool descriptions in prompts, and are exercised in tests.
send_keys works for single keys and combinations (Control+a, Shift+Tab).
screenshot produces an image attached to the next snapshot user message; emits the right events.
upload_file validates the file exists; respects the allow-list when configured; gracefully rejects when not enabled.
dropdown_options returns options for both <select> and ARIA menus.
- The system prompt now references send_keys for autocomplete/escape patterns (where the current text says "use keyboard navigation" without a tool to do it).
Effort estimate
3-4 days for all four, including tests and prompt updates. Each is independent enough to split into separate PRs if useful.
Related issues
Pairs with the zero-LLM exploration tools issue (dropdown_options is conceptually similar to find_elements). The screenshot tool benefits from the scroll-position context work (knowing what's visible helps decide whether to screenshot).
Files likely affected
packages/core/src/tools/webActionTools.ts
packages/core/src/browser/ariaBrowser.ts (PageAction enum)
packages/core/src/browser/playwrightBrowser.ts (handlers)
packages/core/src/prompts.ts (tool examples)
packages/core/src/config/defaults.ts (upload allow-list)
packages/core/test/
Current state
Pilo's web action vocabulary (
packages/core/src/tools/webActionTools.ts) covers the core interactions: click, fill, select, hover, check, uncheck, focus, enter, wait, goto, back, forward, extract, done, abort. Plus webSearch, Tabstack tools, andrequest_user_dataare gated tools.Common interaction patterns the agent cannot currently perform without workarounds:
enterwhich presses Enter on a specific ref.vision: true. The agent has no way to request a fresh screenshot mid-step (e.g., "did that click cause a visual change I should look at?").<select>, the agent currently has to guess the value or click-to-open the dropdown, see the options, then click an option. A direct "what options does this select have?" tool is cheaper.The gap
Each missing tool has real-world cases:
send_keys— sites that don't expose a "close modal" button (rely on Escape), forms that require keyboard navigation (combobox + ArrowDown + Enter to select), Ctrl+S to save in web editors. The system prompt currently mentions Escape (prompts.ts:339) and arrow keys but the agent can't use them.screenshot— diagnostic tool for the agent itself ("the click didn't seem to do anything, let me see"), and a debug surface for users watching task replays. Currently the screenshot is bundled with each snapshot (vision mode) and is wasted on snapshots where vision isn't useful.upload_file— any task involving "upload your resume", "attach a screenshot", "import a CSV" requires a file upload tool.dropdown_options— closes a common multi-step pattern (open dropdown, read options, close, then select) into a single zero-cost tool call.Proposed scope
Four small tool additions. Each can ship independently.
A.
send_keystoolImplementation in
playwrightBrowser.ts:page.keyboard.press(keys)(Playwright understandsControl+asyntax natively).B.
screenshottoolThe exact threading (where the screenshot data ends up in
messages) needs design. Simplest: the action result includes ascreenshotAvailable: trueflag; the next snapshot user message includes the captured image. Vision mode behavior is unchanged.C.
upload_filetoolImplementation via Playwright's
locator.setInputFiles(path). If the ref isn't a file input directly, find the nearest descendant file input. Validate the file exists and is non-empty before passing to Playwright.Security considerations: this tool can read arbitrary local files. Gate behind a
WebAgentOptions.allowFileUpload?: { allowedPaths?: string[] }option so deployments can constrain it. Default: disabled.D.
dropdown_optionstoolImplementation: locate the element, check if it's a
<select>(return its<option>text + value) or hasrole="combobox"/role="menu"(return descendants withrole="option"orrole="menuitem"). Return up to N options (cap at 200 to prevent runaway).Implementation notes
performActionWithValidationpattern for consistency.PageAction.*enum value and corresponding handler inplaywrightBrowser.ts.buildToolExamplesinprompts.ts.upload_filesecurity is non-trivial. Default to disabled; opt-in via config; document the security model.screenshotadds latency and image-content to the conversation. Keep its description discouraging overuse ("Use sparingly").send_keys,dropdown_options) are 2-4 hours each.screenshotandupload_fileare 1 day each.Acceptance criteria
webActionTools.ts(upload_filegated by config), have tool descriptions in prompts, and are exercised in tests.send_keysworks for single keys and combinations (Control+a,Shift+Tab).screenshotproduces an image attached to the next snapshot user message; emits the right events.upload_filevalidates the file exists; respects the allow-list when configured; gracefully rejects when not enabled.dropdown_optionsreturns options for both<select>and ARIA menus.Effort estimate
3-4 days for all four, including tests and prompt updates. Each is independent enough to split into separate PRs if useful.
Related issues
Pairs with the zero-LLM exploration tools issue (
dropdown_optionsis conceptually similar tofind_elements). Thescreenshottool benefits from the scroll-position context work (knowing what's visible helps decide whether to screenshot).Files likely affected
packages/core/src/tools/webActionTools.tspackages/core/src/browser/ariaBrowser.ts(PageAction enum)packages/core/src/browser/playwrightBrowser.ts(handlers)packages/core/src/prompts.ts(tool examples)packages/core/src/config/defaults.ts(upload allow-list)packages/core/test/