OBSERVE. DECIDE. ACT.
agent-desktop is a native desktop automation CLI designed for AI agents, built with Rust. It gives structured access to any application through OS accessibility trees — no screenshots, no pixel matching, no browser required.
- Native Rust CLI: Fast, single binary, no runtime dependencies
- C-ABI cdylib (
libagent_desktop_ffi): Load once from Python / Swift / Go / Ruby / Node / C instead of forking the CLI per call - 54 commands: Observation, interaction, keyboard, mouse, notifications, clipboard, window management, plus a bundled
skillsdoc loader - Progressive skeleton traversal: 78–96% token reduction on dense apps via shallow overview + targeted drill-down
- Snapshot & refs: AI-optimized workflow using compact snapshot IDs and deterministic element references (
@e1,@e2) - Headless-by-default interactions: Ref actions use accessibility APIs and block silent focus, cursor, keyboard, or pasteboard side effects
- Structured JSON output: Machine-readable responses with error codes and recovery hints
- Works with any app: Finder, Safari, System Settings, Xcode, Slack — anything with an accessibility tree
npm install -g agent-desktop # downloads prebuilt binary automaticallyOr without installing:
npx agent-desktop snapshot --app Finder -igit clone https://github.com/lahfir/agent-desktop
cd agent-desktop
cargo build --release
cp target/release/agent-desktop /usr/local/bin/Requires Rust 1.85+ and macOS 13.0+.
macOS requires Accessibility permission. Screenshots also require Screen Recording permission. Grant them in System Settings > Privacy & Security by adding the app that launches agent-desktop, or:
agent-desktop permissions --request # trigger platform permission request pathPermission fields are explicit objects, for example:
{
"accessibility": { "state": "granted" },
"screen_recording": { "state": "denied", "suggestion": "Grant Screen Recording permission" },
"automation": { "state": "not_required" }
}Every GitHub Release ships a prebuilt C-ABI cdylib (libagent_desktop_ffi) for macOS, Linux, and Windows alongside the CLI tarballs. dlopen it and call the functions declared in agent_desktop.h for in-process calls instead of fork-exec per command.
import ctypes
lib = ctypes.CDLL("./lib/libagent_desktop_ffi.dylib")
lib.ad_init(1) # verify ABI major (AD_ABI_VERSION_MAJOR) before any call
adapter = lib.ad_adapter_create()
# observe -> act: ad_snapshot -> parse an @e ref -> ad_execute_by_ref ...
lib.ad_adapter_destroy(adapter)Full consumer guide — entrypoints, ownership, threading, error-handling, build/link, release archives, and verification: skills/agent-desktop-ffi/.
For dense apps (Slack, VS Code, Notion), use progressive skeleton traversal to minimize token usage:
# 1. Shallow overview — depth-3 map, truncated containers show children_count
agent-desktop snapshot --skeleton --app Slack -i --compact
# Keep snapshot_id, for example s8f3k2p9
# 2. Drill into a region of interest (named containers get refs as drill targets)
agent-desktop snapshot --root @e3 --snapshot s8f3k2p9 -i --compact
# 3. Act on an element found in the drill-down
agent-desktop click @e12 --snapshot s8f3k2p9
# 4. Re-drill the same region to verify the state change
agent-desktop snapshot --root @e3 --snapshot s8f3k2p9 -i --compactFor simple apps, a full snapshot is fine:
agent-desktop snapshot --app Finder -i # get interactive elements with refs and snapshot_id
agent-desktop click @e3 --snapshot s8f3k2p9 # click a button by ref
agent-desktop type @e5 --snapshot s8f3k2p9 "quarterly report" # insert text into a field
agent-desktop press cmd+s # keyboard shortcut
agent-desktop snapshot -i # re-observe after UI changesAgent loop: snapshot → decide → act → snapshot → decide → act → ...
Use the same --session <id> when multiple agents coordinate on one desktop task. A session owns a latest-snapshot pointer, not a security boundary. Each snapshot gets its own snapshot_id; pass --snapshot <id> when an agent must act on a specific observation. Explicit snapshot IDs can be used without repeating --session; keep --session when you omit --snapshot and want that session's latest snapshot.
flowchart LR
S["--session release-fix"] --> A["snapshot -> s1"]
S --> B["snapshot -> s2"]
A --> C["Agent A: click @e4 --snapshot s1"]
B --> D["Agent B: wait --element @e9 --predicate actionable"]
S --> E["latest_snapshot_id points at newest snapshot"]
C --> F["Explicit snapshot id works outside session too"]
agent-desktop --session release-fix snapshot --app Xcode -i --compact
agent-desktop --session release-fix wait --element @e9 --predicate actionable --timeout 5000
agent-desktop --session release-fix click @e9
agent-desktop click @e9 --snapshot s2agent-desktop snapshot --app Safari -i # accessibility tree with refs
agent-desktop snapshot --surface menu # capture open menu
agent-desktop screenshot --app Finder # PNG screenshot
agent-desktop find --role button --app TextEdit # search by role, name, value, text
agent-desktop get @e3 --snapshot s8f3k2p9 --property value # read element property
agent-desktop is @e7 --snapshot s8f3k2p9 --property checked # check boolean state
agent-desktop list-surfaces --app Notes # list menus, sheets, popovers, alertsget and is resolve the ref once, prefer live platform reads when available, and fall back only when that live read is unsupported by the adapter.
agent-desktop click @e3 # semantic AX-first click
agent-desktop double-click @e3 # AXOpen; physical double-click uses --headed mouse-click --count 2
agent-desktop triple-click @e3 # POLICY_DENIED if physical input is disabled
agent-desktop right-click @e3 # open verified context menu
agent-desktop type @e5 "hello world" # insert text into element
agent-desktop set-value @e5 "new value" # set value directly via AX
agent-desktop clear @e5 # clear element value
agent-desktop focus @e5 # set keyboard focus
agent-desktop select @e9 "Option B" # select verified dropdown/list option
agent-desktop toggle @e12 # flip checkbox or switch
agent-desktop check @e12 # idempotent check
agent-desktop uncheck @e12 # idempotent uncheck
agent-desktop expand @e15 # expand disclosure/tree item
agent-desktop collapse @e15 # collapse disclosure/tree item
agent-desktop scroll @e1 --direction down --amount 3 # scroll (AX-first)
agent-desktop scroll-to @e20 # scroll element into view(macOS, Phase 1) Pure cursor gestures have no accessibility equivalent, so
triple-click,hover, anddragare always physical;double-clickis headless viaAXOpenand only needs--headedfor gesture-only targets. Windows (UIA) and Linux (AT-SPI) adapters may expose different capabilities. Seeskills/agent-desktop/references/commands-interaction.md.
agent-desktop press cmd+s # key combo
agent-desktop press cmd+shift+z # multi-modifier
agent-desktop press escape # single key
agent-desktop key-down shift # hold key
agent-desktop key-up shift # release keyagent-desktop --headed hover @e3 # move cursor to element
agent-desktop --headed hover --xy 500,300 # move cursor to coordinates
agent-desktop --headed drag --from @e3 --to @e8 # drag between elements
agent-desktop --headed drag --from-xy 100,200 --to-xy 400,200 # drag between coordinates
agent-desktop --headed mouse-click --xy 500,300 # click at coordinates
agent-desktop --headed mouse-down --xy 500,300 # press at coordinates
agent-desktop --headed mouse-up --xy 500,300 # release at coordinatesagent-desktop launch Safari # launch app by name
agent-desktop launch com.apple.Safari # launch by bundle ID
agent-desktop close-app Safari # quit app
agent-desktop close-app Safari --force # force quit (SIGTERM, then SIGKILL if needed)
agent-desktop list-apps # list running GUI apps
agent-desktop list-windows # list visible windows
agent-desktop list-windows --app Finder # windows for specific app
agent-desktop focus-window w-4521 # bring window to front
agent-desktop resize-window w-4521 800 600 # resize
agent-desktop move-window w-4521 100 100 # move
agent-desktop minimize w-4521 # minimize
agent-desktop maximize w-4521 # maximize
agent-desktop restore w-4521 # restoreagent-desktop list-notifications # list all notifications
agent-desktop list-notifications --app "Slack" # filter by app
agent-desktop list-notifications --text "deploy" --limit 5 # filter by text
agent-desktop dismiss-notification 1 # dismiss by index
agent-desktop dismiss-all-notifications # dismiss all
agent-desktop dismiss-all-notifications --app "Slack" # dismiss all from app
agent-desktop notification-action 1 --action "Reply" # click action buttonagent-desktop clipboard-get # read clipboard text
agent-desktop clipboard-set "copied" # write to clipboard
agent-desktop clipboard-clear # clear clipboardagent-desktop wait 500 # sleep 500ms
agent-desktop wait --element @e3 --timeout 5000 # wait for element
agent-desktop wait --element @e3 --predicate actionable # wait until safe to act
agent-desktop wait --element @e5 --predicate value --value ready
agent-desktop wait --window "Save" --timeout 10000 # wait for window
agent-desktop wait --text "Loading complete" --app Safari # wait for text
agent-desktop wait --text "Done" --count 1 --app Xcode # wait for exact match count
agent-desktop wait --notification --text "Build Succeeded" # wait for new matching notification
agent-desktop wait --menu --timeout 3000 # wait for menuagent-desktop batch '[
{"command": "click", "args": {"ref_id": "@e2", "snapshot": "<snapshot_id>"}},
{"command": "type", "args": {"ref_id": "@e5", "snapshot": "<snapshot_id>", "text": "hello"}},
{"command": "press", "args": {"combo": "return"}}
]' --stop-on-error
agent-desktop --session run-a batch '[
{"command": "snapshot", "args": {"app": "Finder", "interactive_only": true}},
{"command": "status", "session": "run-b", "args": {}}
]'agent-desktop status # platform, permission report, latest snapshot
agent-desktop permissions # check accessibility/screen-recording/automation
agent-desktop permissions --request # invoke platform request path
agent-desktop version # version stringagent-desktop snapshot [OPTIONS]| Flag | Default | Description |
|---|---|---|
--app <NAME> |
focused app | Filter to a specific application |
--window-id <ID> |
- | Filter to a specific window |
-i / --interactive-only |
off | Only include interactive elements |
--compact |
off | Omit empty structural nodes |
--include-bounds |
off | Include pixel bounds (x, y, width, height) |
--max-depth <N> |
10 | Maximum tree depth |
--skeleton |
off | Shallow 3-level overview; truncated containers show children_count and get refs as drill targets |
--root <REF> |
- | Start traversal from this ref; merges into existing refmap with scoped invalidation |
--snapshot <snapshot_id> |
latest | Snapshot ID to use when resolving --root |
--surface <TYPE> |
window | window, focused, menu, menubar, sheet, popover, alert |
Every command returns structured JSON:
{
"version": "2.0",
"ok": true,
"command": "click",
"data": { "action": "click" }
}Errors include machine-readable codes and recovery hints:
{
"version": "2.0",
"ok": false,
"command": "click",
"error": {
"code": "STALE_REF",
"message": "Element at @e7 no longer matches the last snapshot",
"suggestion": "Run 'snapshot' to refresh refs, then retry"
}
}| Code | Meaning |
|---|---|
PERM_DENIED |
Accessibility permission not granted |
ELEMENT_NOT_FOUND |
No element matched the ref or query |
APP_NOT_FOUND |
Application not running or no windows |
STALE_REF |
Ref could not be re-identified in the live UI |
AMBIGUOUS_TARGET |
Ref recovery matched multiple plausible targets |
SNAPSHOT_NOT_FOUND |
Snapshot ID is missing or expired |
POLICY_DENIED |
Physical/headed path blocked by policy |
ACTION_FAILED |
The OS rejected the action |
PLATFORM_NOT_SUPPORTED |
Adapter method not implemented on this platform |
TIMEOUT |
Wait condition expired |
INVALID_ARGS |
Invalid argument values |
0 success, 1 structured error (JSON on stdout), 2 argument parse error.
snapshot assigns refs to interactive elements in depth-first order: @e1, @e2, @e3, etc. Refs are scoped to a compact snapshot_id such as s8f3k2p9. Commands can omit --snapshot to use the active session's latest snapshot pointer, but passing the ID is more deterministic in multi-step flows and does not require also passing --session.
Interactive roles that receive refs: button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell, radiobutton, incrementor, menubutton, switch, colorwell, dockitem.
Static elements (labels, groups, containers) appear in the tree for context but have no ref.
Reliability contract:
--session <id>scopes the latest snapshot pointer to one caller or agent team; explicit--snapshot <id>resolves the saved snapshot directly.- Ref actions re-identify targets at action time: a moved unique target can proceed, while missing or changed stable identity returns
STALE_REF. - Mutable value text is not treated as stable identity, so text fields and timers can keep resolving when the saved window, path, role, and bounds evidence still identify the same element.
- Multiple plausible targets return
AMBIGUOUS_TARGETinstead of choosing arbitrarily. - Actions run an actionability preflight before dispatch: visibility, stability, enabled state, supported action, policy, and editability.
wait --element @e3 --predicate actionablepolls until the target can be acted on.--trace <path>appends JSONL diagnostics outside stdout;--trace-strictfails on trace setup and pre-action trace writes, while post-action success traces are best-effort after the desktop mutation has already happened.
Stale ref recovery:
snapshot → act → STALE_REF or AMBIGUOUS_TARGET? → wait/snapshot again → retry with the new ref
| macOS | Windows | Linux | |
|---|---|---|---|
| Accessibility tree | Yes | Planned | Planned |
| Click / type / keyboard | Yes | Planned | Planned |
| Mouse input | Yes | Planned | Planned |
| Screenshot | Yes | Planned | Planned |
| Clipboard | Yes | Planned | Planned |
| App & window management | Yes | Planned | Planned |
| Notifications | Yes | Planned | Planned |
cargo build # debug build
cargo build --release # optimized (<15MB)
cargo test --lib --workspace # run tests
cargo clippy --all-targets -- -D warnings # lint (must pass with zero warnings)agent-desktop is a native desktop automation CLI for AI agents. It lets agents observe and control desktop apps through OS accessibility trees, using structured JSON instead of screenshots, pixel matching, or browser-only automation.
No. The core workflow reads native accessibility trees and assigns refs to interactive elements. Screenshots are available as a separate command, but agents do not need screenshots or pixel matching to click buttons, type into fields, inspect menus, or navigate app windows.
| Component | Function |
|---|---|
| Native Rust CLI | Fast, single binary, no runtime dependencies |
| C-ABI cdylib | Load once from Python, Swift, Go, Ruby, Node, or C instead of forking |
| 54 Commands | Observation, interaction, keyboard, mouse, notifications, clipboard, window management, and bundled skills docs |
| Snapshot & Refs | Compact snapshot IDs and deterministic element refs like @e1, @e2 |
| Structured JSON | Machine-readable responses with error codes and recovery hints |
| Feature | Benefit |
|---|---|
| Progressive Skeleton Traversal | 78–96% token reduction on dense apps |
| Headless-by-Default Actions | Ref actions use accessibility APIs and block unintended physical side effects |
| Snapshot Refs | Agents act on stable refs within a snapshot instead of guessing coordinates |
| Recovery Hints | Errors include machine-readable codes and suggestions for the next agent step |
| Cross-Language FFI | Python, Swift, Go, Ruby, Node, C, and C++ hosts can call the native library directly |
| Feature | macOS | Windows | Linux |
|---|---|---|---|
| Accessibility tree | Yes | Planned | Planned |
| Click/type/keyboard | Yes | Planned | Planned |
| Mouse input | Yes | Planned | Planned |
| Screenshot | Yes | Planned | Planned |
| Clipboard | Yes | Planned | Planned |
| App/window management | Yes | Planned | Planned |
| Notifications | Yes | Planned | Planned |
Install the CLI from npm:
npm install -g agent-desktop
agent-desktop snapshot --app SafariBuild the FFI library from source:
cargo build --release
# Outputs: libagent_desktop_ffi.dylib/.so/.dllsnapshot assigns refs to interactive elements in depth-first order: @e1, @e2, @e3, etc. Refs are scoped to a compact snapshot_id such as s8f3k2p9. Commands can omit --snapshot to use the active session's latest snapshot pointer, but explicit snapshot IDs are the deterministic path and do not require also passing --session.
Interactive roles that receive refs:
button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell, radiobutton, incrementor, menubutton, switch, colorwell, dockitem.
Stale ref recovery:
snapshot -> act -> STALE_REF? -> snapshot again -> retry
Yes. agent-desktop is Apache-2.0 licensed for personal and commercial use.
| Resource | Link |
|---|---|
| Repository | github.com/lahfir/agent-desktop |
| ClawHub Skill | clawhub.ai/lahfir/agent-desktop |
| skills.sh Listing | skills.sh/lahfir/agent-desktop/agent-desktop |
| npm Package | npmjs.com/package/agent-desktop |
| CI Status | GitHub Actions |
| Releases | GitHub Releases |
| Issues | GitHub Issues |
Apache-2.0


