AGENT DESKTOP

OBSERVE. DECIDE. ACT.

agent-desktop is a native desktop automation CLI designed for AI agents, built with Rust. It gives structured access to any application through OS accessibility trees — no screenshots, no pixel matching, no browser required.

Architecture

Key Features

Native Rust CLI: Fast, single binary, no runtime dependencies
C-ABI cdylib (libagent_desktop_ffi): Load once from Python / Swift / Go / Ruby / Node / C instead of forking the CLI per call
54 commands: Observation, interaction, keyboard, mouse, notifications, clipboard, window management, plus a bundled skills doc loader
Progressive skeleton traversal: 78–96% token reduction on dense apps via shallow overview + targeted drill-down
Snapshot & refs: AI-optimized workflow using compact snapshot IDs and deterministic element references (@e1, @e2)
Headless-by-default interactions: Ref actions use accessibility APIs and block silent focus, cursor, keyboard, or pasteboard side effects
Structured JSON output: Machine-readable responses with error codes and recovery hints
Works with any app: Finder, Safari, System Settings, Xcode, Slack — anything with an accessibility tree

Installation

npm (recommended)

npm install -g agent-desktop        # downloads prebuilt binary automatically

Or without installing:

npx agent-desktop snapshot --app Finder -i

From source

git clone https://github.com/lahfir/agent-desktop
cd agent-desktop
cargo build --release
cp target/release/agent-desktop /usr/local/bin/

Requires Rust 1.85+ and macOS 13.0+.

Permissions

macOS requires Accessibility permission. Screenshots also require Screen Recording permission. Grant them in System Settings > Privacy & Security by adding the app that launches agent-desktop, or:

agent-desktop permissions --request   # trigger platform permission request path

Permission fields are explicit objects, for example:

{
  "accessibility": { "state": "granted" },
  "screen_recording": { "state": "denied", "suggestion": "Grant Screen Recording permission" },
  "automation": { "state": "not_required" }
}

Language bindings (FFI)

Every GitHub Release ships a prebuilt C-ABI cdylib (libagent_desktop_ffi) for macOS, Linux, and Windows alongside the CLI tarballs. dlopen it and call the functions declared in agent_desktop.h for in-process calls instead of fork-exec per command.

import ctypes
lib = ctypes.CDLL("./lib/libagent_desktop_ffi.dylib")
lib.ad_init(1)  # verify ABI major (AD_ABI_VERSION_MAJOR) before any call
adapter = lib.ad_adapter_create()
# observe -> act: ad_snapshot -> parse an @e ref -> ad_execute_by_ref ...
lib.ad_adapter_destroy(adapter)

Full consumer guide — entrypoints, ownership, threading, error-handling, build/link, release archives, and verification: skills/agent-desktop-ffi/.

Core Workflow for AI

For dense apps (Slack, VS Code, Notion), use progressive skeleton traversal to minimize token usage:

# 1. Shallow overview — depth-3 map, truncated containers show children_count
agent-desktop snapshot --skeleton --app Slack -i --compact
# Keep snapshot_id, for example s8f3k2p9

# 2. Drill into a region of interest (named containers get refs as drill targets)
agent-desktop snapshot --root @e3 --snapshot s8f3k2p9 -i --compact

# 3. Act on an element found in the drill-down
agent-desktop click @e12 --snapshot s8f3k2p9

# 4. Re-drill the same region to verify the state change
agent-desktop snapshot --root @e3 --snapshot s8f3k2p9 -i --compact

For simple apps, a full snapshot is fine:

agent-desktop snapshot --app Finder -i   # get interactive elements with refs and snapshot_id
agent-desktop click @e3 --snapshot s8f3k2p9  # click a button by ref
agent-desktop type @e5 --snapshot s8f3k2p9 "quarterly report"  # insert text into a field
agent-desktop press cmd+s               # keyboard shortcut
agent-desktop snapshot -i               # re-observe after UI changes

Agent loop:  snapshot → decide → act → snapshot → decide → act → ...

Shared sessions for multi-agent workflows

Use the same --session <id> when multiple agents coordinate on one desktop task. A session owns a latest-snapshot pointer, not a security boundary. Each snapshot gets its own snapshot_id; pass --snapshot <id> when an agent must act on a specific observation. Explicit snapshot IDs can be used without repeating --session; keep --session when you omit --snapshot and want that session's latest snapshot.

flowchart LR
    S["--session release-fix"] --> A["snapshot -> s1"]
    S --> B["snapshot -> s2"]
    A --> C["Agent A: click @e4 --snapshot s1"]
    B --> D["Agent B: wait --element @e9 --predicate actionable"]
    S --> E["latest_snapshot_id points at newest snapshot"]
    C --> F["Explicit snapshot id works outside session too"]

agent-desktop --session release-fix snapshot --app Xcode -i --compact
agent-desktop --session release-fix wait --element @e9 --predicate actionable --timeout 5000
agent-desktop --session release-fix click @e9
agent-desktop click @e9 --snapshot s2

Commands

Observation

agent-desktop snapshot --app Safari -i           # accessibility tree with refs
agent-desktop snapshot --surface menu            # capture open menu
agent-desktop screenshot --app Finder            # PNG screenshot
agent-desktop find --role button --app TextEdit  # search by role, name, value, text
agent-desktop get @e3 --snapshot s8f3k2p9 --property value  # read element property
agent-desktop is @e7 --snapshot s8f3k2p9 --property checked # check boolean state
agent-desktop list-surfaces --app Notes          # list menus, sheets, popovers, alerts

get and is resolve the ref once, prefer live platform reads when available, and fall back only when that live read is unsupported by the adapter.

Interaction

agent-desktop click @e3                  # semantic AX-first click
agent-desktop double-click @e3           # AXOpen; physical double-click uses --headed mouse-click --count 2
agent-desktop triple-click @e3           # POLICY_DENIED if physical input is disabled
agent-desktop right-click @e3            # open verified context menu
agent-desktop type @e5 "hello world"     # insert text into element
agent-desktop set-value @e5 "new value"  # set value directly via AX
agent-desktop clear @e5                  # clear element value
agent-desktop focus @e5                  # set keyboard focus
agent-desktop select @e9 "Option B"      # select verified dropdown/list option
agent-desktop toggle @e12                # flip checkbox or switch
agent-desktop check @e12                 # idempotent check
agent-desktop uncheck @e12               # idempotent uncheck
agent-desktop expand @e15                # expand disclosure/tree item
agent-desktop collapse @e15              # collapse disclosure/tree item
agent-desktop scroll @e1 --direction down --amount 3  # scroll (AX-first)
agent-desktop scroll-to @e20             # scroll element into view

(macOS, Phase 1) Pure cursor gestures have no accessibility equivalent, so triple-click, hover, and drag are always physical; double-click is headless via AXOpen and only needs --headed for gesture-only targets. Windows (UIA) and Linux (AT-SPI) adapters may expose different capabilities. See skills/agent-desktop/references/commands-interaction.md.

Keyboard

agent-desktop press cmd+s               # key combo
agent-desktop press cmd+shift+z          # multi-modifier
agent-desktop press escape               # single key
agent-desktop key-down shift             # hold key
agent-desktop key-up shift               # release key

Mouse

agent-desktop --headed hover @e3                  # move cursor to element
agent-desktop --headed hover --xy 500,300         # move cursor to coordinates
agent-desktop --headed drag --from @e3 --to @e8   # drag between elements
agent-desktop --headed drag --from-xy 100,200 --to-xy 400,200  # drag between coordinates
agent-desktop --headed mouse-click --xy 500,300   # click at coordinates
agent-desktop --headed mouse-down --xy 500,300    # press at coordinates
agent-desktop --headed mouse-up --xy 500,300      # release at coordinates

App & Window Management

agent-desktop launch Safari              # launch app by name
agent-desktop launch com.apple.Safari    # launch by bundle ID
agent-desktop close-app Safari           # quit app
agent-desktop close-app Safari --force   # force quit (SIGTERM, then SIGKILL if needed)
agent-desktop list-apps                  # list running GUI apps
agent-desktop list-windows               # list visible windows
agent-desktop list-windows --app Finder  # windows for specific app
agent-desktop focus-window w-4521        # bring window to front
agent-desktop resize-window w-4521 800 600  # resize
agent-desktop move-window w-4521 100 100    # move
agent-desktop minimize w-4521            # minimize
agent-desktop maximize w-4521            # maximize
agent-desktop restore w-4521             # restore

Notifications (macOS only)

agent-desktop list-notifications                       # list all notifications
agent-desktop list-notifications --app "Slack"         # filter by app
agent-desktop list-notifications --text "deploy" --limit 5  # filter by text
agent-desktop dismiss-notification 1                   # dismiss by index
agent-desktop dismiss-all-notifications                # dismiss all
agent-desktop dismiss-all-notifications --app "Slack"  # dismiss all from app
agent-desktop notification-action 1 --action "Reply"   # click action button

Clipboard

agent-desktop clipboard-get              # read clipboard text
agent-desktop clipboard-set "copied"     # write to clipboard
agent-desktop clipboard-clear            # clear clipboard

Wait

agent-desktop wait 500                                       # sleep 500ms
agent-desktop wait --element @e3 --timeout 5000              # wait for element
agent-desktop wait --element @e3 --predicate actionable      # wait until safe to act
agent-desktop wait --element @e5 --predicate value --value ready
agent-desktop wait --window "Save" --timeout 10000           # wait for window
agent-desktop wait --text "Loading complete" --app Safari    # wait for text
agent-desktop wait --text "Done" --count 1 --app Xcode       # wait for exact match count
agent-desktop wait --notification --text "Build Succeeded"   # wait for new matching notification
agent-desktop wait --menu --timeout 3000                     # wait for menu

Batch

agent-desktop batch '[
  {"command": "click", "args": {"ref_id": "@e2", "snapshot": "<snapshot_id>"}},
  {"command": "type", "args": {"ref_id": "@e5", "snapshot": "<snapshot_id>", "text": "hello"}},
  {"command": "press", "args": {"combo": "return"}}
]' --stop-on-error

agent-desktop --session run-a batch '[
  {"command": "snapshot", "args": {"app": "Finder", "interactive_only": true}},
  {"command": "status", "session": "run-b", "args": {}}
]'

System

agent-desktop status                     # platform, permission report, latest snapshot
agent-desktop permissions                # check accessibility/screen-recording/automation
agent-desktop permissions --request      # invoke platform request path
agent-desktop version                    # version string

Snapshot Options

agent-desktop snapshot [OPTIONS]

Flag	Default	Description
`--app <NAME>`	focused app	Filter to a specific application
`--window-id <ID>`	-	Filter to a specific window
`-i` / `--interactive-only`	off	Only include interactive elements
`--compact`	off	Omit empty structural nodes
`--include-bounds`	off	Include pixel bounds (x, y, width, height)
`--max-depth <N>`	10	Maximum tree depth
`--skeleton`	off	Shallow 3-level overview; truncated containers show `children_count` and get refs as drill targets
`--root <REF>`	-	Start traversal from this ref; merges into existing refmap with scoped invalidation
`--snapshot <snapshot_id>`	latest	Snapshot ID to use when resolving `--root`
`--surface <TYPE>`	window	`window`, `focused`, `menu`, `menubar`, `sheet`, `popover`, `alert`

JSON Output

Every command returns structured JSON:

{
  "version": "2.0",
  "ok": true,
  "command": "click",
  "data": { "action": "click" }
}

Errors include machine-readable codes and recovery hints:

{
  "version": "2.0",
  "ok": false,
  "command": "click",
  "error": {
    "code": "STALE_REF",
    "message": "Element at @e7 no longer matches the last snapshot",
    "suggestion": "Run 'snapshot' to refresh refs, then retry"
  }
}

Error Codes

Code	Meaning
`PERM_DENIED`	Accessibility permission not granted
`ELEMENT_NOT_FOUND`	No element matched the ref or query
`APP_NOT_FOUND`	Application not running or no windows
`STALE_REF`	Ref could not be re-identified in the live UI
`AMBIGUOUS_TARGET`	Ref recovery matched multiple plausible targets
`SNAPSHOT_NOT_FOUND`	Snapshot ID is missing or expired
`POLICY_DENIED`	Physical/headed path blocked by policy
`ACTION_FAILED`	The OS rejected the action
`PLATFORM_NOT_SUPPORTED`	Adapter method not implemented on this platform
`TIMEOUT`	Wait condition expired
`INVALID_ARGS`	Invalid argument values

Exit Codes

0 success, 1 structured error (JSON on stdout), 2 argument parse error.

Ref System

snapshot assigns refs to interactive elements in depth-first order: @e1, @e2, @e3, etc. Refs are scoped to a compact snapshot_id such as s8f3k2p9. Commands can omit --snapshot to use the active session's latest snapshot pointer, but passing the ID is more deterministic in multi-step flows and does not require also passing --session.

Interactive roles that receive refs: button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell, radiobutton, incrementor, menubutton, switch, colorwell, dockitem.

Static elements (labels, groups, containers) appear in the tree for context but have no ref.

Reliability contract:

--session <id> scopes the latest snapshot pointer to one caller or agent team; explicit --snapshot <id> resolves the saved snapshot directly.
Ref actions re-identify targets at action time: a moved unique target can proceed, while missing or changed stable identity returns STALE_REF.
Mutable value text is not treated as stable identity, so text fields and timers can keep resolving when the saved window, path, role, and bounds evidence still identify the same element.
Multiple plausible targets return AMBIGUOUS_TARGET instead of choosing arbitrarily.
Actions run an actionability preflight before dispatch: visibility, stability, enabled state, supported action, policy, and editability.
wait --element @e3 --predicate actionable polls until the target can be acted on.
--trace <path> appends JSONL diagnostics outside stdout; --trace-strict fails on trace setup and pre-action trace writes, while post-action success traces are best-effort after the desktop mutation has already happened.

Stale ref recovery:

snapshot → act → STALE_REF or AMBIGUOUS_TARGET? → wait/snapshot again → retry with the new ref

Platform Support

	macOS	Windows	Linux
Accessibility tree	Yes	Planned	Planned
Click / type / keyboard	Yes	Planned	Planned
Mouse input	Yes	Planned	Planned
Screenshot	Yes	Planned	Planned
Clipboard	Yes	Planned	Planned
App & window management	Yes	Planned	Planned
Notifications	Yes	Planned	Planned

Development

cargo build                               # debug build
cargo build --release                     # optimized (<15MB)
cargo test --lib --workspace              # run tests
cargo clippy --all-targets -- -D warnings # lint (must pass with zero warnings)

FAQ

What is agent-desktop?

agent-desktop is a native desktop automation CLI for AI agents. It lets agents observe and control desktop apps through OS accessibility trees, using structured JSON instead of screenshots, pixel matching, or browser-only automation.

Does agent-desktop require screenshots or pixel matching?

No. The core workflow reads native accessibility trees and assigns refs to interactive elements. Screenshots are available as a separate command, but agents do not need screenshots or pixel matching to click buttons, type into fields, inspect menus, or navigate app windows.

How does agent-desktop work?

Component	Function
Native Rust CLI	Fast, single binary, no runtime dependencies
C-ABI cdylib	Load once from Python, Swift, Go, Ruby, Node, or C instead of forking
54 Commands	Observation, interaction, keyboard, mouse, notifications, clipboard, window management, and bundled `skills` docs
Snapshot & Refs	Compact snapshot IDs and deterministic element refs like `@e1`, `@e2`
Structured JSON	Machine-readable responses with error codes and recovery hints

What makes agent-desktop useful for AI agents?

Feature	Benefit
Progressive Skeleton Traversal	78–96% token reduction on dense apps
Headless-by-Default Actions	Ref actions use accessibility APIs and block unintended physical side effects
Snapshot Refs	Agents act on stable refs within a snapshot instead of guessing coordinates
Recovery Hints	Errors include machine-readable codes and suggestions for the next agent step
Cross-Language FFI	Python, Swift, Go, Ruby, Node, C, and C++ hosts can call the native library directly

Which platforms are supported?

Feature	macOS	Windows	Linux
Accessibility tree	Yes	Planned	Planned
Click/type/keyboard	Yes	Planned	Planned
Mouse input	Yes	Planned	Planned
Screenshot	Yes	Planned	Planned
Clipboard	Yes	Planned	Planned
App/window management	Yes	Planned	Planned
Notifications	Yes	Planned	Planned

How do I install agent-desktop?

Install the CLI from npm:

npm install -g agent-desktop
agent-desktop snapshot --app Safari

Build the FFI library from source:

cargo build --release
# Outputs: libagent_desktop_ffi.dylib/.so/.dll

What is the ref system?

snapshot assigns refs to interactive elements in depth-first order: @e1, @e2, @e3, etc. Refs are scoped to a compact snapshot_id such as s8f3k2p9. Commands can omit --snapshot to use the active session's latest snapshot pointer, but explicit snapshot IDs are the deterministic path and do not require also passing --session.

Interactive roles that receive refs:

button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell, radiobutton, incrementor, menubutton, switch, colorwell, dockitem.

Stale ref recovery:

snapshot -> act -> STALE_REF? -> snapshot again -> retry

Is agent-desktop free and open source?

Yes. agent-desktop is Apache-2.0 licensed for personal and commercial use.

Where can I get help?

Resource	Link
Repository	github.com/lahfir/agent-desktop
ClawHub Skill	clawhub.ai/lahfir/agent-desktop
skills.sh Listing	skills.sh/lahfir/agent-desktop/agent-desktop
npm Package	npmjs.com/package/agent-desktop
CI Status	GitHub Actions
Releases	GitHub Releases
Issues	GitHub Issues

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
.githooks		.githooks
.github		.github
assets		assets
crates		crates
docs		docs
npm		npm
scripts		scripts
skills		skills
src		src
tests		tests
.gitignore		.gitignore
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONCEPTS.md		CONCEPTS.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
clippy.toml		clippy.toml
deny.toml		deny.toml
release-please-config.json		release-please-config.json
rust-toolchain.toml		rust-toolchain.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AGENT DESKTOP

Architecture

Key Features

Installation

npm (recommended)

From source

Permissions

Language bindings (FFI)

Core Workflow for AI

Shared sessions for multi-agent workflows

Commands

Observation

Interaction

Keyboard

Mouse

App & Window Management

Notifications (macOS only)

Clipboard

Wait

Batch

System

Snapshot Options

JSON Output

Error Codes

Exit Codes

Ref System

Platform Support

Development

FAQ

What is agent-desktop?

Does agent-desktop require screenshots or pixel matching?

How does agent-desktop work?

What makes agent-desktop useful for AI agents?

Which platforms are supported?

How do I install agent-desktop?

What is the ref system?

Is agent-desktop free and open source?

Where can I get help?

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 26

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages