Skip to content

niteshdangi/ghosthand

Repository files navigation

ghosthand

Desktop automation for AI agents that doesn't look like a robot. Bezier mouse paths with muscle-frequency tremor. Bigram-aware typing with realistic typos. No WebDriver. No kernel hooks. No navigator.webdriver.

ci License: MIT PowerShell 5.1+/7+ Windows 10/11 macOS — coming soon Linux — coming soon

ghosthand is a single PowerShell module + SKILL.md manifest that lets any agent — GitHub Copilot CLI, Claude Code, Cursor, Hermes, or any Anthropic-compatible runtime — drive a real Windows desktop the way a person does. It sees the screen via screenshots, clicks and types at human speed, navigates the browser by keyboard, and reads structured UI state via Windows UI Automation.

The trick: most desktop automation is trivially distinguishable from a human (instant cursor teleports, batched keystrokes, navigator.webdriver=true). ghosthand intentionally produces realistic input timing on the OS layer where real users live. Cost: actions take real human time. Benefit: they look like a human did them — because, structurally, they are.

Why ghosthand vs. the alternatives

ghosthand PyAutoGUI / nut.js Playwright / Puppeteer Selenium
Realistic mouse paths (Bezier + tremor) ❌ teleport n/a n/a
Bigram-aware human typing cadence ❌ uniform ❌ uniform ❌ uniform
navigator.webdriver stays false n/a ❌ true ❌ true
Works in any app, not just browsers ❌ browser-only ❌ browser-only
Reads structured UI tree (UIA) ❌ pixels only ✅ DOM ✅ DOM
Kernel hooks / virtual HID required ❌ user-mode ❌ user-mode n/a n/a
Ships as an agent skill

What it does

  • Sees the screen. Background watcher writes a fresh screenshot to state/current.png every ~250 ms. Multi-monitor + virtual-screen capture.
  • Drives the mouse like a human. Bezier paths with side-bias, 8–13 Hz sinusoidal tremor (real muscle frequency), sub-pixel correction near the target, optional overshoot for long travels, Fitts-law timing, pre-click hover dwell.
  • Types like a human. Bigram-aware log-normal cadence (common pairs like th fast, same-finger pairs like ed slower), occasional adjacent-key typos with backspace correction, configurable WPM and typo rate.
  • Talks to the UI tree, not pixels. Native UIA PropertyCondition filtering (~100 ms on busy pages), accessibility-snapshot trees with auto-generated refs (w1e36), retry on transient COM errors, built-in Wait-For primitives.
  • Drives the browser by clicking and typing, not by attaching a debugger. navigator.webdriver stays false because there is no WebDriver. Captures page text via the browser's own UIA tree.
  • Detects CAPTCHAs without solving them. Surfaces the detection so a human can step in.

Quick start

As a GitHub Copilot CLI / Claude Code skill

git clone https://github.com/niteshdangi/ghosthand ~/.copilot/skills/ghosthand
# or for Claude Code:
git clone https://github.com/niteshdangi/ghosthand ~/.claude/skills/ghosthand

The agent auto-discovers skills in those folders. Tell it something like "open Notepad and type a haiku" — it'll invoke the skill and execute.

As a standalone PowerShell module

git clone https://github.com/niteshdangi/ghosthand
cd ghosthand
. ./Ghosthand.ps1

# Move the cursor with overshoot + tremor + sub-pixel correction
Move-MouseHuman -X 800 -Y 400

# Type with bigram-aware human cadence + occasional typos
Send-Text "Hello, world!"

# Drive the browser
$br = Open-Url 'https://github.com/trending'
Click-LinkByText -Text 'microsoft / vscode' -Partial
$pageText = Get-PageText -ProcessId $br.ProcessId

A minimal end-to-end demo

. ./Ghosthand.ps1
Start-ScreenWatcher                                     # vision daemon
$br = Open-Url 'https://github.com/trending'            # human-typed URL
Click-LinkByText -Text 'microsoft / vscode' -Partial    # navigate
$null = Wait-ForUrlContains -Pattern '/microsoft/vscode'
Click-UIElement -NameLike 'Issues  ('                   # repo issues tab
$null = Wait-ForUrlContains -Pattern '/issues'
Get-PageText -ProcessId $br.ProcessId | Select-Object -First 50
Stop-ScreenWatcher

That whole flow takes ~25 s, mostly browser render time. See SKILL.md for the latency-budget table.

How the realism actually works

This is the bit most desktop-automation libraries skip.

Mouse:

  • Bezier curve, not straight line. Two control points biased perpendicular to the travel vector — left-handers and right-handers curve differently; ghosthand picks one and stays consistent within a session.
  • 8–13 Hz tremor layered on top. That's the real frequency of physiological hand tremor. Anti-bot systems looking at velocity power-spectra see the right bump.
  • Fitts' law timing — small targets take longer than big ones, by the same coefficients real humans hit on mouse studies.
  • Overshoot + correction on long travels. Real arms ballistically overshoot and self-correct; ghosthand does the same with a sub-pixel approach phase.
  • Pre-click hover dwell — humans don't click the millisecond they arrive.

Keyboard:

  • Bigram-aware cadence. A log-normal distribution per bigram, with common English pairs (th, er, in) faster and same-finger pairs (ed, tr) slower — matches measured typist data.
  • Adjacent-key typos at a configurable rate, with realistic backspace correction latency. (Type "Hellp" → pause → backspace → "o".)
  • Per-keystroke jitter so two consecutive es don't fire at identical intervals.

See src/Realism.ps1 for the constants and the distribution code.

What's in the box

File Purpose
Ghosthand.psd1 Module manifest (version, exports)
Ghosthand.psm1 Module entry — sources src/*.ps1 in order
Ghosthand.ps1 Back-compat shim for . ./Ghosthand.ps1 callers
src/ Per-concern PowerShell sources (12 files: Types, State, Realism, Logging, Vision, Mouse, Keyboard, Window, UIA, UIASnapshot, Browser, Bundle)
watcher.ps1 Background screenshot daemon
SKILL.md Manifest + agent-facing usage rules and recipes
tests/ Pester smoke tests (run via Invoke-Pester ./tests)
CHANGELOG.md Versioned change log
SECURITY.md Threat model + reporting
CONTRIBUTING.md Dev/test guide + PR rules

Status

v0.1.0 — alpha. Works end-to-end on common Windows workflows (dashboard navigation, file management, text editing). Documented limitations and known sharp edges below.

Platform support

Platform Status
Windows 10 / 11 ✅ supported
macOS 🚧 not yet implemented — PRs welcome
Linux (X11 / Wayland) 🚧 not yet implemented — PRs welcome

The current implementation is built on Win32 P/Invoke (SetCursorPos, mouse_event, keybd_event) plus Windows UI Automation. Adding macOS support means implementing the same surface against Quartz Event Services (input) and the macOS Accessibility API (AX) for UI-tree access, with a small platform-abstraction layer in Ghosthand.ps1. Linux support would similarly use XTest/uinput plus AT-SPI. Both are scoped, tractable contributions if you'd like to take a swing — see CONTRIBUTING.md for the rules of engagement, and open an issue first to coordinate the design before sending a large PR.

Scope and limits

What this is for

  • Personal AI assistants that run on your own desktop.
  • Long-form GUI flows where no API is available (third-party dashboards, legacy apps, configuration panels).
  • Disclosed, legitimate browser automation: research, content drafting, reading email, summarising pages — the kind of thing you'd do yourself.
  • Accessibility-style flows where reading the UIA tree is genuinely useful.

What this is not for

  • Games / kernel anti-cheat bypass. Synthetic input sets the LLMHF_INJECTED flag in every event record. Anti-cheat drivers see it. Don't.
  • Platform Terms-of-Service evasion. Most major sites prohibit automated account creation, automated engagement, and undisclosed bot-account operation. The skill detects CAPTCHAs but won't solve them. PRs that add ToS-evasion features will be closed without review.
  • Multi-user, multi-machine, or RDP-session targeting. Single user, single primary console session.
  • Replacement for proper APIs when one exists. If a service has a real API, use that instead.

Things that work but aren't extensively tested

  • Multi-monitor on mixed-DPI setups. The capture math handles negative monitor offsets, but combinations of 100% + 150% + 200% scaling have not been stress-tested.
  • Firefox. Tested informally; Edge and Chrome are the maintained targets.
  • Non-English UI. Bigram timing tables are tuned for English. The module still works for any text, just with less-realistic timing.

Known sharp edges

  • Wait-PageLoad uses title-stability as a proxy for page-load completion. Pages with live notification counters (Gmail "(5)" → "(6)") can fool it into timing out.
  • Find-DropdownByLabel is a heuristic — it picks the nearest interactive element below a label. Occasionally matches the wrong control on dense forms; pass -MaxYDistance / -MaxXDistance to tighten.
  • Focus-Window is best-effort — Windows foreground-lock policies can cause SetForegroundWindow to silently fail or return the wrong status. Always check .Success and use Wait-ForForegroundWindow to confirm.
  • If a script is killed mid-Send-Hotkey, modifier keys can stay logically down. Run Reset-InputState to recover, or document this prominently in any tooling that wraps the skill.

Why this exists

Most desktop automation libraries either:

  1. Use kernel-mode hooks or virtual HIDs that fail at the user level (and sometimes get flagged by AV).
  2. Drive a debugger-attached browser, which web platforms detect via navigator.webdriver and similar tells.
  3. Optimise for "fast" — instant cursor teleports, batched keystrokes — which produces input timing patterns that are trivially distinguishable from a human.

ghosthand takes the third path seriously. It uses ordinary user-mode SetCursorPos / mouse_event / keybd_event, drives a normal user-launched browser, and intentionally produces realistic timing. The cost is that actions take real human time. The benefit is that they look like a real human did them, on real hardware — because, structurally, they are.

Contributing

PRs welcome — see CONTRIBUTING.md. Particularly looking for: macOS port (Quartz + AX), Linux port (XTest/uinput + AT-SPI), Firefox parity, and bigram tables for non-English layouts.

License

MIT — see LICENSE.

About

Desktop automation for AI agents that doesn't look like a robot. Bezier mouse paths with muscle-frequency tremor, bigram-aware human typing, no WebDriver. Ships as a SKILL.md for Claude Code, Copilot CLI, Cursor, Hermes.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors