Turn Your Coding Models to Be State-of-the-art Browser Agents
- π Blog: Webwright: A Terminal Is All You Need For Web Agents
- π Project Page: microsoft.github.io/Webwright
Webwright gives LLM a terminal where it can launch multiple browswer sessions to inspect the page and complete a web task. It captures and inspects page screenshots/states only when needed. It enforces each web tasks to be completed end2end within a re-runable python script, i.e. your web agent browsing history is a single code file. No multi-agent system, no graph engine, no plugin layer, no hidden orchestration β just a terminal, a browser, and a model.
Already got your favorite agents, and wonder how to make Claude Code, Codex, Hermes, OpenClaw more capable in browser tasks? Consider add Webwright plugin/skills!
π‘ Motivation: Beyond Step-by-Step Web Interaction in a Stateful Browser
Most web agents today treat the browser session itself as the workspace: at each step the model receives the current page state and predicts a single next operation β a click, a type, a DOM selector, or a short tool call. Whatever the format, the agent is locked into predicting one web action at a time inside a predefined interaction loop. That harness was useful when LLMs were weaker. As models get stronger at writing and debugging code, the same harness becomes a bottleneck.
Webwright takes a different stance: separate the agent from the browser, and treat the browser as something the agent can launch, inspect, and discard while developing a program. The persistent artifact is not the browser session β it's the code and logs in the local workspace.
- π§± Robust, reusable interaction with web environments β instead of fragile pixel-level actions, a coding agent with a terminal queries elements, waits for conditions, and handles dynamic behaviors like lazy loading or re-rendering. The resulting scripts can be rerun, adapted, and shared across tasks rather than rediscovered from scratch.
- β‘ Efficient composition of complex workflows β multi-step interactions like selecting a date or filling a form become a compact program. Loops, functions, and abstractions let the agent generalize across similar tasks (e.g. different dates) without re-predicting the same low-level sequences. Fewer interaction rounds, faster execution, less error accumulation on long horizons.
- π§ͺ Workspace-as-state, not browser-as-state β the agent can write exploratory scripts, spawn fresh browser sessions, and decide for itself when to capture screenshots and inspect failures, much like a human engineer iterating on an RPA script.
- πͺ Surprisingly effective despite being minimal β this stripped-down setup turns out to handle complex and especially long-horizon web tasks well (see Performance).
π Why Webwright
Most web agent frameworks bury the actual agent loop under layers of abstractions. Webwright takes the opposite stance:
- πͺΆ Lightweight by design β core agent loop in a single ~450-line file, Playwright environment in ~570 lines, CLI in ~150 lines.
- π§© Pluggable model backends β OpenAI, Anthropic, and OpenRouter, each ~150β200 lines.
- π Zero hidden frameworks β just
httpx,pydantic,playwright, andtyper. - π Flat prompt β observe β execute script loop β readable end-to-end, easy to debug, easy to fork.
- π§ͺ Run-artifact first β every run writes trajectories and screenshots to disk for inspection.
If you want a minimal, easy-to-debug starting point for browser-using agents instead of another heavyweight platform, this is it.
π How Webwright Differs From Other Browser-Agent Repos
How they differ at the architectural level:
| Stagehand (Browserbase) | agent-browser (Vercel) | browser-use | Webwright | |
|---|---|---|---|---|
| Paradigm | Hybrid: code + NL primitives (act / extract / agent) |
CLI tool that another agent (Claude Code, Codex, etc.) calls | Autonomous LLM agent loop over DOM/AX snapshots | Coding agent with a terminal; browser is just an environment it spawns |
| Action space | Playwright code, or NL β LLM-translated Playwright | Discrete subcommands (open, click @e2, snapshot, eval) |
Indexed click/type actions selected by the LLM | Free-form Python (writes Playwright scripts itself) |
| What is "state"? | The browser session | The browser session (held by daemon across CLI calls) | The browser session | The local workspace β code, screenshots, logs. Browser is disposable. |
| Loop shape | Imperative; agent() does multi-step when needed |
One CLI invocation per micro-step | observe β predict next action β execute β repeat | write code β execute β inspect screenshots β repair (code-as-action) |
webwright_demo.mp4
State-of-the-art on two real-website benchmarks with a 100-step budget β see the blog post for full details.
- π Online-Mind2Web (300 tasks): 86.7% with GPT-5.4 β highest among open-sourced harnesses in the AutoEval category. Claude Opus 4.7 reaches 84.7%, and is stronger on the hard split (80.5% vs. 76.6% for GPT-5.4 at N=100).
- π Odysseys (200 long-horizon tasks): 60.1% with GPT-5.4 (avg. 76.1 steps) β +15.6 points over the prior SOTA (Opus 4.6 at 44.5%, using vision based approach and persistent browser) and +26.6 points over base GPT-5.4 (33.5% using xy-coordinate prediction and persistent browser).
- π§ Code-as-action beats coordinate prediction: Webwright substantially outperforms a reproduced GPT-5.4 screenshot+xy-coordinate baseline across all difficulty splits.
- π§° Small models + reusable tools: generated scripts can be packaged as parameterized CLI tools β even Qwen-3.5-9B completes tasks well on Online-Mind2Web sites with 5+ tools available.
webwright/
βββ pyproject.toml # package: webwright
βββ src/webwright/
β βββ run/cli.py # CLI entrypoint (`webwright`)
β βββ agents/default.py # core agent loop
β βββ environments/ # Playwright browser workspace
β βββ tools/ # image_qa, self_reflection
β βββ models/ # openai_model, anthropic_model, base
β βββ config/ # base.yaml, model_openai.yaml, model_claude.yaml
β βββ utils/
βββ tests/
βββ outputs/ # run artifacts (trajectories, screenshots)
- Python 3.10+
- Chromium installed through Playwright
- An API key for your chosen backend (OpenAI, Anthropic, or OpenRouter)
pip install -e .
playwright install chromiumExport credentials for the chosen backend (e.g. OPENAI_API_KEY or ANTHROPIC_API_KEY), then:
python -m webwright.run.cli \
-c base.yaml -c model_openai.yaml \
-t "Search for flights from SEA to JFK on 2026-08-15 to 2026-08-20" \
--start-url https://www.google.com/flights \
--task-id demo_openai \
-o outputs/default| Flag | Description |
|---|---|
-c |
Config file(s) from src/webwright/config/ (stackable). |
-t |
Task instruction. |
--start-url |
Initial page. |
--task-id |
Output subfolder name. |
-o |
Output directory. |
Webwright ships plugin manifests for both Claude Code (.claude-plugin/plugin.json) and OpenAI Codex (.codex-plugin/plugin.json), with the shared skill at skills/webwright/ and slash commands at skills/webwright/commands/. The host agent drives the Webwright loop natively β no extra LLM API key or cost beyond your host subscription. Hosts that read PNG screenshots natively skip the OpenAI-backed image_qa / self_reflection tools.
Common runtime deps (install once after either path):
pip install -e .
playwright install chromiumClaude Code
Install through the bundled marketplace inside Claude Code:
# 1. Add this repo as a Claude Code plugin marketplace
/plugin marketplace add microsoft/Webwright
# 2. Install the plugin from that marketplace
/plugin install webwright@webwright
Prefer a local checkout? Point the marketplace command at the cloned repo instead:
/plugin marketplace add /absolute/path/to/Webwright
/plugin install webwright@webwright
Start a new Claude Code session after installing β plugins are loaded at session start and won't appear until you restart.
You can either ask Claude Code in plain English (the skill auto-activates from its description), or use one of the slash commands:
/webwright:run search Google Flights for flights from SEA to JFK on 2026-08-15 to 2026-08-20
/webwright:craft search a ticket on Google Flights from LAX to SFO depart June 7 return June 14
/webwright:run(or any plain prompt) produces a one-shotfinal_script.pyfor the literal task values./webwright:craftproduces a reusable CLI tool:final_script.pybecomes one parameterized function with a Google-styleArgs:docstring and anargparsewrapper whose flags default to the concrete task values, so you can rerun it later with different arguments β e.g.python final_script.py --origin JFK --destination LAX --depart-date 2026-07-01.
In both modes Claude Code scaffolds a workspace with plan.md, runs instrumented Playwright scripts under final_runs/run_<id>/, and visually self-verifies each critical point against the saved screenshots.
OpenAI Codex
Codex reads Claude-style marketplaces, so the same repo works as a Codex plugin marketplace. From the Codex CLI:
# 1. Add this repo as a Codex plugin marketplace
codex plugin marketplace add microsoft/Webwright
# 2. Open the plugin browser and install Webwright
codex
/pluginsPrefer a local checkout?
codex plugin marketplace add /absolute/path/to/WebwrightThen restart Codex so the new marketplace and plugin are picked up.
In a new Codex thread, either ask in plain English (the skill auto-activates from its description) or invoke the bundled skill explicitly with @webwright:
@webwright search Google Flights for flights from SEA to JFK on 2026-08-15 to 2026-08-20
Codex scaffolds a workspace with plan.md, runs instrumented Playwright scripts under final_runs/run_<id>/, and visually self-verifies each critical point against the saved screenshots.
To turn the plugin off without uninstalling, set its entry in ~/.codex/config.toml to enabled = false and restart Codex.
π¦ OpenClaw
Install directly from a local checkout (path, archive, npm spec, git repo, or clawhub: spec all work):
openclaw plugins install /absolute/path/to/Webwright
openclaw gateway restart # reload so the plugin and skill are picked upVerify:
openclaw plugins list | grep webwright
openclaw skills list | grep webwright # should show "β ready"The webwright skill is now available to any OpenClaw agent surface (CLI, Telegram, etc.) β invoke it by asking the agent in natural language, or via the slash commands shipped under skills/webwright/commands/, e.g. /webwright run <task>.
To uninstall: openclaw plugins uninstall webwright.
Hermes Agent
Hermes Agent is a skills-compatible client, so the same skills/webwright/ folder loads as a Hermes skill. Symlink it into your Hermes user-skills directory:
mkdir -p ~/.hermes/skills
ln -sfn /absolute/path/to/Webwright/skills/webwright ~/.hermes/skills/webwrightNo Hermes-specific manifest is needed; only SKILL.md is loaded.
Start Hermes (hermes) and ask it to drive a web task in natural language β the skill auto-activates from its description. You can also invoke it explicitly with /webwright.
Note: the named subcommands shipped under skills/webwright/commands/ (/webwright:run, /webwright:craft) are a Claude Code / Codex convention and are inert in Hermes; the skill itself still works end-to-end.
- SWE-agent/mini-swe-agent β design inspiration for the minimal agent loop.
- Playwright β browser automation.
If you use Webwright in your research or build on it, please cite this repository:
@misc{webwright2026,
title = {Webwright: A terminal is all you need for web agents},
author = {Lu, Yadong and Xu, Lingrui and Huang, Chao and Awadallah, Ahmed},
year = {2026},
howpublished = {\url{https://github.com/microsoft/Webwright}},
note = {GitHub repository}
}
