Benchmark LLM vision + tool-use capabilities on Neuralink's cursor control task.
At Neuralink, a game called Webgrid tests how precisely users can control a cursor. This benchmark evaluates LLMs on the same task: the model sees a screenshot of a grid with one blue target cell and uses tools (screen, mouse_move, mouse_click) to navigate the cursor to the target and click.
For comparison: Neuralink's eighth clinical trial participant achieved 10.39 BPS controlling his computer with his brain; the highest mouse-based score mentioned is 17.1 BPS on a 35x35 grid (Neuralink employee).
The goal is to click targets on the grid as quickly as possible while minimizing misclicks. Score is measured in bits per second (BPS), derived from net correct clicks (NTPM) and grid size.
- NTPM: Net correct clicks = correct - incorrect
- BPS:
max((NTPM / 60) * log2(N - 1), 0)where N = grid cells (e.g., 900 for 30×30)
Verified against the Neuralink Webgrid frontend source:
function E(f, t) { return Math.max(Math.log2(t * t - 1) * f / 60, 0) }
Results from 10 rounds on the browser-based eval (make play, 30×30 grid, 991px canvas, 70s, fullscreen):
| Model | Modality | Grid | Canvas | Round | NTPM | BPS |
|---|---|---|---|---|---|---|
| claude-4.6-opus (computer use) | Browser click | 30×30 | 991px | 1 | 5 | 0.82 |
| claude-4.6-opus (computer use) | Browser click | 30×30 | 991px | 2 | 5 | 0.82 |
| claude-4.6-opus (computer use) | Browser click | 30×30 | 991px | 3 | 5 | 0.82 |
| claude-4.6-opus (computer use) | Browser click | 30×30 | 991px | 4 | 7 | 1.14 |
| claude-4.6-opus (computer use) | Browser click | 30×30 | 991px | 5 | 7 | 1.14 |
| claude-4.6-opus (computer use) | Browser click | 30×30 | 991px | 6 | 5 | 0.82 |
| claude-4.6-opus (computer use) | Browser click | 30×30 | 991px | 7 | 2 | 0.33 |
| claude-4.6-opus (computer use) | Browser click | 30×30 | 991px | 8 | 6 | 0.98 |
| claude-4.6-opus (computer use) | Browser click | 30×30 | 991px | 9 | 3 | 0.49 |
| claude-4.6-opus (computer use) | Browser click | 30×30 | 991px | 10 | 4 | 0.65 |
| Avg | 4.9 | 0.80 |
Comparison with other players:
| Player | Method | Grid | Best BPS | Avg BPS |
|---|---|---|---|---|
| Bliss Chapman | Mouse | 35×35 | 17.10 | — |
| Neuralink P8 | N1 Brain Implant | 30×30 | 10.39 | — |
| claude-4.6-opus | Computer use (browser click) | 30×30 | 1.14 | 0.80 |
| gemini-3-flash-preview | API tool pipeline | 30×30 | 0.16 | ~0.16 |
git clone git@github.com:ofou/webgrid_eval.git
cd webgrid_eval
make install-devmake play
# Open http://localhost:8000 in your browser (F11 for fullscreen)# 1. Start the API server
make dev# 2. In another terminal, run the evaluation
make eval ARGS="configs/openrouter.yaml"# Start the game (30×30 grid, 991px canvas, Neuralink-identical UI)
make play
# Open http://localhost:8000 → F11 for fullscreen → click blue cellsResults are logged to results/web_games.json.
Create a YAML configuration file (see configs/ for examples):
# configs/my_models.yaml
base_url: https://openrouter.ai/api/v1
grid_size: 64 # 8×8 grid (64 cells)
canvas_size: 256 # screenshot size in pixels
max_seconds: 70 # evaluation duration per model
models:
- google/gemini-3-flash-preview
- qwen/qwen3-vl-235b-a22b-instructAvailable configs:
configs/openrouter.yaml- OpenRouter API (many models)configs/google.yaml- Google AI API (Gemini models)configs/local.yaml- Local LLM server (e.g., LM Studio, Ollama)
# Run with a config file
make eval ARGS="configs/openrouter.yaml"
# With custom duration (seconds)
make eval ARGS="configs/openrouter.yaml --seconds 120"
# Cap images per API request (for models with limits)
make eval ARGS="configs/openrouter.yaml --max-images 8"When the server is running (make dev):
GET /health- Health checkPOST /api/session/start- Run single model evaluationPOST /api/eval/run- Run batch evaluation (multiple models)
# Generate GIFs for all evaluation results
make gif
# Or for a specific evaluation folder
make gif ARGS="eval/model-name"The LLM agent has access to three tools:
| Tool | Description |
|---|---|
screen |
Returns current HUD + screenshot (like looking at your monitor) |
mouse_move |
Move cursor by (dx, dy) pixels. Positive dx=right, dy=down |
mouse_click |
Click at the current cursor position |
If you use this software in your research, please cite:
@software{olivares2026webgrid,
author = {Olivares Urrutia, Omar},
title = {{Webgrid Eval: Benchmark for LLM Vision and Tool-Use Capabilities}},
year = {2026},
month = feb,
url = {https://github.com/ofou/webgrid_eval},
}- Inspired by Neuralink's Webgrid
Contributions are welcome!
