Skip to content

Webgrid Eval: Benchmark for LLM Vision and Tool-Use Capabilities

Notifications You must be signed in to change notification settings

ofou/webgrid_eval

Repository files navigation

Webgrid Eval

CI wakatime

Benchmark LLM vision + tool-use capabilities on Neuralink's cursor control task.

Overview

At Neuralink, a game called Webgrid tests how precisely users can control a cursor. This benchmark evaluates LLMs on the same task: the model sees a screenshot of a grid with one blue target cell and uses tools (screen, mouse_move, mouse_click) to navigate the cursor to the target and click.

Example Replay

gemini-3-flash-preview at 1x speed

gemini-3-flash-preview on 30×30 grid — 4 correct, 3 misclicks, 0.16 BPS (1 NTPM), in 70s task

Human Baseline

For comparison: Neuralink's eighth clinical trial participant achieved 10.39 BPS controlling his computer with his brain; the highest mouse-based score mentioned is 17.1 BPS on a 35x35 grid (Neuralink employee).

Metrics

The goal is to click targets on the grid as quickly as possible while minimizing misclicks. Score is measured in bits per second (BPS), derived from net correct clicks (NTPM) and grid size.

  • NTPM: Net correct clicks = correct - incorrect
  • BPS: max((NTPM / 60) * log2(N - 1), 0) where N = grid cells (e.g., 900 for 30×30)

Verified against the Neuralink Webgrid frontend source: function E(f, t) { return Math.max(Math.log2(t * t - 1) * f / 60, 0) }

Benchmark Results

Results from 10 rounds on the browser-based eval (make play, 30×30 grid, 991px canvas, 70s, fullscreen):

Model Modality Grid Canvas Round NTPM BPS
claude-4.6-opus (computer use) Browser click 30×30 991px 1 5 0.82
claude-4.6-opus (computer use) Browser click 30×30 991px 2 5 0.82
claude-4.6-opus (computer use) Browser click 30×30 991px 3 5 0.82
claude-4.6-opus (computer use) Browser click 30×30 991px 4 7 1.14
claude-4.6-opus (computer use) Browser click 30×30 991px 5 7 1.14
claude-4.6-opus (computer use) Browser click 30×30 991px 6 5 0.82
claude-4.6-opus (computer use) Browser click 30×30 991px 7 2 0.33
claude-4.6-opus (computer use) Browser click 30×30 991px 8 6 0.98
claude-4.6-opus (computer use) Browser click 30×30 991px 9 3 0.49
claude-4.6-opus (computer use) Browser click 30×30 991px 10 4 0.65
Avg 4.9 0.80

Comparison with other players:

Player Method Grid Best BPS Avg BPS
Bliss Chapman Mouse 35×35 17.10
Neuralink P8 N1 Brain Implant 30×30 10.39
claude-4.6-opus Computer use (browser click) 30×30 1.14 0.80
gemini-3-flash-preview API tool pipeline 30×30 0.16 ~0.16

Quick Start

Installation

git clone git@github.com:ofou/webgrid_eval.git
cd webgrid_eval
make install-dev

Play the game (default eval mode)

make play
# Open http://localhost:8000 in your browser (F11 for fullscreen)

Run API-based evaluation (requires LLM API key)

# 1. Start the API server
make dev
# 2. In another terminal, run the evaluation
make eval ARGS="configs/openrouter.yaml"

Usage

Browser Game (default eval)

# Start the game (30×30 grid, 991px canvas, Neuralink-identical UI)
make play
# Open http://localhost:8000 → F11 for fullscreen → click blue cells

Results are logged to results/web_games.json.

Configure Models (API eval)

Create a YAML configuration file (see configs/ for examples):

# configs/my_models.yaml
base_url: https://openrouter.ai/api/v1
grid_size: 64 # 8×8 grid (64 cells)
canvas_size: 256 # screenshot size in pixels
max_seconds: 70 # evaluation duration per model

models:
  - google/gemini-3-flash-preview
  - qwen/qwen3-vl-235b-a22b-instruct

Available configs:

  • configs/openrouter.yaml - OpenRouter API (many models)
  • configs/google.yaml - Google AI API (Gemini models)
  • configs/local.yaml - Local LLM server (e.g., LM Studio, Ollama)

Run Evaluation

# Run with a config file
make eval ARGS="configs/openrouter.yaml"

# With custom duration (seconds)
make eval ARGS="configs/openrouter.yaml --seconds 120"

# Cap images per API request (for models with limits)
make eval ARGS="configs/openrouter.yaml --max-images 8"

API Endpoints

When the server is running (make dev):

  • GET /health - Health check
  • POST /api/session/start - Run single model evaluation
  • POST /api/eval/run - Run batch evaluation (multiple models)

Generate Replay GIFs

# Generate GIFs for all evaluation results
make gif

# Or for a specific evaluation folder
make gif ARGS="eval/model-name"

Tools

The LLM agent has access to three tools:

Tool Description
screen Returns current HUD + screenshot (like looking at your monitor)
mouse_move Move cursor by (dx, dy) pixels. Positive dx=right, dy=down
mouse_click Click at the current cursor position

Citation

If you use this software in your research, please cite:

@software{olivares2026webgrid,
  author  = {Olivares Urrutia, Omar},
  title   = {{Webgrid Eval: Benchmark for LLM Vision and Tool-Use Capabilities}},
  year    = {2026},
  month   = feb,
  url     = {https://github.com/ofou/webgrid_eval},
}

Acknowledgments

Contributing

Contributions are welcome!

About

Webgrid Eval: Benchmark for LLM Vision and Tool-Use Capabilities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published