Real-world benchmarks for AI coding agents
PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files.
Results are collected on a public leaderboard at pinchbench.com.
Most LLM benchmarks test isolated capabilities. PinchBench tests what actually matters for coding agents:
- Tool usage — Can the model call the right tools with the right parameters?
- Multi-step reasoning — Can it chain together actions to complete complex tasks?
- Real-world messiness — Can it handle ambiguous instructions and incomplete information?
- Practical outcomes — Did it actually create the file, send the email, or schedule the meeting?
# Clone the skill
git clone https://github.com/pinchbench/skill.git
cd skill
# Run benchmarks with your model of choice
./scripts/run.sh --model anthropic/claude-sonnet-4
# Or run specific tasks
./scripts/run.sh --model openai/gpt-4o --suite task_01_calendar,task_02_stockRequirements:
- Python 3.10+
- uv package manager
- A running OpenClaw instance
PinchBench includes 23 tasks across real-world categories:
| Category | Tasks | What's tested |
|---|---|---|
| Productivity | Calendar, daily summaries | Event creation, time parsing, scheduling |
| Research | Stock prices, conferences, markets | Web search, data extraction, synthesis |
| Writing | Blog posts, emails, humanization | Content generation, tone, formatting |
| Coding | Weather scripts, file structures | Code generation, file operations |
| Analysis | Spreadsheets, PDFs, documents | Data processing, summarization |
| Triage, search | Inbox management, filtering | |
| Memory | Context retrieval, knowledge management | Long-term memory, recall |
| Skills | ClawHub, skill discovery | OpenClaw ecosystem integration |
Each task is graded automatically, by an LLM judge, or both — ensuring both objective and nuanced evaluation.
To get your results on the leaderboard:
# Register for an API token (one-time)
./scripts/run.sh --register
# Run benchmark — results auto-upload with your token
./scripts/run.sh --model anthropic/claude-sonnet-4Skip uploading with --no-upload if you just want local results.
| Flag | Description |
|---|---|
--model MODEL |
Model to test (e.g., anthropic/claude-sonnet-4) |
--suite SUITE |
all, automated-only, or comma-separated task IDs |
--runs N |
Number of runs per task for averaging |
--timeout-multiplier N |
Scale timeouts for slower models |
--output-dir DIR |
Where to save results (default: results/) |
--no-upload |
Skip uploading to leaderboard |
--register |
Request an API token for submissions |
--upload FILE |
Upload a previous results JSON |
We welcome new tasks! Check out tasks/TASK_TEMPLATE.md for the format. Good tasks are:
- Real-world — Something an actual user would ask an agent to do
- Measurable — Clear success criteria that can be graded
- Reproducible — Same task should produce consistent grading
- Challenging — Tests agent capabilities, not just LLM knowledge
- Leaderboard: pinchbench.com
- OpenClaw: github.com/openclaw/openclaw
- Issues: github.com/pinchbench/skill/issues
MIT — see LICENSE for details.
Claw-some AI agent testing 🦞
