onlycodes

A benchmark testing whether Claude Code performs better when restricted to writing and executing code directly, rather than using its native file-system tools (Read, Grep, Glob, Edit, etc.).

Hypothesis

Forcing the model to solve tasks by writing a single script — rather than making multiple fine-grained tool calls — should reduce turns, lower token costs, and complete tasks faster. The "only code" approach is implemented via an MCP server (execute_code) that provides a sandboxed Python/Bash execution environment.

Approaches

Approach	Description
Baseline	All native tools available (Read, Grep, Glob, Edit, Bash, etc.)
Only code (no tools)	Restricted to a single `execute_code` MCP tool; must write one script per task

Benchmark Tasks

Five tasks against a shared Python fixture project (fixtures/myapp/):

Find all Python files that import os or os.path — list file paths and line numbers
Find all os.environ.get() references that are missing from .env.example
Run the pytest suite and report total/passed/failed with exact failure names
Find every file containing the variable name server_url
Add a --dry-run flag to myapp/cli.py that prints intent and exits without calling start()

Results

Task	Baseline Cost	Only Code Cost	Savings	Baseline Time	Only Code Time	Speedup
1 — OS imports	$0.0833	$0.0653	22%	18.8s	13.2s	1.4×
2 — Missing env vars	$0.1390	$0.0676	51%	28.4s	7.9s	3.6×
3 — Run pytest	$0.0985	$0.0750	24%	23.4s	11.1s	2.1×
4 — Find server_url	$0.0756	$0.0525	31%	15.9s	10.3s	1.5×
5 — Add --dry-run	$0.1031	$0.0796	23%	8.2s	5.0s	1.6×
Total	$0.4995	$0.3400	32%	~94.7s	~47.5s	2.0×

All 10 runs completed successfully (100% task success rate).

The "only code" approach was 2× faster and 32% cheaper overall. Task 2 shows the largest gap: baseline hit a permission denial mid-task, retried across 13 turns, and accumulated 175k cache-read tokens vs 7.5k for the single-shot script approach.

Repository Structure

fixtures/          # Benchmark fixture project (myapp/ + tests/)
fixtures_requests/ # Alternate fixture set (HTTP/requests-based tasks)
oracle/            # Ground-truth answers for grading each task
results/           # Baseline vs constrained run logs (JSONL)
results_mcp/       # Only-code (MCP) run logs (JSONL)
exec-server.js     # MCP server exposing execute_code tool (stdio transport)
exec-server.bundle.mjs  # Bundled version for fast startup (~130ms vs ~3.9s)
mcp-config.json    # MCP server config for --mcp-config CLI flag
run_prevalidation.sh       # Baseline vs constrained benchmark runner
run_mcp_integration_test.sh  # Only-code (MCP) benchmark runner

Running the Benchmark

Baseline (all tools):

./run_prevalidation.sh

Only code (MCP):

./run_mcp_integration_test.sh

Results are written as JSONL streams to results/ and results_mcp/ respectively. Grade against the oracle files in oracle/.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.claude		.claude
docs		docs
fixtures		fixtures
logs		logs
oracle		oracle
oracle_requests		oracle_requests
patches		patches
problems		problems
scripts		scripts
swebench		swebench
test		test
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
DISCUSSION.md		DISCUSSION.md
README.md		README.md
ROADMAP.md		ROADMAP.md
bridge-server.js		bridge-server.js
build.mjs		build.mjs
codebox.py		codebox.py
concept.md		concept.md
config-loader.js		config-loader.js
datasci_mini_ids.txt		datasci_mini_ids.txt
exec-server.bundle.mjs		exec-server.bundle.mjs
exec-server.js		exec-server.js
interceptor.js		interceptor.js
mcp-config.json		mcp-config.json
mcp_bridge.py		mcp_bridge.py
package-lock.json		package-lock.json
package.json		package.json
passthrough-config.json		passthrough-config.json
patterns.json		patterns.json
pytest.ini		pytest.ini
python_kernel.py		python_kernel.py
requirements.txt		requirements.txt
run_mcp_integration_test.sh		run_mcp_integration_test.sh
run_prevalidation.sh		run_prevalidation.sh
run_prevalidation_requests.sh		run_prevalidation_requests.sh
run_swebench.sh		run_swebench.sh
sub-mcp-manager.js		sub-mcp-manager.js
summarize_results.py		summarize_results.py
swebench_problems.txt		swebench_problems.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

onlycodes

Hypothesis

Approaches

Benchmark Tasks

Results

Repository Structure

Running the Benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

onlycodes

Hypothesis

Approaches

Benchmark Tasks

Results

Repository Structure

Running the Benchmark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages