"LLMs are better at writing code to call tools than at calling tools directly." — Cloudflare Code Mode Research
A comprehensive benchmark comparing Code Mode (code generation) vs Traditional Function Calling for LLM tool interactions. Demonstrates that Code Mode achieves 60% faster execution, 68% fewer tokens, and 88% fewer API round trips while maintaining equal accuracy.
Metric | Regular Agent | Code Mode | Improvement |
---|---|---|---|
Average Latency | 11.88s | 4.71s | 60.4% faster ⚡ |
API Round Trips | 8.0 iterations | 1.0 iteration | 87.5% reduction 🔄 |
Token Usage | 144,250 tokens | 45,741 tokens | 68.3% savings 💰 |
Success Rate | 6/8 (75%) | 7/8 (88%) | +13% higher ✅ |
Validation Accuracy | 100% | 100% | Equal accuracy |
Annual Cost Savings: $9,536/year at 1,000 scenarios/day (Claude Haiku pricing)
📊 View Full Results | 📈 Raw Data Tables
- Python 3.11+
- Anthropic API key (for Claude)
- Google API key (for Gemini, optional)
# Clone the repository
git clone <repository-url>
cd codemode_benchmark
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env and add your API keys
# Run full benchmark with Claude
make run
# Run with Gemini
python benchmark.py --model gemini
# Run specific scenario
python benchmark.py --scenario 1
# Run limited scenarios
python benchmark.py --limit 3
codemode_benchmark/
├── README.md # This file
├── benchmark.py # Main benchmark runner
├── requirements.txt # Python dependencies
├── Makefile # Convenient commands
│
├── agents/ # Agent implementations
│ ├── __init__.py
│ ├── codemode_agent.py # Code Mode (code generation)
│ ├── regular_agent.py # Traditional function calling
│ ├── gemini_codemode_agent.py # Gemini Code Mode
│ └── gemini_regular_agent.py # Gemini function calling
│
├── tools/ # Tool definitions
│ ├── __init__.py
│ ├── business_tools.py # Accounting/invoicing tools
│ ├── accounting_tools.py # Core accounting logic
│ └── example_tools.py # Simple example tools
│
├── sandbox/ # Secure code execution
│ ├── __init__.py
│ └── executor.py # RestrictedPython sandbox
│
├── tests/ # Test files
│ ├── test_api.py
│ ├── test_scenarios.py # Scenario definitions
│ └── ...
│
├── debug/ # Debug scripts (development)
│ └── debug_*.py
│
├── docs/ # Documentation
│ ├── BENCHMARK_SUMMARY.md # Comprehensive analysis
│ ├── RESULTS_DATA.md # Raw data tables
│ ├── QUICKSTART.md # Quick start guide
│ ├── TOOLS.md # Tool API documentation
│ ├── CHANGELOG.md # Version history
│ └── GEMINI.md # Gemini-specific notes
│
└── results/ # Benchmark results
├── benchmark_results_claude.json
├── benchmark_results_gemini.json
├── results.log
└── results-gemini.log
User Query → LLM → Tool Call #1 → Execute → Result
↓
LLM processes result → Tool Call #2 → Execute → Result
↓
[Repeat 5-16 times...]
↓
Final Response
Problems:
- Multiple API round trips
- Neural network processing between each tool call
- Context grows with each iteration
- High latency and token costs
User Query → LLM generates complete code → Executes all tools → Final Response
Advantages:
- Single code generation pass
- Batch multiple operations
- No context re-processing
- Natural programming constructs (loops, variables, conditionals)
Example:
Regular Agent sees this as 3 separate tool calls:
{"name": "create_transaction", "input": {"amount": 2500, ...}}
{"name": "create_transaction", "input": {"amount": 150, ...}}
{"name": "get_financial_summary", "input": {}}
Code Mode generates efficient code:
expenses = [
("rent", 2500, "Monthly rent"),
("utilities", 150, "Electricity")
]
for category, amount, desc in expenses:
tools.create_transaction("expense", category, amount, desc)
summary = json.loads(tools.get_financial_summary())
result = f"Total: ${summary['summary']['total_expenses']}"
The benchmark includes 8 realistic business scenarios:
- Monthly Expense Recording - Record 4 expenses and generate summary
- Client Invoicing Workflow - Create 2 invoices, update status, summarize
- Payment Processing - Create invoice, process partial payments
- Mixed Income/Expense Tracking - 7 transactions with financial analysis
- Multi-Account Management - Complex transfers between 3 accounts
- Quarter-End Analysis - Simulate 3 months of business activity
- Complex Multi-Client Invoicing - 3 invoices with partial payments (16 operations)
- Budget Tracking - 14 categorized expenses with analysis
Each scenario includes automated validation to ensure correctness.
class CodeModeAgent:
def run(self, user_message: str) -> Dict[str, Any]:
# 1. Send message with tools API documentation
response = self.client.messages.create(
system=self._create_system_prompt(), # Contains tools API
messages=[{"role": "user", "content": user_message}]
)
# 2. Extract generated code
code = extract_code_from_response(response)
# 3. Execute in sandbox
result = self.executor.execute(code)
return result
from typing import TypedDict, Literal
class TransactionResponse(TypedDict):
status: Literal["success"]
transaction: TransactionDict
new_balance: float
def create_transaction(
transaction_type: Literal["income", "expense", "transfer"],
category: str,
amount: float,
description: str,
account: str = "checking"
) -> str:
"""
Create a new transaction.
Returns: JSON string with TransactionResponse structure
Example:
result = tools.create_transaction("expense", "rent", 2500.0, "Monthly rent")
data = json.loads(result)
print(data["new_balance"]) # 7500.0
"""
# Implementation...
Code execution uses RestrictedPython for sandboxing:
- No filesystem access
- No network access
- No dangerous imports
- Controlled builtins
Complexity | Scenarios | Avg Speedup | Avg Token Savings |
---|---|---|---|
High (10+ ops) | 2 | 79.2% | 36,389 tokens |
Medium (5-9 ops) | 3 | 47.5% | 8,774 tokens |
Low (3-4 ops) | 1 | 45.3% | 6,209 tokens |
Key Insight: Code Mode advantage scales with complexity, but even simple tasks benefit significantly.
Daily Volume | Regular Annual | Code Mode Annual | Annual Savings |
---|---|---|---|
100 | $252 | $77 | $175 |
1,000 | $2,519 | $766 | $1,753 |
10,000 | $25,185 | $7,665 | $17,520 |
100,000 | $251,850 | $76,650 | $175,200 |
(Based on Claude Haiku pricing: $0.25/1M input, $1.25/1M output)
- Model: Claude 3 Haiku
- Performance: 60.4% faster, 68.3% fewer tokens
- Best For: Cost-sensitive production workloads
- Status: ✅ Fully tested (8/8 scenarios)
- Model: Gemini 2.0 Flash Experimental
- Performance: 15.1% faster, 70.6% fewer iterations
- Best For: Low-latency requirements
- Status: ✅ Partially tested (2/8 scenarios)
- Note: Faster baseline but more verbose code generation
# Run all tests
make test
# Run specific test file
python -m pytest tests/test_scenarios.py
# Test Code Mode agent directly
python agents/codemode_agent.py
# Test Regular Agent directly
python agents/regular_agent.py
# Test sandbox execution
python sandbox/executor.py
- Benchmark Summary - Comprehensive analysis with insights
- Results Data - Raw performance tables
- Quick Start Guide - Step-by-step setup
- Tools Documentation - Available tools and API
- Changelog - Version history
- Gemini Notes - Gemini-specific information
-
Batching Advantage
- Single code block replaces multiple API calls
- No neural network processing between operations
- Example: 16 iterations → 1 iteration (Scenario 7)
-
Cognitive Efficiency
- LLMs have extensive training on code generation
- Natural programming constructs (loops, variables, conditionals)
- TypedDict provides clear type contracts
-
Computational Efficiency
- No context re-processing between tool calls
- Direct code execution in sandbox
- Reduced token overhead
✅ Multi-step workflows - Greatest benefit with many operations ✅ Complex business logic - Invoicing, accounting, data processing ✅ Batch operations - Similar actions on multiple items ✅ Cost-sensitive workloads - Production at scale ✅ Latency-critical applications - User-facing systems
- Use TypedDict for response types - Provides clear structure to LLM
- Include examples in docstrings - Shows correct usage patterns
- Batch similar operations - Leverage loops in code
- Validate results - Automated checks ensure correctness
- Handle errors gracefully - Try-except in generated code
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Make your changes
- Run tests (
make test
) - Commit (
git commit -m 'Add amazing feature'
) - Push (
git push origin feature/amazing-feature
) - Open a Pull Request
- Cloudflare Code Mode Blog Post
- Anthropic Building Effective Agents
- Claude API Documentation
- Gemini API Documentation
- RestrictedPython Documentation
MIT License - See LICENSE file for details
- Inspired by Cloudflare's Code Mode research
- Built on Anthropic's Building Effective Agents framework
- Uses RestrictedPython for secure code execution
For questions or feedback, please open an issue on GitHub.
Benchmark Date: January 2025 Models Tested: Claude 3 Haiku, Gemini 2.0 Flash Experimental Test Scenarios: 8 realistic business workflows Result: Code Mode is 60% faster, uses 68% fewer tokens, with equal accuracy