-
Notifications
You must be signed in to change notification settings - Fork 2
Token Optimization
OCC is designed to minimize LLM token usage. This guide covers every technique available.
Traditional approaches stuff everything into one giant prompt. OCC decomposes work into focused steps, each receiving only the context it needs.
Traditional: 1 prompt × 40K tokens = 40K tokens
OCC: 6 prompts × ~2.5K tokens = ~15K tokens
The savings come from isolation — each step only pays for the tokens it actually needs.
Each step gets its own prompt with only its dependencies injected. No conversation history accumulation.
# Step A outputs 3000 tokens of research
# Step B outputs 2000 tokens of analysis
# Step C only needs Step B's output — it never sees Step A's 3000 tokens
steps:
- id: a
prompt: "Research {input.topic}"
output_var: research
- id: b
depends_on: [a]
prompt: "Analyze: {research}"
output_var: analysis
- id: c
depends_on: [b] # Only depends on B, not A
prompt: "Summarize: {analysis}" # Receives ~2000 tokens, not 5000
output_var: summaryTransform steps manipulate data without LLM calls:
# Research step returns 5000 tokens of JSON
- id: research
prompt: "Research and return JSON with key_findings, sources, raw_data"
output_var: raw_research
# Extract only key_findings — costs 0 tokens
- id: extract
type: transform
operation: json_extract
input_var: raw_research
json_path: "key_findings"
prompt: "extract"
output_var: findings # Now ~500 tokens
# Next step receives 500 tokens instead of 5000
- id: report
depends_on: [extract]
prompt: "Write report from: {findings}"
output_var: reportAvailable zero-token operations: json_extract, regex_match, template, split, merge, truncate, replace, filter, map, join, to_json, from_json
Use cheap models for simple tasks, expensive models for critical ones:
steps:
# Simple classification — use Haiku ($0.25/M input)
- id: classify
model: claude-haiku-4-5
prompt: "Classify this text: {input.text}. Return: positive, negative, or neutral."
output_var: sentiment
# Complex analysis — use Sonnet ($3/M input)
- id: analyze
model: claude-sonnet-4-6
depends_on: [classify]
prompt: "Deep analysis of {input.text} (classified as {sentiment})"
output_var: analysis
# Critical synthesis — use Opus ($15/M input)
- id: synthesize
model: claude-opus-4-6
depends_on: [analyze]
prompt: "Produce final executive report from: {analysis}"
output_var: reportCost breakdown:
| Step | Model | Input tokens | Cost |
|---|---|---|---|
| classify | Haiku | ~200 | $0.00005 |
| analyze | Sonnet | ~2000 | $0.006 |
| synthesize | Opus | ~3000 | $0.045 |
| Total | ~5200 | $0.051 |
vs. running everything on Opus: ~8000 tokens × $15/M = $0.12 (2.4x more expensive)
Identical prompts skip the LLM entirely:
- id: expensive_research
cache:
enabled: true
ttl_minutes: 120 # Cache for 2 hours
prompt: "Research {input.topic}"
output_var: researchCache key = hash of (step ID + resolved prompt + model). If the same chain runs with the same inputs within the TTL, the cached result is returned instantly (0 tokens, 0 cost, ~1ms).
Skip expensive steps when they're not needed:
- id: quick_check
model: claude-haiku-4-5
prompt: "Does this code have security issues? Answer YES or NO: {input.code}"
output_var: has_issues
- id: deep_audit
depends_on: [quick_check]
condition: '{has_issues} contains "YES"' # Only runs if issues found
model: claude-opus-4-6
prompt: "Full security audit of: {input.code}"
output_var: auditIf has_issues is "NO", the deep_audit step is skipped — saving potentially thousands of tokens.
Stop the chain when the answer is found:
- id: check_cache
prompt: "Is this answer in the knowledge base? {input.question}"
output_var: cached_answer
early_exit_if: '{cached_answer} != "NOT_FOUND"'
- id: research
depends_on: [check_cache]
prompt: "Research: {input.question}" # Never runs if cache hit
output_var: research
- id: synthesize
depends_on: [research]
prompt: "Answer from research: {research}" # Never runs if cache hit
output_var: answerControl how dependency outputs are compressed:
- id: final_step
depends_on: [big_research, small_facts]
context_strategy:
big_research: "truncate:2000" # Trim to 2000 chars
small_facts: "full" # Keep as-is
prompt: |
Research (condensed): {big_research}
Key facts: {small_facts}
output_var: resultOptions: full (default), summarize (LLM compression), truncate:N (hard cut at N chars)
When combining parallel outputs, choose the cheapest strategy:
- id: combine
type: merge
inputs: [research_a, research_b, research_c]
strategy: pick_best # LLM picks 1 of 3 — cheaper than summarizing all 3
prompt: "Pick the most relevant research."
output_var: best_research| Strategy | LLM cost | Output size |
|---|---|---|
concatenate |
0 tokens | Sum of all inputs |
json_array |
0 tokens | Sum of all inputs |
pick_best |
~input size | 1 input's worth |
llm_summarize |
~input size | Compressed |
Bad — Asking Claude to search inside the prompt (costs tokens for the tool call overhead):
prompt: "Search the web for {input.topic} and then analyze the results"Good — Pre-tool does the search, data is ready:
pre_tools:
- type: web_search
query: "{input.topic} latest research"
inject_as: data
prompt: "Analyze this data: {data}"The pre-tool approach is cleaner AND cheaper because the LLM doesn't waste tokens on tool-calling overhead.
| Chain type | Steps | Avg tokens/step | Total | Est. cost (Sonnet) |
|---|---|---|---|---|
| Simple (3 steps) | 3 | 1500 | 4.5K | $0.014 |
| Medium (6 steps) | 6 | 2500 | 15K | $0.045 |
| Complex (12 steps) | 12 | 3000 | 36K | $0.108 |
| Pipeline (3 chains × 5 steps) | 15 | 2000 | 30K | $0.090 |
Caching can reduce repeat executions to $0.
- Chain Format — Context strategy and caching configuration
- Step Types — Transform steps (zero-token operations)
- Architecture — How step isolation works internally