Skip to content

Token Optimization

lacause edited this page Mar 29, 2026 · 1 revision

Token Optimization

OCC is designed to minimize LLM token usage. This guide covers every technique available.

The Core Principle

Traditional approaches stuff everything into one giant prompt. OCC decomposes work into focused steps, each receiving only the context it needs.

Traditional:  1 prompt × 40K tokens = 40K tokens
OCC:          6 prompts × ~2.5K tokens = ~15K tokens

The savings come from isolation — each step only pays for the tokens it actually needs.

Technique 1: Step Isolation

Each step gets its own prompt with only its dependencies injected. No conversation history accumulation.

# Step A outputs 3000 tokens of research
# Step B outputs 2000 tokens of analysis
# Step C only needs Step B's output — it never sees Step A's 3000 tokens

steps:
  - id: a
    prompt: "Research {input.topic}"
    output_var: research

  - id: b
    depends_on: [a]
    prompt: "Analyze: {research}"
    output_var: analysis

  - id: c
    depends_on: [b]           # Only depends on B, not A
    prompt: "Summarize: {analysis}"  # Receives ~2000 tokens, not 5000
    output_var: summary

Technique 2: Transform Steps (Zero Tokens)

Transform steps manipulate data without LLM calls:

# Research step returns 5000 tokens of JSON
- id: research
  prompt: "Research and return JSON with key_findings, sources, raw_data"
  output_var: raw_research

# Extract only key_findings — costs 0 tokens
- id: extract
  type: transform
  operation: json_extract
  input_var: raw_research
  json_path: "key_findings"
  prompt: "extract"
  output_var: findings            # Now ~500 tokens

# Next step receives 500 tokens instead of 5000
- id: report
  depends_on: [extract]
  prompt: "Write report from: {findings}"
  output_var: report

Available zero-token operations: json_extract, regex_match, template, split, merge, truncate, replace, filter, map, join, to_json, from_json

Technique 3: Per-Step Model Selection

Use cheap models for simple tasks, expensive models for critical ones:

steps:
  # Simple classification — use Haiku ($0.25/M input)
  - id: classify
    model: claude-haiku-4-5
    prompt: "Classify this text: {input.text}. Return: positive, negative, or neutral."
    output_var: sentiment

  # Complex analysis — use Sonnet ($3/M input)
  - id: analyze
    model: claude-sonnet-4-6
    depends_on: [classify]
    prompt: "Deep analysis of {input.text} (classified as {sentiment})"
    output_var: analysis

  # Critical synthesis — use Opus ($15/M input)
  - id: synthesize
    model: claude-opus-4-6
    depends_on: [analyze]
    prompt: "Produce final executive report from: {analysis}"
    output_var: report

Cost breakdown:

Step Model Input tokens Cost
classify Haiku ~200 $0.00005
analyze Sonnet ~2000 $0.006
synthesize Opus ~3000 $0.045
Total ~5200 $0.051

vs. running everything on Opus: ~8000 tokens × $15/M = $0.12 (2.4x more expensive)

Technique 4: Caching

Identical prompts skip the LLM entirely:

- id: expensive_research
  cache:
    enabled: true
    ttl_minutes: 120          # Cache for 2 hours
  prompt: "Research {input.topic}"
  output_var: research

Cache key = hash of (step ID + resolved prompt + model). If the same chain runs with the same inputs within the TTL, the cached result is returned instantly (0 tokens, 0 cost, ~1ms).

Technique 5: Conditional Execution

Skip expensive steps when they're not needed:

- id: quick_check
  model: claude-haiku-4-5
  prompt: "Does this code have security issues? Answer YES or NO: {input.code}"
  output_var: has_issues

- id: deep_audit
  depends_on: [quick_check]
  condition: '{has_issues} contains "YES"'     # Only runs if issues found
  model: claude-opus-4-6
  prompt: "Full security audit of: {input.code}"
  output_var: audit

If has_issues is "NO", the deep_audit step is skipped — saving potentially thousands of tokens.

Technique 6: Early Exit

Stop the chain when the answer is found:

- id: check_cache
  prompt: "Is this answer in the knowledge base? {input.question}"
  output_var: cached_answer
  early_exit_if: '{cached_answer} != "NOT_FOUND"'

- id: research
  depends_on: [check_cache]
  prompt: "Research: {input.question}"     # Never runs if cache hit
  output_var: research

- id: synthesize
  depends_on: [research]
  prompt: "Answer from research: {research}"   # Never runs if cache hit
  output_var: answer

Technique 7: Context Strategy

Control how dependency outputs are compressed:

- id: final_step
  depends_on: [big_research, small_facts]
  context_strategy:
    big_research: "truncate:2000"       # Trim to 2000 chars
    small_facts: "full"                  # Keep as-is
  prompt: |
    Research (condensed): {big_research}
    Key facts: {small_facts}
  output_var: result

Options: full (default), summarize (LLM compression), truncate:N (hard cut at N chars)

Technique 8: Merge Strategy

When combining parallel outputs, choose the cheapest strategy:

- id: combine
  type: merge
  inputs: [research_a, research_b, research_c]
  strategy: pick_best        # LLM picks 1 of 3 — cheaper than summarizing all 3
  prompt: "Pick the most relevant research."
  output_var: best_research
Strategy LLM cost Output size
concatenate 0 tokens Sum of all inputs
json_array 0 tokens Sum of all inputs
pick_best ~input size 1 input's worth
llm_summarize ~input size Compressed

Technique 9: Pre-Tools vs. In-Prompt Requests

Bad — Asking Claude to search inside the prompt (costs tokens for the tool call overhead):

prompt: "Search the web for {input.topic} and then analyze the results"

Good — Pre-tool does the search, data is ready:

pre_tools:
  - type: web_search
    query: "{input.topic} latest research"
    inject_as: data
prompt: "Analyze this data: {data}"

The pre-tool approach is cleaner AND cheaper because the LLM doesn't waste tokens on tool-calling overhead.

Token Budget Calculator

Chain type Steps Avg tokens/step Total Est. cost (Sonnet)
Simple (3 steps) 3 1500 4.5K $0.014
Medium (6 steps) 6 2500 15K $0.045
Complex (12 steps) 12 3000 36K $0.108
Pipeline (3 chains × 5 steps) 15 2000 30K $0.090

Caching can reduce repeat executions to $0.

See Also

Clone this wiki locally