Crossfire

Crossfire is a multi-agent adversarial refinement orchestrator, which means: it sends your prompt to multiple LLMs to generate text or code, reviews each generated artefact multiple times by different LLMs, and then merges the best pieces into one refined artefact. That generation → review → synthesis loop is repeated as many times as you want. Crossfire implements a multi-start evolutionary search with delayed selection, or an improved Ralph Wiggum method.

Architecture

flowchart LR
    Instruction --> Enrich
    Enrich --> Generate

    subgraph Crossfire loop
        Generate["Generate (G models in parallel)"]
        Review["Review (R models in parallel)"]
        Synthesize
        Generate --> Review --> Synthesize
        Synthesize -->|next round| Generate
    end

    Synthesize --> Output

Each round runs generation → review → synthesis sequentially; within generation and review, model calls run in parallel. Rounds repeat until the configured limit or until reviewers find no material weaknesses (early stop). Here, "material" means substantive issues like factual errors, logic flaws, or broken code, not style issues or nitpicks. Each mode (research, code, edit, check, write) has its own review protocol: rubber duck debugging for code, precision review for editing, literary criticism for writing, and so on.

How to use

Prerequisites

Python 3.12
uv for dependency management

Install

uv python install 3.12         # transitive incompatibility in docformatter->untokenize
uv venv --python 3.12
uv sync

Set up API keys

export OPENROUTER_API_KEY="sk-or-..."
export TAVILY_API_KEY="tvly-..."      # only needed if search.enabled = true in crossfire.toml

Run

uv run crossfire run \
  --mode research \
  --instruction "Compare error correction strategies for superconducting vs trapped-ion qubits" \
  --num-generators 2 \
  --num-reviewers-per-candidate 3 \
  --num-rounds 5

CLI options

Flag	Description	Default
`--mode`	One of `research`, `code`, `edit`, `check`, `write`	—
`--instruction`	Task instruction (mutually exclusive with `--instruction-file`)	—
`--instruction-file`	Read instruction from a file (mutually exclusive with `--instruction`)	—
`--context-file`	Supplementary reference material (see below)	none
`--num-generators`	Generators per round	1
`--num-reviewers-per-candidate`	Reviewers assigned to each candidate	3
`--num-rounds`	Number of rounds	3
`--enrich / --no-enrich`	Enrich instruction via a lightweight model before generation	true
`--early-stop / --no-early-stop`	Stop early when reviewers find no material weaknesses	true
`--early-stop-threshold`	Weakness threshold for early stopping	1
`--verbose`, `-v`	Show JSON log events	false
`--output`	Additional path to write the final output to	none
`--run-dir`	Archive directory for all run artifacts	`runs/<timestamp>`
`--config`	Path to `crossfire.toml`	auto-detected
`--dry-run`	Simulate without network calls	false

Instruction vs context

The --instruction (or --instruction-file) is the task: what you want Crossfire to do. The --context-file is reference material: source text, code to review, data to analyse. They are separated because the pipeline treats them differently:

The instruction is never compressed, even when the full prompt exceeds the token budget. Context may be trimmed if the prompt doesn't fit the model's context window.
The enricher rewrites the instruction into a richer brief but leaves the context untouched.
Prompts place them in distinct sections so the LLMs know which is which.

# Short instruction + long reference document
uv run crossfire run \
  --mode research \
  --instruction "Summarize the key findings and identify methodological weaknesses" \
  --context-file paper.pdf

Cost estimation

Cost estimation in --dry-run requires current model prices from OpenRouter. These can be grabbed and stored in pricing.json with the following command:

uv run crossfire prices

Since it fetches pricing on all OpenRouter models, we can add moves to crossfire.toml without re-fetching.

Clean up

Remove all generated and cached files (runs, .venv, caches, bytecode):

uv run crossfire clean        # interactive
uv run crossfire clean --yes  # skip yes/no prompts

This deletes runs/, .venv/, .ruff_cache/, .pytest_cache/, .mypy_cache/, __pycache__/ dirs, *.pyc files, and packaging artefacts (dist/, build/, *.egg-info/). clean refuses to run unless it finds a crossfire.toml in the current directory, so a mistaken cd cannot nuke an unrelated project's .venv. After cleaning, run uv sync to recreate the virtual environment.

Modes

Mode	Focus
`research`	Structured research with citations and verification of claims
`code`	Production code and associated tests with systematic rubber duck debugging review
`edit`	Precision editing with systematic anti-waffle and jargon elimination
`check`	Accuracy and logical validity checking
`write`	Creative writing with literary criticism and sceptical review

Model configuration

Each mode can override the global [models.*] defaults via [modes.<mode>.*] sections in crossfire.toml. Only roles that differ need to be specified. Any missing roles fall back to the global default.

Cross-group overlap is allowed: a model may both generate and review in the same round, as they do not share prompts or contexts.

Configuration

Crossfire is configured via crossfire.toml at the project root. The file defines four model groups (enricher, generators, reviewers, synthesizer), each with a list of model IDs, a context window, and maximum output tokens. Per-mode overrides go in [modes.<mode>.*] sections.

See crossfire.toml for the full configuration with all 5 modes, per-model context window overrides, and detailed comments explaining the model selection rationale.

Execution model

Each round has three sequential phases:

Generation — num_generators independent candidates produced in parallel
Review — each candidate reviewed by num_reviewers_per_candidate independent reviewers in parallel
Synthesis — all candidates and reviews merged into a single refined output

Rounds are strictly sequential, in which round N+1 uses round N's synthesis as input.

Prompt enrichment

When a [models.enricher] section is configured, Crossfire rewrites the raw user instruction into a richer, more structured brief before round 1 begins. This adds constraints, clarifies ambiguities, and incorporates mode-specific rules so generators start from a stronger prompt. This is particularly useful for short one-liners. Disable with --no-enrich.

Adaptive early stopping

After each round, Crossfire checks whether any reviewer found material weaknesses. If all reviews report no weaknesses, the remaining rounds are skipped, as there is no value in further refinement. Disable with --no-early-stop.

Generator refusal detection

Some models occasionally refuse to produce output, responding with meta-commentary like "the sources are insufficient" instead of answering the prompt. Crossfire detects these refusals and attempts a replacement model from the generator pool. If no replacement is available, the generator is dropped and the round fails gracefully. The refusal is logged as a model_dropped event with reason refusal.

Synthesis reuse

If a synthesis is itself a refusal or regression (detected by the same refusal patterns), Crossfire discards it and carries forward the previous round's synthesis. This prevents a bad late round from overwriting good earlier output. The event is logged as synthesis_regression.

Systematic review protocols

Code mode catches subtle issues that often slip through regular reviews:

Security vulnerabilities: SQL injection, XSS, path traversal, hardcoded secrets
Logic errors: Off-by-one errors, null pointer issues, edge case failures
Type safety: Unsafe casts, missing type hints/declarations, implicit conversions
Error handling: Unhandled exceptions, resource leaks, race conditions
Testing gaps: Missing edge case tests, insufficient error path coverage
Performance issues: Algorithmic complexity, unnecessary loops, memory leaks

Edit mode eliminates waffle and imprecision:

Precision & clarity: Vague qualifiers, hedging language, weak verbs, unclear antecedents
Conciseness: Redundant phrases, wordy constructions, unnecessary intensifiers, filler words
Business jargon & LLM waffling: Corporate speak, buzzword clusters, throat-clearing, academic bloat
Structure & flow: Buried main points, repetitive patterns, weak transitions, logic gaps

Writer mode provides literary criticism:

Story integrity: Plot holes, character inconsistencies, pacing problems, unearned stakes
Voice & craft: Authenticity, clichés, dialogue quality, show vs tell failures
Emotional resonance: Genuine vs manipulative moments, forced sentiment, unearned emotion
Structure & clarity: Confusing transitions, unnecessary scenes, weak endings
Originality & insight: Derivative concepts, predictable developments, lack of depth

Each mode's review protocol is defined in crossfire/core/prompts.py.

Logs and cost tracking

Structured events are emitted as JSON lines to stderr when --verbose / -v is passed. Without it, stderr shows only the live progress display. Timestamps use local time, because the output is meant for the operator at the terminal.

Key event types:

prompt_enriched: the instruction was rewritten by the enricher model
compression_applied: the token budget compression was applied
model_dropped: a model was excluded because of token overflow or repeated failures
synthesis_decision: which candidates were selected/discarded
early_stop: the number of remaining rounds skipped because no weaknesses were found
round_failed: a round could not complete
run_failed: the entire run aborted
cost_summary: per-model and total token/cost breakdown

You can filter these with standard tools:

uv run crossfire run ... -v 2>&1 | jq 'select(.event == "synthesis_decision")'

Every OpenRouter response's token usage and cost metadata is captured. A cost_summary event is emitted at the end of each run with per-model and aggregate totals.

Development

uv python install 3.12         # transitive incompatibility in docformatter->untokenize
uv venv --python 3.12
uv sync
uv run pre-commit install      # set up git hooks (recommended)
uv run pytest                  # run tests
uv run pytest --cov            # run tests with coverage
uv run ruff check crossfire/   # lint
uv run mypy crossfire/         # type check

Dry run

Use --dry-run to verify your changes without making API calls. It produces deterministic synthetic outputs via SHA-256 hashing. If pricing.json is present (from crossfire prices), the summary table includes an upper-bound cost estimate.

uv run crossfire run \
  --mode code \
  --instruction "Implement a binary search tree in Python" \
  --dry-run

Run all checks locally

# Run the same checks as in CI
uv run pre-commit run --all-files && uv run pytest --cov

# Or use the convenience script
./scripts/check-all.sh

The pre-commit hooks automatically run docformatter, ruff --fix, ruff format, and mypy on every commit to ensure consistent code quality.

Project structure

crossfire/
├── docs/                     # example artefacts
├── crossfire/
│   ├── core/
│   │   ├── orchestrator.py   # round loop, concurrency, failure handling
│   │   ├── domain.py         # domain primitives (Task, Candidate, etc.)
│   │   ├── prompts.py        # mode-aware prompt builders and output parsers
│   │   ├── config.py         # TOML loading with CLI override precedence
│   │   ├── logging.py        # structured JSON-line event logging
│   │   ├── tokens.py         # tiktoken-based estimation
│   │   ├── compression.py    # extractive compression and prompt fitting
│   │   ├── openrouter.py     # OpenRouter HTTP client with retry
│   │   ├── simulation.py     # deterministic fakes for dry-run mode
│   │   ├── progress.py       # progress reporting
│   │   ├── reviewers.py      # reviewer-to-candidate assignment
│   │   ├── search.py         # search integration with Tavily
│   │   ├── pricing.py        # OpenRouter pricing cache and cost estimation
│   │   ├── exclamations.py   # The Simpsons prefixes for error messages
│   │   └── archive.py        # disk archival
│   ├── ui/
│   │   └── tui.py            # Rich-based progress display
│   └── cli.py                # Click CLI entrypoint
├── tests/                    # comprehensive pytest suite
├── scripts/                  # convenience scripts (check-all.sh)
├── crossfire.toml            # default configuration
├── pyproject.toml
└── .pre-commit-config.yaml

Limitations and improvements

Token estimation is approximate. Counts use the cl100k_base tokenizer (via tiktoken) as a proxy for all providers. Actual token counts may differ for non-OpenAI models.

OpenRouter is the only LLM provider. Direct API calls to Anthropic, Google, or OpenAI are not supported.

Tavily is the only search provider. A missing TAVILY_API_KEY fails at startup when search.enabled = true. Transient errors degrade gracefully to empty results rather than aborting the run.

Compression is extractive, not generative. When prompts exceed the token budget, Crossfire drops sections and sentences rather than summarizing. The task instruction is never compressed.

Cost estimates are approximate. The dry-run estimate uses fixed output-token defaults (~5,000 tokens per generator/synthesizer call, ~2,000 per reviewer) and average pricing across each model group. It does not predict early stopping or actual output lengths, so it typically overestimates by 2-4x for runs that stop early. If the instruction contains an explicit word or page count, the estimate uses that instead, but the regex may also match counts that describe the input rather than the desired output.

No streaming. Responses are received in full.

No persistent state. Each run is self-contained, so a crashed run must be restarted from scratch.

Ideas for improvements

Resume: restart interrupted runs from the last completed round
Local UI: browser-based interface with live progress, output panel, and searchable run history
Library and containerization: extract a reusable library and Docker image so Crossfire can run as a service (switch to UTC-based logs)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.cursor		.cursor
.github		.github
crossfire		crossfire
docs		docs
scripts		scripts
tests		tests
.cursorignore		.cursorignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
crossfire.toml		crossfire.toml
cspell.json		cspell.json
pricing.json		pricing.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crossfire

Architecture

How to use

Prerequisites

Install

Set up API keys

Run

CLI options

Instruction vs context

Cost estimation

Clean up

Modes

Model configuration

Configuration

Execution model

Prompt enrichment

Adaptive early stopping

Generator refusal detection

Synthesis reuse

Systematic review protocols

Logs and cost tracking

Development

Dry run

Run all checks locally

Project structure

Limitations and improvements

Ideas for improvements

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crossfire

Architecture

How to use

Prerequisites

Install

Set up API keys

Run

CLI options

Instruction vs context

Cost estimation

Clean up

Modes

Model configuration

Configuration

Execution model

Prompt enrichment

Adaptive early stopping

Generator refusal detection

Synthesis reuse

Systematic review protocols

Logs and cost tracking

Development

Dry run

Run all checks locally

Project structure

Limitations and improvements

Ideas for improvements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages