Open head-to-head leaderboard for company-lookalike APIs.
Live leaderboard: https://benchmarks.openfunnel.dev/leaderboards/lookalike
This repo is the open data + code mirror of that page — every cell on the
leaderboard is backed by a literal HTTP request/response envelope and a
literal LLM judge prompt + response, both committed under data/lookalike-runs/.
For each seed company, every vendor returns its top-K = 10 lookalikes; an
LLM judge (gpt-5.4-mini) scores each returned company for relevance against the
seed's business model; cell value is Precision@K — relevant / K.
- Live leaderboard UI — https://benchmarks.openfunnel.dev/leaderboards/lookalike
- JSON API — https://benchmarks.openfunnel.dev/api/leaderboards/lookalike
- Markdown agent docs — https://benchmarks.openfunnel.dev/llms.txt
- OpenAPI 3.1 spec — https://benchmarks.openfunnel.dev/openapi.json
- MCP server discovery — https://benchmarks.openfunnel.dev/.well-known/mcp.json
| # | Vendor | Precision@K | Judged | Avg latency |
|---|---|---|---|---|
| 1 | OpenFunnel | 89.05% | 14/14 | 30746.7ms |
| 2 | PredictLeads | 73.57% | 14/14 | 725.6ms |
| 3 | Ocean.io | 71.43% | 14/14 | 1840.4ms |
| 4 | Parallel | 70.0% | 13/14 | 1491.5ms |
| 5 | Exa | 37.32% | 14/14 | 243.9ms |
14 seed companies × 5 vendors. The full per-cell breakdown
and the raw audit trail (every HTTP request + every judge call) lives under
data/lookalike-runs/.
| path | purpose |
|---|---|
data/latest-lookalike.json |
The leaderboard snapshot — seeds, per-vendor rows, per-cell aggregates. |
data/lookalike-runs/<dataset>/<seed>/<vendor>.json |
Slim per-cell artifact — winning config's candidates with the judge's binary verdict + one-line rationale. |
data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json |
Full audit trail — every config attempted, with the literal HTTP request/response (auth headers redacted) and the literal LLM prompt + raw response per candidate. |
data/lookalike-runs/README.md |
Schema docs for the raw artifacts. |
manifest.json |
Flat index of every cell with the headline numbers + file paths. Easy to ingest programmatically. |
scripts/run_lookalike_benchmark.py |
Orchestrator. Sweeps every config each runner declares; keeps the highest-Precision@K winner per cell. |
scripts/lookalike/runners/<vendor>.py |
One file per vendor — endpoint URL, auth, request shape, response parser, swept configs. |
scripts/lookalike/judge.py |
LLM judge — system prompt, Pydantic verdict schema, mock mode. |
scripts/lookalike/common.py |
Dataclasses + HTTP helper + persistence + redaction. |
Pick any (seed, vendor) pair on the leaderboard. The corresponding raw file at
data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json contains, for every
config the orchestrator swept:
vendor_calls[].request_*— replay the HTTP call with your own credentials.judge_calls[].messages— replay the literal LLM prompt against any OpenAI-v1 compatible model (OpenAI direct, Azure, Anthropic via converter, local llama, your own fine-tune).
This is the open-source "audit trail" claim: every Precision@K number in the leaderboard is backed by a literal HTTP envelope you can re-run plus a literal LLM prompt you can re-score with your own judge to measure bias.
python3 -m venv .venv && source .venv/bin/activate
python3 -m pip install -r requirements.txt
cp .env.example .env && $EDITOR .env # fill in vendor keys + judge endpoint
PYTHONPATH=scripts python scripts/run_lookalike_benchmark.py --mock # offline smoke test (no keys)
PYTHONPATH=scripts python scripts/run_lookalike_benchmark.py # live full sweep
PYTHONPATH=scripts python scripts/run_lookalike_benchmark.py --only openfunnel --seeds pylon,liveblocks- Drop a new file
scripts/lookalike/runners/<your_vendor>.pyexportingVENDOR_SLUG,VENDOR_NAME,CONFIGS: list[dict], andrun(seed, k, config) -> RunResult. - Register it in
scripts/lookalike/runners/__init__.py::REGISTRY. - Add a row to
data/latest-lookalike.json::leaderboardso the orchestrator knows to aggregate it. - Run
python scripts/run_lookalike_benchmark.py --only <your_vendor>and open a PR.
- Precision@K. For each seed, the vendor returns up to K candidates. The
judge labels each candidate
relevant: boolagainst the seed's description. Cell value =relevant_count / K. Vendor row = mean across judged seeds. - Best-of sweep. Each runner declares 1-4 config variants (e.g. agentic vs semantic, with-query vs seed-only). For every cell we run all configs and keep the highest-Precision@K winner. Tiebreaker: more judged candidates, then lower latency.
- Judge. Single-pass binary verdict + 1-line rationale per candidate. Same prompt and rubric across all vendors. We publish the literal prompt and the raw model response so judge bias is fully auditable — swap the model and re-score to measure drift.
- Known limitations. (1) Judge bias: a single LLM judge has its own priors; we publish the full audit trail so you can swap and re-score. (2) K-tail vs precision tradeoff: vendors that can only return small sets win P@K by default — we require >= K results to score the cell. (3) No recall metric: precision says nothing about how many real lookalikes the vendor missed.