OpenFunnel Lookalike Benchmark

Open head-to-head leaderboard for company-lookalike APIs.

Live leaderboard: https://benchmarks.openfunnel.dev/leaderboards/lookalike

This repo is the open data + code mirror of that page — every cell on the leaderboard is backed by a literal HTTP request/response envelope and a literal LLM judge prompt + response, both committed under data/lookalike-runs/.

For each seed company, every vendor returns its top-K = 10 lookalikes; an LLM judge (gpt-5.4-mini) scores each returned company for relevance against the seed's business model; cell value is Precision@K — relevant / K.

Endpoints

Live leaderboard UI — https://benchmarks.openfunnel.dev/leaderboards/lookalike
JSON API — https://benchmarks.openfunnel.dev/api/leaderboards/lookalike
Markdown agent docs — https://benchmarks.openfunnel.dev/llms.txt
OpenAPI 3.1 spec — https://benchmarks.openfunnel.dev/openapi.json
MCP server discovery — https://benchmarks.openfunnel.dev/.well-known/mcp.json

Current leaderboard

#	Vendor	Precision@K	Judged	Avg latency
1	OpenFunnel	89.05%	14/14	30746.7ms
2	PredictLeads	73.57%	14/14	725.6ms
3	Ocean.io	71.43%	14/14	1840.4ms
4	Parallel	70.0%	13/14	1491.5ms
5	Exa	37.32%	14/14	243.9ms

14 seed companies × 5 vendors. The full per-cell breakdown and the raw audit trail (every HTTP request + every judge call) lives under data/lookalike-runs/.

What's in this repo

path	purpose
`data/latest-lookalike.json`	The leaderboard snapshot — seeds, per-vendor rows, per-cell aggregates.
`data/lookalike-runs/<dataset>/<seed>/<vendor>.json`	Slim per-cell artifact — winning config's candidates with the judge's binary verdict + one-line rationale.
`data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json`	Full audit trail — every config attempted, with the literal HTTP request/response (auth headers redacted) and the literal LLM prompt + raw response per candidate.
`data/lookalike-runs/README.md`	Schema docs for the raw artifacts.
`manifest.json`	Flat index of every cell with the headline numbers + file paths. Easy to ingest programmatically.
`scripts/run_lookalike_benchmark.py`	Orchestrator. Sweeps every config each runner declares; keeps the highest-Precision@K winner per cell.
`scripts/lookalike/runners/<vendor>.py`	One file per vendor — endpoint URL, auth, request shape, response parser, swept configs.
`scripts/lookalike/judge.py`	LLM judge — system prompt, Pydantic verdict schema, mock mode.
`scripts/lookalike/common.py`	Dataclasses + HTTP helper + persistence + redaction.

Reproducing a cell

Pick any (seed, vendor) pair on the leaderboard. The corresponding raw file at data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json contains, for every config the orchestrator swept:

vendor_calls[].request_* — replay the HTTP call with your own credentials.
judge_calls[].messages — replay the literal LLM prompt against any OpenAI-v1 compatible model (OpenAI direct, Azure, Anthropic via converter, local llama, your own fine-tune).

This is the open-source "audit trail" claim: every Precision@K number in the leaderboard is backed by a literal HTTP envelope you can re-run plus a literal LLM prompt you can re-score with your own judge to measure bias.

Running the benchmark yourself

python3 -m venv .venv && source .venv/bin/activate
python3 -m pip install -r requirements.txt
cp .env.example .env && $EDITOR .env   # fill in vendor keys + judge endpoint

PYTHONPATH=scripts python scripts/run_lookalike_benchmark.py --mock          # offline smoke test (no keys)
PYTHONPATH=scripts python scripts/run_lookalike_benchmark.py                 # live full sweep
PYTHONPATH=scripts python scripts/run_lookalike_benchmark.py --only openfunnel --seeds pylon,liveblocks

Contributing a new vendor

Drop a new file scripts/lookalike/runners/<your_vendor>.py exporting VENDOR_SLUG, VENDOR_NAME, CONFIGS: list[dict], and run(seed, k, config) -> RunResult.
Register it in scripts/lookalike/runners/__init__.py::REGISTRY.
Add a row to data/latest-lookalike.json::leaderboard so the orchestrator knows to aggregate it.
Run python scripts/run_lookalike_benchmark.py --only <your_vendor> and open a PR.

Methodology

Precision@K. For each seed, the vendor returns up to K candidates. The judge labels each candidate relevant: bool against the seed's description. Cell value = relevant_count / K. Vendor row = mean across judged seeds.
Best-of sweep. Each runner declares 1-4 config variants (e.g. agentic vs semantic, with-query vs seed-only). For every cell we run all configs and keep the highest-Precision@K winner. Tiebreaker: more judged candidates, then lower latency.
Judge. Single-pass binary verdict + 1-line rationale per candidate. Same prompt and rubric across all vendors. We publish the literal prompt and the raw model response so judge bias is fully auditable — swap the model and re-score to measure drift.
Known limitations. (1) Judge bias: a single LLM judge has its own priors; we publish the full audit trail so you can swap and re-score. (2) K-tail vs precision tradeoff: vendors that can only return small sets win P@K by default — we require >= K results to score the cell. (3) No recall metric: precision says nothing about how many real lookalikes the vendor missed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenFunnel Lookalike Benchmark

Endpoints

Current leaderboard

What's in this repo

Reproducing a cell

Running the benchmark yourself

Contributing a new vendor

Methodology

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
manifest.json		manifest.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OpenFunnel Lookalike Benchmark

Endpoints

Current leaderboard

What's in this repo

Reproducing a cell

Running the benchmark yourself

Contributing a new vendor

Methodology

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages