Skip to content

openfunnel/gtm-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenFunnel Lookalike Benchmark

Open head-to-head leaderboard for company-lookalike APIs.

Live leaderboard: https://benchmarks.openfunnel.dev/leaderboards/lookalike

This repo is the open data + code mirror of that page — every cell on the leaderboard is backed by a literal HTTP request/response envelope and a literal LLM judge prompt + response, both committed under data/lookalike-runs/.

For each seed company, every vendor returns its top-K = 10 lookalikes; an LLM judge (gpt-5.4-mini) scores each returned company for relevance against the seed's business model; cell value is Precision@Krelevant / K.

Endpoints

Current leaderboard

# Vendor Precision@K Judged Avg latency
1 OpenFunnel 89.05% 14/14 30746.7ms
2 PredictLeads 73.57% 14/14 725.6ms
3 Ocean.io 71.43% 14/14 1840.4ms
4 Parallel 70.0% 13/14 1491.5ms
5 Exa 37.32% 14/14 243.9ms

14 seed companies × 5 vendors. The full per-cell breakdown and the raw audit trail (every HTTP request + every judge call) lives under data/lookalike-runs/.

What's in this repo

path purpose
data/latest-lookalike.json The leaderboard snapshot — seeds, per-vendor rows, per-cell aggregates.
data/lookalike-runs/<dataset>/<seed>/<vendor>.json Slim per-cell artifact — winning config's candidates with the judge's binary verdict + one-line rationale.
data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json Full audit trail — every config attempted, with the literal HTTP request/response (auth headers redacted) and the literal LLM prompt + raw response per candidate.
data/lookalike-runs/README.md Schema docs for the raw artifacts.
manifest.json Flat index of every cell with the headline numbers + file paths. Easy to ingest programmatically.
scripts/run_lookalike_benchmark.py Orchestrator. Sweeps every config each runner declares; keeps the highest-Precision@K winner per cell.
scripts/lookalike/runners/<vendor>.py One file per vendor — endpoint URL, auth, request shape, response parser, swept configs.
scripts/lookalike/judge.py LLM judge — system prompt, Pydantic verdict schema, mock mode.
scripts/lookalike/common.py Dataclasses + HTTP helper + persistence + redaction.

Reproducing a cell

Pick any (seed, vendor) pair on the leaderboard. The corresponding raw file at data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json contains, for every config the orchestrator swept:

  • vendor_calls[].request_* — replay the HTTP call with your own credentials.
  • judge_calls[].messages — replay the literal LLM prompt against any OpenAI-v1 compatible model (OpenAI direct, Azure, Anthropic via converter, local llama, your own fine-tune).

This is the open-source "audit trail" claim: every Precision@K number in the leaderboard is backed by a literal HTTP envelope you can re-run plus a literal LLM prompt you can re-score with your own judge to measure bias.

Running the benchmark yourself

python3 -m venv .venv && source .venv/bin/activate
python3 -m pip install -r requirements.txt
cp .env.example .env && $EDITOR .env   # fill in vendor keys + judge endpoint

PYTHONPATH=scripts python scripts/run_lookalike_benchmark.py --mock          # offline smoke test (no keys)
PYTHONPATH=scripts python scripts/run_lookalike_benchmark.py                 # live full sweep
PYTHONPATH=scripts python scripts/run_lookalike_benchmark.py --only openfunnel --seeds pylon,liveblocks

Contributing a new vendor

  1. Drop a new file scripts/lookalike/runners/<your_vendor>.py exporting VENDOR_SLUG, VENDOR_NAME, CONFIGS: list[dict], and run(seed, k, config) -> RunResult.
  2. Register it in scripts/lookalike/runners/__init__.py::REGISTRY.
  3. Add a row to data/latest-lookalike.json::leaderboard so the orchestrator knows to aggregate it.
  4. Run python scripts/run_lookalike_benchmark.py --only <your_vendor> and open a PR.

Methodology

  • Precision@K. For each seed, the vendor returns up to K candidates. The judge labels each candidate relevant: bool against the seed's description. Cell value = relevant_count / K. Vendor row = mean across judged seeds.
  • Best-of sweep. Each runner declares 1-4 config variants (e.g. agentic vs semantic, with-query vs seed-only). For every cell we run all configs and keep the highest-Precision@K winner. Tiebreaker: more judged candidates, then lower latency.
  • Judge. Single-pass binary verdict + 1-line rationale per candidate. Same prompt and rubric across all vendors. We publish the literal prompt and the raw model response so judge bias is fully auditable — swap the model and re-score to measure drift.
  • Known limitations. (1) Judge bias: a single LLM judge has its own priors; we publish the full audit trail so you can swap and re-score. (2) K-tail vs precision tradeoff: vendors that can only return small sets win P@K by default — we require >= K results to score the cell. (3) No recall metric: precision says nothing about how many real lookalikes the vendor missed.

About

Benchmarks for GTM Vendors

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages