leaderboard: per-dataset metric derivation, chip filters + metric toggle on every table#29
Open
radinhamidi wants to merge 1 commit into
Open
leaderboard: per-dataset metric derivation, chip filters + metric toggle on every table#29radinhamidi wants to merge 1 commit into
radinhamidi wants to merge 1 commit into
Conversation
…ric toggle on every table - per-dataset shard reads metrics from actual runs (no MAP/recall_1000 phantom columns) - shared FilterChips + MatrixCell components reused across home / dataset / method / model / retriever pages - every per-X table gets chip filters (method/model/retriever/metric as applicable) + metric toggle - pretty metric labels (nDCG@10, R@1k, R@100, MAP) everywhere - drop double scrollbar on home + per-dataset tables - /models index renders display label, not provider-prefixed id - /runs page shows method display name; reproduce snippet aligned to example pipeline with correct Pyserini index names and qrels-based trec_eval - /about page no longer claims run.txt/queries.tsv are guaranteed; path includes retriever segment Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Comprehensive table-quality pass across every leaderboard page, driven by issues found in the latest review:
eval_metricsthe dataset registry listed (MAP on TREC DL, recall_1000 on BEIR) regardless of whether those metrics actually appeared in the data. The per-dataset shard now derives its columns from the actual run rows, same approach as the home matrix uses.qg-chip-hidden→qg-itable-reapplyhandshake).ndcg_cut_10→nDCG@10,recall_1000→R@1k,recall_100→R@100,map→MAP) on /datasets/[id] and the per-X pages too.max-h-[70vh] overflow-y-auto. The page scrolls naturally;sticky top-0thead sticks to the viewport.gpt-4.1) not the provider-prefixed id (openai/gpt-4.1) — matches the /methods index convention..flat.splade-pp-ed/.flat.bge-base-en-v1.5for non-lexical paradigms; trec_eval references the qrels key from the dataset registry, not the topics key.Q2D (FS)etc.) not the raw method_id..run.txtand queries.tsv — those are optional under the current schema; path includes the{retriever}segment that PR Schema: optional artifacts + DL-HARD dataset entry #20 added.MatrixCell.astro(link + primary/secondary spans + sort hooks) andFilterChips.astro(groups + metric special-case + reapply event).Test plan
python -m pytest reproducibility/tests/— 44/44 passingpnpm --filter @qg/leaderboard build— clean (1113 pages built)beir-v1.0.0-trec-covid.splade-pp-ed, not.flat.splade-pp-edmax-h-[70vh]wrapper🤖 Generated with Claude Code