Skip to content

Add reproducibility data pipeline (schema v1, lib, scripts, CI)#4

Merged
radinhamidi merged 1 commit intomainfrom
feat/reproducibility-pipeline
Apr 29, 2026
Merged

Add reproducibility data pipeline (schema v1, lib, scripts, CI)#4
radinhamidi merged 1 commit intomainfrom
feat/reproducibility-pipeline

Conversation

@radinhamidi
Copy link
Copy Markdown
Member

Summary

Adds a self-contained reproducibility/ umbrella that backs leaderboard.querygym.com and the SIGIR 2026 reproducibility paper. querygym/ is untouched; the wheel is unchanged.

  • Schema v1 (reproducibility/schema.json) is the language-neutral contract every run JSON must satisfy. Validated three times (emit / submit / aggregate) so drift can't leak into the leaderboard.
  • Embedded hashes: params_hash (8 hex over the tuning surface, doubles as filename) and run_id (16 hex over the payload minus volatile fields). Hand-edits to a metric value fail validation with a clear run_id mismatch.
  • Layout: reproducibility/data/runs/{dataset_id}/{method_id}/{model}/{params_hash}.{json,run.txt,queries.tsv}.
  • Tooling: reproducibility/lib/ (private helpers), scripts/aggregate_runs.py with --check for CI, scripts/submit_run.py used by both trusted and fork contributors.
  • Example pipeline wired: examples/querygym_pyserini/pipeline.py calls build_run_summary for full pipelines; partial pipelines write pipeline_partial.json instead.
  • CI workflow runs schema/validator tests + aggregator --check on PRs touching reproducibility/** or the example pipeline. No pyserini/trec_eval in CI by design — fork PRs are verified manually by maintainers re-running locally.

See reproducibility/README.md and docs/user-guide/reproducibility.md for contributor flows; reproducibility/schema.md for the field-by-field schema.

Test plan

  • pytest reproducibility/tests -v --no-cov — 19 tests cover hashing, schema rejections, registry checks, hash tampering, and silent metric edits.
  • python -m reproducibility.scripts.aggregate_runs on empty runs/ produces deterministic CSV + manifest; --check exits 0.
  • End-to-end: copied the test fixture into runs/ via submit_run.py, ran aggregate_runs, confirmed 3 rows; tampered a metric → --check fails with clear error.
  • python -m build --sdist confirms reproducibility/, web/, and runs/ content are NOT in the sdist; querygym/ files ship as before.
  • _build_v1_summary helper in pipeline.py validates against synthetic per-step metadata.
  • CI on this PR runs the new workflow against the new schema/tests.

Out of scope (separate PRs)

  • One-time SIGIR JSON regen — to land after this PR is merged and the schema is locked.
  • Web/leaderboard scaffolding (reproducibility/site/).
  • Decommissioning Jekyll (_config.yml, _layouts/, docs/leaderboard.html).

🤖 Generated with Claude Code

Adds a self-contained reproducibility/ umbrella that backs
leaderboard.querygym.com and the SIGIR 2026 reproducibility paper.
Nothing in querygym/ is touched; the wheel is unchanged.

Schema v1 (reproducibility/schema.json) is the language-neutral
contract every run JSON must satisfy. Three validation passes
(emit / submit / aggregate) prevent drift. Hashes are embedded:
params_hash (8 hex over the tuning surface, doubles as filename)
and run_id (16 hex over the payload minus volatile fields).

Layout: reproducibility/data/runs/{dataset_id}/{method_id}/{model}/
{params_hash}.{json,run.txt,queries.tsv}

Tooling:
- reproducibility/lib/: build_run_summary, validate, hash helpers
  (private to this repo's tooling; external consumers read schema.json)
- reproducibility/scripts/aggregate_runs.py with --check for CI
- reproducibility/scripts/submit_run.py for both trusted contribs
  and fork PR submitters
- reproducibility/tests/: 19 tests covering hashing, validation,
  and hostile inputs

Wires examples/querygym_pyserini/pipeline.py to call
build_run_summary at the end of run_pipeline; falls back to a
pipeline_partial.json for incomplete runs.

CI workflow runs on PRs touching reproducibility/** or the
example pipeline. No pyserini/trec_eval in CI by design;
fork PRs are verified manually by maintainers re-running locally.

MANIFEST.in prunes reproducibility/ and web/ from sdist;
pyproject.toml adds a 'repro' extra (pandas, jsonschema) and
extends testpaths. .gitignore protects CLAUDE.local.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@radinhamidi radinhamidi merged commit d070722 into main Apr 29, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant