Automated diagnostics for rubric quality. RIFT classifies rubric criteria against a taxonomy of eight failure modes organized into three categories: Reliability, Content Validity, and Consequential Validity.
Paper: RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
| Mode | Category | Description |
|---|---|---|
subjective |
Reliability | Uses unanchored subjective terms |
non_atomic |
Reliability | Bundles multiple independently scorable requirements |
ungrounded |
Reliability | Requires verification without providing grounding |
misaligned_or_rigid |
Content Validity | Grades wrong objective or over-constrains |
missing_criteria |
Content Validity | Prompt implies requirements the rubric doesn't cover |
hackable |
Consequential Validity | Gameable via proxy metrics |
low_signal |
Consequential Validity | Rubric as a whole doesn't discriminate well |
redundant_criteria |
Consequential Validity | Multiple criteria evaluate the same requirement |
Requires uv.
git clone <repo>
cd rift
uv sync
cp .env.example .env # add your API keys.env keys:
OPENAI_API_KEY=...
GEMINI_API_KEY=...
# Prevalence experiment — 3 rubrics per source, fast sanity check
uv run python prevalence_experiment.py --n 3 --concurrency 5 --judge gpt-5.4-2026-03-05
# HBP experiment — 3 conversations
uv run python hbp_experiment.py --n 3 --concurrency 3 --judge gpt-5.4-2026-03-05 --eval-strategy scopedReproduces RIFT paper Table 2. Evaluates failure mode prevalence across five rubric datasets using the joined strategy (full rubric evaluated with all failure modes — paper-equivalent method).
uv run python prevalence_experiment.py --n 50 --concurrency 10 --judge gpt-5.2-2025-12-11Results are saved to results/prevalence_<timestamp>.jsonl and cached per judge. Each record includes rubric_text, labels (majority-voted), votes (raw per-run outputs), n_votes, and an error field if the API call failed.
Default for --n is 10 (5 sources → 50 total API calls). The paper uses --votes 5.
Runs RIFT on all 525 conversations and 1,135 rubric criteria from HealthBench Professional.
uv run python hbp_experiment.py --concurrency 8 --judge gpt-5.4-2026-03-05 --eval-strategy scopedResults are saved to results/hbp_<timestamp>.jsonl and cached per judge + strategy. Each record includes rubric_text, labels (majority-voted), votes (raw per-run outputs), n_votes, and an error field if the API call failed.
--eval-strategy controls how failure modes are applied:
joined— all rubric criteria for a conversation are concatenated into one string and evaluated with all failure modes together. Equivalent to the paper's method.scoped(default) — criterion-scope failure modes run on each criterion individually; rubric-scope modes run on the full joined rubric. Producesper_criterionandper_conversationrecords, letting you pinpoint failure modes at the criterion level rather than just the rubric.
To run RIFT on any rubric dataset, prepare a JSONL file where each line has two required fields:
{"input_context": "Write a haiku about winter.", "rubric_text": "5 pts: Contains exactly 17 syllables in 5-7-5 structure."}
{"input_context": "Summarize the article.", "rubric_text": "10 pts: Covers all main points accurately."}Any additional fields are passed through as metadata in the output. Then run:
uv run python run.py --input my_rubrics.jsonl
uv run python run.py --input my_rubrics.jsonl --eval-strategy scoped --votes 3 --judge gpt-5.4-2026-03-05A sample file with 10 rubrics drawn from the five paper datasets is included for quick testing:
uv run python run.py --input sample_rubrics.jsonl --concurrency 5 --judge gpt-5.4-2026-03-05Results are saved to results/run_<timestamp>.jsonl with the same schema as the other experiments.
All experiments share the same CLI parameters:
--judge gpt-5.4-2026-03-05 Judge model(s) to use, space-separated. See Judges section for available models.
--concurrency 10 Max simultaneous API calls. Lower this if you hit rate limits.
--n (experiment-specific) Limit to first N items (rubrics or conversations). Useful for quick tests.
--votes 1 Number of judge runs per rubric; majority vote is used when >1. The paper uses 5.
--no-cache off Force re-run even if cached results exist for this judge + strategy.
--eval-strategy scoped (HBP only) joined or scoped — see HBP experiment section for details.
Defaults for --n and --concurrency differ per experiment — see each experiment section above.
| Dataset | HuggingFace | Type | Used in |
|---|---|---|---|
| AdvancedIF | 🤗 facebook/AdvancedIF | Human-curated | Prevalence experiment |
| ResearchRubrics | 🤗 ScaleAI/researchrubrics | Human-written | Prevalence experiment |
| WildChecklists | 🤗 viswavi/wildchecklists | LLM-generated | Prevalence experiment |
| OpenRubrics | 🤗 OpenRubrics/OpenRubrics | LLM-generated | Prevalence experiment |
| Auto-Rubric | 🤗 agentscope-ai/Auto-Rubric | LLM-generated | Prevalence experiment |
| HealthBench Professional | 🤗 openai/healthbench-professional | Physician-written | HBP experiment |
| Model ID | Provider | Notes |
|---|---|---|
gpt-5.2-2025-12-11 |
OpenAI | Paper's primary judge |
gpt-5.4-2026-03-05 |
OpenAI | Latest OpenAI judge |
gemini-3.1-pro-preview |
Latest Gemini Pro judge | |
gemini-3.1-flash-lite |
Latest Gemini Flash judge |
Pass one or more judges via --judge. Results for each judge are cached and analyzed separately.
To register a new judge, add an entry to JUDGE_REGISTRY in the experiment file:
"your-model-id": ("openai", "OPENAI_API_KEY"), # or "google" + "GEMINI_API_KEY"Then add the corresponding API key to .env.
Controls which failure modes are enabled and their scopes. Remove a mode to exclude it from all experiments. Change a scope value to override the default.
{
"failure_modes": {
"subjective": "criterion",
"non_atomic": "criterion",
"ungrounded": "criterion",
"misaligned_or_rigid": "criterion",
"missing_criteria": "rubric",
"hackable": "criterion",
"low_signal": "rubric",
"redundant_criteria": "rubric"
}
}@article{qi2026rift,
title={RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics},
author={Qi, Zhengyang and Dickens, Charles and Pham, Derek and Dsouza, Amanda and Parchami, Armin and Sala, Frederic and Varma, Paroma},
journal={arXiv preprint arXiv:2604.01375},
year={2026}
}