RIFT — RubrIc Failure mode Taxonomy

Automated diagnostics for rubric quality. RIFT classifies rubric criteria against a taxonomy of eight failure modes organized into three categories: Reliability, Content Validity, and Consequential Validity.

Paper: RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics

Failure modes

Mode	Category	Description
`subjective`	Reliability	Uses unanchored subjective terms
`non_atomic`	Reliability	Bundles multiple independently scorable requirements
`ungrounded`	Reliability	Requires verification without providing grounding
`misaligned_or_rigid`	Content Validity	Grades wrong objective or over-constrains
`missing_criteria`	Content Validity	Prompt implies requirements the rubric doesn't cover
`hackable`	Consequential Validity	Gameable via proxy metrics
`low_signal`	Consequential Validity	Rubric as a whole doesn't discriminate well
`redundant_criteria`	Consequential Validity	Multiple criteria evaluate the same requirement

Installation

Requires uv.

git clone <repo>
cd rift
uv sync
cp .env.example .env   # add your API keys

.env keys:

OPENAI_API_KEY=...
GEMINI_API_KEY=...

Quick smoke test

# Prevalence experiment — 3 rubrics per source, fast sanity check
uv run python prevalence_experiment.py --n 3 --concurrency 5 --judge gpt-5.4-2026-03-05

# HBP experiment — 3 conversations
uv run python hbp_experiment.py --n 3 --concurrency 3 --judge gpt-5.4-2026-03-05 --eval-strategy scoped

Experiments

1. Prevalence experiment (`prevalence_experiment.py`)

Reproduces RIFT paper Table 2. Evaluates failure mode prevalence across five rubric datasets using the joined strategy (full rubric evaluated with all failure modes — paper-equivalent method).

uv run python prevalence_experiment.py --n 50 --concurrency 10 --judge gpt-5.2-2025-12-11

Results are saved to results/prevalence_<timestamp>.jsonl and cached per judge. Each record includes rubric_text, labels (majority-voted), votes (raw per-run outputs), n_votes, and an error field if the API call failed.

Default for --n is 10 (5 sources → 50 total API calls). The paper uses --votes 5.

2. HBP experiment (`hbp_experiment.py`)

Runs RIFT on all 525 conversations and 1,135 rubric criteria from HealthBench Professional.

uv run python hbp_experiment.py --concurrency 8 --judge gpt-5.4-2026-03-05 --eval-strategy scoped

Results are saved to results/hbp_<timestamp>.jsonl and cached per judge + strategy. Each record includes rubric_text, labels (majority-voted), votes (raw per-run outputs), n_votes, and an error field if the API call failed.

--eval-strategy controls how failure modes are applied:

joined — all rubric criteria for a conversation are concatenated into one string and evaluated with all failure modes together. Equivalent to the paper's method.
scoped (default) — criterion-scope failure modes run on each criterion individually; rubric-scope modes run on the full joined rubric. Produces per_criterion and per_conversation records, letting you pinpoint failure modes at the criterion level rather than just the rubric.

Bring your own dataset

To run RIFT on any rubric dataset, prepare a JSONL file where each line has two required fields:

{"input_context": "Write a haiku about winter.", "rubric_text": "5 pts: Contains exactly 17 syllables in 5-7-5 structure."}
{"input_context": "Summarize the article.", "rubric_text": "10 pts: Covers all main points accurately."}

Any additional fields are passed through as metadata in the output. Then run:

uv run python run.py --input my_rubrics.jsonl
uv run python run.py --input my_rubrics.jsonl --eval-strategy scoped --votes 3 --judge gpt-5.4-2026-03-05

A sample file with 10 rubrics drawn from the five paper datasets is included for quick testing:

uv run python run.py --input sample_rubrics.jsonl --concurrency 5 --judge gpt-5.4-2026-03-05

Results are saved to results/run_<timestamp>.jsonl with the same schema as the other experiments.

Parameters

All experiments share the same CLI parameters:

--judge            gpt-5.4-2026-03-05        Judge model(s) to use, space-separated. See Judges section for available models.
--concurrency      10                         Max simultaneous API calls. Lower this if you hit rate limits.
--n                (experiment-specific)      Limit to first N items (rubrics or conversations). Useful for quick tests.
--votes            1                          Number of judge runs per rubric; majority vote is used when >1. The paper uses 5.
--no-cache         off                        Force re-run even if cached results exist for this judge + strategy.
--eval-strategy    scoped                     (HBP only) joined or scoped — see HBP experiment section for details.

Defaults for --n and --concurrency differ per experiment — see each experiment section above.

Datasets

Dataset	HuggingFace	Type	Used in
AdvancedIF	🤗 facebook/AdvancedIF	Human-curated	Prevalence experiment
ResearchRubrics	🤗 ScaleAI/researchrubrics	Human-written	Prevalence experiment
WildChecklists	🤗 viswavi/wildchecklists	LLM-generated	Prevalence experiment
OpenRubrics	🤗 OpenRubrics/OpenRubrics	LLM-generated	Prevalence experiment
Auto-Rubric	🤗 agentscope-ai/Auto-Rubric	LLM-generated	Prevalence experiment
HealthBench Professional	🤗 openai/healthbench-professional	Physician-written	HBP experiment

Judges

Model ID	Provider	Notes
`gpt-5.2-2025-12-11`	OpenAI	Paper's primary judge
`gpt-5.4-2026-03-05`	OpenAI	Latest OpenAI judge
`gemini-3.1-pro-preview`	Google	Latest Gemini Pro judge
`gemini-3.1-flash-lite`	Google	Latest Gemini Flash judge

Pass one or more judges via --judge. Results for each judge are cached and analyzed separately.

To register a new judge, add an entry to JUDGE_REGISTRY in the experiment file:

"your-model-id": ("openai", "OPENAI_API_KEY"),  # or "google" + "GEMINI_API_KEY"

Then add the corresponding API key to .env.

Configuration (`config.json`)

Controls which failure modes are enabled and their scopes. Remove a mode to exclude it from all experiments. Change a scope value to override the default.

{
  "failure_modes": {
    "subjective":          "criterion",
    "non_atomic":          "criterion",
    "ungrounded":          "criterion",
    "misaligned_or_rigid": "criterion",
    "missing_criteria":    "rubric",
    "hackable":            "criterion",
    "low_signal":          "rubric",
    "redundant_criteria":  "rubric"
  }
}

Reference

@article{qi2026rift,
  title={RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics},
  author={Qi, Zhengyang and Dickens, Charles and Pham, Derek and Dsouza, Amanda and Parchami, Armin and Sala, Frederic and Varma, Paroma},
  journal={arXiv preprint arXiv:2604.01375},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
notebooks		notebooks
rift		rift
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
config.json		config.json
hbp_experiment.py		hbp_experiment.py
prevalence_experiment.py		prevalence_experiment.py
pyproject.toml		pyproject.toml
run.py		run.py
sample_rubrics.jsonl		sample_rubrics.jsonl
sanity_check.py		sanity_check.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RIFT — RubrIc Failure mode Taxonomy

Failure modes

Installation

Quick smoke test

Experiments

1. Prevalence experiment (`prevalence_experiment.py`)

2. HBP experiment (`hbp_experiment.py`)

Bring your own dataset

Parameters

Datasets

Judges

Configuration (`config.json`)

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RIFT — RubrIc Failure mode Taxonomy

Failure modes

Installation

Quick smoke test

Experiments

1. Prevalence experiment (prevalence_experiment.py)

2. HBP experiment (hbp_experiment.py)

Bring your own dataset

Parameters

Datasets

Judges

Configuration (config.json)

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Prevalence experiment (`prevalence_experiment.py`)

2. HBP experiment (`hbp_experiment.py`)

Configuration (`config.json`)

Packages