GitHub - prism-benchmark/prism: Code for the PRISM benchmark: https://github.com/prism-benchmark/prism-benchmark.github.io

# TLDR — reproduce all paper results in 4 commands:
pip install -r requirements.txt          # 1. install deps
cp .env.example .env                     # 2. set GOOGLE_API_KEY in .env
python run.py --setup-data               # 3. download dataset (~4 GB)
python run.py                            # 4. run all evaluations

See CONFIG_GUIDE.md for detailed setup, custom providers, and per-aspect configuration options.

1. Setup

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env

Edit .env — set your API keys (at minimum GOOGLE_API_KEY for the Gemini evaluator):

GOOGLE_API_KEY=AIza...
MIMO_API_KEY=gsk_...            # optional, for robustness checks

2. Download the data

Downloads papers, human reviews, and pre-generated LLM reviews (~4 GB):

python run.py --setup-data

This populates Data/Final_LLM_Reviewer_Data/ with the full benchmark corpus (1,000 papers across ICLR 2024–2026, ICML 2025, NeurIPS 2025, with human + 5 LLM reviewer outputs).

No GPU needed — LLM reviews are pre-generated. You only need API access for the judge model.

3. Run the paper experiment

# Run all 4 evaluation aspects across all 5 conferences (default)
python run.py

This evaluates Depth of Analysis, Novelty Assessment, Flaw Identification & Prioritization, and Constructiveness for all reviewer types (Human, SEA, Reviewer2, TreeReview, DeepReview, CycleReview) across all conferences.

Selective runs (faster iteration)

python run.py --list                                # see available aspects/reviewers
python run.py --profile aspects                     # evaluation only
python run.py --only depth_of_analysis              # single aspect
python run.py --conference iclr2024                 # single venue
python run.py --limit 10                            # first N papers (quick test)
python run.py --dry-run                             # preview without executing

Run with a different judge model

Edit llm_config.yaml to switch any aspect to a different provider:

aspects:
  constructiveness:
    provider: openai          # was: mimo
    model: gpt-4o

Or use any OpenAI-compatible API:

providers:
  together:
    api_key: ${TOGETHER_API_KEY}
    base_url: https://api.together.xyz/v1

aspects:
  depth_of_analysis:
    provider: together
    model: meta-llama/Llama-3.3-70B-Instruct-Turbo

# .env
TOGETHER_API_KEY=tsk_...

4. Results

Each pipeline writes structured results to its output/ directory:

Aspect	Output location	Key metrics
Depth of Analysis	`Aspects_benchmarking/depth_of_analysis/output/metrics/`	DoA, R_premise, S_depth
Constructiveness	`Aspects_benchmarking/constructiveness/output/`	MCS, D1–D5
Flaw ID & Prioritization	`Aspects_benchmarking/flaw_identification/output_cfi_*/`	Critical/Minor Recall, nCPS
Novelty	`Aspects_benchmarking/novelty_vefification/output/`	NS, SR, SSR

Aggregated summary CSVs correspond to Table 1 in the paper.

What gets evaluated

Venue	Papers	Decisions sampled
ICLR 2024	200	Oral, Spotlight, Poster, Reject
ICLR 2025	200	Oral, Spotlight, Poster, Reject
ICLR 2026	200	Oral, Poster, Reject
ICML 2025	200	Oral, Spotlight, Poster, Reject
NeurIPS 2025	200	Oral, Spotlight, Poster, Reject

Reviewer system	Type
Human	Expert peer reviewers
SEA	Supervised fine-tuning
Reviewer2	Prompting-based
TreeReview	Prompting-based
DeepReview	Supervised fine-tuning
CycleReview	Supervised fine-tuning

Repository layout

PRISM/
├── run.py                       ← unified orchestrator (this is all you need)
├── llm_config.yaml              ← model/provider configuration
├── llm_client.py                ← unified LLM client
├── .env                         ← API keys & paths (gitignored)
├── Data/                        ← download scripts
├── Aspects_benchmarking/        ← evaluation pipelines (RQ1–RQ4)
│   ├── depth_of_analysis/
│   ├── novelty_vefification/
│   ├── flaw_identification/
│   └── constructiveness/
└── LLM_reviewer/                ← reviewer generation baselines (not needed for eval)

Dependencies

The evaluation stack is lightweight:

google-generativeai  openai  httpx  numpy  scipy  matplotlib  python-dotenv

Heavier dependencies (vLLM, transformers) are only needed if re-generating LLM reviews — skip unless you're rerunning the generation baselines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Setup

2. Download the data

3. Run the paper experiment

Selective runs (faster iteration)

Run with a different judge model

4. Results

What gets evaluated

Repository layout

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.idea		.idea
Aspects_benchmarking		Aspects_benchmarking
Data		Data
LLM_reviewer		LLM_reviewer
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONFIG_GUIDE.md		CONFIG_GUIDE.md
README.md		README.md
ai_config.py		ai_config.py
llm_client.py		llm_client.py
llm_config.yaml		llm_config.yaml
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

1. Setup

2. Download the data

3. Run the paper experiment

Selective runs (faster iteration)

Run with a different judge model

4. Results

What gets evaluated

Repository layout

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages