SIREN — Selection-Aware Reporting for Adaptive LLM Evaluation

SIREN at a glance

SIREN replaces the naive in-sample winner with a procedure-level estimator that uses soft selection over a top-K shortlist and repeated-split held-out scoring, then quantifies uncertainty via a Gaussian multiplier bootstrap on the plug-in influence function. The bootstrap reweights benchmark items rather than splits or artifacts, yielding pointwise CIs, simultaneous bands, and fixed-comparison statements at the cost of a single length-M dot product per resample.

Headline results

Setup: 2 subjects (MMLU-Pro Math, Law) × 2 tuners (random search, DSPy) × 11 open-weight LLMs × 4 token budgets = 176 cells. Convention: ties (estimate = θ_B*) count as wrong-direction.

Directional accuracy (right-direction calls / total cells)

Method	Math RS	Math DSPy	Law RS	Law DSPy	Total
M1 (naive max)	37/44	33/44	15/44	40/44	125/176 (71%)
M3 (single split)	35/44	31/44	20/44	35/44	121/176 (69%)
M4 (R-split t)	29/44	39/44	27/44	40/44	135/176 (77%)
SIREN	42/44	43/44	42/44	40/44	167/176 (95%)

Mean signed bias (percentage points, θ̂_A − θ_A*)

Method	Math RS	Math DSPy	Law RS	Law DSPy
M1	+1.37	+1.43	+1.90	+1.01
SIREN	+0.04	−0.05	+0.08	−0.20

M1's bias is positive in 176/176 cells — the empirical signature of selection bias rather than sampling noise.

PromptEval comparison (signed bias range across all 4 cells, by f)

Observation fraction f	Bias range (pp)
f = 0.05 (sparse)	−1.67 to −12.72
f = 1.00 (= M1, dense)	+0.77 to +5.19
SIREN	−0.21 to +0.11

PromptEval exhibits a U-shaped bias trade-off across f: sparse PromptEval shrinks toward the population mean (severe underestimation), dense PromptEval reduces to the M1 winner (re-introducing selection bias). SIREN remains calibrated at every budget across all four cells.

Repository structure

.
├── README.md
├── figure/
│   └── pipeline.png                     ← architecture diagram
│
├── 01_appendix_tex/                     ← drop into your paper
│   ├── appendix_full.tex                  primary deliverable
│   ├── siren_per_model_all4_appendix.tex
│   ├── prompteval_per_model_all4_appendix.tex
│   ├── directional_summary_tables.tex
│   └── prompteval_per_budget_wraptables.tex
│
├── 02_data_csv/                         ← processed analysis data
│   ├── math_overlap_results.csv         (88 rows: 11 models × 4 budgets × 2 tuners)
│   ├── math_full_results.csv
│   └── law_dspy_full_results.csv
│
├── 03_analysis_scripts/                 ← reproducibility
│   ├── build_tensor.py
│   ├── compare_models.py
│   ├── compute_ground_truth.py
│   ├── exp1_coverage.py                 Theory Validation — Study A
│   ├── exp2_near_tie.py                 Theory Validation — Study B
│   ├── exp3_optimism.py                 Theory Validation — Study C
│   ├── study_e_dspy_randsearch_law.py
│   ├── study_e_miprov2_law.py
│   ├── study_e_multimodel_api.py
│   ├── study_e_multimodel_law.py
│   ├── study_h_prompteval_law.py
│   └── study_h_prompteval_math.py
│
└── 04_raw_results/                      ← raw experiment outputs
    ├── law_data_tar.gz                  Law: all 11 models × 2 tuners
    ├── original_math_results_tar.gz     Math RS: 10 of 11 models
    ├── qwen3_math_rs_overnight.tar.gz   Math RS: Qwen3-8B (separate run)
    ├── dspy_rs_math_results_tar.gz      Math DSPy: all 11 models
    ├── dspy_rs_results_tar.gz           older variant, kept for traceability
    ├── study_h_prompteval_math_results.json
    └── study_h_prompteval_law_results.json

Quick start for reviewers

To inspect the appendix LaTeX

Open 01_appendix_tex/appendix_full.tex directly. It contains the complete \section{Additional experiments} block with all 11 appendix tables and Theory Validation (Studies A–C) prose verbatim from the paper draft.

To verify a number from a table

All directional and bias counts trace back to:

Math → 02_data_csv/math_overlap_results.csv
Law → JSONs inside 04_raw_results/law_data_tar.gz
PromptEval (both subjects) → the two study_h_prompteval_*.json files

To regenerate from raw data

# Extract Law results
mkdir -p law_work
tar xzf 04_raw_results/law_data_tar.gz -C law_work

# Re-run PromptEval analysis (~30 sec on a CPU)
cd law_work
python ../03_analysis_scripts/study_h_prompteval_law.py

# Compare against shipped JSON
diff <(python -c "import json; print(json.dumps(json.load(open('study_h_prompteval_law_results.json')), sort_keys=True))") \
     <(python -c "import json; print(json.dumps(json.load(open('../04_raw_results/study_h_prompteval_law_results.json')), sort_keys=True))")

The Math PromptEval JSON does not require regeneration — it is shipped directly in 04_raw_results/study_h_prompteval_math_results.json.

Scripts at a glance

Theory Validation (Studies A–C): controlled simulations

Script	Validates
`exp1_coverage.py`	Multiplier-bootstrap coverage (Theorem 1+2)
`exp2_near_tie.py`	Hard-argmax near-tie undercoverage (Proposition 3)
`exp3_optimism.py`	√(2 log H) same-data optimism (Proposition 4)

Real-data experiments (Studies E and H on MMLU-Pro)

Script	Generates
`study_e_multimodel_law.py`	Law random-search results
`study_e_dspy_randsearch_law.py`	Law DSPy results
`study_e_miprov2_law.py`	MIPROv2 variant (not used in final)
`study_e_multimodel_api.py`	API-served LLM variant
`study_h_prompteval_law.py`	PromptEval baseline on Law
`study_h_prompteval_math.py`	PromptEval baseline on Math

Analysis utilities

Script	Purpose
`build_tensor.py`	Constructs (selection_fold, eval_fold) tensors
`compute_ground_truth.py`	θ* via 10K Monte Carlo redraws of the split design
`compare_models.py`	Aggregates per-model results across budgets

How to run

Reproduce a single LLM result (e.g., Llama-3.1-8B on MMLU-Pro Law, DSPy)

# 1. Start a vLLM server for the target model
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --gpu-memory-utilization 0.85 \
    --max-model-len 2048 --max-num-seqs 8 --port 8000

# 2. Run Study E (DSPy variant) for that model
python 03_analysis_scripts/study_e_dspy_randsearch_law.py \
    --model meta-llama/Llama-3.1-8B-Instruct --port 8000 \
    --results-dir results/llama_law

# 3. Compute the procedure-level ground truth via Monte Carlo
python 03_analysis_scripts/compute_ground_truth.py \
    --results-dir results/llama_law

# 4. Aggregate into a per-model summary
python 03_analysis_scripts/compare_models.py --results-dir results/llama_law

Each model run takes ~2–6 hours on a single RTX 4090. Full reproduction across all 11 models × 4 budgets × 2 subjects × 2 tuners requires ~40 GPU-days.

Reproduce the PromptEval baseline (no GPU needed)

# Extract the raw Law results and run PromptEval analysis (~30 sec on CPU)
mkdir -p law_work && tar xzf 04_raw_results/law_data_tar.gz -C law_work
cd law_work
python ../03_analysis_scripts/study_h_prompteval_law.py

Reproduce Theory Validation simulations (Studies A–C, no GPU needed)

python 03_analysis_scripts/exp1_coverage.py    # Study A: ~10 min
python 03_analysis_scripts/exp2_near_tie.py    # Study B: ~15 min
python 03_analysis_scripts/exp3_optimism.py    # Study C: ~5 min

Drop the appendix into a paper

Open 01_appendix_tex/appendix_full.tex and copy its contents into your paper's appendix. Required preamble packages: xcolor, multirow, booktabs, graphicx, placeins, wrapfig, pifont. The file uses \cmark and \xmark macros from pifont.

Method ↔ table-column mapping (legacy convention)

The appendix tables use a column convention inherited from the early Law tables, where JSON method-key ≠ column header:

Column header	Underlying JSON key	Description
`M1/M2`	`m1`	Naive max (M2 = M1 + Wald, identical point estimate)
`M3`	`m4`	Single-split holdout
`M4`	`m5`	R-split argmax with Student-t CI
`SIREN`	`m7`	Full influence function + multiplier bootstrap

(This convention is consistent across all 11 appendix tables.)

Data note: Qwen3-8B Math RS

Qwen3-8B Math RS data lives in qwen3_math_rs_overnight.tar.gz, not in original_math_results_tar.gz. Qwen3 was run via a separate overnight script and stored under a non-standard path on the original server. The JSON structure is equivalent up to one minor field-name difference (m5 in place of m6 for the t-CI ablation; values are identical).

JSON structure verified:

Top-level keys: [500000, 1500000, 3000000, 6500000]
Per-budget: m7 (SIREN), comparison.{m1..m5}, top_k_scores (length 10)
top_k_scores=10 confirms K=10 Math RS convention

Excluded from this package (intentional)

To keep the package submission-focused, the following were excluded:

Gemini 2.5 Flash debugging artifacts — multi-day dead-end, no usable results.
GSM8K early-stage experiments — superseded by MMLU-Pro.
Older Math LaTeX files (table4_math_500k.tex, math_section_latex.tex, etc.) — superseded by appendix_full.tex.
Intermediate appendix drafts — superseded by the final appendix_full.tex.

License

This code is released under the MIT License. See LICENSE for details.

If you use SIREN in your research, please cite:

@article{xu2026siren,
title={Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking},
author={Xu, Yang and Zhang, Jiefu and Sun, Haixiang and Zhou, Zihan and Cao, Tianyu and Aggarwal, Vaneet},
journal={arXiv preprint arXiv:<ARXIV_ID>},
year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIREN — Selection-Aware Reporting for Adaptive LLM Evaluation

SIREN at a glance

Headline results

Directional accuracy (right-direction calls / total cells)

Mean signed bias (percentage points, θ̂_A − θ_A*)

PromptEval comparison (signed bias range across all 4 cells, by f)

Repository structure

Quick start for reviewers

To inspect the appendix LaTeX

To verify a number from a table

To regenerate from raw data

Scripts at a glance

Theory Validation (Studies A–C): controlled simulations

Real-data experiments (Studies E and H on MMLU-Pro)

Analysis utilities

How to run

Reproduce a single LLM result (e.g., Llama-3.1-8B on MMLU-Pro Law, DSPy)

Reproduce the PromptEval baseline (no GPU needed)

Reproduce Theory Validation simulations (Studies A–C, no GPU needed)

Drop the appendix into a paper

Method ↔ table-column mapping (legacy convention)

Data note: Qwen3-8B Math RS

Excluded from this package (intentional)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
01_appendix_tex		01_appendix_tex
02_data_csv		02_data_csv
03_analysis_scripts		03_analysis_scripts
04_raw_results		04_raw_results
figure		figure
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

SIREN — Selection-Aware Reporting for Adaptive LLM Evaluation

SIREN at a glance

Headline results

Directional accuracy (right-direction calls / total cells)

Mean signed bias (percentage points, θ̂_A − θ_A*)

PromptEval comparison (signed bias range across all 4 cells, by f)

Repository structure

Quick start for reviewers

To inspect the appendix LaTeX

To verify a number from a table

To regenerate from raw data

Scripts at a glance

Theory Validation (Studies A–C): controlled simulations

Real-data experiments (Studies E and H on MMLU-Pro)

Analysis utilities

How to run

Reproduce a single LLM result (e.g., Llama-3.1-8B on MMLU-Pro Law, DSPy)

Reproduce the PromptEval baseline (no GPU needed)

Reproduce Theory Validation simulations (Studies A–C, no GPU needed)

Drop the appendix into a paper

Method ↔ table-column mapping (legacy convention)

Data note: Qwen3-8B Math RS

Excluded from this package (intentional)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages