Add variance-stratified benchmark pruning extension by Shashank-mankala1 · Pull Request #1391 · modelscope/evalscope

Shashank-mankala1 · 2026-06-02T02:18:52Z

Add variance-stratified benchmark pruning extension

What this does

Implements Variance-Weighted Stratified Sampling — an IRT-inspired pruning strategy that selects the minimal sample set preserving the model ranking signal, implemented as a native evalscope extension.

Approach

Compute per-item difficulty (mean pass rate across models) and discrimination (score variance across models) from historical review data
Stratify items into 4 difficulty buckets: hard / medium-hard / medium-easy / easy
Within each stratum, select items with highest variance (most discriminative)
Include calibration anchors (extreme-difficulty items) for sanity checking

Results

Benchmark | Full | Pruned | Reduction | Rank Preserved | Max Score Δ -- | -- | -- | -- | -- | -- LiveCodeBench v5 | 315 | 189 | 40% | Yes (Kendall τ = 1.0) | 0.030 AA-LCR | 100 | 50 | 50% | Yes (Kendall τ = 1.0) | 0.020

Model ranking is perfectly preserved at these ratios. The pruned set's average item variance is 1.8× higher than the full set, confirming the most informative samples are selected.

Why it generalises to unseen models

Selection is based on structural item properties (variance, difficulty) — not on which specific model gets an item right. A 4th model evaluated on this subset will still be correctly ranked.

Files added

evalscope/pruning/item_stats.py — per-item statistics from review JSONL files
evalscope/pruning/strategy.py — VarianceStratifiedPruner implementation
evalscope/pruning/adapter.py — evalscope DataAdapter integration mixin
tools/compare_runs.py — CLI to compare full vs pruned run results

Usage

python -m tools.compare_runs \
    --review-dir "./Evals/Part 1/reviews" \
    --benchmark live_code_bench_v5 \
    --score-key pass \
    --prune-ratio 0.6

Developed against evalscope commit de7b0b3f08c617f48a00ef09f7169dc74212a6d9

gemini-code-assist

Code Review

This pull request introduces a benchmark compression framework to evalscope using Variance-Weighted Stratified Sampling, which includes modules for computing item statistics, implementing pruning strategies, and integrating with the dataset loading pipeline, alongside documentation and comparison tools. The review feedback highlights several critical robustness improvements: resolving non-determinism in the pruning selection logic by avoiding slicing on unordered sets, and defensively handling missing keys, unexpected types, and index type mismatches (string vs. integer) during JSON parsing and filtering to prevent runtime crashes and silent mismatches.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-02T02:20:31Z

+        selected: Set[int] = set()
+
+        # Step 1: Include calibration anchors (highest and lowest difficulty items)
+        all_items_sorted = sorted(item_stats.values(), key=lambda x: x.difficulty)
+        for item in all_items_sorted[:self.calibration_anchors]:
+            selected.add(item.index)
+        for item in all_items_sorted[-self.calibration_anchors:]:
+            selected.add(item.index)
+
+        # Step 2: Ensure minimum per stratum
+        for stratum in self.strata:
+            items_in_stratum = stratified[stratum.name]
+            for item in items_in_stratum[:self.min_samples_per_stratum]:
+                selected.add(item.index)
+
+        # Step 3: Allocate remaining budget by stratum weight
+        remaining = target - len(selected)
+        if remaining <= 0:
+            return sorted(list(selected)[:target])
+
+        total_weight = sum(s.allocation_weight for s in self.strata)
+        for stratum in self.strata:
+            allocation = max(0, int(remaining * stratum.allocation_weight / total_weight))
+            items_in_stratum = stratified[stratum.name]
+
+            added = 0
+            for item in items_in_stratum:
+                if item.index in selected:
+                    continue
+                selected.add(item.index)
+                added += 1
+                if added >= allocation:
+                    break
+
+        # Step 4: Fill any remaining slots with highest-variance items globally
+        if len(selected) < target:
+            all_by_variance = sorted(
+                item_stats.values(), key=lambda x: x.variance, reverse=True
+            )
+            for item in all_by_variance:
+                if item.index in selected:
+                    continue
+                selected.add(item.index)
+                if len(selected) >= target:
+                    break
+
+        return sorted(selected)


Slicing an unordered set (e.g., list(selected)[:target]) is non-deterministic and can lead to arbitrary item selection when the number of selected items exceeds the target size. Using a dict as an ordered set (available in Python 3.7+) ensures that insertion order is preserved, meaning calibration anchors and stratum minimums are prioritized correctly during truncation.

selected: Dict[int, None] = {} # Step 1: Include calibration anchors (highest and lowest difficulty items) all_items_sorted = sorted(item_stats.values(), key=lambda x: x.difficulty) for item in all_items_sorted[:self.calibration_anchors]: selected[item.index] = None for item in all_items_sorted[-self.calibration_anchors:]: selected[item.index] = None # Step 2: Ensure minimum per stratum for stratum in self.strata: items_in_stratum = stratified[stratum.name] for item in items_in_stratum[:self.min_samples_per_stratum]: selected[item.index] = None # Step 3: Allocate remaining budget by stratum weight remaining = target - len(selected) if remaining <= 0: return sorted(list(selected.keys())[:target]) total_weight = sum(s.allocation_weight for s in self.strata) for stratum in self.strata: allocation = max(0, int(remaining * stratum.allocation_weight / total_weight)) items_in_stratum = stratified[stratum.name] added = 0 for item in items_in_stratum: if item.index in selected: continue selected[item.index] = None added += 1 if added >= allocation: break # Step 4: Fill any remaining slots with highest-variance items globally if len(selected) < target: all_by_variance = sorted( item_stats.values(), key=lambda x: x.variance, reverse=True ) for item in all_by_variance: if item.index in selected: continue selected[item.index] = None if len(selected) >= target: break return sorted(list(selected.keys()))

gemini-code-assist · 2026-06-02T02:20:31Z

+        selected = self._compute_selected_indices(config)
+        if selected is None:
+            return samples
+
+        filtered = [s for s in samples if s.get('index') in selected]


To prevent silent filtering failures due to type mismatches (e.g., if indices are loaded as strings from historical reviews but are integers in the live dataset, or vice versa), normalize the index values before performing the membership check.

selected = self._compute_selected_indices(config) if selected is None: return samples def _normalize(idx): if isinstance(idx, str) and idx.isdigit(): return int(idx) return idx filtered = [s for s in samples if _normalize(s.get('index')) in selected]

gemini-code-assist · 2026-06-02T02:20:31Z

+        with open(fpath, encoding='utf-8') as f:
+            for line in f:
+                if not line.strip():
+                    continue
+                row = json.loads(line)
+                idx = row['index']
+                if indices is not None and idx not in indices:
+                    continue
+                score = row['sample_score']['score']['value'][score_key]
+                model_scores[model_name].append(float(score))


Defensively handle missing keys and normalize index types to avoid silent mismatches when comparing model scores.

with open(fpath, encoding='utf-8') as f: for line in f: if not line.strip(): continue row = json.loads(line) idx = row.get('index') if idx is None: continue if isinstance(idx, str) and idx.isdigit(): idx = int(idx) if indices is not None and idx not in indices: continue sample_score = row.get('sample_score', {}) score_obj = sample_score.get('score', {}) if isinstance(sample_score, dict) else {} score_val = score_obj.get('value') if isinstance(score_obj, dict) else None if isinstance(score_val, dict): score = score_val.get(score_key) else: score = score_val if score is None: continue model_scores[model_name].append(float(score))

gemini-code-assist · 2026-06-02T02:20:31Z

+    score_matrix: Dict[int, List[float]] = {}
+    review_path = Path(review_dir)
+
+    for fpath in sorted(review_path.glob(f'{benchmark_prefix}__*.jsonl')):
+        with open(fpath, encoding='utf-8') as f:
+            for line in f:
+                if not line.strip():
+                    continue
+                row = json.loads(line)
+                idx = row['index']
+                score = row['sample_score']['score']['value'][score_key]
+                if idx not in score_matrix:
+                    score_matrix[idx] = []
+                score_matrix[idx].append(float(score))
+
+    items: Dict[int, ItemStats] = {}
+    for idx, scores in score_matrix.items():
+        n = len(scores)
+        mean = sum(scores) / n
+        variance = sum((s - mean) ** 2 for s in scores) / n
+        items[idx] = ItemStats(
+            index=idx,
+            difficulty=mean,
+            variance=variance,
+            n_models=n,
+            scores=scores,
+        )
+
+    return items


The JSON parsing is vulnerable to missing keys or unexpected types (e.g., if value is not a dictionary or if score_key is missing). Additionally, index type mismatches (string vs integer) can cause silent matching failures later. Normalizing the index to an integer (if numeric) and defensively accessing dictionary keys prevents runtime crashes and ensures robust behavior.

score_matrix: Dict[int, List[float]] = {} review_path = Path(review_dir) for fpath in sorted(review_path.glob(f'{benchmark_prefix}__*.jsonl')): with open(fpath, encoding='utf-8') as f: for line in f: if not line.strip(): continue row = json.loads(line) idx = row.get('index') if idx is None: continue if isinstance(idx, str) and idx.isdigit(): idx = int(idx) sample_score = row.get('sample_score', {}) score_obj = sample_score.get('score', {}) if isinstance(sample_score, dict) else {} score_val = score_obj.get('value') if isinstance(score_obj, dict) else None if isinstance(score_val, dict): score = score_val.get(score_key) else: score = score_val if score is None: continue if idx not in score_matrix: score_matrix[idx] = [] score_matrix[idx].append(float(score)) items: Dict[int, ItemStats] = {} for idx, scores in score_matrix.items(): n = len(scores) mean = sum(scores) / n variance = sum((s - mean) ** 2 for s in scores) / n items[idx] = ItemStats( index=idx, difficulty=mean, variance=variance, n_models=n, scores=scores, ) return items

Yunnglin · 2026-06-02T04:16:58Z

Hi @Shashank-mankala1, thanks for the PR. The variance-stratified pruning concept is interesting — IRT-inspired item selection for benchmark compression is a valid research direction.

However, I have some questions and concerns before we can evaluate this for merge:

Questions

Intent clarification: The PR includes Handout_A.md and Handout_B.md which appear to be deliverables from a technical assessment (references to "customer conversation", "PM", "sales engineer", "Part A/Part B" structure). Was this developed as part of an external assignment that you are now contributing upstream, or is this intended as a standalone evalscope feature proposal? Either way is fine — just want to understand the context.

Structural issues

Broken imports: tools/compare_runs.py imports from evalscope_ext.pruning (line 7-11), which does not exist in this repo. The code under evalscope/pruning/ uses relative imports correctly, but the CLI tool cannot run as-is.
Root-level files: Handout_A.md, Handout_B.md, and README_PRUNING.md should not live at the repository root. If this is meant as documentation, it should go under docs/ or within the module directory.
Root-level tools/ directory: evalscope does not have a top-level tools/ package. CLI utilities should either go under evalscope/cli/ or as a script within the module.
No integration with evalscope pipeline: The PruningMixin in adapter.py is defined but never wired into any existing adapter or exposed through TaskConfig/dataset_args. How would a user actually invoke this through the standard evalscope eval workflow?
No tests: The project requires new features to ship with at least a minimal runnable test.
No link to an issue or RFC: Features of this scope typically start with a discussion or issue to align on the design before implementation.

If you want to move forward

If this is a genuine contribution you'd like to land, I'd suggest:

Open an issue first describing the feature proposal and desired UX
Remove the handout/assessment files
Fix the tools/compare_runs.py import to use evalscope.pruning
Show how pruning integrates with the existing eval pipeline (e.g., via dataset_args or a new CLI flag)
Add at least one unit test for VarianceStratifiedPruner.select()

Happy to discuss the design direction if you'd like to proceed.

…el docs, fix imports

Shashank-mankala1 · 2026-06-02T11:35:44Z

Hi @Yunnglin ,
Thanks for the review. Fixed in the latest push:

Intent: This was developed as part of a technical assessment. I'm contributing it upstream because I believe IRT-inspired benchmark compression is genuinely useful for evalscope users who need fast go/no-go model evaluation without running full suites.

Fixes:

Removed Handout_A.md, Handout_B.md, README_PRUNING.md from repo root
Removed tools/ directory
Added evalscope/cli/pruning_compare.py - CLI comparison tool with correct evalscope.pruning imports
Added tests/pruning/test_pruning.py - 11 unit tests for VarianceStratifiedPruner, all passing

Integration path: PruningMixin.filter_samples_by_pruning() in adapter.py accepts a sample list + config dict and returns the filtered subset. An existing adapter would call this from load_dataset():

class LiveCodeBenchPrunedAdapter(LiveCodeBenchAdapter, PruningMixin):
    def load_dataset(self):
        samples = super().load_dataset()
        config = self._get_pruning_config()
        return self.filter_samples_by_pruning(samples, config)

Users invoke via: --dataset-args '{"live_code_bench": {"pruning_strategy": "variance_stratified", "prune_ratio": 0.6, "review_dir": "./reviews"}}'

Happy to wire this into the actual LiveCodeBench adapter if you'd prefer a fully runnable integration PR. Will open a feature issue to align on the design.

Yunnglin · 2026-06-03T09:18:29Z

Thanks for the contribution and the clear writeup — the variance/difficulty stratified sampling idea is genuinely interesting and well-grounded in IRT-style benchmark compression work.

I don't think this is mergeable as it stands, mostly for a structural reason: evalscope routes new functionality through existing registries (@register_benchmark, @register_filter, …) and DataAdapter / mixin base classes rather than introducing parallel mechanisms. This PR adds a top-level evalscope/pruning/ package with its own local strategy registry, a PruningMixin that no DataAdapter actually inherits from, and a CLI entry point that isn't wired into evalscope/cli/cli.py's dispatcher — so nothing here is reachable from evalscope eval ... or from any existing benchmark today. There are also a few stray artifacts that need cleaning up (e.g. the from evalscope_ext.pruning ... fallback in the tests, the hardcoded ./Evals/Part 1/reviews path, references to a non-existent live_code_bench_pruned benchmark).

Conceptually, what you're doing is dataset-level sample selection, which sits at a different layer than Filter (per-sample output post-processing). The closest existing surface is the collection / sampler design under docs/en/advanced_guides/collection/ — that's where a "select an informative subset from historical reviews" capability would compose naturally.

Would you mind opening an issue first to sketch the integration design (registry hook, config schema, how it composes with dataset_args, validation methodology beyond 3 models)? Happy to discuss the design there before another round of code.

Thanks again.

Add variance-stratified benchmark pruning extension

e79d88e

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

Fix PR review: move CLI to evalscope/cli/, add tests, remove root-lev…

56e1f04

…el docs, fix imports

Shashank-mankala1 mentioned this pull request Jun 2, 2026

Feature proposal: Benchmark pruning via variance-weighted stratified sampling #1393

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add variance-stratified benchmark pruning extension#1391

Add variance-stratified benchmark pruning extension#1391
Shashank-mankala1 wants to merge 2 commits into
modelscope:mainfrom
Shashank-mankala1:benchmark-pruning

Shashank-mankala1 commented Jun 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

Yunnglin commented Jun 2, 2026

Uh oh!

Shashank-mankala1 commented Jun 2, 2026

Uh oh!

Yunnglin commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Shashank-mankala1 commented Jun 2, 2026