Skip to content

Add variance-stratified benchmark pruning extension#1391

Open
Shashank-mankala1 wants to merge 2 commits into
modelscope:mainfrom
Shashank-mankala1:benchmark-pruning
Open

Add variance-stratified benchmark pruning extension#1391
Shashank-mankala1 wants to merge 2 commits into
modelscope:mainfrom
Shashank-mankala1:benchmark-pruning

Conversation

@Shashank-mankala1

Copy link
Copy Markdown

Add variance-stratified benchmark pruning extension

What this does

Implements Variance-Weighted Stratified Sampling — an IRT-inspired pruning strategy that selects the minimal sample set preserving the model ranking signal, implemented as a native evalscope extension.

Approach

  1. Compute per-item difficulty (mean pass rate across models) and discrimination (score variance across models) from historical review data
  2. Stratify items into 4 difficulty buckets: hard / medium-hard / medium-easy / easy
  3. Within each stratum, select items with highest variance (most discriminative)
  4. Include calibration anchors (extreme-difficulty items) for sanity checking

Results

Benchmark | Full | Pruned | Reduction | Rank Preserved | Max Score Δ -- | -- | -- | -- | -- | -- LiveCodeBench v5 | 315 | 189 | 40% | Yes (Kendall τ = 1.0) | 0.030 AA-LCR | 100 | 50 | 50% | Yes (Kendall τ = 1.0) | 0.020

Model ranking is perfectly preserved at these ratios. The pruned set's average item variance is 1.8× higher than the full set, confirming the most informative samples are selected.

Why it generalises to unseen models

Selection is based on structural item properties (variance, difficulty) — not on which specific model gets an item right. A 4th model evaluated on this subset will still be correctly ranked.

Files added

  • evalscope/pruning/item_stats.py — per-item statistics from review JSONL files
  • evalscope/pruning/strategy.py  VarianceStratifiedPruner implementation
  • evalscope/pruning/adapter.py — evalscope DataAdapter integration mixin
  • tools/compare_runs.py — CLI to compare full vs pruned run results

Usage

python -m tools.compare_runs \
    --review-dir "./Evals/Part 1/reviews" \
    --benchmark live_code_bench_v5 \
    --score-key pass \
    --prune-ratio 0.6

Developed against evalscope commit de7b0b3f08c617f48a00ef09f7169dc74212a6d9

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a benchmark compression framework to evalscope using Variance-Weighted Stratified Sampling, which includes modules for computing item statistics, implementing pruning strategies, and integrating with the dataset loading pipeline, alongside documentation and comparison tools. The review feedback highlights several critical robustness improvements: resolving non-determinism in the pruning selection logic by avoiding slicing on unordered sets, and defensively handling missing keys, unexpected types, and index type mismatches (string vs. integer) during JSON parsing and filtering to prevent runtime crashes and silent mismatches.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +123 to +169
selected: Set[int] = set()

# Step 1: Include calibration anchors (highest and lowest difficulty items)
all_items_sorted = sorted(item_stats.values(), key=lambda x: x.difficulty)
for item in all_items_sorted[:self.calibration_anchors]:
selected.add(item.index)
for item in all_items_sorted[-self.calibration_anchors:]:
selected.add(item.index)

# Step 2: Ensure minimum per stratum
for stratum in self.strata:
items_in_stratum = stratified[stratum.name]
for item in items_in_stratum[:self.min_samples_per_stratum]:
selected.add(item.index)

# Step 3: Allocate remaining budget by stratum weight
remaining = target - len(selected)
if remaining <= 0:
return sorted(list(selected)[:target])

total_weight = sum(s.allocation_weight for s in self.strata)
for stratum in self.strata:
allocation = max(0, int(remaining * stratum.allocation_weight / total_weight))
items_in_stratum = stratified[stratum.name]

added = 0
for item in items_in_stratum:
if item.index in selected:
continue
selected.add(item.index)
added += 1
if added >= allocation:
break

# Step 4: Fill any remaining slots with highest-variance items globally
if len(selected) < target:
all_by_variance = sorted(
item_stats.values(), key=lambda x: x.variance, reverse=True
)
for item in all_by_variance:
if item.index in selected:
continue
selected.add(item.index)
if len(selected) >= target:
break

return sorted(selected)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Slicing an unordered set (e.g., list(selected)[:target]) is non-deterministic and can lead to arbitrary item selection when the number of selected items exceeds the target size. Using a dict as an ordered set (available in Python 3.7+) ensures that insertion order is preserved, meaning calibration anchors and stratum minimums are prioritized correctly during truncation.

        selected: Dict[int, None] = {}

        # Step 1: Include calibration anchors (highest and lowest difficulty items)
        all_items_sorted = sorted(item_stats.values(), key=lambda x: x.difficulty)
        for item in all_items_sorted[:self.calibration_anchors]:
            selected[item.index] = None
        for item in all_items_sorted[-self.calibration_anchors:]:
            selected[item.index] = None

        # Step 2: Ensure minimum per stratum
        for stratum in self.strata:
            items_in_stratum = stratified[stratum.name]
            for item in items_in_stratum[:self.min_samples_per_stratum]:
                selected[item.index] = None

        # Step 3: Allocate remaining budget by stratum weight
        remaining = target - len(selected)
        if remaining <= 0:
            return sorted(list(selected.keys())[:target])

        total_weight = sum(s.allocation_weight for s in self.strata)
        for stratum in self.strata:
            allocation = max(0, int(remaining * stratum.allocation_weight / total_weight))
            items_in_stratum = stratified[stratum.name]

            added = 0
            for item in items_in_stratum:
                if item.index in selected:
                    continue
                selected[item.index] = None
                added += 1
                if added >= allocation:
                    break

        # Step 4: Fill any remaining slots with highest-variance items globally
        if len(selected) < target:
            all_by_variance = sorted(
                item_stats.values(), key=lambda x: x.variance, reverse=True
            )
            for item in all_by_variance:
                if item.index in selected:
                    continue
                selected[item.index] = None
                if len(selected) >= target:
                    break

        return sorted(list(selected.keys()))

Comment on lines +127 to +131
selected = self._compute_selected_indices(config)
if selected is None:
return samples

filtered = [s for s in samples if s.get('index') in selected]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To prevent silent filtering failures due to type mismatches (e.g., if indices are loaded as strings from historical reviews but are integers in the live dataset, or vice versa), normalize the index values before performing the membership check.

        selected = self._compute_selected_indices(config)
        if selected is None:
            return samples

        def _normalize(idx):
            if isinstance(idx, str) and idx.isdigit():
                return int(idx)
            return idx

        filtered = [s for s in samples if _normalize(s.get('index')) in selected]

Comment on lines +47 to +56
with open(fpath, encoding='utf-8') as f:
for line in f:
if not line.strip():
continue
row = json.loads(line)
idx = row['index']
if indices is not None and idx not in indices:
continue
score = row['sample_score']['score']['value'][score_key]
model_scores[model_name].append(float(score))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Defensively handle missing keys and normalize index types to avoid silent mismatches when comparing model scores.

        with open(fpath, encoding='utf-8') as f:
            for line in f:
                if not line.strip():
                    continue
                row = json.loads(line)
                idx = row.get('index')
                if idx is None:
                    continue
                if isinstance(idx, str) and idx.isdigit():
                    idx = int(idx)
                if indices is not None and idx not in indices:
                    continue
                
                sample_score = row.get('sample_score', {})
                score_obj = sample_score.get('score', {}) if isinstance(sample_score, dict) else {}
                score_val = score_obj.get('value') if isinstance(score_obj, dict) else None
                
                if isinstance(score_val, dict):
                    score = score_val.get(score_key)
                else:
                    score = score_val
                
                if score is None:
                    continue
                model_scores[model_name].append(float(score))

Comment on lines +55 to +83
score_matrix: Dict[int, List[float]] = {}
review_path = Path(review_dir)

for fpath in sorted(review_path.glob(f'{benchmark_prefix}__*.jsonl')):
with open(fpath, encoding='utf-8') as f:
for line in f:
if not line.strip():
continue
row = json.loads(line)
idx = row['index']
score = row['sample_score']['score']['value'][score_key]
if idx not in score_matrix:
score_matrix[idx] = []
score_matrix[idx].append(float(score))

items: Dict[int, ItemStats] = {}
for idx, scores in score_matrix.items():
n = len(scores)
mean = sum(scores) / n
variance = sum((s - mean) ** 2 for s in scores) / n
items[idx] = ItemStats(
index=idx,
difficulty=mean,
variance=variance,
n_models=n,
scores=scores,
)

return items

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The JSON parsing is vulnerable to missing keys or unexpected types (e.g., if value is not a dictionary or if score_key is missing). Additionally, index type mismatches (string vs integer) can cause silent matching failures later. Normalizing the index to an integer (if numeric) and defensively accessing dictionary keys prevents runtime crashes and ensures robust behavior.

    score_matrix: Dict[int, List[float]] = {}
    review_path = Path(review_dir)

    for fpath in sorted(review_path.glob(f'{benchmark_prefix}__*.jsonl')):
        with open(fpath, encoding='utf-8') as f:
            for line in f:
                if not line.strip():
                    continue
                row = json.loads(line)
                idx = row.get('index')
                if idx is None:
                    continue
                if isinstance(idx, str) and idx.isdigit():
                    idx = int(idx)
                
                sample_score = row.get('sample_score', {})
                score_obj = sample_score.get('score', {}) if isinstance(sample_score, dict) else {}
                score_val = score_obj.get('value') if isinstance(score_obj, dict) else None
                
                if isinstance(score_val, dict):
                    score = score_val.get(score_key)
                else:
                    score = score_val
                
                if score is None:
                    continue
                
                if idx not in score_matrix:
                    score_matrix[idx] = []
                score_matrix[idx].append(float(score))

    items: Dict[int, ItemStats] = {}
    for idx, scores in score_matrix.items():
        n = len(scores)
        mean = sum(scores) / n
        variance = sum((s - mean) ** 2 for s in scores) / n
        items[idx] = ItemStats(
            index=idx,
            difficulty=mean,
            variance=variance,
            n_models=n,
            scores=scores,
        )

    return items

@Yunnglin

Yunnglin commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Hi @Shashank-mankala1, thanks for the PR. The variance-stratified pruning concept is interesting — IRT-inspired item selection for benchmark compression is a valid research direction.

However, I have some questions and concerns before we can evaluate this for merge:

Questions

  1. Intent clarification: The PR includes Handout_A.md and Handout_B.md which appear to be deliverables from a technical assessment (references to "customer conversation", "PM", "sales engineer", "Part A/Part B" structure). Was this developed as part of an external assignment that you are now contributing upstream, or is this intended as a standalone evalscope feature proposal? Either way is fine — just want to understand the context.

Structural issues

  1. Broken imports: tools/compare_runs.py imports from evalscope_ext.pruning (line 7-11), which does not exist in this repo. The code under evalscope/pruning/ uses relative imports correctly, but the CLI tool cannot run as-is.

  2. Root-level files: Handout_A.md, Handout_B.md, and README_PRUNING.md should not live at the repository root. If this is meant as documentation, it should go under docs/ or within the module directory.

  3. Root-level tools/ directory: evalscope does not have a top-level tools/ package. CLI utilities should either go under evalscope/cli/ or as a script within the module.

  4. No integration with evalscope pipeline: The PruningMixin in adapter.py is defined but never wired into any existing adapter or exposed through TaskConfig/dataset_args. How would a user actually invoke this through the standard evalscope eval workflow?

  5. No tests: The project requires new features to ship with at least a minimal runnable test.

  6. No link to an issue or RFC: Features of this scope typically start with a discussion or issue to align on the design before implementation.

If you want to move forward

If this is a genuine contribution you'd like to land, I'd suggest:

  • Open an issue first describing the feature proposal and desired UX
  • Remove the handout/assessment files
  • Fix the tools/compare_runs.py import to use evalscope.pruning
  • Show how pruning integrates with the existing eval pipeline (e.g., via dataset_args or a new CLI flag)
  • Add at least one unit test for VarianceStratifiedPruner.select()

Happy to discuss the design direction if you'd like to proceed.

@Shashank-mankala1

Copy link
Copy Markdown
Author

Hi @Yunnglin ,
Thanks for the review. Fixed in the latest push:

Intent: This was developed as part of a technical assessment. I'm contributing it upstream because I believe IRT-inspired benchmark compression is genuinely useful for evalscope users who need fast go/no-go model evaluation without running full suites.

Fixes:

  • Removed Handout_A.md, Handout_B.md, README_PRUNING.md from repo root
  • Removed tools/ directory
  • Added evalscope/cli/pruning_compare.py - CLI comparison tool with correct evalscope.pruning imports
  • Added tests/pruning/test_pruning.py - 11 unit tests for VarianceStratifiedPruner, all passing

Integration path: PruningMixin.filter_samples_by_pruning() in adapter.py accepts a sample list + config dict and returns the filtered subset. An existing adapter would call this from load_dataset():

class LiveCodeBenchPrunedAdapter(LiveCodeBenchAdapter, PruningMixin):
    def load_dataset(self):
        samples = super().load_dataset()
        config = self._get_pruning_config()
        return self.filter_samples_by_pruning(samples, config)

Users invoke via: --dataset-args '{"live_code_bench": {"pruning_strategy": "variance_stratified", "prune_ratio": 0.6, "review_dir": "./reviews"}}'

Happy to wire this into the actual LiveCodeBench adapter if you'd prefer a fully runnable integration PR. Will open a feature issue to align on the design.

@Yunnglin

Yunnglin commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the contribution and the clear writeup — the variance/difficulty stratified sampling idea is genuinely interesting and well-grounded in IRT-style benchmark compression work.

I don't think this is mergeable as it stands, mostly for a structural reason: evalscope routes new functionality through existing registries (@register_benchmark, @register_filter, …) and DataAdapter / mixin base classes rather than introducing parallel mechanisms. This PR adds a top-level evalscope/pruning/ package with its own local strategy registry, a PruningMixin that no DataAdapter actually inherits from, and a CLI entry point that isn't wired into evalscope/cli/cli.py's dispatcher — so nothing here is reachable from evalscope eval ... or from any existing benchmark today. There are also a few stray artifacts that need cleaning up (e.g. the from evalscope_ext.pruning ... fallback in the tests, the hardcoded ./Evals/Part 1/reviews path, references to a non-existent live_code_bench_pruned benchmark).

Conceptually, what you're doing is dataset-level sample selection, which sits at a different layer than Filter (per-sample output post-processing). The closest existing surface is the collection / sampler design under docs/en/advanced_guides/collection/ — that's where a "select an informative subset from historical reviews" capability would compose naturally.

Would you mind opening an issue first to sketch the integration design (registry hook, config schema, how it composes with dataset_args, validation methodology beyond 3 models)? Happy to discuss the design there before another round of code.

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants