Add variance-stratified benchmark pruning extension#1391
Add variance-stratified benchmark pruning extension#1391Shashank-mankala1 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a benchmark compression framework to evalscope using Variance-Weighted Stratified Sampling, which includes modules for computing item statistics, implementing pruning strategies, and integrating with the dataset loading pipeline, alongside documentation and comparison tools. The review feedback highlights several critical robustness improvements: resolving non-determinism in the pruning selection logic by avoiding slicing on unordered sets, and defensively handling missing keys, unexpected types, and index type mismatches (string vs. integer) during JSON parsing and filtering to prevent runtime crashes and silent mismatches.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| selected: Set[int] = set() | ||
|
|
||
| # Step 1: Include calibration anchors (highest and lowest difficulty items) | ||
| all_items_sorted = sorted(item_stats.values(), key=lambda x: x.difficulty) | ||
| for item in all_items_sorted[:self.calibration_anchors]: | ||
| selected.add(item.index) | ||
| for item in all_items_sorted[-self.calibration_anchors:]: | ||
| selected.add(item.index) | ||
|
|
||
| # Step 2: Ensure minimum per stratum | ||
| for stratum in self.strata: | ||
| items_in_stratum = stratified[stratum.name] | ||
| for item in items_in_stratum[:self.min_samples_per_stratum]: | ||
| selected.add(item.index) | ||
|
|
||
| # Step 3: Allocate remaining budget by stratum weight | ||
| remaining = target - len(selected) | ||
| if remaining <= 0: | ||
| return sorted(list(selected)[:target]) | ||
|
|
||
| total_weight = sum(s.allocation_weight for s in self.strata) | ||
| for stratum in self.strata: | ||
| allocation = max(0, int(remaining * stratum.allocation_weight / total_weight)) | ||
| items_in_stratum = stratified[stratum.name] | ||
|
|
||
| added = 0 | ||
| for item in items_in_stratum: | ||
| if item.index in selected: | ||
| continue | ||
| selected.add(item.index) | ||
| added += 1 | ||
| if added >= allocation: | ||
| break | ||
|
|
||
| # Step 4: Fill any remaining slots with highest-variance items globally | ||
| if len(selected) < target: | ||
| all_by_variance = sorted( | ||
| item_stats.values(), key=lambda x: x.variance, reverse=True | ||
| ) | ||
| for item in all_by_variance: | ||
| if item.index in selected: | ||
| continue | ||
| selected.add(item.index) | ||
| if len(selected) >= target: | ||
| break | ||
|
|
||
| return sorted(selected) |
There was a problem hiding this comment.
Slicing an unordered set (e.g., list(selected)[:target]) is non-deterministic and can lead to arbitrary item selection when the number of selected items exceeds the target size. Using a dict as an ordered set (available in Python 3.7+) ensures that insertion order is preserved, meaning calibration anchors and stratum minimums are prioritized correctly during truncation.
selected: Dict[int, None] = {}
# Step 1: Include calibration anchors (highest and lowest difficulty items)
all_items_sorted = sorted(item_stats.values(), key=lambda x: x.difficulty)
for item in all_items_sorted[:self.calibration_anchors]:
selected[item.index] = None
for item in all_items_sorted[-self.calibration_anchors:]:
selected[item.index] = None
# Step 2: Ensure minimum per stratum
for stratum in self.strata:
items_in_stratum = stratified[stratum.name]
for item in items_in_stratum[:self.min_samples_per_stratum]:
selected[item.index] = None
# Step 3: Allocate remaining budget by stratum weight
remaining = target - len(selected)
if remaining <= 0:
return sorted(list(selected.keys())[:target])
total_weight = sum(s.allocation_weight for s in self.strata)
for stratum in self.strata:
allocation = max(0, int(remaining * stratum.allocation_weight / total_weight))
items_in_stratum = stratified[stratum.name]
added = 0
for item in items_in_stratum:
if item.index in selected:
continue
selected[item.index] = None
added += 1
if added >= allocation:
break
# Step 4: Fill any remaining slots with highest-variance items globally
if len(selected) < target:
all_by_variance = sorted(
item_stats.values(), key=lambda x: x.variance, reverse=True
)
for item in all_by_variance:
if item.index in selected:
continue
selected[item.index] = None
if len(selected) >= target:
break
return sorted(list(selected.keys()))| selected = self._compute_selected_indices(config) | ||
| if selected is None: | ||
| return samples | ||
|
|
||
| filtered = [s for s in samples if s.get('index') in selected] |
There was a problem hiding this comment.
To prevent silent filtering failures due to type mismatches (e.g., if indices are loaded as strings from historical reviews but are integers in the live dataset, or vice versa), normalize the index values before performing the membership check.
selected = self._compute_selected_indices(config)
if selected is None:
return samples
def _normalize(idx):
if isinstance(idx, str) and idx.isdigit():
return int(idx)
return idx
filtered = [s for s in samples if _normalize(s.get('index')) in selected]| with open(fpath, encoding='utf-8') as f: | ||
| for line in f: | ||
| if not line.strip(): | ||
| continue | ||
| row = json.loads(line) | ||
| idx = row['index'] | ||
| if indices is not None and idx not in indices: | ||
| continue | ||
| score = row['sample_score']['score']['value'][score_key] | ||
| model_scores[model_name].append(float(score)) |
There was a problem hiding this comment.
Defensively handle missing keys and normalize index types to avoid silent mismatches when comparing model scores.
with open(fpath, encoding='utf-8') as f:
for line in f:
if not line.strip():
continue
row = json.loads(line)
idx = row.get('index')
if idx is None:
continue
if isinstance(idx, str) and idx.isdigit():
idx = int(idx)
if indices is not None and idx not in indices:
continue
sample_score = row.get('sample_score', {})
score_obj = sample_score.get('score', {}) if isinstance(sample_score, dict) else {}
score_val = score_obj.get('value') if isinstance(score_obj, dict) else None
if isinstance(score_val, dict):
score = score_val.get(score_key)
else:
score = score_val
if score is None:
continue
model_scores[model_name].append(float(score))| score_matrix: Dict[int, List[float]] = {} | ||
| review_path = Path(review_dir) | ||
|
|
||
| for fpath in sorted(review_path.glob(f'{benchmark_prefix}__*.jsonl')): | ||
| with open(fpath, encoding='utf-8') as f: | ||
| for line in f: | ||
| if not line.strip(): | ||
| continue | ||
| row = json.loads(line) | ||
| idx = row['index'] | ||
| score = row['sample_score']['score']['value'][score_key] | ||
| if idx not in score_matrix: | ||
| score_matrix[idx] = [] | ||
| score_matrix[idx].append(float(score)) | ||
|
|
||
| items: Dict[int, ItemStats] = {} | ||
| for idx, scores in score_matrix.items(): | ||
| n = len(scores) | ||
| mean = sum(scores) / n | ||
| variance = sum((s - mean) ** 2 for s in scores) / n | ||
| items[idx] = ItemStats( | ||
| index=idx, | ||
| difficulty=mean, | ||
| variance=variance, | ||
| n_models=n, | ||
| scores=scores, | ||
| ) | ||
|
|
||
| return items |
There was a problem hiding this comment.
The JSON parsing is vulnerable to missing keys or unexpected types (e.g., if value is not a dictionary or if score_key is missing). Additionally, index type mismatches (string vs integer) can cause silent matching failures later. Normalizing the index to an integer (if numeric) and defensively accessing dictionary keys prevents runtime crashes and ensures robust behavior.
score_matrix: Dict[int, List[float]] = {}
review_path = Path(review_dir)
for fpath in sorted(review_path.glob(f'{benchmark_prefix}__*.jsonl')):
with open(fpath, encoding='utf-8') as f:
for line in f:
if not line.strip():
continue
row = json.loads(line)
idx = row.get('index')
if idx is None:
continue
if isinstance(idx, str) and idx.isdigit():
idx = int(idx)
sample_score = row.get('sample_score', {})
score_obj = sample_score.get('score', {}) if isinstance(sample_score, dict) else {}
score_val = score_obj.get('value') if isinstance(score_obj, dict) else None
if isinstance(score_val, dict):
score = score_val.get(score_key)
else:
score = score_val
if score is None:
continue
if idx not in score_matrix:
score_matrix[idx] = []
score_matrix[idx].append(float(score))
items: Dict[int, ItemStats] = {}
for idx, scores in score_matrix.items():
n = len(scores)
mean = sum(scores) / n
variance = sum((s - mean) ** 2 for s in scores) / n
items[idx] = ItemStats(
index=idx,
difficulty=mean,
variance=variance,
n_models=n,
scores=scores,
)
return items|
Hi @Shashank-mankala1, thanks for the PR. The variance-stratified pruning concept is interesting — IRT-inspired item selection for benchmark compression is a valid research direction. However, I have some questions and concerns before we can evaluate this for merge: Questions
Structural issues
If you want to move forwardIf this is a genuine contribution you'd like to land, I'd suggest:
Happy to discuss the design direction if you'd like to proceed. |
…el docs, fix imports
|
Hi @Yunnglin , Intent: This was developed as part of a technical assessment. I'm contributing it upstream because I believe IRT-inspired benchmark compression is genuinely useful for evalscope users who need fast go/no-go model evaluation without running full suites. Fixes:
Integration path: PruningMixin.filter_samples_by_pruning() in adapter.py accepts a sample list + config dict and returns the filtered subset. An existing adapter would call this from load_dataset(): Users invoke via: --dataset-args '{"live_code_bench": {"pruning_strategy": "variance_stratified", "prune_ratio": 0.6, "review_dir": "./reviews"}}' Happy to wire this into the actual LiveCodeBench adapter if you'd prefer a fully runnable integration PR. Will open a feature issue to align on the design. |
|
Thanks for the contribution and the clear writeup — the variance/difficulty stratified sampling idea is genuinely interesting and well-grounded in IRT-style benchmark compression work. I don't think this is mergeable as it stands, mostly for a structural reason: evalscope routes new functionality through existing registries ( Conceptually, what you're doing is dataset-level sample selection, which sits at a different layer than Would you mind opening an issue first to sketch the integration design (registry hook, config schema, how it composes with Thanks again. |
Add variance-stratified benchmark pruning extension
What this does
Implements Variance-Weighted Stratified Sampling — an IRT-inspired pruning strategy that selects the minimal sample set preserving the model ranking signal, implemented as a native evalscope extension.
Approach
Results
Benchmark | Full | Pruned | Reduction | Rank Preserved | Max Score Δ -- | -- | -- | -- | -- | -- LiveCodeBench v5 | 315 | 189 | 40% | Yes (Kendall τ = 1.0) | 0.030 AA-LCR | 100 | 50 | 50% | Yes (Kendall τ = 1.0) | 0.020Model ranking is perfectly preserved at these ratios. The pruned set's average item variance is 1.8× higher than the full set, confirming the most informative samples are selected.
Why it generalises to unseen models
Selection is based on structural item properties (variance, difficulty) — not on which specific model gets an item right. A 4th model evaluated on this subset will still be correctly ranked.
Files added
evalscope/pruning/item_stats.py— per-item statistics from review JSONL filesevalscope/pruning/strategy.py—VarianceStratifiedPrunerimplementationevalscope/pruning/adapter.py— evalscopeDataAdapterintegration mixintools/compare_runs.py— CLI to compare full vs pruned run resultsUsage
Developed against evalscope commit
de7b0b3f08c617f48a00ef09f7169dc74212a6d9