Skip to content

fix(dataloader): prevent data drop and padding during validation for accurate metrics#1100

Merged
garrett4wade merged 1 commit intoinclusionAI:mainfrom
Anguo-star:dev_ag
Mar 30, 2026
Merged

fix(dataloader): prevent data drop and padding during validation for accurate metrics#1100
garrett4wade merged 1 commit intoinclusionAI:mainfrom
Anguo-star:dev_ag

Conversation

@Anguo-star
Copy link
Copy Markdown
Contributor

fix(dataloader): use EvalDistributedSampler for validation to ensure accurate metrics

Description

This PR fixes a critical bug where the validation dataloader was producing inaccurate evaluation metrics due to hardcoded data dropping and standard distributed padding.

Previously, create_dataloader hardcoded drop_last=True and used the standard DistributedSampler for all dataset configs. The standard DistributedSampler pads the dataset to make it evenly divisible by the number of replicas, which causes some validation samples to be evaluated twice. Combined with drop_last=True, this means the validation set was both artificially inflated (via padding) and truncated (via dropping), leading to biased evaluation results.

Changes made:

  1. Introduced EvalDistributedSampler: A custom sampler designed specifically for evaluation that prevents dataset padding. It ensures every sample in the dataset is evaluated exactly once across the cluster.
  2. Updated create_dataloader: Added logic to check if the config is a ValidDatasetConfig. If true, it dynamically applies the new EvalDistributedSampler and forces drop_last=False.

Related Issue

Fixes #1095

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

Additional Context


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new EvalDistributedSampler class in areal/utils/dataloader.py designed for evaluation tasks. Unlike the standard DistributedSampler, this implementation avoids padding the dataset, ensuring that each sample is evaluated exactly once across the cluster even if the dataset size is not evenly divisible by the number of replicas. The create_dataloader function was updated to use this new sampler when a ValidDatasetConfig is provided. I have no feedback to provide.

Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Anguo-star Thanks for the fix!

FYI the new EvalDistributedSampler class may not be mandatory with the single-controller usage (i.e., "python3 train.py" instead of the legacy "python3 -m areal.infra.launcher.local train.py"), because the data is only loaded in the controller process. The fix is good for backward compatiblity. Merging anyways.

@garrett4wade garrett4wade merged commit 0405b5c into inclusionAI:main Mar 30, 2026
rchardx added a commit that referenced this pull request Mar 30, 2026
…rash

In single-controller mode with drop_last=False (as set by PR #1100
for accurate validation metrics), the last validation batch may
contain fewer sequences than dp_size. This causes
balanced_greedy_partition to raise "The length of nums must be
divisible by K" during evaluation dispatch.

Truncate trailing items in _dispatch_tensors when the batch size
is not evenly divisible by dp_size, preserving compatibility with
both drop_last=False validation and strict equal-partition dispatch.
rchardx added a commit that referenced this pull request Mar 30, 2026
…rash

In single-controller mode with drop_last=False (as set by PR #1100
for accurate validation metrics), the last validation batch may
contain fewer sequences than dp_size. This causes
balanced_greedy_partition to raise "The length of nums must be
divisible by K" during evaluation dispatch.

Truncate trailing items in _dispatch_tensors when the batch size
is not evenly divisible by dp_size, preserving compatibility with
both drop_last=False validation and strict equal-partition dispatch.
rchardx added a commit that referenced this pull request Mar 31, 2026
…ivisible batches

When drop_last=False, the last eval batch may not be divisible by
dp_size, causing balanced_greedy_partition to crash (#1100/#1109).
Instead of truncating remainder items (which biases eval metrics),
pad with zero-contribution dummy items before dispatch so that all
DP replicas participate in collective ops without deadlock.

Key changes:
- Add make_dummy_eval_item in areal/utils/data.py
- Add _pad_eval_batch in train_controller.py for evaluate_* methods
- Move total_loss_weight assertion after all_reduce (allow local zero)
- Add zero-weight fast path in FSDP/Megatron/Archon engines
- Add tests for padding, dummy schema, and zero-weight loss path

Refs: #1100, #1109, #898
rchardx added a commit that referenced this pull request Apr 1, 2026
…ivisible batches

When drop_last=False, the last eval batch may not be divisible by
dp_size, causing balanced_greedy_partition to crash.  Instead of
truncating remainder items (which biases metrics), pad the batch with
zero-contribution dummy items before dispatch.  Extend dispatch to
support group_size-aware partitioning so RW chosen/rejected pairs
stay on the same DP rank.

Key changes:
- Add make_dummy_eval_item() for zero-weight schema-compatible items
- Add _pad_eval_batch() for eval-only dummy padding before dispatch
- Add zero-weight fast path in FSDP/Megatron/Archon engines
- Move total_loss_weight assertion after all_reduce (allow local zero)
- Add group_size-aware dispatch for RW chosen/rejected pairs
- Exclude dummy sequences from RW loss weight via count_nonzero
- Add 20+ tests covering padding, dispatch, and zero-weight paths

Refs: #1095, #1100, #898
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants