fix(dataloader): prevent data drop and padding during validation for accurate metrics#1100
Conversation
…accurate metrics(pre-commit checked)
There was a problem hiding this comment.
Code Review
This pull request introduces a new EvalDistributedSampler class in areal/utils/dataloader.py designed for evaluation tasks. Unlike the standard DistributedSampler, this implementation avoids padding the dataset, ensuring that each sample is evaluated exactly once across the cluster even if the dataset size is not evenly divisible by the number of replicas. The create_dataloader function was updated to use this new sampler when a ValidDatasetConfig is provided. I have no feedback to provide.
garrett4wade
left a comment
There was a problem hiding this comment.
@Anguo-star Thanks for the fix!
FYI the new EvalDistributedSampler class may not be mandatory with the single-controller usage (i.e., "python3 train.py" instead of the legacy "python3 -m areal.infra.launcher.local train.py"), because the data is only loaded in the controller process. The fix is good for backward compatiblity. Merging anyways.
…rash In single-controller mode with drop_last=False (as set by PR #1100 for accurate validation metrics), the last validation batch may contain fewer sequences than dp_size. This causes balanced_greedy_partition to raise "The length of nums must be divisible by K" during evaluation dispatch. Truncate trailing items in _dispatch_tensors when the batch size is not evenly divisible by dp_size, preserving compatibility with both drop_last=False validation and strict equal-partition dispatch.
…rash In single-controller mode with drop_last=False (as set by PR #1100 for accurate validation metrics), the last validation batch may contain fewer sequences than dp_size. This causes balanced_greedy_partition to raise "The length of nums must be divisible by K" during evaluation dispatch. Truncate trailing items in _dispatch_tensors when the batch size is not evenly divisible by dp_size, preserving compatibility with both drop_last=False validation and strict equal-partition dispatch.
…ivisible batches When drop_last=False, the last eval batch may not be divisible by dp_size, causing balanced_greedy_partition to crash (#1100/#1109). Instead of truncating remainder items (which biases eval metrics), pad with zero-contribution dummy items before dispatch so that all DP replicas participate in collective ops without deadlock. Key changes: - Add make_dummy_eval_item in areal/utils/data.py - Add _pad_eval_batch in train_controller.py for evaluate_* methods - Move total_loss_weight assertion after all_reduce (allow local zero) - Add zero-weight fast path in FSDP/Megatron/Archon engines - Add tests for padding, dummy schema, and zero-weight loss path Refs: #1100, #1109, #898
…ivisible batches When drop_last=False, the last eval batch may not be divisible by dp_size, causing balanced_greedy_partition to crash. Instead of truncating remainder items (which biases metrics), pad the batch with zero-contribution dummy items before dispatch. Extend dispatch to support group_size-aware partitioning so RW chosen/rejected pairs stay on the same DP rank. Key changes: - Add make_dummy_eval_item() for zero-weight schema-compatible items - Add _pad_eval_batch() for eval-only dummy padding before dispatch - Add zero-weight fast path in FSDP/Megatron/Archon engines - Move total_loss_weight assertion after all_reduce (allow local zero) - Add group_size-aware dispatch for RW chosen/rejected pairs - Exclude dummy sequences from RW loss weight via count_nonzero - Add 20+ tests covering padding, dispatch, and zero-weight paths Refs: #1095, #1100, #898
fix(dataloader): use EvalDistributedSampler for validation to ensure accurate metrics
Description
This PR fixes a critical bug where the validation dataloader was producing inaccurate evaluation metrics due to hardcoded data dropping and standard distributed padding.
Previously,
create_dataloaderhardcodeddrop_last=Trueand used the standardDistributedSamplerfor all dataset configs. The standardDistributedSamplerpads the dataset to make it evenly divisible by the number of replicas, which causes some validation samples to be evaluated twice. Combined withdrop_last=True, this means the validation set was both artificially inflated (via padding) and truncated (via dropping), leading to biased evaluation results.Changes made:
EvalDistributedSampler: A custom sampler designed specifically for evaluation that prevents dataset padding. It ensures every sample in the dataset is evaluated exactly once across the cluster.create_dataloader: Added logic to check if the config is aValidDatasetConfig. If true, it dynamically applies the newEvalDistributedSamplerand forcesdrop_last=False.Related Issue
Fixes #1095
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prBreaking Change Details (if applicable):
Additional Context
Need help? Check the Contributing Guide or ask in
GitHub Discussions!