Large-scale model training benefits from data at scale, but the value of a dataset also depends on how effectively it is used. Data Efficacy studies how to turn available data into stronger training signal by scoring samples, selecting useful subsets, and organizing them into effective training sequences.
Large-scale model training depends heavily on data curation. Many data efficiency methods compute expensive sample-level scores for quality, difficulty, learnability, or relevance, but these scores are often used only once for filtering.
Data Efficacy aims to reuse such scores more fully across the training pipeline. In this repository, that shared pipeline is organized around four reusable stages:
- Data Scoring estimates sample-level utility.
- Data Selection chooses useful subsets under a data or compute budget.
- Data Ordering organizes selected samples into an effective training sequence.
- Model Training and Evaluation measure whether the curated data improves downstream performance.
- 2026/05: Added the ACL 2026 follow-up work Demystifying Data Organization for Enhanced LLM Training, with new data organization methods under
data_ordering. - 2025/08: Released the codebase for general-domain pre-training.
- 2025/06: Released Data Efficacy for Language Model Training (DELT) on arXiv.
This work studies how to organize scored training data (data ordering) and introduces practical guidances for boundary sharpening, cyclic scheduling, curriculum continuity, and local diversity.
This work introduces a data efficacy pipeline for language model training that reuses sample-level scores across data scoring, data selection, and data ordering.
.
├── data_scoring/ # Compute sample-level scores, including LQS and KenLM-based scoring.
├── data_selection/ # Select subsets with top-r, top-k, or threshold methods.
├── data_ordering/ # Organize scored data with sorting, folding, zig-zag, segment, STR, and SAW.
├── model_train/ # Train models on curated data.
├── model_eval/ # Evaluate trained models.
├── docs/ # Paper-specific documentation and assets.
└── figures/ # Figures used by repository documentation.
conda create -n data_efficacy python=3.10 -y
conda activate data_efficacy
pip install -r requirements.txtFor lightweight data ordering only, numpy and pyyaml are sufficient.
Environment Variables
export HF_TOKEN="<your_huggingface_token>"
export WANDB_API_KEY="<your_wandb_apikey>"Dataset
python utils.py --content dataset --id $HF_DATASET_ID --save-dir $OUTPUT_DATA_PATH
# Example:
python utils.py \
--content dataset \
--id togethercomputer/RedPajama-Data-1T \
--save-dir data/source-cc-1b.jsonl \
--data-name common_crawl \
--split-name train \
--sample-size 500000You can also use your own JSONL dataset.
Model
python utils.py --content model --id $HF_MODEL_ID --save-dir $OUTPUT_MODEL_PATH
# Example:
python utils.py \
--content model \
--id Data-Selection/BSL-160M \
--save-dir models/mistral-160mThe repository exposes each stage through a separate entry script. You can run the full scoring-selection-ordering-training pipeline or reuse only the stages needed by a specific paper.
Data Scoring
Existing scoring methods include Learnability-Quality Score (lqs) and Perplexity (kenlm). For LQS details, see data_scoring/lqs/README.md.
bash data_scoring/entry.sh $INPUT_DATA_PATH $OUTPUT_DATA_PATH $METHOD $CONFIG_PATH
# Example:
bash data_scoring/entry.sh \
data/source-cc-1b.jsonl \
data/source-cc-1b_scored-lqs.jsonl \
lqs \
data_scoring/config/lqs.yamlData Selection
Existing selection methods include Top-R (top-r), Threshold (threshold), and Top-K (top-k).
bash data_selection/entry.sh $INPUT_DATA_PATH $OUTPUT_DATA_PATH $METHOD $CONFIG_PATH
# Example:
bash data_selection/entry.sh \
data/source-cc-1b_scored-lqs.jsonl \
data/source-cc-1b_scored-lqs_selected-r1.0.jsonl \
top-r \
data_selection/config/top-r.yamlData Ordering
Existing ordering methods include Sorting (sorting), Folding Ordering (folding), Zig-zag Ordering (zigzag), Segment Ordering (segment), Stair Ordering / STR (stair), Saw Ordering / SAW (saw), and Shuffle (shuffle). For the ACL 2026 data organization work, see Demystifying Data Organization for Enhanced LLM Training.
bash data_ordering/entry.sh $INPUT_DATA_PATH $OUTPUT_DATA_PATH $METHOD $CONFIG_PATH
# Example:
bash data_ordering/entry.sh \
data/source-cc-1b_scored-lqs_selected-r1.0.jsonl \
data/source-cc-1b_scored-lqs_selected-r1.0_ordered-saw.jsonl \
saw \
data_ordering/config/saw.yamlModel Training
bash model_train/entry.sh $INPUT_DATA_PATH $INPUT_MODEL_PATH $OUTPUT_MODEL_PATH $METHOD $CONFIG_PATH
# Example:
bash model_train/entry.sh \
data/source-cc-1b_scored-lqs_selected-r1.0_ordered-saw.jsonl \
models/mistral-160m \
models/pretrain_mistral-160m_source-cc-1b_ordered-saw \
pretrain \
model_train/config/train.yamlModel Evaluation
bash model_eval/entry.sh $INPUT_MODEL_PATH $OUTPUT_RESULT_PATH $METHOD $CONFIG_PATH
# Example:
bash model_eval/entry.sh \
models/pretrain_mistral-160m_source-cc-1b_ordered-saw \
models/pretrain_mistral-160m_source-cc-1b_ordered-saw/result.yaml \
lm_evaluation_harness \
model_eval/config/general.yaml@article{dai2025data,
title={Data Efficacy for Language Model Training},
author={Yalun Dai and Yangyu Huang and Xin Zhang and Wenshan Wu and Chong Li and Wenhui Lu and Shijie Cao and Li Dong and Scarlett Li},
journal={arXiv preprint arXiv:2506.21545},
year={2025}
}
@inproceedings{dai2026demystifying,
title={Demystifying Data Organization for Enhanced LLM Training},
author={Yalun Dai and Yangyu Huang and Tongshen Yang and Yonghan Wang and Xin Zhang and Wenshan Wu and Qihao Zhao and Hao Li and Yuanyuan Gao and Kim-Hui Yap and Scarlett Li},
booktitle={Proceedings of the Annual Meeting of the Association for Computational Linguistics},
year={2026}
}This repository is licensed under the MIT License.
