Skip to content

microsoft/data-efficacy

Repository files navigation

Data Efficacy

Task License

Large-scale model training benefits from data at scale, but the value of a dataset also depends on how effectively it is used. Data Efficacy studies how to turn available data into stronger training signal by scoring samples, selecting useful subsets, and organizing them into effective training sequences.

Introduction

Large-scale model training depends heavily on data curation. Many data efficiency methods compute expensive sample-level scores for quality, difficulty, learnability, or relevance, but these scores are often used only once for filtering.

Data Efficacy aims to reuse such scores more fully across the training pipeline. In this repository, that shared pipeline is organized around four reusable stages:

  • Data Scoring estimates sample-level utility.
  • Data Selection chooses useful subsets under a data or compute budget.
  • Data Ordering organizes selected samples into an effective training sequence.
  • Model Training and Evaluation measure whether the curated data improves downstream performance.

Data efficacy pipeline

News

  • 2026/05: Added the ACL 2026 follow-up work Demystifying Data Organization for Enhanced LLM Training, with new data organization methods under data_ordering.
  • 2025/08: Released the codebase for general-domain pre-training.
  • 2025/06: Released Data Efficacy for Language Model Training (DELT) on arXiv.

Works

Demystifying Data Organization for Enhanced LLM Training (Paper | README)

This work studies how to organize scored training data (data ordering) and introduces practical guidances for boundary sharpening, cyclic scheduling, curriculum continuity, and local diversity.

Data Efficacy for Language Model Training (Paper | README)

This work introduces a data efficacy pipeline for language model training that reuses sample-level scores across data scoring, data selection, and data ordering.

Repo Structure

.
├── data_scoring/      # Compute sample-level scores, including LQS and KenLM-based scoring.
├── data_selection/    # Select subsets with top-r, top-k, or threshold methods.
├── data_ordering/     # Organize scored data with sorting, folding, zig-zag, segment, STR, and SAW.
├── model_train/       # Train models on curated data.
├── model_eval/        # Evaluate trained models.
├── docs/              # Paper-specific documentation and assets.
└── figures/           # Figures used by repository documentation.

Installation

conda create -n data_efficacy python=3.10 -y
conda activate data_efficacy
pip install -r requirements.txt

For lightweight data ordering only, numpy and pyyaml are sufficient.

Preparation

Environment Variables
export HF_TOKEN="<your_huggingface_token>"
export WANDB_API_KEY="<your_wandb_apikey>"
Dataset
python utils.py --content dataset --id $HF_DATASET_ID --save-dir $OUTPUT_DATA_PATH

# Example:
python utils.py \
  --content dataset \
  --id togethercomputer/RedPajama-Data-1T \
  --save-dir data/source-cc-1b.jsonl \
  --data-name common_crawl \
  --split-name train \
  --sample-size 500000

You can also use your own JSONL dataset.

Model
python utils.py --content model --id $HF_MODEL_ID --save-dir $OUTPUT_MODEL_PATH

# Example:
python utils.py \
  --content model \
  --id Data-Selection/BSL-160M \
  --save-dir models/mistral-160m

Pipeline Usage

The repository exposes each stage through a separate entry script. You can run the full scoring-selection-ordering-training pipeline or reuse only the stages needed by a specific paper.

Data Scoring

Existing scoring methods include Learnability-Quality Score (lqs) and Perplexity (kenlm). For LQS details, see data_scoring/lqs/README.md.

bash data_scoring/entry.sh $INPUT_DATA_PATH $OUTPUT_DATA_PATH $METHOD $CONFIG_PATH

# Example:
bash data_scoring/entry.sh \
  data/source-cc-1b.jsonl \
  data/source-cc-1b_scored-lqs.jsonl \
  lqs \
  data_scoring/config/lqs.yaml
Data Selection

Existing selection methods include Top-R (top-r), Threshold (threshold), and Top-K (top-k).

bash data_selection/entry.sh $INPUT_DATA_PATH $OUTPUT_DATA_PATH $METHOD $CONFIG_PATH

# Example:
bash data_selection/entry.sh \
  data/source-cc-1b_scored-lqs.jsonl \
  data/source-cc-1b_scored-lqs_selected-r1.0.jsonl \
  top-r \
  data_selection/config/top-r.yaml
Data Ordering

Existing ordering methods include Sorting (sorting), Folding Ordering (folding), Zig-zag Ordering (zigzag), Segment Ordering (segment), Stair Ordering / STR (stair), Saw Ordering / SAW (saw), and Shuffle (shuffle). For the ACL 2026 data organization work, see Demystifying Data Organization for Enhanced LLM Training.

bash data_ordering/entry.sh $INPUT_DATA_PATH $OUTPUT_DATA_PATH $METHOD $CONFIG_PATH

# Example:
bash data_ordering/entry.sh \
  data/source-cc-1b_scored-lqs_selected-r1.0.jsonl \
  data/source-cc-1b_scored-lqs_selected-r1.0_ordered-saw.jsonl \
  saw \
  data_ordering/config/saw.yaml
Model Training
bash model_train/entry.sh $INPUT_DATA_PATH $INPUT_MODEL_PATH $OUTPUT_MODEL_PATH $METHOD $CONFIG_PATH

# Example:
bash model_train/entry.sh \
  data/source-cc-1b_scored-lqs_selected-r1.0_ordered-saw.jsonl \
  models/mistral-160m \
  models/pretrain_mistral-160m_source-cc-1b_ordered-saw \
  pretrain \
  model_train/config/train.yaml
Model Evaluation
bash model_eval/entry.sh $INPUT_MODEL_PATH $OUTPUT_RESULT_PATH $METHOD $CONFIG_PATH

# Example:
bash model_eval/entry.sh \
  models/pretrain_mistral-160m_source-cc-1b_ordered-saw \
  models/pretrain_mistral-160m_source-cc-1b_ordered-saw/result.yaml \
  lm_evaluation_harness \
  model_eval/config/general.yaml

Citation

@article{dai2025data,
  title={Data Efficacy for Language Model Training},
  author={Yalun Dai and Yangyu Huang and Xin Zhang and Wenshan Wu and Chong Li and Wenhui Lu and Shijie Cao and Li Dong and Scarlett Li},
  journal={arXiv preprint arXiv:2506.21545},
  year={2025}
}

@inproceedings{dai2026demystifying,
  title={Demystifying Data Organization for Enhanced LLM Training},
  author={Yalun Dai and Yangyu Huang and Tongshen Yang and Yonghan Wang and Xin Zhang and Wenshan Wu and Qihao Zhao and Hao Li and Yuanyuan Gao and Kim-Hui Yap and Scarlett Li},
  booktitle={Proceedings of the Annual Meeting of the Association for Computational Linguistics},
  year={2026}
}

License

This repository is licensed under the MIT License.

Releases

No releases published

Packages

 
 
 

Contributors