Skip to content

Studying how data ordering can improve pretraining #702

@kothasuhas

Description

@kothasuhas

Description

Interested in how to properly order pretraining data for fixed available data.

To start, we can consider how you would want to do curriculum learning for transfer: if you are given a few samples from D_1 and samples from less related D_2, what ordering of samples minimizes test loss for D_1? How does this vary with the number of samples from D_1 vs D_2?

Hypothesis or Goal

Hypothesis 1: To improve test loss on D_1, it will be beneficial to bias D_1 samples toward the end of training
Hypothesis 2: The benefit of keeping D_1 samples closer to the end of training will decrease as D_1 has more samples

Goal: The amount we care about domains is not proportional to their frequency. Suppose we have a utility function over these domains (induced by correlation to downstream benchmarks, or a non-linear function of their sample count). In that case, we should be able to design curricula that bias the appropriate domains toward the end of training. Along the way, we build a deeper understanding of what data needs to be throughout training and what data can be treated as fine-tuning data.

Links

Results

TODO

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions