Description
Interested in how to properly order pretraining data for fixed available data.
To start, we can consider how you would want to do curriculum learning for transfer: if you are given a few samples from D_1 and samples from less related D_2, what ordering of samples minimizes test loss for D_1? How does this vary with the number of samples from D_1 vs D_2?
Hypothesis or Goal
Hypothesis 1: To improve test loss on D_1, it will be beneficial to bias D_1 samples toward the end of training
Hypothesis 2: The benefit of keeping D_1 samples closer to the end of training will decrease as D_1 has more samples
Goal: The amount we care about domains is not proportional to their frequency. Suppose we have a utility function over these domains (induced by correlation to downstream benchmarks, or a non-linear function of their sample count). In that case, we should be able to design curricula that bias the appropriate domains toward the end of training. Along the way, we build a deeper understanding of what data needs to be throughout training and what data can be treated as fine-tuning data.
Links
Results
TODO
Description
Interested in how to properly order pretraining data for fixed available data.
To start, we can consider how you would want to do curriculum learning for transfer: if you are given a few samples from D_1 and samples from less related D_2, what ordering of samples minimizes test loss for D_1? How does this vary with the number of samples from D_1 vs D_2?
Hypothesis or Goal
Hypothesis 1: To improve test loss on D_1, it will be beneficial to bias D_1 samples toward the end of training
Hypothesis 2: The benefit of keeping D_1 samples closer to the end of training will decrease as D_1 has more samples
Goal: The amount we care about domains is not proportional to their frequency. Suppose we have a utility function over these domains (induced by correlation to downstream benchmarks, or a non-linear function of their sample count). In that case, we should be able to design curricula that bias the appropriate domains toward the end of training. Along the way, we build a deeper understanding of what data needs to be throughout training and what data can be treated as fine-tuning data.
Links
Results
TODO