Description
Tootsie Rolls famously use a graining process where the previous day's batch is folded into the next day's, resulting in consistency, or something.
We anticipate we'll be constantly improving our data mixture. We may as well get a main run started and improve the mixture as it progresses, rather than starting from scratch. By using WSD-S, we're basically just warm-starting anyway whenever we switch things up.
Hypothesis or Goal
- Verify WSD-S works at this scale
- Investigate folding in new data as this run progresses, using our best guess at what good data is as the run progresses.
- (Maybe) is it better to fold in data: right after minima, or middle of cycle? My guess is mid-cycle.
Links
(Delete any that aren't applicable)
Results
See comment stream for details.

Description
Tootsie Rolls famously use a graining process where the previous day's batch is folded into the next day's, resulting in consistency, or something.
We anticipate we'll be constantly improving our data mixture. We may as well get a main run started and improve the mixture as it progresses, rather than starting from scratch. By using WSD-S, we're basically just warm-starting anyway whenever we switch things up.
Hypothesis or Goal
Links
(Delete any that aren't applicable)
Results
See comment stream for details.