-
Notifications
You must be signed in to change notification settings - Fork 69
Closed
Labels
Description
Description
The first audio experiment #1699 (600M model, 500B tokens) yields a reasonable audio model, but it lacks semantic knowledge. We're experimenting with adding unimodal data (e.g., text-only DCLM) into the pre-training data mix. We're going sweep the percentage of text-only data from 0%, 10%, 20%, 50% at a smaller scale (e.g., 150M model, 100B tokens).
cc. @Helw150
Hypothesis or Goal
Find an optimal ratio between (speech, text) and text-only data for pre-training, and understand the impact on the amount of text-only data to S->S, T->T, S->T, T->S performance.
Links
(Delete any that aren't applicable.)
- WandB Report: (link)
- Data Browser: (link)
- (etc.)
Results
(What did you find, including relevant evaluation metrics, etc.)