Skip to content

Experiment: Audio model - Unimodal data mix #1978

@potsawee

Description

@potsawee

Description

The first audio experiment #1699 (600M model, 500B tokens) yields a reasonable audio model, but it lacks semantic knowledge. We're experimenting with adding unimodal data (e.g., text-only DCLM) into the pre-training data mix. We're going sweep the percentage of text-only data from 0%, 10%, 20%, 50% at a smaller scale (e.g., 150M model, 100B tokens).

cc. @Helw150

Hypothesis or Goal

Find an optimal ratio between (speech, text) and text-only data for pre-training, and understand the impact on the amount of text-only data to S->S, T->T, S->T, T->S performance.

Links

(Delete any that aren't applicable.)

  • WandB Report: (link)
  • Data Browser: (link)
  • (etc.)

Results

(What did you find, including relevant evaluation metrics, etc.)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions