Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending Lhotse dataloading to text/multimodal data #1295

Merged
merged 9 commits into from
Mar 7, 2024

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Mar 5, 2024

This PR adds a very basic support for incorporating text-only data into Lhotse samplers to enable text and multimodal dataloading. Highlights:

  • new ABC SamplingConstraint that generalizes TimeConstraint, and allows to create other types of constraints to decide when to stop sampling a mini-batch as well as how to determine the "size" of an example (e.g. for audio its duration, but for text it may be sth like num tokens)
  • dynamic samplers have a new argument called constraint where SamplingConstraint instances may be passed directly
  • TokenConstraint which is almost identical to TimeConstraint but uses num_tokens / max_tokens
  • very basic dataclass TextExample that wraps text/tokens, CutSet can be used to yield those (just pass text iterator to CutSet like CutSet(text_example_iter)) (it's not super clean but it works; trying to figure out if we can make this cleaner)
  • unit tests illustrating how to use this for text dataloading and even for mixed modality dataloading (text and audio data together in a mini-batch)

This is stretching the original scope of Lhotse a bit, but I feel like it's worth it: we accumulated a bunch of solid techniques here and it'd be a pity to have to use something completely different for multimodal modeling, especially when so little changes are required to make it work here. Would love to know your thoughts @danpovey @csukuangfj @desh2608 @m-wiesner

@pzelasko pzelasko added this to the 1.22.0 milestone Mar 5, 2024
@pzelasko pzelasko merged commit d0521a7 into master Mar 7, 2024
11 checks passed
@pzelasko pzelasko deleted the feature/non-cut-sampling branch March 7, 2024 16:16
@pzelasko
Copy link
Collaborator Author

pzelasko commented Mar 7, 2024

Note that this is an experimental feature: let us know if you're running into issues with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant