-
Notifications
You must be signed in to change notification settings - Fork 25.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding utilities to chunk large audio files and read directly from microphone #14250
Conversation
The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves).
…or pipelines (#14352) * Adding support for raw python `generator` in addition to `Dataset` The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled #14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves). * Adding iterator support for `tf` too.
The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves).
The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves).
Reopening since I am using stuff from this PR for testing ( |
Awesome that you found a way to stream audio!!! If possible I'd be happy to first get #14896 (review) merged to enable offline decoding of very large files and once that's done it would be great to tackle the online streaming case. Could we maybe put the VAD utilities in a new seperate PR since it is a bit unrelated in my opinion to "online streaming" as explained here: #14896 (comment) ? |
4aada7f
to
48902a7
Compare
…or pipelines (huggingface#14352) * Adding support for raw python `generator` in addition to `Dataset` The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves). * Adding iterator support for `tf` too.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
microphone - chunk_files, requires scipy only, relatively straightforward. - vad_files, requires webrtcvad, will chunk both large chunks and if voice is unactivated (good potential if lots of silence, but might miss large portions of the audio). - Both require ffmpeg too, maybe move to `av` (initial implementation was 4x slower and more complex though.) - ffmpeg_microphone will stream audio from the machine's microphone. no streaming (meaning temporary results while the whole frame is being processed) yet, but should be relatively easy to do and probably pipeline agnostic. All those functions have many knobs to turn which can affect the end result quite drastically, so there are no "sane" defaults (afaik). For now those are explicitely separate from the core of pipeline meaning they are likely to change, and simply meant as helper functions to keep simple APIs even on more challenging data, and make demos easy to do. Another benefit is that we can expose all those knobs without exploding the pipeline's complexity (they are not exposed yet). Current defaults yield both for chunk_files and vad_files 47WER on AMI with `facebook/wav2vec2-base-960h` which is on par with expectations. Remove work. Unused kwarg. Enabling streaming mode for the microphone. Fixing on MacOS we need to capture only microphone. Style after rebase. Stride and chunking outside of pipeline. Final chunk_lenght_s ? Protect tests with ffmpeg check. Fixing QA (batching padding values for `p_mask` is special. isort.
unstale |
Looking now! |
|
||
|
||
def ffmpeg_microphone( | ||
sampling_rate: int, format_for_conversion: str, chunk_length_s: float, stream_chunk_ms: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe have stream_chunk_ms
in seconds as well @Narsil given that everything else is in seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could we also make use of the stride logic here?
] | ||
|
||
if stream_chunk_ms is not None: | ||
buflen = int(round(sampling_rate * stream_chunk_ms * size_of_sample / 1000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buflen = int(round(sampling_rate * stream_chunk_ms * size_of_sample / 1000)) | |
buflen = int(round(sampling_rate * stream_chunk_s * size_of_sample)) |
would be nicer IMO
|
||
|
||
# Taken from https://github.com/wiseman/py-webrtcvad/blob/master/example.py | ||
def frame_generator(frame_duration_ms, audio_generator, sample_rate): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would also prefer frame_duration_s
here
yield b"".join([f.bytes for f in voiced_frames]) | ||
|
||
|
||
def vad_files(filenames, sampling_rate: int, chunk_length_s: float): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add in another PR - think the default streaming works quite well without it no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it out and it's super cool!
If possible, could we maybe leave out the VAD stuff in this PR (don't think it's necessary in a first step).
Also I think it'd be very cool if we could align stream_chunk_ms
to the seconds format as well with the other inputs.
Then this would make a first great demo for live ASR no?
Guess we could add the stride logic in a next step here as well to improve the result no?
Think then
Sorry @patrickvonplaten I induced you in error. This is the main PR, from which we decided I would stem sub, smaller PRs: This one is the smaller (which I need to update too apparently) #15046 I merely rebased this one so it wouldn't be too stale.. |
No worries! Will take a look at the new one tomorrow first thing then :-) |
Sorry is this PR still relevant? |
Well, it's still contains the It's more of a safekeep PR, I'll mark it as draft since we shouldn't merge it. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
This PR and was slowly merged bits by bits (with sometimes major changes) into transformers.
Keeping it open for VAD chunking which is in this PR and not yet available in transormers
voice is unactivated (good potential if lots of silence, but might
miss large portions of the audio).
av
(initial implementationwas 4x slower and more complex though.)
no streaming (meaning temporary results while the whole frame is beingstreaming support (but a bit manual)processed) yet, but should be relatively easy to do and probably
pipeline agnostic.
All those functions have many knobs to turn which can affect the end
result quite drastically, so there are no "sane" defaults (afaik).
For now those are explicitely separate from the core of pipeline meaning
they are likely to change, and simply meant as helper functions to keep
simple APIs even on more challenging data, and make demos easy to do.
Another benefit is that we can expose all those knobs without exploding
the pipeline's complexity (they are not exposed yet).
Current defaults yield both for chunk_files and vad_files 47WER on AMI
with
facebook/wav2vec2-base-960h
which is on par with expectations.Wer script:
Microphone streaming:
Edit: Separating this from ChunkPipeline which in the end is totally
unrelated work (linked to decision to keep these as helpers instead
of within the pipeline).
What does this PR do?
Fixes # (issue)
@LysandreJik
@anton-l
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.