Adding utilities to chunk large audio files and read directly from microphone #14250

Narsil · 2021-11-02T16:59:27Z

This PR and was slowly merged bits by bits (with sometimes major changes) into transformers.
Keeping it open for VAD chunking which is in this PR and not yet available in transormers

chunk_files, requires scipy only, relatively straightforward.
vad_files, requires webrtcvad, will chunk both large chunks and if
voice is unactivated (good potential if lots of silence, but might
miss large portions of the audio).
Both require ffmpeg too, maybe move to av (initial implementation
was 4x slower and more complex though.)
ffmpeg_microphone will stream audio from the machine's microphone.
no streaming (meaning temporary results while the whole frame is being
processed) yet, but should be relatively easy to do and probably
pipeline agnostic. streaming support (but a bit manual)

All those functions have many knobs to turn which can affect the end
result quite drastically, so there are no "sane" defaults (afaik).
For now those are explicitely separate from the core of pipeline meaning
they are likely to change, and simply meant as helper functions to keep
simple APIs even on more challenging data, and make demos easy to do.
Another benefit is that we can expose all those knobs without exploding
the pipeline's complexity (they are not exposed yet).

Current defaults yield both for chunk_files and vad_files 47WER on AMI
with facebook/wav2vec2-base-960h which is on par with expectations.

Wer script:

from jiwer import wer                                                                
                                                                                                                          
from datasets import load_dataset                                                    
from transformers import pipeline                                                    
from transformers.pipelines.audio_utils import chunk_files, vad_files                
import tqdm                                                                          
import numpy as np                                                                   
import re                                                                            
                                                                                     
                                                                                     
def evaluate():                                                                      
    dataset = load_dataset("ami", "headset-single", split="validation")              
                                                                                     
    pipe = pipeline("automatic-speech-recognition", device=0)                        
    sampling_rate = pipe.feature_extractor.sampling_rate                             
                                                                                     
    non_letters = re.compile(r"[^a-z'\s]+")                                                                
    multi_space = re.compile(r"\s+")                                                 
                                                                                     
    vad_wers = []                                                                    
    chunk_wers = []                                                                  
    max_chunk_duration_s = 20                                                        
    for item in tqdm.tqdm(dataset):                                                  
        words = item["words"]                                                        
        filename = item["file"]                                                      
        target_text = " ".join(words).lower()                                        
        target_text = non_letters.sub("", target_text)                               
        target_text = multi_space.sub(" ", target_text)                              
                                                                                     
        pred_text = ""                                                               
        for item in tqdm.tqdm(pipe(chunk_files([filename], sampling_rate, max_chunk_duration_s))):   
            pred_text += " " + item["text"]                                       
        pred_text = pred_text.lower()                                             
        chunk_wers.append(wer(target_text, pred_text))                            
                                                                                  
        pred_text = ""                                                            
        for item in tqdm.tqdm(pipe(vad_files([filename], sampling_rate, max_chunk_duration_s))):   
            pred_text += " " + item["text"]                                       
        pred_text = pred_text.lower()                                             
        vad_wers.append(wer(target_text, pred_text))                              
                                                                                  
    return np.mean(chunk_wers), np.mean(vad_wers)                                 
                                                                                  
                                                                                  
if __name__ == "__main__":                                                        
    score = evaluate()                                                            
    print("WER: ", score)

Microphone streaming:

import datetime                                                                      
import sys                                                                                                                
from transformers import pipeline                                                    
from transformers.pipelines.audio_utils import ffmpeg_microphone                     
                                                                                     
nlp = pipeline("automatic-speech-recognition", device=0)                             
sampling_rate = nlp.feature_extractor.sampling_rate                                  
                                                                                     
                                                                                     
start = datetime.datetime.now()                                                      
                                                                                     
max_chunk_duration_s = 5                                                             
stream_chunk_ms = 50                                                                 
N = max_chunk_duration_s * 1000 / stream_chunk_ms                                    
for i, item in enumerate(                                                            
    nlp(                                                                             
        ffmpeg_microphone(                                                                                 
            sampling_rate=sampling_rate,                                             
            format_for_conversion="f32le",                                           
            max_chunk_duration_s=max_chunk_duration_s,                               
            stream_chunk_ms=stream_chunk_ms,                                         
        ),                                                                           
        batch_size=1,                                                                
        num_workers=1,                                                               
    )                                                                                
):                                                                                   
    sys.stdout.write("\033[K")                                                       
    print(item["text"], end="\r")                                                    
    if i % N == N - 1:                                                               
        print("")

Edit: Separating this from ChunkPipeline which in the end is totally
unrelated work (linked to decision to keep these as helpers instead
of within the pipeline).

What does this PR do?

Fixes # (issue)

@LysandreJik
@anton-l

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves).

…or pipelines (#14352) * Adding support for raw python `generator` in addition to `Dataset` The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled #14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves). * Adding iterator support for `tf` too.

The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves).

github-actions · 2021-12-06T15:02:11Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves).

Narsil · 2021-12-22T15:49:14Z

Reopening since I am using stuff from this PR for testing (ffmpeg_microphone namely)

patrickvonplaten · 2021-12-29T11:06:21Z

Awesome that you found a way to stream audio!!!

If possible I'd be happy to first get #14896 (review) merged to enable offline decoding of very large files and once that's done it would be great to tackle the online streaming case. Could we maybe put the VAD utilities in a new seperate PR since it is a bit unrelated in my opinion to "online streaming" as explained here: #14896 (comment) ?

…or pipelines (huggingface#14352) * Adding support for raw python `generator` in addition to `Dataset` The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves). * Adding iterator support for `tf` too.

github-actions · 2022-01-30T15:01:54Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

microphone - chunk_files, requires scipy only, relatively straightforward. - vad_files, requires webrtcvad, will chunk both large chunks and if voice is unactivated (good potential if lots of silence, but might miss large portions of the audio). - Both require ffmpeg too, maybe move to `av` (initial implementation was 4x slower and more complex though.) - ffmpeg_microphone will stream audio from the machine's microphone. no streaming (meaning temporary results while the whole frame is being processed) yet, but should be relatively easy to do and probably pipeline agnostic. All those functions have many knobs to turn which can affect the end result quite drastically, so there are no "sane" defaults (afaik). For now those are explicitely separate from the core of pipeline meaning they are likely to change, and simply meant as helper functions to keep simple APIs even on more challenging data, and make demos easy to do. Another benefit is that we can expose all those knobs without exploding the pipeline's complexity (they are not exposed yet). Current defaults yield both for chunk_files and vad_files 47WER on AMI with `facebook/wav2vec2-base-960h` which is on par with expectations. Remove work. Unused kwarg. Enabling streaming mode for the microphone. Fixing on MacOS we need to capture only microphone. Style after rebase. Stride and chunking outside of pipeline. Final chunk_lenght_s ? Protect tests with ffmpeg check. Fixing QA (batching padding values for `p_mask` is special. isort.

Narsil · 2022-01-31T13:26:10Z

unstale

patrickvonplaten · 2022-01-31T14:42:32Z

Looking now!

patrickvonplaten · 2022-01-31T17:05:32Z

src/transformers/pipelines/audio_utils.py

+
+
+def ffmpeg_microphone(
+    sampling_rate: int, format_for_conversion: str, chunk_length_s: float, stream_chunk_ms: Optional[int] = None


Could we maybe have stream_chunk_ms in seconds as well @Narsil given that everything else is in seconds?

Also could we also make use of the stride logic here?

patrickvonplaten · 2022-01-31T17:06:05Z

src/transformers/pipelines/audio_utils.py

+    ]
+
+    if stream_chunk_ms is not None:
+        buflen = int(round(sampling_rate * stream_chunk_ms * size_of_sample / 1000))


Suggested change

buflen = int(round(sampling_rate * stream_chunk_ms * size_of_sample / 1000))

buflen = int(round(sampling_rate * stream_chunk_s * size_of_sample))

would be nicer IMO

patrickvonplaten · 2022-01-31T17:06:42Z

src/transformers/pipelines/audio_utils.py

+
+
+# Taken from https://github.com/wiseman/py-webrtcvad/blob/master/example.py
+def frame_generator(frame_duration_ms, audio_generator, sample_rate):


would also prefer frame_duration_s here

patrickvonplaten · 2022-01-31T17:08:51Z

src/transformers/pipelines/audio_utils.py

+        yield b"".join([f.bytes for f in voiced_frames])
+
+
+def vad_files(filenames, sampling_rate: int, chunk_length_s: float):


Maybe add in another PR - think the default streaming works quite well without it no?

patrickvonplaten

Tried it out and it's super cool!

If possible, could we maybe leave out the VAD stuff in this PR (don't think it's necessary in a first step).
Also I think it'd be very cool if we could align stream_chunk_ms to the seconds format as well with the other inputs.
Then this would make a first great demo for live ASR no?

Guess we could add the stride logic in a next step here as well to improve the result no?
Think then

Narsil · 2022-01-31T19:15:59Z

Sorry @patrickvonplaten I induced you in error.

This is the main PR, from which we decided I would stem sub, smaller PRs:

This one is the smaller (which I need to update too apparently) #15046

I merely rebased this one so it wouldn't be too stale..

patrickvonplaten · 2022-01-31T20:11:37Z

No worries! Will take a look at the new one tomorrow first thing then :-)

patrickvonplaten · 2022-02-23T17:36:29Z

Sorry is this PR still relevant?

Narsil · 2022-02-23T18:05:46Z

Well, it's still contains the vad chunking.

It's more of a safekeep PR, I'll mark it as draft since we shouldn't merge it.

github-actions · 2022-03-20T15:02:26Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Narsil force-pushed the chunk_asr branch from 6eb58cc to eedf790 Compare November 4, 2021 08:29

Narsil mentioned this pull request Nov 10, 2021

Adding support for raw python generator in addition to Dataset for pipelines #14352

Merged

5 tasks

Narsil force-pushed the chunk_asr branch from eedf790 to 07f8ae8 Compare November 12, 2021 11:11

github-actions bot closed this Dec 15, 2021

Narsil reopened this Dec 22, 2021

anton-l mentioned this pull request Dec 23, 2021

Large audio chunking for the existing ASR pipeline #14896

Merged

Narsil requested a review from patrickvonplaten December 29, 2021 09:21

Narsil force-pushed the chunk_asr branch 3 times, most recently from 4aada7f to 48902a7 Compare January 5, 2022 15:31

Narsil added 3 commits January 31, 2022 13:44

Post rebase.

fcbc410

Force raw samples to include sampling_rate.

d47ca31

Narsil force-pushed the chunk_asr branch from 48902a7 to d47ca31 Compare January 31, 2022 13:25

torch Protect test

132fb34

patrickvonplaten reviewed Jan 31, 2022

View reviewed changes

Narsil marked this pull request as draft February 23, 2022 18:05

github-actions bot closed this Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding utilities to chunk large audio files and read directly from microphone #14250

Adding utilities to chunk large audio files and read directly from microphone #14250

Narsil commented Nov 2, 2021 •

edited

Loading

github-actions bot commented Dec 6, 2021

Narsil commented Dec 22, 2021

patrickvonplaten commented Dec 29, 2021

github-actions bot commented Jan 30, 2022

Narsil commented Jan 31, 2022

patrickvonplaten commented Jan 31, 2022

patrickvonplaten Jan 31, 2022

patrickvonplaten Jan 31, 2022

patrickvonplaten Jan 31, 2022

patrickvonplaten Jan 31, 2022

patrickvonplaten Jan 31, 2022

patrickvonplaten left a comment

Narsil commented Jan 31, 2022

patrickvonplaten commented Jan 31, 2022

patrickvonplaten commented Feb 23, 2022

Narsil commented Feb 23, 2022

github-actions bot commented Mar 20, 2022



		def ffmpeg_microphone(
		sampling_rate: int, format_for_conversion: str, chunk_length_s: float, stream_chunk_ms: Optional[int] = None

	buflen = int(round(sampling_rate * stream_chunk_ms * size_of_sample / 1000))
	buflen = int(round(sampling_rate * stream_chunk_s * size_of_sample))



		# Taken from https://github.com/wiseman/py-webrtcvad/blob/master/example.py
		def frame_generator(frame_duration_ms, audio_generator, sample_rate):

		yield b"".join([f.bytes for f in voiced_frames])


		def vad_files(filenames, sampling_rate: int, chunk_length_s: float):

Adding utilities to chunk large audio files and read directly from microphone #14250

Adding utilities to chunk large audio files and read directly from microphone #14250

Conversation

Narsil commented Nov 2, 2021 • edited Loading

This PR and was slowly merged bits by bits (with sometimes major changes) into transformers. Keeping it open for VAD chunking which is in this PR and not yet available in transormers

What does this PR do?

Before submitting

Who can review?

github-actions bot commented Dec 6, 2021

Narsil commented Dec 22, 2021

patrickvonplaten commented Dec 29, 2021

github-actions bot commented Jan 30, 2022

Narsil commented Jan 31, 2022

patrickvonplaten commented Jan 31, 2022

patrickvonplaten Jan 31, 2022

Choose a reason for hiding this comment

patrickvonplaten Jan 31, 2022

Choose a reason for hiding this comment

patrickvonplaten Jan 31, 2022

Choose a reason for hiding this comment

patrickvonplaten Jan 31, 2022

Choose a reason for hiding this comment

patrickvonplaten Jan 31, 2022

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

Narsil commented Jan 31, 2022

patrickvonplaten commented Jan 31, 2022

patrickvonplaten commented Feb 23, 2022

Narsil commented Feb 23, 2022

github-actions bot commented Mar 20, 2022

Narsil commented Nov 2, 2021 •

edited

Loading

This PR and was slowly merged bits by bits (with sometimes major changes) into transformers.
Keeping it open for VAD chunking which is in this PR and not yet available in transormers