Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding utilities to chunk large audio files and read directly from microphone #14250

Closed
wants to merge 4 commits into from

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Nov 2, 2021


This PR and was slowly merged bits by bits (with sometimes major changes) into transformers.
Keeping it open for VAD chunking which is in this PR and not yet available in transormers

  • chunk_files, requires scipy only, relatively straightforward.
  • vad_files, requires webrtcvad, will chunk both large chunks and if
    voice is unactivated (good potential if lots of silence, but might
    miss large portions of the audio).
  • Both require ffmpeg too, maybe move to av (initial implementation
    was 4x slower and more complex though.)
  • ffmpeg_microphone will stream audio from the machine's microphone.
    no streaming (meaning temporary results while the whole frame is being
    processed) yet, but should be relatively easy to do and probably
    pipeline agnostic.
    streaming support (but a bit manual)

All those functions have many knobs to turn which can affect the end
result quite drastically, so there are no "sane" defaults (afaik).
For now those are explicitely separate from the core of pipeline meaning
they are likely to change, and simply meant as helper functions to keep
simple APIs even on more challenging data, and make demos easy to do.
Another benefit is that we can expose all those knobs without exploding
the pipeline's complexity (they are not exposed yet).

Current defaults yield both for chunk_files and vad_files 47WER on AMI
with facebook/wav2vec2-base-960h which is on par with expectations.

Wer script:

from jiwer import wer                                                                
                                                                                                                          
from datasets import load_dataset                                                    
from transformers import pipeline                                                    
from transformers.pipelines.audio_utils import chunk_files, vad_files                
import tqdm                                                                          
import numpy as np                                                                   
import re                                                                            
                                                                                     
                                                                                     
def evaluate():                                                                      
    dataset = load_dataset("ami", "headset-single", split="validation")              
                                                                                     
    pipe = pipeline("automatic-speech-recognition", device=0)                        
    sampling_rate = pipe.feature_extractor.sampling_rate                             
                                                                                     
    non_letters = re.compile(r"[^a-z'\s]+")                                                                
    multi_space = re.compile(r"\s+")                                                 
                                                                                     
    vad_wers = []                                                                    
    chunk_wers = []                                                                  
    max_chunk_duration_s = 20                                                        
    for item in tqdm.tqdm(dataset):                                                  
        words = item["words"]                                                        
        filename = item["file"]                                                      
        target_text = " ".join(words).lower()                                        
        target_text = non_letters.sub("", target_text)                               
        target_text = multi_space.sub(" ", target_text)                              
                                                                                     
        pred_text = ""                                                               
        for item in tqdm.tqdm(pipe(chunk_files([filename], sampling_rate, max_chunk_duration_s))):   
            pred_text += " " + item["text"]                                       
        pred_text = pred_text.lower()                                             
        chunk_wers.append(wer(target_text, pred_text))                            
                                                                                  
        pred_text = ""                                                            
        for item in tqdm.tqdm(pipe(vad_files([filename], sampling_rate, max_chunk_duration_s))):   
            pred_text += " " + item["text"]                                       
        pred_text = pred_text.lower()                                             
        vad_wers.append(wer(target_text, pred_text))                              
                                                                                  
    return np.mean(chunk_wers), np.mean(vad_wers)                                 
                                                                                  
                                                                                  
if __name__ == "__main__":                                                        
    score = evaluate()                                                            
    print("WER: ", score)  

Microphone streaming:

import datetime                                                                      
import sys                                                                                                                
from transformers import pipeline                                                    
from transformers.pipelines.audio_utils import ffmpeg_microphone                     
                                                                                     
nlp = pipeline("automatic-speech-recognition", device=0)                             
sampling_rate = nlp.feature_extractor.sampling_rate                                  
                                                                                     
                                                                                     
start = datetime.datetime.now()                                                      
                                                                                     
max_chunk_duration_s = 5                                                             
stream_chunk_ms = 50                                                                 
N = max_chunk_duration_s * 1000 / stream_chunk_ms                                    
for i, item in enumerate(                                                            
    nlp(                                                                             
        ffmpeg_microphone(                                                                                 
            sampling_rate=sampling_rate,                                             
            format_for_conversion="f32le",                                           
            max_chunk_duration_s=max_chunk_duration_s,                               
            stream_chunk_ms=stream_chunk_ms,                                         
        ),                                                                           
        batch_size=1,                                                                
        num_workers=1,                                                               
    )                                                                                
):                                                                                   
    sys.stdout.write("\033[K")                                                       
    print(item["text"], end="\r")                                                    
    if i % N == N - 1:                                                               
        print("") 

Edit: Separating this from ChunkPipeline which in the end is totally
unrelated work (linked to decision to keep these as helpers instead
of within the pipeline).

What does this PR do?

Fixes # (issue)

@LysandreJik
@anton-l

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Narsil added a commit to Narsil/transformers that referenced this pull request Nov 10, 2021
The main goal is to ease the create of streaming data to the pipe.

`Dataset` is more involved and pytorch specific.

This PR, provides a way to use a python iterator too.
This enabled huggingface#14250 but can be proposed as a standalone PR.

```python
from transformers import pipeline

def read_data(filename):
    with open(filename, 'r') as f:
        for line in f:
            yield f

pipe = pipeline("text-classification")
for classified in pipe(read_data("large_file.txt")):
    print("Success ! ", classified)
```

The main caveat of this, is the interaction with `DataLoader` with
`num_workers>1`. When you have multiple workers, each receive a copy
of the generator (like `IterableDataset`). That means the naive Iterator
will fail since all workers iterate on all items of the generator.

There are ways to do clever "skipping", but it could be bad still
because all workers still do have to pass through all items of the
generator (they just ignore items they don't handle), depending on
the case it might be bad.

Using `num_workers=1` is the simplest fix and if the cost of loading
your data is small enough should be good enough. In the above example
trying to do smart tricks to skip some lines is unlikely to be a net
positive for instance.

If there are better ways to do "jumps" on some data, then using
`Dataset` is more advised (since then differents workers can just jump
themselves).
Narsil added a commit that referenced this pull request Nov 12, 2021
…or pipelines (#14352)

* Adding support for raw python `generator` in addition to `Dataset`

The main goal is to ease the create of streaming data to the pipe.

`Dataset` is more involved and pytorch specific.

This PR, provides a way to use a python iterator too.
This enabled #14250 but can be proposed as a standalone PR.

```python
from transformers import pipeline

def read_data(filename):
    with open(filename, 'r') as f:
        for line in f:
            yield f

pipe = pipeline("text-classification")
for classified in pipe(read_data("large_file.txt")):
    print("Success ! ", classified)
```

The main caveat of this, is the interaction with `DataLoader` with
`num_workers>1`. When you have multiple workers, each receive a copy
of the generator (like `IterableDataset`). That means the naive Iterator
will fail since all workers iterate on all items of the generator.

There are ways to do clever "skipping", but it could be bad still
because all workers still do have to pass through all items of the
generator (they just ignore items they don't handle), depending on
the case it might be bad.

Using `num_workers=1` is the simplest fix and if the cost of loading
your data is small enough should be good enough. In the above example
trying to do smart tricks to skip some lines is unlikely to be a net
positive for instance.

If there are better ways to do "jumps" on some data, then using
`Dataset` is more advised (since then differents workers can just jump
themselves).

* Adding iterator support for `tf` too.
Narsil added a commit to Narsil/transformers that referenced this pull request Nov 12, 2021
The main goal is to ease the create of streaming data to the pipe.

`Dataset` is more involved and pytorch specific.

This PR, provides a way to use a python iterator too.
This enabled huggingface#14250 but can be proposed as a standalone PR.

```python
from transformers import pipeline

def read_data(filename):
    with open(filename, 'r') as f:
        for line in f:
            yield f

pipe = pipeline("text-classification")
for classified in pipe(read_data("large_file.txt")):
    print("Success ! ", classified)
```

The main caveat of this, is the interaction with `DataLoader` with
`num_workers>1`. When you have multiple workers, each receive a copy
of the generator (like `IterableDataset`). That means the naive Iterator
will fail since all workers iterate on all items of the generator.

There are ways to do clever "skipping", but it could be bad still
because all workers still do have to pass through all items of the
generator (they just ignore items they don't handle), depending on
the case it might be bad.

Using `num_workers=1` is the simplest fix and if the cost of loading
your data is small enough should be good enough. In the above example
trying to do smart tricks to skip some lines is unlikely to be a net
positive for instance.

If there are better ways to do "jumps" on some data, then using
`Dataset` is more advised (since then differents workers can just jump
themselves).
Narsil added a commit to Narsil/transformers that referenced this pull request Nov 26, 2021
The main goal is to ease the create of streaming data to the pipe.

`Dataset` is more involved and pytorch specific.

This PR, provides a way to use a python iterator too.
This enabled huggingface#14250 but can be proposed as a standalone PR.

```python
from transformers import pipeline

def read_data(filename):
    with open(filename, 'r') as f:
        for line in f:
            yield f

pipe = pipeline("text-classification")
for classified in pipe(read_data("large_file.txt")):
    print("Success ! ", classified)
```

The main caveat of this, is the interaction with `DataLoader` with
`num_workers>1`. When you have multiple workers, each receive a copy
of the generator (like `IterableDataset`). That means the naive Iterator
will fail since all workers iterate on all items of the generator.

There are ways to do clever "skipping", but it could be bad still
because all workers still do have to pass through all items of the
generator (they just ignore items they don't handle), depending on
the case it might be bad.

Using `num_workers=1` is the simplest fix and if the cost of loading
your data is small enough should be good enough. In the above example
trying to do smart tricks to skip some lines is unlikely to be a net
positive for instance.

If there are better ways to do "jumps" on some data, then using
`Dataset` is more advised (since then differents workers can just jump
themselves).
@github-actions
Copy link

github-actions bot commented Dec 6, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Narsil added a commit to Narsil/transformers that referenced this pull request Dec 8, 2021
The main goal is to ease the create of streaming data to the pipe.

`Dataset` is more involved and pytorch specific.

This PR, provides a way to use a python iterator too.
This enabled huggingface#14250 but can be proposed as a standalone PR.

```python
from transformers import pipeline

def read_data(filename):
    with open(filename, 'r') as f:
        for line in f:
            yield f

pipe = pipeline("text-classification")
for classified in pipe(read_data("large_file.txt")):
    print("Success ! ", classified)
```

The main caveat of this, is the interaction with `DataLoader` with
`num_workers>1`. When you have multiple workers, each receive a copy
of the generator (like `IterableDataset`). That means the naive Iterator
will fail since all workers iterate on all items of the generator.

There are ways to do clever "skipping", but it could be bad still
because all workers still do have to pass through all items of the
generator (they just ignore items they don't handle), depending on
the case it might be bad.

Using `num_workers=1` is the simplest fix and if the cost of loading
your data is small enough should be good enough. In the above example
trying to do smart tricks to skip some lines is unlikely to be a net
positive for instance.

If there are better ways to do "jumps" on some data, then using
`Dataset` is more advised (since then differents workers can just jump
themselves).
@github-actions github-actions bot closed this Dec 15, 2021
@Narsil Narsil reopened this Dec 22, 2021
@Narsil
Copy link
Contributor Author

Narsil commented Dec 22, 2021

Reopening since I am using stuff from this PR for testing (ffmpeg_microphone namely)

@patrickvonplaten
Copy link
Contributor

Awesome that you found a way to stream audio!!!

If possible I'd be happy to first get #14896 (review) merged to enable offline decoding of very large files and once that's done it would be great to tackle the online streaming case. Could we maybe put the VAD utilities in a new seperate PR since it is a bit unrelated in my opinion to "online streaming" as explained here: #14896 (comment) ?

@Narsil Narsil force-pushed the chunk_asr branch 3 times, most recently from 4aada7f to 48902a7 Compare January 5, 2022 15:31
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
…or pipelines (huggingface#14352)

* Adding support for raw python `generator` in addition to `Dataset`

The main goal is to ease the create of streaming data to the pipe.

`Dataset` is more involved and pytorch specific.

This PR, provides a way to use a python iterator too.
This enabled huggingface#14250 but can be proposed as a standalone PR.

```python
from transformers import pipeline

def read_data(filename):
    with open(filename, 'r') as f:
        for line in f:
            yield f

pipe = pipeline("text-classification")
for classified in pipe(read_data("large_file.txt")):
    print("Success ! ", classified)
```

The main caveat of this, is the interaction with `DataLoader` with
`num_workers>1`. When you have multiple workers, each receive a copy
of the generator (like `IterableDataset`). That means the naive Iterator
will fail since all workers iterate on all items of the generator.

There are ways to do clever "skipping", but it could be bad still
because all workers still do have to pass through all items of the
generator (they just ignore items they don't handle), depending on
the case it might be bad.

Using `num_workers=1` is the simplest fix and if the cost of loading
your data is small enough should be good enough. In the above example
trying to do smart tricks to skip some lines is unlikely to be a net
positive for instance.

If there are better ways to do "jumps" on some data, then using
`Dataset` is more advised (since then differents workers can just jump
themselves).

* Adding iterator support for `tf` too.
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

microphone

- chunk_files, requires scipy only, relatively straightforward.
- vad_files, requires webrtcvad, will chunk both large chunks and if
  voice is unactivated (good potential if lots of silence, but might
miss large portions of the audio).
- Both require ffmpeg too, maybe move to `av` (initial implementation
  was 4x slower and more complex though.)
- ffmpeg_microphone will stream audio from the machine's microphone.
  no streaming (meaning temporary results while the whole frame is being
  processed) yet, but should be relatively easy to do and probably
  pipeline agnostic.

All those functions have many knobs to turn which can affect the end
result quite drastically, so there are no "sane" defaults (afaik).
For now those are explicitely separate from the core of pipeline meaning
they are likely to change, and simply meant as helper functions to keep
simple APIs even on more challenging data, and make demos easy to do.
Another benefit is that we can expose all those knobs without exploding
the pipeline's complexity (they are not exposed yet).

Current defaults yield both for chunk_files and vad_files 47WER on AMI
with `facebook/wav2vec2-base-960h` which is on par with expectations.

Remove work.

Unused kwarg.

Enabling streaming mode for the microphone.

Fixing on MacOS we need to capture only microphone.

Style after rebase.

Stride and chunking outside of pipeline.

Final chunk_lenght_s ?

Protect tests with ffmpeg check.

Fixing QA (batching padding values for `p_mask` is special.

isort.
@Narsil
Copy link
Contributor Author

Narsil commented Jan 31, 2022

unstale

@patrickvonplaten
Copy link
Contributor

Looking now!



def ffmpeg_microphone(
sampling_rate: int, format_for_conversion: str, chunk_length_s: float, stream_chunk_ms: Optional[int] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe have stream_chunk_ms in seconds as well @Narsil given that everything else is in seconds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could we also make use of the stride logic here?

]

if stream_chunk_ms is not None:
buflen = int(round(sampling_rate * stream_chunk_ms * size_of_sample / 1000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buflen = int(round(sampling_rate * stream_chunk_ms * size_of_sample / 1000))
buflen = int(round(sampling_rate * stream_chunk_s * size_of_sample))

would be nicer IMO



# Taken from https://github.com/wiseman/py-webrtcvad/blob/master/example.py
def frame_generator(frame_duration_ms, audio_generator, sample_rate):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would also prefer frame_duration_s here

yield b"".join([f.bytes for f in voiced_frames])


def vad_files(filenames, sampling_rate: int, chunk_length_s: float):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add in another PR - think the default streaming works quite well without it no?

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it out and it's super cool!

If possible, could we maybe leave out the VAD stuff in this PR (don't think it's necessary in a first step).
Also I think it'd be very cool if we could align stream_chunk_ms to the seconds format as well with the other inputs.
Then this would make a first great demo for live ASR no?

Guess we could add the stride logic in a next step here as well to improve the result no?
Think then

@Narsil
Copy link
Contributor Author

Narsil commented Jan 31, 2022

Sorry @patrickvonplaten I induced you in error.

This is the main PR, from which we decided I would stem sub, smaller PRs:

This one is the smaller (which I need to update too apparently) #15046

I merely rebased this one so it wouldn't be too stale..

@patrickvonplaten
Copy link
Contributor

No worries! Will take a look at the new one tomorrow first thing then :-)

@patrickvonplaten
Copy link
Contributor

Sorry is this PR still relevant?

@Narsil
Copy link
Contributor Author

Narsil commented Feb 23, 2022

Well, it's still contains the vad chunking.

It's more of a safekeep PR, I'll mark it as draft since we shouldn't merge it.

@Narsil Narsil marked this pull request as draft February 23, 2022 18:05
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants