Adding support for `microphone` streaming within pipeline. #15046

Narsil · 2022-01-05T18:02:11Z

Uses ffmpeg to get microphone data.
Makes sure alignment is made to size_of_sample.
Works by sending {"raw": ..data.., "stride": (n, left, right), "partial": bool, "sampling_rate": sampling_rate}
directly to the pipeline enabling to stream partial results and still
get inference.
Let's partial information flow through the pipeline to enable caller
to get it back and choose to display text or not.
The striding reconstitution is bound to have errors since CTC does not
keep previous state. Currently most of the errors are we don't know if
there's a space or not between two chunks.
Since we have some left striding info, we could use that during decoding
to choose what to do with those spaces and even extra letters maybe (if
the stride is long enough, it's bound to cover at least a few symbols) Fixed by using intelligent replacement on the dropped tokens.

import datetime
import sys
from transformers import pipeline
from transformers.pipelines.audio_utils import ffmpeg_microphone_live

pipe = pipeline("automatic-speech-recognition", device=0)
sampling_rate = pipe.feature_extractor.sampling_rate


start = datetime.datetime.now()

chunk_length_s = 5
stream_chunk_s = 0.1
mic = ffmpeg_microphone_live(
    sampling_rate=sampling_rate,
    chunk_length_s=chunk_length_s,
    stream_chunk_s=stream_chunk_s,
)
print("Start talking...")
for item in pipe(mic):
    sys.stdout.write("\033[K")
    print(item["text"], end="\r")
    if not item["partial"][0]:
        print("")

2nd Better IMO, but low-level demo (requires curses on UNIX like, does not work on windows variants):

import sys
import numpy as np
from transformers import pipeline
from transformers.pipelines.audio_utils import ffmpeg_microphone_live
from curses import wrapper
import curses


def main():
    pipe = pipeline("automatic-speech-recognition", device=0)
    sampling_rate = pipe.feature_extractor.sampling_rate

    chunk_length_s = 5
    stream_chunk_s = 0.1
    mic = ffmpeg_microphone_live(
        sampling_rate=sampling_rate,
        chunk_length_s=chunk_length_s,
        stream_chunk_s=stream_chunk_s,  # , stride_length_s=(1, 0.1)
    )
    print("Start talking...")
    stdscr = curses.initscr()
    curses.noecho()
    curses.cbreak()
    text = ""
    for item in pipe(mic):
        displayed = text + item["text"]
        if not item["partial"][0]:
            text += item["text"]
        
        stdscr.addstr(0, 0, displayed)
        stdscr.clrtoeol()
        stdscr.refresh()


if __name__ == "__main__":
    wrapper(main())

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

- Uses `ffmpeg` to get microphone data. - Makes sure alignment is made to `size_of_sample`. - Works by sending `{"raw": ..data.., "stride": (n, left, right), "partial": bool}` directly to the pipeline enabling to stream partial results and still get inference. - Let's `partial` information flow through the pipeline to enable caller to get it back and choose to display text or not. - The striding reconstitution is bound to have errors since CTC does not keep previous state. Currently most of the errors are we don't know if there's a space or not between two chunks. Since we have some left striding info, we could use that during decoding to choose what to do with those spaces and even extra letters maybe (if the stride is long enough, it's bound to cover at least a few symbols) Fixing tests. Protecting with `require_torch`. `raw_ctc` support for nicer demo. Post rebase fixes. Revamp to split raw_mic_data from it's live chunking. - Requires a refactor to make everything a bit cleaner. Automatic resampling. Small fix. Small fix.

patrickvonplaten · 2022-02-01T16:56:34Z

src/transformers/pipelines/audio_utils.py

+    """
+    Helper function to read audio from the microphone file through ffmpeg. This will output repeating/increasing chunks
+    until `chunk_length_s`. It will make use of striding to avoid errors on the "sides" of the various chunks.
+    """


Would be nice to explain the different between stream_chunk_s and chunk_length_s here

I made the docstring much longer and more exhaustive

patrickvonplaten · 2022-02-01T16:58:02Z

src/transformers/pipelines/automatic_speech_recognition.py

@@ -127,17 +92,17 @@ class AutomaticSpeechRecognitionPipeline(ChunkPipeline):
    to support multiple audio formats
    """

-    def __init__(self, feature_extractor: Union["SequenceFeatureExtractor", str], *args, **kwargs):
+    def __init__(self, model, tokenizer, feature_extractor: Union["SequenceFeatureExtractor", str], *args, **kwargs):


Is that a bit backwards breaking? Think it's totally fine for me as most people probably just use the pipeline function anyways.

It is breaking in some sense yes.

Calls like AutomaticSpeechRecognitionPipeline(feature_extractor, model, tokenizer) will break.
But, the regular pipeline has a call signature (model, tokenizer, feature_extractor), so this makes it a little bit consistent.
It also makes the parent class responsible of doing self.feature_extractor = feature_extractor (so less risk of future discrepancy between pipelines). It's definitely something that should probably be a separate PR, and make sure ALL PR henceforth have a correct call signature. We could also use something like MyPipeline(*, model. tokenizer, feature_extractor) which would disallow purely position based arguments, which would make everything simpler to maintain IMO with reducing confusion for users too: https://www.vegardstikbakke.com/python-keyword-only/

I will revisit the PR to exclude this from the current PR, it's definitely not good just putting that here, I would rather make a sweep at all classes and make a solid (tested) decision about parameter flow. (I think the current thing, is just something that was written before the rewrite of pipelines and just happened to be that way, not a conscious decision.

patrickvonplaten · 2022-02-01T17:01:49Z

src/transformers/pipelines/automatic_speech_recognition.py

-        return preprocess_params, {}, {}
+
+        postprocess_params = {}
+        if "raw_ctc" in kwargs:


What is raw_ctc?

I don't think raw_ctc is ever passed no?

It is captured by postprocess, however I don't think it's needed really.

Most likely some local experiment that I ended committing by error.
FYI, the goal was to enable the caller to fuse the chunks themselves, in the context of a stateless pipeline that would still be usable by a live microphone. Since the pipeline is stateless, it couldn't know the last used token in order to set it in the striding area (so it would get properly discarded when doing CTC decoding).

But as I found out, a purely stateless pipeline, incurs way to much network overhead to be viable (The current script settings would mean 4Mo/s bandwidth, vs 64ko with a stateful connection, so stateful connection it is, and so we can ignore raw_ctc altogether :)).

Removed it from current PR, it also removes the other issue.

It was used for the 2nd demo btw, where it was live and perfect transcription. But we will settle for the updated second demo which might contain erroneously duplicated letters (due to the chunking boundaries). The demo still looks&feels nice IMO.

patrickvonplaten

Tried it out and it works very well!

More or less good for merge for me! Just a bit confused about the raw_ctc kwarg. Is that used / passed anywhere? What is meant exactly by "raw"?

anton-l

Just a couple of docstring suggestions, otherwise - very cool way of handling partial live chunks, LGTM!

src/transformers/pipelines/audio_utils.py

tests/test_pipelines_automatic_speech_recognition.py

Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

sgugger · 2022-02-02T17:20:53Z

src/transformers/pipelines/automatic_speech_recognition.py

+            inputs (`np.ndarray` or `bytes` or `str` or `dict`):
+                The inputs is either :
+                    - `str` that is the filename of the
+                    audio file, the file will be read at the correct sampling rate to get the waveform using *ffmpeg*.
+                    This
+                requires *ffmpeg* to be installed on the system.
+                    - `bytes` it is supposed to be the
                content of an audio file and is interpreted by *ffmpeg* in the same way.
+                    - (`np.ndarray` of shape (n, ) of type `np.float32` or `np.float64`)
+                        Raw audio at the correct sampling rate (no further check will be done)
+                    - `dict` form can be used to pass raw audio sampled at arbirary `sampling_rate` and let
+                this pipeline do the resampling. The dict must be in the fomat `{"sampling_rate": int, "raw":
+                np.array}` with optionally a `"stride": (left: int, right: int)` than can ask the pipeline to treat the
+                first `left` samples and last `right` samples to be ignored in decoding (but used at inference to
+                provide more context to the model). Only use `stride` with CTC models.


I know we don't have a good check that will tell you the doc is failing yet, but this docstring was pretty obviously badly formatted.
Please pay closer attention before merging PRs :-)

Could something have happened in quality ?

Will pay more attention in the future though.

I think the styling script only exacerbated a wrong syntax (it's leaving the fixed docstring alone).

patrickvonplaten · 2022-02-07T10:08:27Z

Hey @Narsil,

I think the PR broke some slow tests:

FAILED tests/test_pipelines_automatic_speech_recognition.py::AutomaticSpeechRecognitionPipelineTests::test_speech_to_text_leveraged
FAILED tests/test_pipelines_automatic_speech_recognition.py::AutomaticSpeechRecognitionPipelineTests::test_torch_speech_encoder_decoder
FAILED tests/test_pipelines_automatic_speech_recognition.py::AutomaticSpeechRecognitionPipelineTests::test_xls_r_from_en
FAILED tests/test_pipelines_automatic_speech_recognition.py::AutomaticSpeechRecognitionPipelineTests::test_xls_r_to_en

Could you take a look maybe? :-)

Narsil · 2022-02-14T10:49:29Z

I can't reproduce.

Is did have an issue with old 1.18.0 version, gone in 1.18.3, was that it ?

patrickvonplaten · 2022-02-14T13:00:02Z

Fixed it :-)

…ce#15046) * Adding support for `microphone` streaming within pipeline. - Uses `ffmpeg` to get microphone data. - Makes sure alignment is made to `size_of_sample`. - Works by sending `{"raw": ..data.., "stride": (n, left, right), "partial": bool}` directly to the pipeline enabling to stream partial results and still get inference. - Let's `partial` information flow through the pipeline to enable caller to get it back and choose to display text or not. - The striding reconstitution is bound to have errors since CTC does not keep previous state. Currently most of the errors are we don't know if there's a space or not between two chunks. Since we have some left striding info, we could use that during decoding to choose what to do with those spaces and even extra letters maybe (if the stride is long enough, it's bound to cover at least a few symbols) Fixing tests. Protecting with `require_torch`. `raw_ctc` support for nicer demo. Post rebase fixes. Revamp to split raw_mic_data from it's live chunking. - Requires a refactor to make everything a bit cleaner. Automatic resampling. Small fix. Small fix. * Post rebase fix (need to let super handle more logic, reorder args.) * Update docstrings * Docstring format. * Remove print. * Prevent flow of `input_values`. * Fixing `stride` too. * Fixing the PR by removing `raw_ctc`. * Better docstrings. * Fixing init. * Update src/transformers/pipelines/audio_utils.py Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com> * Update tests/test_pipelines_automatic_speech_recognition.py Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com> * Quality. Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

Narsil changed the title ~~[WIP] Adding support for microphone streaming within pipeline.~~ Adding support for microphone streaming within pipeline. Jan 6, 2022

Narsil requested review from anton-l and patrickvonplaten January 6, 2022 11:56

Narsil force-pushed the microphone_streaming branch from c8b2579 to a14cbe9 Compare January 12, 2022 10:23

Narsil changed the title ~~Adding support for microphone streaming within pipeline.~~ [WIP] Adding support for microphone streaming within pipeline. Jan 12, 2022

Narsil changed the title ~~[WIP] Adding support for microphone streaming within pipeline.~~ Adding support for microphone streaming within pipeline. Jan 13, 2022

Narsil force-pushed the microphone_streaming branch from 1c6fb99 to f778ceb Compare January 13, 2022 17:00

Narsil mentioned this pull request Jan 31, 2022

Adding utilities to chunk large audio files and read directly from microphone #14250

Closed

5 tasks

Narsil force-pushed the microphone_streaming branch from f778ceb to 5d2c0af Compare February 1, 2022 11:40

Narsil added 6 commits February 1, 2022 12:49

Post rebase fix (need to let super handle more logic, reorder args.)

ea2f882

Update docstrings

b5b8c81

Docstring format.

6cb1bfd

Remove print.

bffae3e

Prevent flow of input_values.

fc65adb

Fixing stride too.

734c280

patrickvonplaten reviewed Feb 1, 2022

View reviewed changes

Narsil added 3 commits February 2, 2022 09:39

Fixing the PR by removing raw_ctc.

b9de3d5

Better docstrings.

f6f51d9

Fixing init.

852ee4f

anton-l approved these changes Feb 2, 2022

View reviewed changes

src/transformers/pipelines/audio_utils.py Outdated Show resolved Hide resolved

tests/test_pipelines_automatic_speech_recognition.py Outdated Show resolved Hide resolved

Narsil and others added 3 commits February 2, 2022 11:46

Update src/transformers/pipelines/audio_utils.py

d4e0cc1

Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

Update tests/test_pipelines_automatic_speech_recognition.py

0677e88

Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

Quality.

db3d900

Narsil merged commit 623d8cb into huggingface:master Feb 2, 2022

Narsil deleted the microphone_streaming branch February 2, 2022 14:12

sgugger reviewed Feb 2, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for `microphone` streaming within pipeline. #15046

Adding support for `microphone` streaming within pipeline. #15046

Narsil commented Jan 5, 2022 •

edited

Loading

patrickvonplaten Feb 1, 2022

Narsil Feb 2, 2022

patrickvonplaten Feb 1, 2022

Narsil Feb 2, 2022

patrickvonplaten Feb 1, 2022

patrickvonplaten Feb 1, 2022

Narsil Feb 2, 2022

Narsil Feb 2, 2022

Narsil Feb 2, 2022

patrickvonplaten left a comment

anton-l left a comment

sgugger Feb 2, 2022

Narsil Feb 2, 2022

sgugger Feb 2, 2022

patrickvonplaten commented Feb 7, 2022

Narsil commented Feb 14, 2022

patrickvonplaten commented Feb 14, 2022

Adding support for microphone streaming within pipeline. #15046

Adding support for microphone streaming within pipeline. #15046

Conversation

Narsil commented Jan 5, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

anton-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Feb 7, 2022

Narsil commented Feb 14, 2022

patrickvonplaten commented Feb 14, 2022

Adding support for `microphone` streaming within pipeline. #15046

Adding support for `microphone` streaming within pipeline. #15046

Narsil commented Jan 5, 2022 •

edited

Loading