Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for microphone streaming within pipeline. #15046

Merged
merged 13 commits into from
Feb 2, 2022

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Jan 5, 2022

  • Uses ffmpeg to get microphone data.

  • Makes sure alignment is made to size_of_sample.

  • Works by sending {"raw": ..data.., "stride": (n, left, right), "partial": bool, "sampling_rate": sampling_rate}
    directly to the pipeline enabling to stream partial results and still
    get inference.

  • Let's partial information flow through the pipeline to enable caller
    to get it back and choose to display text or not.

  • The striding reconstitution is bound to have errors since CTC does not
    keep previous state. Currently most of the errors are we don't know if
    there's a space or not between two chunks.
    Since we have some left striding info, we could use that during decoding
    to choose what to do with those spaces and even extra letters maybe (if
    the stride is long enough, it's bound to cover at least a few symbols)
    Fixed by using intelligent replacement on the dropped tokens.

import datetime
import sys
from transformers import pipeline
from transformers.pipelines.audio_utils import ffmpeg_microphone_live

pipe = pipeline("automatic-speech-recognition", device=0)
sampling_rate = pipe.feature_extractor.sampling_rate


start = datetime.datetime.now()

chunk_length_s = 5
stream_chunk_s = 0.1
mic = ffmpeg_microphone_live(
    sampling_rate=sampling_rate,
    chunk_length_s=chunk_length_s,
    stream_chunk_s=stream_chunk_s,
)
print("Start talking...")
for item in pipe(mic):
    sys.stdout.write("\033[K")
    print(item["text"], end="\r")
    if not item["partial"][0]:
        print("")

2nd Better IMO, but low-level demo (requires curses on UNIX like, does not work on windows variants):

import sys
import numpy as np
from transformers import pipeline
from transformers.pipelines.audio_utils import ffmpeg_microphone_live
from curses import wrapper
import curses


def main():
    pipe = pipeline("automatic-speech-recognition", device=0)
    sampling_rate = pipe.feature_extractor.sampling_rate

    chunk_length_s = 5
    stream_chunk_s = 0.1
    mic = ffmpeg_microphone_live(
        sampling_rate=sampling_rate,
        chunk_length_s=chunk_length_s,
        stream_chunk_s=stream_chunk_s,  # , stride_length_s=(1, 0.1)
    )
    print("Start talking...")
    stdscr = curses.initscr()
    curses.noecho()
    curses.cbreak()
    text = ""
    for item in pipe(mic):
        displayed = text + item["text"]
        if not item["partial"][0]:
            text += item["text"]
        
        stdscr.addstr(0, 0, displayed)
        stdscr.clrtoeol()
        stdscr.refresh()


if __name__ == "__main__":
    wrapper(main())

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil Narsil changed the title [WIP] Adding support for microphone streaming within pipeline. Adding support for microphone streaming within pipeline. Jan 6, 2022
@Narsil Narsil changed the title Adding support for microphone streaming within pipeline. [WIP] Adding support for microphone streaming within pipeline. Jan 12, 2022
@Narsil Narsil changed the title [WIP] Adding support for microphone streaming within pipeline. Adding support for microphone streaming within pipeline. Jan 13, 2022
- Uses `ffmpeg` to get microphone data.
- Makes sure alignment is made to `size_of_sample`.
- Works by sending `{"raw": ..data.., "stride": (n, left, right),
"partial": bool}`
directly to the pipeline enabling to stream partial results and still
get inference.
- Let's `partial` information flow through the pipeline to enable caller
  to get it back and choose to display text or not.

- The striding reconstitution is bound to have errors since CTC does not
keep previous state. Currently most of the errors are we don't know if
there's a space or not between two chunks.
Since we have some left striding info, we could use that during decoding
to choose what to do with those spaces and even extra letters maybe (if
the stride is long enough, it's bound to cover at least a few symbols)

Fixing tests.

Protecting with `require_torch`.

`raw_ctc` support for nicer demo.

Post rebase fixes.

Revamp to split raw_mic_data from it's live chunking.

- Requires a refactor to make everything a bit cleaner.

Automatic resampling.

Small fix.

Small fix.
"""
Helper function to read audio from the microphone file through ffmpeg. This will output repeating/increasing chunks
until `chunk_length_s`. It will make use of striding to avoid errors on the "sides" of the various chunks.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to explain the different between stream_chunk_s and chunk_length_s here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the docstring much longer and more exhaustive

@@ -127,17 +92,17 @@ class AutomaticSpeechRecognitionPipeline(ChunkPipeline):
to support multiple audio formats
"""

def __init__(self, feature_extractor: Union["SequenceFeatureExtractor", str], *args, **kwargs):
def __init__(self, model, tokenizer, feature_extractor: Union["SequenceFeatureExtractor", str], *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a bit backwards breaking? Think it's totally fine for me as most people probably just use the pipeline function anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is breaking in some sense yes.

Calls like AutomaticSpeechRecognitionPipeline(feature_extractor, model, tokenizer) will break.
But, the regular pipeline has a call signature (model, tokenizer, feature_extractor), so this makes it a little bit consistent.
It also makes the parent class responsible of doing self.feature_extractor = feature_extractor (so less risk of future discrepancy between pipelines). It's definitely something that should probably be a separate PR, and make sure ALL PR henceforth have a correct call signature. We could also use something like MyPipeline(*, model. tokenizer, feature_extractor) which would disallow purely position based arguments, which would make everything simpler to maintain IMO with reducing confusion for users too: https://www.vegardstikbakke.com/python-keyword-only/

I will revisit the PR to exclude this from the current PR, it's definitely not good just putting that here, I would rather make a sweep at all classes and make a solid (tested) decision about parameter flow. (I think the current thing, is just something that was written before the rewrite of pipelines and just happened to be that way, not a conscious decision.

return preprocess_params, {}, {}

postprocess_params = {}
if "raw_ctc" in kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is raw_ctc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think raw_ctc is ever passed no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is captured by postprocess, however I don't think it's needed really.

Most likely some local experiment that I ended committing by error.
FYI, the goal was to enable the caller to fuse the chunks themselves, in the context of a stateless pipeline that would still be usable by a live microphone. Since the pipeline is stateless, it couldn't know the last used token in order to set it in the striding area (so it would get properly discarded when doing CTC decoding).

But as I found out, a purely stateless pipeline, incurs way to much network overhead to be viable (The current script settings would mean 4Mo/s bandwidth, vs 64ko with a stateful connection, so stateful connection it is, and so we can ignore raw_ctc altogether :)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it from current PR, it also removes the other issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used for the 2nd demo btw, where it was live and perfect transcription. But we will settle for the updated second demo which might contain erroneously duplicated letters (due to the chunking boundaries). The demo still looks&feels nice IMO.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it out and it works very well!

More or less good for merge for me! Just a bit confused about the raw_ctc kwarg. Is that used / passed anywhere? What is meant exactly by "raw"?

Copy link
Member

@anton-l anton-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of docstring suggestions, otherwise - very cool way of handling partial live chunks, LGTM!

src/transformers/pipelines/audio_utils.py Outdated Show resolved Hide resolved
tests/test_pipelines_automatic_speech_recognition.py Outdated Show resolved Hide resolved
Narsil and others added 3 commits February 2, 2022 11:46
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
@Narsil Narsil merged commit 623d8cb into huggingface:master Feb 2, 2022
@Narsil Narsil deleted the microphone_streaming branch February 2, 2022 14:12
Comment on lines +155 to +169
inputs (`np.ndarray` or `bytes` or `str` or `dict`):
The inputs is either :
- `str` that is the filename of the
audio file, the file will be read at the correct sampling rate to get the waveform using *ffmpeg*.
This
requires *ffmpeg* to be installed on the system.
- `bytes` it is supposed to be the
content of an audio file and is interpreted by *ffmpeg* in the same way.
- (`np.ndarray` of shape (n, ) of type `np.float32` or `np.float64`)
Raw audio at the correct sampling rate (no further check will be done)
- `dict` form can be used to pass raw audio sampled at arbirary `sampling_rate` and let
this pipeline do the resampling. The dict must be in the fomat `{"sampling_rate": int, "raw":
np.array}` with optionally a `"stride": (left: int, right: int)` than can ask the pipeline to treat the
first `left` samples and last `right` samples to be ignored in decoding (but used at inference to
provide more context to the model). Only use `stride` with CTC models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we don't have a good check that will tell you the doc is failing yet, but this docstring was pretty obviously badly formatted.
Please pay closer attention before merging PRs :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could something have happened in quality ?

Will pay more attention in the future though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the styling script only exacerbated a wrong syntax (it's leaving the fixed docstring alone).

@patrickvonplaten
Copy link
Contributor

Hey @Narsil,

I think the PR broke some slow tests:

FAILED tests/test_pipelines_automatic_speech_recognition.py::AutomaticSpeechRecognitionPipelineTests::test_speech_to_text_leveraged
FAILED tests/test_pipelines_automatic_speech_recognition.py::AutomaticSpeechRecognitionPipelineTests::test_torch_speech_encoder_decoder
FAILED tests/test_pipelines_automatic_speech_recognition.py::AutomaticSpeechRecognitionPipelineTests::test_xls_r_from_en
FAILED tests/test_pipelines_automatic_speech_recognition.py::AutomaticSpeechRecognitionPipelineTests::test_xls_r_to_en

Could you take a look maybe? :-)

@Narsil
Copy link
Contributor Author

Narsil commented Feb 14, 2022

I can't reproduce.

Is did have an issue with old 1.18.0 version, gone in 1.18.3, was that it ?

@patrickvonplaten
Copy link
Contributor

Fixed it :-)

ManuelFay pushed a commit to ManuelFay/transformers that referenced this pull request Mar 31, 2022
…ce#15046)

* Adding support for `microphone` streaming within pipeline.

- Uses `ffmpeg` to get microphone data.
- Makes sure alignment is made to `size_of_sample`.
- Works by sending `{"raw": ..data.., "stride": (n, left, right),
"partial": bool}`
directly to the pipeline enabling to stream partial results and still
get inference.
- Let's `partial` information flow through the pipeline to enable caller
  to get it back and choose to display text or not.

- The striding reconstitution is bound to have errors since CTC does not
keep previous state. Currently most of the errors are we don't know if
there's a space or not between two chunks.
Since we have some left striding info, we could use that during decoding
to choose what to do with those spaces and even extra letters maybe (if
the stride is long enough, it's bound to cover at least a few symbols)

Fixing tests.

Protecting with `require_torch`.

`raw_ctc` support for nicer demo.

Post rebase fixes.

Revamp to split raw_mic_data from it's live chunking.

- Requires a refactor to make everything a bit cleaner.

Automatic resampling.

Small fix.

Small fix.

* Post rebase fix (need to let super handle more logic, reorder args.)

* Update docstrings

* Docstring format.

* Remove print.

* Prevent flow of `input_values`.

* Fixing `stride` too.

* Fixing the PR by removing `raw_ctc`.

* Better docstrings.

* Fixing init.

* Update src/transformers/pipelines/audio_utils.py

Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

* Update tests/test_pipelines_automatic_speech_recognition.py

Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

* Quality.

Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants