-
Notifications
You must be signed in to change notification settings - Fork 25.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for microphone
streaming within pipeline.
#15046
Conversation
microphone
streaming within pipeline.microphone
streaming within pipeline.
c8b2579
to
a14cbe9
Compare
microphone
streaming within pipeline.microphone
streaming within pipeline.
microphone
streaming within pipeline.microphone
streaming within pipeline.
1c6fb99
to
f778ceb
Compare
- Uses `ffmpeg` to get microphone data. - Makes sure alignment is made to `size_of_sample`. - Works by sending `{"raw": ..data.., "stride": (n, left, right), "partial": bool}` directly to the pipeline enabling to stream partial results and still get inference. - Let's `partial` information flow through the pipeline to enable caller to get it back and choose to display text or not. - The striding reconstitution is bound to have errors since CTC does not keep previous state. Currently most of the errors are we don't know if there's a space or not between two chunks. Since we have some left striding info, we could use that during decoding to choose what to do with those spaces and even extra letters maybe (if the stride is long enough, it's bound to cover at least a few symbols) Fixing tests. Protecting with `require_torch`. `raw_ctc` support for nicer demo. Post rebase fixes. Revamp to split raw_mic_data from it's live chunking. - Requires a refactor to make everything a bit cleaner. Automatic resampling. Small fix. Small fix.
f778ceb
to
5d2c0af
Compare
""" | ||
Helper function to read audio from the microphone file through ffmpeg. This will output repeating/increasing chunks | ||
until `chunk_length_s`. It will make use of striding to avoid errors on the "sides" of the various chunks. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to explain the different between stream_chunk_s
and chunk_length_s
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the docstring much longer and more exhaustive
@@ -127,17 +92,17 @@ class AutomaticSpeechRecognitionPipeline(ChunkPipeline): | |||
to support multiple audio formats | |||
""" | |||
|
|||
def __init__(self, feature_extractor: Union["SequenceFeatureExtractor", str], *args, **kwargs): | |||
def __init__(self, model, tokenizer, feature_extractor: Union["SequenceFeatureExtractor", str], *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a bit backwards breaking? Think it's totally fine for me as most people probably just use the pipeline function anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is breaking in some sense yes.
Calls like AutomaticSpeechRecognitionPipeline(feature_extractor, model, tokenizer)
will break.
But, the regular pipeline has a call signature (model, tokenizer, feature_extractor)
, so this makes it a little bit consistent.
It also makes the parent class responsible of doing self.feature_extractor = feature_extractor
(so less risk of future discrepancy between pipelines). It's definitely something that should probably be a separate PR, and make sure ALL PR henceforth have a correct call signature. We could also use something like MyPipeline(*, model. tokenizer, feature_extractor)
which would disallow purely position based arguments, which would make everything simpler to maintain IMO with reducing confusion for users too: https://www.vegardstikbakke.com/python-keyword-only/
I will revisit the PR to exclude this from the current PR, it's definitely not good just putting that here, I would rather make a sweep at all classes and make a solid (tested) decision about parameter flow. (I think the current thing, is just something that was written before the rewrite of pipelines and just happened to be that way, not a conscious decision.
return preprocess_params, {}, {} | ||
|
||
postprocess_params = {} | ||
if "raw_ctc" in kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is raw_ctc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think raw_ctc
is ever passed no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is captured by postprocess
, however I don't think it's needed really.
Most likely some local experiment that I ended committing by error.
FYI, the goal was to enable the caller to fuse the chunks themselves, in the context of a stateless pipeline that would still be usable by a live microphone. Since the pipeline is stateless, it couldn't know the last used token in order to set it in the striding area (so it would get properly discarded when doing CTC decoding).
But as I found out, a purely stateless pipeline, incurs way to much network overhead to be viable (The current script settings would mean 4Mo/s bandwidth, vs 64ko with a stateful connection, so stateful connection it is, and so we can ignore raw_ctc
altogether :)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it from current PR, it also removes the other issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was used for the 2nd demo btw, where it was live and perfect transcription. But we will settle for the updated second demo which might contain erroneously duplicated letters (due to the chunking boundaries). The demo still looks&feels nice IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it out and it works very well!
More or less good for merge for me! Just a bit confused about the raw_ctc
kwarg. Is that used / passed anywhere? What is meant exactly by "raw"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of docstring suggestions, otherwise - very cool way of handling partial live chunks, LGTM!
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
inputs (`np.ndarray` or `bytes` or `str` or `dict`): | ||
The inputs is either : | ||
- `str` that is the filename of the | ||
audio file, the file will be read at the correct sampling rate to get the waveform using *ffmpeg*. | ||
This | ||
requires *ffmpeg* to be installed on the system. | ||
- `bytes` it is supposed to be the | ||
content of an audio file and is interpreted by *ffmpeg* in the same way. | ||
- (`np.ndarray` of shape (n, ) of type `np.float32` or `np.float64`) | ||
Raw audio at the correct sampling rate (no further check will be done) | ||
- `dict` form can be used to pass raw audio sampled at arbirary `sampling_rate` and let | ||
this pipeline do the resampling. The dict must be in the fomat `{"sampling_rate": int, "raw": | ||
np.array}` with optionally a `"stride": (left: int, right: int)` than can ask the pipeline to treat the | ||
first `left` samples and last `right` samples to be ignored in decoding (but used at inference to | ||
provide more context to the model). Only use `stride` with CTC models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we don't have a good check that will tell you the doc is failing yet, but this docstring was pretty obviously badly formatted.
Please pay closer attention before merging PRs :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could something have happened in quality ?
Will pay more attention in the future though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the styling script only exacerbated a wrong syntax (it's leaving the fixed docstring alone).
Hey @Narsil, I think the PR broke some slow tests:
Could you take a look maybe? :-) |
I can't reproduce. Is did have an issue with old |
Fixed it :-) |
…ce#15046) * Adding support for `microphone` streaming within pipeline. - Uses `ffmpeg` to get microphone data. - Makes sure alignment is made to `size_of_sample`. - Works by sending `{"raw": ..data.., "stride": (n, left, right), "partial": bool}` directly to the pipeline enabling to stream partial results and still get inference. - Let's `partial` information flow through the pipeline to enable caller to get it back and choose to display text or not. - The striding reconstitution is bound to have errors since CTC does not keep previous state. Currently most of the errors are we don't know if there's a space or not between two chunks. Since we have some left striding info, we could use that during decoding to choose what to do with those spaces and even extra letters maybe (if the stride is long enough, it's bound to cover at least a few symbols) Fixing tests. Protecting with `require_torch`. `raw_ctc` support for nicer demo. Post rebase fixes. Revamp to split raw_mic_data from it's live chunking. - Requires a refactor to make everything a bit cleaner. Automatic resampling. Small fix. Small fix. * Post rebase fix (need to let super handle more logic, reorder args.) * Update docstrings * Docstring format. * Remove print. * Prevent flow of `input_values`. * Fixing `stride` too. * Fixing the PR by removing `raw_ctc`. * Better docstrings. * Fixing init. * Update src/transformers/pipelines/audio_utils.py Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com> * Update tests/test_pipelines_automatic_speech_recognition.py Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com> * Quality. Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
Uses
ffmpeg
to get microphone data.Makes sure alignment is made to
size_of_sample
.Works by sending
{"raw": ..data.., "stride": (n, left, right), "partial": bool, "sampling_rate": sampling_rate}
directly to the pipeline enabling to stream partial results and still
get inference.
Let's
partial
information flow through the pipeline to enable callerto get it back and choose to display text or not.
The striding reconstitution is bound to have errors since CTC does notFixed by using intelligent replacement on the droppedkeep previous state. Currently most of the errors are we don't know if
there's a space or not between two chunks.
Since we have some left striding info, we could use that during decoding
to choose what to do with those spaces and even extra letters maybe (if
the stride is long enough, it's bound to cover at least a few symbols)
tokens
.2nd Better IMO, but low-level demo (requires curses on UNIX like, does not work on windows variants):
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.