Replies: 2 comments 3 replies
-
i want to know how to make streaming asr |
Beta Was this translation helpful? Give feedback.
-
well, every model based on Transformers architecture, like Whisper, must have a limited max_length for the sequence, otherwise it would go, for example, in out of memory (the NN is huge). But actually the Whisper codebase is really good in processing long audio files (I have tested it). The approach, as far as I know is based on chunking with strides (see: https://huggingface.co/blog/asr-chunking). So, it is not a limitation. The value for N_FRAMES translate in the input to the encoder part. That is fixed by the architecture. You cannot change it. |
Beta Was this translation helpful? Give feedback.
-
Hi! Running the code base on README.md, I have tried to modify
audio.py
hard-coded audio hyperparameters hoping to increase the 30 seconds sample by default. Noted thatN_FRAMES
must have a modulus of zero, but any permutations toSAMPLE_RATE
,HOP_LENGTH
, ORCHUNK_LENGTH
will result inmodel.py
complaining about incorrect audio shape.How to fix this?
And, is it possible to perform
model.transcribe("audio.mp3")
withprompt
e.g. class DecodingOptions() ?Beta Was this translation helpful? Give feedback.
All reactions