Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider Supporting CTranslate2 for faster inference #40

Open
kamranjon opened this issue Feb 23, 2023 · 15 comments
Open

Consider Supporting CTranslate2 for faster inference #40

kamranjon opened this issue Feb 23, 2023 · 15 comments
Labels
enhancement New feature or request

Comments

@kamranjon
Copy link

I recently learned about faster-whisper which uses the CTranslate2 library for faster inference. It seems you need to convert the whisper models first, but it claims the accuracy is the same for 4x speed improvements and reduced memory on both CPU and GPU.

I'm not sure if it would be feasible to support this but wanted to bring it up in case it was of interest. Feel free to close this issue if it is not possible.

@Jeronymous
Copy link
Member

Thank you @kamranjon for letting us know 👍
I knew about whisper.cpp (which is unfortunately not working on GPU), but I did not (yet) about faster-whisper.
It's definitely worth having a look.

If the model has the same interface as Whisper model, it's actually straightforward to test with whisper_timestamped.

I'm just a bit worried because this repo seems to re-implement the decoding from whisper, so it won't follow further improvement of whisper (I know they are still fixing some possible bugs, like an infinite loop when the timestamp prediction is stuck on <|0.00|>).
Also the code is surprisingly short. So I am wondering:
Do you know if it gives the same results as whisper (up to some random seeding issues...).

@RaulKite
Copy link

Have you seen this way to speedup inference through huggingface new method?

Automatic speech recognition pipeline 🚀
The prediction of timestamps is also available as part of the pipeline. It comes with a new feature: batched prediction. Long audio files can now be processed in a batched manner. This is made available by the _find_timestamp_sequence function, which is able to merge chunks of audios together based on timing information and timestamp prediction.
In order to run the pipeline in batches, you must enable chunking by setting chunk_length_s = 30 as well as decide on a batch_size. This should allow for significant performance gain, with little losses in wer, depending on the hyperparmeters you define.
The recommended parameters are chunk_length_s=30, stride_length_s=[6,0]. If you want to learn more about how these parameters can affect the final results, feel free to refer to the blogpost on chunking for ASR.

@Jeronymous
Copy link
Member

So many things happening around Whisper, it becomes hard to follow 😅

Thank you for coming in @RaulKite . Do you have a link?

I've played a bit with HuggingFace's transformers.Whisper* classes, but it was hard to recover the same accuracy as OpenAI implementation. I mean I could not get comparable WER with a simple implementation using whisper in transformers (although it unlocks batch processing, yes).
But I gave a quick try, maybe I missed something.

@RaulKite
Copy link

@ronyfadel
Copy link

I'm just a bit worried because this repo seems to re-implement the decoding from whisper, so it won't follow further improvement of whisper

cc @guillaumekln

@guillaumekln
Copy link

guillaumekln commented Feb 26, 2023

Hello,

Yes, faster-whisper is a complete reimplementation of the Whisper model and transcription loop. It is the reason it can be much more efficient (while giving the same results in most cases). We are watching closely the main repo and will port any new improvements.

However, it is currently not compatible with extensions such as whisper-timestamped. As far as I understand, whisper-timestamped requires access to some model layers to get the attention weights or output logits. These outputs are currently not exposed to Python since most of the execution is happening in CTranslate2 which is a C++ library. Some additional work is needed to return all these intermediate values but it is not possible at the moment.

@ronyfadel
Copy link

ronyfadel commented Feb 26, 2023

Would it be possible to run the transcription through faster-whisper, and do all the post-processing that whisper-timestamped is doing using the regular whisper model? I reckon it'd still be faster than using vanilla whisper.

@kamranjon
Copy link
Author

kamranjon commented Feb 26, 2023

@ronyfadel unfortunately it seems the answer is no, it requires information that CTranslate2 does not surface - so it would require additional inference to be ran on the regular whisper model to surface that information and it would overall take more time. In the future if CTranslate2 surfaces some of these outputs through their Python API - it might be possible, but for now this is not feasible.

@ronyfadel
Copy link

What's the information that CTranslate2 doesn't surface, so that I understand better?

@kamranjon
Copy link
Author

@ronyfadel

As far as I understand, whisper-timestamped requires access to some model layers to get the attention weights or output logits. These outputs are currently not exposed to Python since most of the execution is happening in CTranslate2 which is a C++ library. Some additional work is needed to return all these intermediate values but it is not possible at the moment.

@ronyfadel
Copy link

@ronyfadel

As far as I understand, whisper-timestamped requires access to some model layers to get the attention weights or output logits. These outputs are currently not exposed to Python since most of the execution is happening in CTranslate2 which is a C++ library. Some additional work is needed to return all these intermediate values but it is not possible at the moment.

You missed my comment.

I'm asking if the post-processing can be based on the vanilla whisper weights. Meaning: fast transcription using fast-whisper and slow alignment based on vanilla whisper.

@Jeronymous
Copy link
Member

Yes @ronyfadel that's a good suggestion, I think with little modification, whisper-timestamped can decouple the transcription part from the alignment part.
I'll look into that, with fast-whisper in mind

What's the information that CTranslate2 doesn't surface, so that I understand better?

The most critical seems to be the cross attention weights, that need to be accessed to do the alignment

@ronyfadel
Copy link

@Jeronymous bingo! (I'm still catching up and diving into the codebase).

all_hooks = []
all_hooks.append(model.encoder.conv1.register_forward_hook(hook_mfcc))
all_hooks.append(model.decoder.token_embedding.register_forward_hook(hook_input_tokens))
nblocks = len(model.decoder.blocks)
j = 0
for i, block in enumerate(model.decoder.blocks):
    if i < nblocks - word_alignement_most_top_layers:
        continue
    all_hooks.append(
        block.cross_attn.register_forward_hook(
            lambda layer, ins, outs, index=j: hook_attention_weights(layer, ins, outs, index))
    )
    j += 1
if compute_word_confidence or no_speech_threshold is not None:
    all_hooks.append(model.decoder.ln.register_forward_hook(hook_output_logits))

Without these hooks in CTranslate2 (and exposing the cross attention weights), I'm not sure how I can move forward :)

@Jeronymous Jeronymous added the enhancement New feature or request label Mar 8, 2023
@guillaumekln
Copy link

guillaumekln commented Mar 9, 2023

While I don't plan on making the library compatible with these hooks, I'm working to expose a method align which can return the text/time alignments as implemented in openai/whisper:

OpenNMT/CTranslate2#1120

I also have an experimental integration in faster-whisper that enables word-level timestamps. Follow these installation instructions if you want to try it out.

EDIT: word-level timestamps are now available on the master branch of faster-whisper.

@erturkdotgg
Copy link

No, no and please NO. CTranslate2 requires Nvidia cards and it doesn't have rOCM(AMD) support. This is the only modification that i can use with my AMD card so please do not bring ctranslate2 support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants