Consider Supporting CTranslate2 for faster inference #40

kamranjon · 2023-02-23T14:56:38Z

I recently learned about faster-whisper which uses the CTranslate2 library for faster inference. It seems you need to convert the whisper models first, but it claims the accuracy is the same for 4x speed improvements and reduced memory on both CPU and GPU.

I'm not sure if it would be feasible to support this but wanted to bring it up in case it was of interest. Feel free to close this issue if it is not possible.

Jeronymous · 2023-02-23T16:13:43Z

Thank you @kamranjon for letting us know 👍
I knew about whisper.cpp (which is unfortunately not working on GPU), but I did not (yet) about faster-whisper.
It's definitely worth having a look.

If the model has the same interface as Whisper model, it's actually straightforward to test with whisper_timestamped.

I'm just a bit worried because this repo seems to re-implement the decoding from whisper, so it won't follow further improvement of whisper (I know they are still fixing some possible bugs, like an infinite loop when the timestamp prediction is stuck on <|0.00|>).
Also the code is surprisingly short. So I am wondering:
Do you know if it gives the same results as whisper (up to some random seeding issues...).

RaulKite · 2023-02-23T17:47:57Z

Have you seen this way to speedup inference through huggingface new method?

Automatic speech recognition pipeline 🚀
The prediction of timestamps is also available as part of the pipeline. It comes with a new feature: batched prediction. Long audio files can now be processed in a batched manner. This is made available by the _find_timestamp_sequence function, which is able to merge chunks of audios together based on timing information and timestamp prediction.
In order to run the pipeline in batches, you must enable chunking by setting chunk_length_s = 30 as well as decide on a batch_size. This should allow for significant performance gain, with little losses in wer, depending on the hyperparmeters you define.
The recommended parameters are chunk_length_s=30, stride_length_s=[6,0]. If you want to learn more about how these parameters can affect the final results, feel free to refer to the blogpost on chunking for ASR.

Jeronymous · 2023-02-23T17:55:59Z

So many things happening around Whisper, it becomes hard to follow 😅

Thank you for coming in @RaulKite . Do you have a link?

I've played a bit with HuggingFace's transformers.Whisper* classes, but it was hard to recover the same accuracy as OpenAI implementation. I mean I could not get comparable WER with a simple implementation using whisper in transformers (although it unlocks batch processing, yes).
But I gave a quick try, maybe I missed something.

RaulKite · 2023-02-23T19:37:24Z

https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor

ronyfadel · 2023-02-26T16:03:39Z

I'm just a bit worried because this repo seems to re-implement the decoding from whisper, so it won't follow further improvement of whisper

cc @guillaumekln

guillaumekln · 2023-02-26T16:55:09Z

Hello,

Yes, faster-whisper is a complete reimplementation of the Whisper model and transcription loop. It is the reason it can be much more efficient (while giving the same results in most cases). We are watching closely the main repo and will port any new improvements.

However, it is currently not compatible with extensions such as whisper-timestamped. As far as I understand, whisper-timestamped requires access to some model layers to get the attention weights or output logits. These outputs are currently not exposed to Python since most of the execution is happening in CTranslate2 which is a C++ library. Some additional work is needed to return all these intermediate values but it is not possible at the moment.

ronyfadel · 2023-02-26T16:57:56Z

Would it be possible to run the transcription through faster-whisper, and do all the post-processing that whisper-timestamped is doing using the regular whisper model? I reckon it'd still be faster than using vanilla whisper.

kamranjon · 2023-02-26T18:57:23Z

@ronyfadel unfortunately it seems the answer is no, it requires information that CTranslate2 does not surface - so it would require additional inference to be ran on the regular whisper model to surface that information and it would overall take more time. In the future if CTranslate2 surfaces some of these outputs through their Python API - it might be possible, but for now this is not feasible.

ronyfadel · 2023-02-26T19:00:08Z

What's the information that CTranslate2 doesn't surface, so that I understand better?

kamranjon · 2023-02-26T19:01:11Z

@ronyfadel

As far as I understand, whisper-timestamped requires access to some model layers to get the attention weights or output logits. These outputs are currently not exposed to Python since most of the execution is happening in CTranslate2 which is a C++ library. Some additional work is needed to return all these intermediate values but it is not possible at the moment.

ronyfadel · 2023-02-26T19:04:08Z

@ronyfadel

As far as I understand, whisper-timestamped requires access to some model layers to get the attention weights or output logits. These outputs are currently not exposed to Python since most of the execution is happening in CTranslate2 which is a C++ library. Some additional work is needed to return all these intermediate values but it is not possible at the moment.

You missed my comment.

I'm asking if the post-processing can be based on the vanilla whisper weights. Meaning: fast transcription using fast-whisper and slow alignment based on vanilla whisper.

Jeronymous · 2023-02-26T19:30:46Z

Yes @ronyfadel that's a good suggestion, I think with little modification, whisper-timestamped can decouple the transcription part from the alignment part.
I'll look into that, with fast-whisper in mind

What's the information that CTranslate2 doesn't surface, so that I understand better?

The most critical seems to be the cross attention weights, that need to be accessed to do the alignment

ronyfadel · 2023-02-27T08:41:00Z

@Jeronymous bingo! (I'm still catching up and diving into the codebase).

all_hooks = []
all_hooks.append(model.encoder.conv1.register_forward_hook(hook_mfcc))
all_hooks.append(model.decoder.token_embedding.register_forward_hook(hook_input_tokens))
nblocks = len(model.decoder.blocks)
j = 0
for i, block in enumerate(model.decoder.blocks):
    if i < nblocks - word_alignement_most_top_layers:
        continue
    all_hooks.append(
        block.cross_attn.register_forward_hook(
            lambda layer, ins, outs, index=j: hook_attention_weights(layer, ins, outs, index))
    )
    j += 1
if compute_word_confidence or no_speech_threshold is not None:
    all_hooks.append(model.decoder.ln.register_forward_hook(hook_output_logits))

Without these hooks in CTranslate2 (and exposing the cross attention weights), I'm not sure how I can move forward :)

guillaumekln · 2023-03-09T16:05:18Z

While I don't plan on making the library compatible with these hooks, I'm working to expose a method align which can return the text/time alignments as implemented in openai/whisper:

OpenNMT/CTranslate2#1120

I also have an experimental integration in faster-whisper that enables word-level timestamps. Follow these installation instructions if you want to try it out.

EDIT: word-level timestamps are now available on the master branch of faster-whisper.

erturkdotgg · 2023-12-27T14:12:13Z

No, no and please NO. CTranslate2 requires Nvidia cards and it doesn't have rOCM(AMD) support. This is the only modification that i can use with my AMD card so please do not bring ctranslate2 support.

Jeronymous added the enhancement New feature or request label Mar 8, 2023

kamranjon mentioned this issue Mar 22, 2023

Can we get a speed boost of %80? #66

Closed

Jeronymous mentioned this issue Jan 25, 2024

ctranslate2 support #165

Closed

Jeronymous mentioned this issue Mar 25, 2024

There are plans to use ctranslate2 to speed up? similar to faster-whisper #181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider Supporting CTranslate2 for faster inference #40

Consider Supporting CTranslate2 for faster inference #40

kamranjon commented Feb 23, 2023

Jeronymous commented Feb 23, 2023

RaulKite commented Feb 23, 2023

Jeronymous commented Feb 23, 2023

RaulKite commented Feb 23, 2023

ronyfadel commented Feb 26, 2023

guillaumekln commented Feb 26, 2023 •

edited

Loading

ronyfadel commented Feb 26, 2023 •

edited

Loading

kamranjon commented Feb 26, 2023 •

edited

Loading

ronyfadel commented Feb 26, 2023

kamranjon commented Feb 26, 2023

ronyfadel commented Feb 26, 2023

Jeronymous commented Feb 26, 2023

ronyfadel commented Feb 27, 2023

guillaumekln commented Mar 9, 2023 •

edited

Loading

erturkdotgg commented Dec 27, 2023

Consider Supporting CTranslate2 for faster inference #40

Consider Supporting CTranslate2 for faster inference #40

Comments

kamranjon commented Feb 23, 2023

Jeronymous commented Feb 23, 2023

RaulKite commented Feb 23, 2023

Jeronymous commented Feb 23, 2023

RaulKite commented Feb 23, 2023

ronyfadel commented Feb 26, 2023

guillaumekln commented Feb 26, 2023 • edited Loading

ronyfadel commented Feb 26, 2023 • edited Loading

kamranjon commented Feb 26, 2023 • edited Loading

ronyfadel commented Feb 26, 2023

kamranjon commented Feb 26, 2023

ronyfadel commented Feb 26, 2023

Jeronymous commented Feb 26, 2023

ronyfadel commented Feb 27, 2023

guillaumekln commented Mar 9, 2023 • edited Loading

erturkdotgg commented Dec 27, 2023

guillaumekln commented Feb 26, 2023 •

edited

Loading

ronyfadel commented Feb 26, 2023 •

edited

Loading

kamranjon commented Feb 26, 2023 •

edited

Loading

guillaumekln commented Mar 9, 2023 •

edited

Loading