CPU Dynamic Quantization #20

MiscellaneousStuff · 2023-02-09T02:25:26Z

Would it be possible for you guys to add an option to enable dynamic quantization of the model when it's being run on a CPU? This would greatly improve the run-time performance of the OpenAI Whisper model (CPU-only) with minimal to no loss in performance.

The benchmarks for this are available here.

The implementation only requires adding a few lines of code using features which are already built into PyTorch.

Implementation

Quantization of the Whisper model requires changing the Linear()
layers within the model to nn.Linear(). This is because you need
to specifiy which layer types to dynamically quantize, such as:

quantized_model = torch.quantization.quantize_dynamic(
    model_fp32, {torch.nn.Linear}, dtype=torch.qint8
)

However the whisper model is designed to be adaptable, i.e.
it can run at different precisions, so the Linear() layer contains
custom code to account for this. However, this is not required for
the quantized model. You can either change the Linear() layers in
"/whisper/whisper/model.py" yourself (i.e. create a fork of OpenAI-Whisper
which would be compatible with future merges), or you can use
mine from here.

The text was updated successfully, but these errors were encountered:

hayabhay · 2023-02-09T06:43:18Z

Could this be done by swapping the whisper packages underneath?
-- pip install openai-whisper
++ pip install git+https://github.com/MiscellaneousStuff/whisper.git

MiscellaneousStuff · 2023-02-09T08:22:08Z

Yep. That submodule is exactly the same as the original but has swapped the Linear() layer for nn.Linear(). However, it also means that anyone wanting to run the model at half precision on GPU won’t be able to do it, should it only use that custom whisper module for dynamic quantisation on CPU.

hayabhay · 2023-02-09T09:19:39Z

Great! In that case, I'll add it as a note on Readme to swap out whisper for your fork if they intend to run it on a CPU only machine. Thanks!

hayabhay · 2023-02-09T09:26:46Z

Updated Readme here: 0431dee

menelic · 2023-05-24T18:18:19Z

Doing what is recommended in the Readme does not work:

Note: If you're using a CPU-only machine, your runtime can be sped-up by using quantization implemented by @MicellaneousStuff by swapping out pip install openai-whisper from requirements.txt and replacing it with their fork pip install git+https://github.com/MiscellaneousStuff/whisper.git (See related discussion here - #20)

what exactly has to be put in the requirements.txt?

hayabhay closed this as completed Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Dynamic Quantization #20

CPU Dynamic Quantization #20

MiscellaneousStuff commented Feb 9, 2023 •

edited

Loading

hayabhay commented Feb 9, 2023

MiscellaneousStuff commented Feb 9, 2023

hayabhay commented Feb 9, 2023

hayabhay commented Feb 9, 2023

menelic commented May 24, 2023

CPU Dynamic Quantization #20

CPU Dynamic Quantization #20

Comments

MiscellaneousStuff commented Feb 9, 2023 • edited Loading

Implementation

hayabhay commented Feb 9, 2023

MiscellaneousStuff commented Feb 9, 2023

hayabhay commented Feb 9, 2023

hayabhay commented Feb 9, 2023

menelic commented May 24, 2023

MiscellaneousStuff commented Feb 9, 2023 •

edited

Loading