Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

CPU Dynamic Quantization #20

Closed
MiscellaneousStuff opened this issue Feb 9, 2023 · 5 comments
Closed

CPU Dynamic Quantization #20

MiscellaneousStuff opened this issue Feb 9, 2023 · 5 comments

Comments

@MiscellaneousStuff
Copy link

MiscellaneousStuff commented Feb 9, 2023

Would it be possible for you guys to add an option to enable dynamic quantization of the model when it's being run on a CPU? This would greatly improve the run-time performance of the OpenAI Whisper model (CPU-only) with minimal to no loss in performance.

The benchmarks for this are available here.

The implementation only requires adding a few lines of code using features which are already built into PyTorch.

Implementation

Quantization of the Whisper model requires changing the Linear()
layers within the model to nn.Linear(). This is because you need
to specifiy which layer types to dynamically quantize, such as:

quantized_model = torch.quantization.quantize_dynamic(
    model_fp32, {torch.nn.Linear}, dtype=torch.qint8
)

However the whisper model is designed to be adaptable, i.e.
it can run at different precisions, so the Linear() layer contains
custom code to account for this. However, this is not required for
the quantized model. You can either change the Linear() layers in
"/whisper/whisper/model.py" yourself (i.e. create a fork of OpenAI-Whisper
which would be compatible with future merges), or you can use
mine from here.

@hayabhay
Copy link
Owner

hayabhay commented Feb 9, 2023

Could this be done by swapping the whisper packages underneath?
-- pip install openai-whisper
++ pip install git+https://github.com/MiscellaneousStuff/whisper.git

@MiscellaneousStuff
Copy link
Author

Yep. That submodule is exactly the same as the original but has swapped the Linear() layer for nn.Linear(). However, it also means that anyone wanting to run the model at half precision on GPU won’t be able to do it, should it only use that custom whisper module for dynamic quantisation on CPU.

@hayabhay
Copy link
Owner

hayabhay commented Feb 9, 2023

Great! In that case, I'll add it as a note on Readme to swap out whisper for your fork if they intend to run it on a CPU only machine. Thanks!

@hayabhay hayabhay closed this as completed Feb 9, 2023
@hayabhay
Copy link
Owner

hayabhay commented Feb 9, 2023

Updated Readme here: 0431dee

@menelic
Copy link

menelic commented May 24, 2023

Doing what is recommended in the Readme does not work:

Note: If you're using a CPU-only machine, your runtime can be sped-up by using quantization implemented by @MicellaneousStuff by swapping out pip install openai-whisper from requirements.txt and replacing it with their fork pip install git+https://github.com/MiscellaneousStuff/whisper.git (See related discussion here - #20)

what exactly has to be put in the requirements.txt?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants