Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int8 quantized TTS model slower than fp32 #575

Open
martinshkreli opened this issue Feb 7, 2024 · 10 comments
Open

int8 quantized TTS model slower than fp32 #575

martinshkreli opened this issue Feb 7, 2024 · 10 comments

Comments

@martinshkreli
Copy link

(myenv) ubuntu@152:~/sherpa-onnx/python_api_examples$ python3 test.py
Elapsed: 0.080
Saved sentence_0.wav.
Elapsed: 0.085
Saved sentence_1.wav.
Elapsed: 0.080
Saved sentence_2.wav.
Elapsed: 0.074
Saved sentence_3.wav.
Elapsed: 0.054
Saved sentence_4.wav.
Elapsed: 0.081
Saved sentence_5.wav.
Elapsed: 0.067

(myenv) ubuntu@152-69-195-75:~/sherpa-onnx/python_api_examples$ python3 test.py
Elapsed: 19.561
Saved sentence_0.wav.
Elapsed: 26.432
Saved sentence_1.wav.
Elapsed: 27.989
Saved sentence_2.wav.
Elapsed: 23.956
Saved sentence_3.wav.
Elapsed: 11.361
Saved sentence_4.wav.
Elapsed: 27.825
Saved sentence_5.wav.
Elapsed: 19.567

any special flag to set to use int8?

@danpovey
Copy link
Collaborator

danpovey commented Feb 7, 2024

Fangjun will get back to you about it, but: hi, martin shkreli!
We might need more hardware info and details about what differed between those two runs.

@csukuangfj
Copy link
Collaborator

@martinshkreli

Could you describe how you get the int8 models?

@martinshkreli
Copy link
Author

martinshkreli commented Feb 12, 2024

Hi guys, thanks again for the wonderful repo. I followed this link to download the model:
https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html#download-the-model

Then, I used that file (vits-ljs.int8.onnx) for inference in the python script (offline-tts.py). This was on an 8xA100 instance.

@martinshkreli
Copy link
Author

@martinshkreli

Could you describe how you get the int8 models?

hi Fangjun, i just wanted to try and get your attention one more time, sorry if I am being annoying!

@csukuangfj
Copy link
Collaborator

The int8 model is obtained via the following code

quantize_dynamic(
model_input=filename,
model_output=filename_int8,
weight_type=QuantType.QUInt8,
)

Note that it uses

weight_type=QuantType.QUInt8,

It is a known issue about onnxruntime that quint8 is slower.

For instance, if you search with google, you can find similar issues:

@danpovey
Copy link
Collaborator

danpovey commented Feb 17, 2024 via email

@csukuangfj
Copy link
Collaborator

int8 model mentioned in this issue is about 4x less in file size than that of float32.

If memory matters, then int8 model is preferred.

@beqabeqa473
Copy link

hi @csukuangfj do you know how to optimize speed of an int8 model? I was experimenting several months ago with it, but i was not able to convert to qint8 and quint8 is really slow on cpu.

@nshmyrev
Copy link
Contributor

nshmyrev commented Apr 9, 2024

You don't need to optimize speed, you need to pick MB-iSTFT VITS model, they are order of magnitude faster than raw VITS with the same quality.

@smallbraingames
Copy link

You don't need to optimize speed, you need to pick MB-iSTFT VITS model, they are order of magnitude faster than raw VITS with the same quality.

where can we find these models?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants