int8 quantized TTS model slower than fp32 #575

martinshkreli · 2024-02-07T00:36:18Z

(myenv) ubuntu@152:~/sherpa-onnx/python_api_examples$ python3 test.py
Elapsed: 0.080
Saved sentence_0.wav.
Elapsed: 0.085
Saved sentence_1.wav.
Elapsed: 0.080
Saved sentence_2.wav.
Elapsed: 0.074
Saved sentence_3.wav.
Elapsed: 0.054
Saved sentence_4.wav.
Elapsed: 0.081
Saved sentence_5.wav.
Elapsed: 0.067

(myenv) ubuntu@152-69-195-75:~/sherpa-onnx/python_api_examples$ python3 test.py
Elapsed: 19.561
Saved sentence_0.wav.
Elapsed: 26.432
Saved sentence_1.wav.
Elapsed: 27.989
Saved sentence_2.wav.
Elapsed: 23.956
Saved sentence_3.wav.
Elapsed: 11.361
Saved sentence_4.wav.
Elapsed: 27.825
Saved sentence_5.wav.
Elapsed: 19.567

any special flag to set to use int8?

danpovey · 2024-02-07T02:07:54Z

Fangjun will get back to you about it, but: hi, martin shkreli!
We might need more hardware info and details about what differed between those two runs.

csukuangfj · 2024-02-07T02:11:44Z

@martinshkreli

Could you describe how you get the int8 models?

martinshkreli · 2024-02-12T14:56:56Z

Hi guys, thanks again for the wonderful repo. I followed this link to download the model:
https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html#download-the-model

Then, I used that file (vits-ljs.int8.onnx) for inference in the python script (offline-tts.py). This was on an 8xA100 instance.

martinshkreli · 2024-02-16T01:13:47Z

@martinshkreli

Could you describe how you get the int8 models?

hi Fangjun, i just wanted to try and get your attention one more time, sorry if I am being annoying!

csukuangfj · 2024-02-16T12:39:17Z

The int8 model is obtained via the following code

sherpa-onnx/scripts/vits/export-onnx-ljs.py

Lines 204 to 208 in d771762

    
           quantize_dynamic( 
        
               model_input=filename, 
        
               model_output=filename_int8, 
        
               weight_type=QuantType.QUInt8, 
        
           )

Note that it uses

sherpa-onnx/scripts/vits/export-onnx-ljs.py

Line 207 in d771762

weight_type=QuantType.QUInt8,

It is a known issue about onnxruntime that quint8 is slower.

For instance, if you search with google, you can find similar issues:

danpovey · 2024-02-17T05:18:28Z

fangjun, is the int8 intended for different applications or devices then?

…

On Friday, February 16, 2024, Fangjun Kuang ***@***.***> wrote: The int8 model is obtained via the following code https://github.com/k2-fsa/sherpa-onnx/blob/d7717628689b051b4c9bffd8d43f3e 074388e2d7/scripts/vits/export-onnx-ljs.py#L204-L208 Note that it uses https://github.com/k2-fsa/sherpa-onnx/blob/d7717628689b051b4c9bffd8d43f3e 074388e2d7/scripts/vits/export-onnx-ljs.py#L207 It is a known issue about onnxruntime that quint8 is slower. For instance, if you search with google, you can find similar issues: - microsoft/onnxruntime#12854 <microsoft/onnxruntime#12854> - microsoft/onnxruntime#6732 <microsoft/onnxruntime#6732> — Reply to this email directly, view it on GitHub <#575 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO24SJC2ZHERFOMYLKDYT5HQDAVCNFSM6AAAAABC45NFDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBYGMYTONZUHA> . You are receiving this because you commented.Message ID: ***@***.***>

csukuangfj · 2024-02-17T05:24:02Z

int8 model mentioned in this issue is about 4x less in file size than that of float32.

If memory matters, then int8 model is preferred.

beqabeqa473 · 2024-04-03T07:40:46Z

hi @csukuangfj do you know how to optimize speed of an int8 model? I was experimenting several months ago with it, but i was not able to convert to qint8 and quint8 is really slow on cpu.

nshmyrev · 2024-04-09T19:16:17Z

You don't need to optimize speed, you need to pick MB-iSTFT VITS model, they are order of magnitude faster than raw VITS with the same quality.

smallbraingames · 2024-07-08T19:10:44Z

You don't need to optimize speed, you need to pick MB-iSTFT VITS model, they are order of magnitude faster than raw VITS with the same quality.

where can we find these models?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int8 quantized TTS model slower than fp32 #575

int8 quantized TTS model slower than fp32 #575

martinshkreli commented Feb 7, 2024

danpovey commented Feb 7, 2024

csukuangfj commented Feb 7, 2024

martinshkreli commented Feb 12, 2024 •

edited

Loading

martinshkreli commented Feb 16, 2024

csukuangfj commented Feb 16, 2024

danpovey commented Feb 17, 2024 via email

csukuangfj commented Feb 17, 2024

beqabeqa473 commented Apr 3, 2024

nshmyrev commented Apr 9, 2024

smallbraingames commented Jul 8, 2024

int8 quantized TTS model slower than fp32 #575

int8 quantized TTS model slower than fp32 #575

Comments

martinshkreli commented Feb 7, 2024

danpovey commented Feb 7, 2024

csukuangfj commented Feb 7, 2024

martinshkreli commented Feb 12, 2024 • edited Loading

martinshkreli commented Feb 16, 2024

csukuangfj commented Feb 16, 2024

danpovey commented Feb 17, 2024 via email

csukuangfj commented Feb 17, 2024

beqabeqa473 commented Apr 3, 2024

nshmyrev commented Apr 9, 2024

smallbraingames commented Jul 8, 2024

martinshkreli commented Feb 12, 2024 •

edited

Loading