SpeechT5 ONNX support #1404

fxmarty · 2023-09-21T12:48:11Z

This PR adds the support of SpeechT5 ONNX export.

fxmarty · 2023-09-21T14:20:14Z

Hi @xenova, a long awaited one =) This PR is still missing tests, documentation and KV cache support but it is in a good state already. I'll finish it next week. For now I only implemented the text-to-speech task following transformers generate_speech.

Working version: optimum-cli export onnx --model microsoft/speecht5_tts speecht5_onnx --model-kwargs '{"vocoder": "microsoft/speecht5_hifigan"}'

Also left to do align the -with-past and the variant args

xenova · 2023-09-21T23:03:01Z

Wow this is amazing - thanks so much @fxmarty! I've uploaded my model files here. I'll test it in transformers.js, and I'll update those files when the other options are available. I don't suppose you have any python code which I can use for testing,

something similar to this?

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM

session = ORTModelForSeq2SeqLM.from_pretrained('Xenova/ipt-350m', subfolder='onnx')
tokenizer = AutoTokenizer.from_pretrained('Xenova/ipt-350m')

generator_ort = pipeline(
    task="text-generation",
    model=session,
    tokenizer=tokenizer,
)

generator_ort('La nostra azienda')
# [{'generated_text': "La nostra azienda è specializzata nella vendita di prodotti per l'igiene orale e per la salute."}]

Or will the ORTModelForTextToWaveform and/or ORTModelForTextToSpectrogram be coming later in this PR?

The speecht5 docs have a nice example here too.

Also, I had to downgrade to onnxruntime==1.15.1. 1.16.0 gives this error:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/content/transformers.js/scripts/convert.py", line 18, in <module>
    from onnxruntime.quantization import (
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/quantization/__init__.py", line 1, in <module>
    from .calibrate import (  # noqa: F401
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/quantization/calibrate.py", line 21, in <module>
    from .quant_utils import apply_plot, load_model_with_shape_infer, smooth_distribution
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/quantization/quant_utils.py", line 115, in <module>
    onnx_proto.TensorProto.FLOAT8E4M3FN: float8e4m3fn,
AttributeError: FLOAT8E4M3FN

I assume this is because I had onnx<1.14 installed, but just posting here in case.

fxmarty · 2023-09-26T09:07:18Z

I'll wrap up this PR and add a python example :)

fxmarty · 2023-09-26T17:16:14Z

@xenova something like this (not optimized at all). Does that work for you?

import onnxruntime as ort
import numpy as np
import soundfile as sf
from transformers import SpeechT5Processor

encoder_path = "/path/to/encoder_model.onnx"
decoder_path = "/path/to/decoder_model_merged.onnx"
postnet_and_vocoder_path = "/path/to/decoder_postnet_and_vocoder.onnx"

encoder = ort.InferenceSession(encoder_path, providers=["CPUExecutionProvider"])
decoder = ort.InferenceSession(decoder_path, providers=["CPUExecutionProvider"])
postnet_and_vocoder = ort.InferenceSession(postnet_and_vocoder_path, providers=["CPUExecutionProvider"])

def add_fake_pkv(inputs):
    shape = (1, 12, 0, 64)
    for i in range(6):
        inputs[f"past_key_values.{i}.encoder.key"] = np.zeros(shape).astype(np.float32)
        inputs[f"past_key_values.{i}.encoder.value"] = np.zeros(shape).astype(np.float32)
        inputs[f"past_key_values.{i}.decoder.key"] = np.zeros(shape).astype(np.float32)
        inputs[f"past_key_values.{i}.decoder.value"] = np.zeros(shape).astype(np.float32)
    return inputs

def add_real_pkv(inputs, previous_outputs, cross_attention_pkv):
    for i in range(6):
        inputs[f"past_key_values.{i}.encoder.key"] = cross_attention_pkv[f"present.{i}.encoder.key"]
        inputs[f"past_key_values.{i}.encoder.value"] = cross_attention_pkv[f"present.{i}.encoder.value"]
        inputs[f"past_key_values.{i}.decoder.key"] = previous_outputs[f"present.{i}.decoder.key"]
        inputs[f"past_key_values.{i}.decoder.value"] = previous_outputs[f"present.{i}.decoder.value"]
    return inputs

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

inputs = processor(text="Hello, my dog is cute", return_tensors="np")

inp = {
    "input_ids": inputs["input_ids"]
}

outputs = encoder.run(None, inp)
outputs = {output_key.name: outputs[idx] for idx, output_key in enumerate(encoder.get_outputs())}

encoder_last_hidden_state = outputs["encoder_outputs"]
encoder_attention_mask = outputs["encoder_attention_mask"]

minlenratio = 0.0
maxlenratio = 20.0
reduction_factor = 2
threshold = 0.5
num_mel_bins = 80

maxlen = int(encoder_last_hidden_state.shape[1] * maxlenratio / reduction_factor)
minlen = int(encoder_last_hidden_state.shape[1] * minlenratio / reduction_factor)

spectrogram = []
cross_attentions = []
past_key_values = None
idx = 0
cross_attention_pkv = None
use_cache_branch = False

speaker_embeddings = speaker_embeddings = np.zeros((1, 512)).astype(np.float32)

while True:
    idx += 1

    decoder_inputs = {}
    decoder_inputs["use_cache_branch"] = np.array([use_cache_branch])
    decoder_inputs["encoder_attention_mask"] = encoder_attention_mask
    decoder_inputs["speaker_embeddings"] = speaker_embeddings

    if not use_cache_branch:
        decoder_inputs = add_fake_pkv(decoder_inputs)
        decoder_inputs["output_sequence"] = np.zeros((1, 1, num_mel_bins)).astype(np.float32)
        use_cache_branch = True
        decoder_inputs["encoder_hidden_states"] = encoder_last_hidden_state
    else:
        decoder_inputs = add_real_pkv(decoder_inputs, decoder_outputs, cross_attention_pkv)
        decoder_inputs["output_sequence"] = decoder_outputs["output_sequence_out"]
        decoder_inputs["encoder_hidden_states"] = np.zeros((1, 0, 768)).astype(np.float32)  # useless when cross-attention KV has already been computed

    decoder_outputs = decoder.run(None, decoder_inputs)
    decoder_outputs = {output_key.name: decoder_outputs[idx] for idx, output_key in enumerate(decoder.get_outputs())}

    if idx == 1:  # i.e. use_cache_branch = False
        cross_attention_pkv = {key: val for key, val in decoder_outputs.items() if ("encoder" in key and "present" in key)}

    prob = decoder_outputs["prob"]
    spectrum = decoder_outputs["spectrum"]

    spectrogram.append(spectrum)
    
    print("prob", prob)

    # Finished when stop token or maximum length is reached.
    if idx >= minlen and (int(sum(prob >= threshold)) > 0 or idx >= maxlen):
        print("len spectrogram", len(spectrogram))
        spectrogram = np.concatenate(spectrogram)
        vocoder_output = postnet_and_vocoder.run(None, {"spectrogram": spectrogram})
        break

sf.write("speech.wav", vocoder_output[0], samplerate=16000)

xenova

Works with transformers.js! 🚀 xenova/transformers.js#345

fxmarty · 2023-10-06T12:52:35Z

@echarlaix probably you would prefer to merge first the PR for the decoders? I expect some conflicts between those two.

fxmarty · 2023-10-16T07:44:07Z

@echarlaix WDYT?

echarlaix · 2023-10-16T08:09:49Z

@echarlaix probably you would prefer to merge first the PR for the decoders? I expect some conflicts between those two.

Yes that would be great, thanks for letting me know. To me we can merge the decoder PR cc @michaelbenayoun

echarlaix

super cool thanks @fxmarty

optimum/exporters/onnx/model_configs.py

echarlaix · 2023-10-16T13:26:15Z

optimum/exporters/onnx/base.py

-        # Attempt to merge only if the decoder was exported without/with past
-        if self.use_past is True and len(models_and_onnx_configs) == 3:
+        # Attempt to merge only if the decoder was exported without/with past, and ignore seq2seq models exported with text-generation task
+        if len(onnx_files_subpaths) >= 3 and self.use_past is True or self.variant == "with-past":


why do we need to check self.variant ?

I'm not sure. I'll need to double check.

baskrahmer

Accidentally clicked review 😝 meant to just submit a comment

baskrahmer · 2023-10-19T14:45:23Z

optimum/exporters/onnx/__main__.py

+        )
+        model_type = config.model_type.replace("_", "-")
+        if model_type not in TasksManager._SUPPORTED_MODEL_TYPE:
+            custom_architecture = True


@fxmarty this line currently does nothing since it is set to False again in line 381. Do you want to have a look?

Good catch! I'll fix

fxmarty added 5 commits September 21, 2023 10:48

wip

2dd5209

wip bis

be26f71

nit

02259a8

nit^2

d181ad2

working export

54d3bc7

working with-past version

b107b2d

fxmarty added 3 commits September 26, 2023 17:35

add test

f8f69ab

add doc

69313a1

working merged onnx

b88ed06

xenova mentioned this pull request Oct 4, 2023

Add support for text-to-speech (w/ Speecht5) xenova/transformers.js#345

Merged

3 tasks

fxmarty added 4 commits October 5, 2023 13:03

fix dropout with training=True export

918893e

test fix

74ba08c

fix custom models

c5a8a1d

some cleaning

2f9661d

fxmarty marked this pull request as ready for review October 5, 2023 14:09

fxmarty requested review from michaelbenayoun, xenova, echarlaix and mht-sharma October 5, 2023 14:10

xenova approved these changes Oct 6, 2023

View reviewed changes

echarlaix reviewed Oct 16, 2023

View reviewed changes

echarlaix approved these changes Oct 16, 2023

View reviewed changes

Merge branch 'master' into speecht5-onnx

5a2ccde

fxmarty added 4 commits October 17, 2023 15:16

merge mess

595ff14

address review comments

563424c

Merge branch 'master' into speecht5-onnx

2c7a73f

fix tests

bce548a

fxmarty merged commit 554a83a into huggingface:main Oct 18, 2023
65 of 68 checks passed

baskrahmer reviewed Oct 19, 2023

View reviewed changes

fxmarty mentioned this pull request Oct 20, 2023

Fix custom architecture detection in onnx export #1472

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpeechT5 ONNX support #1404

SpeechT5 ONNX support #1404

fxmarty commented Sep 21, 2023 •

edited

fxmarty commented Sep 21, 2023 •

edited

xenova commented Sep 21, 2023 •

edited

fxmarty commented Sep 26, 2023

fxmarty commented Sep 26, 2023 •

edited

xenova left a comment

fxmarty commented Oct 6, 2023

fxmarty commented Oct 16, 2023

echarlaix commented Oct 16, 2023

echarlaix left a comment

echarlaix Oct 16, 2023

fxmarty Oct 16, 2023

baskrahmer left a comment

baskrahmer Oct 19, 2023

fxmarty Oct 19, 2023

SpeechT5 ONNX support #1404

SpeechT5 ONNX support #1404

Conversation

fxmarty commented Sep 21, 2023 • edited

fxmarty commented Sep 21, 2023 • edited

xenova commented Sep 21, 2023 • edited

fxmarty commented Sep 26, 2023

fxmarty commented Sep 26, 2023 • edited

xenova left a comment

Choose a reason for hiding this comment

fxmarty commented Oct 6, 2023

fxmarty commented Oct 16, 2023

echarlaix commented Oct 16, 2023

echarlaix left a comment

Choose a reason for hiding this comment

echarlaix Oct 16, 2023

Choose a reason for hiding this comment

fxmarty Oct 16, 2023

Choose a reason for hiding this comment

baskrahmer left a comment

Choose a reason for hiding this comment

baskrahmer Oct 19, 2023

Choose a reason for hiding this comment

fxmarty Oct 19, 2023

Choose a reason for hiding this comment

fxmarty commented Sep 21, 2023 •

edited

fxmarty commented Sep 21, 2023 •

edited

xenova commented Sep 21, 2023 •

edited

fxmarty commented Sep 26, 2023 •

edited