XTTS: add inference_stream_text (slightly friendlier for text-streaming) #21

czuzu · 2024-05-10T13:11:08Z

Hello,

(moved the PR here, noticed the comments on the old one)

Doing TTS streaming but also with text-streaming (text coming progressively over a stream), locally.
I know inference_stream theoretically is enough for this case, except for the beginning part (which indeed is not so bad to be repeated but nicer would be to be able to skip it too since it's not necessary):

language = language.split("-")[0]  # remove the country code
length_scale = 1.0 / max(speed, 0.05)
gpt_cond_latent = gpt_cond_latent.to(self.device) # nicer to be able to skip when doing text-streaming
speaker_embedding = speaker_embedding.to(self.device) # nicer to be able to skip when doing text-streaming

So I've added inference_stream_text (maybe not the best name, let me know if you prefer another) particularly for text-streaming, e.g.:

def text_streaming_generator():
    yield "It took me quite a long time to develop a voice and now that I have it I am not going to be silent."
    yield "Having discovered not just one, but many voices, I will champion each."

print("Inference with text streaming...")

text_gen = text_streaming_generator()
inf_gen = model.inference_stream_text(
    # note `text` param not provided as it will be streamed
    "en",
    gpt_cond_latent,
    speaker_embedding
)

wav_chunks = []
for text in text_gen:
    # Add text progressively
    model.inference_add_text(text, enable_text_splitting=True)
    for chunk in enumerate(inf_gen):
        if chunk is None:
            break # all chunks generated for the current text
        print(f"Received chunk {len(wav_chunks)} of audio length {chunk.shape[-1]}")
        wav_chunks.append(chunk)

# Call finalize to discard the inference generator
model.inference_finalize_text()

IMO this also makes for a nicer interface when doing text-streaming, I'll leave it to you to decide :)

Cheers! 🍻

czuzu · 2024-05-10T13:13:51Z

PR recreated - linting test was not passing, I had to refork from this repo (my old fork was from coqui-ai/TTS) and run make style and make lint locally

eginhard · 2024-05-11T09:50:20Z

Thanks for the PR! Just to let you that I'm traveling and won't have a chance to look at this for a couple of days.

czuzu · 2024-05-11T10:00:14Z

Thanks for the update, it can wait ofc, it's a small one. Safe travels 👌

eginhard · 2024-05-14T16:10:01Z

Could you explain the benefits of this PR more concretely, i.e. what do you mean by "slightly friendlier for text-streaming"? Running streaming TTS for sentences one-by-one is already possible with the current code, no?

czuzu · 2024-05-14T16:35:55Z

Hi @eginhard,

It's a bit friendlier in the sense that it's more clear that you can do text-streaming, my argument is that this explicitness puts the user "at ease" in regards to keeping the necessary "initialization state" in effect while that's ongoing (instead of redoing it every time new text comes in). Currently as I've mentioned, from the current code, that "initialization state" represents only those 2 statements that move gpt_cond_latent and speaker_embedding to the target device. That's not significant, but I was thinking more could happen there later unless we make it clear that text-streaming is something the library offers explicitly. And that explicitness is made stronger by adding separate functions to add text progressively rather than with inference function calls where you give all the needed arguments all the time. This is more of an "aesthetics" change, but also IMO would cut the risk for library developers adding more later into that initialization phase which would then be significantly more of interest to be avoided to be done more than once when doing text streaming than it is now.

All in all, I know this is debatable regarding its usefulness, so I'll leave it to your preference to decide whether it's worth doing this, I'm fine with either it's just that I tend more towards this version when doing text streaming, it's clearer for me that the library offers incremental TTS when I see those functions.

eginhard · 2024-05-14T17:31:00Z

Thank you for clarifying. I'm not sure I agree that it needs to be made more explicit that it is possible to run streaming TTS for multiple sentences one-by-one. I'll close this PR because it adds complexity to the code with no clear practical benefit.

Also text-streaming or incremental TTS usually refers to streaming synthesis based on partial text input, which is not currently supported, so this terminology would only lead to more confusion.

However, please let me know if there is anything that blocks your use case.

czuzu · 2024-05-14T18:23:10Z

Sure, no problem, thanks for taking the time to look over this 👍

czuzu added 3 commits May 10, 2024 16:00

XTTS: add inference_stream_text (slightly friendlier for text-streaming)

061ff03

Fix bad indent in inference_stream

c4f9045

Fix styling issues

970234e

czuzu mentioned this pull request May 10, 2024

XTTS: add inference_stream_text (slightly friendlier for text-streaming) coqui-ai/TTS#3724

Closed

eginhard closed this May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XTTS: add inference_stream_text (slightly friendlier for text-streaming) #21

XTTS: add inference_stream_text (slightly friendlier for text-streaming) #21

czuzu commented May 10, 2024

czuzu commented May 10, 2024

eginhard commented May 11, 2024

czuzu commented May 11, 2024

eginhard commented May 14, 2024

czuzu commented May 14, 2024 •

edited

Loading

eginhard commented May 14, 2024

czuzu commented May 14, 2024

XTTS: add inference_stream_text (slightly friendlier for text-streaming) #21

XTTS: add inference_stream_text (slightly friendlier for text-streaming) #21

Conversation

czuzu commented May 10, 2024

czuzu commented May 10, 2024

eginhard commented May 11, 2024

czuzu commented May 11, 2024

eginhard commented May 14, 2024

czuzu commented May 14, 2024 • edited Loading

eginhard commented May 14, 2024

czuzu commented May 14, 2024

czuzu commented May 14, 2024 •

edited

Loading