Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XTTS: add inference_stream_text (slightly friendlier for text-streaming) #21

Closed
wants to merge 3 commits into from

Conversation

czuzu
Copy link

@czuzu czuzu commented May 10, 2024

Hello,

(moved the PR here, noticed the comments on the old one)

Doing TTS streaming but also with text-streaming (text coming progressively over a stream), locally.
I know inference_stream theoretically is enough for this case, except for the beginning part (which indeed is not so bad to be repeated but nicer would be to be able to skip it too since it's not necessary):

language = language.split("-")[0]  # remove the country code
length_scale = 1.0 / max(speed, 0.05)
gpt_cond_latent = gpt_cond_latent.to(self.device) # nicer to be able to skip when doing text-streaming
speaker_embedding = speaker_embedding.to(self.device) # nicer to be able to skip when doing text-streaming

So I've added inference_stream_text (maybe not the best name, let me know if you prefer another) particularly for text-streaming, e.g.:

def text_streaming_generator():
    yield "It took me quite a long time to develop a voice and now that I have it I am not going to be silent."
    yield "Having discovered not just one, but many voices, I will champion each."

print("Inference with text streaming...")

text_gen = text_streaming_generator()
inf_gen = model.inference_stream_text(
    # note `text` param not provided as it will be streamed
    "en",
    gpt_cond_latent,
    speaker_embedding
)

wav_chunks = []
for text in text_gen:
    # Add text progressively
    model.inference_add_text(text, enable_text_splitting=True)
    for chunk in enumerate(inf_gen):
        if chunk is None:
            break # all chunks generated for the current text
        print(f"Received chunk {len(wav_chunks)} of audio length {chunk.shape[-1]}")
        wav_chunks.append(chunk)

# Call finalize to discard the inference generator
model.inference_finalize_text()

IMO this also makes for a nicer interface when doing text-streaming, I'll leave it to you to decide :)

Cheers! 🍻

@czuzu
Copy link
Author

czuzu commented May 10, 2024

PR recreated - linting test was not passing, I had to refork from this repo (my old fork was from coqui-ai/TTS) and run make style and make lint locally

@eginhard
Copy link
Member

Thanks for the PR! Just to let you that I'm traveling and won't have a chance to look at this for a couple of days.

@czuzu
Copy link
Author

czuzu commented May 11, 2024

Thanks for the update, it can wait ofc, it's a small one. Safe travels 👌

@eginhard
Copy link
Member

Could you explain the benefits of this PR more concretely, i.e. what do you mean by "slightly friendlier for text-streaming"? Running streaming TTS for sentences one-by-one is already possible with the current code, no?

@czuzu
Copy link
Author

czuzu commented May 14, 2024

Hi @eginhard,

It's a bit friendlier in the sense that it's more clear that you can do text-streaming, my argument is that this explicitness puts the user "at ease" in regards to keeping the necessary "initialization state" in effect while that's ongoing (instead of redoing it every time new text comes in). Currently as I've mentioned, from the current code, that "initialization state" represents only those 2 statements that move gpt_cond_latent and speaker_embedding to the target device. That's not significant, but I was thinking more could happen there later unless we make it clear that text-streaming is something the library offers explicitly. And that explicitness is made stronger by adding separate functions to add text progressively rather than with inference function calls where you give all the needed arguments all the time. This is more of an "aesthetics" change, but also IMO would cut the risk for library developers adding more later into that initialization phase which would then be significantly more of interest to be avoided to be done more than once when doing text streaming than it is now.

All in all, I know this is debatable regarding its usefulness, so I'll leave it to your preference to decide whether it's worth doing this, I'm fine with either it's just that I tend more towards this version when doing text streaming, it's clearer for me that the library offers incremental TTS when I see those functions.

@eginhard
Copy link
Member

Thank you for clarifying. I'm not sure I agree that it needs to be made more explicit that it is possible to run streaming TTS for multiple sentences one-by-one. I'll close this PR because it adds complexity to the code with no clear practical benefit.

Also text-streaming or incremental TTS usually refers to streaming synthesis based on partial text input, which is not currently supported, so this terminology would only lead to more confusion.

However, please let me know if there is anything that blocks your use case.

@eginhard eginhard closed this May 14, 2024
@czuzu
Copy link
Author

czuzu commented May 14, 2024

Sure, no problem, thanks for taking the time to look over this 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants