Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XTTS: add inference_stream_text (slightly friendlier for text-streaming) #20

Closed
wants to merge 2 commits into from

Conversation

czuzu
Copy link

@czuzu czuzu commented May 9, 2024

Hello,

(moved the PR here, noticed the comments on the old one)

Doing TTS streaming but also with text-streaming (text coming progressively over a stream), locally.
I know inference_stream theoretically is enough for this case, except for the beginning part (which indeed is not so bad to be repeated but nicer would be to be able to skip it too since it's not necessary):

language = language.split("-")[0]  # remove the country code
length_scale = 1.0 / max(speed, 0.05)
gpt_cond_latent = gpt_cond_latent.to(self.device) # nicer to be able to skip when doing text-streaming
speaker_embedding = speaker_embedding.to(self.device) # nicer to be able to skip when doing text-streaming

So I've added inference_stream_text (maybe not the best name, let me know if you prefer another) particularly for text-streaming, e.g.:

def text_streaming_generator():
    yield "It took me quite a long time to develop a voice and now that I have it I am not going to be silent."
    yield "Having discovered not just one, but many voices, I will champion each."

print("Inference with text streaming...")

text_gen = text_streaming_generator()
inf_gen = model.inference_stream_text(
    # note `text` param not provided as it will be streamed
    "en",
    gpt_cond_latent,
    speaker_embedding
)

wav_chunks = []
for text in text_gen:
    # Add text progressively
    model.inference_add_text(text, enable_text_splitting=True)
    for chunk in enumerate(inf_gen):
        if chunk is None:
            break # all chunks generated for the current text
        print(f"Received chunk {len(wav_chunks)} of audio length {chunk.shape[-1]}")
        wav_chunks.append(chunk)

# Call finalize to discard the inference generator
model.inference_finalize_text()

IMO this also makes for a nicer interface when doing text-streaming, I'll leave it to you to decide :)

Cheers! 🍻

@czuzu czuzu closed this by deleting the head repository May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant