Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTS streaming does not work #864

Closed
1 task done
meenie opened this issue Nov 22, 2023 · 37 comments
Closed
1 task done

TTS streaming does not work #864

meenie opened this issue Nov 22, 2023 · 37 comments
Labels
bug Something isn't working

Comments

@meenie
Copy link

meenie commented Nov 22, 2023

Confirm this is an issue with the Python library and not an underlying OpenAI API

  • This is an issue with the Python library

Describe the bug

When following the documentation on how to use client.audio.speech.create(), the returned response has a method called stream_to_file(file_path) which explains that when used, it should stream the content of the audio file as it's being created. This does not seem to work. I've used a rather large text input that generates a 3.5 minute sound file and the file is only created once the whole request is completed.

To Reproduce

Utilize the following code and replace the text input with a decently large amount of text.

from pathlib import Path
from openai import OpenAI
client = OpenAI()

speech_file_path = Path(__file__).parent / "speech.mp3"
response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="""
        <Decently large bit of text here>
    """
)

response.stream_to_file(speech_file_path)

Notice that when the script is run that the speech.mp3 file is only ever created after the request is fully completed.

Code snippets

No response

OS

macOS

Python version

Python 3.11.6

Library version

openai v1.2.4

@meenie meenie added the bug Something isn't working label Nov 22, 2023
@RobertCraigie
Copy link
Collaborator

Thanks for the bug report, we're working on a fix.

@meenie
Copy link
Author

meenie commented Nov 22, 2023

Anything I might be able to do to help? I see that there is no mention of streaming in the TTS REST API endpoint, so I'm assuming it doesn't actually support this feature?

@RobertCraigie
Copy link
Collaborator

All good! The issue here is that the http client reads the entire response body before returning.

The "stream" terminology here is maybe a little confusing as it's different from the stream argument that you're likely familiar with in the chat.completions.create method as that results in Server-sent Events whereas in this case it's just referring to lazily reading the response body.

@amarbayar
Copy link

Once it is fixed, would this allow us to send streamed response from chat.completions.create to the audio.speech.create and get a lazily-reading behavior as text response is streamed in chunks from chat completion?

@antont
Copy link

antont commented Nov 23, 2023

Seems to be fixed in #866

I hacked it in earlier in #724 and results were good, when I just passed the stream=True parameter in .. even for a long (like 30s or more) text, i got to start hearing it in a browser client in about 1s.

@RobertCraigie here seems to say, though, that it only starts when the whole audio is completed on the server? Apparently the generation is quick, then?

@meenie
Copy link
Author

meenie commented Nov 23, 2023

@antont oh shoot! I didn't see your issue already about this because I searched TTS and instead of just Speach 🤦🏻. Also, great news about it being possibly fixed! Thanks for verifying :).

@meenie
Copy link
Author

meenie commented Nov 23, 2023

Once it is fixed, would this allow us to send streamed response from chat.completions.create to the audio.speech.create and get a lazily-reading behavior as text response is streamed in chunks from chat completion?

@amarbayar, I'm super interested in this! I don't think it will work because the speech endpoint would need to accept a chunk transfer encoding and I don't think it does. How awesome would that be if this was built into the library, though?

@rattrayalex
Copy link
Collaborator

@antont out of curiosity, how did you play the streaming audio in the browser? Do you have example code you could share for the benefit of others?

@rattrayalex
Copy link
Collaborator

Ah, I'm sorry to say the fix in #866 is being reverted; upon further discussion, we've found a better way that should be available in the coming days. Thank you for your patience.

@EigenSpan
Copy link

Ill be honest Im quite new to github as I dont really collab with anyone. Would definitely like to know when this works though. Would love to shave the time off my generated audio. In the meantime while we are waiting for this bugfix is there any other way to stream the response from the tts endpoint?

@antont
Copy link

antont commented Nov 24, 2023

@antont out of curiosity, how did you play the streaming audio in the browser? Do you have example code you could share for the benefit of others?

Well it's very simple: just return audio/mpeg from the http server, and stream the response. The browsers handle that by showing a audio player so didn't need to do anything on the browser side.

I used FastAPI, so the http handler func is:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncStream
from openai.types.chat import ChatCompletionChunk

app = FastAPI()

@app.get("/stream")
async def stream():
    text = "test. "
    text *= 10 #NOTE: I tested with 100 here too to make sure that it streams, i.e. still starts quickly, in about 1s
    
    #was without stream=True param
    #speech_stream: HttpxBinaryResponseContent = await text_to_speech_stream_openai(text)
    speech_stream: AsyncStream[ChatCompletionChunk] = await text_to_speech_stream_openai(text) #is not the actual type

    return StreamingResponse(speech_stream.response.aiter_bytes(), media_type="audio/mpeg")
    
async def text_to_speech_stream_openai(text: str):
    print('Generating audio from text review using Open AI API')
    #without stream=True is: response: HttpxBinaryResponseContent
    stream: AsyncStream[ChatCompletionChunk] = await client.audio.speech.create(
        model="tts-1",
        voice="echo",
        input=text,
        stream=True
    ) #type: ignore
    #print(type(response), dir(response), response)
    print(type(stream), dir(stream), stream)
    return stream

This requires the stream param in speech.create, so either my quick hack PR #724 or the later more proper one from OpenAI, #866 . I guess that one works too even though they reverted it now.

I have this live at https://id-longtask-3q4kbi7oda-uc.a.run.app/stream . First query to the server takes time as it starts up the instance, but it starts playing audio in about 1s in later requests. Don't bomb it too much so I don't need to take it down due to much API usage..

@antont
Copy link

antont commented Nov 24, 2023

Once it is fixed, would this allow us to send streamed response from chat.completions.create to the audio.speech.create and get a lazily-reading behavior as text response is streamed in chunks from chat completion?

Yes proper support for such pipelining within the OpenAI backend would be great.

It might be possible to hack a somewhat working system now

  • do the usual streaming chunk based thing to get text
  • send it for audio generating intermittently
  • start streaming the first audio after a bit of delay, after some text has come in, and the speech stream for that has started
  • in the meantime, also read more text from the text response, and start a new stream to get that as audio, to your backend code
  • when the first audio stream from openai ends, switch the input to your streaming response to be from the new tts stream
  • if the new stream has not started yet, put in some empty audio in your stream or maybe a little pink noise, so it's just a small pause for the listener

If you'd do the splits when a sentence ends it might be bearable, even though might have awkward pauses too. I'm not sure how the timings would go actually, might be even quite fine, maybe even no pauses? Perhaps with a slow speaker :D

@EigenSpan
Copy link

I did the split on sentences approach with elevenlabs before openai's tts endpoint came out. It did work but it was quite choppy. I'll just wait for the time being and hope its implemented soon.

Regarding your prior post so did they add the stream parameter to create? Because when I tried to use it in my current version which was updated maybe a week or two ago it said there was no stream parameter.

@antont
Copy link

antont commented Nov 24, 2023

so did they add the stream parameter to create? Because when I tried to use it in my current version which was updated maybe a week or two ago it said there was no stream parameter.

It's not in any released version.

I added it myself and put up a PR, mostly for info and to get some too, and that's what I have been using, only for that test. So if you use that branch it's there.

Now they added it too, so you'd have it in like a 2d old version from openai's repo, but they then reverted the addition as are reworking it to be somehow better, so it's not in current main. That's what I tried to say in:

This requires the stream param in speech.create, so either my quick hack PR #724 or the later more proper one from OpenAI, #866 . I guess that one works too even though they reverted it now.

Cool to hear that you actually implemented that splitting thing!

@rattrayalex
Copy link
Collaborator

For those interested in sentence-based splitting, an openai-node user shared some code which might be helpful.

@EigenSpan
Copy link

EigenSpan commented Nov 24, 2023

@antont For sentence based splitting at least using the chat completions endpoint all I did was stream the response and as each token came in I added it to a string. That string is checked against a regex if it matches a sentence structure (i got gpt to do the regex). If the string contained a full sentence it would push that sentence through to my tts endpoint and in a thread generate the relevant audio file for that sentence. Of course there was more steps to organizing it all and it was a bit of a pain and there was always a noticeable pause between each sentence (more than there should have been). But it did work. For example I also needed to check api call timestamps to ensure if a thread was out of sequence from say the second sentence finishing faster than the first that it wouldnt play them in the wrong order.

I am much happier to stream the audio out and run the audio on the full turn instead of each sentence. The problem is I seem to not be able to get anything asynchronously from assistant. I figured if i called .aiter_bytes() off of the tts... create() function it would immediately return the async iterator which I could continuously check for new chunks but it only seems to return the iterator once the entire response is complete... I logged the time it took to get first chunk from aiter_bytes() and the time it took for stream_to_file() to complete and the first chunk from aiter_bytes() was always slower than the full completion from stream_to_file() I can only conclude its because the aiter_bytes() doesnt return the asynciterator until the full response is received. I am a rookie at async as I have always used processes or threads but even gpt expected my solution would work and thinks the endpoint is currently just not capable of streaming.

Anyone more experienced in the matter know how to make it work?

@antont
Copy link

antont commented Nov 25, 2023

I am much happier to stream the audio out and run the audio on the full turn instead of each sentence. The problem is I seem to not be able to get anything asynchronously from assistant. I figured if i called .aiter_bytes() off of the tts... create()

Yes that's correct and what I do in what I pasted above.

So did you pass the stream=True parameter? Using either of the branches that supports it.

@antont
Copy link

antont commented Nov 25, 2023

For those interested in sentence-based splitting, an openai-node user shared some code which might be helpful.

Cool, thanks. It doesn't seem to use streaming to get the audio from tts, but maybe it's fine if getting the chunks as whole is fast enough.

    const arrayBuffer = await response.arrayBuffer();
    const blob = new Blob([arrayBuffer], { type: 'audio/mpeg' });
    const url = URL.createObjectURL(blob);

@EigenSpan
Copy link

@antont No, I am super new to using github so I am not quite sure how it works yet. Am i able to just look up like a specific branch by id and pull that down as my new openai library? Even if its not the best implementation as long as other things dont significantly break i'd 100% be willing to work with it temporarily if it gives me a tts streamable response. Just finished learning how websockets work and was finally after like 6 hours able to setup a websocket connection to twilio. Their docs is really dated which made it much harder...

@antont
Copy link

antont commented Nov 25, 2023

@antont No, I am super new to using github so I am not quite sure how it works yet. Am i able to just look up like a specific branch by id and pull that down as my new openai library? Even if its not the best implementation as long as other things dont significantly break i'd 100% be willing to work with it temporarily if it gives me a tts streamable response.

Yes, you can get that version and install it. https://github.com/openai/openai-python/tree/b2b4239bc95a2c81d9db49416ec4095f8a72d5e2 . Maybe there are nice instructions somewhere.

@EigenSpan here are brief instructions for you:

git clone https://github.com/openai/openai-python.git
cd openai-python
git checkout b2b4239
pip install .

oh or actually just this works too:

pip install git+https://github.com/openai/openai-python.git@b2b4239bc95a2c81d9db49416ec4095f8a72d5e2

@doomuch
Copy link

doomuch commented Nov 26, 2023

Great that this issue is being inspected! I've spent 5 hours (it was a week ago) trying to fix that. I've done the same as @antont and it worked! But when I tried to run the audio, it was so messy.

Here is an example with just using the requests library: https://gist.github.com/44-5-53-6-k/2df4f85d210d0ba80ff6335a78d872e5

Basically, it does this:

    p = pyaudio.PyAudio()
    stream = p.open(format=pyaudio.paInt16,
                    channels=1,
                    rate=44000,
                    output=True)

    # Pre-buffer for smoother playback

    try:
        for chunk in response.iter_content(chunk_size=4000096):
            if chunk:
                audio_data = AudioSegment.from_mp3(io.BytesIO(chunk))
                raw_data = audio_data.raw_data
                stream.write(raw_data)
                # Clear the buffer to ensure no old data is played
    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        # Cleanup
        stream.stop_stream()
        stream.close()
        p.terminate()

And at some point it just gives me this:

[cache @ 0x600000f9c150] Inner protocol failed to seekback end : -78
    Last message repeated 1 times
[mp3 @ 0x134104580] Failed to read frame size: Could not seek to 1049.
[cache @ 0x600000f9c150] Statistics, cache hits:2 cache misses:1
cache:pipe:0: Invalid argument

I belive that I am fundamentally wrong about how streaming chunks work, but I can't find any info on that. Could you share how do you run these chunks?

@mbarretol
Copy link

Seems like everyone just wants to be able to pipe the incoming text streaming chunks into the speech create function and for it to start generating speech as soon as a certain threshold is reached (probably a whole sentence, so it sounds natural). Hope this feature is added soon.

@EigenSpan
Copy link

@Narco121 Although everyone absolutely wanrs to be able to pipe the gpt response stream into the tts endpoint and recieve a tts stream back. If you use the chat completions endpoint this can be done by using a generator and async or threads with a regex. You add the incoming tokens to a string and check against a regex for when that string is a complete sentence (check for things like .?!…) Once it is async/thread send that string to your tts endpoint and organize the responses in order and play them. It’s a pain to organize but it did somewhat work when I did it. Sounded a but choppy with 11labs not sure how openai’s tts would sound.

The main problem people are having stems around the fact that in the main branch you currently can’t stream the tts endpoint. At least every way I’ve tried using threading or async etc… The actual endpint even when calling stream to file, doesnt stream anything. It only returns the file once all the audio is generated. There is no way to access the audio data before it is 100% completed.

that being said as antont pointed out they did fix it in another branch but later reverted the change in favour of another method of doing it. For now once my code is ready to handle the stream I’ll just be swapping to the branc that allows tts even if it isn’t the “best” way of doing it.

@cyzanfar
Copy link

cyzanfar commented Dec 30, 2023

I got the output of OAI TTS to stream. Here's an example:

url = "https://api.openai.com/v1/audio/speech"
headers = {
    "Authorization": 'Bearer YOUR_API_KEY', 
}

data = {
    "model": model,
    "input": input_text,
    "voice": voice,
    "response_format": "opus",
}

with requests.post(url, headers=headers, json=data, stream=True) as response:
    if response.status_code == 200:
        buffer = io.BytesIO()
        for chunk in response.iter_content(chunk_size=4096):
            buffer.write(chunk)

Hope that helps!

@amarbayar
Copy link

amarbayar commented Dec 31, 2023

Nice @cyzanfar !

In case anyone comes across this thread looking for a fully working solution, written in Python, with a sample Flask app and an audio player:

from flask import Flask, Response, render_template_string
import requests

app = Flask(__name__)

@app.route('/')
def index():
    # HTML template to render an audio player
    html = '''
    <!DOCTYPE html>
    <html>
    <body>
    <audio controls autoplay>
        <source src="/stream" type="audio/mpeg">
        Your browser does not support the audio element.
    </audio>
    </body>
    </html>
    '''
    return render_template_string(html)

@app.route('/stream')
def stream():
    def generate():
        url = "https://api.openai.com/v1/audio/speech"
        headers = {
            "Authorization": 'Bearer YOUR_SK_TOKEN, 
        }

        data = {
            "model": "tts-1",
            "input": "YOUR TEXT THAT NEEDS TO BE TTSD HERE",
            "voice": "alloy",
            "response_format": "mp3",
        }

        with requests.post(url, headers=headers, json=data, stream=True) as response:
            if response.status_code == 200:
                for chunk in response.iter_content(chunk_size=4096):
                    yield chunk

    return Response(generate(), mimetype="audio/mpeg")

if __name__ == "__main__":
    app.run(debug=True, threaded=True)

And this is indeed working beautifully!

@Boscop
Copy link

Boscop commented Jan 10, 2024

@amarbayar Is it also possible to stream a TTS response in TypeScript?
I don't see a stream parameter here:
https://github.com/openai/openai-openapi/blob/f4a2833d00e92c4b1cb531d437da88a03de997d8/openapi.yaml#L6860-L6894

In my frontend I want to play the returned audio with minimum latency.

@matthiaskern
Copy link

matthiaskern commented Jan 27, 2024

Just tried the new raw response streaming API in the SDK (#1072) with fastAPI. This always results in
httpx.StreamClosed: Attempted to read or stream content, but the stream has been closed. unless you read from the response directly without iterating (response.read()).

    openai_client = OpenAI()
    with openai_client.audio.speech.with_streaming_response.create(
        model="tts-1",
        voice="alloy",
        input=text,
        response_format="mp3"
    ) as response:

        if response.status_code == 200:
            def generate():
                for chunk in response.iter_bytes(chunk_size=2048):
                    print(f"Chunk size: {len(chunk)}")  # Print the size of each chunk
                    yield chunk

            return StreamingResponse(
                content=generate(),
                media_type="audio/mp3"
            )

        else:
            return HTTPException(status_code=500, detail="Failed to generate audio")

Making the API call directly seems to still be the best way to go.

@RobertCraigie
Copy link
Collaborator

RobertCraigie commented Jan 27, 2024

@matthiaskern that's happening because the context manager is exiting when you return StreamingResponse(...).

Like this example in the FastAPI docs, you'll need to make the API call within your generate() function, unless FastAPI supports yielding responses?

Is there a reason that making the API call inside generate() wouldn't work for you?

@matthiaskern
Copy link

matthiaskern commented Jan 27, 2024

of course, thank you @RobertCraigie! I'm not sure about how to handle an exception from OpenAI in this case, but this is a great start.

for reference:

    def generate():
        with openai_client.audio.speech.with_streaming_response.create(
            model="tts-1",
            voice="alloy",
            input=input,
            response_format="mp3"
        ) as response:
            if response.status_code == 200:
                for chunk in response.iter_bytes(chunk_size=2048):
                    yield chunk

    return StreamingResponse(
        content=generate(),
        media_type="audio/mp3"
    )

@RobertCraigie
Copy link
Collaborator

Great @matthiaskern! I'm not 100% sure how to handle exceptions - I'm not a FastAPI expert, but I think you could define middleware to handle exceptions? That way you don't have to handle it for every endpoint.

It's also worth noting that the original code snippet you shared won't actually hit the return HTTPException(status_code=500, ... branch because we handle non success codes automatically and raise an exception, see

except httpx.HTTPStatusError as err: # thrown on 4xx and 5xx status code
and
class APIStatusError(APIError):
.

@nimobeeren
Copy link

As of today (openai==1.12.0), the Python example on the text-to-speech quickstart yields:

DeprecationWarning: Due to a bug, this method doesn't actually stream the response content, .with_streaming_response.method() should be used instead

I found the message a bit cryptic and couldn't find any real documentation of this .with_streaming_response.method(), but after some digging I was able to get it working:

from openai import OpenAI

client = OpenAI()

with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="alloy",
    input="""I see skies of blue and clouds of white
             The bright blessed days, the dark sacred nights
             And I think to myself
             What a wonderful world""",
) as response:
    # This doesn't seem to be *actually* streaming, it just creates the file
    # and then doesn't update it until the whole generation is finished
    response.stream_to_file("speech.mp3")

But I wasn't able to achieve actual streaming with the Python library, only through the REST API (see my post on OpenAI Forum).

It would be great to have improved documentation and support for streaming TTS!

@antont
Copy link

antont commented Feb 23, 2024

response.stream_to_file("speech.mp3")

I think you should not do that but something like speech_stream.response.aiter_bytes

My working thing from a few months back is in a comment above here and used that.

@yjp20
Copy link

yjp20 commented Feb 23, 2024

As of today (openai==1.12.0), the Python example on the text-to-speech quickstart yields:

DeprecationWarning: Due to a bug, this method doesn't actually stream the response content, .with_streaming_response.method() should be used instead

I found the message a bit cryptic and couldn't find any real documentation of this .with_streaming_response.method(), but after some digging I was able to get it working:

from openai import OpenAI

client = OpenAI()

with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="alloy",
    input="""I see skies of blue and clouds of white
             The bright blessed days, the dark sacred nights
             And I think to myself
             What a wonderful world""",
) as response:
    # This doesn't seem to be *actually* streaming, it just creates the file
    # and then doesn't update it until the whole generation is finished
    response.stream_to_file("speech.mp3")

But I wasn't able to achieve actual streaming with the Python library, only through the REST API (see my post on OpenAI Forum).

It would be great to have improved documentation and support for streaming TTS!

Hey @nimobeeren, thanks for the repro. On my computer this does seem to be streaming to the file as expected :(. Would you be able to provide what system you're using (it shouldn't make a difference, but just in case) and how you were able to determine that the file wasn't streaming?

@rattrayalex
Copy link
Collaborator

But I wasn't able to achieve actual streaming with the Python library, only through the REST API (see my post on OpenAI Forum).

Thanks for the handy link – we'll be releasing a new example file soon based on that thread which shows how to stream TTS to pyaudio with WAV.

@rattrayalex
Copy link
Collaborator

rattrayalex commented Mar 3, 2024

An example on how to stream TTS response to your speakers with PyAudio is now available here:

def stream_to_speakers() -> None:
import pyaudio
player_stream = pyaudio.PyAudio().open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)
start_time = time.time()
with openai.audio.speech.with_streaming_response.create(
model="tts-1",
voice="alloy",
response_format="pcm", # similar to WAV, but without a header chunk at the start.
input="""I see skies of blue and clouds of white
The bright blessed days, the dark sacred nights
And I think to myself
What a wonderful world""",
) as response:
print(f"Time to first byte: {int((time.time() - start_time) * 1000)}ms")
for chunk in response.iter_bytes(chunk_size=1024):
player_stream.write(chunk)
print(f"Done in {int((time.time() - start_time) * 1000)}ms.")

@nimobeeren
Copy link

nimobeeren commented Mar 4, 2024

Hey @nimobeeren, thanks for the repro. On my computer this does seem to be streaming to the file as expected :(. Would you be able to provide what system you're using (it shouldn't make a difference, but just in case) and how you were able to determine that the file wasn't streaming?

@yjp20 It seems to be working for me too now! I'm just running the code and simultaneously doing ls -lh speech.mp3 a few times in a row. Before it seemed like the file sat at 0 bytes until all of the chunks were received and then wrote them in one go. But now I can actually see the file increasing in size over time, as expected! Most likely my original experiment was flawed in some way 😅

@nimobeeren
Copy link

@rattrayalex thank you Alex, that example works for me!

I spent some time trying to figure out why response_format="wav" format gives a faster TTFB than pcm, and I found that only the WAV is actually sent more quickly, but the first audio data takes roughly the same time to arrive.

It might be nice to use the header to set the pyaudio options instead of hardcoding them, but I couldn't figure out how to do that (didn't manage to turn the response into something I can feed into wave.open()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests