# Stream Chat Completion

By default, when you request a completion from OpenAI, the entire completion is generated before sent back in a single response. Users can experience long idling time before there is anything to read on the screen.

We can chunk the reponse and stream it back to the user. This way, the user can see the response as soon as it is being generated.


In [1]:
import openai
import time  # for measuring the duration of API calls
from getpass import getpass

openai.api_key = getpass("Enter your OpenAI API key: ")

### Normal Response


In [5]:
start_time = time.time()

res = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Count to 100, with a comma between each number and no newlines. E.g., 1, 2, "}],
    temperature=0
)

res_time = time.time() - start_time

print(f"Full response received after {res_time:.2f} seconds")
print(f"Response: {res.choices[0].message.content}")

Full response received after 8.19 seconds
Response: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.


### Stream Response


In [7]:
# record the time before the request is sent
start_time = time.time()

# send a ChatCompletion request to count to 100
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
    ],
    temperature=0,
    stream=True  # again, we set stream=True
)

# create variables to collect the stream of chunks
collected_chunks = []
collected_messages = []
# iterate through the stream of events
for chunk in response:
    chunk_time = time.time() - start_time  # calculate the time delay of the chunk
    collected_chunks.append(chunk)  # save the event response
    chunk_message = chunk['choices'][0]['delta']  # extract the message
    collected_messages.append(chunk_message)  # save the message
    # print the delay and text
    print(
        f"Message received {chunk_time:.2f} seconds after request: {chunk_message}")

# print the time delay and text received
print(f"Full response received {chunk_time:.2f} seconds after request")
full_reply_content = ''.join([m.get('content', '')
                             for m in collected_messages])
print(f"Full conversation received: {full_reply_content}")

Message received 0.42 seconds after request: {
  "role": "assistant",
  "content": ""
}
Message received 0.42 seconds after request: {
  "content": "1"
}
Message received 0.45 seconds after request: {
  "content": ","
}
Message received 0.50 seconds after request: {
  "content": " "
}
Message received 0.51 seconds after request: {
  "content": "2"
}
Message received 0.55 seconds after request: {
  "content": ","
}
Message received 0.59 seconds after request: {
  "content": " "
}
Message received 0.62 seconds after request: {
  "content": "3"
}
Message received 0.65 seconds after request: {
  "content": ","
}
Message received 0.68 seconds after request: {
  "content": " "
}
Message received 0.71 seconds after request: {
  "content": "4"
}
Message received 0.75 seconds after request: {
  "content": ","
}
Message received 0.78 seconds after request: {
  "content": " "
}
Message received 0.85 seconds after request: {
  "content": "5"
}
Message received 0.87 seconds after request: {
  "cont

Using the stream method, we receive the first message at 0.42 seconds, altought the total time is slightly longer than normal resonse.
