# How to stream completions
# 如何流式传输完成
By default, when you request a completion from the OpenAI, the entire completion is generated before being sent back in a single response.
 在默认情况下，当您从OpenAI请求完成时，整个完成将在单个响应中生成并发送回来。
If you're generating long completions, waiting for the response can take many seconds.
 如果您正在生成长的完成，等待响应可能需要很多秒。
To get responses sooner, you can 'stream' the completion as it's being generated. This allows you to start printing or processing the beginning of the completion before the full completion is finished.
 要更快地获得响应，您可以在生成完成时“流式传输”完成。 这允许您在完成完成之前开始打印或处理完成的开始部分。

To stream completions, set `stream=True` when calling the chat completions or completions endpoints. This will return an object that streams back the response as [data-only server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format). Extract chunks from the `delta` field rather than the `message` field.
 流式传输完成时，设置`stream=True`时调用聊天完成或完成端点。 这将返回一个对象，该对象将响应作为[仅数据服务器发送事件](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format)流式传输回来。 从`delta`字段而不是`message`字段中提取块。

## Downsides
## 缺点
Note that using `stream=True` in a production application makes it more difficult to moderate the content of the completions, as partial completions may be more difficult to evaluate. which has implications for [approved usage](https://beta.openai.com/docs/usage-guidelines).
    请注意，在生产应用程序中使用`stream=True`会使评估完成内容的部分完成更加困难，这对[批准使用](https://beta.openai.com/docs/usage-guidelines)有影响。
Another small drawback of streaming responses is that the response no longer includes the `usage` field to tell you how many tokens were consumed. After receiving and combining all of the responses, you can calculate this yourself using [`tiktoken`](How_to_count_tokens_with_tiktoken.ipynb).
    流式响应的另一个小缺点是响应不再包含`usage`字段，以告诉您消耗了多少个令牌。 在接收并组合所有响应之后，您可以使用[`tiktoken`](How_to_count_tokens_with_tiktoken.ipynb)自己计算这一点。
## Example code
## 示例代码
Below, this notebook shows:
 在下面，此笔记本显示：
1. What a typical chat completion response looks like
什么是典型的聊天完成响应
2. What a streaming chat completion response looks like
什么是流式聊天完成响应
3. How much time is saved by streaming a chat completion
通过流式聊天完成节省了多少时间
4. How to stream non-chat completions (used by older models like `text-davinci-003`)
如何流式传输非聊天完成（由旧模型如`text-davinci-003`使用）

In [1]:
# imports
import openai  # for OpenAI API calls
import time  # for measuring time duration of API calls

### 1. What a typical chat completion response looks like
### 1.典型的聊天完成响应是什么样子的

With a typical ChatCompletions API call, the response is first computed and then returned all at once.
使用典型的ChatCompletions API调用，首先计算响应，然后一次返回。

In [2]:
# Example of an OpenAI ChatCompletion request
# https://platform.openai.com/docs/guides/chat

# record the time before the request is sent
start_time = time.time()

# send a ChatCompletion request to count to 100
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
    ],
    temperature=0,
)

# calculate the time it took to receive the response
response_time = time.time() - start_time

# print the time delay and text received
print(f"Full response received {response_time:.2f} seconds after request")
print(f"Full response received:\n{response}")


Full response received 44.57 seconds after request
Full response received:
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.",
        "role": "assistant"
      }
    }
  ],
  "created": 1680853695,
  "id": "chatcmpl-72b7PPTarnyTtj1oPymfl0EJGMRvq",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 299,
    "prompt_tokens": 37,
    "total_tokens": 336
  }
}


The reply can be extracted with `response['choices'][0]['message']`.
 这个回复可以用`response['choices'][0]['message']`提取。

The content of the reply can be extracted with `response['choices'][0]['message']['content']`.
这个回复的内容可以用`response['choices'][0]['message']['content']`提取。

In [3]:
reply = response['choices'][0]['message']
print(f"Extracted reply: \n{reply}")

reply_content = response['choices'][0]['message']['content']
print(f"Extracted content: \n{reply_content}")


Extracted reply: 
{
  "content": "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.",
  "role": "assistant"
}
Extracted content: 
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.


### 2. How to stream a chat completion
### 2. 如何流式传输聊天完成
With a streaming API call, the response is sent back incrementally in chunks via an [event stream](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format). In Python, you can iterate over these events with a `for` loop.
使用流式API调用，响应将通过[事件流](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format)逐步发送回来。 在Python中，您可以使用`for`循环迭代这些事件。

Let's see what it looks like:
让我们看看它是什么样子的：

In [4]:
# Example of an OpenAI ChatCompletion request with stream=True
# https://platform.openai.com/docs/guides/chat

# a ChatCompletion request
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'user', 'content': "What's 1+1? Answer in one word."}
    ],
    temperature=0,
    stream=True  # this time, we set stream=True
)

for chunk in response:
    print(chunk)

{
  "choices": [
    {
      "delta": {
        "role": "assistant"
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1680853977,
  "id": "chatcmpl-72bBxYOFoElrpCN29UeVf8uhRtvhe",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}
{
  "choices": [
    {
      "delta": {
        "content": "2"
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1680853977,
  "id": "chatcmpl-72bBxYOFoElrpCN29UeVf8uhRtvhe",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}
{
  "choices": [
    {
      "delta": {
        "content": "."
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1680853977,
  "id": "chatcmpl-72bBxYOFoElrpCN29UeVf8uhRtvhe",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}
{
  "choices": [
    {
      "delta": {},
      "finish_reason": "stop",
      "index": 0
    }
  ],
  "created": 1680853977,
  "id": "chatcmpl-72bBxYOFoElrpCN29UeVf8uhRt

As you can see above, streaming responses have a `delta` field rather than a `message` field. `delta` can hold things like:
如上所述，流式响应具有`delta`字段而不是`message`字段。 `delta`可以包含以下内容：
- a role token (e.g., `{"role": "assistant"}`)
- 一个角色令牌（例如，`{"role": "assistant"}`）
- a content token (e.g., `{"content": "\n\n"}`)
- 一个内容令牌（例如，`{"content": "\n\n"}`）
- nothing (e.g., `{}`), when the stream is over
- 没有（例如，`{}`），当流结束时

### 3. How much time is saved by streaming a chat completion
### 3. 流式聊天完成节省了多少时间

Now let's ask `gpt-3.5-turbo` to count to 100 again, and see how long it takes.
现在让我们再次让`gpt-3.5-turbo`数到100，并看看需要多长时间。

In [5]:
# Example of an OpenAI ChatCompletion request with stream=True
# https://platform.openai.com/docs/guides/chat

# record the time before the request is sent
start_time = time.time()

# send a ChatCompletion request to count to 100
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
    ],
    temperature=0,
    stream=True  # again, we set stream=True
)

# create variables to collect the stream of chunks
# 创建变量来收集流的块
collected_chunks = []
collected_messages = []
# iterate through the stream of events
for chunk in response:
    chunk_time = time.time() - start_time  # calculate the time delay of the chunk
    collected_chunks.append(chunk)  # save the event response
    chunk_message = chunk['choices'][0]['delta']  # extract the message
    collected_messages.append(chunk_message)  # save the message
    print(f"Message received {chunk_time:.2f} seconds after request: {chunk_message}")  # print the delay and text

# print the time delay and text received
# 打印延迟和接收到的文本
print(f"Full response received {chunk_time:.2f} seconds after request")
full_reply_content = ''.join([m.get('content', '') for m in collected_messages])
print(f"Full conversation received: {full_reply_content}")


Message received 1.49 seconds after request: {
  "role": "assistant"
}
Message received 1.49 seconds after request: {
  "content": "1"
}
Message received 1.63 seconds after request: {
  "content": ","
}
Message received 1.71 seconds after request: {
  "content": " "
}
Message received 1.84 seconds after request: {
  "content": "2"
}
Message received 1.95 seconds after request: {
  "content": ","
}
Message received 2.07 seconds after request: {
  "content": " "
}
Message received 2.19 seconds after request: {
  "content": "3"
}
Message received 2.32 seconds after request: {
  "content": ","
}
Message received 2.44 seconds after request: {
  "content": " "
}
Message received 2.54 seconds after request: {
  "content": "4"
}
Message received 2.68 seconds after request: {
  "content": ","
}
Message received 2.79 seconds after request: {
  "content": " "
}
Message received 2.92 seconds after request: {
  "content": "5"
}
Message received 3.01 seconds after request: {
  "content": ","
}
Messa

#### Time comparison
### 时间比较

In the example above, both requests took about 3 seconds to fully complete. Request times will vary depending on load and other stochastic factors.
在上面的例子中，两个请求都需要大约3秒才能完全完成。 请求时间将根据负载和其他随机因素而变化。

However, with the streaming request, we received the first token after 0.1 seconds, and subsequent tokens every ~0.01-0.02 seconds.
但是，使用流式请求，我们在0.1秒后收到了第一个令牌，随后的令牌每~0.01-0.02秒。

### 4. How to stream non-chat completions (used by older models like `text-davinci-003`)
### 4. 如何流式传输非聊天完成（由旧模型（如`text-davinci-003`）使用）

#### A typical completion request
#### 一个典型的完成请求

With a typical Completions API call, the text is first computed and then returned all at once.
使用典型的Completions API调用，首先计算文本，然后一次返回所有文本。

In [6]:
# Example of an OpenAI Completion request
# https://beta.openai.com/docs/api-reference/completions/create

# record the time before the request is sent
start_time = time.time()

# send a Completion request to count to 100
response = openai.Completion.create(
    model='text-davinci-002',
    prompt='1,2,3,',
    max_tokens=193,
    temperature=0,
)

# calculate the time it took to receive the response
response_time = time.time() - start_time

# extract the text from the response
completion_text = response['choices'][0]['text']

# print the time delay and text received
print(f"Full response received {response_time:.2f} seconds after request")
print(f"Full text received: {completion_text}")

Full response received 4.62 seconds after request
Full text received: 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100


#### A streaming completion request
#### 流式完成请求

With a streaming Completions API call, the text is sent back via a series of events. In Python, you can iterate over these events with a `for` loop.
#### 一个流式的Completions API调用，文本通过一系列事件发送回来。 在Python中，您可以使用`for`循环迭代这些事件。

In [7]:
# Example of an OpenAI Completion request, using the stream=True option
# https://beta.openai.com/docs/api-reference/completions/create

# record the time before the request is sent
start_time = time.time()

# send a Completion request to count to 100
response = openai.Completion.create(
    model='text-davinci-002',
    prompt='1,2,3,',
    max_tokens=193,
    temperature=0,
    stream=True,  # this time, we set stream=True
)

# create variables to collect the stream of events
collected_events = []
completion_text = ''
# iterate through the stream of events
for event in response:
    event_time = time.time() - start_time  # calculate the time delay of the event
    collected_events.append(event)  # save the event response
    event_text = event['choices'][0]['text']  # extract the text
    completion_text += event_text  # append the text
    print(f"Text received: {event_text} ({event_time:.2f} seconds after request)")  # print the delay and text

# print the time delay and text received
print(f"Full response received {event_time:.2f} seconds after request")
print(f"Full text received: {completion_text}")

Text received: 4 (0.62 seconds after request)
Text received: , (0.62 seconds after request)
Text received: 5 (0.62 seconds after request)
Text received: , (0.62 seconds after request)
Text received: 6 (0.68 seconds after request)
Text received: , (0.68 seconds after request)
Text received: 7 (0.68 seconds after request)
Text received: , (0.68 seconds after request)
Text received: 8 (0.68 seconds after request)
Text received: , (0.70 seconds after request)
Text received: 9 (0.70 seconds after request)
Text received: , (0.70 seconds after request)
Text received: 10 (0.70 seconds after request)
Text received: , (0.70 seconds after request)
Text received: 11 (0.70 seconds after request)
Text received: , (0.71 seconds after request)
Text received: 12 (0.71 seconds after request)
Text received: , (0.71 seconds after request)
Text received: 13 (0.71 seconds after request)
Text received: , (0.72 seconds after request)
Text received: 14 (0.88 seconds after request)
Text received: , (0.88 second

#### Time comparison
#### 时间比较

In the example above, both requests took about 3 seconds to fully complete. Request times will vary depending on load and other stochastic factors.
在上面的例子中，两个请求都需要大约3秒才能完全完成。 请求时间将根据负载和其他随机因素而变化。

However, with the streaming request, we received the first token after 0.18 seconds, and subsequent tokens every ~0.01-0.02 seconds.
但是，使用流式请求，我们在0.18秒后收到了第一个令牌，随后的令牌每~0.01-0.02秒。