-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Confirm this is an issue with the Python library and not an underlying OpenAI API
- This is an issue with the Python library
Describe the bug
When generating a completion with stream=True I get JSONDecodeError when the LLM tries to generate \u2028. After some digging, this seems to be due to \u2028 corresponding to 2 tokens [378, 101], none of which can be decoded into strings alone.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("\u2028").decode() # '\u2028' unicode line separator
enc.decode_bytes(tokens).decode() # [378, 101] is fine
enc.decode_bytes([tokens[0]]).decode() # [378]
# > UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data
enc.decode_bytes([tokens[1]]) # [101]
# > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 0: invalid start byteWhen stream=True I presume the tokens are streamed one at a time and hence can't be decoded. The error occurs when the openai library tries to convert the SSE data to json, resulting in an invalid conversion e.g. {..., "choices":[{"index":0,"delta":{"content":" (note that I have truncated the beginning of the data, but the content just ends like this).
This seems quite obscure but handles frequently in scenarios where references / citations are made to some user input.
To Reproduce
- Create prompt including a single character/symbol that gets encoded into multiple tokens e.g.
\u2028which cl100k_base maps to the tokens[378, 101] - Ask the LLM to recite the user message
Code snippets
from openai import OpenAI
api_key = "..." # Replace with your API key
system_message = "The user will send you a short text. You MUST respond with the EXACT same text verbatim as the user supplies, nothing more, nothing less."
user_message = """
Minim culpa \u2028 anim eu id exercitation amet. Culpa culpa esse mollit pariatur est enim. Exercitation minim cillum aute occaecat. Incididunt velit commodo sit ea. \u2028
Deserunt labore eu ipsum reprehenderit esse sunt nisi aliqua qui id mollit. Id cupidatat incididunt Lorem ex ullamco quis voluptate mollit sit labore quis. Nostrud sint sint Lorem tempor minim amet aliquip elit fugiat. Ipsum cupidatat ipsum veniam ut ea magna nostrud id quis exercitation tempor velit aliqua sit. Proident sint velit ullamco culpa dolore magna ut eiusmod pariatur. Commodo ut sint minim ex aliqua eu esse anim elit elit eiusmod ea. Culpa quis in ea id cupidatat labore amet amet ullamco sunt Lorem do tempor ad. \u2028
Dolor anim dolore laborum fugiat dolor eiusmod amet adipisicing. Consectetur et dolor enim proident aute deserunt. Excepteur ullamco ea officia nulla irure cupidatat veniam ipsum ex. Labore sint sit incididunt ad exercitation labore minim consequat elit sit nulla occaecat do nisi. Irure est commodo id eu fugiat eiusmod proident consequat ea.from typing import Any, Dict, List, Literal, Union \u2028
"""
client = OpenAI(api_key=api_key)
response = client.chat.completions.create(
model="gpt-4-1106-preview",
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": user_message},
],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")OS
macOS
Python version
Python 3.11.0
Library version
openai v1.13.3
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working