tiktoken example notebook returns incorrect token counts for chat APIs

**Identify the file to be fixed**
[The name of the file containing the problem: How_to_count_tokens_with_tiktoken.ipynb](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb)

**Describe the problem**
The problem is that the supplied example code for computing token counts for chat messages appears to be off by 1 (low) for each message. The numbers returned by `num_tokens_from_messages()` did not match those returned by the API endpoint. The problem is the same with both the gpt-3.5-turbo and gpt-4 endpoints even though these are separate code paths.

**Describe a solution**
By trial and error I made the following changes on the two lines with the `WAS` comments:

```
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 5 # WAS 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif model == "gpt-4-0314":
        tokens_per_message = 4 # WAS 3
        tokens_per_name = 1
```

With the above changes I get the correct token counts for both chat endpoints.

May I suggest that the tiktoken library itself handle the details of knowing the chat wrapper encoding?

**Additional context**
I tried to get tiktoken to encode the message wrappers to compute the actual token overhead using:

```
encoding.encode("<|im_start|>system\n<|im_end|>\n", allowed_special="all")
and
encoding.encode("<|start|>system\n<|end|>\n", allowed_special="all")
```

But tiktoken does not understand those special start/end tokens.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tiktoken example notebook returns incorrect token counts for chat APIs #488

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tiktoken example notebook returns incorrect token counts for chat APIs #488

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions