Skip to content

tiktoken example notebook returns incorrect token counts for chat APIs #488

@RossBencina

Description

@RossBencina

Identify the file to be fixed
The name of the file containing the problem: How_to_count_tokens_with_tiktoken.ipynb

Describe the problem
The problem is that the supplied example code for computing token counts for chat messages appears to be off by 1 (low) for each message. The numbers returned by num_tokens_from_messages() did not match those returned by the API endpoint. The problem is the same with both the gpt-3.5-turbo and gpt-4 endpoints even though these are separate code paths.

Describe a solution
By trial and error I made the following changes on the two lines with the WAS comments:

    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 5 # WAS 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif model == "gpt-4-0314":
        tokens_per_message = 4 # WAS 3
        tokens_per_name = 1

With the above changes I get the correct token counts for both chat endpoints.

May I suggest that the tiktoken library itself handle the details of knowing the chat wrapper encoding?

Additional context
I tried to get tiktoken to encode the message wrappers to compute the actual token overhead using:

encoding.encode("<|im_start|>system\n<|im_end|>\n", allowed_special="all")
and
encoding.encode("<|start|>system\n<|end|>\n", allowed_special="all")

But tiktoken does not understand those special start/end tokens.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions