Skip to content

Commit

Permalink
Merge pull request #278 from openai/ted/update-token-counting
Browse files Browse the repository at this point in the history
updates token counting guide
  • Loading branch information
ted-at-openai committed Mar 25, 2023
2 parents afa9436 + b45d2b2 commit 4c1c731
Showing 1 changed file with 33 additions and 34 deletions.
67 changes: 33 additions & 34 deletions examples/How_to_count_tokens_with_tiktoken.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"\n",
"Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"cl100k_base\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
"\n",
"Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n",
"Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).\n",
"\n",
"\n",
"## Encodings\n",
Expand All @@ -22,17 +22,16 @@
"\n",
"| Encoding name | OpenAI models |\n",
"|-------------------------|-----------------------------------------------------|\n",
"| `cl100k_base` | ChatGPT models, `text-embedding-ada-002` |\n",
"| `p50k_base` | Code models, `text-davinci-002`, `text-davinci-003` |\n",
"| `cl100k_base` | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002` |\n",
"| `p50k_base` | Codex models, `text-davinci-002`, `text-davinci-003`|\n",
"| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci` |\n",
"\n",
"You can retrieve the encoding for a model using `tiktoken.encoding_for_model()` as follows:\n",
"```python\n",
"encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')\n",
"```\n",
"\n",
"`p50k_base` overlaps substantially with `r50k_base`, and for non-code applications, they will usually give the same tokens.\n",
"\n",
"Note that `p50k_base` overlaps substantially with `r50k_base`, and for non-code applications, they will usually give the same tokens.\n",
"\n",
"## Tokenizer libraries by language\n",
"\n",
Expand Down Expand Up @@ -61,7 +60,7 @@
"source": [
"## 0. Install `tiktoken`\n",
"\n",
"Install `tiktoken` with `pip`:"
"If needed, install `tiktoken` with `pip`:"
]
},
{
Expand All @@ -76,8 +75,8 @@
"Requirement already satisfied: tiktoken in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (0.3.2)\n",
"Requirement already satisfied: regex>=2022.1.18 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from tiktoken) (2022.10.31)\n",
"Requirement already satisfied: requests>=2.26.0 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from tiktoken) (2.28.2)\n",
"Requirement already satisfied: idna<4,>=2.5 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (3.3)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (2.0.9)\n",
"Requirement already satisfied: idna<4,>=2.5 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (3.3)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (2021.10.8)\n",
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/ted/.virtualenvs/openai/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (1.26.7)\n",
"Note: you may need to restart the kernel to use updated packages.\n"
Expand All @@ -98,7 +97,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -119,7 +118,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -136,7 +135,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -162,7 +161,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 4,
"metadata": {},
"outputs": [
{
Expand All @@ -171,7 +170,7 @@
"[83, 1609, 5963, 374, 2294, 0]"
]
},
"execution_count": 5,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -190,7 +189,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -203,7 +202,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 6,
"metadata": {},
"outputs": [
{
Expand All @@ -212,7 +211,7 @@
"6"
]
},
"execution_count": 7,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -239,7 +238,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 7,
"metadata": {},
"outputs": [
{
Expand All @@ -248,7 +247,7 @@
"'tiktoken is great!'"
]
},
"execution_count": 8,
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -275,7 +274,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 8,
"metadata": {},
"outputs": [
{
Expand All @@ -284,7 +283,7 @@
"[b't', b'ik', b'token', b' is', b' great', b'!']"
]
},
"execution_count": 9,
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -308,12 +307,12 @@
"source": [
"## 5. Comparing encodings\n",
"\n",
"Different encodings can vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."
"Different encodings vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings."
]
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -336,7 +335,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 10,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -366,7 +365,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 11,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -396,7 +395,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 12,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -431,16 +430,16 @@
"source": [
"## 6. Counting tokens for chat API calls\n",
"\n",
"ChatGPT models like `gpt-3.5-turbo` use tokens in the same way as past completions models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.\n",
"ChatGPT models like `gpt-3.5-turbo` and `gpt-4` use tokens in the same way as older completions models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.\n",
"\n",
"Below is an example function for counting tokens for messages passed to `gpt-3.5-turbo-0301` or `gpt-4-0314`.\n",
"\n",
"Note that the exact way that messages are converted into tokens may change from model to model. So when future model versions are released, the answers returned by this function may be only approximate. The [ChatML documentation](https://github.com/openai/openai-python/blob/main/chatml.md) explains in more detail how the OpenAI API converts messages into tokens."
"Note that the exact way that messages are converted into tokens may change from model to model, and may even change over time for the same model. Therefore, the counts returned by the function below should be considered an estimate, not a guarantee."
]
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -458,7 +457,7 @@
" print(\"Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.\")\n",
" return num_tokens_from_messages(messages, model=\"gpt-4-0314\")\n",
" elif model == \"gpt-3.5-turbo-0301\":\n",
" tokens_per_message = 4 # every message follows <im_start>{role/name}\\n{content}<im_end>\\n\n",
" tokens_per_message = 4 # every message follows <|start|>{role/name}\\n{content}<|end|>\\n\n",
" tokens_per_name = -1 # if there's a name, the role is omitted\n",
" elif model == \"gpt-4-0314\":\n",
" tokens_per_message = 3\n",
Expand All @@ -472,26 +471,26 @@
" num_tokens += len(encoding.encode(value))\n",
" if key == \"name\":\n",
" num_tokens += tokens_per_name\n",
" num_tokens += 2 # every reply is primed with <im_start>assistant\n",
" num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>\n",
" return num_tokens\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gpt-3.5-turbo-0301\n",
"126 prompt tokens counted by num_tokens_from_messages().\n",
"126 prompt tokens counted by the OpenAI API.\n",
"127 prompt tokens counted by num_tokens_from_messages().\n",
"127 prompt tokens counted by the OpenAI API.\n",
"\n",
"gpt-4-0314\n",
"128 prompt tokens counted by num_tokens_from_messages().\n",
"128 prompt tokens counted by the OpenAI API.\n",
"129 prompt tokens counted by num_tokens_from_messages().\n",
"129 prompt tokens counted by the OpenAI API.\n",
"\n"
]
}
Expand Down

0 comments on commit 4c1c731

Please sign in to comment.