Tiktoken is an open-source tool developed by OpenAI that is utilized for tokenizing text.

Tokenization is when you split a text string to a list of tokens. Tokens can be letters, words or grouping of words (depending on the text language).

[OpenAI Tokenizer Tool](https://platform.openai.com/tokenizer)

https://github.com/openai/tiktoken

`tiktoken` supports three encodings used by OpenAI models:

| Encoding name           | OpenAI models                                       |
|-------------------------|-----------------------------------------------------|
| `cl100k_base`           | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large`  |
| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003`|
| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                         |

In [1]:
from tiktoken import encoding_for_model

In [6]:
enc = encoding_for_model("text-davinci-003")
toks = enc.encode("The Los Angeles Dodgers won the World Series in 2020.")

toks

[464, 5401, 5652, 23576, 1839, 262, 2159, 7171, 287, 12131, 13]

In [7]:
enc = encoding_for_model("gpt-4")
toks = enc.encode("The Los Angeles Dodgers won the World Series in 2020.")

toks

[791, 9853, 12167, 56567, 2834, 279, 4435, 11378, 304, 220, 2366, 15, 13]

In [9]:
[enc.decode_single_token_bytes(o).decode('utf-8') for o in toks]

['The',
 ' Los',
 ' Angeles',
 ' Dodgers',
 ' won',
 ' the',
 ' World',
 ' Series',
 ' in',
 ' ',
 '202',
 '0',
 '.']

In [12]:
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = encoding_for_model(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [14]:
num_tokens_from_string("The Los Angeles Dodgers won the World Series in 2020.", "gpt-4")

13