# Calculate The Number Of Tokens

This notebook explains how to estimate the number of tokens in different use cases.

## Setup

In [3]:
%pip install --upgrade tiktoken -q
%pip install --upgrade openai -q

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Import tiktoken

In [1]:
import tiktoken

Helper function to count the nunber of tokens for a given encoding name.

In [6]:
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [7]:
def num_tokens_for_string_for_model(string: str, model_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.encoding_for_model(model_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

Test

In [11]:
test_string = "tiktoken is great!"
model_name = "gpt-4o-mini"
encoding_name = "o200k_base"


In [8]:
print(f"Number of tokens({model_name}): {num_tokens_for_string_for_model(test_string, model_name)}")

Number of tokens(gpt-4o-mini): 6


In [12]:
print(f"Number of tokens({encoding_name}): {num_tokens_from_string(test_string, encoding_name)}")

Number of tokens(o200k_base): 6


Let's calculate the number tokens for a given file and estimate the cost for one model used for embedding.

In [32]:
file_name = "../data/ThePrince_Machiavelli.txt"
cost_per_token = 0.05/1000000
with open(file_name, "r",encoding="utf-8") as file:
    content = file.read()
    content_length = len(content)
    print(f"Length({file_name}): {content_length:,}")
    content_words = content.split()
    print(f"Number of words({file_name}): {len(content_words):,}")
    content_token_count = num_tokens_for_string_for_model(content, model_name)
    print(f"Number of tokens({model_name}): {content_token_count:,}")
    print(f"Cost of tokens({model_name}): {content_token_count*cost_per_token}\u00A2")

Length(../data/ThePrince_Machiavelli.txt): 301,912
Number of words(../data/ThePrince_Machiavelli.txt): 52,979
Number of tokens(gpt-4o-mini): 70,730
Cost of tokens(gpt-4o-mini): 0.0035365¢


## References
 1. https://github.com/openai/tiktoken
 1. https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

## Appendix

In [20]:
test_string = "tiktoken is great!"
model_name = "gpt-4o-mini"

encoding = tiktoken.encoding_for_model(model_name)
encoded = encoding.encode(test_string)
print(f"Encode({model_name},\"{test_string[:10]}...\") => {encoded}")
print(f"Len({encoded}) => {len(encoded)}")

Encode(gpt-4o-mini,"tiktoken i...") => [83, 8251, 2488, 382, 2212, 0]
Len([83, 8251, 2488, 382, 2212, 0]) => 6


In [21]:
test_string = "tiktoken is great!"
encoding_name = "o200k_base"

encoding = tiktoken.get_encoding(encoding_name)
encoded = encoding.encode(test_string)
print(f"Encode({encoding_name},\"{test_string[:10]}...\") => {encoded}")
print(f"Len({encoded}) => {len(encoded)}")

Encode(o200k_base,"tiktoken i...") => [83, 8251, 2488, 382, 2212, 0]
Len([83, 8251, 2488, 382, 2212, 0]) => 6
