# Tokenization and counting tokens 

Although LLMs allow text-to-text user--computer interaction, behind the scenes the work with numbers.
This means that any input text need to be converted into a sequence of integers ("encoded") that represent the words, subwords, and symbols in the input in a way the model can "understand."
This process of converted text inoputs into a sequence of integers is called *tokenization*.

When we work with the OpenAI GPT models, you don't need to worry about this too much, since it handles the tokenization for you.
The only reason we want to know about tokenization is to be able to count the number of tokens in your input text.
Counting tokens is important because it helps you to compute the costs of using a language model and ensure that your input text is within the maximum token limit of the model you are using.

In this notebook, we'll use the `tiktoken` python library to count the number of tokens in a given text.
An alternative, interactive tool can be found https://platform.openai.com/tokenizer

## Background

> The atomic unit of consumption for a language model is not a “word”, but rather a “token”.
> You can kind of think of tokens as syllables, and on average they work out to about 750 words per 1,000 tokens.
> They represent many concepts beyond just alphabetical characters – such as punctuation, sentence boundaries, and the end of a document.
> &mdash; [source](https://github.com/brexhq/prompt-engineering?tab=readme-ov-file#tokens)

Learn more about tokenizers and their reason of existence here: https://huggingface.co/docs/transformers/tokenizer_summary

## Token limits a.k.a. context window size

LLMs are "stateless" and thus cannot remember anything about previous requests or converations.
This means that so you always need to include everything that it might need to know that is specific to the current session.

This is a major downside of LLMs, as it means that the leading language model architecture, the Transformer, has a fixed input and output size – at a certain point the prompt cannot grow any larger.

The total size of the prompt, sometimes referred to as the **context window**, is model dependent.
For GPT-3, it is 4,096 tokens. 
For GPT-4, it is 8,192 tokens or 32,768 tokens depending on which variant you use.

You can find a detailed overview here: 

- for GPT-4 and its variants: https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo
- for GPT-3.5-turbo and its variants: https://platform.openai.com/docs/models/gpt-3-5-turbo

In [2]:
# !pip install tiktoken==0.6.0
import tiktoken

`tiktoken` makes available several encodings that are used by the varios OpenAI models, including GPT-3 and GPT-4.

In [12]:
# list encoding names
tiktoken.list_encoding_names()

['gpt2', 'r50k_base', 'p50k_base', 'p50k_edit', 'cl100k_base']

For example, GPT-4 (snapshot from June 2023) uses the 'cl100k_base' encoding:

In [17]:
# get the encoding model for the desired model
encoding = tiktoken.encoding_for_model('gpt-4-0613')
encoding.name

'cl100k_base'

With the `encoding` instance created above, you can tokenize and encode any text input:

In [18]:
encoding.encode('Hello, world!')

[9906, 11, 1917, 0]

These numbers are just token's indexes in the tokenizer's vocabulary. They are not the actual token counts.

In [25]:
[encoding.decode_single_token_bytes(tok).decode() for tok in encoding.encode('Hello, world!')]

['Hello', ',', ' world', '!']

But since we can tokenize a text, counting the number of tokens is trivial:

In [27]:
toks = encoding.encode('Hello, world!')
len(toks)

4

### A simple utility function

In [6]:
def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [7]:
text = "Liberal Alliance er det eneste alternativ til  et træt VKO-flertal, som er bange for både  reformer, udlændinge og vælgere, og en  populistisk S/SF-regering, som er bange  for præcis de samme ting - og som vil indføre endnu flere skatter, afgifter, regler og  forbud,  end  den  nuværende  regering  plager os med."
num_tokens_from_string(text)

107

### A more advanced approach

If we use the function above, we need to reload the encoding every time we want to count the tokens in a new text.
Also, we can only input one text at a time.

To avoid this, we can create a class that loads the encoding once and then allows us to count the tokens in multiple texts.

In [35]:
from typing import Union, List

class TokenCounter:
    def __init__(self, encoding_name: Union[str, None] = None, model: Union[str, None] = None):
        """
        Initialize the tokenizer with either a model or an encoding name.

        Args:
            encoding_name (Union[str, None]): The name of the encoding to use. Default is None.
            model (Union[str, None]): The model to use for encoding. Default is None.

        Raises:
            ValueError: If neither model nor encoding_name is provided.
            ValueError: If both model and encoding_name are provided.
        """
        # ensure that either model or encoding_name is provided
        if model is None and encoding_name is None:
            raise ValueError("Either `model` or `encoding_name` must be provided.")
        if model is not None and encoding_name is not None:
            raise ValueError("Only one of `model` or `encoding_name` can be provided.")
        if encoding_name:
            self.encoding = tiktoken.get_encoding(encoding_name)
        else:
            self.encoding = tiktoken.encoding_for_model(model)
    
    def count_tokens(self, input: Union[str, List[str]]) -> Union[int, List[int]]:
        """
        Count the number of tokens in the input.

        Args:
            input (Union[str, List[str]]): The input to tokenize. Can be a string or a list of strings.

        Returns:
            Union[int, List[int]]: The number of tokens in the input. If the input is a list, returns a list of token counts.
        """
        if isinstance(input, str):
            return len(self.encoding.encode(input))
        else:
            toks = self.encoding.encode_batch(input)
            return [len(t) for t in toks]

    def __call__(self, input: Union[str, List[str]]) -> Union[int, List[int]]:
        """
        Call the tokenizer on the input. This is equivalent to calling count_tokens.

        Args:
            input (Union[str, List[str]]): The input to tokenize. Can be a string or a list of strings.

        Returns:
            Union[int, List[int]]: The number of tokens in the input. If the input is a list, returns a list of token counts.
        """
        return self.count_tokens(input)

**_Note:_** This code defines a `TokenCounter` class that can be initialized with either a model or an encoding name. The `count_tokens` method counts the number of tokens in the input, and the `__call__` method allows the tokenizer to be called like a function.

In [36]:
token_counter = TokenCounter(model="gpt-4-0613")

In [38]:
token_counter("Hello, world!")

4

In [39]:
token_counter(["Hello, world!", "I'm tiktoken!"])

[4, 5]

## Computing API usage costs

OpenAI charges model usage costs based on the number of tokens processed by the model.
This means that you need to be aware of the number of tokens in your input text and the (expected) number of tokens in its response to avoid unexpected costs.

To see what OpenAI charges you per 1000 (one thousand) input and output tokens, see https://openai.com/pricing