In [1]:
import pandas as pd
from adapter import (GPTTokenizerAdapter, LlamaTokenizerAdapter,
                     MistralTokenizerAdapter)
from sample_texts import sample_texts

gpt_models = ["gpt-4o", "gpt-4", "gpt-3.5"]


if __name__ == "__main__":
    result = []
    for lang, text in sample_texts.items():
        for model in gpt_models:
            gpt_adapter = GPTTokenizerAdapter(model)
            gpt_tokens = gpt_adapter.count_tokens(text)
            result.append({"lang": lang, "model": model, "tokens": gpt_tokens})

        mistral_adapter = MistralTokenizerAdapter()
        mistral_tokens = mistral_adapter.count_tokens(text)
        result.append({"lang": lang, "model": "mistral", "tokens": mistral_tokens})

        llama_adapter = LlamaTokenizerAdapter()
        llama_tokens = llama_adapter.count_tokens(text)
        result.append({"lang": lang, "model": "llama", "tokens": llama_tokens})

    df = pd.DataFrame(result)
    print(df)


  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


       lang    model  tokens
0   english   gpt-4o      46
1   english    gpt-4      46
2   english  gpt-3.5      46
3   english  mistral      50
4   english    llama      51
5   spanish   gpt-4o      54
6   spanish    gpt-4      70
7   spanish  gpt-3.5      70
8   spanish  mistral      76
9   spanish    llama      73
10   arabic   gpt-4o      67
11   arabic    gpt-4     113
12   arabic  gpt-3.5     113
13   arabic  mistral     151
14   arabic    llama     152
15    hindi   gpt-4o      59
16    hindi    gpt-4     185
17    hindi  gpt-3.5     185
18    hindi  mistral     186
19    hindi    llama     190


<h1>Tokenization</h1>

- Texts need to be represented as numbers in our models so that our model can understand them correctly. 
- Tokenization breaks down text into tokens, and each token is assigned a numerical representation, or index, which can be used to feed into a model
- Each unique token is assigned a specific index number in the tokenizer’s vocabulary.
- these tokens are passed through the model, which typically includes an embedding layer and transformer blocks
- The embedding layer converts the tokens into dense vectors that capture semantic meanings
- The transformer blocks then process these embedding vectors to understand the context
- The last step is decoding, which detokenize output tokens back to human-readable text. This is done by mapping the tokens back to their corresponding words using the tokenizer’s vocabulary.

<h2>Types of Tokenization</h2>

1. **Word-Based Tokenization** : This is the most straightforward form, where the text is segmented into words based on spaces or punctuation. It's simple but can be inefficient for languages without clear word delimiters or for handling inflections and derivations effectively.

2. **Subword Tokenization**: Popularized by models like BERT and GPT, this approach involves breaking down words into smaller, meaningful units (subwords) using algorithms like Byte-Pair Encoding (BPE) or WordPiece. This method helps in managing vocabulary size more efficiently and dealing with unknown words or morphological variations.

3. **Byte-Level Tokenization**: As seen in models like GPT-2 and GPT-3, this approach tokenizes text at the byte level, encoding each byte of the text into tokens, which aligns neatly with UTF-8 encoding, ensuring better handling of diverse languages and special characters.