## Introduction.

In this notebook, you'll learn about tokens - a crucial concept tied directly to Large Language Models (LLMs).



Here's what you will learn:
- What is a token?
- Why do we use tokens?

### What is a token?

A token is a chunk of text that Large Language Models read or generate.

Here's key information about tokens:
- A token is the smallest unit of text that AI models process.
- Tokens don't have the defined length. Some are only 1 character long, others can be longer words.
- Tokens can be: words, sub-words, punctuation marks or special symbols.
- As a rule of thumb, a token corresponds to 3/4 of the word. So 100 tokens is roughly 75 words.

So let me show you how to count tokens.

### Why tokens (not characters or words)?

#TODO: 
- Try to explain the drawbacks of character and word approach.
- Perplexity: Why do we use tokens in LLMs, not characters or words?
- Explain that tokens are the combination of the best from both worlds.
- Explain tokens are turned into embeddings inside of LLMs, as they have a lookup table in which they get IDs (tokens). The embedding values are trainable and get updated during the training of the model.

## Code

### Load the OpenAI API Key

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

### Counting 'r' in 'strawberry'

Let's use GPT-4o and GPT-4o Mini

fg5fctrcqagccccccccae3wcwb vdhjswdhb nsxzz7yf8ug;we.|es\'3qsa<[]>

In [23]:
from openai import OpenAI

client = OpenAI()

strawberry = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "How many 'r' in 'strawberry'?"}
    ],
    seed=42
)

strawberry_resp = strawberry.choices[0].message.content
print(strawberry_resp)

The word "strawberry" contains two 'r' letters.


### Describe Mountains

In [7]:
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write 3 sentences about Mount Everest"}
    ],
    seed=42
)

response = completion.choices[0].message.content
print(response)

Mount Everest, the highest peak in the world, stands at an elevation of 8,848.86 meters (29,031.7 feet) above sea level. Located in the Himalayas on the border between Nepal and Tibet, it attracts thousands of climbers each year, including seasoned mountaineers and ambitious adventurers. Despite its allure, climbing Everest presents significant challenges, including extreme weather, high altitudes, and the risk of avalanches.


Let's copy the response:

```
Mount Everest, the highest peak in the world, stands at an elevation of 8,848.86 meters (29,031.7 feet) above sea level. Located in the Himalayas on the border between Nepal and Tibet, it attracts thousands of climbers each year, including seasoned mountaineers and ambitious adventurers. Despite its allure, climbing Everest presents significant challenges, including extreme weather, high altitudes, and the risk of avalanches.
```

In [11]:
from openai import OpenAI

client = OpenAI()

am_completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write 3 sentences about Amazon Rainforest"}
    ],
    seed=42
)

am_response = am_completion.choices[0].message.content
print(am_response)

The Amazon Rainforest, often referred to as the "lungs of the Earth," spans across several countries in South America, including Brazil, Peru, and Colombia, and plays a crucial role in regulating the global climate. This vast ecosystem is home to approximately 10% of the known species on Earth, showcasing an unparalleled diversity of flora and fauna, many of which are found nowhere else. However, the rainforest faces significant threats from deforestation, illegal logging, and climate change, which jeopardize its invaluable biodiversity and the rights of the indigenous communities that depend on it.


Let's copy:


```
The Amazon Rainforest, often referred to as the "lungs of the Earth," spans across several countries in South America, including Brazil, Peru, and Colombia, and plays a crucial role in regulating the global climate. This vast ecosystem is home to approximately 10% of the known species on Earth, showcasing an unparalleled diversity of flora and fauna, many of which are found nowhere else. However, the rainforest faces significant threats from deforestation, illegal logging, and climate change, which jeopardize its invaluable biodiversity and the rights of the indigenous communities that depend on it.
```

Awesome! We've got a short description about my country, Poland.

Let's count words and characters first. In Python it's quite simple:

In [16]:
words = len(am_response.split())
characters = len(am_response)

print(f"The response has {words} words and {characters} characters.")

The response has 92 words and 606 characters.


### Counting tokens.

To count tokens, we'll use the `tiktoken` library.

Here's how:

In [17]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")

tokens = enc.encode(am_response)
print(f"The response has {len(tokens)} tokens.")

The response has 113 tokens.


Let's break the code down:
- We imported the tiktoken library.
- We defined the encoder using `encoding_for_model("gpt-4o-mini")` to ensure we use the right encoder.
- We "tokenized" the response using `encode(pl_response)`.
- We counted the tokens using Python's `len` function.

Great!

Let's take our sample text and run it through the [online tokenizer](https://tiktokenizer.vercel.app/).

Here are the results:

<img src="./images/AmazonDescTokens.png" alt="Amazon Description tokens" width="500px" />

I love that visual representation. The app highlights every single token. It helps us see how they actually look like.

Below, we can see the numerical representation of each token from the decription.

Let's try to see, if the numbers match with the tokens from the `tiktoken` library:

In [18]:
print(tokens)

[976, 9529, 33159, 76428, 11, 4783, 22653, 316, 472, 290, 392, 82576, 328, 290, 16464, 3532, 78545, 5251, 4919, 8981, 306, 6800, 8108, 11, 3463, 24868, 11, 61802, 11, 326, 41071, 11, 326, 17473, 261, 19008, 5430, 306, 101955, 290, 5466, 16721, 13, 1328, 11332, 38423, 382, 2237, 316, 16679, 220, 702, 4, 328, 290, 5542, 15361, 402, 16464, 11, 86573, 448, 88238, 28955, 328, 90258, 326, 98806, 11, 1991, 328, 1118, 553, 2491, 51180, 1203, 13, 5551, 11, 290, 164436, 22060, 6933, 35649, 591, 1056, 192522, 11, 23802, 17553, 11, 326, 16721, 3343, 11, 1118, 108527, 750, 1617, 69505, 106248, 326, 290, 6393, 328, 290, 68509, 15061, 484, 9630, 402, 480, 13]


### Why counting tokens?

When creating AI applications, it's crucial to manage (and count) tokens for several reasons:
1. **Cost management** - Tokens directly influence the cost of API usage.
2. **Billing accuracy** - Token counting enables accurate usage-based billing for customers.
3. **Performance optimization** - The number of tokens affects model performance. Monitoring token usage helps optimize prompts.
4. **Customer transparency** - Providing real-time token usage data to customers through dashboards helps them control their spending and avoid unexpected costs.
5. **Product optimization** - Analyzing token usage patterns can provide insights into how customers are using the AI product, informing future improvements and feature development.
6. **Compliance and security**-  Monitoring token usage can help detect unusual patterns that might indicate security issues.
7. **Profitability analysis** - By attributing token usage to specific customers or features, companies endure profitability.

### Polish Translation

In [14]:
prompt = "Translate to Polish the following: " + am_response

polish = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": prompt}
    ],
    temperature=0.0
)

translation = polish.choices[0].message.content
print(translation)

Amazonia, często nazywana "płucami Ziemi", rozciąga się na kilku krajach Ameryki Południowej, w tym w Brazylii, Peru i Kolumbii, i odgrywa kluczową rolę w regulacji globalnego klimatu. Ten ogromny ekosystem jest domem dla około 10% znanych gatunków na Ziemi, prezentując niezrównaną różnorodność flory i fauny, z których wiele nie występuje nigdzie indziej. Jednak las deszczowy stoi w obliczu poważnych zagrożeń związanych z wylesianiem, nielegalnym wyrębem drzew i zmianami klimatycznymi, które zagrażają jego bezcennej bioróżnorodności oraz prawom społeczności rdzennych, które od niego zależą.


### German Translation?