## Introduction.

In this notebook, you'll learn about tokens - a crucial concept tied directly to Large Language Models (LLMs).



Here's what you will learn:
- What is a token?
- Why do we use tokens?

### What is a token?

A token is a chunk of text that Large Language Models read or generate.

Here's key information about tokens:
- A token is the smallest unit of text that AI models process.
- Tokens don't have the defined length. Some are only 1 character long, others can be longer words.
- Tokens can be: words, sub-words, punctuation marks or special symbols.
- As a rule of thumb, a token corresponds to 3/4 of the word. So 100 tokens is roughly 75 words.

So let me show you how to count tokens.

### Why tokens (not characters or words)?

#TODO: 
- Try to explain the drawbacks of character and word approach.
- Perplexity: Why do we use tokens in LLMs, not characters or words?
- Explain that tokens are the combination of the best from both worlds.
- Explain tokens are turned into embeddings inside of LLMs, as they have a lookup table in which they get IDs (tokens). The embedding values are trainable and get updated during the training of the model.

## Code

### Load the OpenAI API Key

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [7]:
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write 3 sentences about Mount Everest"}
    ],
    seed=42
)

response = completion.choices[0].message.content
print(response)

Mount Everest, the highest peak in the world, stands at an elevation of 8,848.86 meters (29,031.7 feet) above sea level. Located in the Himalayas on the border between Nepal and Tibet, it attracts thousands of climbers each year, including seasoned mountaineers and ambitious adventurers. Despite its allure, climbing Everest presents significant challenges, including extreme weather, high altitudes, and the risk of avalanches.


Let's copy the response:

```
Mount Everest, the highest peak in the world, stands at an elevation of 8,848.86 meters (29,031.7 feet) above sea level. Located in the Himalayas on the border between Nepal and Tibet, it attracts thousands of climbers each year, including seasoned mountaineers and ambitious adventurers. Despite its allure, climbing Everest presents significant challenges, including extreme weather, high altitudes, and the risk of avalanches.
```

In [11]:
from openai import OpenAI

client = OpenAI()

am_completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write 3 sentences about Amazon Rainforest"}
    ],
    seed=42
)

am_response = am_completion.choices[0].message.content
print(am_response)

The Amazon Rainforest, often referred to as the "lungs of the Earth," spans across several countries in South America, including Brazil, Peru, and Colombia, and plays a crucial role in regulating the global climate. This vast ecosystem is home to approximately 10% of the known species on Earth, showcasing an unparalleled diversity of flora and fauna, many of which are found nowhere else. However, the rainforest faces significant threats from deforestation, illegal logging, and climate change, which jeopardize its invaluable biodiversity and the rights of the indigenous communities that depend on it.


Let's copy:


```
The Amazon Rainforest, often referred to as the "lungs of the Earth," spans across several countries in South America, including Brazil, Peru, and Colombia, and plays a crucial role in regulating the global climate. This vast ecosystem is home to approximately 10% of the known species on Earth, showcasing an unparalleled diversity of flora and fauna, many of which are found nowhere else. However, the rainforest faces significant threats from deforestation, illegal logging, and climate change, which jeopardize its invaluable biodiversity and the rights of the indigenous communities that depend on it.
```

Awesome! We've got a short description about my country, Poland.

Let's count words and characters first. In Python it's quite simple:

In [None]:
words_pl = len(pl_response.split())
characters_pl = len(pl_response)

print(f"The response has {words_pl} words and {characters_pl} characters.")

The response has 76 words and 493 characters.


### Counting tokens.

To count tokens, we'll use the `tiktoken` library.

Here's how:

In [None]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")

tokens = enc.encode(pl_response)
print(f"The response has {len(tokens)} tokens.")

The response has 93 tokens.


Let's break the code down:
- We imported the tiktoken library.
- We defined the encoder using `encoding_for_model("gpt-4o-mini")` to ensure we use the right encoder.
- We "tokenized" the response using `encode(pl_response)`.
- We counted the tokens using Python's `len` function.

Great!

Let's take our sample text and run it through the [online tokenizer](https://tiktokenizer.vercel.app/).

Here are the results:

<img src="./images/SamplePolishDesc.png" alt="Poland Description tokens" width="500px" />

I love that visual representation. The app highlights every single token. It helps us see how they actually look like.

Below, we can see the numerical representation of each token from the decription.

Let's try to see, if the numbers match with the tokens from the `tiktoken` library:

### Polish Translation

In [14]:
prompt = "Translate to Polish the following: " + am_response

polish = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": prompt}
    ],
    temperature=0.0
)

translation = polish.choices[0].message.content
print(translation)

Amazonia, często nazywana "płucami Ziemi", rozciąga się na kilku krajach Ameryki Południowej, w tym w Brazylii, Peru i Kolumbii, i odgrywa kluczową rolę w regulacji globalnego klimatu. Ten ogromny ekosystem jest domem dla około 10% znanych gatunków na Ziemi, prezentując niezrównaną różnorodność flory i fauny, z których wiele nie występuje nigdzie indziej. Jednak las deszczowy stoi w obliczu poważnych zagrożeń związanych z wylesianiem, nielegalnym wyrębem drzew i zmianami klimatycznymi, które zagrażają jego bezcennej bioróżnorodności oraz prawom społeczności rdzennych, które od niego zależą.


### German Translation?