## Introduction.

In this notebook, you'll learn about tokens - a crucial concept tied directly to Large Language Models (LLMs).



Here's what you will learn:
- What is a token?
- Why do we use tokens?
- Why LLMs are so bad at counting?

### What is a token?

A token is a chunk of text that Large Language Models read or generate.

Here's key information about tokens:
- A token is the smallest unit of text that AI models process.
- Tokens don't have the defined length. Some are only 1 character long, others can be longer words.
- Tokens can be: words, sub-words, punctuation marks or special symbols.
- As a rule of thumb, a token corresponds to 3/4 of the word. So 100 tokens is roughly 75 words.

So let me show you how to count tokens.

### Why tokens (not characters or words)?

#TODO: 
- Try to explain the drawbacks of character and word approach.
- Perplexity: Why do we use tokens in LLMs, not characters or words?
- Explain that tokens are the combination of the best from both worlds.
- Explain tokens are turned into embeddings inside of LLMs, as they have a lookup table in which they get IDs (tokens). The embedding values are trainable and get updated during the training of the model.

## Code

### Load the OpenAI API Key

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

### Counting 'r' in 'strawberry'

Let's use GPT-4o and GPT-4o Mini

In [33]:
from openai import OpenAI

client = OpenAI()

strawberry_prompt = "How many R's are there in the word 'strawberry?'"

strawberry = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": strawberry_prompt}
    ],
    seed=42
)

strawberry_resp = strawberry.choices[0].message.content
print(strawberry_resp)

There are two R's in the word "strawberry."


In [34]:
strawberry_mini = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": strawberry_prompt}
    ],
    seed=42
)

strawberry_resp_mini = strawberry_mini.choices[0].message.content
print(strawberry_resp_mini)

The word 'strawberry' contains 2 'R's.


Let me show you, what GPT models see.

We'll use the `tiktoken` library

In [27]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")
tokens = enc.encode(strawberry_prompt)

In [29]:
print(tokens)

[5299, 1991, 460, 885, 553, 1354, 306, 290, 2195, 461, 302, 1618, 19772, 127222]


And these tokens is all that GPT models receive.

Later, the tokens will be converted into embeddings (long vectors).

In [36]:
token_berry = enc.encode("berry")
token_strawberry = enc.encode(" strawberry")
token_berry, token_strawberry

([19772], [101830])

Important: ChatGPT sometimes gets it right.

Look:

<img src="./images/ChatGPTRnW.png" alt="ChatGPTScreen" width="500" />

But it's not because it learned how to count letters. 

It's because it's "memorizing" the correct answer for this specific question.

### Counting words, letters and tokens.

In [24]:
from openai import OpenAI

client = OpenAI()

am_completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write 3 sentences about Amazon Rainforest"}
    ],
    seed=42,
    temperature=0.0
)

am_response = am_completion.choices[0].message.content
print(am_response)

The Amazon Rainforest, often referred to as the "lungs of the Earth," is the largest tropical rainforest in the world, spanning across several countries in South America, including Brazil, Peru, and Colombia. It is home to an incredible diversity of flora and fauna, with millions of species, many of which are not found anywhere else on the planet. However, the rainforest faces significant threats from deforestation, climate change, and industrial activities, which jeopardize its ecological balance and the livelihoods of indigenous communities.


Let's copy:


```
The Amazon Rainforest, often referred to as the "lungs of the Earth," is the largest tropical rainforest in the world, spanning across several countries in South America, including Brazil, Peru, and Colombia. It is home to an incredible diversity of flora and fauna, with millions of species, many of which are not found anywhere else on the planet. However, the rainforest faces significant threats from deforestation, climate change, and industrial activities, which jeopardize its ecological balance and the livelihoods of indigenous communities.
```

Awesome! We've got a short description about my country, Poland.

Let's count words and characters first. In Python it's quite simple:

In [21]:
words_cnt = len(am_response.split())
characters_cnt = len(am_response)

print(f"The response has {words_cnt} words and {characters_cnt} characters.")

The response has 82 words and 549 characters.


Let's ask GPT-4o Mini the same questions.

In [18]:
word_prompt = "How many words is in the following paragraph: " + am_response

words = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": word_prompt}],
    temperature=0.0
)

word_response = words.choices[0].message.content
print(word_response)

The paragraph contains 81 words.


In [19]:
character_prompt = "How many characters is in the following paragraph: " + am_response

characters = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": character_prompt}],
    temperature=0.0
)

character_response = characters.choices[0].message.content
print(character_response)

The paragraph you provided contains 570 characters, including spaces and punctuation.


In [23]:
token_prompt = "How many tokens is in the following paragraph: " + am_response

tokens = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": token_prompt}],
    temperature=0.0
)

token_response = tokens.choices[0].message.content
print(token_response)

To determine the number of tokens in the provided paragraph, we can break it down into individual words and punctuation marks. In natural language processing, a token typically refers to a word or a punctuation mark.

The paragraph you provided contains 86 tokens.


### Counting tokens.

To count tokens, we'll use the `tiktoken` library.

Here's how:

In [None]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")

tokens = enc.encode(am_response)
print(f"The response has {len(tokens)} tokens.")

Let's break the code down:
- We imported the tiktoken library.
- We defined the encoder using `encoding_for_model("gpt-4o-mini")` to ensure we use the right encoder.
- We "tokenized" the response using `encode(pl_response)`.
- We counted the tokens using Python's `len` function.

Great!

Let's take our sample text and run it through the [online tokenizer](https://tiktokenizer.vercel.app/).

Here are the results:

<img src="./images/AmazonDescTokens.png" alt="Amazon Description tokens" width="500px" />

I love that visual representation. The app highlights every single token. It helps us see how they actually look like.

Below, we can see the numerical representation of each token from the decription.

Let's try to see, if the numbers match with the tokens from the `tiktoken` library:

In [None]:
print(tokens)

### Why counting tokens?

When creating AI applications, it's crucial to manage (and count) tokens for several reasons:
1. **Cost management** - Tokens directly influence the cost of API usage.
2. **Billing accuracy** - Token counting enables accurate usage-based billing for customers.
3. **Performance optimization** - The number of tokens affects model performance. Monitoring token usage helps optimize prompts.
4. **Customer transparency** - Providing real-time token usage data to customers through dashboards helps them control their spending and avoid unexpected costs.
5. **Product optimization** - Analyzing token usage patterns can provide insights into how customers are using the AI product, informing future improvements and feature development.
6. **Compliance and security**-  Monitoring token usage can help detect unusual patterns that might indicate security issues.
7. **Profitability analysis** - By attributing token usage to specific customers or features, companies endure profitability.

### Polish Translation

In [None]:
prompt = "Translate to Polish the following: " + am_response

polish = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": prompt}
    ],
    temperature=0.0
)

translation = polish.choices[0].message.content
print(translation)

### German Translation?

### Describe Mountains?

In [4]:
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write 3 sentences about Mount Everest"}
    ],
    seed=42
)

response = completion.choices[0].message.content
print(response)

Mount Everest, the highest peak in the world, stands at an elevation of 8,848.86 meters (29,031.7 feet) above sea level. Located in the Himalayas on the border between Nepal and Tibet, it attracts thousands of climbers each year, despite the extreme conditions and risks associated with its ascent. The mountain is also significant culturally, revered in local traditions and considered sacred by the Sherpa people.


Let's copy the response:

```
Mount Everest, the highest peak in the world, stands at an elevation of 8,848.86 meters (29,031.7 feet) above sea level. Located in the Himalayas on the border between Nepal and Tibet, it attracts thousands of climbers each year, including seasoned mountaineers and ambitious adventurers. Despite its allure, climbing Everest presents significant challenges, including extreme weather, high altitudes, and the risk of avalanches.
```