## Introduction.

In this notebook, you'll learn about tokens - a crucial concept tied directly to Large Language Models (LLMs).


Here's what you will learn:
- What is a token?
- How to count tokens?
- Why LLMs can't count letters?
- Why do we use sub-words (not characters or words)?




#ideas:
- should I encode for gpt-4 and decode for gpt-4o (or vice versa)? 

### What is a token?

A token is a chunk of text that Large Language Models read or generate.

Here's key information about tokens:
- Tokens are atomic units that represent our language.
- Tokens allow LLMs to efficiently understand language.
- Tokens are the smallest unit of text that AI models process.
- Tokens don't have the defined length. Some are only 1 character long, others can be longer words.
- Tokens can be: words, sub-words, punctuation marks or special symbols.
- As a rule of thumb, a token corresponds to 3/4 of the word. So 100 tokens is roughly 75 words.


## Why sub-words (not characters or words) for tokens?

#TODO: 
- Explain tokens are turned into embeddings inside of LLMs, as they have a lookup table in which they get IDs (tokens). The embedding values are trainable and get updated during the training of the model.

For tokens, we use sub-words. Here's an example:

<img src="./images/Sub-wordExample.png" alt="Sub word token" width="500" />


People often ask:
- "Why not characters?"
- "Why not entire words?"

Let me explain...

### The Cons of character-based approach.
1. **Too little information**. Each character in itself holds very little semantic meaning.
2. **Longer sequences**. The more tokens we pass, the higher computational power required. Character tokens create overly long sequences, which makes them extremely inefficient.
3. **Complex training**. It's hard to understand word meanings, context and relationships using only characters.


### The Cons of full word-based approach.
1. **Large Token Vocabulary**. When every unique word becomes a token, we and up with an enormous vocabulary. It's inefficient in memory and processing resources.
2. **Out-of-Vocabulary (OOV) Words**. The model can't handle rare or new words.
3. **Misspelling and variation problems**. It's impossible to add all variations or misspelled words into model's vocabulary. Look at the above image. We'd need a separate token for words like: unlearn, unfit, reworked, redesigning, deconstructed. But we can perfectly capture their meaning using 2 tokens.

### How sub-words combine the best of both words?
Sub-words combine the best of both words, while decreasing the limitaions of them.
1. **Smaller vocabulary**. Sub-words reduce vocabulary by using smaller and reusable chunks of words.
2. **Handling OOV words.** When a model "sees" a word for the first time it can break it down into smaller chunks. And the model knows the chunks already, so it can capture the meaning of the new word.
3. **Balance between length and information.** Optimization methods find the best sub-word combinations for a language.


In summary:
- Character tokens are too small and carry too little information (lack of semantic meaning).
- Word tokens are too large and create huge vocabularies.
- Combine more efficient token vocabularies with semantic representation.

Here are more examples on where sub-word tokens shine.

<img src="./images/LongWords.png" alt="long words" width="500" />


## Code

### Load the OpenAI API Key

In [2]:
from dotenv import load_dotenv

load_dotenv()

import warnings
warnings.filterwarnings('ignore')

## Counting words, letters and tokens.

In [18]:
from openai import OpenAI

client = OpenAI()

am_completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write 3 short sentences about Amazon Rainforest"}
    ],
    seed=42,
    temperature=0.0
)

am_response = am_completion.choices[0].message.content
print(am_response)

The Amazon Rainforest is the largest tropical rainforest in the world, spanning over 5.5 million square kilometers. It is home to an incredible diversity of wildlife, including thousands of plant species, birds, mammals, and insects. The rainforest plays a crucial role in regulating the Earth's climate and is often referred to as the "lungs of the planet" due to its vast capacity for carbon dioxide absorption.


Let's copy:


```
The Amazon Rainforest is the largest tropical rainforest in the world, spanning over 5.5 million square kilometers. It is home to an incredible diversity of wildlife, including thousands of plant species, birds, mammals, and insects. The rainforest plays a crucial role in regulating the Earth's climate and is often referred to as the "lungs of the planet" due to its vast capacity for carbon dioxide absorption.
```

Awesome! We've got a short description about The Amazon Rainforest.

### Counting words and characters.

Let's count words and characters first. In Python it's quite simple:

In [19]:
words_cnt = len(am_response.split())
characters_cnt = len(am_response)

print(f"The response has {words_cnt} words and {characters_cnt} characters.")

The response has 66 words and 413 characters.


OK, so our Amazon Rainforest description has 66 words and 413 characters.

Note: Your description may be different. So if your results differ, don't freak out :)

### Counting tokens

In [25]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")
tokens = enc.encode(am_response)

print(f"The response has {len(tokens)} tokens.")

The response has 80 tokens.


Let's break the code down:
- We imported the tiktoken library.
- We defined the encoder using `encoding_for_model("gpt-4o-mini")` to ensure we use the right encoder. 
- We "tokenized" the response using `encode(response)`.
- We counted the tokens using Python's `len` function.

Great!

Let's take our sample text and run it through the [online tokenizer](https://tiktokenizer.vercel.app/).

Here are the results:

<img src="./images/AmazonDescTokens2.png" alt="Amazon Description tokens" width="500px" />

I love that visual representation. The app highlights every single token. It helps us see how they actually look like.

Below, we can see the numerical representation of each token from the decription.

Let's try to see, if the numbers match with the tokens from the `tiktoken` library:

In [26]:
print(tokens)

[976, 9529, 33159, 76428, 382, 290, 10574, 40068, 164436, 306, 290, 2375, 11, 66335, 1072, 220, 20, 13, 20, 5749, 13749, 63677, 13, 1225, 382, 2237, 316, 448, 19201, 28955, 328, 40214, 11, 3463, 13369, 328, 6804, 15361, 11, 28510, 11, 119032, 11, 326, 65129, 13, 623, 164436, 17473, 261, 19008, 5430, 306, 101955, 290, 146677, 16721, 326, 382, 4783, 22653, 316, 472, 290, 392, 82576, 328, 290, 17921, 1, 5192, 316, 1617, 11332, 12241, 395, 15883, 70513, 57036, 13]


Just to be crystal clear. We now compare the tokens from code with the numbers from the last image. They are identical because we used the same text with the same encoder in both methods.

We'll count words, characters, and tokens later too.

Let's write helper functions for counting and printing the results.

In [42]:
import tiktoken

def count_words_and_chars(text):
    words_cnt = len(text.split())
    characters_cnt = len(text)

    print(f"The text has {words_cnt} words and {characters_cnt} characters.")
    
def count_tokens(text):
    enc = tiktoken.encoding_for_model("gpt-4o-mini")
    tokens = enc.encode(text)

    print(f"The response has {len(tokens)} tokens.")

### Asking GPT-4o Mini to count words, characters, and tokens.

Let's ask GPT-4o Mini to count for us.

As a reminder, the correct answers for our description are:
- 66 words
- 413 characters
- 80 tokens

Good luck, GPT-4o!

**Counting Words**

In [21]:
word_prompt = "How many words is in the following paragraph: " + am_response

words = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": word_prompt}],
    temperature=0.0
)

word_response = words.choices[0].message.content
print(word_response)

The paragraph contains 63 words.


**Counting Characters**

In [22]:
character_prompt = "How many characters is in the following paragraph: " + am_response

characters = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": character_prompt}],
    temperature=0.0
)

character_response = characters.choices[0].message.content
print(character_response)

The paragraph you provided contains 469 characters, including spaces and punctuation.


Let's summarize:
- The word count was close (63 instead of 66).
- The character count was bad (469 instead of 413).


What about tokens?

**Counting tokens**

In [23]:
token_prompt = "How many tokens is in the following paragraph: " + am_response

tokens = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": token_prompt}],
    temperature=0.0
)

token_response = tokens.choices[0].message.content
print(token_response)

To determine the number of tokens in the provided paragraph, we can break it down into individual words and punctuation marks. In natural language processing, a token typically refers to a word or a punctuation mark.

The paragraph you provided contains the following text:

"The Amazon Rainforest is the largest tropical rainforest in the world, spanning over 5.5 million square kilometers. It is home to an incredible diversity of wildlife, including thousands of plant species, birds, mammals, and insects. The rainforest plays a crucial role in regulating the Earth's climate and is often referred to as the 'lungs of the planet' due to its vast capacity for carbon dioxide absorption."

Counting the tokens, we find:

1. Words
2. Punctuation marks (like commas, periods, and quotation marks)

After counting, the total number of tokens in the paragraph is **102**.


Ouch...

102 instead of 80.

But I appreciate the effort! We can see how hard GPT-4o Mini tried to give the correct answer.

### LLMs counting characters and words in simpler examples.

I'll show you something interesting...

I'll ask GPT-4o Mini to count words and characters in 2 sentences:
1. "This is a short sentence."
2. "This is a very short sentence."

Let's see the results.

In [29]:
short_sentence = "This is a short sentence"
counting_prompt = "Count characters and words in the following sentence: " + short_sentence

char_cnt1 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": counting_prompt}],
    temperature=0.0
)

char_response1 = char_cnt1.choices[0].message.content
print(char_response1)

The sentence "This is a short sentence" contains 30 characters (including spaces) and 6 words.


In [30]:
very_short_sentence = "This is a very short sentence"
counting_prompt = "Count characters and words in the following sentence: " + very_short_sentence

char_cnt2 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": counting_prompt}],
    temperature=0.0
)

char_response2 = char_cnt2.choices[0].message.content
print(char_response2)

The sentence "This is a very short sentence" contains 30 characters (including spaces) and 7 words.


Look, I've added an entire word, "very," to the second example. But for GPT-4o Mini, both sentences have the same number of characters.

Weird...

Let's see the responses next to each other.

In [31]:
print(char_response1)
print(char_response2)

The sentence "This is a short sentence" contains 30 characters (including spaces) and 6 words.
The sentence "This is a very short sentence" contains 30 characters (including spaces) and 7 words.


### Counting with GPT-4o.

Let's do some more tests. This time we'll use GPT-4o, which is the more powerful model

In [32]:
very_short_sentence = "This is a very short sentence"
counting_prompt = "Count characters and words in the following sentence: " + very_short_sentence

char_cnt3 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": counting_prompt}],
    temperature=0.0
)

char_response3 = char_cnt3.choices[0].message.content
print(char_response3)

Sure! Let's break it down:

**Sentence:** "This is a very short sentence"

**Character Count:**
- Total characters (including spaces): 27
- Total characters (excluding spaces): 22

**Word Count:**
- Total words: 6

So, the sentence "This is a very short sentence" has 27 characters (including spaces), 22 characters (excluding spaces), and 6 words.


In [33]:
counting_prompt = "Count characters and words in the following sentence: " + short_sentence

char_cnt4 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": counting_prompt}],
    temperature=0.0
)

char_response4 = char_cnt4.choices[0].message.content
print(char_response4)

Sure! Let's break it down:

**Sentence:** "This is a short sentence"

1. **Character Count:**
   - Total characters including spaces: 24
   - Total characters excluding spaces: 21

2. **Word Count:**
   - Total words: 5

So, the sentence "This is a short sentence" has 24 characters including spaces, 21 characters excluding spaces, and 5 words.


In [34]:
total_characters2 = len(very_short_sentence)
total_words2 = len(very_short_sentence.split())
total_characters_no_spaces2 = len(very_short_sentence.replace(" ", ""))


print(f"'{very_short_sentence}' contains {total_characters2} characters and {total_words2} words.")

'This is a very short sentence' contains 29 characters and 6 words.


### Why counting tokens?

When creating AI applications, it's crucial to manage (and count) tokens for several reasons:
1. **Cost management** - Tokens directly influence the cost of API usage.
2. **Billing accuracy** - Token counting enables accurate usage-based billing for customers.
3. **Performance optimization** - The number of tokens affects model performance. Monitoring token usage helps optimize prompts.
4. **Customer transparency** - Providing real-time token usage data to customers through dashboards helps them control their spending and avoid unexpected costs.
5. **Product optimization** - Analyzing token usage patterns can provide insights into how customers are using the AI product, informing future improvements and feature development.
6. **Compliance and security**-  Monitoring token usage can help detect unusual patterns that might indicate security issues.
7. **Profitability analysis** - By attributing token usage to specific customers or features, companies endure profitability.

## Tokens for non-english languages.

I want to show you how the tokens work for other languages. I'll play with Polish and German translations, because I speak both languages.

### Polish Translation

Let's start by asking GPT-4o Mini to translate the Amazon Rainforest description into Polish.

In [36]:
pl_prompt = "Translate to Polish the following: " + am_response

polish = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": pl_prompt}
    ],
    temperature=0.0
)

pl_translation = polish.choices[0].message.content
print(pl_translation)

Amazonia jest największym tropikalnym lasem deszczowym na świecie, zajmującym ponad 5,5 miliona kilometrów kwadratowych. Jest domem dla niesamowitej różnorodności dzikiej przyrody, w tym tysięcy gatunków roślin, ptaków, ssaków i owadów. Las deszczowy odgrywa kluczową rolę w regulacji klimatu Ziemi i często nazywany jest "płucami planety" ze względu na swoją ogromną zdolność do absorpcji dwutlenku węgla.


Cool, we've got the polish translation. Let's count words and characters:

In [39]:
count_words_and_chars(pl_translation)

The text has 55 words and 406 characters.


As a reminder, here are the numbers for the original description.

In [41]:
count_words_and_chars(am_response)

The text has 66 words and 413 characters.


Awesome!

So the polish translation is slightly shorter in words and characters.

What about tokens?

Let me show you the tokens from tiktokenizer again.

<img src="./images/AmazonDescPL.png" alt="Amazon Description tokens" width="500px" />

The Polish translation has 132 tokens, the English has 80 tokens.

What a huge difference!

### Comparing token "efficiency"

At the beginning I said: "As a rule of thumb, a token corresponds to 3/4 of the word. So 100 tokens is roughly 75 words."

Now, let's do some math to test the statement.

- For English: 66 (words) / 80 (tokens) = 0.825 -> 82 words per 100 tokens.
- For Polish: 55 (words) / 132 (tokens) ≈ 0.42 -> 42 words per 100 tokens.



### Counting 'r' in 'strawberry'

Let's use GPT-4o and GPT-4o Mini

In [3]:
from openai import OpenAI

client = OpenAI()

strawberry_prompt = "How many R's are there in the word 'strawberry?'"

strawberry = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": strawberry_prompt}
    ],
    seed=42
)

strawberry_resp = strawberry.choices[0].message.content
print(strawberry_resp)

The word 'strawberry' contains 2 R's.


In [27]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")
tokens = enc.encode(strawberry_prompt)

And these tokens is all that GPT models receive.

Later, the tokens will be converted into embeddings (long vectors).

In [36]:
print(tokens)

token_berry = enc.encode("berry")
token_strawberry = enc.encode(" strawberry")
token_berry, token_strawberry

([19772], [101830])

In [1]:
strawberry_mini = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": strawberry_prompt}
    ],
    seed=42
)

strawberry_resp_mini = strawberry_mini.choices[0].message.content
print(strawberry_resp_mini)

NameError: name 'client' is not defined

Important: ChatGPT sometimes gets it right.

Look:

<img src="./images/ChatGPTRnW.png" alt="ChatGPTScreen" width="500" />

But it's not because it learned how to count letters. 

It's because it's "memorizing" the correct answer for this specific question.

## Token Embeddings

LLMs receive a list of tokens.

But token numbers are just IDs. There's no meaning in IDs.

To get the meaning, we need to turn tokens into vector embeddings.

But it happens inside of the Large Language Models.

I will not describe it in detail, but here's the crucial information:
- LLMs have a so-called lookup table to match token IDs with token embeddings

TODO:
- Add More crucial parts

Now, we don't have access to any parameters inside of GPT-4 models, so I can't show you the token embeddings for them.

Luckily, we can look inside of open-source models. So let me use the BERT model, to show you the embeddings.

### BERT Tokenizer

In [10]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained model tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load pre-trained model
model = BertModel.from_pretrained('bert-base-uncased')

# Text to be tokenized
text = "My name is Kris."

# Encode text
input_ids = tokenizer.encode(text, add_special_tokens=False)

# Output the token IDs
print("Token IDs:", input_ids)

# Convert token IDs back to raw tokens and output them
raw_tokens = [tokenizer.decode([token_id]) for token_id in input_ids]
print("Raw tokens:", raw_tokens)

# Convert list of IDs to a tensor
input_ids_tensor = torch.tensor([input_ids])

# Pass the input through the model
with torch.no_grad():
    outputs = model(input_ids_tensor)

# Extract the embeddings
embeddings = outputs.last_hidden_state

# Print the embeddings
print("Embeddings: ", embeddings)

Token IDs: [2026, 2171, 2003, 19031, 1012]
Raw tokens: ['my', 'name', 'is', 'kris', '.']
Embeddings:  tensor([[[-0.0736, -0.3184, -0.0209,  ..., -0.1902,  0.6918,  0.4574],
         [-0.6156, -0.2890,  0.1733,  ..., -0.2697,  0.7416,  0.2899],
         [-1.0302, -0.0216,  0.3159,  ..., -0.1907,  0.7589,  0.9509],
         [-0.2899, -0.7108,  0.3310,  ..., -0.2164,  0.8269,  0.2375],
         [-0.2973, -0.9342,  0.3401,  ...,  0.1431,  0.9154,  0.1736]]])


In [11]:
embeddings.shape

torch.Size([1, 5, 768])

In [3]:
from transformers import pipeline; 
print(pipeline('sentiment-analysis')('we love you'))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998704195022583}]


In [5]:
print(pipeline('sentiment-analysis')("What a boring movie!"))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9997424483299255}]


### German Translation?

### Describe Mountains?

In [4]:
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write 3 sentences about Mount Everest"}
    ],
    seed=42
)

response = completion.choices[0].message.content
print(response)

Mount Everest, the highest peak in the world, stands at an elevation of 8,848.86 meters (29,031.7 feet) above sea level. Located in the Himalayas on the border between Nepal and Tibet, it attracts thousands of climbers each year, despite the extreme conditions and risks associated with its ascent. The mountain is also significant culturally, revered in local traditions and considered sacred by the Sherpa people.


Let's copy the response:

```
Mount Everest, the highest peak in the world, stands at an elevation of 8,848.86 meters (29,031.7 feet) above sea level. Located in the Himalayas on the border between Nepal and Tibet, it attracts thousands of climbers each year, including seasoned mountaineers and ambitious adventurers. Despite its allure, climbing Everest presents significant challenges, including extreme weather, high altitudes, and the risk of avalanches.
```

### Conclusions.


**Further reading**
- [Embedding tokens vs embedding string](https://community.openai.com/t/embedding-tokens-vs-embedding-strings/463213/5)
- [BERT token vs embedding](https://stackoverflow.com/questions/77189885/bert-token-vs-embedding)
- [Explained: Tokens and Embeddings in LLMs](https://medium.com/the-research-nest/explained-tokens-and-embeddings-in-llms-69a16ba5db33)