<h1>Chapter 2 - Tokens and Token Embeddings</h1>
<i>Exploring tokens and embeddings as an integral part of building LLMs</i>


<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter02/Chapter%202%20-%20Tokens%20and%20Token%20Embeddings.ipynb)

---

This notebook is for Chapter 2 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---

In [2]:
%%capture
!pip install --upgrade transformers==4.41.2 sentence-transformers==3.0.1 gensim==4.3.2 scikit-learn==1.5.0 accelerate==0.31.0 peft==0.11.1 scipy==1.10.1 numpy==1.26.4

# Downloading and Running An LLM

The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately and keep them as such so that we can explore them separately.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [4]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=20
)

# Print the output
print(tokenizer.decode(generation_output[0]))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Subject: Heartfelt Apologies for the Gardening Mishap


Dear


In [5]:
print(input_ids)

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001]], device='cuda:0')


In [6]:
for id in input_ids[0]:
   print(tokenizer.decode(id))

Write
an
email
apolog
izing
to
Sarah
for
the
trag
ic
garden
ing
m
ish
ap
.
Exp
lain
how
it
happened
.
<|assistant|>


In [7]:
generation_output

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001,  3323,   622, 29901, 17778, 29888,  2152,
          6225, 11763,   363,   278, 19906,   292,   341,   728,   481,    13,
            13,    13, 29928,   799]], device='cuda:0')

In [8]:
print(tokenizer.decode(3323))
print(tokenizer.decode(622))
print(tokenizer.decode([3323, 622]))
print(tokenizer.decode(29901))

Sub
ject
Subject
:


# Comparing Trained LLM Tokenizers


In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer

colors_list = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence, tokenizer_name):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

In [10]:
text = """
English and CAPITALIZATION
🎵 鸟
show_tokens False None elif == >= else: two tabs:"    " Three tabs: "       "
12.0*50=600
"""

In [11]:
show_tokens(text, "bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98menglish[0m [0;30;48;2;141;160;203mand[0m [0;30;48;2;231;138;195mcapital[0m [0;30;48;2;166;216;84m##ization[0m [0;30;48;2;255;217;47m[UNK][0m [0;30;48;2;102;194;165m[UNK][0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195mtoken[0m [0;30;48;2;166;216;84m##s[0m [0;30;48;2;255;217;47mfalse[0m [0;30;48;2;102;194;165mnone[0m [0;30;48;2;252;141;98meli[0m [0;30;48;2;141;160;203m##f[0m [0;30;48;2;231;138;195m=[0m [0;30;48;2;166;216;84m=[0m [0;30;48;2;255;217;47m>[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98melse[0m [0;30;48;2;141;160;203m:[0m [0;30;48;2;231;138;195mtwo[0m [0;30;48;2;166;216;84mtab[0m [0;30;48;2;255;217;47m##s[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;252;141;98m"[0m [0;30;48;2;141;160;203m"[0m [0;30;48;2;231;138;195mthree[0m [0;30;48;2;166;216;84mtab[0m [0;30;48;2;255;217;47m##s[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;25

In [12]:
show_tokens(text, "bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203mand[0m [0;30;48;2;231;138;195mCA[0m [0;30;48;2;166;216;84m##PI[0m [0;30;48;2;255;217;47m##TA[0m [0;30;48;2;102;194;165m##L[0m [0;30;48;2;252;141;98m##I[0m [0;30;48;2;141;160;203m##Z[0m [0;30;48;2;231;138;195m##AT[0m [0;30;48;2;166;216;84m##ION[0m [0;30;48;2;255;217;47m[UNK][0m [0;30;48;2;102;194;165m[UNK][0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195mtoken[0m [0;30;48;2;166;216;84m##s[0m [0;30;48;2;255;217;47mF[0m [0;30;48;2;102;194;165m##als[0m [0;30;48;2;252;141;98m##e[0m [0;30;48;2;141;160;203mNone[0m [0;30;48;2;231;138;195mel[0m [0;30;48;2;166;216;84m##if[0m [0;30;48;2;255;217;47m=[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98m>[0m [0;30;48;2;141;160;203m=[0m [0;30;48;2;231;138;195melse[0m [0;30;48;2;166;216;84m:[0m [0;30;48;2;255;217;47mtwo[0m [0;30;48;2;102;194;165mta[0m [0;30;48;2;252;1

In [13]:
show_tokens(text, "gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAP[0m [0;30;48;2;166;216;84mITAL[0m [0;30;48;2;255;217;47mIZ[0m [0;30;48;2;102;194;165mATION[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203m [0m [0;30;48;2;231;138;195m [0m [0;30;48;2;166;216;84m [0m [0;30;48;2;255;217;47m  [0m [0;30;48;2;102;194;165m [0m [0;30;48;2;252;141;98m [0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195mshow[0m [0;30;48;2;166;216;84m_[0m [0;30;48;2;255;217;47mt[0m [0;30;48;2;102;194;165mok[0m [0;30;48;2;252;141;98mens[0m [0;30;48;2;141;160;203m False[0m [0;30;48;2;231;138;195m None[0m [0;30;48;2;166;216;84m el[0m [0;30;48;2;255;217;47mif[0m [0;30;48;2;102;194;165m ==[0m [0;30;48;2;252;141;98m >=[0m [0;30;48;2;141;160;203m else[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84m two[0m [0;30;48;2;255;217;47m tabs[0m [0;30;48;2;102;194;165m:"[0m [0;30;48;2;252;141;98m [0m 

In [14]:
show_tokens(text, "google/flan-t5-small")

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

[0;30;48;2;102;194;165mEnglish[0m [0;30;48;2;252;141;98mand[0m [0;30;48;2;141;160;203mCA[0m [0;30;48;2;231;138;195mPI[0m [0;30;48;2;166;216;84mTAL[0m [0;30;48;2;255;217;47mIZ[0m [0;30;48;2;102;194;165mATION[0m [0;30;48;2;252;141;98m[0m [0;30;48;2;141;160;203m<unk>[0m [0;30;48;2;231;138;195m[0m [0;30;48;2;166;216;84m<unk>[0m [0;30;48;2;255;217;47mshow[0m [0;30;48;2;102;194;165m_[0m [0;30;48;2;252;141;98mto[0m [0;30;48;2;141;160;203mken[0m [0;30;48;2;231;138;195ms[0m [0;30;48;2;166;216;84mFal[0m [0;30;48;2;255;217;47ms[0m [0;30;48;2;102;194;165me[0m [0;30;48;2;252;141;98mNone[0m [0;30;48;2;141;160;203m[0m [0;30;48;2;231;138;195me[0m [0;30;48;2;166;216;84ml[0m [0;30;48;2;255;217;47mif[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98m=[0m [0;30;48;2;141;160;203m>[0m [0;30;48;2;231;138;195m=[0m [0;30;48;2;166;216;84melse[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165mtwo[0m [0;30;48;2;252;141;98mtab[0m [0;30;48;2;141

In [15]:
# The official is `tiktoken` but this the same tokenizer on the HF platform
show_tokens(text, "Xenova/gpt-4")

tokenizer_config.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAPITAL[0m [0;30;48;2;166;216;84mIZATION[0m [0;30;48;2;255;217;47m
[0m [0;30;48;2;102;194;165m [0m [0;30;48;2;252;141;98m [0m [0;30;48;2;141;160;203m [0m [0;30;48;2;231;138;195m  [0m [0;30;48;2;166;216;84m [0m [0;30;48;2;255;217;47m [0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_tokens[0m [0;30;48;2;231;138;195m False[0m [0;30;48;2;166;216;84m None[0m [0;30;48;2;255;217;47m elif[0m [0;30;48;2;102;194;165m ==[0m [0;30;48;2;252;141;98m >=[0m [0;30;48;2;141;160;203m else[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84m two[0m [0;30;48;2;255;217;47m tabs[0m [0;30;48;2;102;194;165m:"[0m [0;30;48;2;252;141;98m   [0m [0;30;48;2;141;160;203m "[0m [0;30;48;2;231;138;195m Three[0m [0;30;48;2;166;216;84m tabs[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165m "[0m [0;30;48;2

In [16]:
# You need to request access before being able to use this tokenizer
show_tokens(text, "bigcode/starcoder2-15b")

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAPITAL[0m [0;30;48;2;166;216;84mIZATION[0m [0;30;48;2;255;217;47m
[0m [0;30;48;2;102;194;165m [0m [0;30;48;2;252;141;98m [0m [0;30;48;2;141;160;203m [0m [0;30;48;2;231;138;195m [0m [0;30;48;2;166;216;84m [0m [0;30;48;2;255;217;47m [0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195mtokens[0m [0;30;48;2;166;216;84m False[0m [0;30;48;2;255;217;47m None[0m [0;30;48;2;102;194;165m elif[0m [0;30;48;2;252;141;98m ==[0m [0;30;48;2;141;160;203m >=[0m [0;30;48;2;231;138;195m else[0m [0;30;48;2;166;216;84m:[0m [0;30;48;2;255;217;47m two[0m [0;30;48;2;102;194;165m tabs[0m [0;30;48;2;252;141;98m:"[0m [0;30;48;2;141;160;203m   [0m [0;30;48;2;231;138;195m "[0m [0;30;48;2;166;216;84m Three[0m [0;30;48;2;255;217;47m tabs[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;25

In [17]:
show_tokens(text, "facebook/galactica-1.3b")

tokenizer_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.00 [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAP[0m [0;30;48;2;166;216;84mITAL[0m [0;30;48;2;255;217;47mIZATION[0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98m [0m [0;30;48;2;141;160;203m [0m [0;30;48;2;231;138;195m [0m [0;30;48;2;166;216;84m [0m [0;30;48;2;255;217;47m  [0m [0;30;48;2;102;194;165m [0m [0;30;48;2;252;141;98m [0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195mshow[0m [0;30;48;2;166;216;84m_[0m [0;30;48;2;255;217;47mtokens[0m [0;30;48;2;102;194;165m False[0m [0;30;48;2;252;141;98m None[0m [0;30;48;2;141;160;203m elif[0m [0;30;48;2;231;138;195m [0m [0;30;48;2;166;216;84m==[0m [0;30;48;2;255;217;47m [0m [0;30;48;2;102;194;165m>[0m [0;30;48;2;252;141;98m=[0m [0;30;48;2;141;160;203m else[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84m two[0m [0;30;48;2;255;217;47m t[0m [0;30;48;2;102;194;165mabs[0m [0;30;48;2;252;141;98m:[0m [

In [18]:
show_tokens(text, "microsoft/Phi-3-mini-4k-instruct")

[0;30;48;2;102;194;165m[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mEnglish[0m [0;30;48;2;231;138;195mand[0m [0;30;48;2;166;216;84mC[0m [0;30;48;2;255;217;47mAP[0m [0;30;48;2;102;194;165mIT[0m [0;30;48;2;252;141;98mAL[0m [0;30;48;2;141;160;203mIZ[0m [0;30;48;2;231;138;195mATION[0m [0;30;48;2;166;216;84m
[0m [0;30;48;2;255;217;47m [0m [0;30;48;2;102;194;165m [0m [0;30;48;2;252;141;98m [0m [0;30;48;2;141;160;203m [0m [0;30;48;2;231;138;195m[0m [0;30;48;2;166;216;84m [0m [0;30;48;2;255;217;47m [0m [0;30;48;2;102;194;165m [0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mshow[0m [0;30;48;2;231;138;195m_[0m [0;30;48;2;166;216;84mto[0m [0;30;48;2;255;217;47mkens[0m [0;30;48;2;102;194;165mFalse[0m [0;30;48;2;252;141;98mNone[0m [0;30;48;2;141;160;203melif[0m [0;30;48;2;231;138;195m==[0m [0;30;48;2;166;216;84m>=[0m [0;30;48;2;255;217;47melse[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;252;141;98mtwo[0m [0;30;48;2;141;16

# Contextualized Word Embeddings From a Language Model (Like BERT)

In [19]:
from transformers import AutoModel, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

# Tokenize the sentence
tokens = tokenizer('Hello world', return_tensors='pt')

# Process the tokens
output = model(**tokens)[0]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/241M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

In [20]:
output.shape

torch.Size([1, 4, 384])

In [21]:
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

[CLS]
Hello
 world
[SEP]


In [22]:
output

tensor([[[-3.4816,  0.0861, -0.1819,  ..., -0.0612, -0.3911,  0.3017],
         [ 0.1898,  0.3208, -0.2315,  ...,  0.3714,  0.2478,  0.8048],
         [ 0.2071,  0.5036, -0.0485,  ...,  1.2175, -0.2292,  0.8582],
         [-3.4278,  0.0645, -0.1427,  ...,  0.0658, -0.4367,  0.3834]]],
       grad_fn=<NativeLayerNormBackward0>)

# Text Embeddings (For Sentences and Whole Documents)

In [23]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to text embeddings
vector = model.encode("Best movie ever!")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [24]:
vector.shape

(768,)

# Word Embeddings Beyond LLMs


In [25]:
# Replacing the gensim section with this:
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F


model_name = "sentence-transformers/all-MiniLM-L6-v2"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def get_word_embedding(word):
    """Get embedding for a single word"""
    inputs = tokenizer(word, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
        # Use mean pooling to get word representation
        embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings[0]

def find_similar_words(target_word, word_list):
    """Find most similar words from a list"""
    target_emb = get_word_embedding(target_word)
    similarities = []

    for word in word_list:
        word_emb = get_word_embedding(word)
        sim = F.cosine_similarity(target_emb.unsqueeze(0),
                                 word_emb.unsqueeze(0))
        similarities.append((word, sim.item()))

    return sorted(similarities, key=lambda x: x[1], reverse=True)

# replaces the gensim king/queen example
word_list = ["king", "queen", "prince", "princess", "man", "woman",
             "throne", "crown", "ruler", "emperor", "servant"]

similar_to_king = find_similar_words("king", word_list)
print("Words similar to 'king':")
for word, score in similar_to_king[:9]:
    print(f"  {word}: {score:.4f}")

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Words similar to 'king':
  king: 1.0000
  queen: 0.6807
  throne: 0.6611
  prince: 0.5884
  princess: 0.4843
  crown: 0.4721
  emperor: 0.4697
  servant: 0.3985
  ruler: 0.3564


In [26]:
# import gensim.downloader as api

# # Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)
# # Other options include "word2vec-google-news-300"
# # More options at https://github.com/RaRe-Technologies/gensim-data
# model = api.load("glove-wiki-gigaword-50")

In [27]:
# model.most_similar([model['king']], topn=11)

# Recommending songs by embeddings

In [28]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

# Load song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [29]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

## Your Turn

Learn tokens and embeddings step-by-step. Each task provides a working example, your task is to modify and extend it.

In [30]:
# Task 1: See how different text types get tokenized
# EXAMPLE PROVIDED - Run this first to see how it works:

def analyze_tokens(text, tokenizer):
    """Analyze how text is tokenized and display results"""
    tokens = tokenizer(text, return_tensors="pt")
    token_ids = tokens.input_ids[0].tolist()

    print(f"Original text: {text}")
    print(f"Number of tokens: {len(token_ids)}")
    print(f"Tokens per character: {len(token_ids)/len(text):.2f}")
    print("Individual tokens:")
    for i, token_id in enumerate(token_ids):
        decoded = tokenizer.decode(token_id)
        print(f"  {i}: '{decoded}' (ID: {token_id})")
    print("-" * 50)
    return token_ids

# Example usage:
example_text = "Hello World! 2024"
example_ids = analyze_tokens(example_text, tokenizer)

# YOUR TURN:
# 1. Change example_text to include: your name, an email address, and a price (e.g., $19.99)
# 2. Run the analysis and observe how special characters and numbers are tokenized
# 3. Try adding an emoji and see what happens

your_text = "YOUR_TEXT_HERE"  # Modify this
# your_ids = analyze_tokens(your_text, tokenizer)


Original text: Hello World! 2024
Number of tokens: 7
Tokens per character: 0.41
Individual tokens:
  0: '[CLS]' (ID: 101)
  1: 'hello' (ID: 7592)
  2: 'world' (ID: 2088)
  3: '!' (ID: 999)
  4: '202' (ID: 16798)
  5: '##4' (ID: 2549)
  6: '[SEP]' (ID: 102)
--------------------------------------------------


### Questions to consider:
After running the example, answer these questions

  Q1: Looking at your tokenization results, identify whether this tokenizer uses  word, subword, or character-level tokenization. Give an example to suport your answer. (Refer to Figure 2-6 in the chapter for tokenization types)


---


  Q2: See the special token at the beginning (token ID 1). Based on Chapter 2's
  discussion of special tokens, what is `<s>` for?


---


  Q3: Calculate the compression ratio (characters/tokens). How does this compare to the ~4:1 ratio mentioned for English text in the chapter?


---


  Q4: Your email address was likely tokenized differently than regular words. Why is this be problematic for a model trained primarily on formal text?
  Recall the "unknown token" problem discussed in the chapter.

In [31]:
# Task 2: Compare how the same concept is tokenized in different forms
# Building on Task 1's analyze_tokens function

# EXAMPLE PROVIDED:
texts_to_compare = {
    "lowercase": "artificial intelligence",
    "uppercase": "ARTIFICIAL INTELLIGENCE",
    "camelCase": "ArtificialIntelligence",
    "with_symbols": "artificial-intelligence",
    "with_numbers": "artificial1 intelligence2"
}

print("Comparing tokenization patterns:\n")
token_counts = {}

for label, text in texts_to_compare.items():
    tokens = tokenizer(text, return_tensors="pt")
    count = len(tokens.input_ids[0])
    token_counts[label] = count
    print(f"{label:15} -> {count} tokens: {text}")

# Find most efficient
most_efficient = min(token_counts, key=token_counts.get)
print(f"\nMost token-efficient: {most_efficient} with {token_counts[most_efficient]} tokens")

# YOUR TURN:
# 1. Replace "artificial intelligence" with a different two-word concept (e.g., "machine learning", "data science")
# 2. Add three more variations: snake_case, with dots (.), and abbreviated form
# 3. Identify which formatting is most token-efficient for your chosen concept

your_concept_variations = {
    "lowercase": "YOUR_CONCEPT_HERE",
    "uppercase": "YOUR_CONCEPT_IN_CAPS",
    # Add more variations here
}

# Analyze your variations using the same pattern as above


Comparing tokenization patterns:

lowercase       -> 4 tokens: artificial intelligence
uppercase       -> 4 tokens: ARTIFICIAL INTELLIGENCE
camelCase       -> 6 tokens: ArtificialIntelligence
with_symbols    -> 5 tokens: artificial-intelligence
with_numbers    -> 6 tokens: artificial1 intelligence2

Most token-efficient: lowercase with 4 tokens


### Questions to consider:


Q1: The chapter mentions that GPT-4 has a vocabulary of ~100,000 tokens while BERT has ~30,000. Based on your results, which formatting (uppercase, camelCase, etc.) would benefit more from a larger vocabulary? Why?


---


Q2: Notice how "ARTIFICIAL INTELLIGENCE" requires different tokens than "artificial intelligence". Referring to the chapter's BERT cased vs uncased discussion, what are the trade-offs of preserving capitalization in tokenization?


---


Q3: The chapter shows "CAPITALIZATION" being split into 2 tokens by GPT-4 but 8 tokens by BERT. Based on your experiments, predict how many tokens your concept would need in each tokenizer and explain why.


---


Q4: If you were designing a tokenizer for a code completion system (not natural language), which variation pattern would you prioritize optimizing for? Consider that code often uses camelCase, snake_case, and special symbols.

In [32]:

# Task 3: Find patterns in tokenization by examining multiple examples
# Building on Tasks 1 & 2

# EXAMPLE PROVIDED:
def find_token_patterns(word_list, tokenizer):
    """Find common patterns in how words are tokenized"""
    patterns = {
        "single_token": [],
        "multiple_tokens": [],
        "starts_with_space": [],
        "contains_special": []
    }

    for word in word_list:
        tokens = tokenizer(word, return_tensors="pt")
        token_ids = tokens.input_ids[0].tolist()
        decoded_tokens = [tokenizer.decode(tid) for tid in token_ids]

        # Categorize based on tokenization pattern
        if len(token_ids) == 3:  # Accounting for special tokens
            patterns["single_token"].append(word)
        else:
            patterns["multiple_tokens"].append(word)

        # Check for space prefix (common in many tokenizers)
        for decoded in decoded_tokens:
            if decoded.startswith(" ") or decoded.startswith("Ġ"):
                patterns["starts_with_space"].append(word)
                break

    return patterns

# Example word list - technology terms
tech_words = ["computer", "algorithm", "cryptocurrency", "AI", "blockchain",
              "machine", "learning", "neural", "network", "GPU"]

patterns = find_token_patterns(tech_words, tokenizer)

print("Tokenization Patterns Found:")
for pattern_type, words in patterns.items():
    if words:  # Only print non-empty patterns
        print(f"\n{pattern_type}:")
        print(f"  {', '.join(words[:5])}")  # Show first 5 examples

# YOUR TURN:
# 1. Create your own word list with 10-15 words from a different domain
#    (e.g., medical terms, cooking terms, sports terms)
# 2. Add a new pattern category: "contains_number" or "all_uppercase"
# 3. Analyze which types of words in your domain require more tokens

your_domain_words = [
    # Add your domain-specific words here
]

# your_patterns = find_token_patterns(your_domain_words, tokenizer)


Tokenization Patterns Found:

single_token:
  computer, algorithm, AI, machine, learning

multiple_tokens:
  cryptocurrency, blockchain, GPU


### Questions to consider:

Q1: The chapter discusses how subword tokenization handles new/unknown words by breaking them into known pieces. Which words in your domain list demonstrate this behavior? How does this relate to the OOV (out-of-vocabulary) problem mentioned?


---


Q2: Compare your results to Figure 2-3 from the chapter (GPT-4 tokenizer example). Do you see similar patterns with partial words?


---


Q3: The chapter mentions training data influences tokenization. Based on your domain-specific words, what can you infer about the training data used for this tokenizer?


---


Q4: If specialized domain terms (medical, legal, etc.) often require multiple tokens,
  what implications does this have for:
  a) API costs (charged per token)?
  b) Context window limitations?
  c) Model understanding of domain-specific concepts?


---


Q5: The chapter discusses byte-fallback tokenization. Is there any words that might trigger byte-level encoding? What would be the trade-off?

In [33]:
# Task 4: Extract and analyze relationships between token embeddings
# Building on previous tasks, now working with embeddings

# EXAMPLE PROVIDED:
from transformers import AutoModel
import torch.nn.functional as F
import numpy as np

# Load a model for embeddings (using smaller model for speed)
embed_model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

def get_token_embeddings(text, tokenizer, model):
    """Get embeddings for each token in text"""
    tokens = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        output = model(**tokens)[0]

    # Get individual tokens for labels
    token_ids = tokens.input_ids[0].tolist()
    token_labels = [tokenizer.decode(tid) for tid in token_ids]

    return output[0], token_labels  # Return first batch item

def analyze_embedding_similarity(text_pairs, tokenizer, model):
    """Compare embeddings between pairs of related texts"""

    for pair_name, (text1, text2) in text_pairs.items():
        emb1, labels1 = get_token_embeddings(text1, tokenizer, model)
        emb2, labels2 = get_token_embeddings(text2, tokenizer, model)

        # Calculate average embedding for each text
        avg_emb1 = emb1.mean(dim=0)
        avg_emb2 = emb2.mean(dim=0)

        # Calculate cosine similarity
        similarity = F.cosine_similarity(avg_emb1.unsqueeze(0),
                                        avg_emb2.unsqueeze(0))

        print(f"\n{pair_name}:")
        print(f"  Text 1: {text1}")
        print(f"  Text 2: {text2}")
        print(f"  Similarity: {similarity.item():.4f}")

# Example: Comparing related concepts
example_pairs = {
    "synonyms": ("happy", "joyful"),
    "antonyms": ("happy", "sad"),
    "related": ("dog", "puppy"),
    "unrelated": ("happy", "computer")
}

analyze_embedding_similarity(example_pairs, tokenizer, embed_model)

# YOUR TURN:
# 1. Create 4 new text pairs testing different relationships:
#    - Two technical terms from the same field
#    - Same word in different contexts (e.g., "bank" as financial vs river)
#    - Two words in different languages (if supported by tokenizer)
#    - A word and its definition
# 2. Identify which relationship types have highest/lowest similarity

your_pairs = {
    "technical_related": ("YOUR_TERM1", "YOUR_TERM2"),
    # Add more pairs here
}

# analyze_embedding_similarity(your_pairs, tokenizer, embed_model)




synonyms:
  Text 1: happy
  Text 2: joyful
  Similarity: 0.7765

antonyms:
  Text 1: happy
  Text 2: sad
  Similarity: 0.7143

related:
  Text 1: dog
  Text 2: puppy
  Similarity: 0.8345

unrelated:
  Text 1: happy
  Text 2: computer
  Similarity: 0.8413



Q1: The chapter explains that contextualized embeddings (like from BERT) differ from static embeddings (like Word2Vec). Based on your similarity scores, provide evidence that these are contextualized embeddings. (Hint: Would "bank" have the same embedding in different contexts with static embeddings?)


---


Q2: Figure 2-9 shows how token embeddings are processed. Your embedding dimension is 384.The chapter mentions GPT models can have 768 or more dimensions. What trade-offs exist between embedding dimension size and model performance?


---


Q3: Your synonyms showed high similarity while antonyms showed lower similarity.How does this validate the chapter's claim that embeddings "capture meanings and patterns in language"?


---


Q4: You calculated average embeddings by taking the mean. The chapter also mentions using [CLS] tokens for sentence representation. What are potential advantages/disadvantages of each approach?


---


Q5: If embedding similarity doesn't always match human intuition about word relationships, what does this tell us about what the model has "learned" during training?

In [35]:
# Task 5: Combine techniques from Challenges 1-4 into a comprehensive analysis
# This brings together tokenization, pattern finding, and embedding analysis

# EXAMPLE PROVIDED:
def comprehensive_text_analysis(text, reference_text, tokenizer, model):
    """
    Perform complete analysis combining all previous techniques:
    1. Basic tokenization stats (Task 1)
    2. Pattern comparison (Task 2)
    3. Token pattern discovery (Task 3)
    4. Embedding similarity (Task 4)
    """

    print("="*60)
    print("TEXT ANALYSIS")
    print("="*60)

    # Step 1: Basic tokenization (from Task 1)
    print("\n1. TOKENIZATION STATISTICS:")
    tokens = tokenizer(text, return_tensors="pt")
    token_ids = tokens.input_ids[0].tolist()
    print(f"   Text length: {len(text)} characters")
    print(f"   Token count: {len(token_ids)} tokens")
    print(f"   Compression ratio: {len(text)/len(token_ids):.2f} chars/token")

    # Step 2: Compare variations (from Task 2)
    print("\n2. FORMAT VARIATIONS:")
    variations = {
        "original": text,
        "uppercase": text.upper(),
        "lowercase": text.lower(),
        "no_spaces": text.replace(" ", "")
    }

    for var_name, var_text in variations.items():
        var_tokens = tokenizer(var_text, return_tensors="pt")
        print(f"   {var_name:12} -> {len(var_tokens.input_ids[0])} tokens")

    # Step 3: Find patterns (from Task 3)
    print("\n3. TOKEN PATTERNS:")
    words = text.split()
    long_words = [w for w in words if len(w) > 5]
    short_words = [w for w in words if len(w) <= 3]

    if long_words:
        print(f"   Long words (>5 chars): {', '.join(long_words[:3])}")
    if short_words:
        print(f"   Short words (≤3 chars): {', '.join(short_words[:3])}")

    # Step 4: Embedding similarity (from Task 4)
    print("\n4. SEMANTIC SIMILARITY:")
    if reference_text:
        emb1, _ = get_token_embeddings(text, tokenizer, model)
        emb2, _ = get_token_embeddings(reference_text, tokenizer, model)

        avg_emb1 = emb1.mean(dim=0)
        avg_emb2 = emb2.mean(dim=0)

        similarity = F.cosine_similarity(avg_emb1.unsqueeze(0),
                                        avg_emb2.unsqueeze(0))
        print(f"   Similarity to reference: {similarity.item():.4f}")
        print(f"   Reference: '{reference_text[:50]}...'")

    print("\n" + "="*60)

    return {
        "char_count": len(text),
        "token_count": len(token_ids),
        "compression": len(text)/len(token_ids),
        "similarity": similarity.item() if reference_text else None
    }

# Example usage:
example_text = "Artificial intelligence is transforming how we work and live"
reference = "Machine learning is changing our daily lives"

results = comprehensive_text_analysis(example_text, reference, tokenizer, embed_model)

# YOUR TURN - FINAL TASK:
# 1. Choose a paragraph (3-4 sentences) about any technical topic
# 2. Create a reference text that is:
#    a) A paraphrase of your paragraph
#    b) On the same topic but different viewpoint
#    c) A simplified explanation of the same concept
# 3. Run the comprehensive analysis on all three variations
# 4. Add one new metric to the analysis function that combines at least 2 techniques
#    (e.g., find the most semantically important token by combining embedding magnitude
#     with token frequency, or identify tokens that change meaning most when context changes)

your_paragraph = """
YOUR PARAGRAPH HERE
"""

your_reference_paraphrase = """
YOUR PARAPHRASE HERE
"""

# Run comprehensive analysis
# your_results = comprehensive_text_analysis(your_paragraph, your_reference_paraphrase, tokenizer, embed_model)

# BONUS: Create your own analysis metric
def your_custom_metric(text, tokenizer, model):
    """
    Create a new metric combining multiple techniques.
    Example: Find tokens that are both rare and semantically important
    """
    # YOUR CODE HERE
    pass


TEXT ANALYSIS

1. TOKENIZATION STATISTICS:
   Text length: 60 characters
   Token count: 11 tokens
   Compression ratio: 5.45 chars/token

2. FORMAT VARIATIONS:
   original     -> 11 tokens
   uppercase    -> 11 tokens
   lowercase    -> 11 tokens
   no_spaces    -> 17 tokens

3. TOKEN PATTERNS:
   Long words (>5 chars): Artificial, intelligence, transforming
   Short words (≤3 chars): is, how, we

4. SEMANTIC SIMILARITY:
   Similarity to reference: 0.7789
   Reference: 'Machine learning is changing our daily lives...'



### Questions to Consider:

**Token Count:**
- Looking at your results, which text format used the most tokens? Which used the least? Why do you think this happened?

**Compression Patterns:**
- Your compression ratio shows how many characters become one token. Is it more efficient than the example text? What makes text more or less "compressible"?

**Similarity Results:**
- Look at your similarity score between the original and reference text. Is it higher or lower than you expected? What words might be causing the similarity or difference?

**Connecting to Chapter 2 Concepts:**
- The chapter mentions that spaces are often included in tokens (like "Ġ" in GPT-2). Can you spot evidence of this in your tokenization results?

**Pattern Recognition:**
- Notice how uppercase vs lowercase affects token count. Based on what you learned about BERT cased/uncased in the chapter, why does capitalization matter so much?

**Practical Application:**
- If you were sending this text to an API that charges per token:
  - Which format would be cheapest to send?
  - Could you rewrite it to use fewer tokens while keeping the same meaning?

**Your Custom Metric Ideas:**
Instead of creating complex code, describe in words:
- What would be useful to measure that combines two techniques? (Example: "Finding words that are both long AND have low similarity to other words")
- Why would this metric be helpful for understanding text?

**Summary:**
Based on all your experiments:
- Name one surprising thing you discovered about tokenization
- Name one way tokenization could affect a chatbot's responses
- If you had to explain tokenization to a friend, what's the most important thing they should know?