Course: Language as Data, University of Göttingen

# Week 6: Embeddings
In this lab, we examine how token ids can be mapped into embeddings. The notebook adapts parts from [Sebastian Raschka's notebook](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb) accompanying chapter 2 of his book ["Build a Large Language Model (from Scratch)"](https://www.manning.com/books/build-a-large-language-model-from-scratch).

As an example, we use the same book as in the previous notebook: [Emma](https://www.gutenberg.org/cache/epub/158/pg158.txt) by Jane Austen. 

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# !pip install 'torch>=2.0.1' 'jupyterlab>=4.0' 'tiktoken>=0.5.1' 'numpy>=1.25,<2.0' --user

In [3]:
import tiktoken
text_path = "content/spa_wikipedia_2021_30K-sentences.txt"

# 1. Token Ids

You can use any tokenizer. Here, we use the pre-trained tokenizer employed in the gpt2 model. 

In [4]:
tokenizer = tiktoken.get_encoding("gpt2")

In [5]:
from src.helper  import get_cleaned_spanish_text_as_string
raw_text = get_cleaned_spanish_text_as_string(text_path)
enc_text = tokenizer.encode(raw_text)

The tokenizer maps each token into a token id: 

In [6]:
print(raw_text[0:100])
print()
print(tokenizer.decode(enc_text[0:50]))
print()
tokens = [tokenizer.decode([tok]) for tok in enc_text[0:50]]
print(enc_text[0:50])
print()
print(tokens)

 12 de abril de 1996 es una actriz de la industria del entretenimiento para adultos y una personalid

 12 de abril de 1996 es una actriz de la industria del entretenimiento para adultos y una personalidad de internet. 15 Y el Señor dijo a Noé La hijas de tus

[1105, 390, 450, 22379, 390, 8235, 1658, 555, 64, 719, 47847, 390, 8591, 2226, 7496, 1619, 920, 1186, 268, 320, 1153, 78, 31215, 4044, 418, 331, 555, 64, 2614, 32482, 390, 5230, 13, 1315, 575, 1288, 1001, 12654, 273, 2566, 7639, 257, 1400, 2634, 4689, 16836, 292, 390, 256, 385]

[' 12', ' de', ' ab', 'ril', ' de', ' 1996', ' es', ' un', 'a', ' act', 'riz', ' de', ' la', ' indust', 'ria', ' del', ' ent', 'ret', 'en', 'im', 'ient', 'o', ' para', ' adult', 'os', ' y', ' un', 'a', ' personal', 'idad', ' de', ' internet', '.', ' 15', ' Y', ' el', ' Se', 'ñ', 'or', ' di', 'jo', ' a', ' No', 'é', ' La', ' hij', 'as', ' de', ' t', 'us']


Why is the raw text shorter than the decoded text in the print statements above? Make sure you understand what the indices refer to. 

In [7]:
# Note that the gpt2 tokenizer was trained with cased training data. 
# Example: "Paris" is kept as a single token, "paris" is split into two tokens
print(tokenizer.encode("Paris"))
print(tokenizer.encode("paris"))

[40313]
[1845, 271]


# 2. Sliding Window
Language models generate text one word at a time. During training, we iteratively predict every word of the training data. 


In [8]:
enc_sample = enc_text[7:99]

context_length = 4
for i in range(1, context_length+1):
    context = enc_sample[:i]
    target = enc_sample[i]

    print(context, "-->", target)
    print(tokenizer.decode(context), "-->",tokenizer.decode([target]))

[555] --> 64
 un --> a
[555, 64] --> 719
 una -->  act
[555, 64, 719] --> 47847
 una act --> riz
[555, 64, 719, 47847] --> 390
 una actriz -->  de


The **context length** (also called context size) indicates the maximum sequence length that the model accepts. It is a hyperparameter that is set when configuring the model architecture. For gpt2, the context size was set to 1024 tokens, for llama3, the context size was set to 8,192 tokens. Note that punctuation symbols (as the comma above) are also tokens (id 11). 


We implement the **Dataset** class in torch to split the training data into overlapping input sequences with the specified context size. 

For each input sequence, the **target sequence** is set by shifting the input sequence by one token to the right. Input and output sequences are represented as torch tensors. 

In [9]:
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader


class GPTDataset(Dataset):
    def __init__(self, txt, tokenizer, context_length):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - context_length):
            input_sequence = token_ids[i:i + context_length]
            
            #shift to the right
            target_sequence = token_ids[i + 1: i + context_length + 1]

            # input and output are represented as tensors
            self.input_ids.append(torch.tensor(input_sequence))
            self.target_ids.append(torch.tensor(target_sequence))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


# 3. Batching
The data is fed to the model in batches. The **batch size** is also a hyperparameter. It refers to the number of training examples the model sees in one iteration of the training process before updating its weights. Both gpt2 and llama-3 use a batch size of 512. Smaller models use a batch size of 128. We implement the **DataLoader** class in torch to split the training data into sequences with the specified context size. 

In [10]:

def create_dataloader(txt, batch_size=8, context_length=4, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDataset(txt, tokenizer, context_length)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [11]:
# Set a manual seed for reproducibility of shuffling and weight initialization
torch.manual_seed(0) 
dataloader = create_dataloader(raw_text, batch_size=8, context_length=4, shuffle=False)


The dataloader can be used as an iterator. Each batch consists of input sequences and target sequences.  When working with tensors, it is important to understand the dimensions. Vary the batch size and max_length parameters and inspect how it affects the shape of the tensors. For your project, ensure to split the data into training, development, and test portions. 

In [12]:
print(len(dataloader))
data_iter = iter(dataloader)

first_batch = next(data_iter)
print("1:", first_batch)


152705
1: [tensor([[ 1105,   390,   450, 22379],
        [  390,   450, 22379,   390],
        [  450, 22379,   390,  8235],
        [22379,   390,  8235,  1658],
        [  390,  8235,  1658,   555],
        [ 8235,  1658,   555,    64],
        [ 1658,   555,    64,   719],
        [  555,    64,   719, 47847]]), tensor([[  390,   450, 22379,   390],
        [  450, 22379,   390,  8235],
        [22379,   390,  8235,  1658],
        [  390,  8235,  1658,   555],
        [ 8235,  1658,   555,    64],
        [ 1658,   555,    64,   719],
        [  555,    64,   719, 47847],
        [   64,   719, 47847,   390]])]


In [13]:
inputs, targets = first_batch
print("Inputs:\n", inputs)
print("Shape:", inputs.shape)
print("\nTargets:\n", targets)
print("Shape:", targets.shape)

Inputs:
 tensor([[ 1105,   390,   450, 22379],
        [  390,   450, 22379,   390],
        [  450, 22379,   390,  8235],
        [22379,   390,  8235,  1658],
        [  390,  8235,  1658,   555],
        [ 8235,  1658,   555,    64],
        [ 1658,   555,    64,   719],
        [  555,    64,   719, 47847]])
Shape: torch.Size([8, 4])

Targets:
 tensor([[  390,   450, 22379,   390],
        [  450, 22379,   390,  8235],
        [22379,   390,  8235,  1658],
        [  390,  8235,  1658,   555],
        [ 8235,  1658,   555,    64],
        [ 1658,   555,    64,   719],
        [  555,    64,   719, 47847],
        [   64,   719, 47847,   390]])
Shape: torch.Size([8, 4])


## 4. Embeddings

We can now decide how we want to represent our input data in the model. 


### 4.1 Token Embeddings

The token embeddings project the token IDs of the tokenizer into vector space.
The dimensions of the embedding matrix are determined by the vocabulary size of the tokenizer and the embedding size. The embedding size is a hyperparameter. 

GPT-2 uses a vocabulary size of 50,257 and an embedding size of 768. 

In [14]:
print(tokenizer.max_token_value)

vocab_size = tokenizer.max_token_value+1
embedding_dim = 256 

# Create the token embedding layer
token_embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)


50256


Now, we can pass our token IDs through the embedding layer to get the token embeddings. 
The weights of the embedding layer are randomly initalized and get optimized during training. 

In [15]:
token_embeddings = token_embedding_layer(inputs)

print("Shape of token embeddings:", token_embeddings.shape)
print("First sequence", token_embeddings[0])

Shape of token embeddings: torch.Size([8, 4, 256])
First sequence tensor([[ 0.2794,  0.9326,  0.0547,  ..., -1.0102, -0.7279, -0.1012],
        [-0.0574, -2.3481, -0.3402,  ...,  0.3958,  0.4084, -1.2151],
        [ 1.2543, -0.8234,  1.1952,  ..., -0.1504, -0.2927,  0.8922],
        [ 0.0996,  0.3128, -1.6327,  ...,  0.2467, -0.4801, -1.2572]],
       grad_fn=<SelectBackward0>)


This outputs a tensor of shape (batch_size, context_length, embedding_dim).

### 4.2 Positional Embeddings

The positional embeddings indicate the order of the input tokens within each sequence. GPT-2 uses absolute position embeddings. 
We'll create a positional embedding layer according to our context_length with the same embedding size as the token embeddings

In [16]:
position_embedding_layer = torch.nn.Embedding(context_length, embedding_dim)
position_ids = torch.arange(context_length)
print("Position IDs:", position_ids)
position_embeddings = position_embedding_layer(position_ids)

Position IDs: tensor([0, 1, 2, 3])


The position embedding layer multiplies the position ids with randomly initialized weights that are optimized during training. 

In [17]:
print("Shape of position embeddings:", position_embeddings.shape)
print("Position embeddings:", position_embeddings)

Shape of position embeddings: torch.Size([4, 256])
Position embeddings: tensor([[ 0.5785,  0.1814,  0.2622,  ..., -0.5162,  1.1787,  0.4018],
        [ 1.6504,  2.3930,  0.0143,  ..., -0.0124,  0.4445, -0.8851],
        [ 0.6348,  0.1572, -1.0412,  ...,  2.1842,  1.1838, -0.7935],
        [ 0.9566,  0.3479,  0.5343,  ...,  0.3031, -0.8450, -0.0861]],
       grad_fn=<EmbeddingBackward0>)


### 4.3 Combining Token and Positional Embeddings

The token embeddings and positional embeddings are combined to get the final input embeddings.

In [18]:
input_embeddings = token_embeddings + position_embeddings

print("Input embeddings shape:", input_embeddings.shape)

Input embeddings shape: torch.Size([8, 4, 256])


## 5 Note
Embedding layers are a computationally efficient implementation. We could also use a one-hot encoding and a linear layer that multiplies the one-hot matrix with the embedding weights. If you are interested in more details on this, Sebastian Raschka's [bonus material](see https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/03_bonus_embedding-vs-matmul/embeddings-and-linear-layers.ipynb).