<a href="https://colab.research.google.com/github/hollyemblem/raschka-llm-from-scratch/blob/main/chapter2_creating_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Chapter 2

#### Creating Token Embeddings
We initialise embeddings - which is the process of converting token IDs into vector representations - with random values. In later chapters (Chapter 5) we will optimise the embedding weights..

Raschka: "Embedding layers perform a lookup operation, retrieving the embedding vector corresponding to the token ID from the embedding layer’s weight matrix."

Me: Think about this as; embedding layers store a trainable matrix (with backprop for example) where each row is a token embedding. If we have a token ID, we can perform a look up and retrieve the corresponding row, which is the embedding vector.


In [1]:
!pip install tiktoken



In [2]:
from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.12.0


In [3]:
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

('the-verdict.txt', <http.client.HTTPMessage at 0x7bbd2a6e1130>)

In [4]:
tokenizer = tiktoken.get_encoding("gpt2")

In [5]:
import torch
from torch.utils.data import Dataset, DataLoader

In [6]:
##A toy example -  four input tokens with IDs 2, 3, 5, and 1

input_ids = torch.tensor([2, 3, 5, 1])

In [7]:
vocab_size = 6 #This is different to our BPE vocab size of 50k+!
output_dim = 3 #GPT-3's embedding size is  12,288 dimensions

In [8]:
##Instantiating a embedding layer
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


#### 6x3 Matrix:
There is one row for each of the six possible tokens in the vocabulary,
and there is one column for each of the three embedding dimensions.



In [9]:
#Now, let’s apply it to a token ID to obtain the embedding vector:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


We can find the above value in the fourth row of the matrix, or row 3 due to 0 numbering. This embedding layer is therefore a lookup which means we can retrieve rows from the embedding layer's weight matrix, with a token ID (in this instance, 3)

 In other words, the embedding layer is essentially a lookup operation that retrieves rows from the embedding layer’s weight matrix via a token ID.




#### Understanding one-hot encoding approach
"For those who are familiar with one-hot encoding, the embedding layer approach described here is essentially just a more efficient way of implementing one-hot encoding followed by matrix multiplication in a fully connected layer...

Because the embedding layer is just a more efficient implementation equivalent to the one-hot encoding and matrix-multiplication approach, it can be seen as a neural network layer that can be optimized via backpropagation."

More detailed guidance available [here](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/03_bonus_embedding-vs-matmul/embeddings-and-linear-layers.ipynb)

In [14]:
print(embedding_layer(torch.tensor([2, 3, 5, 1])))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


In [15]:
print(embedding_layer(torch.tensor([3,5,2,1])))

tensor([[-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 1.2753, -0.2010, -0.1606],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


In above example, there isn't recollection of order. "The embedding layer converts a token ID into the same vector representation regardless of where it is located in the input sequence. For example, the token ID 5, whether it’s in the second or third position in the token ID input vector, will result in the same embedding vector."

## Encoding Word Positions

"The way the previously introduced embedding layer works is that the same token ID always gets mapped to the same vector representation, regardless of where the token ID is positioned in the input sequence"

But...Attention doesn't have a notion of position or order of a sequence. So while the above approach is quite deterministic and good for reproducibility, we need a method of injecting some positioning logic into the LLM.

Two types of approaches for this:

- Absolute positional embedding; unique token identifying position added to embedding. E.g if first token = 1,1,1. Then we add 1.1, 1.2, 1.3 getting to input embeddings of 2.1, 2.2, 2.3. For third tokens of 1,1,1 we add 3.1, 3.2, 3.3, getting to 3.1, 3.2, 3.3.


- Relative positional embedding; these focus on distance between tokens, i.e how near or how far. Helps model generalise to sequences of varying lengths.

Note: GPT uses a variant of absolute positional embeddings which are optimised during training, rather than being fixed/predefined like the original Transformer models.



In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)    #1

        for i in range(0, len(token_ids) - max_length, stride):     #2
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):    #3
        return len(self.input_ids)

    def __getitem__(self, idx):         #4
        return self.input_ids[idx], self.target_ids[idx]

In [None]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")                         #1
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)   #2
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,     #3
        num_workers=num_workers     #4
    )

    return dataloader

In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iter = iter(dataloader)      #1
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [None]:
vocab_size = 50257
output_dim = 256 #vector rep
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [None]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
   stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


In [None]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


In [None]:
## To implement GPT's approach to absolute positioning, we need another embedding layer with same dimensions as our previous one
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [None]:
## We now add them  <- These are our input embeddings. TOKEN EMBEDDINGS + POSITION EMBEDDINGS
## Position embeddings are another layer we can train, not fixed.
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


![https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781633437166/files/Images/2-19.png]