<a href="https://colab.research.google.com/github/hollyemblem/raschka-llm-from-scratch/blob/chapter-2-embeddings/chapter2_creating_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Chapter 2

#### Creating Token Embeddings
We initialise embeddings - which is the process of converting token IDs into vector representations - with random values. In later chapters (Chapter 5) we will optimise the embedding weights..

Raschka: "Embedding layers perform a lookup operation, retrieving the embedding vector corresponding to the token ID from the embedding layer’s weight matrix."

Me: Think about this as; embedding layers store a trainable matrix (with backprop for example) where each row is a token embedding. If we have a token ID, we can perform a look up and retrieve the corresponding row, which is the embedding vector.


In [1]:
import torch
from torch.utils.data import Dataset, DataLoader

In [2]:
##A toy example -  four input tokens with IDs 2, 3, 5, and 1

input_ids = torch.tensor([2, 3, 5, 1])

In [3]:
vocab_size = 6 #This is different to our BPE vocab size of 50k+!
output_dim = 3 #GPT-3's embedding size is  12,288 dimensions

In [4]:
##Instantiating a embedding layer
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


#### 6x3 Matrix:
There is one row for each of the six possible tokens in the vocabulary,
and there is one column for each of the three embedding dimensions.



In [5]:
#Now, let’s apply it to a token ID to obtain the embedding vector:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


We can find the above value in the fourth row of the matrix, or row 3 due to 0 numbering. This embedding layer is therefore a lookup which means we can retrieve rows from the embedding layer's weight matrix, with a token ID (in this instance, 3)

 In other words, the embedding layer is essentially a lookup operation that retrieves rows from the embedding layer’s weight matrix via a token ID.




#### Understanding one-hot encoding approach
"For those who are familiar with one-hot encoding, the embedding layer approach described here is essentially just a more efficient way of implementing one-hot encoding followed by matrix multiplication in a fully connected layer...

Because the embedding layer is just a more efficient implementation equivalent to the one-hot encoding and matrix-multiplication approach, it can be seen as a neural network layer that can be optimized via backpropagation."

More detailed guidance available [here](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/03_bonus_embedding-vs-matmul/embeddings-and-linear-layers.ipynb)

In [6]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


## Encoding Word Positions

"The way the previously introduced embedding layer works is that the same token ID always gets mapped to the same vector representation, regardless of where the token ID is positioned in the input sequence"

But...Attention doesn't have a notion of position or order of a sequence. So while the above approach is quite deterministic and good for reproducibility, we need a method of injecting some positioning logic into the LLM.