## 2.7 Building Token Embeddings

- Our data for the Large Language Model (LLM) is almost ready
- Next, the last step is to use the embedding layer to embed the token into a continuous vector representation. The token itself is not computable, it needs to be mapped to a continuous vector space before subsequent operations can be performed. The result of this mapping is the embedding corresponding to the token
- Usually, these embedding layers used to convert tokens are part of the Large Language Model (LLM) and are constantly adjusted and optimized during the model training process.

<img src="https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-15.jpg?raw=true" width="400px">

- Assume that we have the following four input examples after word segmentation, and the corresponding input IDs are 5, 1, 3, and 2 respectively:

In [1]:
import torch
input_ids = torch.tensor([2, 3, 5, 1])

- To simplify the problem, suppose we have a small vocabulary of only 6 words and we want to create embeddings of size 3.

In [2]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- This will generate a 6x3 weight matrix:

In [3]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


- Since the embedding layer is just a more efficient implementation of the one-hot encoding and matrix multiplication method, it can be viewed as a neural network layer that can be optimized by back-propagation.
- For those familiar with one-hot encoding, the above embedding layer method is essentially just a more efficient way to implement one-hot encoding followed by matrix multiplication, which is used in the fully connected layer and its detailed description can be found in the supplementary code [./embedding_vs_matmul](https://github.com/datawhalechina/llms-from-scratch-cn/tree/main/ch02/03_bonus_embedding-vs-matmul).
- Since the embedding layer is just a more efficient implementation of the one-hot encoding and matrix multiplication method, it can be viewed as a neural network layer that can be optimized by back-propagation algorithm.

- To convert the token with ID 3 into a 3D vector, we perform the following steps:

In [4]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


- Note that the above is row 4 in the `embedding_layer` weight matrix.
- To embed all four `input_ids` values ​​above, we do the following:

In [5]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


- The embedding layer is essentially a lookup operation:

<img src="https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-16.jpg?raw=true" width="500px">

- **You may be interested in additional content comparing Embedding layers to regular Linear layers: [../03_bonus_embedding-vs-matmul](https://github.com/datawhalechina/llms-from-scratch-cn/tree/main/ch02/03_bonus_embedding-vs-matmul)**