# Embedding layer and Linear layer

- In PyTorch, Embedding layers implement the same functionality as Linear layers that perform matrix multiplication; the reason we use Embedding layers is to improve computational efficiency.
- We will step through this relationship using code examples in PyTorch.

In [1]:
import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 1.12.1+cu113


## Using nn.Embedding

In [3]:
# Assume that we have the following 3 training samples,
# These samples may represent tag IDs in the context of a language model (LM)
idx = torch.tensor([2, 3, 1])

# The number of rows in the embedding matrix can be determined by taking the maximum tag ID + 1.
# If the highest tag ID is 3, we want to have 4 rows, corresponding to possible
# Tag ID 0, 1, 2, 3
num_idx = max(idx) + 1

# The desired embedding dimension is a hyperparameter
out_dim = 5

- Implement a simple embedding layer

In [4]:
# For reproducibility, we use a random seed,
# Because the weights of the embedding layer are initialized with small random values
torch.manual_seed(123)

# Create an embedding layer, specifying the input dimension as num_idx and the output dimension as out_dim
embedding = torch.nn.Embedding(num_idx, out_dim)

View the embedding weight data

In [5]:
embedding.weight

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.3035, -0.5880,  1.5810],
        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015],
        [ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953]], requires_grad=True)

- Use an embedding layer to get a vector representation of the training sample with ID 1

In [6]:
embedding(torch.tensor([1]))

tensor([[ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],
       grad_fn=<EmbeddingBackward0>)

- Below is a visualization of the underlying operation

<img src="images/1.png" width="400px">

- Similarly, we can use the Embedding layer to get the vector representation of the training sample with ID 2:

In [7]:
embedding(torch.tensor([2]))

tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315]],
       grad_fn=<EmbeddingBackward0>)

<img src="images/2.png" width="400px">

- Now, let's transform all the training examples we defined previously:

In [8]:
# Change the original third line to the current first line, the fourth line to the current second line, and the second line to the current third line
idx = torch.tensor([2, 3, 1])
embedding(idx)

tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],
        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],
       grad_fn=<EmbeddingBackward0>)

- Under the hood, it's still the same look-up concept:

<img src="images/3.png" width="450px">

## Using nn.Linear

- Next, we will use One-Hot encoding, which is done in the `nn.Linear` layer, just like the embedding layer
- First, we convert the tag ID to a One-Hot representation:

In [12]:
onehot = torch.nn.functional.one_hot(idx)
onehot

tensor([[0, 0, 1, 0],
        [0, 0, 0, 1],
        [0, 1, 0, 0]])

- Next, we use matrix multiplication $X W^\top$ to initialize a Linear layer

In [16]:
torch.manual_seed(123)
# Initialize a Linear layer whose weight matrix is ​​a linear layer from num_idx (input dimension) to out_dim (output dimension) and has no bias term
linear = torch.nn.Linear(num_idx, out_dim, bias=False)
print(linear.weight)

Parameter containing:
tensor([[-0.2039,  0.0166, -0.2483,  0.1886],
        [-0.4260,  0.3665, -0.3634, -0.3975],
        [-0.3159,  0.2264, -0.1847,  0.1871],
        [-0.4244, -0.3034, -0.1836, -0.0983],
        [-0.3814,  0.3274, -0.1179,  0.1605]], requires_grad=True)


- Note that the `linear` layer in PyTorch is also initialized with small random weights. In order to make a direct comparison with the `Embedding` layer above, we must use the same small random weights, which is why we reassign them here:

In [17]:
# The weights of the linear layer are reassigned to the same small random weights as the embedding layer to ensure that they have the same initialization. This is to make them directly comparable in subsequent operations.
linear.weight = torch.nn.Parameter(embedding.weight.T.detach())

- Now, we can use a linear layer to process the one-hot encoded representation of the input:

In [18]:
linear(onehot.float())

tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],
        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]], grad_fn=<MmBackward0>)

As we can see, this is exactly the same result we get when we use the Embedding layer:

In [19]:
embedding(idx)

tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],
        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],
       grad_fn=<EmbeddingBackward0>)

- The calculation that happens under the hood is as follows, for the label ID of the first training example:

<img src="images/4.png" width="450px">

- and the tag ID for the second training example:

<img src="images/5.png" width="450px">

- 
Since all but one index in each one-hot encoded row is zero (by design), this matrix multiplication is essentially a lookup of the one-hot encoded element
- . Using matrix multiplication on the one-hot encoding is equivalent to using the embedding layer to look it up, but this approach may be less efficient if we use large embedding matrices because there are many unnecessary multiplications with zeros.