# Tatoeba: PyTorch Embeddings

Author: Pierre Nugues

In this notbook, we will build embedding objects and embedding bags 
This is a preliminary step to understand language detection and CLD3, https://github.com/google/cld3

In [1]:
import random
import torch
import torch.nn as nn

In [2]:
random.seed(4321)
torch.manual_seed(4321)

<torch._C.Generator at 0x7fe3a8ad8070>

## Embeddings


We use the class `Embedding(num_embeddings, embedding_dim, ...)` and we create 8 dense vectors (embeddings) in a vector space of dimension 5

In [3]:
embedding = nn.Embedding(8, 5)

In [4]:
embedding.weight

Parameter containing:
tensor([[-0.4716, -0.3436, -1.1742,  0.1221,  1.3231],
        [-0.6415,  0.8538, -1.8969,  0.2142,  1.1937],
        [-0.8704,  0.2439, -0.0453,  1.4172, -0.0614],
        [ 1.5471,  1.4126,  0.0268,  0.5757, -0.8794],
        [-0.0493,  0.0144, -0.3218, -0.1144, -0.6089],
        [-0.1303,  0.1426,  1.6467,  0.8824, -0.8752],
        [-0.4935,  0.4820, -0.6308, -0.1754,  0.3182],
        [ 1.7125, -1.5122,  0.5076,  0.1487,  0.4369]], requires_grad=True)

An embedding layer acts as a lookup table

In [5]:
embedding(torch.LongTensor([3, 2]))

tensor([[ 1.5471,  1.4126,  0.0268,  0.5757, -0.8794],
        [-0.8704,  0.2439, -0.0453,  1.4172, -0.0614]],
       grad_fn=<EmbeddingBackward0>)

When the input has a variable length, we have to align vectors up to a maximal length. We need then a padding symbol for the sequences less than this maximal length. We tell Torch by assigning the padding symbol an index usually 0

In [6]:
embedding = nn.Embedding(8, 5, padding_idx=0)

In [7]:
embedding.weight

Parameter containing:
tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.2780, -0.2379,  0.2242, -0.0326, -0.9581],
        [-1.5594, -0.2332,  0.6088, -0.5317, -1.3245],
        [ 2.0561, -0.8149, -0.7872, -2.3632,  1.0788],
        [ 0.8716,  0.2437, -1.5932,  0.6259, -0.4487],
        [-0.2243,  0.1594,  0.7884, -0.1794, -0.3420],
        [-1.0898, -0.2070,  0.1724,  0.1432,  0.8795],
        [ 0.0064, -0.6666,  0.9459,  1.7336,  0.2753]], requires_grad=True)

In [8]:
embedding(torch.LongTensor([0, 3, 2]))

tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 2.0561, -0.8149, -0.7872, -2.3632,  1.0788],
        [-1.5594, -0.2332,  0.6088, -0.5317, -1.3245]],
       grad_fn=<EmbeddingBackward0>)

## Embedding Bags
Embedding bags deal with embedding sequences of variable length when the embeddings are summed. In CLD3, we have a weighted sum of a variable number of embeddings. See https://github.com/google/cld3

In [9]:
embedding_bag = nn.EmbeddingBag(8, 5, mode='sum')

In [10]:
embedding_bag.weight

Parameter containing:
tensor([[ 1.6391, -0.1375,  0.0677, -0.0138, -0.4869],
        [-0.7907,  2.7215,  0.1517, -1.2534,  0.0896],
        [ 1.2457,  0.8432, -0.0916,  1.0716, -1.1500],
        [-1.2355,  2.1430, -0.3934,  0.0314, -0.6845],
        [-3.5251, -2.2849, -0.7089,  1.5505, -0.4069],
        [-0.0530,  0.5492,  1.4681, -0.4333,  0.0309],
        [-0.5371, -1.9131,  0.6755,  1.3665,  0.7476],
        [-0.8257, -0.0985,  0.3710, -1.2320,  0.1728]], requires_grad=True)

an `EmbeddingBag` object needs the bags of indices it will sum as its first parameter 

In [11]:
embedding_bag(torch.tensor([[1, 2], [3, 4]]))

tensor([[ 0.4549,  3.5647,  0.0601, -0.1818, -1.0604],
        [-4.7606, -0.1419, -1.1023,  1.5819, -1.0914]],
       grad_fn=<EmbeddingBagBackward0>)

In [23]:
embedding_bag.weight[1] + embedding_bag.weight[2]

tensor([ 0.4549,  3.5647,  0.0601, -0.1818, -1.0604], grad_fn=<AddBackward0>)

Or we may have a 1-D input and the the bag indices as second parameter: `offsets`

In [24]:
embedding_bag(torch.tensor([1, 2, 3, 4]), offsets=torch.tensor([0, 2]))

tensor([[ 0.4549,  3.5647,  0.0601, -0.1818, -1.0604],
        [-4.7606, -0.1419, -1.1023,  1.5819, -1.0914]],
       grad_fn=<EmbeddingBagBackward0>)

We can also compute a weighted sum using the `per_sample_weights` parameter. The shape must be the same as the input

In [25]:
embedding_bag(torch.tensor([[1, 2], [3, 4]]), per_sample_weights=torch.tensor([[0.5, 0.5], [0.2, 0.8]]))

tensor([[ 0.2275,  1.7823,  0.0301, -0.0909, -0.5302],
        [-3.0672, -1.3993, -0.6458,  1.2467, -0.4624]],
       grad_fn=<EmbeddingBagBackward0>)

In [26]:
0.5 * embedding_bag.weight[1] + 0.5 * embedding_bag.weight[2]

tensor([ 0.2275,  1.7823,  0.0301, -0.0909, -0.5302], grad_fn=<AddBackward0>)

In [27]:
0.2 * embedding_bag.weight[3] + 0.8 * embedding_bag.weight[4]

tensor([-3.0672, -1.3993, -0.6458,  1.2467, -0.4624], grad_fn=<AddBackward0>)

With an offset

In [28]:
embedding_bag(torch.tensor([1, 2, 3, 4]), 
              offsets=torch.tensor([0, 2]), 
              per_sample_weights=torch.tensor([0.5, 0.5, 0.2, 0.8]))

tensor([[ 0.2275,  1.7823,  0.0301, -0.0909, -0.5302],
        [-3.0672, -1.3993, -0.6458,  1.2467, -0.4624]],
       grad_fn=<EmbeddingBagBackward0>)