# Tatoeba: PyTorch Embeddings

Author: Pierre Nugues

In this notbook, we will build embedding objects and embedding bags 
This is a preliminary step to understand language detection and CLD3, https://github.com/google/cld3

In [1]:
import torch
import torch.nn as nn

In [2]:
torch.manual_seed(4321)

<torch._C.Generator at 0x131ad64d0>

## Embeddings


We use the class `Embedding(num_embeddings, embedding_dim, ...)` and we create 8 dense vectors (embeddings) in a vector space of dimension 5

In [4]:
embedding = nn.Embedding(8, 5)

In [5]:
embedding.weight

Parameter containing:
tensor([[-0.4562, -0.9391,  1.9130, -0.7439, -0.4459],
        [-1.2780, -0.2379,  0.2242, -0.0326, -0.9581],
        [-1.5594, -0.2332,  0.6088, -0.5317, -1.3245],
        [ 2.0561, -0.8149, -0.7872, -2.3632,  1.0788],
        [ 0.8716,  0.2437, -1.5932,  0.6259, -0.4487],
        [-0.2243,  0.1594,  0.7884, -0.1794, -0.3420],
        [-1.0898, -0.2070,  0.1724,  0.1432,  0.8795],
        [ 0.0064, -0.6666,  0.9459,  1.7336,  0.2753]], requires_grad=True)

An embedding layer acts as a lookup table

In [6]:
embedding(torch.LongTensor([3, 2]))

tensor([[ 2.0561, -0.8149, -0.7872, -2.3632,  1.0788],
        [-1.5594, -0.2332,  0.6088, -0.5317, -1.3245]],
       grad_fn=<EmbeddingBackward0>)

When the input has a variable length, we have to align vectors up to a maximal length. We need then a padding symbol for the sequences less than this maximal length. We tell Torch by assigning the padding symbol an index usually 0

In [7]:
embedding = nn.Embedding(8, 5, padding_idx=0)

In [8]:
embedding

Embedding(8, 5, padding_idx=0)

In [8]:
embedding.weight

Parameter containing:
tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.2780, -0.2379,  0.2242, -0.0326, -0.9581],
        [-1.5594, -0.2332,  0.6088, -0.5317, -1.3245],
        [ 2.0561, -0.8149, -0.7872, -2.3632,  1.0788],
        [ 0.8716,  0.2437, -1.5932,  0.6259, -0.4487],
        [-0.2243,  0.1594,  0.7884, -0.1794, -0.3420],
        [-1.0898, -0.2070,  0.1724,  0.1432,  0.8795],
        [ 0.0064, -0.6666,  0.9459,  1.7336,  0.2753]], requires_grad=True)

In [9]:
embedding(torch.LongTensor([0, 3, 2]))

tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.2355,  2.1430, -0.3934,  0.0314, -0.6845],
        [ 1.2457,  0.8432, -0.0916,  1.0716, -1.1500]],
       grad_fn=<EmbeddingBackward0>)

## Embedding Bags
Embedding bags deal with embedding sequences of variable length when the embeddings are summed. In CLD3, we have a weighted sum of a variable number of embeddings. See https://github.com/google/cld3

In [18]:
embedding_bag = nn.EmbeddingBag(8, 5, mode='sum')  # Default mean

In [19]:
embedding_bag.weight

Parameter containing:
tensor([[-1.7122,  0.3235, -0.6941, -0.1128, -0.3620],
        [-0.0215, -0.3685, -0.2244,  2.4480, -1.4205],
        [-1.0848,  0.3204, -1.3139,  0.8290, -0.2611],
        [ 0.3025, -1.0016, -0.2019,  0.0992, -1.1972],
        [ 1.4835, -0.5860,  0.7933, -0.7355, -0.9182],
        [-1.0781,  0.0139,  0.6118,  1.1398,  0.1750],
        [ 1.0420,  0.1025,  0.6203, -0.2963,  0.2832],
        [ 0.8905, -0.9703,  1.7193, -0.1643,  1.8135]], requires_grad=True)

an `EmbeddingBag` object needs the bags of indices it will sum as its first parameter 

In [20]:
embedding_bag(torch.tensor([[1, 2], [3, 4]]))

tensor([[-1.1063, -0.0481, -1.5382,  3.2771, -1.6816],
        [ 1.7861, -1.5875,  0.5914, -0.6363, -2.1154]],
       grad_fn=<EmbeddingBagBackward0>)

In [21]:
embedding_bag(torch.tensor([[1]])) + embedding_bag(torch.tensor([[2]]))

tensor([[-1.1063, -0.0481, -1.5382,  3.2771, -1.6816]], grad_fn=<AddBackward0>)

In [22]:
embedding_bag.weight[3] + embedding_bag.weight[4]

tensor([ 1.7861, -1.5875,  0.5914, -0.6363, -2.1154], grad_fn=<AddBackward0>)

Or we may have a 1-D input and the the bag indices as second parameter: `offsets`

In [23]:
embedding_bag(torch.tensor([1, 2, 3, 4]), offsets=torch.tensor([0, 2]))

tensor([[-1.1063, -0.0481, -1.5382,  3.2771, -1.6816],
        [ 1.7861, -1.5875,  0.5914, -0.6363, -2.1154]],
       grad_fn=<EmbeddingBagBackward0>)

We can also compute a weighted sum using the `per_sample_weights` parameter. The shape must be the same as the input

In [24]:
embedding_bag(torch.tensor([[1, 2], [3, 4]]), per_sample_weights=torch.tensor(
    [[0.5, 0.5], [0.2, 0.8]]))

tensor([[-0.5532, -0.0241, -0.7691,  1.6385, -0.8408],
        [ 1.2473, -0.6691,  0.5942, -0.5686, -0.9740]],
       grad_fn=<EmbeddingBagBackward0>)

In [27]:
0.5 * embedding_bag.weight[1] + 0.5 * embedding_bag.weight[2]

tensor([-0.5532, -0.0241, -0.7691,  1.6385, -0.8408], grad_fn=<AddBackward0>)

In [26]:
0.2 * embedding_bag.weight[3] + 0.8 * embedding_bag.weight[4]

tensor([ 1.2473, -0.6691,  0.5942, -0.5686, -0.9740], grad_fn=<AddBackward0>)

With an offset

In [28]:
embedding_bag(torch.tensor([1, 2, 3, 4]),
              offsets=torch.tensor([0, 2]),
              per_sample_weights=torch.tensor([0.5, 0.5, 0.2, 0.8]))

tensor([[-0.5532, -0.0241, -0.7691,  1.6385, -0.8408],
        [ 1.2473, -0.6691,  0.5942, -0.5686, -0.9740]],
       grad_fn=<EmbeddingBagBackward0>)