# 2 Embeddings

In section 1 ConvertTextToTensors, we operated on high-dimensional bag-of-words vectors with the length as the size of the vocabulary,
and we were explicitly converting from low-dimensional positional representation vectors into sparse one-hot
representation. This one-hot representation is not memory-efficient, in addition, each word is treated
independently of each other, i.e. one-hot encoded vectors do not express any semantic similarity between words.

## 2.1 What is embedding?
The idea of embedding is to represent words by "lower-dimensional dense vectors", which somehow reflect
semantic meaning of a word. We will later discuss how to build meaningful word embeddings, but for now
let's just think of embeddings as a way to lower dimensionality of a word vector.

So, embedding layer would take a word as an input, and produce an output vector of specified embedding_size.
In a sense, it is very similar to Linear layer, but instead of taking one-hot encoded vector, it will be
able to take a word number as an input.

## 2.2 First model that use embedding layer as input layer
By using embedding layer as a first layer in our network, we can switch from bag-or-words to embedding bag
model, where we first convert each word in our text into corresponding embedding, and then compute some
aggregate function over all those embeddings, such as sum, average or max.


In [None]:
import torch
from collections import Counter, OrderedDict
from torchtext.vocab import vocab
import torchtext

In [None]:
class ClassifierWithSameLengthEmbedding(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.fc = torch.nn.Linear(embed_dim, num_class)

    def forward(self, x):
        # first layer is embedding
        x = self.embedding(x)
        # second layer calculate mean of the embedding
        x = torch.mean(x, dim=1)
        # third layer is linear
        return self.fc(x)

## 2.3 Dealing with variable sequence size

As a result of this architecture, mini-batches to our network would need to be created in a certain way. In section 1,
when using bag-of-words, all BoW tensors in a mini-batch had equal size vocab_size, regardless of the actual length
of the text sequence. Once we move to word embeddings, we would end up with variable number of words in each text
sample, and when combining those samples into mini-batches we would have to apply some padding functions.

This can be done using the collate_fn function to the datasource. In this tutorial we use two different method:
- padding the text tensor with zero, so all tensors have the same length which is the max text length of a batch
- using offset vector, which would hold offsets of all sequences stored in one large vector

In [None]:
# build a simple tokenizer
my_tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

# build label class list
label_classes = ['World', 'Sports', 'Business', 'Sci/Tech']


def load_dataset(storage_path):
    print("Loading dataset...")
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root=storage_path)
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    return train_dataset, test_dataset


path = "/tmp/pytorch/data"
train, test = load_dataset(path)


# function that build vocabulary with the token of all text
def build_vocabulary(dataset, tokenizer, ngrams=1, min_freq=1):
    # here we use counter to store the generated token to take in account the token frequency
    counter = Counter()
    # we iterate over all rows, covert text to word token, and add these token to bag_of words
    for (label, line) in dataset:
        counter.update(torchtext.data.utils.ngrams_iterator(tokenizer(line), ngrams=ngrams))
    # sort the collected token counter by token's frequencies
    sorted_by_token_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
    # build a set of words as an orderedDict
    words_dict = OrderedDict(sorted_by_token_freq_tuples)
    # we build a vocabulary based on the words token
    return vocab(words_dict, min_freq=min_freq)


# build a vocab
my_vocab = build_vocabulary(train, my_tokenizer)


def encode(text, vocabulary, tokenizer):
    return [vocabulary[word] for word in tokenizer(text)]

### 2.3.1 Padding the text tensor with zero



In [None]:
# This function read all tuples of the batch, and returns two tensors labels and features
# The length of the text tensor is the max length of the text in the batch. For the text whose length
# is inferior will be padded with 0.
def padding_text(b):
    # b is the list of tuples of length batch_size
    #   - first element of a tuple = label,
    #   - second = feature (text sequence)
    # build vectorized sequence
    v = [encode(x[1], my_vocab, my_tokenizer) for x in b]
    # first, compute max length of a sequence in this minibatch
    l = max(map(len, v))
    return (  # tuple of two tensors - labels and features
        torch.LongTensor([t[0] - 1 for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t), (0, l - len(t)), mode='constant', value=0) for t in v])
    )

In [None]:
train_loader = torch.utils.data.DataLoader(train, batch_size=16, collate_fn=padding_text, shuffle=True)

In [None]:
def select_hardware_for_training(device_name):
    if device_name == 'cpu':
        return 'cpu'
    elif device_name == 'gpu':
        return 'cuda' if (device_name == "") & torch.cuda.is_available() else 'cpu'
    else:
        print("Unknown device name, choose cpu as default device")
        return 'cpu'


device = select_hardware_for_training("cpu")

vocab_size = len(my_vocab)

# build the model
model_same_length = ClassifierWithSameLengthEmbedding(vocab_size, 32, len(label_classes)).to(device)

In [None]:
# define training loop
def train_loop(net, dataloader, lr=0.01, optimizer=None, loss_fn=torch.nn.CrossEntropyLoss(), epoch_size=None,
               report_freq=200):
    optimizer = optimizer or torch.optim.Adam(net.parameters(), lr=lr)
    loss_fn = loss_fn.to(device)
    net.train()
    total_loss, acc, count, i = 0, 0, 0, 0
    for labels, features in dataloader:
        optimizer.zero_grad()
        features, labels = features.to(device), labels.to(device)
        out = net(features)
        loss = loss_fn(out, labels)  #cross_entropy(out,labels)
        loss.backward()
        optimizer.step()
        total_loss += loss
        _, predicted = torch.max(out, 1)
        acc += (predicted == labels).sum()
        count += len(labels)
        i += 1
        if i % report_freq == 0:
            print(f"{count}: acc={acc.item() / count}")
        if epoch_size and count > epoch_size:
            break
    return total_loss.item() / count, acc.item() / count


# We are only training for 25k records here (less than one full epoch) for the sake of time, but you can continue
# training, write a function to train for several epochs, and experiment with learning rate parameter to achieve
# higher accuracy. You should be able to go to the accuracy of about 90%.
train_loop(model_same_length, train_loader, lr=1, epoch_size=25000)

### 2.3.2 EmbeddingBag Layer and Variable-Length Sequence Representation

In the previous architecture, we needed to pad all sequences to the same length in order to fit them into a
mini-batch. This is not the most efficient way to represent variable length sequences - another approach would be
to use offset vector, which would hold offsets of all sequences stored in one large vector.

To work with offset representation, we use EmbeddingBag layer
(https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html). It is similar to Embedding, but it takes
content vector and offset vector as input, and it also includes averaging layer, which can be mean, sum or max.

In [None]:
class ClassifierWithOffsetEmbedding(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = torch.nn.EmbeddingBag(vocab_size, embed_dim)
        self.fc = torch.nn.Linear(embed_dim, num_class)

    def forward(self, text, off):
        x = self.embedding(text, off)
        return self.fc(x)

In [None]:
# define a collate_fn which returns tensors with offset
def offset_text(b):
    # first, compute data tensor from all sequences
    x = [torch.tensor(encode(t[1], my_vocab, my_tokenizer)) for t in b]
    # now, compute the offsets by accumulating the tensor of sequence lengths
    o = [0] + [len(t) for t in x]
    o = torch.tensor(o[:-1]).cumsum(dim=0)
    return (
        torch.LongTensor([t[0] - 1 for t in b]),  # labels
        torch.cat(x),  # text
        o
    )

# get the data loader with the offset_text collate function
train_loader_offset = torch.utils.data.DataLoader(train, batch_size=16, collate_fn=offset_text, shuffle=True)

In [None]:
# build model with the offset embedding
model_with_offset = ClassifierWithOffsetEmbedding(vocab_size, 32, len(label_classes)).to(device)

# define a train loop for offset
def train_loop_offset(model, dataloader, lr=0.01, optimizer=None, loss_fn=torch.nn.CrossEntropyLoss(), epoch_size=None,
                    report_freq=200):
    optimizer = optimizer or torch.optim.Adam(model.parameters(), lr=lr)
    loss_fn = loss_fn.to(device)
    model.train()
    total_loss, acc, count, i = 0, 0, 0, 0
    for labels, text, off in dataloader:
        optimizer.zero_grad()
        labels, text, off = labels.to(device), text.to(device), off.to(device)
        out = model(text, off)
        loss = loss_fn(out, labels)  #cross_entropy(out,labels)
        loss.backward()
        optimizer.step()
        total_loss += loss
        _, predicted = torch.max(out, 1)
        acc += (predicted == labels).sum()
        count += len(labels)
        i += 1
        if i % report_freq == 0:
            print(f"{count}: acc={acc.item() / count}")
        if epoch_size and count > epoch_size:
            break
    return total_loss.item() / count, acc.item() / count

# train the model
train_loader_offset(model_with_offset, train_loader_offset, lr=4, epoch_size=25000)

### 2.4 Semantic Embedding Word2Vec
In our previous example, the model embedding layer learnt to map words to vector representation,
however, this representation did not have much semantically meaning. It would be nice to learn
such vector representation, that similar words or synonyms would correspond to vectors that are
close to each other in terms of some vector distance (eg. euclidean distance).

To do that, we need to pre-train our embedding model on a large collection of text in a specific way.
One of the first ways to train semantic embeddings is called Word2Vec. It is based on two main
architectures that are used to produce a distributed representation of words:

- Continuous bag-of-words (CBoW) — in this architecture, we train the model to predict a word from surrounding context.
  Given the ngram (W−2,W−1,W0,W1,W2), the goal of the model is to predict W0 from (W−2,W−1,W1,W2).
- Continuous skip-gram is opposite to CBoW. The model uses surrounding window of context words to predict the
  current word.

![word2vec](notebooks/img/example-algorithms-for-converting-words-to-vectors.png)

CBoW is faster, while skip-gram is slower, but does a better job of representing infrequent words.

Both CBOW and Skip-Grams are “predictive” embeddings, in that they only take local contexts into account. Word2Vec
does not take advantage of global context.

FastText, builds on Word2Vec by learning vector representations for each word and the charachter n-grams found
within each word. The values of the representations are then averaged into one vector at each training step.
While this adds a lot of additional computation to pre-training it enables word embeddings to encode sub-word
information.

Another method, GloVe, leverages the idea of co-occurrence matrix, uses neural methods to decompose co-occurrence
matrix into more expressive and non linear word vectors.

You can play with the example by changing embeddings to FastText and GloVe, since gensim supports

### 2.4.1

To experiment with word2vec embedding pre-trained on Google News dataset, we can use gensim library.




In [19]:
import gensim.downloader as api
# load the google word2vec embedding, the size of the embedding is 1.6GB. So it may take some time to download
w2v = api.load('word2vec-google-news-300')

# Below we find the words most similar to 'neural'
for w,p in w2v.most_similar('neural'):
    print(f"{w} -> {p}")

MemoryError: Unable to allocate 3.35 GiB for an array with shape (3000000, 300) and data type float32

In [None]:
# We can also extract vector embeddings from the word, to be used in training classification model
# we only show first 20 components of the vector for clarity:

w2v.word_vec('play')[:20]

In [18]:
# Great thing about semantical embeddings is that you can manipulate vector encoding to change the semantics.
# For example, we can ask to find a word, whose vector representation would be as close as possible to words king
# and woman, and as far away from the word man

w2v.most_similar(positive=['king','woman'],negative=['man'])[0]

NameError: name 'w2v' is not defined

## 2.5 Using Pre-Trained Embeddings in PyTorch

We can modify the example above to pre-populate the matrix in our embedding layer with semantically embeddings,
such as Word2Vec. We need to take into account that vocabularies of pre-trained embedding and our text corpus
will likely not match, so we will initialize weights for the missing words with random values:


In [None]:
embed_size = len(w2v.get_vector('hello'))
print(f'Embedding size: {embed_size}')

net = ClassifierWithSameLengthEmbedding(vocab_size,embed_size,len(label_classes))

print('Populating matrix, this will take some time...',end='')
found, not_found = 0,0

for i,w in enumerate(vocab.itos):
    try:
        net.embedding.weight[i].data = torch.tensor(w2v.get_vector(w))
        found+=1
    except:
        net.embedding.weight[i].data = torch.normal(0.0,1.0,(embed_size,))
        not_found+=1

print(f"Done, found {found} words, {not_found} words missing")
net = net.to(device)