# Word embeddings
In this notebook we work with pretrained word embedding scores from the GloVe project. We use the smallest version, which maps 400,000 words into 50D embedding space, and was trained on 6 billion words.
From the project description:
> "The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence".

For more details on model formulation and training procedures visit the [GloVe project website](https://nlp.stanford.edu/projects/glove/).

In [1]:
# All dependencies for the entire notebook
import torch
import torch.nn as nn
import torch.nn.functional as F

from warnings import warn
from tqdm.auto import tqdm

## Data

In [2]:
# Download and unzip glove word embeddings
!wget -nc https://github.com/holmrenser/deep_learning/raw/main/data/glove.6B.50d.txt.gz
!gunzip -f glove.6B.50d.txt.gz

zsh:1: command not found: wget
gunzip: can't stat: glove.6B.50d.txt.gz (glove.6B.50d.txt.gz.gz): No such file or directory


## Model
We create a small class that wraps functionality for reading in the tab delimited file with pretrained embeddings, let's us select embeddings for specific words, and can calculate closest (in embedding space) words to a given word. The pretrained embeddings are parsed into a vocabulary mapping words to integer indices, and a torch embedding table that is accessed using these indices.

In [3]:
class WordEmbedding(nn.Module):
    """Wrapper class for working with GloVe word embeddings"""
    def __init__(self, vocab: dict[str, int], embeddings: torch.tensor):
        super().__init__()
        self.vocab = vocab
        self.embeddings = nn.Embedding.from_pretrained(embeddings)

    @classmethod
    def from_pretrained(cls, filename: str) -> 'WordEmbedding':
        """Load pretrained embeddings from a whitespace-separated text file, first column is the word, rest are embeddings"""
        vocab = {'<unk>': 0} # start vocabulary with special character <unk> for unknown words
        embeddings = []

        with open(filename,'r') as fh:
            data = fh.readlines()
            for i,line in enumerate(tqdm(data, desc='Loading')):
                parts = line.split()

                token = parts[0]
                vocab[token] = i + 1 # add one to account for predefined <unk> token

                embedding = list(map(float, parts[1:]))
                embeddings.append(embedding)

        embeddings = torch.tensor(embeddings)
        unk_emb = embeddings.mean(dim=0) # embedding of unknown characters is average of all embeddings
        embeddings = torch.vstack([unk_emb, embeddings])

        return cls(vocab, embeddings)

    def forward(self, word: str) -> torch.tensor:
        """Maps word to embedding vector"""
        i = self.vocab.get(word, 0) # 0 is the index of the <unk> character
        if i == 0:
            warn(f'{word} is not in the vocabulary, returning average embedding')
        return self.embeddings(torch.tensor([i]))

    def find_closest(self, vec: torch.tensor, k: int=1) -> str:
        """Find closest k words of an embedding vector using cosine similarity"""
        cos_sim = F.cosine_similarity(emb.embeddings.weight, vec)
        closest_idx = torch.argsort(cos_sim, descending=True)[:k]
        reverse_vocab = {v:k for k,v in self.vocab.items()}
        words = [reverse_vocab[idx] for idx in closest_idx.tolist()]
        return words[0] if k == 1 else words

emb = WordEmbedding.from_pretrained('glove.6B.50d.txt')

Loading:   0%|          | 0/400000 [00:00<?, ?it/s]

## Examples
__Example 1:__ Selecting embeddings for arbitrary words can be done by calling a WordEmbedding class instance with a string.

In [4]:
emb('hot')

tensor([[-7.6663e-01,  6.9023e-01,  7.5462e-02,  1.1688e-01, -7.9722e-01,
         -1.9606e-01, -7.7409e-01,  1.7351e-01,  2.6248e-01,  5.5295e-01,
         -2.9190e-01, -2.4505e-01,  5.9885e-01,  1.2445e+00,  2.6401e-01,
          2.0211e-01,  4.2139e-02,  5.1844e-01, -8.1704e-01, -1.0801e+00,
          2.2864e-01,  9.1212e-02,  1.5638e+00,  7.5056e-01, -6.1206e-02,
         -6.9001e-01, -5.3558e-01,  1.1311e+00,  1.3871e+00,  3.6151e-01,
          2.8475e+00,  1.0733e-01, -1.7073e-02,  4.5358e-01, -7.1374e-03,
          1.1177e-01, -1.5955e-01,  3.0205e-01,  5.4222e-01, -5.4103e-01,
          2.3276e-01,  2.1756e-01, -4.1444e-02,  1.7056e-03,  7.6265e-01,
          6.6241e-01, -4.5484e-02, -8.1479e-01,  4.6763e-02,  3.1134e-01]])

__Example 2:__ Strings that are not in the pretrained vocabulary of 400,000 'words' raise a warning and return the average embedding of all words.

In [5]:
emb('solidgoldmagikarp')

  warn(f'{word} is not in the vocabulary, returning average embedding')


tensor([[-0.1292, -0.2887, -0.0122, -0.0568, -0.2021, -0.0839,  0.3336,  0.1605,
          0.0387,  0.1783,  0.0470, -0.0029,  0.2910,  0.0461, -0.2092, -0.0661,
         -0.0682,  0.0767,  0.3134,  0.1785, -0.1226, -0.0992, -0.0750,  0.0641,
          0.1444,  0.6089,  0.1746,  0.0534, -0.0127,  0.0347, -0.8124, -0.0469,
          0.2019,  0.2031, -0.0394,  0.0697, -0.0155, -0.0341, -0.0653,  0.1225,
          0.1399, -0.1745, -0.0801,  0.0850, -0.0104, -0.1370,  0.2013,  0.1007,
          0.0065,  0.0169]])

__Example 3:__ Using cosine similarity, we can identify words that are close in embedding space. The `find_closest` method implements searching with a given embedding.

In [6]:
# Find the 10 words that are closest in embedding space to the embedding of 'frog'
emb.find_closest(emb('solidgoldmagikarp'), k=10)

  warn(f'{word} is not in the vocabulary, returning average embedding')


['<unk>',
 'tom.fowler@chron.com',
 'mangxamba',
 'mongkolporn',
 'ryryryryryry',
 'jenalia.moreno@chron.com',
 'purva.patel@chron.com',
 'jiwamol',
 'afp02',
 'thongrung']

__Example 4:__ We can perform arithmetic on embedding vectors and find the closest word to resulting vector.

In [8]:
# Reproducing bishop eq. 12.27 (p. 376)
emb.find_closest(emb('paris') - emb('france') + emb('germany'))

'berlin'

### Exercise 1
What is the result of example 4 when you substitute 'italy' with 'germany'? Are there countries where this doesn't work?

### Exercise 2
Can you find a word in example 3 where the 10 closest words are not all semantically related to the input word? Can you explain why training on co-occurence can result in this observation?

## Answers

### Exercise 1
'rome' becomes 'berlin', exactly as expected. Belgium does not work, returning 'paris' instead of 'brussels'

### Exercise 2
The 10 closest words to 'water' include 'dry' and 'sand', indicating co-occurrence (which was used for training) does not always capture semantic similarity. A wider context might be necessary to capture the true semantic meaning of an individual word.