# Word embeddings
In this notebook we work with pretrained word embedding scores from the GloVe project. We use the smallest version, which maps 400,000 words into 50D embedding space, and was trained on 6billion words. 
From the project description: 
> "The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence".

For more details on model formulation and training procedures visit the [GloVe project website](https://nlp.stanford.edu/projects/glove/).

In [None]:
# All dependencies for the entire notebook
import torch
import torch.nn as nn
import torch.nn.functional as F

from warnings import warn
from tqdm.auto import tqdm

## Data

In [None]:
# Download and unzip glove word embeddings
!wget -nc https://github.com/holmrenser/deep_learning/raw/main/data/glove.6B.50d.txt.gz
!gunzip -f glove.6B.50d.txt.gz

## Model

In [None]:
class WordEmbedding(nn.Module):
    """Wrapper class for working with GloVe word embeddings"""
    def __init__(self, vocab: dict[str, int], embeddings: torch.tensor):
        super().__init__()
        self.vocab = vocab
        self.embeddings = nn.Embedding.from_pretrained(embeddings)

    @classmethod
    def from_pretrained(cls, filename: str) -> 'WordEmbedding':
        """Load pretrained embeddings from a whitespace-separated text file, first column is the word, rest are embeddings"""
        vocab = {'<unk>': 0} # start vocabulary with special character <unk> for unknown words
        embeddings = []
        
        with open(filename,'r') as fh:
            data = fh.readlines()
            for i,line in enumerate(tqdm(data, desc='Loading')):
                parts = line.split()
                
                token = parts[0]
                vocab[token] = i + 1 # add one to account for predefined <unk> token
                
                embedding = list(map(float, parts[1:]))
                embeddings.append(embedding)

        embeddings = torch.tensor(embeddings)
        unk_emb = embeddings.mean(dim=0) # embedding of unknown characters is average of all embeddings
        embeddings = torch.vstack([unk_emb, embeddings])
        
        return cls(vocab, embeddings)

    def forward(self, word: str) -> torch.tensor:
        """Maps word to embedding vector"""
        i = self.vocab.get(word, 0) # 0 is the index of the <unk> character
        if i == 0:
            warn(f'{word} is not in the vocabulary, returning average embedding')
        return self.embeddings(torch.tensor([i]))

    def find_closest(self, vec: torch.tensor, k: int=1) -> str:
        """Find closest k words of an embedding vector using cosine similarity"""
        cos_sim = F.cosine_similarity(emb.embeddings.weight, vec)
        closest_idx = {*map(int, torch.argsort(cos_sim)[-k:])}
        words = [word for word,idx in self.vocab.items() if idx in closest_idx]
        return words[0] if k == 1 else words

emb = WordEmbedding.from_pretrained('glove.6B.50d.txt')

## Examples

In [None]:
# Reproducing bishop eq. 12.27 (p. 376)
emb.find_closest(emb('paris') - emb('france') + emb('italy'))

In [None]:
# Find the 10 words that are closest in embedding space to the embedding of 'frog'
emb.find_closest(emb('frog'), k=10)

### Exercise 1
What is the result of example 1 when you substitute 'italy' with 'germany'? Are there countries where this doesn't work?

### Exercise 2
Can you find a word where the 10 closest words are not all semantically related to the input word? Can you explain why training on co-occurence can result in this observation?

## Answers

### Exercise 1
'rome' becomes 'berlin', exactly as expected. Belgium does not work, returning 'paris' instead of 'brussels'

### Exercise 2
The 10 closest words to 'water' include 'dry' and 'sand', indicating co-occurrence (which was used for training) does not always capture semantic similarity --> larger context might be necesarry to capture true semanting meaning of an individual word.