<a target="_blank" href="https://colab.research.google.com/github/holmrenser/deep_learning/blob/main/word_embeddings">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Word embeddings
In this notebook we work with pretrained word embedding scores from the [GloVe project](https://nlp.stanford.edu/projects/glove/). We use the smallest version, which maps 400,000 words into 50D embedding space, and was trained on 6billion words.

In [1]:
!wget https://github.com/holmrenser/deep_learning/raw/main/data/glove.6B.50d.txt.gz
!gunzip glove.6B.50d.txt.gz

--2024-03-12 15:23:53--  https://github.com/holmrenser/deep_learning/raw/main/data/glove.6B.50d.txt.gz
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/holmrenser/deep_learning/main/data/glove.6B.50d.txt.gz [following]
--2024-03-12 15:23:53--  https://raw.githubusercontent.com/holmrenser/deep_learning/main/data/glove.6B.50d.txt.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69239637 (66M) [application/octet-stream]
Saving to: ‘glove.6B.50d.txt.gz’


2024-03-12 15:24:03 (25,4 MB/s) - ‘glove.6B.50d.txt.gz’ saved [69239637/69239637]



In [95]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [115]:
class WordEmbedding(nn.Module):
    def __init__(self, vocab: dict[str, int], embeddings: torch.tensor):
        super().__init__()
        self.vocab = vocab
        self.embeddings = nn.Embedding.from_pretrained(embeddings)

    @classmethod
    def from_pretrained(cls, filename: str) -> 'WordEmbedding':
        vocab = {'<pad>':0, '<unk>':1}
        embeddings = []
        
        with open(filename,'r') as fh:
            for i,line in enumerate(fh):
                parts = line.split()
                
                token = parts[0]
                vocab[token] = i + 2
                
                embedding = list(map(float, parts[1:]))
                embeddings.append(embedding)

        embeddings = torch.tensor(embeddings)
        unk_emb = torch.zeros((1, embeddings.shape[1]))
        pad_emb = embeddings.mean(dim=0)
        embeddings = torch.vstack([pad_emb, unk_emb, embeddings])
        
        return cls(vocab, embeddings)

    def forward(self, word: str) -> torch.tensor:
        i = self.vocab.get(word, 0)
        return self.embeddings(torch.tensor([i]))

    def find_closest(self, vec: torch.tensor) -> str:
        cos_sim = F.cosine_similarity(emb.embeddings.weight, vec)
        i = torch.argmax(cos_sim)
        word = [k for k,v in self.vocab.items() if v == i]
        return word[0]

emb = WordEmbedding.from_pretrained('glove.6B.50d.txt')
emb.find_closest(emb('paris') - emb('france') + emb('italy'))

'rome'