### Word representation
The easiest way to embed a word is one-hot.

$$
\mathbf x =
\begin{bmatrix}
0 \\
0 \\
\vdots \\
1 \\
\vdots \\
0 
\end{bmatrix}_{||V|| \times 1}
$$

#### Analogy of two vectors

$$
\begin{align*}
e_{\text{man}} - e_{\text{woman}} &\approx e_{\text{king}} - e \\
e &\approx e_{\text{king}} - e_{\text{man}} + e_{\text{woman}} \\
\end{align*}
$$

The similarity of the vectors are calculated with cosine similarity.

$$
-1 \leq\mathrm{cossim}(e, e_{\text{king}} - e_{\text{man}} + e_{\text{woman}}) \leq 1
$$

### Embedding matrix

In [55]:
import numpy as np
import matplotlib.pyplot as plt

# embedding matrix
E = np.random.randn(10, 100)

# one-hot vector 
x = np.zeros((100,1))
x[42] += 1

# embedded vector
e = E @ x

### Word2Vec

#### Skip-gram

In [99]:
text = """John quickly realized that the fox was jumping over a brown fence. Meanwhile, the lazy dog slept under the warm sun, dreaming of chasing squirrels in the park. A wizard in a distant land cast spells to levitate objects and summon mystical creatures. The gym was full of athletes lifting weights, running on treadmills, and practicing yoga poses. Buzzing bees were collecting nectar from vibrant flowers, while a group of birds sang harmoniously from the treetops. In the city, cars zoomed by as people hurried to work, their minds filled with tasks and deadlines. The library was a sanctuary of knowledge, where students pored over books and researchers delved into ancient manuscripts. A chef in a bustling kitchen prepared exquisite dishes, skillfully chopping vegetables and grilling meats. At the beach, waves crashed against the shore as children built sandcastles and surfers rode the swells. In the forest, a lumberjack wielded his axe, cutting down trees for timber. The night sky was a tapestry of stars, constellations, and planets, inspiring wonder and awe in all who gazed upon it."""
text = text.lower()

chars = sorted(list(set(text)))
c2i = lambda c: chars.index(c)
i2c = lambda i: chars[i]

X = []
context_size = 2
for i in range(context_size, len(text)-context_size):
    for offset in range(-context_size, context_size+1):
        if offset == 0:
            continue
        X.append([c2i(text[i]), c2i(text[i+offset])])

print(f"{len(text) = }")
print(f"{len(X)    = }")

len(text) = 1093
len(X)    = 4356


In [100]:
import torch
import torch.nn as nn

class SkipGram(nn.Module):
    def __init__(self, vocab_size=len(chars), embedding_dim=5):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim

        self.embedding = nn.Embedding(num_embeddings=self.vocab_size,
                                      embedding_dim=self.embedding_dim)
        self.linear = nn.Linear(in_features=self.embedding_dim,
                                out_features=self.vocab_size)
        
        nn.init.xavier_uniform_(self.embedding.weight)
        nn.init.xavier_uniform_(self.linear.weight)
    
    def forward(self, x):
        x = self.embedding(x)
        x = self.linear(x)
        return x

In [101]:
import torch.optim as optim
from torch.utils.data import DataLoader

skipgram = SkipGram()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(skipgram.parameters(), lr=0.1)

skipgram.train()
for e in range(1):
    running_loss = 0
    count = 0
    for x, y in DataLoader(X, batch_size=32, shuffle=True):
        count += 1
        optimizer.zero_grad()

        y_hat = skipgram(x)
        loss = criterion(y_hat, y) # <- computing every skip-grams is expensive. Cross entropy loss of huge vocab size (expensive!) * context size

        loss.backward()
        optimizer.step()

        running_loss += loss.item()
    print(f"[epoch {e+1:0>3}] {running_loss/count:7.5f}", end='\r')

[epoch 001] 3.18345

#### Skip-gram (with negative sampling)

In [102]:
import random

text = """John quickly realized that the fox was jumping over a brown fence. Meanwhile, the lazy dog slept under the warm sun, dreaming of chasing squirrels in the park. A wizard in a distant land cast spells to levitate objects and summon mystical creatures. The gym was full of athletes lifting weights, running on treadmills, and practicing yoga poses. Buzzing bees were collecting nectar from vibrant flowers, while a group of birds sang harmoniously from the treetops. In the city, cars zoomed by as people hurried to work, their minds filled with tasks and deadlines. The library was a sanctuary of knowledge, where students pored over books and researchers delved into ancient manuscripts. A chef in a bustling kitchen prepared exquisite dishes, skillfully chopping vegetables and grilling meats. At the beach, waves crashed against the shore as children built sandcastles and surfers rode the swells. In the forest, a lumberjack wielded his axe, cutting down trees for timber. The night sky was a tapestry of stars, constellations, and planets, inspiring wonder and awe in all who gazed upon it."""
text = text.lower()

chars = sorted(list(set(text)))
c2i = lambda c: chars.index(c)
i2c = lambda i: chars[i]

X = []
context_size = 2
for i in range(context_size, len(text)-context_size):
    target = c2i(text[i])

    positive_samples = []
    for offset in range(-context_size, context_size+1):
        if offset == 0:
            continue
        i = c2i(text[i+offset])
        positive_samples.append(i)

    negative_sample_candidates = list(set(range(len(chars))) - set(positive_samples))
    negative_samples = random.sample(negative_sample_candidates, context_size*2*5) # 1:5 ratio of negative sampling

    for s in positive_samples:
        X.append([target, s, 1])

    for s in negative_samples:
        X.append([target, s, 0])

print(f"{len(text) = }")
print(f"{len(X)    = }")

len(text) = 1093
len(X)    = 26136


<img src="https://wikidocs.net/images/page/69141/그림7.PNG" height=300/>

### Debiasing word embeddings

One way to debias some word embeddings (doctor:man=nurse:woman) is to find a non-biased dimension where is perpendicular to gender axis and project them.

<img src="src/debiasing.png" height=400 />