# Skip-gram

- 📺 **Video:** [https://youtu.be/hznxqCIrzSQ](https://youtu.be/hznxqCIrzSQ)

## Overview
- Study the skip-gram objective that predicts surrounding context words from a center word.
- Understand negative sampling as an efficient approximation to softmax normalization.

## Key ideas
- **Predictive training:** maximize probability of observed (center, context) pairs.
- **Negative sampling:** sample noise words to contrast with true context words.
- **Embeddings:** input and output embeddings jointly learn semantic regularities.
- **Optimization:** stochastic gradient updates on mini-batches scale to large corpora.

## Demo
Train a tiny skip-gram model with negative sampling on a toy corpus to illustrate the gradient updates described in the lecture (https://youtu.be/4P_yGJvqMeI).

In [1]:
import numpy as np
from collections import Counter

corpus = [
    'she is a skilled doctor and compassionate leader',
    'he is a brilliant engineer and creative designer',
    'the nurse offered patient support and kindness',
    'the manager coordinated the project with precision',
    'artists create inspiring work with emotion and style',
    'scientists test hypotheses with rigorous experiments',
    'teachers guide students with patience and care',
    'the programmer solved complex problems quickly'
]

tokens = [word for sentence in corpus for word in sentence.split()]
vocab = sorted(set(tokens))
word_to_id = {word: idx for idx, word in enumerate(vocab)}
id_to_word = {idx: word for word, idx in word_to_id.items()}
counts = Counter(tokens)
distribution = np.array([counts[word] for word in vocab], dtype=float)
noise_dist = (distribution ** 0.75) / np.sum(distribution ** 0.75)

window = 2
embedding_dim = 8
rng = np.random.default_rng(1)
W_in = rng.normal(scale=0.1, size=(len(vocab), embedding_dim))
W_out = rng.normal(scale=0.1, size=(len(vocab), embedding_dim))

pairs = []
for sentence in corpus:
    words = sentence.split()
    for i, word in enumerate(words):
        target = word_to_id[word]
        for j in range(max(0, i - window), min(len(words), i + window + 1)):
            if i == j:
                continue
            pairs.append((target, word_to_id[words[j]]))

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

lr = 0.05
num_neg = 3

for epoch in range(1, 801):
    total_loss = 0.0
    rng.shuffle(pairs)
    for target, context in pairs:
        target_vec = W_in[target]
        context_vec = W_out[context]
        score = target_vec @ context_vec
        prob = sigmoid(score)
        grad = (prob - 1)
        W_in[target] -= lr * grad * context_vec
        W_out[context] -= lr * grad * target_vec
        total_loss -= np.log(prob + 1e-8)

        neg_ids = rng.choice(len(vocab), size=num_neg, p=noise_dist)
        for neg in neg_ids:
            neg_vec = W_out[neg]
            neg_score = target_vec @ neg_vec
            neg_prob = sigmoid(neg_score)
            grad_neg = neg_prob
            W_in[target] -= lr * grad_neg * neg_vec
            W_out[neg] -= lr * grad_neg * target_vec
            total_loss -= np.log(1 - neg_prob + 1e-8)
    if epoch % 200 == 0:
        print(f"epoch {epoch:3d} | loss {total_loss/len(pairs):.4f}")

embeddings = W_in

def similar(word, top_k=4):
    idx = word_to_id[word]
    vec = embeddings[idx]
    sims = embeddings @ vec / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(vec) + 1e-8)
    ranking = sims.argsort()[::-1]
    return [(id_to_word[i], sims[i]) for i in ranking if i != idx][:top_k]

print()
print('Nearest neighbors after training:')
for word in ['doctor', 'engineer', 'teachers']:
    print(word, '->', similar(word))


epoch 200 | loss 4.3505


  return 1 / (1 + np.exp(-x))


epoch 400 | loss 17.9440
epoch 600 | loss 17.8067
epoch 800 | loss 17.7043

Nearest neighbors after training:
doctor -> [('creative', 0.9862556976884768), ('compassionate', 0.9844193319770509), ('brilliant', 0.974811414015661), ('care', 0.9734487544990958)]
engineer -> [('designer', 0.9916436433017531), ('kindness', 0.9669990182026375), ('leader', 0.9637092849382919), ('patience', 0.9400564512125047)]
teachers -> [('test', 0.9393482001183809), ('precision', 0.8593904508848271), ('work', 0.8400085169632004), ('is', 0.817506675612962)]


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 14.5](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf)
- [A Scalable Hierarchical Distributed Language Model](https://papers.nips.cc/paper/2008/hash/1e056d2b0ebd5c878c550da6ac5d3724-Abstract.html)
- [Neural Word Embedding as Implicit Matrix Factorization](https://papers.nips.cc/paper/2014/file/feab05aa91085b7a8012516bc3533958-Paper.pdf)
- [GloVe: Global Vectors for Word Representation](https://www.aclweb.org/anthology/D14-1162/)
- [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606)
- [Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings](https://papers.nips.cc/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf)
- [Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings](https://www.aclweb.org/anthology/N19-1062/)
- [Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them](https://www.aclweb.org/anthology/N19-1061/)
- [Deep Unordered Composition Rivals Syntactic Methods for Text Classification](https://www.aclweb.org/anthology/P15-1162/)


*Links only; we do not redistribute slides or papers.*