# Word Embeddings

- 📺 **Video:** [https://youtu.be/8EqQROdVPyM](https://youtu.be/8EqQROdVPyM)

## Overview
- Introduce distributed word representations that capture semantics through context.
- Understand how co-occurrence statistics lead to dense vectors used across NLP tasks.

## Key ideas
- **Distributional hypothesis:** words appearing in similar contexts tend to have similar meanings.
- **Embedding spaces:** low-dimensional vectors encode syntactic and semantic relationships.
- **Linear structure:** vector arithmetic like king - man + woman ≈ queen emerges naturally.
- **Training signals:** count-based (SVD/PPMI) or predictive (skip-gram) models leverage co-occurrence data.

## Demo
Build a co-occurrence matrix from a toy corpus, derive embeddings with truncated SVD, and retrieve nearest neighbors, echoing the lecture (https://youtu.be/THhRQ7UQm70).

In [1]:
from collections import defaultdict
import numpy as np
from sklearn.decomposition import TruncatedSVD

corpus = [
    'she is a skilled doctor and compassionate leader',
    'he is a brilliant engineer and creative designer',
    'the nurse offered patient support and kindness',
    'the manager coordinated the project with precision',
    'artists create inspiring work with emotion and style',
    'scientists test hypotheses with rigorous experiments',
    'teachers guide students with patience and care',
    'the programmer solved complex problems quickly'
]

vocab = sorted(set(' '.join(corpus).split()))
word_to_id = {word: idx for idx, word in enumerate(vocab)}

window = 2
cooc = np.zeros((len(vocab), len(vocab)), dtype=float)
for sentence in corpus:
    tokens = sentence.split()
    for idx, token in enumerate(tokens):
        target_id = word_to_id[token]
        for j in range(max(0, idx - window), min(len(tokens), idx + window + 1)):
            if j == idx:
                continue
            cooc[target_id, word_to_id[tokens[j]]] += 1

row_sums = cooc.sum(axis=1, keepdims=True)
cooc_norm = np.where(row_sums > 0, cooc / row_sums, 0)
svd = TruncatedSVD(n_components=5, random_state=0)
embeddings = svd.fit_transform(cooc_norm)

ids_to_word = {idx: word for word, idx in word_to_id.items()}

def nearest_neighbors(word, top_k=4):
    if word not in word_to_id:
        return []
    idx = word_to_id[word]
    vec = embeddings[idx]
    sims = embeddings @ vec
    denom = np.linalg.norm(embeddings, axis=1) * np.linalg.norm(vec)
    cos = sims / (denom + 1e-8)
    ranking = cos.argsort()[::-1]
    neighbors = [(ids_to_word[i], cos[i]) for i in ranking if ids_to_word[i] != word]
    return neighbors[:top_k]

for word in ['doctor', 'engineer', 'teacher', 'artist']:
    print(f"Nearest to '{word}':")
    for neighbor, score in nearest_neighbors(word):
        print(f"  {neighbor:>10s} | cos={score:.3f}")
    print()


Nearest to 'doctor':
    engineer | cos=1.000
     skilled | cos=0.944
   brilliant | cos=0.944
    creative | cos=0.843

Nearest to 'engineer':
      doctor | cos=1.000
     skilled | cos=0.944
   brilliant | cos=0.944
    creative | cos=0.842

Nearest to 'teacher':

Nearest to 'artist':



## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 14.5](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf)
- [A Scalable Hierarchical Distributed Language Model](https://papers.nips.cc/paper/2008/hash/1e056d2b0ebd5c878c550da6ac5d3724-Abstract.html)
- [Neural Word Embedding as Implicit Matrix Factorization](https://papers.nips.cc/paper/2014/file/feab05aa91085b7a8012516bc3533958-Paper.pdf)
- [GloVe: Global Vectors for Word Representation](https://www.aclweb.org/anthology/D14-1162/)
- [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606)
- [Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings](https://papers.nips.cc/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf)
- [Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings](https://www.aclweb.org/anthology/N19-1062/)
- [Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them](https://www.aclweb.org/anthology/N19-1061/)
- [Deep Unordered Composition Rivals Syntactic Methods for Text Classification](https://www.aclweb.org/anthology/P15-1162/)


*Links only; we do not redistribute slides or papers.*