# Verifying "Analogies Explained" via Synthetic Data

This notebook verifies the paper by constructing synthetic co-occurrence data where paraphrases hold by design.

**Core equation:** For $w^* = W$ (e.g., king = {man, royalty}):

$$\text{PMI}(c, w^*) = \sum_{w \in W} \text{PMI}(c, w) + \rho(c) + \sigma(c) - \tau$$

In [1]:
import numpy as np
import random
from collections import defaultdict

**Data Generation.** 

This simulation creates a synthetic corpus where paraphrase structure is **explicitly embedded by design**:

- **Target words**: `man`, `woman`, `king`, `queen`, `<royalty>`
- **Context pools**: Each target has its own context distribution (e.g., `king` draws from `MAN + ROYALTY`)
- **Paraphrase pairs**: `P(c|king) ≈ P(c|man, royalty)` and `P(c|queen) ≈ P(c|woman, royalty)` by construction

All what we count here will be used later to calculate needed probabilities, e.g. `P(W|c)`, `P(W)`, `P(c|W)` for `W=(man,royalty)` etc.

> I chose x = 1/6 to make τ = 0 (marginal independence: p(man, royalty) = p(man)·p(royalty)). Let p = 5x be the probability of sampling a single target. Then:
> - p(man) = p/5 + (1-p)/2
> - p(royalty) = p/5 + (1-p)
> - p(man, royalty) = (1-p)/2
> 
> Solving p(man)·p(royalty) = p(man, royalty) gives x = 1/6.
>
> Note: This choice is convenient but not essential. Due to symmetry in our construction (man/woman and king/queen are parallel), we have τ_king ≈ τ_queen, so their difference cancels in the analogy regardless of the individual τ values.

In [2]:
def generate_data(n_windows=100_000, x=1/6, seed=42, scale=10):
    np.random.seed(seed)
    random.seed(seed)
    
    NEUTRAL = [f"neutral_{i}" for i in range(10 * scale)]
    MAN = [f"man_{i}" for i in range(4 * scale)]
    WOMAN = [f"woman_{i}" for i in range(4 * scale)]
    ROYALTY = [f"royalty_{i}" for i in range(2 * scale)]
    
    CONTEXTS = {
        'man': MAN + NEUTRAL,
        'woman': WOMAN + NEUTRAL,
        'king': MAN + ROYALTY,
        'queen': WOMAN + ROYALTY,
        '<royalty>': ROYALTY,
        ('<royalty>', 'man'): MAN + ROYALTY,
        ('<royalty>', 'woman'): WOMAN + ROYALTY,
    }
    TARGETS = ['man', 'woman', 'king', 'queen', '<royalty>']
    
    counts = defaultdict(int)
    paired_counts = defaultdict(int)
    target_conditioned_counts = defaultdict(lambda: defaultdict(int))
    
    for _ in range(n_windows):        
        if random.random() < 5 * x:
            targets = [random.choice(TARGETS)]
            ctx_key = targets[0]
        else:
            targets = random.choice([['man', '<royalty>'], ['woman', '<royalty>']])
            ctx_key = tuple(sorted(targets))
            paired_counts[ctx_key] += 1
        
        context = list(set(random.sample(CONTEXTS[ctx_key], k=4)))
        
        for w0 in targets:
            counts[w0] += 1
            for w1 in context:
                paired_counts[tuple(sorted((w0, w1)))] += 1
        
        for i, w1 in enumerate(context):
            counts[w1] += 1
            target_conditioned_counts[ctx_key][w1] += 1
            for j in range(i+1, len(context)):
                w2 = context[j]
                paired_counts[tuple(sorted((w1, w2)))] += 1
                
    return counts, paired_counts, target_conditioned_counts

**PMI and Embeddings**
> Remember, $W$ is paraphrase context and $w$ is usually a target word, e.g., `W=(man, royalty)` and `w=king`

**Comment on smoothing.** Vanilla PMI calculation has to deal with log(0) when we don't meet context word $c$. Laplace $\alpha$-smoothing $p(c) = \frac{N_c + \alpha}{N + \alpha \cdot V}$ (where $N_c$ is frequency of $c$, $V$ is vocabulary size) prevents this but introduces **bias** if we use the same alpha for different counts. All these small biases will later lead to significant error on many neutral tokens $c$ that are never met with `king` or `royalty`. To fix it, we can use our math skills and knowledge about the simulation to figure out consistent formulas for normalization.

Thanks to smoothing on pairs $p(c,w) = \frac{N_{cw} + \alpha}{N + \alpha V}$, the conditional distribution is $p(c|w) = \frac{N_{cw} + \alpha}{xN + x\alpha V}$. Since we know the proportion $N_W/N_w = 1/2$ and $N_w/N = x = 1/6$, to be consistent on $c$ such that $N_{cw} = N_{cW} = 0$, the normalization for $p(c|W)$  should be:
$$p(c|W) = \frac{0.5\alpha}{N_W + \frac{1}{12} \alpha V} = \frac{0.5\alpha}{0.5 xN + 0.5 x \alpha V} = \frac{\alpha}{xN + x \alpha V} = p(c|w)$$

This ensures that smoothing remains proportionally consistent across target words and their paraphrase contexts among words that never occur with $w$ and $W$ ($N_{wc} = N_{Wc} = 0$), preventing spurious bias in the $\sigma$ term.

In [3]:
def build_all(counts, paired_counts, target_conditioned_counts, total, alpha=1, rank=50):
    vocab = sorted(counts.keys())
    word2idx = {w: i for i, w in enumerate(vocab)}
    n = len(vocab)
    
    p_w = np.array([(counts[w] + alpha) / (total + alpha * n) for w in vocab])
    
    p_cw = np.zeros((n, n))
    for (w1, w2), cnt in paired_counts.items():
        i, j = word2idx[w1], word2idx[w2]
        p_cw[i, j] = cnt
        p_cw[j, i] = cnt
    p_cw = (p_cw + alpha) / (total + alpha * n)
    
    PMI = np.log(p_cw / (p_w[:, None] * p_w[None, :]))
    
    PAIRS = [('<royalty>', 'man'), ('<royalty>', 'woman')]
    p_W = {pair: paired_counts[pair] / total for pair in PAIRS}
    
    p_c_given_W = {}
    for pair in PAIRS:
        tc = target_conditioned_counts[pair]
        sz = paired_counts[pair]
        p_c_given_W[pair] = {c: (tc.get(c, 0) + 0.5 * alpha) / (sz + alpha * n / 12) for c in vocab}
    
    U, S, Vt = np.linalg.svd(PMI, full_matrices=False)
    sqrt_S = np.sqrt(np.abs(S[:rank])) * np.sign(S[:rank])
    W = (U[:, :rank] * sqrt_S).T
    C = (Vt[:rank, :].T * sqrt_S).T
    C_dag = np.linalg.pinv(C.T)
    
    return PMI, p_cw, p_w, p_c_given_W, p_W, W, C_dag, word2idx

**Verification**

Let's calculate all errors and check that actual difference of vectors and embeddings is explained by terms from the paper, and that linear analogies really emerge

In [4]:
def verify(PMI, p_cw, p_w, p_c_given_W, p_W, W, C_dag, word2idx):
    n = len(word2idx)
    idx = word2idx
    vocab = {i: w for w, i in idx.items()}
    log = lambda x: np.log(np.maximum(x, 1e-15))
    
    p_given_c = {w: p_cw[:, i] / p_w[i] for w, i in idx.items()}
    
    PAIRS = [('<royalty>', 'man'), ('<royalty>', 'woman')]
    p_cW = {pair: np.array([p_c_given_W[pair][vocab[i]] for i in range(n)]) for pair in PAIRS}
    p_Wc = {pair: p_cW[pair] * p_W[pair] / p_w for pair in PAIRS}
    
    CASES = [('king', 'man', '<royalty>', ('<royalty>', 'man')),
             ('queen', 'woman', '<royalty>', ('<royalty>', 'woman'))]
    
    for space_name, vecs, project, s in [("PMI space", PMI, lambda x: x, "I"),
                                         ("Embedding space", W, lambda x: C_dag @ x, "C†")]:
        print(f"\n{space_name}")
        print("=" * 50)
        for target, w1, w2, pair in CASES:
            obs = vecs[:, idx[target]] - vecs[:, idx[w1]] - vecs[:, idx[w2]]
            rho = log(p_given_c[target]) - log(p_cW[pair])
            p_w1c = np.array([p_given_c[c][idx[w1]] for c in idx])
            p_w2c = np.array([p_given_c[c][idx[w2]] for c in idx])
            sigma = log(p_Wc[pair]) - log(p_w1c) - log(p_w2c)
            tau = log(p_W[pair]) - log(p_w[idx[w1]]) - log(p_w[idx[w2]])
            residual = np.linalg.norm(obs - project(rho + sigma - tau))
            print(f"{target} = {w1} + {w2}")
            print(f"  ||ρ||={np.linalg.norm(rho):.2f}, ||σ||={np.linalg.norm(sigma):.2f}, |τ|={np.abs(tau):.4f}")
            print(f"  residual: {residual:.6f}")
    
    print("\n" + "=" * 50)
    print("ANALOGY: king - man + woman → ?")
    analogy = W[:, idx['king']] - W[:, idx['man']] + W[:, idx['woman']]
    dists = [np.linalg.norm(analogy - W[:, i]) for i in range(n)]
    for r, i in enumerate(np.argsort(dists)[:5]):
        print(f"  {r+1}. {vocab[i]:<12} dist={dists[i]:.4f}")

In [5]:
N = 100_000
counts, paired_counts, target_conditioned_counts = generate_data(n_windows=N, seed=42, scale=15)
PMI, p_cw, p_w, p_c_given_W, p_W, W, C_dag, word2idx = build_all(
    counts, paired_counts, target_conditioned_counts, N, rank=5
)
verify(PMI, p_cw, p_w, p_c_given_W, p_W, W, C_dag, word2idx)


PMI space
king = man + <royalty>
  ||ρ||=0.56, ||σ||=15.08, |τ|=0.0079
  residual: 0.000000
queen = woman + <royalty>
  ||ρ||=0.64, ||σ||=15.10, |τ|=0.0166
  residual: 0.000000

Embedding space
king = man + <royalty>
  ||ρ||=0.56, ||σ||=15.08, |τ|=0.0079
  residual: 0.000000
queen = woman + <royalty>
  ||ρ||=0.64, ||σ||=15.10, |τ|=0.0166
  residual: 0.000000

ANALOGY: king - man + woman → ?
  1. queen        dist=0.0585
  2. <royalty>    dist=3.1119
  3. royalty_9    dist=3.7952
  4. royalty_15   dist=3.7968
  5. royalty_1    dist=3.7990


**Results:**

With proper construction we achieved:
- $\|\rho\| \approx 0$ - paraphrase quality error is almost zero
- $|\tau| \approx 0$ - words in paraphrase are marginally independent
- Residual $= 0$ - all error terms from the paper correctly decompose the embedding distance
- Analogy "king - man + woman" → queen ranks #1 — linear analogies emerge as predicted

Note: $\sigma$ isn't zero and it's very hard to make it so. Try inventing custom sampling that achieves $\sigma \approx 0$ to understand why it's so challenging :)

**PS:** $x$ doesn't actually have to lead to $\tau = 0$ — thanks to symmetry in our data, we would still have zero residual even if $\tau \neq 0$ (we would only face some difficulties with smoothing and making $\rho \approx 0$).