
# Word2Vec on WikiText‑2: End‑to‑End Walkthrough

This notebook walks you step‑by‑step through:
1. **Loading the WikiText‑2 dataset** with [🤗 Datasets](https://huggingface.co/docs/datasets).
2. **Preprocessing and tokenization** into sentences of tokens.
3. **Creating Skip‑Gram pairs** manually (for intuition and inspection).
4. **Training Word2Vec (Skip‑Gram)** with Gensim for distributional/meaning learning.
5. **Quick sanity‑checks & visualization** of learned semantics.

> **Note:** You need an internet connection for the dataset download and `pip install` commands when you run this notebook locally.



## 0. Setup

Install the required packages (run this cell if you're in a fresh environment).


In [None]:

# If running on a clean environment, uncomment to install:
# !pip install -U datasets gensim matplotlib numpy
# (Optional) If you want to try a tiny PyTorch baseline later:
# !pip install torch



## 1. Load WikiText‑2
We'll use the **raw** variant (`wikitext-2-raw-v1`) to handle our own tokenization.


In [None]:

from datasets import load_dataset

# This will download on first run and cache afterward
ds = load_dataset("wikitext", "wikitext-2-raw-v1")

# Inspect splits and a sample
print(ds)
print("\nExample train text snippet:")
print(ds["train"][0]["text"][:500])



## 2. Preprocess & Tokenize

We'll do a **very light** tokenization:
- Lowercase
- Split on whitespace
- Drop empty lines
  
For production, consider a robust tokenizer (e.g., spaCy, Hugging Face tokenizers).


In [None]:

def to_sentences(dataset_split):
    sentences = []
    for ex in dataset_split:
        text = ex["text"].strip().lower()
        if not text:
            continue
        # Very simple tokenization; you can replace with a better one
        toks = text.split()
        if toks:
            sentences.append(toks)
    return sentences

train_sentences = to_sentences(ds["train"])
valid_sentences = to_sentences(ds["validation"])
test_sentences  = to_sentences(ds["test"])

len(train_sentences), len(valid_sentences), len(test_sentences), train_sentences[:3]



## 3. (Optional) Build Vocabulary

If you want to **manually** create Skip‑Gram pairs and analyze IDs, it's handy to build a vocabulary.


In [None]:

from collections import Counter

# Count words on train sentences only
counter = Counter(w for sent in train_sentences for w in sent)

# Prune rare words to keep the vocab manageable (optional)
min_count = 5
vocab = [w for w, c in counter.items() if c >= min_count]
vocab = sorted(vocab)

tok2id = {w: i for i, w in enumerate(vocab)}
id2tok = {i: w for w, i in tok2id.items()}

len(vocab), list(tok2id.items())[:10]



## 4. Create Skip‑Gram Pairs (Manually)

We create `(target, context)` pairs from a window around each token.  
This is for **intuition/inspection**—Gensim's Word2Vec can train Skip‑Gram directly from sentences (`sg=1`).


In [None]:

from typing import List

WINDOW = 5  # context window on each side

def skip_grams_from_sentences(sentences: List[List[str]], window: int = WINDOW):
    pairs = []  # [[target_id, context_id], ...]
    for sent in sentences:
        for i, wd in enumerate(sent):
            tid = tok2id.get(wd)
            if tid is None:
                continue
            left = max(0, i - window)
            right = min(len(sent), i + window + 1)
            for j in range(left, right):
                if j == i:
                    continue
                cid = tok2id.get(sent[j])
                if cid is not None:
                    pairs.append([int(tid), int(cid)])
    return pairs

# WARNING: Creating all pairs for the whole corpus can be large.
# For demonstration, we'll take a small subset of sentences.
subset = train_sentences[:2000]  # tweak as needed
pairs_demo = skip_grams_from_sentences(subset, window=WINDOW)

len(pairs_demo), pairs_demo[:10]



## 5. Train Word2Vec (Skip‑Gram) with Gensim

Gensim’s `Word2Vec` can train Skip‑Gram internally by setting `sg=1`.  
We feed it **sentences** (lists of tokens); it handles windowing and negative sampling under the hood.


In [None]:

from gensim.models import Word2Vec

# Hyperparameters
vector_size = 100
window = 5           # context window size
min_count = 5        # must match or be <= the pruning you used
workers = 4          # adjust to your CPU cores
sg = 1               # 1 = Skip-Gram, 0 = CBOW
negative = 10        # number of negative samples
epochs = 5

w2v = Word2Vec(
    sentences=train_sentences,
    vector_size=vector_size,
    window=window,
    min_count=min_count,
    workers=workers,
    sg=sg,
    negative=negative,
    epochs=epochs,
)

# Access the keyed vectors
kv = w2v.wv
print("Vocab size learned:", len(kv))
print("Vector for 'language' (if present):", kv["language"][:10] if "language" in kv else "not in vocab")



## 6. Quick Sanity Checks: Most Similar Words


In [None]:

query_words = ["language", "research", "city", "music", "war"]
for q in query_words:
    if q in kv:
        print(f"\nMost similar to '{q}':")
        for w, score in kv.most_similar(q, topn=10):
            print(f"  {w:15s}  {score:.3f}")
    else:
        print(f"\n'{q}' not in vocabulary.")



## 7. (Optional) Visualize Embeddings (PCA)

We'll pick a small list of words and project their vectors to 2D using PCA for a quick plot.


In [None]:

import numpy as np
import matplotlib.pyplot as plt

def plot_words(words):
    present = [w for w in words if w in kv]
    if not present:
        print("None of the chosen words are in the vocabulary yet.")
        return
    X = np.vstack([kv[w] for w in present])

    # PCA to 2D
    X = X - X.mean(axis=0, keepdims=True)
    U, S, Vt = np.linalg.svd(X, full_matrices=False)
    X2d = X @ Vt[:2].T

    plt.figure(figsize=(6, 6))
    plt.scatter(X2d[:, 0], X2d[:, 1])
    for i, w in enumerate(present):
        plt.annotate(w, (X2d[i, 0], X2d[i, 1]))
    plt.title("Word2Vec Embeddings (2D PCA)")
    plt.xlabel("PC1")
    plt.ylabel("PC2")
    plt.show()

plot_words(["king", "queen", "man", "woman", "london", "paris", "music", "art", "science", "research"])



## 8. Notes: Manual Skip‑Grams vs Gensim Training

- The list `pairs_demo` we created earlier is **for understanding/inspection**.  
- Gensim’s `Word2Vec` expects raw **sentences of tokens** and handles windowing, negative sampling, and training internally when `sg=1`.  
- If you want to implement Skip‑Gram **from scratch**, you can use the `pairs_demo` to drive a small PyTorch training loop with negative sampling — but for practical usage, Gensim is optimized and much faster.
