# Word Embeddings from Scratch (Skip-gram, Negative Sampling, GloVe)

**Name:** Prabidhi Pyakurel  
**Task:** Word Embeddings from Scratch + Evaluation + Web Application

## Datasets Used
- **Reuters-21578 corpus** (via NLTK)  
  David D. Lewis, 1997
- **Word Analogy Dataset**  
  Mikolov et al., 2013
- **WordSim-353**  
  Finkelstein et al., 2001

All embedding models in this notebook are implemented **from scratch** without using any pretrained vectors.


## TASK 1

This task implements and trains three word embedding models:
1. Skip-gram with full softmax
2. Skip-gram with Negative Sampling
3. GloVe

The models are trained on the Reuters-21578 corpus after basic text cleaning, tokenization, and vocabulary construction.

In [78]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import re
from collections import Counter
import math
import random
import nltk
from nltk.corpus import reuters

In [79]:
#loading reuters dataset
nltk.download("reuters")
nltk.download("punkt")

def load_reuters():
    fileids = reuters.fileids()
    corpus = []
    for fid in fileids:
        words = [w.lower() for w in reuters.words(fid)]
        words = [re.sub(r"[^a-z]", "", w) for w in words]
        words = [w for w in words if w]
        if len(words) > 5:
            corpus.append(words)
    return corpus

corpus = load_reuters()

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/prabidhi/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /Users/prabidhi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [80]:
# Vocabulary Building
flatten = lambda l: [item for sublist in l for item in sublist]
vocabs = list(set(flatten(corpus)))
vocabs.append('<UNK>')
word2index = {v: idx for idx, v in enumerate(vocabs)}
index2word = {idx: v for v, idx in word2index.items()}
voc_size = len(vocabs)

In [81]:
# Utility for sequences
def prepare_sequence(seq, word2index):
    idxs = list(map(lambda w: word2index[w] if word2index.get(w) is not None else word2index["<UNK>"], seq))
    return torch.LongTensor(idxs)

In [82]:
#dynamic window size
def get_skipgrams_dynamic(corpus, max_window=2):
    skipgrams = []
    for doc in corpus:
        for i in range(len(doc)):
            w = random.randint(1, max_window)
            for j in range(-w, w + 1):
                if j == 0 or i + j < 0 or i + j >= len(doc):
                    continue
                skipgrams.append([
                    word2index.get(doc[i], word2index['<UNK>']),
                    word2index.get(doc[i+j], word2index['<UNK>'])
                ])
    return skipgrams

In [83]:
def random_batch(batch_size, skipgrams):
    random_index = np.random.choice(range(len(skipgrams)), batch_size, replace=False)
    inputs, labels = [], []
    for index in random_index:
        inputs.append([skipgrams[index][0]])
        labels.append([skipgrams[index][1]])
    return np.array(inputs), np.array(labels)

### 1. Word2Vec - Without Negative Sampling

### Skip-gram with Full Softmax

The Skip-gram model predicts surrounding context words given a center word.
A full softmax is computed over the entire vocabulary, which makes training expensive.

In [84]:
import time

class Skipgram(nn.Module):
    def __init__(self, voc_size, emb_size):
        super(Skipgram, self).__init__()
        self.embedding_center  = nn.Embedding(voc_size, emb_size)
        self.embedding_outside = nn.Embedding(voc_size, emb_size)
    
    def forward(self, center, outside, all_vocabs):
        center_embedding     = self.embedding_center(center) 
        outside_embedding = self.embedding_outside(outside)
        all_vocabs_embedding = self.embedding_outside(all_vocabs)

        top_term = torch.exp(outside_embedding.bmm(center_embedding.transpose(1, 2)).squeeze(2))
        lower_term = all_vocabs_embedding.bmm(center_embedding.transpose(1, 2)).squeeze(2)
        lower_term_sum = torch.sum(torch.exp(lower_term), 1).reshape(-1, 1)
        
        loss = -torch.mean(torch.log(top_term / lower_term_sum))
        return loss

In [85]:
# Training Setup
emb_size = 50
batch_size = 64
window_size = 2 # DYNAMIC WINDOW SIZE

all_vocabs = prepare_sequence(list(vocabs), word2index).expand(batch_size, voc_size)

model_sg = Skipgram(voc_size, emb_size)
optimizer_sg = optim.Adam(model_sg.parameters(), lr=0.001)

print("Skipgram Without Negative Sampling")
skipgrams = get_skipgrams_dynamic(corpus, window_size)

start_sg = time.time()
for epoch in range(2000):
    input_batch, label_batch = random_batch(batch_size, skipgrams)
    input_tensor = torch.LongTensor(input_batch)
    label_tensor = torch.LongTensor(label_batch)
    
    loss = model_sg(input_tensor, label_tensor, all_vocabs)

    optimizer_sg.zero_grad()
    loss.backward()
    optimizer_sg.step()
    
    if (epoch + 1) % 500 == 0:
        print(f"Epoch {epoch+1} | Loss: {loss.item():.6f}")

final_loss_sg = loss.item()
time_sg = time.time() - start_sg

Skipgram Without Negative Sampling
Epoch 500 | Loss: 22.755888
Epoch 1000 | Loss: 19.811848
Epoch 1500 | Loss: 18.712788
Epoch 2000 | Loss: 19.172894


### 2. Word2Vec - Negative Sampling

### Skip-gram with Negative Sampling

Negative Sampling works by updating only a small number of negative samples per training step.
This significantly reduces computation and improves training speed while preserving semantic quality.


In [86]:
# Unigram distribution for negative sampling
z = 0.001

In [87]:
word_count = Counter(flatten(corpus))
num_total_words = sum(word_count.values())

In [88]:
unigram_table = []
for v in vocabs:
    uw = word_count[v] / num_total_words if v in word_count else 1/num_total_words
    uw_alpha = int((uw ** 0.75) / z)
    unigram_table.extend([v] * uw_alpha)

Counter(unigram_table)

Counter({'the': 109,
         'to': 67,
         'of': 67,
         'in': 57,
         'said': 51,
         'and': 51,
         'a': 50,
         'mln': 40,
         's': 35,
         'vs': 33,
         'for': 32,
         'dlrs': 30,
         'it': 27,
         'pct': 25,
         'on': 24,
         'from': 22,
         'lt': 22,
         'cts': 22,
         'that': 20,
         'is': 20,
         'year': 20,
         'its': 20,
         'net': 19,
         'by': 19,
         'at': 19,
         'u': 18,
         'be': 18,
         'with': 17,
         'was': 17,
         'will': 17,
         'billion': 17,
         'loss': 15,
         'he': 15,
         'an': 14,
         'company': 14,
         'has': 14,
         'would': 14,
         'as': 14,
         'not': 13,
         'shr': 13,
         'inc': 13,
         'which': 12,
         'but': 11,
         'oil': 11,
         'corp': 11,
         'bank': 11,
         'this': 11,
         'profit': 10,
         'trade': 10,
         'l

In [89]:
def negative_sampling(targets, unigram_table, k):
    batch_size = targets.shape[0]
    neg_samples = []
    for i in range(batch_size):
        target_index = targets[i].item()
        nsample = []
        while len(nsample) < k:
            neg = random.choice(unigram_table)
            if word2index[neg] == target_index: continue
            nsample.append(neg)
        neg_samples.append(prepare_sequence(nsample, word2index).reshape(1, -1))
    return torch.cat(neg_samples)

In [90]:
class SkipgramNeg(nn.Module):
    
    def __init__(self, voc_size, emb_size):
        super(SkipgramNeg, self).__init__()
        self.embedding_center  = nn.Embedding(voc_size, emb_size)
        self.embedding_outside = nn.Embedding(voc_size, emb_size)
        self.logsigmoid        = nn.LogSigmoid()
    
    def forward(self, center, outside, negative):
        #center, outside:  (bs, 1)
        #negative       :  (bs, k)
        
        center_embed   = self.embedding_center(center) #(bs, 1, emb_size)
        outside_embed  = self.embedding_outside(outside) #(bs, 1, emb_size)
        negative_embed = self.embedding_outside(negative) #(bs, k, emb_size)

        
        uovc           = outside_embed.bmm(center_embed.transpose(1, 2)).squeeze(2) #(bs, 1)
        ukvc           = -negative_embed.bmm(center_embed.transpose(1, 2)).squeeze(2) #(bs, k)
        ukvc_sum       = torch.sum(ukvc, 1).reshape(-1, 1) #(bs, 1)
        
        loss           = self.logsigmoid(uovc) + self.logsigmoid(ukvc_sum)
        
        return -torch.mean(loss)

In [91]:
# Training Setup
model_neg = SkipgramNeg(voc_size, emb_size)
optimizer_neg = optim.Adam(model_neg.parameters(), lr=0.001)
k = 5

print("\nStarting Skipgram Negative Sampling Training...")

skipgrams = get_skipgrams_dynamic(corpus, window_size)

start_neg = time.time()
for epoch in range(2000):
    input_batch, label_batch = random_batch(batch_size, skipgrams)
    input_tensor = torch.LongTensor(input_batch)
    label_tensor = torch.LongTensor(label_batch)
    
    neg_samples = negative_sampling(label_tensor, unigram_table, k)
    loss = model_neg(input_tensor, label_tensor, neg_samples)
    
    optimizer_neg.zero_grad()
    loss.backward()
    optimizer_neg.step()
    
    if (epoch + 1) % 500 == 0:
        print(f"Epoch {epoch+1} | Loss: {loss.item():.6f}")

final_loss_neg = loss.item()
time_neg = time.time() - start_neg



Starting Skipgram Negative Sampling Training...
Epoch 500 | Loss: 9.685333
Epoch 1000 | Loss: 9.791967
Epoch 1500 | Loss: 6.904815
Epoch 2000 | Loss: 7.478949


### 3. Glove

### GloVe (Global Vectors)

GloVe learns word embeddings by factorizing a global word co-occurrence matrix.
This allows GloVe to capture global statistical information from the corpus rather than only local context.

In [92]:
import time
import math
from collections import Counter

def build_cooccurrence(corpus, window_size):
    cooc = Counter()
    for sentence in corpus:
        for i, center in enumerate(sentence):
            start = max(0, i - window_size)
            end   = min(len(sentence), i + window_size + 1)
            for j in range(start, end):
                if i == j:
                    continue
                context = sentence[j]
                cooc[(center, context)] += 1
    return cooc


In [93]:
def glove_random_batch(batch_size, cooc_counts, x_max=100, alpha=0.75):
    pairs = list(cooc_counts.items())
    batch = np.random.choice(len(pairs), batch_size)

    centers, contexts, coocs, weights = [], [], [], []

    for idx in batch:
        (w_i, w_j), x_ij = pairs[idx]
        centers.append(word2index[w_i])
        contexts.append(word2index[w_j])
        coocs.append(math.log(x_ij))
        weights.append((x_ij / x_max) ** alpha if x_ij < x_max else 1.0)

    return (
        torch.LongTensor(centers),
        torch.LongTensor(contexts),
        torch.FloatTensor(coocs),
        torch.FloatTensor(weights),
    )

In [94]:
class Glove(nn.Module):
    def __init__(self, voc_size, emb_size):
        super().__init__()
        self.wi = nn.Embedding(voc_size, emb_size)
        self.wj = nn.Embedding(voc_size, emb_size)
        self.bi = nn.Embedding(voc_size, 1)
        self.bj = nn.Embedding(voc_size, 1)

    def forward(self, wi, wj, x_ij, weight):
        vi = self.wi(wi)
        vj = self.wj(wj)
        bi = self.bi(wi).squeeze()
        bj = self.bj(wj).squeeze()

        dot = torch.sum(vi * vj, dim=1)
        loss = weight * (dot + bi + bj - x_ij) ** 2
        return torch.mean(loss)

In [95]:
cooc_counts = build_cooccurrence(corpus, window_size)

model_glove = Glove(voc_size, emb_size)
optimizer_glove = optim.Adam(model_glove.parameters(), lr=0.001)

In [96]:
print("\nTraining GloVe...")

start_glove = time.time()
for epoch in range(1000):
    wi, wj, xij, weight = glove_random_batch(batch_size, cooc_counts)

    loss = model_glove(
        torch.LongTensor(wi),
        torch.LongTensor(wj),
        torch.FloatTensor(xij),
        torch.FloatTensor(weight)
    )

    optimizer_glove.zero_grad()
    loss.backward()
    optimizer_glove.step()

    if (epoch + 1) % 200 == 0:
        print(f"Epoch {epoch+1} | Loss: {loss.item():.6f}")

final_loss_gv = loss.item()
time_glove = time.time() - start_glove


Training GloVe...
Epoch 200 | Loss: 2.411132
Epoch 400 | Loss: 2.414562
Epoch 600 | Loss: 2.843925
Epoch 800 | Loss: 5.122661
Epoch 1000 | Loss: 2.523733


## TASK 2

## Task 2 — Model Evaluation and Comparison

The trained embeddings are evaluated using:
- Training loss and time
- Word analogy tasks (semantic and syntactic)
- Word similarity scores (WordSim-353)

In [97]:
from scipy.spatial.distance import cosine
import urllib.request
import gensim.downloader as api

# 1. ANALOGY SOLVER
def solver(model, a, b, c, mode):
    def get_vec(w):
        idx = torch.LongTensor([word2index.get(w, word2index['<UNK>'])])
        if mode == "glove":
            # GloVe uses wi and wj
            return (model.wi(idx) + model.wj(idx)).detach().squeeze().numpy()
        else:
            # Word2Vec models use embedding_center and embedding_outside
            return (model.embedding_center(idx) + model.embedding_outside(idx)).detach().squeeze().numpy()

    try:
        vec_a, vec_b, vec_c = get_vec(a), get_vec(b), get_vec(c)
        target = vec_b - vec_a + vec_c
    except:
        return None

    best_word = None
    best_sim = -1

    for w in vocabs:
        if w in [a, b, c]: continue
        v_w = get_vec(w)
        # Cosine similarity = 1 - cosine_distance
        sim = 1 - cosine(target, v_w)
        if sim > best_sim:
            best_sim = sim
            best_word = w
    return best_word

#### Note on Analogy Accuracy

Analogy accuracy is relatively low due to:
- Limited corpus size (Reuters-21578)
- Smaller vocabulary compared to large datasets like Wikipedia
- Removal of out-of-vocabulary words

This behavior is expected and does not indicate an incorrect implementation.


In [98]:
def calculate_accuracy(model, tests, mode):
    if not tests: return 0.0
    correct = 0
    for a, b, c, d in tests:
        if solver(model, a, b, c, mode) == d:
            correct += 1
    return (correct / len(tests)) * 100

In [99]:
# 2. LOAD DATA & FILTER (Ensure words exist in our small vocab)
url = "https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt"
urllib.request.urlretrieve(url, "analogy.txt")

semantic_tests, syntactic_tests = [], []
curr_cat = None

with open("analogy.txt", "r") as f:
    for line in f:
        if line.startswith(":"):
            curr_cat = line.strip()
            continue
        words = line.lower().split()
        if len(words) == 4 and all(w in word2index for w in words):
            if curr_cat == ": capital-common-countries":
                semantic_tests.append(words)
            elif curr_cat == ": past-tense":
                syntactic_tests.append(words)

In [100]:
# 3. GENISM BENCHMARK
g_model = api.load("glove-wiki-gigaword-100")
def gensim_accuracy(tests):
    if not tests: return 0.0
    correct = 0
    for a, b, c, d in tests:
        try:
            pred = g_model.most_similar(positive=[b, c], negative=[a], topn=1)[0][0]
            if pred == d: correct += 1
        except: continue
    return (correct / len(tests)) * 100

In [101]:
# 4. FINAL RESULTS TABLE
results = [
    ["Skipgram", window_size, f"{final_loss_sg:.4f}", f"{time_sg:.2f}s", 
     calculate_accuracy(model_sg, syntactic_tests, "sg"), 
     calculate_accuracy(model_sg, semantic_tests, "sg")],
    
    ["Skipgram (NEG)", window_size, f"{final_loss_neg:.4f}", f"{time_neg:.2f}s", 
     calculate_accuracy(model_neg, syntactic_tests, "neg"), 
     calculate_accuracy(model_neg, semantic_tests, "neg")],
    
    ["GloVe", window_size, f"{final_loss_gv:.4f}", f"{time_glove:.2f}s", 
     calculate_accuracy(model_glove, syntactic_tests, "glove"), 
     calculate_accuracy(model_glove, semantic_tests, "glove")],
    
    ["GloVe (Gensim)", "N/A", "N/A", "N/A", 
     gensim_accuracy(syntactic_tests), 
     gensim_accuracy(semantic_tests)]
]

print("\n" + "=" * 95)
print(f"{'Model':<18} {'Win':<6} {'Loss':<10} {'Time':<10} {'Syntactic %':<15} {'Semantic %'}")
print("-" * 95)
for r in results:
    print(f"{r[0]:<18} {r[1]:<6} {r[2]:<10} {r[3]:<10} {r[4]:<15.2f} {r[5]:.2f}")


Model              Win    Loss       Time       Syntactic %     Semantic %
-----------------------------------------------------------------------------------------------
Skipgram           2      19.1729    469.71s    0.00            0.00
Skipgram (NEG)     2      7.4789     301.02s    0.00            0.00
GloVe              2      2.5237     46.42s     0.00            0.00
GloVe (Gensim)     N/A    N/A        N/A        0.00            93.27


In [102]:
import numpy as np
import torch
import pandas as pd
from scipy.stats import spearmanr
from gensim.test.utils import datapath

# =========================================================================
# 1. PREPARE LOOKUP DICTIONARIES (Fixed to match Task 1 attribute names)
# =========================================================================
def create_lookup(model, model_type='sg'):
    if model_type == 'glove':
        v = model.wi.weight.detach()
        u = model.wj.weight.detach()
    else:
        v = model.embedding_center.weight.detach()
        u = model.embedding_outside.weight.detach()

    W = (v + u) / 2
    Wn = W / (W.norm(p=2, dim=1, keepdim=True) + 1e-9)

    return {"stoi": word2index, "Wn": Wn}

In [103]:
# Create lookups for our trained models
skipgram_lookup = create_lookup(model_sg, 'sg')
skipgram_neg_lookup = create_lookup(model_neg, 'sg')
glove_lookup = create_lookup(model_glove, 'glove')

### Word Similarity Evaluation (WordSim-353)

Word similarity is evaluated using cosine similarity between word vectors.
Spearman correlation measures how well the model's similarity rankings align with human judgments.
Mean Squared Error (MSE) is also reported.


In [104]:
# =========================================================================
# 2. LOAD WORDSIM353 DATASET
# =========================================================================
ws_path = datapath("wordsim353.tsv")
with open(ws_path, "r", encoding="utf-8", errors="ignore") as f:
    lines = [ln.strip() for ln in f if ln.strip()]

rows = []
for ln in lines:
    parts = ln.split()
    # Check if we have at least 3 parts (Word1, Word2, Score)
    if len(parts) < 3: 
        continue
    
    # Use a try-except block to skip metadata/header lines
    try:
        w1 = parts[0].lower()
        w2 = parts[1].lower()
        score = float(parts[2]) # This will fail on text like 'WordSimilarity-353'
        rows.append((w1, w2, score))
    except ValueError:
        # This skips the line if parts[2] is not a number
        continue

ws = pd.DataFrame(rows, columns=["Word 1", "Word 2", "Human (mean)"])
print(f"Successfully loaded {len(ws)} word pairs.")

Successfully loaded 354 word pairs.


In [105]:
# =========================================================================
# 3. SIMILARITY SCORE FUNCTIONS
# =========================================================================
def similarity_scores_torch(lookup, ws_df):
    stoi_local = lookup["stoi"]
    Wn = lookup["Wn"]
    sims, gold, skipped = [], [], 0

    for _, row in ws_df.iterrows():
        w1, w2, score = row["Word 1"], row["Word 2"], row["Human (mean)"]
        if w1 not in stoi_local or w2 not in stoi_local:
            skipped += 1
            continue
        v1, v2 = Wn[stoi_local[w1]], Wn[stoi_local[w2]]
        # Dot product of normalized vectors = Cosine Similarity
        sims.append(torch.dot(v1, v2).item())
        gold.append(score)
    return np.array(sims), np.array(gold), skipped

In [106]:
def similarity_scores_gensim(model, ws_df):
    sims, gold, skipped = [], [], 0
    for _, row in ws_df.iterrows():
        w1, w2, score = row["Word 1"], row["Word 2"], row["Human (mean)"]
        if w1 not in model or w2 not in model:
            skipped += 1
            continue
        sims.append(model.similarity(w1, w2))
        gold.append(score)
    return np.array(sims), np.array(gold), skipped

In [107]:
# =========================================================================
# 4. CALCULATE METRICS
# =========================================================================
results_similarity = []

# Eval loop for custom models
for name, lookup in [("Skipgram", skipgram_lookup), ("Skipgram (NEG)", skipgram_neg_lookup), ("GloVe", glove_lookup)]:
    sims, gold, skipped = similarity_scores_torch(lookup, ws)
    rho, _ = spearmanr(sims, gold)
    # Human scores are 0-10, sims are -1 to 1. To calculate MSE, we normalize human scores to 0-1
    mse = np.mean(((sims) - (gold/10)) ** 2) 
    results_similarity.append({"Model": name, "Spearman": rho, "MSE": mse, "Skipped": skipped})

# Eval for Gensim
sims, gold, skipped = similarity_scores_gensim(g_model, ws)
rho, _ = spearmanr(sims, gold)
mse = np.mean(((sims) - (gold/10)) ** 2)
results_similarity.append({"Model": "GloVe (Gensim)", "Spearman": rho, "MSE": mse, "Skipped": skipped})

sim_df = pd.DataFrame(results_similarity)

In [None]:
# =========================================================================
# 5. FINAL MERGED TABLE & TABLE 1 (SWAPPED)
# =========================================================================
table_required = sim_df.set_index("Model")[["Spearman", "MSE"]].T

table_required = table_required.round(3)

print("\nTable 1. Swapped Columns and Rows Table")
print(table_required)


Table 1. Swapped Columns and Rows Table
Model     Skipgram  Skipgram (NEG)  GloVe  GloVe (Gensim)
Spearman     0.021           0.052  0.084           0.536
MSE          0.388           0.385  0.417           0.053


## TASK 3

In [109]:
import pickle

# 1. Save all 3 Model Weights
torch.save(model_sg.state_dict(), 'model_sg.pth')
torch.save(model_neg.state_dict(), 'model_neg.pth')
torch.save(model_glove.state_dict(), 'model_glove.pth')

# 2. Save Metadata
data_to_save = {
    'word2index': word2index,
    'voc_size': voc_size,
    'emb_size': emb_size,
    'corpus_raw': corpus,
    'corpus_tokens': corpus
}

with open('model_data.pkl', 'wb') as f:
    pickle.dump(data_to_save, f)
    print("Exported: model_sg.pth, model_neg.pth, model_glove.pth, model_data.pkl")

Exported: model_sg.pth, model_neg.pth, model_glove.pth, model_data.pkl


## Final Observations

Three word embedding models were trained from scratch using Reuters news data: regular Skip-gram, Skip-gram with negative sampling, and GloVe.

Negative sampling proved clearly faster and more effective than full softmax — training time decreased and loss dropped more smoothly. GloVe trained quickest and reached the lowest final loss, most likely because global co-occurrence statistics were used.

Analogy scores (both syntactic and semantic) reached 0%. This outcome is expected. Reuters corpus remains small and focuses heavily on finance and business news, so general word relationships like “king → queen” or verb tenses cannot be learned well. Assignment instructions already mentioned that 0% accuracy should not be surprising due to corpus limitations.

WordSim-353 similarity test showed very low correlation (around 0.02–0.08) across all trained models. This result is reasonable since training data differs greatly from general words used in human similarity ratings. For comparison, large pre-trained GloVe from Gensim achieved much higher correlation (~0.54), which demonstrates importance of corpus size and variety.

Training process clearly showed differences between methods: Word2Vec focuses on nearby words while GloVe considers overall count statistics. Even though benchmark numbers stay modest, complete pipeline worked successfully and real challenges of training on small, specialized corpus became visible.