# Word Embeddings: Word2Vec and GloVe
## Objective

Introduce dense word-level vector representations that capture semantic similarity, enabling:

- Meaning-aware feature engineering

- Similarity search and clustering

- Stronger document representations than BoW / TF-IDF

> This notebook focuses on using pre-trained embeddings, not training them from scratch.

## Why Word Embeddings Matter

Sparse models treat words as independent symbols.
Embeddings instead encode:

- Semantic similarity `(king ≈ queen)`

- Contextual proximity

- Smooth vector spaces

They allow models to generalize beyond exact token matches.

## Imports and Setup

In [3]:
import numpy as np
import pandas as pd

from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity


Note: Pre-trained embedding files (Word2Vec / GloVe) are not committed to the repo due to size.

# Loading Pre-Trained Embeddings
## Word2Vec (Google News – 300d)

In [9]:
# Example path – adjust to your environment
word2vec_path = "./pre-trained-model/GoogleNews-vectors-negative300.bin"
word2vec_path = "./pre-trained-model/GoogleNews-vectors-negative300.bin"

word2vec = KeyedVectors.load_word2vec_format(
    word2vec_path,
    binary=True
)


# GloVe (Converted to Word2Vec Format)

In [44]:
# Download Glove Word2vec at Kaggle.com
# Source - https://www.kaggle.com/datasets/serquet/word2vecglove6b300d
glove_path = "./pre-trained-model/word2vec-glove.6B.300d.txt"

glove = KeyedVectors.load_word2vec_format(
    glove_path,
    binary=False
)

# Vocabulary Coverage Check
## Why This Matters

Out-of-vocabulary (OOV) words silently degrade performance.

In [12]:
def embedding_coverage(tokens, model):
    covered = [t for t in tokens if t in model]
    return len(covered) / len(tokens) if tokens else 0


In [14]:
tokens = ["clean", "model", "nlp", "foobar"]

print("Word2Vec coverage:", embedding_coverage(tokens, word2vec))
print("GloVe coverage:", embedding_coverage(tokens, glove))


Word2Vec coverage: 0.5
GloVe coverage: 0.75


## Word-Level Semantic Similarity

In [19]:
word2vec.similarity("model", "algorithm")


np.float32(0.2745423)

In [17]:
similar_words = word2vec.most_similar("model", topn=5)
similar_words


[('models', 0.744627833366394),
 ('Model', 0.5217961668968201),
 ('xenograft_mouse', 0.5016441345214844),
 ('Brother_Nidal', 0.4816089868545532),
 ('Smallpox_infection', 0.4770216941833496)]

## Cosine Similarity Between Words

In [22]:
def word_similarity(word1, word2, model):
    return cosine_similarity(
        model[word1].reshape(1, -1),
        model[word2].reshape(1, -1)
    )[0][0]

word_similarity("good", "excellent", word2vec)


np.float32(0.644293)

# From Word Embeddings to Document Embeddings
## Why Aggregation Is Needed

Word embeddings are:

- Fixed-length

- Word-level

Models need document-level vectors.

## Mean Pooling (Baseline)

In [25]:
def document_embedding(tokens, model):
    vectors = [model[t] for t in tokens if t in model]
    if not vectors:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)


## Example Documents

In [28]:
docs = [
    ["clean", "text", "better", "model"],
    ["terrible", "results", "poor", "performance"]
]

doc_embeddings = np.vstack([
    document_embedding(doc, word2vec)
    for doc in docs
])

doc_embeddings.shape


(2, 300)

In [30]:
pd.DataFrame(doc_embeddings)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,0.053162,0.065186,0.069672,0.037354,-0.051529,0.057495,0.034302,-0.079468,0.086975,0.063416,...,-0.090332,0.092163,-0.107666,-0.019409,0.021194,0.051018,-0.02566,-0.075684,-0.025513,-0.036841
1,0.011963,0.15625,0.064331,0.048645,0.023575,0.085205,-0.038025,-0.137435,0.179443,0.186157,...,-0.027008,0.201172,-0.199463,-0.038147,-0.064392,0.055603,0.020752,-0.034363,0.093536,-0.050964


## Document Similarity

In [33]:
cosine_similarity(doc_embeddings)

array([[0.9999999 , 0.38635933],
       [0.38635933, 0.9999999 ]], dtype=float32)

# TF-IDF Weighted Embedding (Improved)
### Why Weighting Helps

 - Common words dominate mean pooling

 - TF-IDF emphasizes discriminative terms

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [39]:
texts = [" ".join(doc) for doc in docs]

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(texts)
tfidf_vocab = tfidf.vocabulary_


In [41]:
def tfidf_weighted_embedding(tokens, model, tfidf, vocab):
    vectors = []
    weights = []
    
    for token in tokens:
        if token in model and token in vocab:
            vectors.append(model[token])
            weights.append(tfidf.idf_[vocab[token]])
    
    if not vectors:
        return np.zeros(model.vector_size)
    
    return np.average(vectors, axis=0, weights=weights)


# Key Limitations of Word Embeddings

- Static (no context awareness)

- Polysemy not handled (bank)

- Require aggregation heuristics

- OOV handling is crude

# When Word Embeddings Are a Good Choice

- `[ok] - ` Small to medium datasets
- `[ok] - ` Semantic similarity tasks
- `[ok] - ` Clustering / retrieval
- `[ok] - ` Feature inputs for classical ML

# Common Mistakes

- `[cons] - ` Treating word embeddings as sentence embeddings
- `[cons] - ` Ignoring OOV rates
- `[cons] - ` Training ML models on raw word vectors
- `[cons] - ` Forgetting normalization

# Key Takeaways

- Word embeddings encode semantic similarity

- Document embeddings require aggregation

- TF-IDF weighting improves signal quality

- Word embeddings bridge classical NLP and deep learning