# Document Embeddings with Doc2Vec
## Objective

Learn fixed-length document embeddings directly from text using Doc2Vec, avoiding manual aggregation of word vectors.

This notebook focuses on:

- Document-level semantic representation

- Similarity, clustering, and downstream ML usage

- Proper training and evaluation discipline

## Why Doc2Vec Exists

Word embeddings require heuristic aggregation (mean, TF-IDF weighting).

Doc2Vec instead:

- Learns document vectors jointly with word vectors

- Encodes document-level semantics directly

- Produces dense, fixed-size representations

## Doc2Vec Architectures (Conceptual)
| Variant | Description                            |
| ------- | -------------------------------------- |
| PV-DM   | Distributed Memory (context + doc id)  |
| PV-DBOW | Distributed Bag of Words (doc → words) |


In practice:

- PV-DBOW is faster and often stronger

- PV-DM can help with word order (limited)

## Imports and Setup

In [2]:
import numpy as np
import pandas as pd

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.metrics.pairwise import cosine_similarity


# Example Corpus

We use **tokenized, normalized documents.**

In [7]:
documents = [
    ["clean", "text", "better", "model"],
    ["terrible", "result", "poor", "performance"],
    ["robust", "interpretable", "model"],
    ["bad", "prediction", "weak", "accuracy"]
]


# Tag Documents

Doc2Vec requires unique document IDs.

In [9]:
tagged_docs = [
    TaggedDocument(words=doc, tags=[f"DOC_{i}"])
    for i, doc in enumerate(documents)
]


# Train a Doc2Vec Model
## Baseline Configuration

In [12]:
doc2vec_model = Doc2Vec(
    vector_size=100,
    window=5,
    min_count=1,
    workers=4,
    epochs=40,
    dm=0  # PV-DBOW
)

doc2vec_model.build_vocab(tagged_docs)
doc2vec_model.train(
    tagged_docs,
    total_examples=doc2vec_model.corpus_count,
    epochs=doc2vec_model.epochs
)


## Inspect Learned Document Vectors

In [15]:
doc_vectors = np.vstack([
    doc2vec_model.dv[tag]
    for tag in doc2vec_model.dv.index_to_key
])

doc_vectors.shape


(4, 100)

## Document Similarity

In [18]:
cosine_similarity(doc_vectors)


array([[ 1.0000001 ,  0.18371744,  0.01432705,  0.00617551],
       [ 0.18371744,  0.99999994, -0.04312898,  0.17653593],
       [ 0.01432705, -0.04312898,  0.9999999 ,  0.02648791],
       [ 0.00617551,  0.17653593,  0.02648791,  0.99999976]],
      dtype=float32)

# Infer Vector for Unseen Document

**Important**: Inference ≠ training.

In [22]:
new_doc = ["clean", "interpretable", "model"]

inferred_vector = doc2vec_model.infer_vector(
    new_doc,
    epochs=20
)

inferred_vector.shape


(100,)

# Similarity to Training Documents

In [25]:
similarities = cosine_similarity(
    inferred_vector.reshape(1, -1),
    doc_vectors
)

similarities


array([[-0.03518941, -0.12601991, -0.03049923, -0.01644303]],
      dtype=float32)

# Downstream Use: Classification Example
## Build Feature Matrix

In [31]:
X = doc_vectors
y = np.array([1, 0, 1, 0])


## Train a Simple Classifier

In [34]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X, y)

clf.score(X, y)


1.0

# TF-IDF Weighted Doc2Vec (Advanced)
## Motivation

- Standard Doc2Vec treats all words equally

- TF-IDF weighting can bias learning toward informative terms

⚠️ This is non-standard and task-dependent.

# Key Limitations of Doc2Vec

- Sensitive to hyperparameters

- Requires sufficient data

- Less effective on short texts

- Often outperformed by Sentence-BERT

# When Doc2Vec Makes Sense

- `[ok] -` Medium-sized corpora
- `[ok] -` Document similarity and clustering
- `[ok] -` Memory-constrained environments
- `[ok] -` As a bridge between classical and modern NLP

# When NOT to Use Doc2Vec

- `[neg] -` Very small datasets
- `[neg] -` Highly contextual language
- `[neg] -` Sentence-level semantics
- `[neg] -` When transformer embeddings are available

# Common Mistakes

- `[neg] -` Training on tiny datasets
- `[neg] -` Comparing inferred and trained vectors unfairly
- `[neg] -` Ignoring randomness (seed control)
- `[neg] -` Treating Doc2Vec as state-of-the-art

# Key Takeaways

- Doc2Vec learns document-level representations

- It removes aggregation heuristics

- Performance depends heavily on data size

- Modern transformers usually dominate