### Explanation of Doc2Vec

**Doc2Vec** is an extension of Word2Vec that generates vector representations for entire documents or pieces of text, rather than just individual words. It was introduced by Quoc Le and Tomas Mikolov in 2014 and is useful for tasks where understanding the meaning of entire documents, paragraphs, or sentences is essential.

### How Doc2Vec Works

Doc2Vec extends Word2Vec by adding a new vector (document ID vector) for each document. These vectors are trained to predict words in the document, similar to how Word2Vec uses context words to predict a target word. There are two main approaches to training Doc2Vec models: Distributed Memory (DM) and Distributed Bag of Words (DBOW).

#### 1. Distributed Memory (DM)

- **Objective**: Predict a target word using the context words and the document vector.
- **Approach**: Similar to CBOW in Word2Vec. Both context words and the document vector are used to predict the target word.
- **Advantages**: Tends to perform well in capturing the semantic meaning of documents.

**Diagram**:
```
[Document ID] + [Context Words] -> Neural Network -> [Target Word]
```

#### 2. Distributed Bag of Words (DBOW)

- **Objective**: Predict context words using the document vector.
- **Approach**: Similar to Skip-gram in Word2Vec. The document vector alone is used to predict multiple context words in the document.
- **Advantages**: Computationally simpler and faster to train.

**Diagram**:
```
[Document ID] -> Neural Network -> [Context Words]
```

### Example Code Using Gensim

Here's how you can train and use Doc2Vec models using the Gensim library:

#### Installation

First, install Gensim if you haven't already:
```bash
pip install gensim
```

# Training a Doc2Vec Model

In [6]:
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

# Sample documents
documents = [
    "I love machine learning. It's fascinating.",
    "Deep learning is a subset of machine learning.",
    "Natural language processing is a key area of AI.",
    "Gensim is a library for topic modeling and document indexing.",
    "Doc2Vec is an extension of Word2Vec."
]

# Tokenize and tag the documents
tagged_data = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(documents)]
tagged_data

[TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'it', "'s", 'fascinating', '.'], tags=['0']),
 TaggedDocument(words=['deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning', '.'], tags=['1']),
 TaggedDocument(words=['natural', 'language', 'processing', 'is', 'a', 'key', 'area', 'of', 'ai', '.'], tags=['2']),
 TaggedDocument(words=['gensim', 'is', 'a', 'library', 'for', 'topic', 'modeling', 'and', 'document', 'indexing', '.'], tags=['3']),
 TaggedDocument(words=['doc2vec', 'is', 'an', 'extension', 'of', 'word2vec', '.'], tags=['4'])]

In [None]:
# Build and train the Doc2Vec model
model = Doc2Vec(vector_size=50, alpha=0.025, min_alpha=0.00025, min_count=1, dm=1)

### Explanation

   - `vector_size=50`: The size of the vectors.
   - `alpha=0.025`: The initial learning rate.
   - `min_alpha=0.00025`: The minimum learning rate.
   - `min_count=1`: Ignores all words with total frequency lower than this.
   - `dm=1`: Indicates the use of the Distributed Memory (DM) model.


In [12]:
# Build vocabulary
model.build_vocab(tagged_data)

# Train the model
model.train(tagged_data, total_examples=model.corpus_count, epochs=100)

In [13]:
# Save the model
model.save("d2v.model")

In [14]:
# Load the model
model = Doc2Vec.load("d2v.model")

In [15]:
# Infer a vector for a new document
new_doc = "I enjoy learning about AI and machine learning."
new_vector = model.infer_vector(word_tokenize(new_doc.lower()))
print(f"Vector for new document:\n{new_vector}\n")

Vector for new document:
[-0.04624398  0.00421499 -0.01397642 -0.01604808 -0.00906577 -0.00510722
  0.00076339  0.0377649  -0.03776836  0.01884614  0.02258971  0.01798172
  0.0033154  -0.03847247  0.02716012 -0.02880351 -0.00393291  0.00821783
 -0.05536242 -0.03861832 -0.0200435   0.01773427  0.0063881   0.02096928
 -0.00288691  0.00093806 -0.02351245 -0.01838718 -0.00294462 -0.02224243
  0.02393194  0.01633536 -0.02707224 -0.01975657 -0.032386    0.01691411
  0.00171951 -0.01972627  0.00995371 -0.01926793  0.00454859  0.03199244
  0.0116352  -0.02720493 -0.00694919  0.00993334  0.02305961 -0.00091604
  0.02895818 -0.01197604]



In [16]:
# Find similar documents
similar_docs = model.dv.most_similar([new_vector])
print(f"Most similar documents:\n{similar_docs}")

Most similar documents:
[('3', 0.9131336808204651), ('0', 0.8961871862411499), ('1', 0.8642350435256958), ('2', 0.8620454668998718), ('4', 0.7248851656913757)]
