#### **BERT Embeddings**

- **Description**:  
  BERT embeddings provide contextualized vector representations for text, enabling semantic understanding of sentences and words in context.

- **Dataset Used**:  
  Pretrained on the **BooksCorpus** and **English Wikipedia** (uncased).

- **When to Use**:  
  - Tasks requiring context-aware embeddings, such as text similarity, classification, or question answering.
  - General-purpose semantic similarity tasks.

- **Key Points**:  
  - **Pros**: Captures contextual nuances; widely supported.  
  - **Cons**: Computationally expensive; embeddings can be large.

- **Python Code (Example Usage)**:
```python
from transformers import AutoTokenizer, AutoModel
import torch

# Load BERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Example text
text = "Retrieval-Augmented Generation is an advanced AI technique."

# Tokenize and encode
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# Get the embeddings (last hidden state)
embeddings = outputs.last_hidden_state
print(embeddings.shape)  # [batch_size, sequence_length, hidden_size]


#### **Sentence-BERT (SBERT)**

- **Description**:  
  A variation of BERT fine-tuned specifically for producing sentence-level embeddings optimized for semantic similarity tasks.

- **Dataset Used**:  
  Pretrained on **SNLI (Stanford Natural Language Inference)** and **STS (Semantic Textual Similarity)** datasets.

- **When to Use**:  
  - Sentence-level tasks such as similarity search or clustering.  
  - RAG systems requiring fast and efficient embeddings.

- **Key Points**:  
  - **Pros**: Produces compact embeddings; highly accurate for semantic tasks.  
  - **Cons**: May not capture token-level nuances.

- **Python Code (Example Usage)**:
```python
from sentence_transformers import SentenceTransformer

# Load SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example sentences
sentences = ["What is RAG?", "How does RAG work?"]

# Generate embeddings
embeddings = model.encode(sentences)
print(embeddings.shape)  # [number_of_sentences, embedding_dimension]


#### **DPR (Dense Passage Retriever)**

- **Description**:  
  A dual-encoder model designed for retrieving relevant passages for a query in large-scale document retrieval systems.

- **Dataset Used**:  
  Pretrained on **Natural Questions** and **TriviaQA** datasets.

- **When to Use**:  
  - For retrieval tasks in RAG systems.  
  - QA systems where query-passage similarity is crucial.

- **Key Points**:  
  - **Pros**: Excels in passage retrieval; works well with RAG pipelines.  
  - **Cons**: Requires fine-tuning for domain-specific tasks.

- **Python Code (Example Usage)**:
```python
from transformers import DPRQuestionEncoderTokenizer, DPRQuestionEncoder

# Load DPR tokenizer and model
tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
model = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

# Example query
query = "What is Retrieval-Augmented Generation?"

# Tokenize the query
inputs = tokenizer(query, return_tensors="pt")

# Generate embeddings
with torch.no_grad():
    embeddings = model(**inputs).pooler_output

print(embeddings.shape)  # [batch_size, hidden_size]


| Feature                | **BERT**                                              | **SBERT**                                             | **DPR**                                               |
|------------------------|-------------------------------------------------------|------------------------------------------------------|-------------------------------------------------------|
| **Description**        | Contextualized embeddings for words in context.       | Sentence-level embeddings optimized for similarity.  | Dual-encoder model for passage retrieval tasks.       |
| **Primary Use**        | Token-level understanding and contextual modeling.    | Semantic similarity, clustering, and search.         | Document/passage retrieval in QA and RAG systems.     |
| **Training Dataset**   | BooksCorpus and English Wikipedia.                    | SNLI and STS datasets.                               | Natural Questions and TriviaQA.                       |
| **Output Type**        | Word-level embeddings for each token.                 | Single vector for an entire sentence.                | Single vector for queries or passages.                |
| **Pros**               | Rich token-level context; versatile.                  | Compact embeddings; fast; highly accurate.           | Accurate retrieval; scales well with large datasets.  |
| **Cons**               | Large embedding size; computationally expensive.      | Lacks token-level nuance; domain-specific fine-tuning may be needed. | Requires domain-specific fine-tuning.                 |
| **Best Fit For**       | Language modeling, classification, and token tagging. | Sentence similarity, clustering, and ranking tasks.  | RAG pipelines and passage-based QA systems.           |
| **Example Model**      | `bert-base-uncased`                                   | `all-MiniLM-L6-v2`                                   | `facebook/dpr-question_encoder-single-nq-base`        |
| **Embedding Size**     | [Batch Size, Sequence Length, Hidden Size]            | [Batch Size, Embedding Size]                         | [Batch Size, Hidden Size]                             |
| **Computational Cost** | High                                                  | Moderate                                             | Moderate                                              |
