# Understanding Embeddings in RAG

This notebook explores:
1. What embeddings are and how they work
2. Generating embeddings with Sentence Transformers
3. Calculating similarity between texts
4. Visualizing embedding spaces
5. Impact of different embedding models

## 1. Setup and Imports

In [None]:
import sys
sys.path.append('..')

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✅ Libraries loaded")

## 2. What Are Embeddings?

In [None]:
# Load the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example text
text = "Neo4j is a graph database"

# Generate embedding
embedding = model.encode(text)

print(f"Text: '{text}'")
print(f"\nEmbedding shape: {embedding.shape}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"\nFirst 10 values: {embedding[:10]}")
print(f"\nEmbedding statistics:")
print(f"  Min: {embedding.min():.4f}")
print(f"  Max: {embedding.max():.4f}")
print(f"  Mean: {embedding.mean():.4f}")
print(f"  Std: {embedding.std():.4f}")

## 3. Semantic Similarity

In [None]:
# Compare different texts
texts = [
    "Neo4j is a graph database",
    "Graph databases store data as nodes and relationships",
    "Neo4j uses Cypher query language",
    "MongoDB is a document database",
    "The weather is nice today",
    "Python is a programming language"
]

# Generate embeddings for all texts
embeddings = model.encode(texts)

# Calculate similarity matrix (cosine similarity)
similarity_matrix = np.dot(embeddings, embeddings.T) / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(embeddings, axis=1).reshape(-1, 1))

# Create DataFrame for better visualization
df_sim = pd.DataFrame(similarity_matrix, 
                      index=[f"Text {i+1}" for i in range(len(texts))],
                      columns=[f"Text {i+1}" for i in range(len(texts))])

# Visualize similarity matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df_sim, annot=True, fmt='.3f', cmap='coolwarm', center=0.5,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Semantic Similarity Matrix', fontsize=16)
plt.tight_layout()
plt.show()

# Show text mapping
print("Text Mapping:")
for i, text in enumerate(texts, 1):
    print(f"Text {i}: {text}")

## 4. Finding Similar Documents

In [None]:
# Query text
query = "How to query a graph database?"
query_embedding = model.encode(query)

# Calculate similarities to all texts
similarities = np.dot(embeddings, query_embedding) / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_embedding))

# Sort by similarity
sorted_indices = np.argsort(similarities)[::-1]

print(f"Query: '{query}'\n")
print("Most similar texts:")
print("-" * 80)

for rank, idx in enumerate(sorted_indices, 1):
    print(f"{rank}. Similarity: {similarities[idx]:.4f}")
    print(f"   Text: {texts[idx]}")
    print()

# Visualize as bar chart
plt.figure(figsize=(10, 6))
plt.barh(range(len(texts)), similarities[sorted_indices], color='steelblue')
plt.yticks(range(len(texts)), [f"{texts[i][:30]}..." if len(texts[i]) > 30 else texts[i] for i in sorted_indices])
plt.xlabel('Similarity Score')
plt.title(f'Similarity to Query: "{query}"')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

## 5. Visualizing Embedding Space (2D)

In [None]:
# Create more diverse texts for visualization
categories = {
    "Graph Databases": [
        "Neo4j is a graph database",
        "Graph databases use nodes and edges",
        "Cypher is Neo4j's query language",
        "Graph traversal is efficient in Neo4j"
    ],
    "Document Databases": [
        "MongoDB stores JSON documents",
        "Document databases are schema-flexible",
        "CouchDB is another document database",
        "Documents can contain nested data"
    ],
    "Machine Learning": [
        "Neural networks learn patterns",
        "Deep learning uses multiple layers",
        "Embeddings represent text as vectors",
        "Transformers revolutionized NLP"
    ],
    "General": [
        "The weather is sunny today",
        "Coffee is a popular beverage",
        "Books contain knowledge",
        "Music brings joy to people"
    ]
}

# Flatten texts and create labels
all_texts = []
all_labels = []
all_colors = []
color_map = {'Graph Databases': 'blue', 'Document Databases': 'green', 
             'Machine Learning': 'red', 'General': 'gray'}

for category, texts in categories.items():
    all_texts.extend(texts)
    all_labels.extend([category] * len(texts))
    all_colors.extend([color_map[category]] * len(texts))

# Generate embeddings
all_embeddings = model.encode(all_texts)

# Reduce to 2D using PCA
pca = PCA(n_components=2, random_state=42)
embeddings_2d_pca = pca.fit_transform(all_embeddings)

# Also try t-SNE for comparison
tsne = TSNE(n_components=2, perplexity=5, random_state=42)
embeddings_2d_tsne = tsne.fit_transform(all_embeddings)

# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

# PCA visualization
for category in categories.keys():
    mask = np.array(all_labels) == category
    ax1.scatter(embeddings_2d_pca[mask, 0], embeddings_2d_pca[mask, 1], 
               label=category, color=color_map[category], s=100, alpha=0.7)

ax1.set_title('Embedding Space - PCA Projection', fontsize=14)
ax1.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
ax1.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
ax1.legend()
ax1.grid(alpha=0.3)

# t-SNE visualization
for category in categories.keys():
    mask = np.array(all_labels) == category
    ax2.scatter(embeddings_2d_tsne[mask, 0], embeddings_2d_tsne[mask, 1], 
               label=category, color=color_map[category], s=100, alpha=0.7)

ax2.set_title('Embedding Space - t-SNE Projection', fontsize=14)
ax2.set_xlabel('t-SNE 1')
ax2.set_ylabel('t-SNE 2')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice how similar texts cluster together in the embedding space!")

## 6. Impact of Different Embedding Models

In [None]:
# Compare different embedding models
models_to_compare = [
    ('all-MiniLM-L6-v2', 384),  # Our default model
    ('all-MiniLM-L12-v2', 384),  # Deeper model
    # ('all-mpnet-base-v2', 768),  # Larger model (optional, slower)
]

test_texts = [
    "Neo4j graph database",
    "Graph database technology",
    "MongoDB document store",
    "Weather forecast today"
]

print("Comparing embedding models:\n")

for model_name, dim in models_to_compare:
    print(f"Model: {model_name} ({dim} dimensions)")
    print("-" * 50)
    
    # Load model
    model = SentenceTransformer(model_name)
    
    # Generate embeddings
    embeddings = model.encode(test_texts)
    
    # Calculate similarity to first text
    query_embedding = embeddings[0]
    similarities = np.dot(embeddings[1:], query_embedding) / (
        np.linalg.norm(embeddings[1:], axis=1) * np.linalg.norm(query_embedding)
    )
    
    print(f"Query: '{test_texts[0]}'")
    for i, sim in enumerate(similarities):
        print(f"  → '{test_texts[i+1]}': {sim:.4f}")
    print()

## 7. Embedding Speed Comparison

In [None]:
import time

# Test embedding speed
test_corpus = [
    "This is a test sentence." for _ in range(100)
]

model = SentenceTransformer('all-MiniLM-L6-v2')

# Single encoding
start = time.time()
for text in test_corpus:
    _ = model.encode(text)
single_time = time.time() - start

# Batch encoding
start = time.time()
_ = model.encode(test_corpus)
batch_time = time.time() - start

print(f"Encoding {len(test_corpus)} sentences:\n")
print(f"Single encoding (one by one): {single_time:.3f} seconds")
print(f"Batch encoding (all at once): {batch_time:.3f} seconds")
print(f"\nSpeedup: {single_time/batch_time:.1f}x faster with batch encoding")
print(f"\nAverage time per sentence:")
print(f"  Single: {single_time/len(test_corpus)*1000:.2f} ms")
print(f"  Batch: {batch_time/len(test_corpus)*1000:.2f} ms")

## 8. Practical RAG Example

In [None]:
from neo4j_rag import Neo4jRAG

# Connect to Neo4j
rag = Neo4jRAG()

# Sample knowledge base about databases
knowledge_base = [
    "Neo4j is a graph database that uses nodes and relationships to store data.",
    "Graph databases are optimized for traversing connected data.",
    "Cypher is Neo4j's declarative query language for pattern matching.",
    "ACID compliance ensures data consistency in Neo4j transactions.",
    "Neo4j can scale horizontally with causal clustering.",
    "Vector indexes in Neo4j enable similarity search on embeddings.",
    "MongoDB is a document database that stores data in JSON-like format.",
    "PostgreSQL is a relational database with SQL support."
]

# Load knowledge base
for i, text in enumerate(knowledge_base):
    rag.add_document(
        content=text,
        metadata={"source": "knowledge_base", "index": i}
    )

print(f"✅ Loaded {len(knowledge_base)} documents\n")

# Test different queries
queries = [
    "How does Neo4j ensure data consistency?",
    "What query language does Neo4j use?",
    "Can Neo4j do similarity search?",
    "What is MongoDB?"
]

for query in queries:
    print(f"Query: {query}")
    results = rag.vector_search(query, k=2)
    
    if results:
        print("Top results:")
        for i, result in enumerate(results, 1):
            print(f"  {i}. (Score: {result['score']:.3f}) {result['text']}")
    print()

# Clean up
rag.close()

## Summary

In this notebook, you learned:

1. **Embeddings** are numerical representations of text (384-dimensional vectors in our case)
2. **Semantic similarity** can be measured using cosine similarity between embeddings
3. **Similar texts cluster together** in the embedding space
4. **Different models** produce different embeddings with varying quality and speed
5. **Batch encoding** is much faster than encoding texts one by one
6. **RAG systems** use embeddings to find relevant context for answering questions

### Key Takeaways:
- Embeddings capture semantic meaning, not just keywords
- The choice of embedding model affects both quality and performance
- Visualization helps understand how embeddings organize information
- In production, always use batch encoding for better performance

### Next Steps:
- Explore notebook 03 for document processing techniques
- Try different embedding models for your use case
- Experiment with different similarity thresholds