# pgVector and LangChain Example Notebook

This notebook demonstrates how to store documents in a vector database and perform similarity searches using pgVector and LangChain.

## Features:
- 🔄 **Random Vector Embeddings**: No OpenAI API key required
- 📚 **Document Storage**: Store sample documents with metadata
- 🔍 **Similarity Search**: Find similar documents
- 🏷️ **Metadata Filtering**: Filter results by metadata
- 🎯 **Direct Vector Search**: Search using vector embeddings directly


In [None]:
# Import required libraries and existing functions
import numpy as np
import pandas as pd
from typing import List, Dict, Any
import warnings

# Import from our existing modules
from vector_example import RandomEmbeddings, create_sample_documents
from config import DATABASE_URL, VECTOR_DIMENSION, COLLECTION_NAME

# Import LangChain components
from langchain_community.vectorstores import PGVector
from langchain.schema import Document

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("📦 All libraries imported successfully!")
print(f"🔗 Database URL: {DATABASE_URL}")
print(f"📏 Vector Dimension: {VECTOR_DIMENSION}")
print(f"📂 Collection Name: {COLLECTION_NAME}")


: 

# pgVector and LangChain Example Notebook

This notebook demonstrates how to store documents in a vector database and perform similarity searches using pgVector and LangChain.

## Features:
- 🔄 **Random Vector Embeddings**: No OpenAI API key required
- 📚 **Document Storage**: Store sample documents with metadata  
- 🔍 **Similarity Search**: Find similar documents
- 🏷️ **Metadata Filtering**: Filter results by metadata
- 🎯 **Direct Vector Search**: Search using vector embeddings directly


In [None]:
# Import required libraries and existing functions
import numpy as np
import pandas as pd
from typing import List, Dict, Any
import warnings

# Import from our existing modules
from vector_example import RandomEmbeddings, create_sample_documents
from config import DATABASE_URL, VECTOR_DIMENSION, COLLECTION_NAME

# Import LangChain components
from langchain_community.vectorstores import PGVector
from langchain.schema import Document

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("📦 All libraries imported successfully!")
print(f"🔗 Database URL: {DATABASE_URL}")
print(f"📏 Vector Dimension: {VECTOR_DIMENSION}")
print(f"📂 Collection Name: {COLLECTION_NAME}")


## 1. 📚 Data Generation

Let's generate sample documents using the existing `create_sample_documents()` function and explore the data.


In [None]:
# Generate sample documents using existing function
sample_documents = create_sample_documents()

print(f"📊 Generated {len(sample_documents)} sample documents\n")

# Display the documents in a nice format
for i, doc in enumerate(sample_documents, 1):
    print(f"🔹 Document {i}:")
    print(f"   Content: {doc.page_content}")
    print(f"   Metadata: {doc.metadata}")
    print()

# Create a DataFrame for better visualization
data = []
for doc in sample_documents:
    data.append({
        'id': doc.metadata['id'],
        'content': doc.page_content,
        'category': doc.metadata['category'],
        'source': doc.metadata['source']
    })

df = pd.DataFrame(data)
print("📋 Documents as DataFrame:")
display(df)


## 2. 🔧 Vector Store Initialization

Initialize the embedding model and pgVector store for data storage and retrieval.


In [None]:
# Initialize embedding model (random vectors for demonstration)
embeddings = RandomEmbeddings(dimension=VECTOR_DIMENSION)
print(f"🧮 Initialized RandomEmbeddings with {VECTOR_DIMENSION} dimensions")

# Test the embedding model
test_vector = embeddings.embed_query("Test query")
print(f"🔍 Sample vector shape: {len(test_vector)}")
print(f"🔢 Sample vector (first 5 elements): {test_vector[:5]}")

# Initialize PGVector store
try:
    vector_store = PGVector(
        connection_string=DATABASE_URL,
        embedding_function=embeddings,
        collection_name=COLLECTION_NAME,
        distance_strategy="cosine"
    )
    print("✅ pgVector store initialized successfully!")
    print(f"📊 Using cosine distance for similarity search")
    
except Exception as e:
    print(f"❌ Failed to initialize pgVector store: {e}")
    print("🔧 Make sure Docker containers are running: cd docker && docker-compose up -d")


## 3. 💾 Data Storage (Training Phase)

Store the generated documents in the pgVector database. This is equivalent to the "training" phase where we build our vector database.


In [None]:
# Store documents in the vector database
print("💾 Storing documents in pgVector database...")

try:
    vector_store.add_documents(sample_documents)
    print(f"✅ Successfully stored {len(sample_documents)} documents!")
    
    # Verify storage by checking if we can retrieve any documents
    print("\n🔍 Verifying data storage...")
    
    # Try a simple search to verify
    test_results = vector_store.similarity_search("technology", k=1)
    if test_results:
        print(f"✅ Verification successful! Found {len(test_results)} document(s)")
        print(f"📄 Sample document: {test_results[0].page_content[:50]}...")
    else:
        print("⚠️ No documents found in verification search")
    
    print(f"\n📊 Database now contains vector embeddings for all documents")
    print(f"🔢 Each document is represented as a {VECTOR_DIMENSION}-dimensional vector")
    
except Exception as e:
    print(f"❌ Failed to store documents: {e}")
    print("🔧 Check if the database connection is working properly")


## 4. 🔍 Basic Similarity Search

Perform similarity searches using different query texts to find the most relevant documents.


In [None]:
# Define search queries for testing
search_queries = [
    "Tell me about artificial intelligence and machine learning",
    "What are data analysis technologies?", 
    "Explain cloud and distributed systems",
    "How does blockchain technology work?",
    "What is computer vision and image processing?"
]

print("🔍 Performing similarity searches...\n")

# Perform searches and display results
for i, query in enumerate(search_queries, 1):
    print(f"🔸 Search {i}: '{query}'")
    print("-" * 60)
    
    try:
        # Basic similarity search
        similar_docs = vector_store.similarity_search(query, k=3)
        
        if similar_docs:
            print("📋 Top 3 similar documents:")
            for j, doc in enumerate(similar_docs, 1):
                print(f"   {j}. {doc.page_content}")
                print(f"      📂 Category: {doc.metadata.get('category', 'N/A')}")
                print(f"      🆔 ID: {doc.metadata.get('id', 'N/A')}")
            
            # Search with similarity scores
            similar_docs_with_scores = vector_store.similarity_search_with_score(query, k=3)
            print("\n📊 Similarity scores:")
            for j, (doc, score) in enumerate(similar_docs_with_scores, 1):
                print(f"   {j}. Score: {score:.4f}")
        else:
            print("❌ No similar documents found")
    
    except Exception as e:
        print(f"❌ Search failed: {e}")
    
    print("=" * 80)


## 5. 🏷️ Metadata Filtering Search

Demonstrate how to filter search results based on metadata criteria.


In [None]:
# Metadata filtering search examples
print("🏷️ Performing metadata filtering searches...\n")

# Example 1: Filter by category
try:
    print("🔹 Filter 1: Only 'technology' category documents")
    filtered_docs = vector_store.similarity_search(
        query="Tell me about technology",
        k=5,
        filter={"category": "technology"}
    )
    
    if filtered_docs:
        print(f"📋 Found {len(filtered_docs)} documents in 'technology' category:")
        for i, doc in enumerate(filtered_docs, 1):
            print(f"   {i}. {doc.page_content}")
            print(f"      📂 Category: {doc.metadata['category']}")
            print(f"      🆔 ID: {doc.metadata['id']}")
    else:
        print("❌ No documents found with specified filter")
        
except Exception as e:
    print(f"❌ Filtered search failed: {e}")

print("\n" + "=" * 60)

# Example 2: Filter by source
try:
    print("🔹 Filter 2: Only 'sample_data' source documents")
    filtered_docs = vector_store.similarity_search(
        query="data processing",
        k=3,
        filter={"source": "sample_data"}
    )
    
    if filtered_docs:
        print(f"📋 Found {len(filtered_docs)} documents from 'sample_data' source:")
        for i, doc in enumerate(filtered_docs, 1):
            print(f"   {i}. {doc.page_content}")
            print(f"      🔗 Source: {doc.metadata['source']}")
    else:
        print("❌ No documents found with specified filter")
        
except Exception as e:
    print(f"❌ Filtered search failed: {e}")

print("\n" + "=" * 60)

# Example 3: Search without filter for comparison
try:
    print("🔹 Comparison: Search without any filters")
    all_docs = vector_store.similarity_search(
        query="technology and data",
        k=3
    )
    
    if all_docs:
        print(f"📋 Found {len(all_docs)} documents (no filter):")
        for i, doc in enumerate(all_docs, 1):
            print(f"   {i}. {doc.page_content}")
            print(f"      📂 Category: {doc.metadata.get('category', 'N/A')}")
            print(f"      🔗 Source: {doc.metadata.get('source', 'N/A')}")
    else:
        print("❌ No documents found")
        
except Exception as e:
    print(f"❌ Search failed: {e}")


## 6. 🎯 Direct Vector Search

Demonstrate searching using vector embeddings directly, without query text.


In [None]:
# Direct vector search examples
print("🎯 Performing direct vector searches...\n")

# Generate multiple random vectors for testing
num_searches = 3
for i in range(num_searches):
    print(f"🔸 Vector Search {i+1}")
    print("-" * 40)
    
    try:
        # Generate a random vector
        random_vector = embeddings.embed_query(f"random search {i+1}")
        
        print(f"🧮 Generated random vector with {len(random_vector)} dimensions")
        print(f"🔢 Vector sample (first 5 elements): {random_vector[:5]}")
        
        # Search using the random vector
        similar_docs = vector_store.similarity_search_by_vector(
            embedding=random_vector,
            k=3
        )
        
        if similar_docs:
            print(f"\n📋 Found {len(similar_docs)} similar documents:")
            for j, doc in enumerate(similar_docs, 1):
                print(f"   {j}. {doc.page_content}")
                print(f"      📂 Category: {doc.metadata.get('category', 'N/A')}")
                print(f"      🆔 ID: {doc.metadata.get('id', 'N/A')}")
        else:
            print("❌ No similar documents found")
    
    except Exception as e:
        print(f"❌ Vector search failed: {e}")
    
    print("=" * 60)

# Compare vector search with text search
print("🔀 Comparison: Vector vs Text Search")
print("-" * 50)

try:
    # Generate a vector for a specific query
    query_text = "machine learning and AI"
    query_vector = embeddings.embed_query(query_text)
    
    # Text-based search
    text_results = vector_store.similarity_search(query_text, k=2)
    
    # Vector-based search using the same vector
    vector_results = vector_store.similarity_search_by_vector(query_vector, k=2)
    
    print(f"📝 Text search for: '{query_text}'")
    if text_results:
        for i, doc in enumerate(text_results, 1):
            print(f"   {i}. {doc.page_content}")
    
    print(f"\n🧮 Vector search using embedding of: '{query_text}'")
    if vector_results:
        for i, doc in enumerate(vector_results, 1):
            print(f"   {i}. {doc.page_content}")
    
    # Note about randomness
    print(f"\n💡 Note: Since we're using random embeddings, results may vary between runs")
    print(f"   In production, you'd use consistent embeddings like OpenAI's text-embedding-ada-002")
    
except Exception as e:
    print(f"❌ Comparison search failed: {e}")


## 7. 📊 Search Results Analysis

Let's analyze and visualize the search results to better understand the vector database performance.


In [None]:
# Analyze search performance and results
print("📊 Analyzing search results and database performance...\n")

# Collect search statistics
search_stats = {
    'total_documents': len(sample_documents),
    'search_queries_tested': len(search_queries),
    'categories': set(),
    'sources': set(),
    'similarity_scores': []
}

# Analyze stored documents
for doc in sample_documents:
    search_stats['categories'].add(doc.metadata.get('category', 'Unknown'))
    search_stats['sources'].add(doc.metadata.get('source', 'Unknown'))

# Perform a comprehensive search analysis
comprehensive_query = "technology data science machine learning"
print(f"🔍 Comprehensive analysis with query: '{comprehensive_query}'")

try:
    # Get all results with scores
    all_results = vector_store.similarity_search_with_score(
        comprehensive_query, 
        k=len(sample_documents)  # Get all documents
    )
    
    if all_results:
        # Extract scores for analysis
        scores = [score for _, score in all_results]
        search_stats['similarity_scores'] = scores
        
        # Create results DataFrame
        results_data = []
        for i, (doc, score) in enumerate(all_results, 1):
            results_data.append({
                'rank': i,
                'content': doc.page_content[:50] + "...",
                'category': doc.metadata.get('category', 'N/A'),
                'source': doc.metadata.get('source', 'N/A'),
                'similarity_score': round(score, 4),
                'document_id': doc.metadata.get('id', 'N/A')
            })
        
        results_df = pd.DataFrame(results_data)
        print("\n📋 Complete Search Results:")
        display(results_df)
        
        # Statistics summary
        print(f"\n📈 Search Statistics Summary:")
        print(f"   📚 Total documents in database: {search_stats['total_documents']}")
        print(f"   🔍 Search queries tested: {search_stats['search_queries_tested']}")
        print(f"   📂 Categories found: {', '.join(search_stats['categories'])}")
        print(f"   🔗 Sources found: {', '.join(search_stats['sources'])}")
        print(f"   📊 Similarity score range: {min(scores):.4f} - {max(scores):.4f}")
        print(f"   📊 Average similarity score: {np.mean(scores):.4f}")
        print(f"   📊 Similarity score std dev: {np.std(scores):.4f}")
        
        # Top and bottom results
        print(f"\n🏆 Most similar document:")
        print(f"   Content: {all_results[0][0].page_content}")
        print(f"   Score: {all_results[0][1]:.4f}")
        
        print(f"\n🔻 Least similar document:")
        print(f"   Content: {all_results[-1][0].page_content}")
        print(f"   Score: {all_results[-1][1]:.4f}")
        
    else:
        print("❌ No results found for comprehensive analysis")
        
except Exception as e:
    print(f"❌ Analysis failed: {e}")

# Performance note
print(f"\n💡 Performance Notes:")
print(f"   🔄 Random embeddings provide consistent vector operations")
print(f"   ⚡ Search speed depends on database size and indexing")
print(f"   🎯 In production, use semantic embeddings (OpenAI, Sentence Transformers, etc.)")
print(f"   📈 Vector similarity scores closer to 0 indicate higher similarity (cosine distance)")


## 8. 🎉 Summary and Next Steps

This notebook demonstrated the complete workflow of using pgVector with LangChain for vector-based document storage and retrieval.


In [None]:
# Summary of what we accomplished
print("🎉 Notebook Execution Summary")
print("=" * 50)

summary_points = [
    "✅ Successfully imported functions from vector_example.py",
    "✅ Generated sample documents with metadata",
    "✅ Initialized RandomEmbeddings (no OpenAI API required)",
    "✅ Connected to pgVector database",
    "✅ Stored documents in vector database (training phase)",
    "✅ Performed similarity searches with various queries",
    "✅ Demonstrated metadata filtering capabilities",
    "✅ Executed direct vector searches",
    "✅ Analyzed search results and performance statistics"
]

for point in summary_points:
    print(f"  {point}")

print(f"\n📊 Final Statistics:")
print(f"  📚 Documents stored: {len(sample_documents)}")
print(f"  🔍 Search methods tested: Text search, Filtered search, Vector search")
print(f"  📏 Vector dimensions: {VECTOR_DIMENSION}")
print(f"  🎯 Distance metric: Cosine similarity")

print(f"\n🚀 Next Steps for Production:")
next_steps = [
    "🔑 Replace RandomEmbeddings with production embeddings (OpenAI, Sentence Transformers)",
    "📚 Load real documents instead of sample data",
    "⚡ Optimize vector indexing for large datasets",
    "🔒 Implement proper authentication and security",
    "📈 Add monitoring and performance metrics",
    "🧪 Implement A/B testing for different embedding models",
    "🔄 Set up automated data pipeline for document updates",
    "🌐 Create REST API for vector search functionality"
]

for step in next_steps:
    print(f"  {step}")

print(f"\n💡 Key Learnings:")
learnings = [
    "Vector databases enable semantic search beyond keyword matching",
    "Metadata filtering adds powerful query capabilities",
    "pgVector provides PostgreSQL-native vector operations",
    "Random embeddings work for testing but semantic embeddings needed for production",
    "Cosine distance is effective for text similarity comparisons"
]

for learning in learnings:
    print(f"  • {learning}")

print(f"\n🔗 Useful Resources:")
resources = [
    "pgVector Documentation: https://github.com/pgvector/pgvector", 
    "LangChain Documentation: https://python.langchain.com/",
    "OpenAI Embeddings: https://platform.openai.com/docs/guides/embeddings",
    "Sentence Transformers: https://www.sbert.net/"
]

for resource in resources:
    print(f"  📖 {resource}")

print(f"\n🎯 Thank you for exploring pgVector with LangChain!")
print(f"   Feel free to experiment with different queries and embeddings!")
