In [3]:
# LangChain Knowledge Base Tutorial: Building a Semantic Search Engine
# Based on official documentation: https://docs.langchain.com/oss/python/langchain/knowledge-base

"""
WHAT WE'RE BUILDING:
According to LangChain documentation, we'll build a search engine over a PDF document 
that allows us to retrieve passages similar to an input query. This is the foundation 
for Retrieval-Augmented Generation (RAG) applications.

CORE CONCEPTS COVERED:
1. Documents and document loaders
2. Text splitters  
3. Embeddings
4. Vector stores and retrievers
"""

import os
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Required packages:
# pip install -qU langchain-chroma langchain-ollama langchain-community pypdf

print("=" * 80)
print("BUILDING A SEMANTIC SEARCH ENGINE WITH LANGCHAIN")
print("=" * 80)

print("""
OVERVIEW (from LangChain docs):
This tutorial demonstrates LangChain's document loader, embedding, and vector store 
abstractions. These are designed to support retrieval of data from databases and 
other sources for integration with LLM workflows, particularly for RAG applications.
""")

# ============================================================================
# STEP 1: DOCUMENTS AND DOCUMENT LOADERS
# ============================================================================
print("\n" + "=" * 60)
print("STEP 1: DOCUMENTS AND DOCUMENT LOADERS")
print("=" * 60)

print("""
WHAT IS A DOCUMENT? (LangChain Documentation):
LangChain implements a Document abstraction with three attributes:
- page_content: a string representing the content
- metadata: a dict containing arbitrary metadata  
- id: (optional) a string identifier for the document

The metadata attribute can capture information about the source of the document,
its relationship to other documents, and other information. Note that an individual 
Document object often represents a chunk of a larger document.
""")

print("\n1.1 Loading PDF Document...")
print("Using PyPDFLoader which loads one Document object per PDF page.")
print("For each page, we can access:")
print("- The string content of the page")
print("- Metadata containing the file name and page number")

# Load PDF - works with both online URLs and local file paths
pdf_url = "https://arxiv.org/pdf/2501.04040.pdf"
loader = PyPDFLoader(pdf_url)
documents = loader.load()

print(f"\n✓ Loaded {len(documents)} pages from PDF")
print(f"✓ Each page is now a Document object with content and metadata")

# Show sample document structure
if documents:
    sample_doc = documents[0]
    print(f"\nSample Document Structure:")
    print(f"- Content length: {len(sample_doc.page_content)} characters")
    print(f"- Metadata: {sample_doc.metadata}")
    print(f"- Content preview: {sample_doc.page_content[:200]}...")

# ============================================================================
# STEP 2: TEXT SPLITTING  
# ============================================================================
print("\n" + "=" * 60)
print("STEP 2: TEXT SPLITTING")
print("=" * 60)

print("""
WHY SPLIT TEXT? (LangChain Documentation):
For both information retrieval and downstream question-answering purposes, 
a page may be too coarse a representation. Our goal is to retrieve Document 
objects that answer an input query, and further splitting our PDF helps ensure 
that the meanings of relevant portions are not "washed out" by surrounding text.

RECURSIVE CHARACTER TEXT SPLITTER:
We use RecursiveCharacterTextSplitter, which recursively splits the document 
using common separators like new lines until each chunk is the appropriate size. 
This is the recommended text splitter for generic text use cases.

OVERLAP STRATEGY:
The overlap helps mitigate the possibility of separating a statement from 
important context related to it.
""")

print("\n2.1 Configuring Text Splitter...")
print("- Chunk size: 4096 characters (as specified)")
print("- Overlap: 410 characters (10% overlap)")
print("- Method: Recursive character splitting")

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4096,
    chunk_overlap=410,  # 10% of 4096
    length_function=len,
    add_start_index=True,  # Preserves character index as metadata
)

print("\n2.2 Splitting documents into chunks...")
chunks = text_splitter.split_documents(documents)

print(f"\n✓ Split {len(documents)} pages into {len(chunks)} chunks")
print(f"✓ Each chunk contains ~4096 characters with 410-character overlap")
print(f"✓ Start index preserved in metadata for tracking")

# Show chunk analysis
if chunks:
    chunk_sizes = [len(chunk.page_content) for chunk in chunks]
    print(f"\nChunk Analysis:")
    print(f"- Average chunk size: {sum(chunk_sizes) / len(chunk_sizes):.0f} characters")
    print(f"- Largest chunk: {max(chunk_sizes)} characters")
    print(f"- Smallest chunk: {min(chunk_sizes)} characters")

# ============================================================================
# STEP 3: EMBEDDINGS
# ============================================================================
print("\n" + "=" * 60)
print("STEP 3: EMBEDDINGS")
print("=" * 60)

print("""
WHAT ARE EMBEDDINGS? (LangChain Documentation):
Vector search is a common way to store and search over unstructured data. 
The idea is to store numeric vectors that are associated with the text. 
Given a query, we can embed it as a vector of the same dimension and use 
vector similarity metrics (such as cosine similarity) to identify related text.

HOW EMBEDDINGS WORK:
Embeddings typically represent text as a "dense" vector such that texts with 
similar meanings are geometrically close. This lets us retrieve relevant 
information just by passing in a question, without knowledge of any specific 
key-terms used in the document.
""")

print("\n3.1 Setting up Ollama Embeddings...")
print("Using nomic-embed-text model for local embedding generation")

embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

print("✓ Embedding model configured")
print("✓ Text will be converted to numeric vectors for similarity search")

# ============================================================================
# STEP 4: VECTOR STORES
# ============================================================================
print("\n" + "=" * 60)
print("STEP 4: VECTOR STORES")
print("=" * 60)

print("""
WHAT ARE VECTOR STORES? (LangChain Documentation):
LangChain VectorStore objects contain methods for adding text and Document 
objects to the store, and querying them using various similarity metrics. 
They are often initialized with embedding models, which determine how text 
data is translated to numeric vectors.

CHROMA VECTOR STORE:
We're using Chroma, which can run in-memory for lightweight workloads or 
with persistence for production use. Some vector stores are hosted by 
providers and require credentials; others like Chroma can run locally.
""")

print("\n4.1 Creating Chroma Vector Store...")
print("- Collection name: pdf_collection")
print("- Storage: Local persistent directory")
print("- Embedding function: nomic-embed-text via Ollama")

vector_store = Chroma(
    collection_name="pdf_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_db",
)

print("✓ Chroma vector store created with local persistence")

print("\n4.2 Adding documents to vector store...")
print("Converting text chunks to embeddings and storing in vector database...")

vector_store.add_documents(documents=chunks)

print(f"✓ Added {len(chunks)} document chunks to vector store")
print("✓ Each chunk has been converted to a vector and stored")

# ============================================================================
# STEP 5: QUERYING THE VECTOR STORE
# ============================================================================
print("\n" + "=" * 60)
print("STEP 5: QUERYING THE VECTOR STORE")
print("=" * 60)

print("""
VECTOR STORE QUERYING (LangChain Documentation):
Once we've instantiated a VectorStore that contains documents, we can query it.
VectorStore includes methods for querying:
- Synchronously and asynchronously
- By string query and by vector  
- With and without returning similarity scores
- By similarity and maximum marginal relevance (MMR)
""")

print("\n5.1 Basic Similarity Search")
print("Finding documents most similar to a query using cosine similarity...")

query = "What is the main topic and methodology of this research?"
results = vector_store.similarity_search(query, k=3)

print(f"\nQuery: '{query}'")
print(f"Retrieved {len(results)} most similar chunks:")

for i, doc in enumerate(results, 1):
    print(f"\n--- Result {i} ---")
    print(f"Content: {doc.page_content[:300]}...")
    print(f"Source: Page {doc.metadata.get('page', 'unknown')}")

print("\n5.2 Similarity Search with Scores")
print("Same search but with similarity scores to see confidence levels...")

results_with_scores = vector_store.similarity_search_with_score(query, k=2)

for i, (doc, score) in enumerate(results_with_scores, 1):
    print(f"\n--- Result {i} (Similarity Score: {score:.4f}) ---")
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Source: Page {doc.metadata.get('page', 'unknown')}")

print("\n5.3 Metadata Filtering")
print("Using metadata filters to search specific parts of the document...")

# First, let's see what metadata is available
print("\nAvailable metadata in our chunks:")
if chunks:
    sample_metadata = chunks[0].metadata
    print(f"Sample metadata: {sample_metadata}")
    
    # Get unique page numbers for filtering examples
    page_numbers = set()
    for chunk in chunks[:10]:  # Check first 10 chunks
        if 'page' in chunk.metadata:
            page_numbers.add(chunk.metadata['page'])
    print(f"Available page numbers (sample): {sorted(list(page_numbers))[:5]}...")

print("\n5.3.1 Filter by Specific Page")
if page_numbers:
    target_page = sorted(list(page_numbers))[0]  # Use first available page
    page_results = vector_store.similarity_search(
        "methodology approach",
        k=2,
        filter={"page": target_page}
    )
    print(f"Searching only in Page {target_page}:")
    for i, doc in enumerate(page_results, 1):
        print(f"  Result {i}: Page {doc.metadata.get('page')} - {doc.page_content[:150]}...")

print("\n5.3.2 Filter by Page Range")
if len(page_numbers) > 1:
    # Filter for pages greater than or equal to a certain page
    min_page = sorted(list(page_numbers))[1] if len(page_numbers) > 1 else sorted(list(page_numbers))[0]
    range_results = vector_store.similarity_search(
        "results conclusions",
        k=3,
        filter={"page": {"$gte": min_page}}  # Pages >= min_page
    )
    print(f"Searching in pages >= {min_page}:")
    for i, doc in enumerate(range_results, 1):
        print(f"  Result {i}: Page {doc.metadata.get('page')} - {doc.page_content[:150]}...")

print("\n5.3.3 Multiple Metadata Filters")
# Complex filtering with multiple conditions
complex_results = vector_store.similarity_search(
    "research findings",
    k=2,
    filter={
        "$and": [
            {"page": {"$gte": 0}},  # Page 0 or higher
            {"source": {"$ne": ""}}  # Has a source
        ]
    }
)
print("Using complex filter (page >= 0 AND has source):")
for i, doc in enumerate(complex_results, 1):
    print(f"  Result {i}: Page {doc.metadata.get('page')} - {doc.page_content[:150]}...")

print("\n5.4 Search by Vector with Filtering")
print("Combining vector search with metadata filtering...")

# Generate embedding for a query and search with filter
query_embedding = embeddings.embed_query("experimental setup")
if page_numbers:
    target_page = sorted(list(page_numbers))[0]
    vector_results = vector_store.similarity_search_by_vector(
        embedding=query_embedding,
        k=2,
        filter={"page": target_page}
    )
    print(f"Vector search in Page {target_page}:")
    for i, doc in enumerate(vector_results, 1):
        print(f"  Result {i}: {doc.page_content[:150]}...")

# ============================================================================
# STEP 6: RETRIEVERS
# ============================================================================
print("\n" + "=" * 60)
print("STEP 6: RETRIEVERS") 
print("=" * 60)

print("""
WHAT ARE RETRIEVERS? (LangChain Documentation):
LangChain Retrievers are Runnables that implement a standard set of methods 
(e.g., synchronous and asynchronous invoke and batch operations). Although 
we can construct retrievers from vector stores, retrievers can interface 
with non-vector store sources of data as well (such as external APIs).

RETRIEVER TYPES:
VectorStoreRetriever supports search types:
- "similarity" (default): Returns most similar documents
- "mmr" (maximum marginal relevance): Balances similarity with diversity  
- "similarity_score_threshold": Filters by minimum similarity score
""")

print("\n6.1 Creating Different Types of Retrievers...")

# Similarity Retriever
similarity_retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# MMR Retriever
print("MMR balances similarity with diversity in retrieved results")
mmr_retriever = vector_store.as_retriever(
    search_type="mmr", 
    search_kwargs={"k": 3, "fetch_k": 10, "lambda_mult": 0.5}
)

# Score Threshold Retriever
print("Score threshold retriever only returns documents above similarity threshold")
threshold_retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.5, "k": 5}
)

print("\n6.2 Retrievers with Metadata Filtering...")

# Get available page numbers for filtering
available_pages = set()
for chunk in chunks[:20]:  # Check sample of chunks
    if 'page' in chunk.metadata:
        available_pages.add(chunk.metadata['page'])

print(f"Available pages for filtering: {sorted(list(available_pages))[:5]}...")

print("\n6.2.1 Similarity Retriever with Page Filter")
if available_pages:
    target_page = sorted(list(available_pages))[0]
    filtered_similarity_retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={
            "k": 3,
            "filter": {"page": target_page}
        }
    )
    
    test_query = "research methodology"
    filtered_results = filtered_similarity_retriever.invoke(test_query)
    print(f"Filtered search (Page {target_page}) found {len(filtered_results)} documents")
    for i, doc in enumerate(filtered_results, 1):
        print(f"  Result {i}: Page {doc.metadata.get('page')} - {doc.page_content[:150]}...")

print("\n6.2.2 MMR Retriever with Page Range Filter")
if len(available_pages) > 1:
    min_page = sorted(list(available_pages))[1] if len(available_pages) > 1 else sorted(list(available_pages))[0]
    filtered_mmr_retriever = vector_store.as_retriever(
        search_type="mmr",
        search_kwargs={
            "k": 3,
            "fetch_k": 8,
            "lambda_mult": 0.7,
            "filter": {"page": {"$gte": min_page}}
        }
    )
    
    mmr_results = filtered_mmr_retriever.invoke("findings conclusions")
    print(f"MMR search (Pages >= {min_page}) found {len(mmr_results)} diverse documents")
    for i, doc in enumerate(mmr_results, 1):
        print(f"  Result {i}: Page {doc.metadata.get('page')} - {doc.page_content[:150]}...")

print("\n6.3 Advanced Filter Examples...")

print("\n6.3.1 Complex AND/OR Filtering")
# Example of complex filtering (if supported by Chroma version)
try:
    complex_filter_retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={
            "k": 2,
            "filter": {
                "$and": [
                    {"page": {"$gte": 0}},
                    {"source": {"$ne": ""}}
                ]
            }
        }
    )
    complex_results = complex_filter_retriever.invoke("experimental design")
    print(f"Complex filter search found {len(complex_results)} documents")
except Exception as e:
    print(f"Complex filtering not fully supported in this Chroma version: {str(e)}")

print("\n6.3.2 Custom Metadata-Based Retrieval")
# Show how to filter by different metadata fields
metadata_examples = [
    {"filter_name": "By Source", "filter": {"source": pdf_url}},
    {"filter_name": "First Few Pages", "filter": {"page": {"$lt": 3}}},
    {"filter_name": "Later Pages", "filter": {"page": {"$gte": 3}}},
]

for example in metadata_examples:
    try:
        custom_retriever = vector_store.as_retriever(
            search_type="similarity",
            search_kwargs={
                "k": 2,
                "filter": example["filter"]
            }
        )
        custom_results = custom_retriever.invoke("data analysis")
        print(f"{example['filter_name']}: Found {len(custom_results)} documents")
    except Exception as e:
        print(f"{example['filter_name']}: Filter not supported - {str(e)}")

print("\n6.4 Testing All Retriever Types...")
test_query = "research findings and conclusions"

# Test all retrievers
retrievers_to_test = [
    ("Basic Similarity", similarity_retriever),
    ("MMR (Diversity)", mmr_retriever),
]

# Add threshold retriever if it works
try:
    threshold_results = threshold_retriever.invoke(test_query)
    retrievers_to_test.append(("Score Threshold", threshold_retriever))
except Exception as e:
    print(f"Score threshold retriever not fully supported: {str(e)}")

for name, retriever in retrievers_to_test:
    try:
        results = retriever.invoke(test_query)
        print(f"\n{name} Retriever: Found {len(results)} documents")
        if results:
            print(f"  First result: Page {results[0].metadata.get('page')} - {results[0].page_content[:100]}...")
    except Exception as e:
        print(f"{name} Retriever error: {str(e)}")

# ============================================================================
# STEP 7: RAG FOUNDATION
# ============================================================================
print("\n" + "=" * 60)
print("STEP 7: RAG APPLICATION FOUNDATION")
print("=" * 60)

print("""
RAG APPLICATIONS (LangChain Documentation):
Retrievers can easily be incorporated into more complex applications, such as 
retrieval-augmented generation (RAG) applications that combine a given question 
with retrieved context into a prompt for a LLM.

WHAT WE'VE BUILT:
We now have a complete semantic search engine that can:
1. Load and process PDF documents
2. Split them into meaningful chunks  
3. Convert chunks to vector embeddings
4. Store vectors in a searchable database
5. Retrieve relevant passages for any query
6. Support different retrieval strategies (similarity, MMR)
""")

print("\n7.1 Testing Complete Pipeline...")
final_query = "What are the key contributions of this paper?"
context_docs = similarity_retriever.invoke(final_query)

print(f"\nQuery: '{final_query}'")
print(f"✓ Retrieved {len(context_docs)} relevant document chunks")
print("✓ These chunks could now be sent to an LLM for answer generation")

# Show what would be sent to LLM
print(f"\nContext that would be sent to LLM:")
for i, doc in enumerate(context_docs[:2], 1):  # Show first 2 for brevity
    print(f"\nChunk {i}: {doc.page_content[:250]}...")

# ============================================================================
# SUMMARY AND PERSISTENCE
# ============================================================================
print("\n" + "=" * 80)
print("TUTORIAL COMPLETE - SEMANTIC SEARCH ENGINE BUILT!")
print("=" * 80)

print(f"""
WHAT WE ACCOMPLISHED:
✓ Loaded PDF document using PyPDFLoader ({len(documents)} pages)
✓ Split into {len(chunks)} chunks using RecursiveCharacterTextSplitter
✓ Generated embeddings using nomic-embed-text model
✓ Created persistent Chroma vector store
✓ Implemented similarity and MMR retrieval strategies  
✓ Built foundation for RAG applications

DATA PERSISTENCE:
✓ Vector store saved to: ./chroma_db
✓ Can be reloaded anytime with the same configuration
✓ Ready for production RAG applications

NEXT STEPS:
- Connect to an LLM (like Ollama) to generate answers
- Implement conversation memory for chat applications  
- Add metadata filtering for more precise retrieval
- Scale to multiple documents and collections
""")

print("\nThis semantic search engine is now ready to power RAG applications!")

BUILDING A SEMANTIC SEARCH ENGINE WITH LANGCHAIN

OVERVIEW (from LangChain docs):
This tutorial demonstrates LangChain's document loader, embedding, and vector store 
abstractions. These are designed to support retrieval of data from databases and 
other sources for integration with LLM workflows, particularly for RAG applications.


STEP 1: DOCUMENTS AND DOCUMENT LOADERS

WHAT IS A DOCUMENT? (LangChain Documentation):
LangChain implements a Document abstraction with three attributes:
- page_content: a string representing the content
- metadata: a dict containing arbitrary metadata  
- id: (optional) a string identifier for the document

The metadata attribute can capture information about the source of the document,
its relationship to other documents, and other information. Note that an individual 
Document object often represents a chunk of a larger document.


1.1 Loading PDF Document...
Using PyPDFLoader which loads one Document object per PDF page.
For each page, we can access:
-

No relevant docs were retrieved using the relevance score threshold 0.5


Filtered search (Page 0) found 3 documents
  Result 1: Page 0 - A Survey on Large Language Models with some Insights
on their Capabilities and Limitations
Andrea Matarazzo
Expedia Group
Italy
a.matarazzo@gmail.com
...
  Result 2: Page 0 - A Survey on Large Language Models with some Insights
on their Capabilities and Limitations
Andrea Matarazzo
Expedia Group
Italy
a.matarazzo@gmail.com
...
  Result 3: Page 0 - A Survey on Large Language Models with some Insights
on their Capabilities and Limitations
Andrea Matarazzo
Expedia Group
Italy
a.matarazzo@gmail.com
...

6.2.2 MMR Retriever with Page Range Filter
MMR search (Pages >= 1) found 3 diverse documents
  Result 1: Page 86 - probability for a hypothesis as more evidence or information becomes available. Fundamentally, Bayesian
inference uses prior knowledge, in the form of...
  Result 2: Page 90 - Figure 40: A: Chain of thoughts (in blue) are intermediate reasoning steps towards a final answer.
The input of CoT prompting is a stack of 

No relevant docs were retrieved using the relevance score threshold 0.5



Basic Similarity Retriever: Found 4 documents
  First result: Page 86 - probability for a hypothesis as more evidence or information becomes available. Fundamentally, Bayes...

MMR (Diversity) Retriever: Found 3 documents
  First result: Page 86 - probability for a hypothesis as more evidence or information becomes available. Fundamentally, Bayes...

Score Threshold Retriever: Found 0 documents

STEP 7: RAG APPLICATION FOUNDATION

RAG APPLICATIONS (LangChain Documentation):
Retrievers can easily be incorporated into more complex applications, such as 
retrieval-augmented generation (RAG) applications that combine a given question 
with retrieved context into a prompt for a LLM.

WHAT WE'VE BUILT:
We now have a complete semantic search engine that can:
1. Load and process PDF documents
2. Split them into meaningful chunks  
3. Convert chunks to vector embeddings
4. Store vectors in a searchable database
5. Retrieve relevant passages for any query
6. Support different retrieval strategi