# 2 – LangChain PDF RAG with Chunking Strategies

**Learning Goals:**
- Use LangChain's document loaders and text splitters
- Persist embeddings in ChromaDB
- Compare chunking strategies: recursive, fixed, sentence-based
- Build a complete RAG pipeline with LangChain

**What we'll build:**
1. Load PDFs using LangChain loaders
2. Test three chunking strategies (RecursiveCharacterTextSplitter, CharacterTextSplitter, SentenceSplitter)
3. Embed chunks using configurable models (OpenAI/Cohere/SBERT)
4. Store in ChromaDB with metadata
5. Query with retrieval + QA chain

**Persistence:**
- `./artifacts/chroma/langchain_recursive/`
- `./artifacts/chroma/langchain_fixed/`
- `./artifacts/chroma/langchain_sentence/`
- `./artifacts/manifests/langchain_{strategy}.json`


In [1]:
#  Global Config & Services (using centralized modules)

import json
import sys
from pathlib import Path
from datetime import datetime
from dotenv import load_dotenv

# Add parent directory to path and change to project root
import os

# Get the notebook's current directory and find project root
notebook_dir = Path.cwd()
if notebook_dir.name == "notebooks":
    project_root = notebook_dir.parent
else:
    project_root = notebook_dir

# Change to project root and add to path
os.chdir(project_root)
sys.path.insert(0, str(project_root))

print(f" Working directory: {os.getcwd()}")

from src.services.llm_services import (
    load_config,
    get_llm,
    get_text_embeddings,
    validate_api_keys,
    print_config_summary
)

# Load environment variables
load_dotenv()

# Load configuration from config.yaml (now we're in project root)
config = load_config("src/config/config.yaml")

# Validate API keys
validate_api_keys(config, verbose=True)

# Print summary
print_config_summary(config)


 Working directory: d:\Courses\_Zuu Crew\AI Engineer Essentials\Programming\Week 03
✅ Config loaded:
  LLM: openai / gpt-4o-mini
  Embeddings: sbert / sentence-transformers/all-MiniLM-L6-v2
  Temperature: 0.2
  Artifacts: ./artifacts


In [2]:
import sentence_transformers
# Initialize LLM and Embeddings using factories from llm_services
llm = get_llm(config)
embeddings = get_text_embeddings(config)

print(f" LLM: {config['llm_provider']} / {config['llm_model']}")
print(f" Embeddings: {config['text_emb_provider']} / {config['text_emb_model']}")

# Verify API key with test completion
print("\n Testing LLM API connection...")
try:
    test_response = llm.invoke("Say 'API working!' if you can read this.")
    test_msg = test_response.content if hasattr(test_response, 'content') else str(test_response)
    print(f" LLM API verified: {test_msg[:50]}")
except Exception as e:
    print(f" LLM API test failed: {e}")
    print("  Please check your .env file and API key configuration.")





  return HuggingFaceEmbeddings(


 LLM: openai / gpt-4o-mini
 Embeddings: sbert / sentence-transformers/all-MiniLM-L6-v2

 Testing LLM API connection...
 LLM API verified: API working!


In [8]:
from langchain_community.document_loaders import (
                                                PyPDFLoader, 
                                                DirectoryLoader, 
                                                TextLoader
                                                )

from langchain_core.documents import Document

pdf_dir = Path(config["data_root"]) / "pdfs"
pdf_dir.mkdir(parents=True, exist_ok=True)

# Try loading PDFs
pdf_files = list(pdf_dir.glob("*.pdf"))

if len(pdf_files) == 0:
    print("  No PDFs found. Creating sample text document...")
    
    sample_content = """# Common Skin Diseases and Conditions

Understanding skin diseases is essential for proper care and treatment. The skin is the largest organ and serves as a protective barrier.

## Inflammatory Skin Conditions

### Eczema (Atopic Dermatitis)
Eczema is a chronic inflammatory condition marked by itchy, dry, and red skin. It affects 10-20% of children and often has a genetic component. Treatment includes daily moisturizing, avoiding triggers, and topical anti-inflammatory medications.

### Psoriasis
Psoriasis is an autoimmune condition causing thick, silvery scales and red plaques. Treatment options include topical corticosteroids, phototherapy, and systemic medications for moderate to severe cases.

## Fungal Infections

### Ringworm (Tinea)
Fungal infections cause circular, red, scaly patches. Common types include athlete's foot (tinea pedis) and jock itch (tinea cruris). Treatment involves topical or oral antifungal medications.

### Treatment Approach
Keep affected areas dry and clean. Use antifungal creams like terbinafine for 2-4 weeks as directed.

## Bacterial Infections

### Impetigo
A common bacterial infection in children causing honey-colored crusts. Requires prompt medical assessment and antibiotic treatment.

### Cellulitis
A deep bacterial infection causing swelling, redness, and warmth. Requires oral or IV antibiotics.

## General Skin Care

Daily moisturizing with thick creams helps maintain the skin barrier. Use broad-spectrum SPF 30+ sunscreen to protect against UV damage. Identify and avoid personal triggers like harsh detergents, allergens, and stress.
"""
    
    sample_file = pdf_dir / "skin_diseases_intro.txt"
    sample_file.write_text(sample_content)
    
    # Load text file as document
    documents = [Document(page_content=sample_content, metadata={"source": "skin_diseases_intro.txt", "page": 0})]
    
else:
    # Load PDFs
    documents = []
    for pdf_path in pdf_files:
        loader = PyPDFLoader(str(pdf_path))
        docs = loader.load()
        documents.extend(docs)

print(f"✅ Loaded {len(documents)} document pages")
print(f"  Total characters: {sum(len(d.page_content) for d in documents):,}")


✅ Loaded 198 document pages
  Total characters: 127,894


---

## Step 2: Chunking Strategies

We'll implement three chunking strategies:

1. **Recursive** - Splits by paragraphs, then sentences, then characters
2. **Fixed** - Fixed character length with overlap
3. **Sentence** - Splits on sentence boundaries (using langchain-experimental)


In [10]:
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
)

def get_splitter(strategy: str, chunk_size: int = 800, chunk_overlap: int = 150):
    """
    Return a text splitter based on the specified strategy.
    
    Args:
        strategy: One of "recursive", "fixed", or "sentence"
        chunk_size: Maximum characters per chunk
        chunk_overlap: Overlap between consecutive chunks
        
    Returns:
        A LangChain text splitter instance
    """
    if strategy == "recursive": 
        return RecursiveCharacterTextSplitter(
                                            chunk_size=chunk_size,
                                            chunk_overlap=chunk_overlap,
                                            separators=["\n\n", "\n", ". ", ", ", " ", ""]
                                            )

    elif strategy == "fixed":
        return CharacterTextSplitter(
                                    chunk_size=chunk_size,
                                    chunk_overlap=chunk_overlap,
                                    separator=" "
                                    )

    elif strategy == "sentence":
        pass 

    else:
        raise ValueError("Unknown stratergy")


# Test all strategies
# strategies = ["recursive", "fixed", "sentence"]
strategies = ["recursive", "fixed"]
split_results = {}

for strategy in strategies:
    splitter = get_splitter(strategy)
    chunks = splitter.split_documents(documents)
    
    # Add strategy to metadata
    for chunk in chunks:
        chunk.metadata["splitter"] = strategy
    
    split_results[strategy] = chunks
    print(f"✅ {strategy:10s}: {len(chunks):4d} chunks")

print(f"\nExample chunk (recursive):")
print(f"  Length: {len(split_results['recursive'][0].page_content)} chars")
print(f"  Content: {split_results['recursive'][0].page_content[:150]}...")
print(f"  Metadata: {split_results['recursive'][0].metadata}")


✅ recursive :  280 chunks
✅ fixed     :  275 chunks

Example chunk (recursive):
  Length: 183 chars
  Content: Common Skin Disease
Surangkana Veeranawin, MD
MSc, Queen Mary University of London
Mmed, University of Sydney
DipDerm RCPS(Glas.)
Diplomate American B...
  Metadata: {'producer': 'Microsoft® PowerPoint® 2016', 'creator': 'Microsoft® PowerPoint® 2016', 'creationdate': '2019-05-29T13:06:19+07:00', 'title': 'Common Skin Disease', 'author': 'Surangkana Veeranawin', 'moddate': '2019-05-29T13:06:19+07:00', 'source': 'data\\pdfs\\3-Lecture-Common-Skin-Disease-2019.pdf', 'total_pages': 119, 'page': 0, 'page_label': '1', 'splitter': 'recursive'}


---

## Step 3: Build ChromaDB Collections

For each chunking strategy, we'll create a separate ChromaDB collection with embeddings.


In [11]:
from langchain_chroma import Chroma

chroma_root = Path(config["artifacts_root"]) / "chroma"
chroma_root.mkdir(parents=True, exist_ok=True)

vectorstores = {}

for strategy in strategies:
    collection_name = f"langchain_{strategy}"
    persist_dir = str(chroma_root / collection_name)
    
    print(f"Building collection: {collection_name}...")
    
    vectorstore = Chroma.from_documents(
                                        documents=split_results[strategy],
                                        embedding=embeddings,
                                        collection_name=collection_name,
                                        persist_directory=persist_dir
                                        )
    
    print(f"  ✅ Persisted to {persist_dir}")
    print(f"  ✅ {len(split_results[strategy])} chunks embedded")

    vectorstores[strategy] = vectorstore

print(f"\n✅ All collections built!")


Building collection: langchain_recursive...
  ✅ Persisted to artifacts\chroma\langchain_recursive
  ✅ 280 chunks embedded
Building collection: langchain_fixed...
  ✅ Persisted to artifacts\chroma\langchain_fixed
  ✅ 275 chunks embedded

✅ All collections built!


### Save Manifests

Each collection gets a manifest tracking build parameters.


In [12]:
manifests_dir = Path(config["artifacts_root"]) / "manifests"
manifests_dir.mkdir(parents=True, exist_ok=True)

for strategy in strategies:
    manifest = {
        "collection_name": f"langchain_{strategy}",
        "framework": "langchain",
        "strategy": strategy,
        "embedding_model": config["text_emb_model"],
        "embedding_provider": config["text_emb_provider"],
        "normalize": config["normalize_embeddings"],
        "chunk_size": 800,
        "chunk_overlap": 150,
        "num_chunks": len(split_results[strategy]),
        "num_documents": len(documents),
        "created_at": datetime.now().isoformat(),
    }
    
    manifest_path = manifests_dir / f"langchain_{strategy}.json"
    with open(manifest_path, "w") as f:
        json.dump(manifest, f, indent=2)
    
    print(f"✅ Manifest saved: {manifest_path.name}")


✅ Manifest saved: langchain_recursive.json
✅ Manifest saved: langchain_fixed.json


---

## Step 4: Retrieval + QA Chain

We'll build a simple RAG chain using LangChain's stuff chain (concatenate all retrieved chunks).


In [17]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# RAG prompt template
rag_prompt_template = """You are a concise assitant for healthcare. Use only the provided context to answer the question.
Keep answers under 5 sentences. Based on below conditions adapt your answer.

1. If question is about Skin diseases and the infomation is insufficent, say "I do not have info, please reach to our hospital".
2. If question is iirelevant to Skin diseases, say "I can't provide info"

Context:
{context}

Question: {question}

Answer:"""

RAG_PROMPT = PromptTemplate(
    template=rag_prompt_template,
    input_variables=["context", "question"]
)

def build_qa_chain(strategy: str, top_k: int = 3):
    """
    Build a RAG chain for a given chunking strategy using LangChain LCEL.
    
    Args:
        strategy: The chunking strategy name (e.g., "recursive", "fixed", "sentence")
        top_k: Number of chunks to retrieve
        
    Returns:
        A RAG chain ready to answer questions
    """
    vectorstore = vectorstores[strategy]
    retriever = vectorstore.as_retriever(search_kwargs={"k" : top_k})
    
    # Build RAG chain using LCEL
    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)
    
    # Chain that takes a string question and returns formatted answer
    qa_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | RAG_PROMPT
        | llm
        | StrOutputParser()
    )
    
    # Create a wrapper that also returns source documents for comparison
    class RAGChainWithSources:
        def __init__(self, chain, retriever):
            self.chain = chain
            self.retriever = retriever
        
        def invoke(self, query_dict):
            # Extract query from dict
            query = query_dict.get("query", query_dict) if isinstance(query_dict, dict) else query_dict
            # Invoke chain with query string
            result = self.chain.invoke(query)
            # Get source documents
            source_docs = self.retriever.invoke(query)
            return {"result": result, "source_documents": source_docs}
    
    return RAGChainWithSources(qa_chain, retriever)
    
# Build chains for all strategies
qa_chains = {strategy: build_qa_chain(strategy) for strategy in strategies}

print(" QA chains ready for all strategies")


 QA chains ready for all strategies


---

## Interactive Demo: Compare Chunking Strategies

Choose a splitter and ask questions to see how different chunking approaches affect retrieval.


In [18]:
# Example queries about dermatology
test_queries = [
    "What causes eczema and atopic dermatitis?",
    "How do you treat fungal infections like ringworm?",
    "What are the recommended treatments for psoriasis?",
]

selected_strategy = "recursive"  # Change to "fixed" or "sentence" to compare

print(f" Testing strategy: {selected_strategy}\n")
print("=" * 80)

for query in test_queries:
    print(f"\n Query: {query}\n")
    
    qa_chain = qa_chains[selected_strategy]
    result = qa_chain.invoke({"query": query})
    
    print(" Retrieved chunks:")
    for i, doc in enumerate(result["source_documents"], 1):
        source = doc.metadata.get("source", "unknown")
        page = doc.metadata.get("page", "?")
        print(f"  [{i}] {source} (page {page})")
        print(f"      {doc.page_content[:120]}...\n")
    
    print(f" Answer:\n{result['result']}\n")
    print("=" * 80)


 Testing strategy: recursive


 Query: What causes eczema and atopic dermatitis?

 Retrieved chunks:
  [1] data\pdfs\Derm_Handbook_3rd-Edition-_Nov_2020-FINAL.pdf (page 47)
      Dermatology: Handbook for medical students & junior doctors  
   British Association of Dermatologists 47 
Atopic eczema...

  [2] data\pdfs\3-Lecture-Common-Skin-Disease-2019.pdf (page 109)
      Atopic Dermatitis
 Chronic inflammatory dermatosis affecting 10%-20% of 
children (esp. infants and young children)
 E...

  [3] data\pdfs\3-Lecture-Common-Skin-Disease-2019.pdf (page 116)
      Atopic Dermatitis
 Minor, less specific feature: 
- xerosis, ichthyosis, palmar hyperlinearity, keratosis pilaris. 
Imm...

 Answer:
Eczema and atopic dermatitis are caused by a combination of factors that are not fully understood. A positive family history of atopy (eczema, asthma, allergic rhinitis) is often present, and there may be a primary genetic defect in skin barrier function, specifically involving the protein fi

---

## Summary

**What we learned:**
- ✅ LangChain provides flexible document loaders and text splitters
- ✅ Chunking strategy affects retrieval quality
- ✅ ChromaDB persists embeddings efficiently
- ✅ RetrievalQA chain simplifies RAG pipelines

**Chunking trade-offs:**
- **Recursive**: Respects document structure (paragraphs, sentences)
- **Fixed**: Simple, predictable chunk sizes
- **Sentence**: Preserves semantic boundaries but variable lengths

**Rebuild triggers:**
- Change embedding model/normalization → rebuild collections
- Change chunking strategy → new collection name
- Change dataset → re-embed and rebuild

**Artifacts:**
- `./artifacts/chroma/langchain_recursive/`
- `./artifacts/chroma/langchain_fixed/`
- `./artifacts/chroma/langchain_sentence/`
- `./artifacts/manifests/langchain_{strategy}.json`
