[title]
RAG Episode 2: Chat With Real PDFs and Remember Every Question

[narration]
In the last episode, you built a RAG system from scratch. You understood the five step pipeline. Load, Split, Embed, Store, Retrieve. You asked questions about a company knowledge base and got accurate answers with sources.

But it had two big limitations.

First, we used hardcoded text. In the real world, you have PDFs, text files, web pages, Word documents. You need to load real files.

Second, our system had no memory. Every question was independent. You couldn't ask a follow-up like, 'Tell me more about that' or 'What did you just say about the pricing?' Real chatbots remember the conversation.

Today, we fix both.

By the end of this episode, you'll have a RAG system that loads real PDF and text files, remembers your entire conversation, handles follow-up questions naturally, and shows you how relevant each result is with similarity scores.

Same stack. Groq for the language model. HuggingFace for embeddings. FAISS — pronounced 'face' — for vector search. All free.

Let's level up.

[display]
## What's New in Episode 2

**Upgrading from Episode 1:**
```
Ep 1: Hardcoded text     --> Ep 2: Real PDF & text files
Ep 1: No memory          --> Ep 2: Conversational RAG
Ep 1: Basic retrieval    --> Ep 2: Scored similarity search
Ep 1: Single questions   --> Ep 2: Follow-up questions work
```

**Tech Stack (same, all free):**
```
LangChain           --> Framework
Groq (Llama 3.3)    --> LLM (free)
HuggingFace         --> Embeddings (free, local)
FAISS               --> Vector store (free, local)
```

**Prerequisites:** RAG Episode 1

[title]
Setup — Getting Everything Ready

[narration]
Quick setup. Same packages as Episode 1, plus py-PDF for reading PDF files. If you've already installed everything from the last episode, this will finish in seconds.

Then we load our Groq API key and initialize the language model and embedding model. Same pattern you already know.

[display]
## Setup

**New package:**
```
pypdf --> Read PDF files
```

**Everything else — same as Episode 1**

In [1]:
"""[narration]
Same install as Episode 1, plus py-PDF for PDF loading. The quiet and upgrade flags keep things clean.

Then we load our API key and initialize both models. Language model at temperature zero for factual answers. Embedding model running locally on your CPU.
"""

!pip install -qU langchain langchain-core langchain-groq langchain-huggingface langchain-community faiss-cpu sentence-transformers python-dotenv pypdf

import os
from dotenv import load_dotenv
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings

load_dotenv()
groq_api_key = os.getenv("GROQ_API_KEY")

if not groq_api_key:
    print("❌ API key not found! Create .env with: GROQ_API_KEY=your_key")
else:
    print(f"✅ Groq API key loaded: {groq_api_key[:12]}...")

llm = ChatGroq(
    model="llama-3.3-70b-versatile",
    temperature=0,
    groq_api_key=groq_api_key
)

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"}
)

print()
print("✅ LLM ready (Groq — Llama 3.3)")
print("✅ Embeddings ready (HuggingFace — local)")
print()
print("Let's load some real documents!")


[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


✅ Groq API key loaded: gsk_Q08MI36d...

✅ LLM ready (Groq — Llama 3.3)
✅ Embeddings ready (HuggingFace — local)

Let's load some real documents!


[title]
Creating Real Documents to Work With

[narration]
Before we load files, we need files to load. I'm going to create realistic documents right here in the notebook — a PDF and multiple text files — so you can see the entire flow without needing to download anything.

In your real projects, you'd skip this step. You already have the PDFs and documents. But for learning, this keeps everything self-contained and reproducible.

I'm creating a startup's documentation. An employee handbook as a PDF. A technical architecture document as a text file. And a product roadmap as another text file. Three files, three different formats, realistic content.

[display]
## Creating Sample Documents

**Why create them in the notebook?**
- Self-contained — no downloads needed
- Reproducible — anyone can run this
- You see the exact content being loaded

**Documents we'll create:**
```
employee_handbook.pdf     --> HR policies, benefits
tech_architecture.txt     --> System design, stack
product_roadmap.txt       --> Features, timeline
```

**In YOUR projects:** Skip this step — use your own files!

In [2]:
"""[narration]
First, I'm creating a PDF using a lightweight library called F-PDF. This generates a real PDF file that our loader will read just like any PDF you'd get from HR or legal.

The employee handbook has sections on leave policy, remote work, compensation, and professional development. Each section is detailed enough that our RAG system has real content to search through.
"""

!pip install -q fpdf

from fpdf import FPDF
import os

os.makedirs("company_docs", exist_ok=True)

pdf = FPDF()
pdf.set_auto_page_break(auto=True, margin=15)

# Page 1 - Leave & Remote Work
pdf.add_page()
pdf.set_font("Arial", "B", 16)
pdf.cell(0, 10, "NovaTech Employee Handbook 2026", ln=True, align="C")
pdf.ln(10)

pdf.set_font("Arial", "B", 13)
pdf.cell(0, 8, "Section 1: Leave Policy", ln=True)
pdf.set_font("Arial", "", 11)
pdf.multi_cell(0, 6, """All full-time employees at NovaTech receive 28 days of paid time off per year, which includes 20 days of annual leave, 5 days of sick leave, and 3 days of personal leave. Annual leave accrues monthly at 1.67 days per month. Unused annual leave up to 5 days may be carried forward to the next calendar year, but must be used by March 31st or it will expire. Employees must submit leave requests at least 5 business days in advance for planned leave. Emergency or sick leave can be reported within 24 hours retroactively. Leave requests are submitted through the NovaTech HR Portal and require manager approval. During peak business periods (March and September), leave requests may be subject to additional review.""")

pdf.ln(5)
pdf.set_font("Arial", "B", 13)
pdf.cell(0, 8, "Section 2: Remote Work Policy", ln=True)
pdf.set_font("Arial", "", 11)
pdf.multi_cell(0, 6, """NovaTech operates a hybrid work model. Employees may work remotely up to 3 days per week. Tuesday and Thursday are designated as in-office collaboration days for all teams. Remote work requests must be logged in the HR Portal by 9 AM on the remote day. Employees in their first 60 days of employment must work from the office full-time to complete onboarding. Managers may require in-office attendance for critical project milestones with at least 48 hours notice. NovaTech provides a one-time home office setup allowance of Rs 25000 for all eligible employees. Internet reimbursement of Rs 1000 per month is provided for employees who work remotely at least 2 days per week.""")

# Page 2 - Compensation & Learning
pdf.add_page()
pdf.set_font("Arial", "B", 13)
pdf.cell(0, 8, "Section 3: Compensation & Benefits", ln=True)
pdf.set_font("Arial", "", 11)
pdf.multi_cell(0, 6, """Salary reviews are conducted annually in April. Performance bonuses are paid in June based on the previous fiscal year performance. The bonus pool ranges from 10% to 25% of annual base salary depending on individual and company performance. NovaTech offers comprehensive health insurance covering employees and up to 2 dependents. The insurance plan includes medical, dental, and vision coverage with a Rs 500 deductible. A company-matched retirement fund contribution of 12% of base salary is provided. Employees become fully vested after 3 years of continuous employment. Stock options are available for employees at the Senior Engineer level and above, vesting over 4 years with a 1-year cliff.""")

pdf.ln(5)
pdf.set_font("Arial", "B", 13)
pdf.cell(0, 8, "Section 4: Professional Development", ln=True)
pdf.set_font("Arial", "", 11)
pdf.multi_cell(0, 6, """NovaTech invests heavily in employee growth. Each employee receives an annual learning budget of Rs 75000 for courses, certifications, conferences, and books. Unused learning budget does not carry over. Additionally, employees can dedicate every Friday afternoon (2 PM onwards) to self-directed learning projects. The company sponsors one major tech conference per year for each team. Internal knowledge sharing sessions are held every Wednesday from 12 PM to 1 PM. Employees pursuing relevant masters degrees or professional certifications receive a 50% tuition reimbursement up to Rs 200000 per year.""")

pdf.output("company_docs/employee_handbook.pdf")
print("✅ Created: company_docs/employee_handbook.pdf (2 pages)")

✅ Created: company_docs/employee_handbook.pdf (2 pages)



[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
"""[narration]
Now two text files. The technical architecture document describes the system design, databases, deployment strategy, and monitoring setup. The product roadmap covers features planned for the next three quarters.

These are the kinds of documents that engineers and product managers actually work with every day.
"""

tech_doc = """NovaTech Technical Architecture Document
Last Updated: January 2026

1. SYSTEM OVERVIEW
NovaTech's platform is a microservices-based SaaS application serving over 50000 active users. The system handles approximately 2 million API requests per day with a 99.95% uptime SLA. Average response time is under 200 milliseconds for core APIs.

2. BACKEND ARCHITECTURE
The backend consists of 12 microservices built with Python 3.12 and FastAPI. Each service is independently deployable and communicates via gRPC for internal calls and REST for external APIs. The authentication service uses OAuth 2.0 with JWT tokens. Rate limiting is enforced at 1000 requests per minute per API key for standard tier and 5000 for premium tier.

3. DATABASE LAYER
Primary database is PostgreSQL 16 running on AWS RDS with read replicas in two availability zones. Redis 7.2 is used for caching with a 15-minute TTL for most queries. MongoDB is used specifically for the activity logging service due to its flexible schema. Database migrations are managed through Alembic and are required to be backwards-compatible for zero-downtime deployments.

4. FRONTEND
The frontend is a single-page application built with React 18 and TypeScript. State management uses Zustand instead of Redux for simplicity. The design system is built on top of Radix UI primitives with custom Tailwind CSS styling. The frontend is served via CloudFront CDN with edge locations in 8 countries.

5. DEPLOYMENT & INFRASTRUCTURE
All services are containerized with Docker and orchestrated via Kubernetes on AWS EKS. The CI/CD pipeline runs on GitHub Actions with the following stages: lint, unit test, integration test, security scan, build, deploy to staging, automated E2E tests, and production deploy with canary rollout. Production deployments happen every Tuesday and Thursday. Hotfixes follow an expedited pipeline with senior engineer approval. Infrastructure is managed as code using Terraform.

6. MONITORING & OBSERVABILITY
Application metrics are collected via Prometheus and visualized in Grafana dashboards. Distributed tracing uses OpenTelemetry with Jaeger as the backend. Log aggregation is handled by the ELK stack (Elasticsearch, Logstash, Kibana). PagerDuty handles alerting with a tiered escalation policy: P1 issues page the on-call engineer immediately, P2 issues within 15 minutes, P3 issues create Jira tickets for next sprint.

7. SECURITY
All data is encrypted at rest using AES-256 and in transit using TLS 1.3. API keys are rotated every 90 days. Penetration testing is conducted quarterly by an external firm. SOC 2 Type II compliance is maintained and audited annually. All code changes require security review for services handling PII or payment data.
"""

with open("company_docs/tech_architecture.txt", "w") as f:
    f.write(tech_doc)
print("✅ Created: company_docs/tech_architecture.txt")

roadmap_doc = """NovaTech Product Roadmap 2026
Status: Approved by Product Council

Q1 2026 (January - March): Foundation Phase
- AI-powered search: Replace keyword search with semantic vector search across all platform content. Expected to improve search relevance by 40%. Uses the same embedding technology as our RAG features. Budget: Rs 1500000. Status: In Progress.
- Dashboard redesign: Complete overhaul of the analytics dashboard with real-time data streaming. New charts powered by D3.js. Custom date range filters and export to PDF. Budget: Rs 800000. Status: In Progress.
- Mobile app v2.0: Native iOS and Android apps replacing the current React Native hybrid. Push notifications for critical alerts. Offline mode for viewing cached reports. Budget: Rs 2000000. Status: Planning.

Q2 2026 (April - June): Growth Phase
- Team collaboration features: Real-time document editing, in-app commenting, task assignments, and team workspaces. Integrates with Slack and Microsoft Teams. Budget: Rs 1200000. Status: Design Phase.
- Enterprise SSO: SAML 2.0 and OpenID Connect support for enterprise customers. Includes directory sync with Active Directory and Okta. Required for 3 pending enterprise deals worth Rs 5000000 combined. Budget: Rs 600000. Status: Planning.
- Advanced analytics: Predictive analytics module using machine learning. Churn prediction, usage forecasting, and anomaly detection. Requires hiring 2 ML engineers. Budget: Rs 1800000. Status: Research.

Q3 2026 (July - September): Scale Phase
- Multi-region deployment: Expand infrastructure to EU (Frankfurt) and APAC (Singapore) regions. Required for GDPR compliance with European customers. Estimated infrastructure cost increase of 35%. Budget: Rs 2500000. Status: Planning.
- API marketplace: Allow third-party developers to build and sell integrations on the NovaTech platform. Revenue share model: 70% developer, 30% NovaTech. SDK support for Python, JavaScript, and Go. Budget: Rs 1000000. Status: Concept.
- White-label solution: Enable enterprise customers to rebrand the platform with their own logo, colors, and domain. Tiered pricing based on customization level. Budget: Rs 900000. Status: Concept.
"""

with open("company_docs/product_roadmap.txt", "w") as f:
    f.write(roadmap_doc)
print("✅ Created: company_docs/product_roadmap.txt")

print()
print("Documents created:")
for f in os.listdir("company_docs"):
    size = os.path.getsize(f"company_docs/{f}")
    print(f"   {f} ({size:,} bytes)")

✅ Created: company_docs/tech_architecture.txt
✅ Created: company_docs/product_roadmap.txt

Documents created:
   employee_handbook.pdf (3,168 bytes)
   product_roadmap.txt (2,187 bytes)
   tech_architecture.txt (2,755 bytes)


[title]
Loading Real Files — PDF Loader and Text Loader

[narration]
Now the fun part. Loading real files.

LangChain has specialized loaders for every file type. Today we'll use two of the most common ones.

The PyPDF Loader reads PDF files page by page. Each page becomes a separate document object with the page number in the metadata. This is perfect because when you cite a source, you can say exactly which page the answer came from.

The Text Loader handles plain text files. Simple but effective. One file, one document.

The beauty of LangChain's design is that after loading, every document looks the same regardless of the original format. PDF, text, web page — they all become document objects with content and metadata. Your entire downstream pipeline stays identical.

This is the universal interface pattern. Once loaded, everything looks the same regardless of the original format.

[display]
## Document Loaders

**PyPDFLoader:**
```python
loader = PyPDFLoader("file.pdf")
docs = loader.load()  # One doc per page
```

**TextLoader:**
```python
loader = TextLoader("file.txt")
docs = loader.load()  # One doc per file
```

**After loading — everything looks the same:**
```
Document(
    page_content = "...",
    metadata = {"source": "file.pdf", "page": 0}
)
```

**100+ loaders available:**
CSV, Word, HTML, Notion, Slack, YouTube, and more

In [4]:
"""[narration]
Let me load all three documents. The PDF loader splits by page, so our two-page handbook becomes two documents. Each text file becomes one document.

I combine everything into a single list. Four documents total from three files. Notice the metadata — the PDF documents have both source and page number. The text files have just the source path.

I'm printing a preview of each document so you can see exactly what the loaders extracted.
"""

from langchain_community.document_loaders import PyPDFLoader, TextLoader

pdf_loader = PyPDFLoader("company_docs/employee_handbook.pdf")
pdf_docs = pdf_loader.load()
print(f"PDF loaded: {len(pdf_docs)} pages")

tech_loader = TextLoader("company_docs/tech_architecture.txt")
tech_docs = tech_loader.load()
print(f"Tech doc loaded: {len(tech_docs)} document(s)")

roadmap_loader = TextLoader("company_docs/product_roadmap.txt")
roadmap_docs = roadmap_loader.load()
print(f"Roadmap loaded: {len(roadmap_docs)} document(s)")

all_documents = pdf_docs + tech_docs + roadmap_docs
print(f"\nTotal documents loaded: {len(all_documents)}")

print()
print("=" * 60)
for i, doc in enumerate(all_documents):
    source = doc.metadata.get('source', 'unknown')
    page = doc.metadata.get('page', 'N/A')
    print(f"\nDocument {i+1}:")
    print(f"   Source: {source}")
    print(f"   Page: {page}")
    print(f"   Length: {len(doc.page_content)} characters")
    print(f"   Preview: {doc.page_content[:100]}...")

PDF loaded: 2 pages
Tech doc loaded: 1 document(s)
Roadmap loaded: 1 document(s)

Total documents loaded: 4


Document 1:
   Source: company_docs/employee_handbook.pdf
   Page: 0
   Length: 1475 characters
   Preview: NovaTech Employee Handbook 2026
Section 1: Leave Policy
All full-time employees at NovaTech receive ...

Document 2:
   Source: company_docs/employee_handbook.pdf
   Page: 1
   Length: 1371 characters
   Preview: Section 3: Compensation & Benefits
Salary reviews are conducted annually in April. Performance bonus...

Document 3:
   Source: company_docs/tech_architecture.txt
   Page: N/A
   Length: 2732 characters
   Preview: NovaTech Technical Architecture Document
Last Updated: January 2026

1. SYSTEM OVERVIEW
NovaTech's p...

Document 4:
   Source: company_docs/product_roadmap.txt
   Page: N/A
   Length: 2170 characters
   Preview: NovaTech Product Roadmap 2026
Status: Approved by Product Council

Q1 2026 (January - March): Founda...


[title]
Splitting, Embedding, and Storing — The Pipeline You Know

[narration]
Now we run the same pipeline from Episode 1. Split into chunks, embed them, store in FAISS.

But this time, I'm using a slightly larger chunk size — one thousand characters instead of five hundred. Why? Because our documents are longer and more detailed. Larger chunks preserve more context per piece, which helps the language model give more complete answers.

There's always a tradeoff. Smaller chunks mean more precise retrieval but less context. Larger chunks mean more context but potentially less precise matches. One thousand with two hundred overlap is a solid default for most production systems.

After splitting, we embed and store everything in one line — same as before.

[display]
## The Pipeline

**Chunk size tradeoff:**
```
Small chunks (500)  --> Precise search, less context
Large chunks (1000) --> More context, broader matches
```

**Our settings:**
- Chunk size: 1000 characters
- Overlap: 200 characters
- Splitter: RecursiveCharacterTextSplitter

**Same one-liner from Episode 1:**
```python
vectorstore = FAISS.from_documents(chunks, embeddings)
```

In [5]:
"""[narration]
Split, embed, store. Three lines of code for the entire indexing pipeline.

One thousand characters per chunk with two hundred character overlap. The overlap is larger this time because our chunks are larger — we want enough shared context between neighbors.

Watch the numbers. We started with four documents. After splitting, we have many more chunks. Each chunk keeps the original metadata, so we never lose track of which file it came from.
"""

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_documents(all_documents)
print(f"Original documents: {len(all_documents)}")
print(f"After splitting:    {len(chunks)} chunks")
print(f"Avg chunk size:     {sum(len(c.page_content) for c in chunks) // len(chunks)} characters")

vectorstore = FAISS.from_documents(chunks, embeddings)
print(f"\n✅ Vector store built with {len(chunks)} chunks from {len(all_documents)} documents")

from collections import Counter
source_counts = Counter(c.metadata['source'] for c in chunks)
print("\nChunks per source:")
for source, count in source_counts.items():
    print(f"   {source}: {count} chunks")

Original documents: 4
After splitting:    11 chunks
Avg chunk size:     724 characters

✅ Vector store built with 11 chunks from 4 documents

Chunks per source:
   company_docs/employee_handbook.pdf: 4 chunks
   company_docs/tech_architecture.txt: 4 chunks
   company_docs/product_roadmap.txt: 3 chunks


[title]
Similarity Search With Scores — How Confident Is the Match?

[narration]
Before we build the full RAG chain, let me show you something new — similarity search with scores.

In Episode 1, we just got back the matching chunks. But we had no idea how similar they were. Was it a ninety-five percent match or a fifty percent match? That information is crucial.

FAISS can return a distance score alongside each result. Lower distance means higher similarity. Think of it like a search engine confidence score — it tells you how relevant each result is.

This is powerful for two reasons. First, you can set a threshold. If nothing scores above a certain relevance, you know the answer probably isn't in your documents. Second, you can show users a confidence level, which builds trust.

Let me show you the difference.

[display]
## Similarity Scores

**Without scores (Episode 1):**
```
Results: [doc1, doc2, doc3]
"Are these good matches? Who knows!"
```

**With scores (Episode 2):**
```
Results: [(doc1, 0.35), (doc2, 0.72), (doc3, 1.45)]
"doc1 is highly relevant, doc3 is barely related"
```

**FAISS distance:**
- Lower = more similar
- 0.0 = identical
- < 1.0 = good match
- > 1.5 = weak match

In [6]:
"""[narration]
The method is called similarity search with score. Instead of just returning documents, it returns tuples — each containing the document and its distance score.

I'll test three questions. One about leave policy, one about deployment, and one that's completely outside our knowledge base. Watch how the scores differ dramatically.

The relevant questions should have low distance scores — meaning high similarity. The irrelevant question should have high scores — meaning the system knows it doesn't have a good match.
"""

test_queries = [
    "How many days of paid leave do I get?",
    "What database does the platform use?",
    "What is the recipe for chocolate cake?",
]

for query in test_queries:
    print(f"\nQuery: '{query}'")
    results_with_scores = vectorstore.similarity_search_with_score(query, k=2)

    for doc, score in results_with_scores:
        source = doc.metadata.get('source', 'unknown')
        print(f"   [{source}] Score: {score:.4f}")
        print(f"      {doc.page_content[:100]}...")
    print("-" * 60)

print()
print("Lower score = better match")
print("The chocolate cake query has HIGH scores — system knows it's irrelevant!")


Query: 'How many days of paid leave do I get?'
   [company_docs/employee_handbook.pdf] Score: 0.9109
      NovaTech Employee Handbook 2026
Section 1: Leave Policy
All full-time employees at NovaTech receive ...
   [company_docs/employee_handbook.pdf] Score: 1.2549
      Section 2: Remote Work Policy
NovaTech operates a hybrid work model. Employees may work remotely up ...
------------------------------------------------------------

Query: 'What database does the platform use?'
   [company_docs/tech_architecture.txt] Score: 0.9583
      3. DATABASE LAYER
Primary database is PostgreSQL 16 running on AWS RDS with read replicas in two ava...
   [company_docs/product_roadmap.txt] Score: 1.3597
      Q2 2026 (April - June): Growth Phase
- Team collaboration features: Real-time document editing, in-a...
------------------------------------------------------------

Query: 'What is the recipe for chocolate cake?'
   [company_docs/product_roadmap.txt] Score: 1.9450
      NovaTech Product Roadm

[title]
Building the Basic RAG Chain — Quick Refresher

[narration]
Let me quickly build the standard RAG chain — same pattern as Episode 1 — so we have a baseline to compare with the conversational version.

Retriever fetches the top three chunks. Prompt tells the language model to answer only from context. Parser extracts clean text.

Quick test to make sure everything works before we add memory.

[display]
## Basic RAG Chain (Refresher)

```
Question --> Retriever --> Prompt --> LLM --> Answer
```

**Same pattern as Episode 1 — just with real documents now!**

In [7]:
"""[narration]
Standard RAG chain. You've seen this before. Retriever with k equals three, anti-hallucination prompt, and the chain connected with pipe operators.

Let me test with a few questions across different documents.
"""

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

def format_docs(docs):
    formatted = []
    for doc in docs:
        source = doc.metadata.get('source', 'unknown')
        page = doc.metadata.get('page', '')
        page_str = f", Page {page + 1}" if isinstance(page, int) else ""
        formatted.append(f"[Source: {source}{page_str}]\n{doc.page_content}")
    return "\n\n".join(formatted)

rag_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful company assistant for NovaTech. Answer questions using ONLY the provided context.

Rules:
- Answer based ONLY on the context below
- If the answer is not in the context, say: "I don't have that information in my knowledge base."
- Always cite the source document and page when available
- Be concise and specific"""),
    ("human", """Context:
{context}

Question: {question}

Answer:""")
])

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

questions = [
    "What is the home office setup allowance?",
    "How is the CI/CD pipeline structured?",
    "What features are planned for Q2 2026?",
]

for q in questions:
    print(f"Q: {q}")
    print(f"A: {rag_chain.invoke(q)}")
    print("-" * 60)

print()
print("✅ Basic RAG works with real documents!")

Q: What is the home office setup allowance?
A: The home office setup allowance is Rs 25000, provided as a one-time allowance for all eligible employees. [Source: company_docs/employee_handbook.pdf, Page 1]
------------------------------------------------------------
Q: How is the CI/CD pipeline structured?
A: The CI/CD pipeline is structured with the following stages: lint, unit test, integration test, security scan, build, deploy to staging, automated E2E tests, and production deploy with canary rollout. [Source: company_docs/tech_architecture.txt, section 5. DEPLOYMENT & INFRASTRUCTURE]
------------------------------------------------------------
Q: What features are planned for Q2 2026?
A: For Q2 2026, the following features are planned: 
1. Team collaboration features, 
2. Enterprise SSO, and 
3. Advanced analytics. [Source: company_docs/product_roadmap.txt]
------------------------------------------------------------

✅ Basic RAG works with real documents!


[title]
The Problem — Why Basic RAG Can't Handle Conversations

[narration]
Now watch this. I'm going to expose the fatal flaw in our basic RAG chain.

First question: 'What is the learning budget at NovaTech?' It answers perfectly. Seventy-five thousand rupees per year.

Second question: 'Does it carry over to the next year?'

And here's the problem. The word 'it' refers to the learning budget. You and I understand that from context. But the RAG system doesn't. Each question is processed independently. There's no memory. No context from the previous question.

So when the retriever searches for 'Does it carry over?', it doesn't know what 'it' refers to. It might find documents about leave carry-over instead. Or it might find nothing relevant at all.

This is the classic follow-up question problem. And it's why every real chatbot needs conversational RAG.

The solution? We need to rewrite the follow-up question into a standalone question before searching. 'Does it carry over?' becomes 'Does the learning budget carry over to the next year?' Now the retriever knows exactly what to look for.

[display]
## The Follow-Up Problem

**Conversation:**
```
User: "What is the learning budget?"
AI:   "Rs 75,000 per year."
User: "Does it carry over?"
AI:   ??? (what is 'it'?)
```

**The fix — rewrite before searching:**
```
"Does it carry over?"
      | (rewrite using chat history)
"Does the learning budget carry over?"
      | (now search works!)
```

**This is called: History-Aware Retrieval**

In [8]:
"""[narration]
Let me prove the problem first. I'll ask about the learning budget, then ask a follow-up.

But this time, I'm also printing what the retriever finds for the follow-up question. This debug output shows you exactly why it fails — the retriever has no idea what 'it' means, so it searches blindly and may pull the wrong documents entirely.
"""

print("BASIC RAG — No Memory:")
print()

q1 = "What is the learning budget at NovaTech?"
a1 = rag_chain.invoke(q1)
print(f"Q1: {q1}")
print(f"A1: {a1}")
print()

q2 = "Does it carry over to the next year?"

# Debug: show what the retriever actually finds for the vague follow-up
retrieved = retriever.invoke(q2)
print(f"Q2: {q2}")
print(f"  --> Retriever found these sources (without context):")
for d in retrieved:
    src = d.metadata.get('source', 'unknown')
    page = d.metadata.get('page', '')
    page_str = f", Page {page + 1}" if isinstance(page, int) else ""
    print(f"      {src}{page_str}: {d.page_content[:80]}...")

a2 = rag_chain.invoke(q2)
print(f"A2: {a2}")

print()
print("Notice: 'it' is ambiguous without history!")
print("The retriever may find leave carry-over instead of learning budget carry-over.")
print("That's why we need conversational RAG — to rewrite the question first.")

BASIC RAG — No Memory:

Q1: What is the learning budget at NovaTech?
A1: The annual learning budget at NovaTech is Rs 75000 for courses, certifications, conferences, and books. 
[Source: company_docs/employee_handbook.pdf, Page 2 and Page 2, Section 4: Professional Development]

Q2: Does it carry over to the next year?
  --> Retriever found these sources (without context):
      company_docs/employee_handbook.pdf, Page 2: 75000 for courses, certifications, conferences, and books. Unused learning budge...
      company_docs/employee_handbook.pdf, Page 2: Section 3: Compensation & Benefits
Salary reviews are conducted annually in Apri...
      company_docs/product_roadmap.txt: Q3 2026 (July - September): Scale Phase
- Multi-region deployment: Expand infras...
A2: No, the unused learning budget does not carry over to the next year. [Source: company_docs/employee_handbook.pdf, Page 2]

Notice: 'it' is ambiguous without history!
The retriever may find leave carry-over instead of learning bu

[title]
Conversational RAG — Adding Memory That Actually Works

[narration]
Here's the architecture for conversational RAG. It has two stages.

Stage one — Question Rewriting. Before we search, we pass the chat history AND the new question to the language model. We ask it: given this conversation history, rewrite the latest question so it makes sense as a standalone question. 'Does it carry over?' becomes 'Does the NovaTech learning budget carry over to the next year?'

Stage two — Standard RAG. Now we take the rewritten question and run our normal RAG pipeline. Search, retrieve, generate.

The key insight is that we use the language model twice. Once to rewrite the question. Once to generate the answer. Two calls, but the result is a chatbot that actually understands follow-ups.

This is exactly how production chatbots handle multi-turn conversations. The pattern is called history-aware retrieval.

[display]
## Conversational RAG Architecture

**Two-stage pipeline:**
```
Stage 1: REWRITE
  Chat History + New Question
      --> LLM rewrites into standalone question

Stage 2: RAG (same as before)
  Standalone Question
      --> Retriever --> Prompt --> LLM --> Answer
```

**Example:**
```
History: "What is the learning budget?" --> "Rs 75,000"
New:     "Does it carry over?"
Rewrite: "Does the NovaTech learning budget carry over?"
```

**Uses LLM twice:** once to rewrite, once to answer

In [9]:
"""[narration]
First, the question rewriting chain. This is the magic ingredient.

I define a prompt that gives the language model the chat history and the latest question. The instruction is clear: rewrite the question so it stands on its own. If the question already makes sense without context, return it unchanged.

Let me test this with our problematic follow-up question. Watch how it transforms 'Does it carry over?' into a complete, searchable question.
"""

from langchain_core.messages import HumanMessage, AIMessage

rewrite_prompt = ChatPromptTemplate.from_messages([
    ("system", """Given the following conversation history and a follow-up question,
rewrite the follow-up question to be a standalone question that includes all necessary context.
If the question already makes sense on its own, return it unchanged.
Only return the rewritten question — nothing else."""),
    ("human", """Chat History:
{chat_history}

Follow-up Question: {question}

Standalone Question:""")
])

rewrite_chain = rewrite_prompt | llm | StrOutputParser()

sample_history = """Human: What is the learning budget at NovaTech?
AI: Each employee receives an annual learning budget of Rs 75,000 for courses, certifications, conferences, and books."""

follow_ups = [
    "Does it carry over to the next year?",
    "What about conferences?",
    "How does this compare to the home office allowance?",
]

print("Question Rewriting Demo:")
print()
print(f"History: '{sample_history}'")
print()

for q in follow_ups:
    rewritten = rewrite_chain.invoke({
        "chat_history": sample_history,
        "question": q
    })
    print(f"   Original:  '{q}'")
    print(f"   Rewritten: '{rewritten}'")
    print()

Question Rewriting Demo:

History: 'Human: What is the learning budget at NovaTech?
AI: Each employee receives an annual learning budget of Rs 75,000 for courses, certifications, conferences, and books.'

   Original:  'Does it carry over to the next year?'
   Rewritten: 'Does the annual learning budget of Rs 75,000 provided to each employee at NovaTech carry over to the next year if it is not fully utilized?'

   Original:  'What about conferences?'
   Rewritten: 'What is the policy regarding the use of the annual learning budget of Rs 75,000 for attending conferences at NovaTech?'

   Original:  'How does this compare to the home office allowance?'
   Rewritten: 'What is the comparison between the annual learning budget of Rs 75,000 for courses, certifications, conferences, and books at NovaTech and the home office allowance?'



In [10]:
"""[narration]
Now the complete conversational RAG system. I wrap everything into a class to keep the state clean.

The class stores the chat history as a list of messages. When you ask a question, it first checks if there's any history. If there is, it rewrites the question. If not, it uses the question as is.

Then it runs the standard RAG pipeline with the rewritten question. After getting the answer, it saves both the question and answer to history.

One important note. This is a simple memory approach — we store the full conversation as-is. In production, you'd summarize or window the chat history to prevent it from growing too large. Long histories increase cost, add latency, and can actually degrade the rewriting quality. But for learning, this pattern shows you the core idea clearly.
"""

class ConversationalRAG:

    def __init__(self, retriever, llm, embeddings):
        self.retriever = retriever
        self.llm = llm
        self.chat_history = []

        self.rewrite_prompt = ChatPromptTemplate.from_messages([
            ("system", """Given the conversation history and a follow-up question,
rewrite the question to be standalone with all necessary context.
If it already stands alone, return it unchanged. Only return the question."""),
            ("human", """Chat History:
{chat_history}

Follow-up Question: {question}

Standalone Question:""")
        ])
        self.rewrite_chain = self.rewrite_prompt | llm | StrOutputParser()

        self.rag_prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a helpful NovaTech company assistant. Answer using ONLY the provided context.
- If the answer is not in the context, say so
- Cite the source document and page when available
- Be concise and specific"""),
            ("human", """Context:
{context}

Question: {question}

Answer:""")
        ])

    def _format_history(self):
        if not self.chat_history:
            return "(No previous conversation)"
        lines = []
        for msg in self.chat_history:
            role = "Human" if msg["role"] == "human" else "AI"
            lines.append(f"{role}: {msg['content']}")
        return "\n".join(lines)

    def ask(self, question):
        if self.chat_history:
            standalone_q = self.rewrite_chain.invoke({
                "chat_history": self._format_history(),
                "question": question
            })
        else:
            standalone_q = question

        docs = self.retriever.invoke(standalone_q)
        context = format_docs(docs)
        messages = self.rag_prompt.format_messages(context=context, question=standalone_q)
        answer = self.llm.invoke(messages).content

        self.chat_history.append({"role": "human", "content": question})
        self.chat_history.append({"role": "ai", "content": answer})

        sources = []
        for d in docs:
            src = d.metadata.get('source', 'unknown')
            page = d.metadata.get('page', None)
            if isinstance(page, int):
                sources.append(f"{src} (Page {page + 1})")
            else:
                sources.append(src)
        sources = sorted(set(sources))
        return answer, sources, standalone_q

    def reset(self):
        self.chat_history = []

conv_rag = ConversationalRAG(retriever, llm, embeddings)
print("✅ Conversational RAG system ready!")
print("   Memory: enabled")
print("   Question rewriting: enabled")

✅ Conversational RAG system ready!
   Memory: enabled
   Question rewriting: enabled


[title]
The Moment of Truth — Follow-Up Questions That Actually Work

[narration]
Let's test this with a multi-turn conversation. I'll start with a question, then ask follow-ups that use pronouns and references to previous answers. The exact scenario that broke our basic RAG chain.

Watch two things. First, the rewritten questions — see how the system transforms vague follow-ups into precise standalone questions. Second, the answers — they should be accurate and relevant, using the right context every time.

[display]
## Testing Conversational RAG

**Our test conversation:**
1. Start with a topic
2. Ask follow-up with "it" / "that"
3. Compare with a different topic
4. Reference something from earlier

**Watching for:**
- ✅ Correct question rewriting
- ✅ Accurate answers from right documents
- ✅ Memory across turns

In [11]:
"""[narration]
Five turns of conversation. Starting with the learning budget, then asking follow-ups that reference previous answers.

For each turn, I'm showing the original question, the rewritten version, the answer, and the sources. This makes it crystal clear how the rewriting works.
"""

conversation = [
    "What is the learning budget at NovaTech?",
    "Does it carry over to the next year?",
    "What about the home office allowance — how much is that?",
    "Which of those two benefits is more generous?",
    "Tell me about the deployment schedule for the engineering team.",
]

print("CONVERSATIONAL RAG — With Memory:")
print()
print("=" * 60)

for i, q in enumerate(conversation, 1):
    answer, sources, rewritten = conv_rag.ask(q)
    print(f"\nTurn {i}: {q}")
    if rewritten != q:
        print(f"Rewritten: {rewritten}")
    print(f"Answer: {answer}")
    print(f"Sources: {', '.join(sources)}")
    print("-" * 60)

print(f"\nConversation history: {len(conv_rag.chat_history)} messages")
print("✅ Follow-up questions work perfectly!")

CONVERSATIONAL RAG — With Memory:


Turn 1: What is the learning budget at NovaTech?
Answer: The annual learning budget at NovaTech is Rs 75000 for courses, certifications, conferences, and books. 
[Source: company_docs/employee_handbook.pdf, Page 2]
Sources: company_docs/employee_handbook.pdf (Page 1), company_docs/employee_handbook.pdf (Page 2)
------------------------------------------------------------

Turn 2: Does it carry over to the next year?
Rewritten: Does the annual learning budget of Rs 75000 at NovaTech carry over to the next year if it is not fully utilized?
Answer: No, the annual learning budget of Rs 75000 at NovaTech does not carry over to the next year if it is not fully utilized. 
[Source: company_docs/employee_handbook.pdf, Page 2]
Sources: company_docs/employee_handbook.pdf (Page 1), company_docs/employee_handbook.pdf (Page 2)
------------------------------------------------------------

Turn 3: What about the home office allowance — how much is that?
Rewritten: W

[title]
Interactive Conversational RAG Demo

[narration]
Let me put this all together in an interactive demo. This time, the chatbot remembers everything you say. You can ask follow-ups, reference previous answers, switch topics, and come back to earlier questions.

Try it yourself. Start with any topic from the handbook, tech docs, or product roadmap. Then ask follow-ups. Test the memory.

Type 'reset' to clear the conversation and start fresh. Type 'quit' to exit.

[display]
## Interactive Demo — Conversational RAG

**Try these conversation flows:**

Flow 1 — HR Deep Dive:
- "What is the remote work policy?"
- "Which days do I need to come in?"
- "What if I'm a new employee?"

Flow 2 — Tech Exploration:
- "What's the tech stack?"
- "How is data cached?"
- "What about security?"

Flow 3 — Product Planning:
- "What's coming in Q2?"
- "How much will that cost?"
- "Is there anything related to AI?"

**Commands:** `reset` = clear memory, `quit` = exit

In [12]:
"""[narration]
Interactive conversational RAG. Every answer builds on the previous ones. The reset command clears memory so you can start a new topic cleanly.

I'll demonstrate a quick conversation, then hand it over to you.
"""

def interactive_conversational_rag():
    rag = ConversationalRAG(retriever, llm, embeddings)

    print()
    print("=" * 60)
    print("NOVATECH KNOWLEDGE BASE — Conversational RAG")
    print("=" * 60)
    print()
    print("Ask anything! I remember the conversation.")
    print("Commands: 'reset' = clear memory, 'quit' = exit")

    while True:
        user_input = input("\nYou: ").strip()

        if user_input.lower() in ['quit', 'exit', 'bye']:
            print("\nGoodbye!")
            break

        if user_input.lower() == 'reset':
            rag.reset()
            print("Memory cleared! Starting fresh.")
            continue

        if not user_input:
            continue

        try:
            answer, sources, rewritten = rag.ask(user_input)
            if rewritten != user_input:
                print(f"   Understood as: {rewritten}")
            print(f"\nAnswer: {answer}")
            print(f"\nSources: {', '.join(sources)}")
        except Exception as e:
            print(f"\nError: {str(e)}")
            break

print("Starting interactive conversational RAG...")
interactive_conversational_rag()

Starting interactive conversational RAG...

NOVATECH KNOWLEDGE BASE — Conversational RAG

Ask anything! I remember the conversation.
Commands: 'reset' = clear memory, 'quit' = exit

Answer: NovaTech is a company that offers a microservices-based SaaS application. [Source: company_docs/tech_architecture.txt]

Sources: company_docs/employee_handbook.pdf (Page 1), company_docs/employee_handbook.pdf (Page 2), company_docs/tech_architecture.txt

Goodbye!


[title]
What You Just Built — And What's Coming Next

[narration]
Let's recap what you accomplished today.

You loaded real PDF and text files using LangChain's document loaders. You saw how different file formats become identical document objects after loading.

You learned about similarity scores and how they tell you how confident the retrieval is. Lower score, better match.

And the big one — you built conversational RAG. A system that rewrites follow-up questions using the chat history before searching. This is the exact pattern used in production chatbots at companies handling millions of conversations.

The two-stage architecture — rewrite then retrieve — is elegant and powerful. And now you understand every piece of it.

In Episode 3, we're going even deeper. Advanced retrieval strategies. Multiple retrieval methods combined. Re-ranking results for maximum accuracy. And metadata filtering so you can search within specific documents or categories.

But right now, while this is fresh, run this notebook. Load your own PDFs. Have a multi-turn conversation with your own documents. The learning happens when you build.

Subscribe to NoteBook Learnings. Drop a comment with what you built.

I'll see you in the next one.

[display]
## What You Built Today

**New skills unlocked:**
1. Loading real PDFs and text files
2. Similarity search with confidence scores
3. Conversational RAG with memory
4. History-aware question rewriting
5. Multi-turn follow-up conversations

**The conversational RAG pattern:**
```python
# Stage 1: Rewrite
standalone = rewrite_chain.invoke({
    "chat_history": history,
    "question": follow_up
})

# Stage 2: Standard RAG
answer = rag_chain.invoke(standalone)
```

**RAG Series — NoteBook Learnings:**
- Ep 1: RAG From Scratch  ✅
- Ep 2: Real Docs + Conversational RAG  <-- YOU ARE HERE
- Ep 3: Advanced Retrieval Strategies
- Ep 4: Multi-Modal RAG (images + tables)
- Ep 5: Production RAG Deployment

**Your homework:**
1. Load your own PDF into this pipeline
2. Have a 5-turn conversation with your document
3. Try different chunk sizes and compare answers
4. Test with questions outside the document scope
5. Share your results in the comments!

**Subscribe** to NoteBook Learnings
**Download** the notebook from the description