# RAG with Vector Databases

## Overview
This notebook demonstrates building a complete, production-ready RAG (Retrieval-Augmented Generation) system using vector databases. You'll learn to extract, process, and query real web content for accurate AI responses.

## Learning Objectives
- Build a complete RAG pipeline from web data to AI responses
- Understand vector database indexing and retrieval
- Handle context window limitations in production systems
- Create persistent knowledge bases for reuse
- Implement semantic search for accurate context retrieval

## Key Concepts
- **FAISS Vector Database**: Fast similarity search and clustering
- **Web Data Processing**: Extract and chunk web content
- **Context Window Management**: Handling token limits effectively
- **Production RAG Pipeline**: End-to-end retrieval and generation
- **Persistent Storage**: Save and reload vector databases

## Step 1: Installing Required Dependencies

This notebook requires additional packages for vector databases and web scraping:
- **faiss-cpu**: Fast similarity search library for vector databases
- **beautifulsoup4**: HTML parsing for web content extraction
- **langchain-community**: Community tools including document loaders

In [1]:
# extra libs
# ! pip install faiss-cpu beautifulsoup4 langchain-community

## Step 2: Environment Configuration

Two options for running the RAG system:
- **Azure OpenAI** (commented out) - Production-grade embeddings and LLM
- **LM Studio** (active) - Local development with simulated embeddings

We use a larger embedding size (8000) for better document representation in this complex RAG scenario.

In [2]:
# # with Azure OpenAI and LangChain

# from langchain_openai import AzureOpenAIEmbeddings
# from langchain_community.vectorstores import FAISS
# from langchain_community.document_loaders import WebBaseLoader
# from langchain_openai import AzureChatOpenAI              # Azure OpenAI LLM
# from langchain_core.prompts import ChatPromptTemplate     # Structured prompts
# from langchain_core.output_parsers import StrOutputParser # Clean output parsing
# from langchain_core.runnables import RunnablePassthrough  # Data passing utilities

# # Environment and configuration
# from dotenv import load_dotenv
# import os

# # Load environment variables from .env file
# load_dotenv()

# embeddings = AzureOpenAIEmbeddings(
#     deployment="text-embedding-3-large",          # Deployment name in Azure
#     azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),      # Your Azure endpoint
#     api_version=os.getenv("AZURE_OPENAI_API_VERSION"),      # API version
#     api_key=os.getenv("AZURE_OPENAI_API_KEY"),              # Authentication key
# )

# llm = AzureChatOpenAI(
#     azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),          
#     azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"), 
#     api_version=os.getenv("AZURE_OPENAI_API_VERSION"),          
#     api_key=os.getenv("AZURE_OPENAI_API_KEY"),                  
# )

### Option 1: Azure OpenAI Configuration (Commented Out)
Production setup with high-quality embeddings and large language models.

### Option 2: LM Studio Configuration (Active)
Local development setup with simulated embeddings for learning RAG concepts without external dependencies.

In [3]:
# with local LM Studio and LangChain

from langchain_core.embeddings import DeterministicFakeEmbedding
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

embeddings = DeterministicFakeEmbedding(size=4096)

llm = ChatOpenAI(
    base_url="http://localhost:1234/v1",
    api_key="not-needed"  # LM Studio doesn't require an API key
)

USER_AGENT environment variable not set, consider setting it to identify your requests.


# Stage 1: Building the Vector Database

This stage demonstrates the **"Indexing"** phase of RAG:
1. **Extract**: Load content from web sources
2. **Split**: Break content into manageable chunks  
3. **Embed**: Convert chunks to vector representations
4. **Store**: Save vectors in searchable database
5. **Persist**: Save database for future use

We'll use Wikipedia as our data source to build a knowledge base about "Kalki 2898 AD".

In [4]:
web_page = "https://en.wikipedia.org/wiki/Kalki_2898_AD"

## Step 1: Data Loading and Processing

In this step, we'll load the webpage content and process it into manageable chunks for our vector database. The WebBaseLoader will:

- **Extract Content**: Download and parse the HTML from the Wikipedia page
- **Clean Data**: Remove HTML tags and extract readable text
- **Split Documents**: Break content into smaller, searchable chunks
- **Prepare for Indexing**: Format chunks for vector embedding

### Why Document Chunking Matters:
- **Context Windows**: LLMs have token limits (4096, 8192, etc.) 
- **Retrieval Precision**: Smaller chunks improve search accuracy
- **Memory Efficiency**: Better for vector storage and similarity search
- **Answer Quality**: Focused chunks lead to more relevant responses

The `load_and_split()` method automatically handles the chunking process using intelligent text splitting strategies.

In [5]:
loader = WebBaseLoader(web_page)
# documents = loader.load()
chunked_documents = loader.load_and_split()

In [6]:
print(f"🌐 Total chunks from webpage: {len(chunked_documents)}")
print(chunked_documents[5])

🌐 Total chunks from webpage: 22
page_content='Pre-production
Ashwin co-wrote the film's script along with Rutham Samar, Sai Madhav Burra and B. S. Sarawagna Kumar, in his second consecutive film with Samar and Burra after Mahanati.[30][31] A muhurat pooja ceremony was held on 24 July 2021 at Ramoji Film City in Hyderabad with the presence of the film's cast and crew. The working title was revealed to have been changed to Project K the same day.[21][32][33] Ashwin stated that the film needed high-end technology and futuristic vehicles to be developed for the film. Although they can be recreated using computer-generated imagery (CGI), the director opted to build these vehicles from scratch.[34]
In March 2022, Ashwin requested businessman Anand Mahindra to provide technical support to build such type of vehicles.[35] A few days later, Mahindra responded that their company Mahindra & Mahindra would assist the production team from their Mahindra Research Valley campus in Chennai.[36] In Jul

### create in memeory vector store

In [7]:
vector_store = FAISS.from_documents(
    chunked_documents,  # Our preprocessed document chunks
    embeddings       # The embedding model to use
)

In [8]:
print(f"\n💾 Vector Database Summary:")
print(f"   - Source: {web_page}")
print(f"   - Chunks: {len(chunked_documents)}")
print(f"   - Technology: FAISS + Embeddings")
print(f"   - Capability: Semantic search across movie information")


💾 Vector Database Summary:
   - Source: https://en.wikipedia.org/wiki/Kalki_2898_AD
   - Chunks: 22
   - Technology: FAISS + Embeddings
   - Capability: Semantic search across movie information


## Important: Context Window Management

### The Token Limit Challenge
When building RAG systems, you'll encounter context window limitations:
- **LM Studio Model**: Often 4096 tokens max
- **Retrieved Content**: Can easily exceed this limit
- **Error Result**: "BadRequestError: context overflows"

### Solutions:
1. **Limit Retrieval**: Use `search_kwargs={"k": 2}` to get fewer chunks
2. **Chunk Smaller**: Break documents into smaller pieces
3. **Model Upgrade**: Use models with larger context windows
4. **Content Filtering**: Pre-process to remove irrelevant content

### Our Approach:
We limit retrieval to 2 most relevant chunks to stay within token limits while maintaining answer quality.

## Step 2: Testing Vector Database Search

Before building the complete RAG chain, let's test our vector database's search capabilities. This demonstrates how semantic search works and helps us understand what context will be retrieved for different types of questions.

### Understanding Semantic Search
- **Keyword Search**: Matches exact words only
- **Semantic Search**: Understands meaning and context
- **Vector Similarity**: Finds conceptually related content
- **Relevance Ranking**: Returns most similar chunks first

### Test Strategy
We'll test different query types to see how the vector database retrieves relevant information:
1. **Plot Information**: Testing story-related queries
2. **Cast Information**: Testing actor-related queries  
3. **Technical Details**: Testing production information

This helps us verify that our knowledge base can find appropriate context before we connect it to the language model.

In [9]:
# Test 1: Search for plot information
query = "What is the plot of Kalki 2898 AD?"
results = vector_store.similarity_search(query, k=3)  # Get top 3 most similar chunks

for i, result in enumerate(results, 1):
    print(f"📄 CHUNK {i}:")
    print(f"Content: {result.page_content}")
    print(f"Source: {result.metadata}")
    print("-" * 80)

print(f"\n💡 Notice how the search finds semantically relevant content")
print("   even if the exact words 'plot' don't appear in the results!")

📄 CHUNK 1:
Content: Critical response
While Deccan Chronicle mentioned mixed reviews and WION wrote the film "failed to impress", India Today commented that the reviews were positive.[166][167][168] On the review aggregator website Rotten Tomatoes, 77% of 39 critics' reviews are positive, with an average rating of 7.1/10. The website's consensus reads: "A colourful spectacle that competes with Hollywood blockbusters while retaining its own distinctive flavor, Kalki 2898 AD marks a rousing breakthrough in Tollywood sci-fi cinema."[169] Metacritic, which uses a weighted average, assigned the film a score of 65 out of 100, based on 8 critics, indicating "generally favorable" reviews.[170]
A critic for Bollywood Hungama rated the film 4 stars out of 5 and wrote, "On the whole, Kalki 2898 AD stands as a grandiose spectacle that depicts the future like never before in Indian cinema and also merges mythology seamlessly, thereby delivering a unique experience to the audiences."[171] Chirag Seh

In [10]:
# Test 2: Search for cast information  
query = "who acted in Kalki 2898 AD?"
results = vector_store.similarity_search(query, k=3)

for i, result in enumerate(results, 1):
    print(f"📄 CHUNK {i}:")
    print(f"Content: {result.page_content}")
    print(f"Source: {result.metadata}")
    print("-" * 80)

print(f"\n🎯 Semantic Search Power:")
print("   - Query: 'who acted' → Finds content about actors/cast")
print("   - Works even with different phrasing")
print("   - Understands context and meaning, not just keywords")

📄 CHUNK 1:
Content: ^ "'Theme Of Kalki' Song From 'Kalki 2898 AD' Is An 'Ode To Lord Krishna' – WATCH Video". The Times of India. 25 June 2024. Archived from the original on 25 June 2024. Retrieved 25 June 2024.

^ "Ta Takkara Complex Song from Kalki 2898 AD: Prabhas, Disha Patani live out their dreams". Hindustan Times. 29 June 2024. Archived from the original on 29 June 2024. Retrieved 29 June 2024.

^ "Hope of Shambala song from Kalki 2898 AD out". Cinema Express. 5 July 2024. Archived from the original on 12 July 2024. Retrieved 5 July 2024.

^ "Kickstart Your Day With The #Kalki2898AD Album". X (formerly Twitter). Archived from the original on 22 August 2024. Retrieved 22 August 2024.

^ "Kalki 2898 AD (Telugu)". Spotify. Archived from the original on 10 July 2024. Retrieved 10 July 2024.

^ "Internet unhappy with Prabhas' first look from 'Project K', call him 'sasta Iron man'". India Today. 19 July 2023. Archived from the original on 21 July 2023. Retrieved 21 July 2023.

^ "Prab

In [11]:
database_name = "local_kalki_vector_store"
vector_store.save_local(database_name)

# Stage 2: Production RAG Question-Answering System

This stage demonstrates the **"Retrieval + Generation"** phase of RAG:
1. **Load**: Restore saved vector database
2. **Retrieve**: Find relevant context for user questions
3. **Generate**: Use LLM + context to create answers
4. **Chain**: Combine everything into seamless pipeline

We'll build a complete movie Q&A system that can answer questions about Kalki 2898 AD using the knowledge we extracted and indexed in Stage 1.

In [12]:
vector_store = FAISS.load_local(
    database_name,                    # Database directory name
    embeddings,                             # Same embedding model used to create it
    allow_dangerous_deserialization=True    # Required for loading pickled objects
)

In [13]:
prompt_template = ChatPromptTemplate.from_template(
    """you are a helpful and comic assistant that help users to find the answers in fun way based on given context.
       
       IMPORTANT RAG INSTRUCTIONS:
       - Use ONLY the information provided in the context below
       - If the context doesn't contain enough information to answer the question, say: "I don't have enough info on this :("
       - Do not make up information or use knowledge outside the provided context
       - Be helpful and engaging while staying factually accurate
       
       ----
       Context: {context}
       ----
       Question: {question}
       
       Remember: Base your answer ONLY on the context provided above!
    """
)

### Building the Context-Aware Chain

This chain configuration includes important optimizations:
- **Limited Retrieval**: `k=2` prevents context overflow
- **Smart Retrieval**: Gets most relevant chunks only
- **Token Management**: Stays within model limits
- **Quality Balance**: Maintains answer quality with less context

In [14]:
chain = (
    {
        # Retrieval: Convert vector store to retriever and get relevant context
        # k=2 limits to 2 most relevant chunks to avoid context overflow
        "context": vector_store.as_retriever(search_kwargs={"k": 2}),
        
        # Pass-through: Forward the question unchanged to the prompt
        "question": RunnablePassthrough(),
    }
    | prompt_template        # Format the prompt with context and question
    | llm                   # Generate response using LM Studio
    | StrOutputParser(keep_whitespace=True)  # Parse to clean string output
)

## Step 3: Testing the Complete RAG System

Now we'll test our production RAG system with various types of questions to demonstrate different scenarios:

### Test Categories:
1. **Information Available**: Questions answerable from our knowledge base
2. **Information Unavailable**: Questions about different topics  
3. **Partial Information**: Questions with limited context
4. **Specific Details**: Questions requiring precise retrieval

In [15]:
question = "who all are main actors in Kalki 2898 AD?"
response = chain.invoke(input=question)
print(response)

print("="*70)
print("✅ Analysis:")
print("   - Chain retrieved relevant context about cast")
print("   - LLM generated answer based only on retrieved information")
print("   - Response maintains engaging tone while being factual")
print("   - Information is directly sourced from our vector database")

Okay, let's see who stars in "Kalki 2898 AD" according to what I've got here!

Based on the text, the main actors are:

*   **Prabhas**
*   **Deepika Padukone**
*   **Kamal Haasan**

The text also mentions that **Sobhita Dhulipala** dubbed for Deepika Padukone in the Telugu version.

Hope this helps! ✨
✅ Analysis:
   - Chain retrieved relevant context about cast
   - LLM generated answer based only on retrieved information
   - Response maintains engaging tone while being factual
   - Information is directly sourced from our vector database


### Test 2: Information NOT in Knowledge Base
Testing with a different movie (Delhi 6) that's not in our database. This should trigger our fallback response.

In [16]:
response = chain.invoke(input="who all are main actors in Delhi6?")
print(response)

I don't have enough info on this :(

The context I have is all about "Kalki 2898 AD", not a movie called "Delhi6". I can tell you who the main actors are in *Kalki 2898 AD* though! They include:

*   Prabhas
*   Deepika Padukone
*   Amitabh Bachchan






### Test 3: Specific Information Retrieval
Testing for specific details like release date that require precise information extraction.

In [17]:
response = chain.invoke(input="when Kalki 2898 AD movie released?")
print(response)

Okay, buckle up for some movie release info! 🎬

According to my intel (aka the context you gave me), **Kalki 2898 AD** was released worldwide on **27 June 2024** in standard, IMAX and 3D formats.

It was *originally* scheduled for May 9th, but got postponed! Glad it finally made its debut though! 🎉



### Test 4: Complex Analysis Questions
Testing with questions that require analysis or interpretation of available information.

In [18]:
response = chain.invoke(input="was this movie over budget?")
print(response)

Hmm, that's a tricky one! The documents talk *a lot* about the movie – rights sold for record prices, advance bookings smashing records... but I don't see anything specifically mentioning if it went over budget. 

It seems like they were confident enough to sell those rights for big bucks, so maybe it stayed on track? But I can't say for sure! 🕵️‍♂️






## Summary

You've successfully built a complete production RAG system! You've mastered:

✅ **End-to-End RAG Pipeline**: From web data to AI answers  
✅ **Vector Database Management**: FAISS indexing and persistence  
✅ **Context Window Optimization**: Handling token limits effectively  
✅ **Semantic Search**: Finding relevant content automatically  
✅ **Production Considerations**: Real-world deployment challenges  

## Key Concepts Learned

- **FAISS Vector Database**: Fast, scalable similarity search
- **WebBaseLoader**: Extract and chunk web content automatically
- **Context Window Management**: Critical for production RAG systems
- **Retrieval Optimization**: Balancing relevance and token limits
- **Persistent Storage**: Save and reload vector databases

## RAG Architecture

### Stage 1: Indexing Pipeline
```
Web Content → Load → Split → Embed → Store → Persist
```

### Stage 2: Query Pipeline
```
User Question → Retrieve → Format → Generate → Respond
```


### 1. Context Window Limitations
- **Problem**: Models have token limits (4096, 8192, etc.)
- **Solution**: Limit retrieval with `search_kwargs={"k": 2}`
- **Result**: Balanced context quality and token efficiency

### 2. Relevance vs Quantity
- **Problem**: More context isn't always better
- **Solution**: Smart retrieval focusing on most relevant chunks
- **Result**: Higher quality answers with focused context

### 3. Scalability
- **Problem**: Large document collections need efficient search
- **Solution**: FAISS vector database with semantic indexing
- **Result**: Fast similarity search across millions of documents

## Production Best Practices

- **Chunk Size**: Balance detail vs context window
- **Retrieval Count**: Start with k=2-3, adjust based on model capacity
- **Embedding Quality**: Use appropriate embedding models for domain
- **Error Handling**: Graceful degradation when context exceeds limits
- **Monitoring**: Track retrieval quality and response accuracy