# Advanced RAG with Cloud Vector Databases: Milvus Integration

## Overview
This notebook demonstrates advanced RAG implementation using cloud-based vector databases. You'll learn to build a personal knowledge base from blog content using Milvus (Zilliz Cloud) for scalable vector storage and retrieval.

## Learning Objectives
- Integrate cloud vector databases (Milvus/Zilliz) with RAG systems
- Process multiple web sources simultaneously
- Build personal knowledge bases from blog content
- Compare different vector database solutions (FAISS vs Milvus vs Azure Search)
- Implement production-grade RAG with cloud scalability

## Key Concepts
- **Cloud Vector Databases**: Scalable, managed vector storage
- **Milvus/Zilliz Cloud**: Open-source vector database with cloud service
- **Multi-Source RAG**: Processing content from multiple URLs
- **Personal Knowledge Base**: AI system trained on specific author's content
- **Production Scalability**: Cloud-based solutions for enterprise use

## Step 1: Installing Advanced Dependencies

This notebook requires additional packages for cloud vector database integration:
- **langchain-milvus**: Official Milvus integration for LangChain
- **Standard packages**: faiss-cpu, beautifulsoup4, langchain-community

### Vector Database Options Demonstrated:
1. **Milvus/Zilliz Cloud** (active) - Scalable cloud vector database
2. **Azure AI Search** (commented) - Microsoft's vector search service
3. **FAISS** (from previous notebooks) - Local vector storage

In [None]:
# extra libs
# ! pip install faiss-cpu beautifulsoup4 langchain-community
# ! pip install langchain-milvus

## Step 2: Environment Configuration Options

This notebook demonstrates multiple vector database configurations:
- **Azure OpenAI + Azure AI Search** (commented out) - Full Microsoft stack
- **LM Studio + Milvus Cloud** (active) - Hybrid local/cloud setup

We use the hybrid approach for cost-effective learning while demonstrating cloud scalability.

In [2]:
# # with Azure OpenAI and LangChain

# from langchain_openai import AzureOpenAIEmbeddings
# from langchain_community.vectorstores import FAISS
# from langchain_community.document_loaders import WebBaseLoader
# from langchain_openai import AzureChatOpenAI              # Azure OpenAI LLM
# from langchain_core.prompts import ChatPromptTemplate     # Structured prompts
# from langchain_core.output_parsers import StrOutputParser # Clean output parsing
# from langchain_core.runnables import RunnablePassthrough  # Data passing utilities

# # Environment and configuration
# from dotenv import load_dotenv
# import os

# # Load environment variables from .env file
# load_dotenv()

# embeddings = AzureOpenAIEmbeddings(
#     deployment="text-embedding-3-large",          # Deployment name in Azure
#     azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),      # Your Azure endpoint
#     api_version=os.getenv("AZURE_OPENAI_API_VERSION"),      # API version
#     api_key=os.getenv("AZURE_OPENAI_API_KEY"),              # Authentication key
# )

# llm = AzureChatOpenAI(
#     azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),          
#     azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"), 
#     api_version=os.getenv("AZURE_OPENAI_API_VERSION"),          
#     api_key=os.getenv("AZURE_OPENAI_API_KEY"),                  
# )

### Option 1: Azure OpenAI Configuration (Commented Out)
Production setup with high-quality embeddings and large language models.

### Option 2: LM Studio Configuration (Active)
Local development setup with simulated embeddings for learning RAG concepts without external dependencies.

In [2]:
# with local LM Studio and LangChain

from langchain_core.embeddings import DeterministicFakeEmbedding
from langchain_milvus import Milvus
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from dotenv import load_dotenv
import os

load_dotenv()

embeddings = DeterministicFakeEmbedding(size=4096)

llm = ChatOpenAI(
    base_url="http://localhost:1234/v1",
    api_key="not-needed"  # LM Studio doesn't require an API key
)

## Step 3: Multi-Source Data Loading

This section demonstrates loading content from multiple blog posts to create a comprehensive personal knowledge base.

### Real-World Application:
- **Personal Blog Archive**: Index an author's complete blog collection
- **Company Knowledge Base**: Process multiple documentation sources
- **Research Collection**: Combine papers, articles, and reports
- **Product Documentation**: Aggregate guides, tutorials, and references

### Data Sources:
The notebook uses actual blog posts covering:
1. **Data Engineering**: Python preferences and best practices
2. **AI/ML Models**: Comparative analysis of AI systems
3. **Software Architecture**: Clean architecture principles
4. **ETL/ELT**: Data pipeline methodologies

In [3]:
web_pages = ("https://blog.dataengineerthings.org/why-i-love-python-as-data-engineer-b78875f6a566",
             "https://maheshwarineeraj.medium.com/5-ai-models-vs-a-self-referential-paradox-who-nailed-it-4a97a4832739",
             "https://maheshwarineeraj.medium.com/clean-architecture-meets-data-engineering-adfaf92cd53c?sk=0f797f27874d1045998da592018b4eb1",
             "https://maheshwarineeraj.medium.com/etl-elt-or-something-better-470a7082c25c?sk=e8579e1965fddb3d18b7150f7d6747df"
             )

## Step 4: Document Processing and Chunking

The `WebBaseLoader` processes multiple URLs simultaneously and automatically chunks the content for optimal retrieval.

### Multi-URL Processing Benefits:
- **Batch Efficiency**: Load multiple sources in one operation
- **Consistent Processing**: Uniform chunking across all sources
- **Metadata Preservation**: Track source URLs for each chunk
- **Content Aggregation**: Build comprehensive knowledge bases

### Chunking Strategy:
- **Intelligent Splitting**: Respects paragraph and sentence boundaries
- **Overlap Management**: Maintains context between chunks
- **Size Optimization**: Balances detail with retrieval efficiency

In [4]:
loader = WebBaseLoader(web_pages)
# documents = loader.load()
chunked_documents = loader.load_and_split()

In [5]:
print(f"🌐 Total chunks from webpage: {len(chunked_documents)}")
print(chunked_documents[7])

🌐 Total chunks from webpage: 8
page_content='ETL, ELT… or Something Better?. Is ETLT the Future in Data Processing? | by Neeraj Maheshwari | Data Engineer ThingsSitemapOpen in appSign upSign inMedium LogoWriteSign upSign inData Engineer Things·Things learned in our data engineering journey and ideas on data and engineering.Member-only storyETL, ELT… or Something Better?Is ETLT the Future in Data Processing?Neeraj Maheshwari5 min read·Feb 10, 2025--ShareZoom image will be displayedimage by authorNon members, read full article using this link.If you’ve ever come across the acronyms ETL and ELT and wondered if someone was just messing around with the alphabet, you’re not alone. These two approaches (and there is a new one) define how we move, clean, and store data to generate actionable insights.Raw data is like crude oil, valuable, but useless in its unprocessed form. It needs to be refined, structured, and made actionable. That’s where data processing techniques ETL, ELT, and ETLT come 

## Step 5: Cloud Vector Database Configuration

This section demonstrates different cloud vector database options for production RAG systems.

### Vector Database Comparison:

#### **Milvus/Zilliz Cloud** (Active):
- **Open Source**: Milvus with managed cloud service (Zilliz)
- **Scalability**: Handles billions of vectors
- **Performance**: Optimized for similarity search
- **Cost**: Free tier available, pay-as-you-scale

#### **Azure AI Search** (Commented):
- **Microsoft Integration**: Native Azure ecosystem support
- **Hybrid Search**: Combines vector and keyword search
- **Enterprise Features**: Security, compliance, monitoring
- **Pricing**: Based on search units and storage

### Why Cloud Vector Databases?
- **Scalability**: Handle massive document collections
- **Reliability**: Managed infrastructure and backups  
- **Performance**: Optimized hardware and indexing
- **Collaboration**: Shared access across teams

In [None]:
# for AzureAI Search

# vector_store: AzureSearch = AzureSearch(
#     azure_search_endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"),
#     azure_search_key=os.getenv("AZURE_SEARCH_API_KEY"),
#     index_name=os.getenv("AZURE_SEARCH_INDEX_NAME"),
#     embedding_function=embeddings.embed_query,
# )

In [6]:
# for Milvus (Zilliz Cloud Available Free)
vector_store = Milvus(
    embedding_function=embeddings,
    collection_name=os.getenv("MILVUS_COLLECTION"),
    connection_args={"uri": os.getenv("MILVUS_URI"), "token": os.getenv("MILVUS_TOKEN")}
)

## Step 6: Vector Database Indexing

The `add_documents()` method uploads our processed chunks to the cloud vector database, creating searchable embeddings.

### Cloud Indexing Process:
1. **Document Upload**: Send chunks to cloud service
2. **Embedding Generation**: Convert text to vectors using cloud resources
3. **Index Creation**: Build optimized search structures
4. **Persistence**: Store vectors in managed cloud storage

### Production Advantages:
- **Automatic Scaling**: Handle varying document loads
- **Backup & Recovery**: Managed data protection
- **Global Access**: Available from anywhere
- **Performance Optimization**: Hardware-accelerated indexing

In [7]:
vector_store.add_documents(documents=chunked_documents)



[459940022183237943,
 459940022183237944,
 459940022183237945,
 459940022183237946,
 459940022183237947,
 459940022183237948,
 459940022183237949,
 459940022183237950]

## Step 7: Testing Semantic Search Capabilities

Before building the complete RAG chain, we test the vector database's search capabilities with different query types.

### Search Test Strategy:
1. **Author Information**: Finding content about the blog author
2. **Topic-Based Search**: Locating specific subject matter (AI/GenAI)
3. **Cross-Reference**: Connecting related concepts across posts

### Why Test Separately?
- **Validation**: Ensure proper indexing and retrieval
- **Optimization**: Tune search parameters (k value)
- **Understanding**: See what context will be retrieved
- **Debugging**: Identify potential issues before RAG integration

In [None]:
# Test 1: Search for plot information
query = "who wrote these blogs?"
results = vector_store.similarity_search(query, k=3)  # Get top 3 most similar chunks

for i, result in enumerate(results, 1):
    print(f"📄 CHUNK {i}:")
    print(f"Content: {result.page_content}")
    print(f"Source: {result.metadata}")
    print("-" * 80)

print(f"\n💡 Notice how the search finds semantically relevant content")

📄 CHUNK 1:
Content: Why I Love Python as Data Engineer | by Neeraj Maheshwari | Data Engineer ThingsSitemapOpen in appSign upSign inMedium LogoWriteSign upSign inData Engineer Things·Follow publicationThings learned in our data engineering journey and ideas on data and engineering.Follow publicationMy Experience with Python in Data EngineeringWhy I Love Python as Data EngineerPython is fun to useNeeraj Maheshwari4 min read·Jan 27, 2025--ListenShareZoom image will be displayedAmong data engineers, there’s always a lively debate: which programming language is best suited for data engineering use cases? The discussion becomes even more interesting when it comes to big data — especially the question of which language to use for writing Spark code. Over the course of my engineering journey, I’ve worked with several programming languages and each language is brilliant in its own way. But over the time Python became my most preferred language for practically every data engineering and even my

In [9]:
# Test 2: Search for cast information  
query = "is there any blog on AI or GenAI"
results = vector_store.similarity_search(query, k=3)

for i, result in enumerate(results, 1):
    print(f"📄 CHUNK {i}:")
    print(f"Content: {result.page_content}")
    print(f"Source: {result.metadata}")
    print("-" * 80)

📄 CHUNK 1:
Content: That's six words. Wait, the actual count depends on how I phrase it. So, I need to structure the response so that the count is accurate. For example, if I say "There are X words in this response," then X should be the correct number of words in that exact sentence.But how do I ensure that? Because if I first write the sentence and then count the words, but the sentence includes the count, which depends on the number of words. That's a bit of a paradox. Wait, no. Let me think. If the response is only the number, like "Five." Then that's one word. But maybe the user expects a full sentence. Let me check typical responses to such questions.Usually, when someone asks for the word count of the response, the assistant might respond with something like, "This response contains X words." Then X is the number. So, the structure is fixed: "This response contains X words." Let's count those words. "This" (1), "response" (2), "contains" (3), "X" (4), "words" (5). So, five words

## Step 8: Building Production RAG Chain

Now we create the complete RAG system that combines cloud vector retrieval with local LLM generation.

### Hybrid Architecture Benefits:
- **Cost Optimization**: Cloud storage + local processing
- **Privacy Control**: Sensitive data stays local during generation
- **Scalability**: Cloud handles large document collections
- **Flexibility**: Mix and match best services for each component

### Chain Configuration:
- **Retrieval**: Cloud vector database (Milvus)
- **Generation**: Local LLM (LM Studio)
- **Context Management**: Limited to 2 chunks for token efficiency

In [10]:
prompt_template = ChatPromptTemplate.from_template(
    """you are a helpful and comic assistant that help users to find the answers in fun way based on given context.
       
       IMPORTANT RAG INSTRUCTIONS:
       - Use ONLY the information provided in the context below
       - If the context doesn't contain enough information to answer the question, say: "I don't have enough info on this :("
       - Do not make up information or use knowledge outside the provided context
       - Be helpful and engaging while staying factually accurate
       
       ----
       Context: {context}
       ----
       Question: {question}
       
       Remember: Base your answer ONLY on the context provided above!
    """
)

In [11]:
chain = (
    {
        # Retrieval: Convert vector store to retriever and get relevant context
        # k=2 limits to 2 most relevant chunks to avoid context overflow
        "context": vector_store.as_retriever(search_kwargs={"k": 2}),
        
        # Pass-through: Forward the question unchanged to the prompt
        "question": RunnablePassthrough(),
    }
    | prompt_template        # Format the prompt with context and question
    | llm                   # Generate response using LM Studio
    | StrOutputParser(keep_whitespace=True)  # Parse to clean string output
)

## Step 9: Comprehensive RAG System Testing

Testing our personal knowledge base RAG system with various question types to demonstrate its capabilities.

### Test Categories:
1. **Personal Information**: Questions about the author
2. **Content Inventory**: What blogs/topics are available
3. **Professional Context**: Work and expertise details
4. **Specific Content**: Detailed summaries of particular posts

### Real-World Applications:
- **Personal Assistant**: Answer questions about your work/writing
- **Company Knowledge Base**: Team expertise and project history
- **Research Repository**: Academic papers and findings
- **Product Documentation**: Feature explanations and tutorials

In [12]:
question = "tell me about the author"
response = chain.invoke(input=question)
print(response)

print("="*70)
print("✅ Analysis:")
print("   - Chain retrieved relevant context about cast")
print("   - LLM generated answer based only on retrieved information")
print("   - Response maintains engaging tone while being factual")
print("   - Information is directly sourced from our vector database")

Based on the text, the author is **Neeraj Maheshwari**. They wrote an article titled "5 AI Models vs. a Self-Referential Paradox: Who Nailed It?" published on Medium on March 17, 2025. They also experimented with asking several AI models a tricky question about word count and documented their responses! 

✅ Analysis:
   - Chain retrieved relevant context about cast
   - LLM generated answer based only on retrieved information
   - Response maintains engaging tone while being factual
   - Information is directly sourced from our vector database


In [13]:
response = chain.invoke(input="what all blogs he has written?")
print(response)

Okay, buckle up, let's see what Neeraj Maheshwari has been writing about! 

According to the text, he's written:

*   **Data Engineering articles** (specifically mentions "other Data Engineering articles")
*   **AI articles**
*   **Humor articles**
*   **Paradox articles**

He also has a blog called **MyEngineeringBlogs** with 9 stories. 😄


In [14]:
response = chain.invoke(input="where does he works?")
print(response)

According to the text, Neeraj Maheshwari is a Data Engineer & Architect and his LinkedIn profile is www.linkedin.com/in/neerajmaheshwari9. He also publishes articles on "Data Engineer Things". 

So, while it doesn't state *where* he physically works, we know his professional identity and where to find more info about him! 💻✨



In [15]:
response = chain.invoke(input="does he work at Walmart?")
print(response)

I don't have enough info on this :(

The context talks about Neeraj Maheshwari's experience with programming languages (Python, Java, Scala, SQL) and data engineering tasks. It doesn't mention anything about him working at Walmart. 






### Advanced Query: Content Summarization
This test demonstrates the system's ability to find and summarize specific blog content, showcasing the power of semantic search combined with AI generation.

In [18]:
response = chain.invoke(input="summarize me the blog he had published on AI models")
print(response)

Okay, buckle up for a wild ride into the mind of AI... according to Neeraj Maheshwari!

He threw a simple question at ChatGPT – "How many words are in the response to this?" – and it spiraled into a self-referential paradox! The AI got *really* focused on getting the word count right, even structuring its responses to state the correct number of words (which ended up being five!). 

He also tested Microsoft Copilot and Meta AI. Copilot didn't respond at all, and Meta AI wasn’t accurate.

Basically, it was an experiment in how AI thinks (and sometimes fails) when faced with a question about itself. It got so serious, ChatGPT started giving word counts for *every* response! 😂

He concluded that the AI structures its responses to state the correct number of words, making the count accurate.






## Summary

You've successfully built an advanced RAG system with cloud vector databases! You've mastered:

✅ **Cloud Vector Integration**: Connected to Milvus/Zilliz cloud services  
✅ **Multi-Source Processing**: Loaded content from multiple blog URLs  
✅ **Personal Knowledge Base**: Created AI system from specific author content  
✅ **Hybrid Architecture**: Combined cloud storage with local processing  
✅ **Production Scalability**: Demonstrated enterprise-ready solutions  

## Key Concepts Learned

- **Milvus/Zilliz Cloud**: Scalable vector database with managed service
- **Multi-URL Processing**: Efficient batch loading from web sources
- **Cloud vs Local Trade-offs**: Balancing cost, privacy, and scalability
- **Personal Knowledge Systems**: AI trained on specific individual content
- **Production Vector Databases**: Enterprise considerations and features

## Advanced RAG Architecture

### Cloud-Hybrid Pipeline:
```
Multiple URLs → Web Loading → Document Chunking → Cloud Vector Storage → Local LLM Generation
```

### Component Distribution:
- **Data Storage**: Cloud vector database (scalable, persistent)
- **Processing**: Local LLM (privacy, cost control)
- **Retrieval**: Cloud semantic search (performance, reliability)

## Vector Database Comparison

### **Local Solutions** (FAISS):
- ✅ **Privacy**: Complete data control
- ✅ **Cost**: No ongoing fees
- ❌ **Scalability**: Limited by local hardware
- ❌ **Collaboration**: Single-user access

### **Cloud Solutions** (Milvus/Azure Search):
- ✅ **Scalability**: Handle massive collections
- ✅ **Reliability**: Managed backups and uptime
- ✅ **Performance**: Optimized infrastructure
- ✅ **Collaboration**: Multi-user access
- ❌ **Privacy**: Data sent to cloud
- ❌ **Cost**: Ongoing service fees

## Production Considerations

### **Choosing Vector Databases**:
- **Small Projects**: Local FAISS for simplicity
- **Medium Scale**: Managed cloud services (Zilliz, Pinecone)
- **Enterprise**: Azure Search, AWS Kendra for integration
- **Open Source**: Self-hosted Milvus for control

### **Cost Optimization**:
- **Free Tiers**: Start with Zilliz, Pinecone free plans
- **Hybrid Approach**: Cloud storage + local processing
- **Batch Processing**: Minimize API calls during indexing
- **Smart Retrieval**: Optimize k values and filtering

### **Security & Privacy**:
- **Data Classification**: Identify sensitive content
- **Encryption**: In-transit and at-rest protection
- **Access Control**: User permissions and authentication
- **Compliance**: GDPR, HIPAA, SOC2 requirements

## Real-World Applications

### **Personal Use Cases**:
- **Blog Archive**: Searchable personal writing history
- **Research Notes**: Academic paper collections
- **Learning Journal**: Course materials and insights
- **Project Documentation**: Personal project history

### **Enterprise Applications**:
- **Company Knowledge**: Employee expertise and experience
- **Customer Support**: Product documentation and FAQs
- **Sales Enablement**: Competitive analysis and case studies
- **HR Systems**: Policy documents and training materials

## Next Steps for Advanced RAG

### **Enhanced Features**:
- **Multi-Modal RAG**: Include images, videos, documents
- **Semantic Filtering**: Advanced query refinement
- **Reranking**: Improve retrieval quality with secondary models
- **Streaming Responses**: Real-time answer generation

### **Production Deployment**:
- **API Development**: REST/GraphQL interfaces
- **Monitoring**: Performance and accuracy tracking
- **A/B Testing**: Compare different RAG configurations
- **Continuous Learning**: Update knowledge base automatically

**Congratulations! You now have the expertise to build production-scale RAG systems with cloud vector databases!**