# Lesson 5: Document RAG with LlamaIndex

In this lesson, we'll learn how to create a Retrieval-Augmented Generation (RAG) system that can process and query documents using LlamaIndex. This approach allows our AI agent to answer questions based on specific document content, making it perfect for document analysis, knowledge bases, and content-specific Q&A systems.

We'll build a complete document RAG solution that can handle multiple document formats and provide accurate, source-attributed responses.

## Lesson Objectives

By the end of this lesson, you will be able to:

1. Set up and configure LlamaIndex for document processing
2. Load and index various document formats (text, PDF, Word, etc.)
3. Create vector embeddings for semantic search
4. Build a basic RAG pipeline for document Q&A
5. Implement advanced retrieval strategies and filtering
6. Use LlamaIndex agents for complex document analysis
7. Create a conversational interface for document exploration
8. Handle multi-document queries and source attribution

## 1. Environment Setup

First, let's set up our environment with the necessary libraries for document RAG.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 
- !pip install llama-index-embeddings-azure-openai
- !pip install llama-index-llms-azure-openai
- !pip install llama-index

In [None]:
# Install required packages if needed
# !pip install llama-index openai python-dotenv tiktoken chromadb

In [None]:
# Import required libraries
import os
import sys
from pathlib import Path
from dotenv import load_dotenv, find_dotenv

# LlamaIndex imports
from llama_index.core import (
    VectorStoreIndex, 
    SimpleDirectoryReader, 
    StorageContext, 
    load_index_from_storage,
    Settings
)
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
import logging
import sys

# logging.basicConfig(
#     stream=sys.stdout, level=logging.INFO
# )  # logging.DEBUG for more verbose output
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


# Additional utilities
from IPython.display import display, Markdown
import json

# Load environment variables
load_dotenv(find_dotenv())

print("✅ All libraries imported successfully!")

✅ All libraries imported successfully!


In [2]:
# Get API keys and configuration
API_KEY = os.environ.get("AZURE_OPENAI_KEY") 
API_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")
API_VERSION = os.environ.get("AZURE_OPENAI_VERSION")
MODEL = os.environ.get("AZURE_OPENAI_MODEL")

# Embedding configuration
EMBEDDINGS_API_KEY = os.environ.get("AZURE_OPENAI_EMBEDDINGS_API_KEY")
EMBEDDINGS_API_ENDPOINT = os.environ.get("AZURE_OPENAI_EMBEDDINGS_ENDPOINT")
EMBEDDINGS_DEPLOYMENT_NAME = os.environ.get("AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME")
EMBEDDINGS_API_VERSION = os.environ.get("AZURE_OPENAI_EMBEDDINGS_API_VERSION")
EMBEDDINGS_MODEL = os.environ.get("AZURE_OPENAI_EMBEDDINGS_MODEL")

In [3]:
llm = AzureOpenAI(
    model=MODEL,
    deployment_name=AZURE_DEPLOYMENT,
    api_key=API_KEY,
    azure_endpoint=API_ENDPOINT,
    api_version=API_VERSION,
)

# You need to deploy your own embedding model as well as your own chat completion model
embed_model = AzureOpenAIEmbedding(
    model=EMBEDDINGS_MODEL,
    deployment_name=EMBEDDINGS_DEPLOYMENT_NAME,
    api_key=EMBEDDINGS_API_KEY,
    azure_endpoint=EMBEDDINGS_API_ENDPOINT,
    api_version=EMBEDDINGS_API_KEY,
)

In [4]:


# Check if the necessary API keys are available
if not API_KEY:
    print("⚠️ Azure OpenAI API key not found. Please set the AZURE_OPENAI_KEY environment variable.")
else:
    print("✅ Azure OpenAI configuration loaded successfully")
    print(f"📋 Using model: {MODEL}")
    print(f"📋 Using deployment: {AZURE_DEPLOYMENT}")

✅ Azure OpenAI configuration loaded successfully
📋 Using model: gpt-4o-mini
📋 Using deployment: gpt-4o-mini


In [5]:
# Configure LlamaIndex with Azure OpenAI
llm = AzureOpenAI(
    deployment_name=AZURE_DEPLOYMENT,
    api_key=API_KEY,
    azure_endpoint=API_ENDPOINT,
    api_version=API_VERSION,
    model=MODEL,
    temperature=0.1
)

# Configure embedding model
embed_model = AzureOpenAIEmbedding(
    deployment_name=EMBEDDINGS_DEPLOYMENT_NAME,  # Update with your embedding deployment
    api_key=API_KEY,
    azure_endpoint=API_ENDPOINT,
    api_version=API_VERSION,
)

# Set global configurations
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 1024
Settings.chunk_overlap = 200

print("🔧 LlamaIndex configured with Azure OpenAI")
print(f"📊 Chunk size: {Settings.chunk_size}")
print(f"📊 Chunk overlap: {Settings.chunk_overlap}")

🔧 LlamaIndex configured with Azure OpenAI
📊 Chunk size: 1024
📊 Chunk overlap: 200


In [None]:
Settings.llm = llm
Settings.embed_model = embed_model

## 2. Loading and Processing Documents

Let's start by loading and processing our sample documents. LlamaIndex makes it easy to load various document formats.

In [None]:
documents = SimpleDirectoryReader(
    input_dir="../lesson_5_document_rag/sample_documents",
).load_data()


## 3. Creating Vector Index

Now let's create a vector index from our documents. This will enable semantic search over the document content.

In [None]:
index = VectorStoreIndex.from_documents(documents)

## 4. Basic RAG Query Engine

Let's create a basic query engine that can answer questions based on our document content.

In [11]:
query = "What are these documents about?"
query_engine = index.as_query_engine()
answer = query_engine.query(query)

print(answer.get_formatted_sources())
print("query was:", query)
print("answer was:", answer)


> Source (Doc id: 7d252d5e-16a0-4b39-81f5-56c254dd943a): Artificial Intelligence and Machine Learning: A Comprehensive Overview

Introduction

Artificial ...

> Source (Doc id: 95ec6e56-c21e-4da7-8ba1-018bca5a6702): Cloud Computing: Transforming Modern Business Infrastructure

Introduction

Cloud computing has r...
query was: What are these documents about?
answer was: The documents provide comprehensive overviews of two significant technological fields: Artificial Intelligence (AI) and Machine Learning (ML), and Cloud Computing. The first document discusses the definitions, categories, fundamentals, applications, challenges, and future outlook of AI and ML, highlighting their transformative impact on various industries. The second document focuses on cloud computing, explaining its characteristics, service models, deployment models, benefits, challenges, and current trends, emphasizing how it has revolutionized business infrastructure and data management.


In [12]:
# Test the basic query engine
def test_basic_query(question):
    """Test a query and display results"""
    print(f"🔍 Question: {question}")
    print("-" * 50)
    
    response = query_engine.query(question)
    
    print("🤖 Answer:")
    print(response.response)
    
    # Show source information
    print(f"\n📚 Sources used: {len(response.source_nodes)} chunks")
    for i, node in enumerate(response.source_nodes, 1):
        print(f"  Source {i}: {node.metadata.get('file_name', 'Unknown')}")
        print(f"    Score: {node.score:.3f}")
        print(f"    Text: {node.text[:150]}...")
        print()
    
    return response



In [13]:
# Test with sample questions
questions = [
    "What is artificial intelligence?",
    "What are the benefits of cloud computing?",
    "Explain the different types of machine learning"
]

for question in questions:
    print("=" * 60)
    test_basic_query(question)
    print()

🔍 Question: What is artificial intelligence?
--------------------------------------------------
🤖 Answer:
Artificial Intelligence refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. It encompasses any machine that exhibits traits associated with a human mind, such as learning and problem-solving. AI can be categorized into three main types: Narrow AI, which is designed for specific tasks; General AI, a hypothetical form that can understand and apply knowledge across various tasks at a human level; and Superintelligence, which surpasses human intelligence in all aspects.

📚 Sources used: 2 chunks
  Source 1: sample1.txt
    Score: 0.446
    Text: Artificial Intelligence and Machine Learning: A Comprehensive Overview

Introduction

Artificial Intelligence (AI) and Machine Learning (ML) have beco...

  Source 2: sample2.txt
    Score: 0.247
    Text: Serverless Computing:
Running applications without managing servers, paying only 

## 5. Advanced Retrieval and Filtering

Let's enhance our RAG system with more advanced retrieval strategies and post-processing filters.

In [17]:
# Create advanced retriever with filters
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core import get_response_synthesizer

# Create retriever with more results initially
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5,  # Get more results initially
)

# Add post-processing filters
postprocessor = SimilarityPostprocessor(similarity_cutoff=0.7)

# Create response synthesizer
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    use_async=False,
)

# Create advanced query engine
advanced_query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[postprocessor],
    response_synthesizer=response_synthesizer
)

print("🚀 Advanced query engine created!")
print("📋 Enhanced features:")
print("  - Similarity filtering (cutoff: 0.7)")
print("  - Higher initial retrieval (top 5)")
print("  - Post-processing for relevance")
print("  - Tree summarization for better responses")

🚀 Advanced query engine created!
📋 Enhanced features:
  - Similarity filtering (cutoff: 0.7)
  - Higher initial retrieval (top 5)
  - Post-processing for relevance
  - Tree summarization for better responses


In [18]:
# Create a custom query engine that can filter by document source
def create_document_specific_query_engine(document_filter=None):
    """
    Create a query engine that can filter by specific documents
    
    Args:
        document_filter: String to match document names (e.g., "sample1.txt")
    """
    
    if document_filter:
        # Filter nodes by document
        filtered_nodes = [
            node for node in index.docstore.docs.values() 
            if document_filter.lower() in node.metadata.get('file_name', '').lower()
        ]
        
        if not filtered_nodes:
            print(f"⚠️ No documents found matching filter: {document_filter}")
            return None
            
        # Create a new index with filtered nodes
        from llama_index.core import DocumentSummaryIndex
        filtered_docs = [
            doc for doc in documents 
            if document_filter.lower() in doc.metadata.get('file_name', '').lower()
        ]
        
        if filtered_docs:
            filtered_index = VectorStoreIndex.from_documents(filtered_docs)
            query_engine = filtered_index.as_query_engine(similarity_top_k=3)
            print(f"🎯 Created filtered query engine for: {document_filter}")
            print(f"📄 Documents included: {len(filtered_docs)}")
            return query_engine
    
    return advanced_query_engine

# Test document-specific querying
def query_specific_document(question, document_filter=None):
    """Query a specific document or all documents"""
    
    if document_filter:
        print(f"🔍 Querying document: {document_filter}")
        engine = create_document_specific_query_engine(document_filter)
        if not engine:
            return None
    else:
        print("🔍 Querying all documents")
        engine = advanced_query_engine
    
    print(f"❓ Question: {question}")
    print("-" * 50)
    
    response = engine.query(question)
    print("🤖 Answer:")
    print(response.response)
    
    return response

# Example queries
print("Testing document-specific querying:")
print("=" * 60)

# Query about AI from the AI document
query_specific_document(
    "What are the main types of artificial intelligence?", 
    "sample1.txt"
)

print("\n" + "=" * 60)

# Query about cloud computing from the cloud document  
query_specific_document(
    "What are the cloud service models?",
    "sample2.txt"
)

# Test advanced query engine with similarity filtering
test_queries = [
    "What are the main concepts in artificial intelligence?",
    "How does cloud computing work?",
    "What are the benefits of machine learning?",
    "Explain containerization in cloud computing"
]

print("🔍 Testing Advanced Query Engine with Similarity Filtering")
print("=" * 60)

for i, query in enumerate(test_queries, 1):
    print(f"\n{i}. Query: {query}")
    print("-" * 40)
    
    # Query with advanced engine
    response = advanced_query_engine.query(query)
    print(f"🤖 Response: {response.response}")
    
    # Show retrieved nodes information
    if hasattr(response, 'source_nodes') and response.source_nodes:
        print(f"📚 Sources used: {len(response.source_nodes)} documents")
        for j, node in enumerate(response.source_nodes[:2]):  # Show first 2 sources
            if hasattr(node, 'score'):
                print(f"   - Source {j+1} (similarity: {node.score:.3f})")
            else:
                print(f"   - Source {j+1}")
    
    print()

print("✅ Advanced query testing completed!")

Testing document-specific querying:
🔍 Querying document: sample1.txt
🎯 Created filtered query engine for: sample1.txt
📄 Documents included: 1
❓ Question: What are the main types of artificial intelligence?
--------------------------------------------------
🎯 Created filtered query engine for: sample1.txt
📄 Documents included: 1
❓ Question: What are the main types of artificial intelligence?
--------------------------------------------------
🤖 Answer:
The main types of artificial intelligence are:

1. Narrow AI (Weak AI): AI systems designed and trained for specific tasks, such as virtual assistants and recommendation algorithms.

2. General AI (Strong AI): A hypothetical form of AI that can understand, learn, and apply knowledge across a wide range of tasks at a level comparable to human intelligence.

3. Superintelligence: AI that exceeds human intelligence in all aspects, including creativity and social skills.

🔍 Querying document: sample2.txt
🤖 Answer:
The main types of artificial 

## 6. Conversational Document Agent

Let's create a conversational agent that can maintain context across multiple queries about our documents.

In [19]:
# Create a conversational document agent
from llama_index.core.memory import ChatMemoryBuffer

class ConversationalDocumentAgent:
    def __init__(self, index, memory_limit=10):
        self.index = index
        self.memory = ChatMemoryBuffer.from_defaults(token_limit=3000)
        self.conversation_history = []
        self.memory_limit = memory_limit
        
        # Create chat engine with memory
        self.chat_engine = index.as_chat_engine(
            chat_mode="context",
            memory=self.memory,
            similarity_top_k=3,
            system_prompt=(
                "You are a helpful document analysis assistant. "
                "Answer questions based on the provided document content. "
                "Always cite the source document when possible. "
                "If information is not in the documents, clearly state that. "
                "Maintain conversation context and refer to previous answers when relevant."
            )
        )
    
    def query(self, question):
        """Process a query with conversation context"""
        print(f"🗣️ User: {question}")
        print("-" * 50)
        
        # Get response from chat engine
        response = self.chat_engine.chat(question)
        
        print("🤖 Assistant:")
        print(response.response)
        
        # Add to conversation history
        self.conversation_history.append({
            "question": question,
            "answer": response.response,
            "sources": len(response.source_nodes) if hasattr(response, 'source_nodes') else 0
        })
        
        # Keep history manageable
        if len(self.conversation_history) > self.memory_limit:
            self.conversation_history = self.conversation_history[-self.memory_limit:]
        
        # Show sources if available
        if hasattr(response, 'source_nodes') and response.source_nodes:
            print(f"\n📚 Sources ({len(response.source_nodes)} chunks):")
            for i, node in enumerate(response.source_nodes, 1):
                doc_name = node.metadata.get('file_name', 'Unknown')
                print(f"  {i}. {doc_name} (Score: {node.score:.3f})")
        
        return response
    
    def get_conversation_summary(self):
        """Get a summary of the conversation"""
        return {
            "total_questions": len(self.conversation_history),
            "questions": [item["question"] for item in self.conversation_history]
        }
    
    def reset_conversation(self):
        """Reset the conversation history"""
        self.conversation_history = []
        self.memory.reset()
        print("🔄 Conversation history reset")

# Create the conversational agent
agent = ConversationalDocumentAgent(index)
print("💬 Conversational Document Agent created!")
print("🎯 Features:")
print("  - Maintains conversation context")
print("  - Source attribution")
print("  - Memory management")
print("  - Multi-turn conversations")

💬 Conversational Document Agent created!
🎯 Features:
  - Maintains conversation context
  - Source attribution
  - Memory management
  - Multi-turn conversations


In [20]:
# Test the conversational agent with a multi-turn conversation
conversation_queries = [
    "What is artificial intelligence?",
    "What are the main types you mentioned?",
    "How does this relate to machine learning?",
    "Now tell me about cloud computing",
    "What are the advantages compared to traditional IT infrastructure?",
    "Which service model would be best for a startup?"
]

print("🎭 Starting multi-turn conversation:")
print("=" * 60)

for i, query in enumerate(conversation_queries, 1):
    print(f"\n🔄 Turn {i}:")
    agent.query(query)
    
    if i < len(conversation_queries):
        print("\n⏳ (continuing conversation...)")

print(f"\n📊 Conversation summary: {agent.get_conversation_summary()}")

🎭 Starting multi-turn conversation:

🔄 Turn 1:
🗣️ User: What is artificial intelligence?
--------------------------------------------------
🤖 Assistant:
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. It encompasses any machine that exhibits traits associated with a human mind, such as learning and problem-solving. The field of AI can be broadly categorized into:

1. **Narrow AI (Weak AI)**: AI systems designed and trained for a specific task, such as virtual assistants (e.g., Siri, Alexa), recommendation algorithms, and image recognition systems.

2. **General AI (Strong AI)**: A hypothetical form of AI that would possess the ability to understand, learn, and apply knowledge across a wide range of tasks at a level equal to human intelligence.

3. **Superintelligence**: AI that surpasses human intelligence in all aspects, including creativity, general wisdom, and social skills.

This definition 

## 7. Advanced Features and Document Analysis

Let's explore some advanced features for document analysis and comparison.

In [21]:
# Advanced document analysis functions

def compare_documents(topic, doc1_filter=None, doc2_filter=None):
    """Compare how different documents discuss a topic"""
    
    print(f"📊 Comparing documents on topic: {topic}")
    print("=" * 60)
    
    if doc1_filter and doc2_filter:
        # Query each document separately
        engine1 = create_document_specific_query_engine(doc1_filter)
        engine2 = create_document_specific_query_engine(doc2_filter)
        
        if engine1 and engine2:
            print(f"📄 Document 1: {doc1_filter}")
            response1 = engine1.query(f"Explain {topic}")
            print("🤖 Response:")
            print(response1.response)
            
            print(f"\n📄 Document 2: {doc2_filter}")
            response2 = engine2.query(f"Explain {topic}")
            print("🤖 Response:")
            print(response2.response)
            
            # Use the main engine to synthesize comparison
            print(f"\n🔄 Synthesizing comparison...")
            comparison_query = f"""
            Based on the available documents, compare and contrast how {topic} is presented. 
            Highlight similarities and differences in the explanations, approaches, or perspectives.
            """
            
            comparison_response = advanced_query_engine.query(comparison_query)
            print("\n🎯 Comparison Analysis:")
            print(comparison_response.response)
    else:
        # General comparison query
        comparison_query = f"""
        Compare and contrast the different perspectives or approaches to {topic} 
        found in the available documents. Highlight key similarities and differences.
        """
        response = advanced_query_engine.query(comparison_query)
        print("🤖 Analysis:")
        print(response.response)

def summarize_document_collection():
    """Create a comprehensive summary of all documents"""
    
    print("📋 Creating comprehensive document summary...")
    print("=" * 60)
    
    summary_query = """
    Provide a comprehensive summary of all the documents in the collection. 
    Include:
    1. Main topics covered
    2. Key concepts and definitions
    3. Important relationships between topics
    4. Overall themes and insights
    
    Structure your response clearly with headers and bullet points.
    """
    
    response = advanced_query_engine.query(summary_query)
    print("📊 Document Collection Summary:")
    print(response.response)
    
    return response

def extract_key_concepts():
    """Extract key concepts and definitions from all documents"""
    
    print("🔑 Extracting key concepts and definitions...")
    print("=" * 60)
    
    concepts_query = """
    Extract and list the key concepts, terms, and definitions from all documents.
    For each concept, provide:
    1. The term/concept name
    2. Its definition or explanation
    3. Which document(s) it appears in
    
    Focus on important technical terms, main ideas, and fundamental concepts.
    """
    
    response = advanced_query_engine.query(concepts_query)
    print("🎓 Key Concepts:")
    print(response.response)
    
    return response

# Test advanced analysis features
print("🔬 Advanced Document Analysis")
print("=" * 60)

# Compare how both documents discuss technology concepts
compare_documents("the future impact of technology")

print("\n" + "=" * 60)

# Get comprehensive summary
summary_response = summarize_document_collection()

print("\n" + "=" * 60)

# Extract key concepts
concepts_response = extract_key_concepts()

🔬 Advanced Document Analysis
📊 Comparing documents on topic: the future impact of technology
🤖 Analysis:
Empty Response

📋 Creating comprehensive document summary...
🤖 Analysis:
Empty Response

📋 Creating comprehensive document summary...
📊 Document Collection Summary:
Empty Response

🔑 Extracting key concepts and definitions...
📊 Document Collection Summary:
Empty Response

🔑 Extracting key concepts and definitions...
🎓 Key Concepts:
Empty Response
🎓 Key Concepts:
Empty Response
