# 📓 The GenAI Revolution Cookbook

**Title:** 5 Essential Steps to Building Agentic RAG Systems with LangChain and ChromaDB

**Description:** Unlock the power of agentic RAG systems with LangChain and ChromaDB. Follow these steps to enhance AI adaptability and relevance in real-world applications.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## Introduction

Building production-ready GenAI applications requires more than just connecting an LLM to a vector database. Traditional RAG systems follow rigid, predefined retrieval patterns that can't adapt to complex queries or changing contexts. Agentic RAG systems solve this by introducing autonomous decision-making—allowing your AI to intelligently decide when to retrieve information, what tools to use, and how to combine multiple data sources dynamically.

In this notebook, you'll learn how to build, optimize, and deploy an agentic RAG system using [LangChain](https://docs.langchain.com), [ChromaDB](https://www.trychroma.com), and OpenAI. You'll implement autonomous retrieval agents, add production-grade optimizations like caching and reranking, and set up monitoring to ensure your system performs reliably at scale. By the end, you'll have a fully functional, production-ready agentic RAG system that you can adapt to your own use cases.

## Setup & Installation

First, install all required dependencies and configure your environment. This setup includes LangChain for agent orchestration, ChromaDB for vector storage, and OpenAI for embeddings and language models.

In [None]:
# Install and configure the development environment for agentic RAG system
!pip install langchain chromadb openai langchain-openai langchain-community cachetools pypdf

import os
import logging
from typing import List, Dict, Any

# Configure logging for better debugging and monitoring
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Set up API keys - replace with your actual keys
os.environ["OPENAI_API_KEY"] = "your_openai_api_key_here"

# Import and verify installations
try:
    import langchain
    import chromadb
    from langchain_openai import OpenAIEmbeddings
    logger.info(f"LangChain version: {langchain.__version__}")
    logger.info("All dependencies installed successfully")
except ImportError as e:
    logger.error(f"Import error: {e}")
    raise

## Building the Vector Store

Before creating the agentic layer, you need a knowledge base. This section shows how to load documents, split them into optimal chunks, and create a ChromaDB vector store for efficient semantic search.

In [None]:
# Prepare data and set up ChromaDB vector store for document retrieval
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
import chromadb

def setup_vector_store(pdf_path: str, collection_name: str = "documents") -> Chroma:
    """
    Load documents, split into chunks, and create a ChromaDB vector store.
    
    Args:
        pdf_path (str): Path to the PDF document to load
        collection_name (str): Name for the ChromaDB collection
        
    Returns:
        Chroma: Configured ChromaDB vector store instance
        
    Raises:
        FileNotFoundError: If PDF file doesn't exist
        Exception: If embedding or storage fails
    """
    try:
        # Load PDF document - handles multi-page documents efficiently
        loader = PyPDFLoader(pdf_path)
        documents = loader.load()
        logger.info(f"Loaded {len(documents)} pages from {pdf_path}")
        
        # Split documents into optimal chunks for retrieval
        # chunk_size=1000 balances context preservation with retrieval precision
        # chunk_overlap=200 ensures continuity across chunk boundaries
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            length_function=len,
            separators=["\n\n", "\n", " ", ""]
        )
        chunks = text_splitter.split_documents(documents)
        logger.info(f"Split into {len(chunks)} chunks")
        
        # Initialize OpenAI embeddings for semantic similarity
        embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
        
        # Create ChromaDB vector store with persistent storage
        # This enables efficient similarity search and retrieval
        vectorstore = Chroma.from_documents(
            documents=chunks,
            embedding=embeddings,
            collection_name=collection_name,
            persist_directory="./chroma_db"
        )
        
        # Test retrieval functionality with a sample query
        test_results = vectorstore.similarity_search("sample query", k=3)
        logger.info(f"Vector store created successfully with {len(test_results)} test results")
        
        return vectorstore
        
    except FileNotFoundError:
        logger.error(f"PDF file not found: {pdf_path}")
        raise
    except Exception as e:
        logger.error(f"Error setting up vector store: {e}")
        raise

# Example usage - replace with your actual PDF path
# vectorstore = setup_vector_store("sample_document.pdf")

## Implementing the Agentic Layer

Agentic Retrieval-Augmented Generation (RAG) systems represent a significant advancement over traditional RAG systems by incorporating autonomous decision-making capabilities. Unlike static RAG systems, which rely on predefined rules for data retrieval, agentic systems dynamically decide what information to retrieve, how to use it, and when to act. This adaptability is crucial for production AI applications, where context and requirements can change rapidly. For a deeper understanding of how to tailor these systems to specific domains, you might find our guide on [customizing LLMs for domain-specific applications](/article/mastering-domain-specific-llm-customization-techniques-and-tools-unveiled) helpful.

The following implementation uses LangChain's ReAct agent framework, which enables reasoning and acting in iterative steps. The agent autonomously decides whether to use retrieval tools based on the query complexity.

In [None]:
# Implement agentic layer with autonomous retrieval capabilities
from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import Tool
from langchain_openai import ChatOpenAI
from langchain import hub
from typing import Optional

class AgenticRAGSystem:
    """
    Agentic RAG system that autonomously decides when and how to retrieve information.
    
    Combines LangChain agents with ChromaDB for intelligent document retrieval.
    """
    
    def __init__(self, vectorstore: Chroma, model_name: str = "gpt-3.5-turbo"):
        """
        Initialize the agentic RAG system.
        
        Args:
            vectorstore (Chroma): Configured ChromaDB vector store
            model_name (str): OpenAI model to use for reasoning
        """
        self.vectorstore = vectorstore
        self.llm = ChatOpenAI(model=model_name, temperature=0)
        self.tools = self._create_tools()
        self.agent = self._create_agent()
        
    def _create_tools(self) -> List[Tool]:
        """
        Create retrieval tools for the agent to use autonomously.
        
        Returns:
            List[Tool]: List of tools available to the agent
        """
        def similarity_search(query: str) -> str:
            """
            Perform similarity search in the vector store.
            
            Args:
                query (str): Search query
                
            Returns:
                str: Formatted search results
            """
            try:
                # Retrieve top 3 most relevant documents
                # k=3 balances relevance with response length
                results = self.vectorstore.similarity_search(query, k=3)
                
                if not results:
                    return "No relevant documents found."
                
                # Format results for agent consumption
                formatted_results = []
                for i, doc in enumerate(results, 1):
                    formatted_results.append(f"Document {i}: {doc.page_content[:500]}...")
                
                return "\n\n".join(formatted_results)
                
            except Exception as e:
                logger.error(f"Error in similarity search: {e}")
                return f"Search error: {str(e)}"
        
        # Define tools available to the agent
        tools = [
            Tool(
                name="document_search",
                description="Search through documents to find relevant information. Use this when you need specific information from the knowledge base.",
                func=similarity_search
            )
        ]
        
        return tools
    
    def _create_agent(self) -> AgentExecutor:
        """
        Create ReAct agent with autonomous decision-making capabilities.
        
        Returns:
            AgentExecutor: Configured agent executor
        """
        # Get ReAct prompt template from LangChain hub
        # ReAct enables reasoning and acting in iterative steps
        prompt = hub.pull("hwchase17/react")
        
        # Create agent with reasoning capabilities
        agent = create_react_agent(self.llm, self.tools, prompt)
        
        # Configure agent executor with error handling
        agent_executor = AgentExecutor(
            agent=agent,
            tools=self.tools,
            verbose=True,  # Enable detailed logging
            handle_parsing_errors=True,  # Graceful error handling
            max_iterations=3  # Prevent infinite loops
        )
        
        return agent_executor
    
    def query(self, question: str) -> str:
        """
        Process user query with autonomous retrieval decisions.
        
        Args:
            question (str): User's question
            
        Returns:
            str: Agent's response with retrieved information
            
        Raises:
            Exception: If agent execution fails
        """
        try:
            # Agent autonomously decides whether to use retrieval tools
            # Based on query complexity and information requirements
            response = self.agent.invoke({"input": question})
            return response["output"]
            
        except Exception as e:
            logger.error(f"Error processing query: {e}")
            return f"I encountered an error processing your question: {str(e)}"

# Example usage
# agentic_system = AgenticRAGSystem(vectorstore)
# response = agentic_system.query("What are the main topics discussed in the document?")
# print(response)

## Production Optimizations

Production environments require more than basic functionality. This section implements critical optimizations including response caching, relevance-based reranking, and performance monitoring. These improvements reduce latency, improve answer quality, and provide visibility into system behavior.

In [None]:
# Optimize system performance with caching, reranking, and monitoring
from cachetools import TTLCache, cached
from functools import wraps
import time
from typing import Tuple

class OptimizedAgenticRAG(AgenticRAGSystem):
    """
    Production-ready agentic RAG system with performance optimizations.
    
    Includes caching, reranking, monitoring, and error handling.
    """
    
    def __init__(self, vectorstore: Chroma, model_name: str = "gpt-3.5-turbo"):
        super().__init__(vectorstore, model_name)
        # TTL cache with 5-minute expiration to balance freshness and performance
        self.cache = TTLCache(maxsize=100, ttl=300)
        self.query_metrics = {"total_queries": 0, "cache_hits": 0, "avg_response_time": 0}
    
    def _enhanced_similarity_search(self, query: str, k: int = 5) -> str:
        """
        Enhanced similarity search with reranking and relevance scoring.
        
        Args:
            query (str): Search query
            k (int): Number of documents to retrieve initially
            
        Returns:
            str: Reranked and formatted search results
        """
        try:
            # Retrieve more documents initially for better reranking
            results = self.vectorstore.similarity_search_with_score(query, k=k)
            
            if not results:
                return "No relevant documents found."
            
            # Rerank results by relevance score (lower scores = higher similarity)
            # Filter out results with low relevance (score > 0.8)
            relevant_results = [(doc, score) for doc, score in results if score < 0.8]
            
            if not relevant_results:
                return "No sufficiently relevant documents found."
            
            # Format top 3 results for agent consumption
            formatted_results = []
            for i, (doc, score) in enumerate(relevant_results[:3], 1):
                confidence = max(0, (1 - score) * 100)  # Convert to confidence percentage
                formatted_results.append(
                    f"Document {i} (Confidence: {confidence:.1f}%): {doc.page_content[:500]}..."
                )
            
            return "\n\n".join(formatted_results)
            
        except Exception as e:
            logger.error(f"Error in enhanced similarity search: {e}")
            return f"Search error: {str(e)}"
    
    @cached(cache=lambda self: self.cache)
    def _cached_query(self, question: str) -> Tuple[str, bool]:
        """
        Execute query with caching to improve response times.
        
        Args:
            question (str): User's question
            
        Returns:
            Tuple[str, bool]: (response, was_cached)
        """
        # This method will be cached automatically by the decorator
        response = super().query(question)
        return response, False  # False indicates this was not from cache
    
    def query(self, question: str) -> str:
        """
        Process query with performance monitoring and caching.
        
        Args:
            question (str): User's question
            
        Returns:
            str: Optimized response with performance metrics
        """
        start_time = time.time()
        self.query_metrics["total_queries"] += 1
        
        try:
            # Check cache first
            cache_key = hash(question)
            if cache_key in self.cache:
                self.query_metrics["cache_hits"] += 1
                response = self.cache[cache_key]
                logger.info(f"Cache hit for query: {question[:50]}...")
                return response
            
            # Execute query if not in cache
            response = super().query(question)
            
            # Cache successful responses
            self.cache[cache_key] = response
            
            # Update performance metrics
            response_time = time.time() - start_time
            self._update_metrics(response_time)
            
            logger.info(f"Query processed in {response_time:.2f}s")
            return response
            
        except Exception as e:
            logger.error(f"Error in optimized query: {e}")
            return f"I encountered an error: {str(e)}. Please try rephrasing your question."
    
    def _update_metrics(self, response_time: float) -> None:
        """
        Update performance metrics for monitoring.
        
        Args:
            response_time (float): Time taken to process the query
        """
        # Calculate rolling average response time
        current_avg = self.query_metrics["avg_response_time"]
        total_queries = self.query_metrics["total_queries"]
        
        # Update average using incremental formula
        self.query_metrics["avg_response_time"] = (
            (current_avg * (total_queries - 1) + response_time) / total_queries
        )
    
    def get_performance_stats(self) -> Dict[str, Any]:
        """
        Get system performance statistics for monitoring.
        
        Returns:
            Dict[str, Any]: Performance metrics
        """
        cache_hit_rate = (
            self.query_metrics["cache_hits"] / max(1, self.query_metrics["total_queries"]) * 100
        )
        
        return {
            "total_queries": self.query_metrics["total_queries"],
            "cache_hit_rate": f"{cache_hit_rate:.1f}%",
            "avg_response_time": f"{self.query_metrics['avg_response_time']:.2f}s",
            "cache_size": len(self.cache)
        }
    
    def health_check(self) -> Dict[str, str]:
        """
        Perform system health check for production monitoring.
        
        Returns:
            Dict[str, str]: Health status of system components
        """
        health_status = {"overall": "healthy"}
        
        try:
            # Test vector store connectivity
            test_results = self.vectorstore.similarity_search("test", k=1)
            health_status["vectorstore"] = "healthy"
        except Exception as e:
            health_status["vectorstore"] = f"unhealthy: {str(e)}"
            health_status["overall"] = "degraded"
        
        try:
            # Test LLM connectivity
            test_response = self.llm.invoke("test")
            health_status["llm"] = "healthy"
        except Exception as e:
            health_status["llm"] = f"unhealthy: {str(e)}"
            health_status["overall"] = "degraded"
        
        return health_status

# Example production deployment setup
def deploy_agentic_rag():
    """
    Example deployment function for production environments.
    
    Returns:
        OptimizedAgenticRAG: Production-ready system instance
    """
    try:
        # Initialize vector store (replace with your actual setup)
        # vectorstore = setup_vector_store("your_documents.pdf")
        
        # Create optimized system
        # system = OptimizedAgenticRAG(vectorstore)
        
        # Perform health check
        # health = system.health_check()
        # logger.info(f"System health: {health}")
        
        # return system
        
        logger.info("Deployment template ready - uncomment and configure with your data")
        return None
        
    except Exception as e:
        logger.error(f"Deployment failed: {e}")
        raise

# Example usage for monitoring
# system = deploy_agentic_rag()
# if system:
#     response = system.query("What are the key findings?")
#     stats = system.get_performance_stats()
#     print(f"Response: {response}")
#     print(f"Performance: {stats}")

## Testing & Validation

Before deploying to production, thoroughly test your agentic RAG system to ensure it performs reliably under various conditions. This section demonstrates how to validate functionality, measure performance, and verify system health.

In [None]:
# Comprehensive testing suite for agentic RAG system
import unittest
from typing import List

class AgenticRAGTester:
    """
    Testing utilities for validating agentic RAG system functionality.
    """
    
    def __init__(self, system: OptimizedAgenticRAG):
        self.system = system
        self.test_results = []
    
    def test_basic_retrieval(self) -> bool:
        """
        Test basic document retrieval functionality.
        
        Returns:
            bool: True if test passes
        """
        try:
            test_query = "What is the main topic?"
            response = self.system.query(test_query)
            
            # Verify response is not empty and doesn't contain error messages
            success = (
                len(response) > 0 and 
                "error" not in response.lower() and
                response != "No relevant documents found."
            )
            
            self.test_results.append({
                "test": "basic_retrieval",
                "passed": success,
                "response_length": len(response)
            })
            
            logger.info(f"Basic retrieval test: {'PASSED' if success else 'FAILED'}")
            return success
            
        except Exception as e:
            logger.error(f"Basic retrieval test failed: {e}")
            return False
    
    def test_cache_performance(self) -> bool:
        """
        Test caching functionality and performance improvement.
        
        Returns:
            bool: True if caching works correctly
        """
        try:
            test_query = "What are the key findings?"
            
            # First query (uncached)
            start_time = time.time()
            response1 = self.system.query(test_query)
            first_query_time = time.time() - start_time
            
            # Second query (should be cached)
            start_time = time.time()
            response2 = self.system.query(test_query)
            cached_query_time = time.time() - start_time
            
            # Verify responses match and cached query is faster
            success = (
                response1 == response2 and
                cached_query_time < first_query_time
            )
            
            self.test_results.append({
                "test": "cache_performance",
                "passed": success,
                "speedup": f"{first_query_time / max(cached_query_time, 0.001):.2f}x"
            })
            
            logger.info(f"Cache performance test: {'PASSED' if success else 'FAILED'}")
            return success
            
        except Exception as e:
            logger.error(f"Cache performance test failed: {e}")
            return False
    
    def test_health_monitoring(self) -> bool:
        """
        Test health check functionality.
        
        Returns:
            bool: True if health check works
        """
        try:
            health_status = self.system.health_check()
            
            # Verify all required components are checked
            required_components = ["overall", "vectorstore", "llm"]
            success = all(component in health_status for component in required_components)
            
            self.test_results.append({
                "test": "health_monitoring",
                "passed": success,
                "status": health_status
            })
            
            logger.info(f"Health monitoring test: {'PASSED' if success else 'FAILED'}")
            return success
            
        except Exception as e:
            logger.error(f"Health monitoring test failed: {e}")
            return False
    
    def run_all_tests(self) -> Dict[str, Any]:
        """
        Run complete test suite and return results.
        
        Returns:
            Dict[str, Any]: Test results summary
        """
        logger.info("Starting test suite...")
        
        tests = [
            self.test_basic_retrieval,
            self.test_cache_performance,
            self.test_health_monitoring
        ]
        
        passed = sum(test() for test in tests)
        total = len(tests)
        
        summary = {
            "total_tests": total,
            "passed": passed,
            "failed": total - passed,
            "success_rate": f"{(passed/total)*100:.1f}%",
            "details": self.test_results
        }
        
        logger.info(f"Test suite completed: {passed}/{total} tests passed")
        return summary

# Example usage
# tester = AgenticRAGTester(system)
# results = tester.run_all_tests()
# print(f"Test Results: {results}")

By implementing these optimizations and testing strategies, you can ensure your agentic RAG system is reliable and efficient in a production environment. To further enhance the performance of your system, consider exploring techniques for [fine-tuning language models with Hugging Face Transformers](/article/mastering-fine-tuning-of-large-language-models-with-hugging-face), which can improve the accuracy and relevance of the generated outputs.

## Conclusion

You've now built a production-ready agentic RAG system with autonomous decision-making, performance optimizations, and comprehensive monitoring. The key lessons learned include:

- **Autonomous retrieval** enables more flexible and context-aware information gathering compared to static RAG systems
- **Caching and reranking** significantly improve response times and answer quality in production environments
- **Health checks and metrics** provide essential visibility for maintaining reliable AI systems at scale
- **Comprehensive testing** ensures your system performs correctly under various conditions before deployment

To take your system to the next level, consider exploring advanced patterns such as multi-agent systems for complex workflows, hybrid search combining vector and keyword retrieval, and dynamic tool selection for multi-modal data sources. For more information, visit the [LangChain documentation](https://docs.langchain.com) and [ChromaDB documentation](https://www.trychroma.com). Additionally, our article on [customizing LLMs for domain-specific applications](/article/mastering-domain-specific-llm-customization-techniques-and-tools-unveiled) provides insights into advanced customization techniques that can be applied to enhance your agentic RAG systems.

Next steps for production deployment include setting up CI/CD pipelines for automated testing and deployment, implementing distributed caching with Redis for multi-instance deployments, and integrating with observability platforms like Prometheus or Datadog for comprehensive monitoring. With these foundations in place, you're ready to build and scale intelligent, autonomous AI systems that adapt to real-world complexity.