# Agent RAG

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datascienceworld-kan/vinagent-docs/blob/main/docs/tutorials/get_started/agent_rag.ipynb)

Let’s assume your company plans to build a smart assistant that can answer not only general questions but also specific ones related to your internal company documents. It is difficult for an LLM to answer such questions if it has not been studied on the topic beforehand.

The RAG (Retrieval-Augmented Generation) pipeline is an approach in natural language processing (NLP) that improves answer accuracy by retrieving relevant information before generating a response. It combines information retrieval with language generation techniques and serves as a solution to enhance the performance of generative models by incorporating a retriever component.

## Why is RAG pipeline

Certainly, RAG pipeline demonstrates its prowess of retrieving the relevant contexts from diverse data sources. It is a powerful tool that pushes the capability of a normal LLM to a new frontier. In a nushell, there are three principal advantages of using RAG such as:

- **Empowering LLM with real-time data access**:
Because the business context always constantly changes over time. Therefore, data is constantly dynamic and transformed in an enterprise that demands AI solutions, which can use LLMs to have the ability to remain up-to-date and current with RAG to facilitate direct access to additional data resources. Ideally, these resources should comprise of real-time and personalized data.

- **Preserving data privacy**:
Many enterprise data is sensitive and confidential. That is why the commercial LLM models like GPT-4, GPT-3.5, Claude, Grok, and Gemini are banned in several corporations, especially in the case where data is considered as the new gold. Therefore, ensuring data privacy is crucial for enterprises.To this end, with a self-hosted LLM (demonstrated in the RAG workflow), sensitive data can be retained on-premises just like the local stored data.

- **Mitigating LLM hallucinations**:
In fact since many LLMs lack access to factual and real-time information, they often generate inaccurate responses but seem convincing. This phenomenon, so-called hallucination, is mitigated by RAG, which reduces the likelihood of hallucinations by providing the LLM with relevant and factional information.

## RAG Architecture

![](./rag_langchain_engine.png)

**1. Retriever**

The retriever finds relevant information from a large collection of documents.

- Data ingestion: Data is collected from various sources like PDFs, text files, emails, or websites.
- Document preprocessing: Long documents are split into smaller parts to make them easier to handle.
- Generating embeddings: Each part is turned into a numeric vector using an embedding model.
- Storing in vector databases: These vectors are saved in a special database for fast searching.

**2. Generator**

The generator uses a language model to create a response based on the retrieved information.

- LLMs: A large language model reads the user query and the retrieved text to write an answer.
- Querying: The system compares the user query with saved vectors to find matching text before generating a reply.


## Build RAG Agent

In [1]:
%pip install --upgrade 'aucodb[vectordb]'


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.10 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


A standard blog [LLM-Powered AI Agent, Lilian Weng](https://lilianweng.github.io/posts/2023-06-23-agent/) is used as an example knowledge data source. We develop an RAG Agent, which use this blog, to answer many concepts of AI Agent.

In [2]:
import bs4
from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
from langchain.document_loaders import WebBaseLoader

# Load sample documents
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


## Retriever pipeline

Next we will build retriever pipeline of RAG agent to extract relevant documents from the blog.

In [3]:
from langchain_huggingface import HuggingFaceEmbeddings
from aucodb.vectordb.factory import VectorDatabaseFactory
from aucodb.vectordb.processor import DocumentProcessor

# 1. Initialize embedding model
embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")

# 2. Initialize document processor
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

doc_processor = DocumentProcessor(splitter=text_splitter)

# 3. Initialize vector database factory
db_type = "milvus"  # Supported types: ['chroma', 'faiss', 'milvus', 'pgvector', 'pinecone', 'qdrant', 'weaviate']
vectordb_factory = VectorDatabaseFactory(
    db_type=db_type,
    embedding_model=embedding_model,
    doc_processor=doc_processor
)

# 4. Store documents in the vector database
vectordb_factory.store_documents(docs)

# 5. Query the vector database
query = "What is Task Decomposition?"
top_k = 5
retrieved_docs = vectordb_factory.query(query, top_k)
for (i, doc) in enumerate(retrieved_docs):
    print(f"Document {i}: {doc}")

  from .autonotebook import tqdm as notebook_tqdm


Document 0: {'text': 'Component One: Planning#\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\nTask Decomposition#\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.\nTree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.', 'score': 0.7417395710945129}
Doc

## Generator pipeline

To answer the domain-specific questions, we should integrate retrieved documents with the query into a in-context prompt. First, let's join this documents into an unique context. 

In [4]:
def format_docs_as_context(retrieved_docs):
    """Format retrieved documents into a readable context string."""
    context_parts = []
    
    for i, doc in enumerate(retrieved_docs):
        # Add document separator and numbering
        context_parts.append(f"Document {i+1}:\n{doc['text']}")
    
    return "\n\n".join(context_parts)

context = format_docs_as_context(retrieved_docs = retrieved_docs)
print(context)

Document 1:
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.

Document 2:
The AI assistant can parse user inp

Create a RAG prompt to combine the question with the provided context.

In [5]:
def create_rag_prompt(context, question):
    """Create a prompt that combines context and question for RAG."""
    prompt = (
        "You are a helpful assistant that answers questions based on the provided context.\n"
        "# Context:\n"
        f"{context}\n"
        f"# Question: {question}\n"
        "# Instructions:\n"
        "- Answer the question based on the information provided in the context above"
        "- If the context doesn't contain enough information to answer the question, say so"
        "- Be concise but comprehensive in your response"
        "Answer:"
    )
    return prompt

rag_prompt = create_rag_prompt(context = context, question = query)
print(rag_prompt)

You are a helpful assistant that answers questions based on the provided context.
# Context:
Document 1:
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a

Generate answer by using LLM model.

In [6]:
from langchain_together import ChatTogether 
from langchain_core.messages import SystemMessage, HumanMessage
from dotenv import load_dotenv
load_dotenv()

llm = ChatTogether(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free"
)

messages = [
    SystemMessage(content="You are a helpful assistant that answers questions based on provided context."),
    HumanMessage(content=rag_prompt)
]

response = llm.invoke(messages)
print(response.content)

Task decomposition refers to the process of breaking down a complex task into smaller, more manageable steps or sub-tasks. This can be achieved through various methods, including using chain of thought (CoT) prompting techniques, Tree of Thoughts, or relying on external classical planners with the Planning Domain Definition Language (PDDL). The goal of task decomposition is to transform big tasks into multiple simpler tasks, making it easier to understand and execute the overall task.


## RAG Agent

Let's organize the code into a class called RAGAgent, which assembles all components like data ingestion, retriever, and generator into a single file.

In [7]:
from typing import List, Union
from langchain_core.documents import Document
from langchain_core.language_models.llms import BaseLanguageModel
from aucodb.vectordb.factory import VectorDatabaseFactory
from langchain_core.messages import SystemMessage, HumanMessage

class RAGAgent:
    def __init__(self, system_message: SystemMessage, llm: BaseLanguageModel, retriever: VectorDatabaseFactory):
        self.system_message = system_message
        self.llm = llm
        self.retriever = retriever

    def format_docs_as_context(self, retrieved_docs: Union[List[Document], List[str]]):
        """Format retrieved documents into a readable context string."""
        context_parts = []
        
        for i, doc in enumerate(retrieved_docs):
            # Add document separator and numbering
            context_parts.append(f"Document {i+1}:\n{doc['text']}")
        
        return "\n\n".join(context_parts)

    def create_rag_prompt(context: str, question: str):
        """Create a prompt that combines context and question for RAG."""
        prompt = (
            "You are a helpful assistant that answers questions based on the provided context.\n"
            "# Context:\n"
            f"{context}\n"
            f"# Question: {question}\n"
            "# Instructions:\n"
            "- Answer the question based on the information provided in the context above"
            "- If the context doesn't contain enough information to answer the question, say so"
            "- Be concise but comprehensive in your response"
            "Answer:"
        )
        return prompt
    
    def retrieved_docs(self, query: str, top_k: int):
        retrieved_docs = self.retriever.query(query, top_k)
        return retrieved_docs

    def invoke(self, question: str, top_k: int=5):
        # Step 1: Retrieve relevant documents
        retrieved_docs = self.retrieved_docs(question, top_k)
        
        # Step 2: Format documents as context
        context = format_docs_as_context(retrieved_docs)

        # Step 3: Create the prompt
        prompt = create_rag_prompt(context, question)
        
        # Step 4: Generate response using LLM
        messages = [
            SystemMessage(content=self.system_message),
            HumanMessage(content=prompt)
        ]

        response = llm.invoke(messages)
        return response

Initialize RAGAgent instance

In [8]:
from langchain_huggingface import HuggingFaceEmbeddings
from aucodb.vectordb.processor import DocumentProcessor
from langchain_together import ChatTogether 
from dotenv import load_dotenv
load_dotenv()

llm = ChatTogether(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free"
)

embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

doc_processor = DocumentProcessor(splitter=text_splitter)

# Initialize vector database factory with milvus and store documents
db_type = "milvus" # # Supported types: ['chroma', 'faiss', 'milvus', 'pgvector', 'pinecone', 'qdrant', 'weaviate']
vectordb_factory = VectorDatabaseFactory(
    db_type=db_type,
    embedding_model=embedding_model,
    doc_processor=doc_processor
)

vectordb_factory.store_documents(docs)

# Initialize RAG agent
rag_agent = RAGAgent(
    system_message="You are an AI assistant can answer user query based on the retrieved context",
    llm=llm,
    retriever=vectordb_factory
)

Invoking the agent.

In [9]:
answer = rag_agent.invoke("What is Task Decomposition?")
print(answer.content)

Task decomposition refers to the process of breaking down a complex task into smaller, more manageable steps or sub-tasks. This can be achieved through various methods, including using chain of thought (CoT) prompting techniques, Tree of Thoughts, or relying on external classical planners with the Planning Domain Definition Language (PDDL). The goal of task decomposition is to transform big tasks into multiple simpler tasks, making it easier to understand and execute the overall task.
