<a href="https://colab.research.google.com/github/lawrencejesse/2023-2034-Lawrence-Ranch-NDVI/blob/main/Agentic_RAG_with_HuggingFace_smolagents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Agentic RAG with Hugging Face smolagents vs Vanilla RAG

Author: [@MariaKhalusova](https://x.com/mariaKhalusova)

Last updated: Jan 9th, 2025

## What you'll learn:

1. Parsing PDF documents from S3 into DataStax AstraDB with Unstructured Platform
2. Building Vanilla RAG in pure Python without using specialized frameworks
3. Differences between Vanilla RAG and Agentic RAG
4. Creating Agentic RAG with Hugging Face `smolagents` library
5. Whether Agentic RAG can produce better answers (spoiler: it can!)

In Vanilla RAG, your system uses the user's question to perform a single retrieval step and get a batch of documents that are meant to be relevant to the query. These documents are then passed on to the LLM to generate an answer grounded in the context of those documents.

However, this approach has limitations. If the results of the retrieval are inadequate (either irrelevant or incomplete), this will have a direct negative impact on generation. There are many different methods one can employ to improve the retrieval quality, such as choosing a better embedding model, switching to a different retrieval method (e.g., BM25, or hybrid, metadata filtering, etc.), increasing the number of retrieved documents, and adding a reranker. However, there may still be situations where a single retrieval step, or retrieving based on the user query "as is," may not produce optimal results.

In this tutorial, we will build a simple Agentic RAG application that will use a retriever as a tool and will be able to:
* Reformulate the user query to improve the retrieval results.
* Review the results.
*  Retrieve more context, if needed

This should allow the RAG application to perform better answer complex question, for example, the ones that might require query decomposition and multiple retrieval steps.

There are several frameworks available for building agentic RAG, in this tutorial, we'll be using the latest library from Hugging Face called [`smolagents`](https://github.com/huggingface/smolagents). The library is lightweight, and very easy to start using to build agentic applications, including but not limited to Agentic RAG.

## Preparing the data

Every RAG application starts with data, and most of the time - unstructured data (PDFs, Word documents, SharePoint files, emails, etc.). Preprocessing this type of data to make it available for retrieval can be a challenging task. [Unstructured Platform](https://unstructured.io/) significantly simplifies this process - it can connect to any data sources you may have in your organization, preprocess the data from those sources making it RAG-ready, and upload the results into your database of choice.

To start transforming your data with Unstructured Platform, you'll need to [sign up on the Unstructured For Developers page](https://unstructured.io/developers). Once you do, you can log into the Platform and process up to 1000 pages per day for free for the first 14 days.

In this tutorial, our data will consist of annual 10-K SEC filings from Walmart Inc., Chevron Corporation, and Costco Wholesale Corporation for the 2023 fiscal year. These reports offer a deep insight into each company's financial performance that year. The documents are originally in PDF format and we have them stored in an Amazon S3 bucket. After preprocessing, we'll store the document chunks with their embeddings in DataStax AstraDB for retrieval. Here is what we need to do to prepare the data:
* Create an S3 _source connector_ in Unstructured Platform to connect it to the documents
* Create an AstraDB _destination connector_ in Unstructured Platform to upload the processed documents
* Create a _workflow_ that starts with a source connector, adds data transformation steps (such as extracting content of the PDFs with Antropic Claude Sonnet, enriching the documents with metadata, chunking the text, and generating embedding vectors for the similarity search), and then ends with uploading the results into the destination.

Let's briefly go over these steps.

### Create an S3 source connector in Unstructured Platform

Log in to your Unstructured Platform account, click `Connectors` on the left side bar, make sure you have `Sources` selected, and click `New` to create a new source connector. Alternatively, use this [direct link](https://platform.unstructured.io/connectors/editor/new/sources). Choose S3, and enter the required info about your bucket.

<img src="https://framerusercontent.com/images/I1hhUk4xRAheCxMOLgrXZZiO0.png" alt="S3 connector settings" width="500"/>

### Create an AstraDB destination connector in Unstructured Platform

Create an account on [datastax.com](https://www.datastax.com/), and create a new Serverless (Vector) Database. Once it's instantiated, grab your credentials - API endpoint, and an application token,- and save them. If you need help getting started with AstraDB, refer to [their documentation](https://docs.datastax.com/en/astra-db-serverless/get-started/quickstart.html).

In the database, create a collection. Give it a name, then in the embedding generation method choose `Bring my own` as we will generate the embeddings automatically with Unstructured Platform. The dimensions value should be set to 3072 in this example as we'll be using `"text-embedding-3-large"` model from OpenAI.

Now you can create a destination connector for AstraDB in Unstructured Platform [here](https://platform.unstructured.io/app/connectors/editor/new/destinations/).

<img src="https://framerusercontent.com/images/Szq022IHqD04mAjyIUlYgdNcVNM.png" alt="S3 connector settings" width="500"/>

<img src="https://framerusercontent.com/images/sJmB9GJ8JhZrPwIm6NccnP82GM.png" alt="S3 connector settings" width="500"/>





### Create a workflow in Unstructured Platform

Navigate to the `Workflows` tab in Unstructured Platform, and click `New workflow`. Choose `Build it with Me` option to set up the workflow with pre-configured options.

First, choose your source and destination using the connectors that you've just created.

Next, select "Platinum" workflow that will use Anthropic Claude Sonnet to preprocess the files:

<img src="https://framerusercontent.com/images/TRUyuKsfDzmjY5YSE76cdmwreLI.png" alt="S3 connector settings" width="500"/>

Optionally, set a schedule. In this example we don't need it.

That's it! Once the workflow is configured, run it, and wait for the job to finish. When completed, you can review the results of the job and what steps Unstructured Platform performed. If there are any errors, you'll find the information about the causes to help you troubleshoot:

<img src="https://i.ibb.co/Y7jyLC3/completed-job.png" alt="Completed job" width="500"/>

Now, let's build RAG!


## Setup

Run the line below to install required dependencies:

* `smolagents`: to configure agentic RAG
* `astrapy`: to connect to AstraDB and query it
* `python-dotenv`: to manage environment variables


In [None]:
!pip install --upgrade -q smolagents astrapy python-dotenv

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.7/67.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.1/177.1 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m67.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Create a local `.env` file that contains the following environment variables, and upload it to your notebook's directory.

* `ASTRA_DB_APPLICATION_TOKEN`
* `ASTRA_DB_API_ENDPOINT`
* `ASTRA_DB_COLLECTION_NAME`
* `ASTRA_DB_NAMESPACE`
* `OPENAI_API_KEY`

When we were preprocessing the data, we've generated embeddings using a model from OpenAI, so we need the key to embed the user queries. For convenience, we'll also use an LLM from OpenAI for generation, and it will be the same for Vanilla RAG and Agentic RAG.

In [None]:
import os
from dotenv import load_dotenv

def load_environment_variables(path_to_dot_env_file) -> None:
    """
    Load environment variables from .env file.
    Raises an error if critical environment variables are missing.
    """
    load_dotenv(path_to_dot_env_file)
    required_vars = [
        "ASTRA_DB_APPLICATION_TOKEN",
        "ASTRA_DB_API_ENDPOINT",
        "ASTRA_DB_COLLECTION_NAME",
        "ASTRA_DB_NAMESPACE",
        "OPENAI_API_KEY"
    ]

    for var in required_vars:
        if not os.getenv(var):
            raise ValueError(f"Missing required environment variable: {var}")

load_environment_variables('/content/.env')

## Connect to your AstraDB collection and set up an OpenAI client

In [None]:
from openai import OpenAI
from astrapy import DataAPIClient

In [None]:
OPENAI_CLIENT = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
EMBEDDING_MODEL = "text-embedding-3-large"
GENERATION_MODEL = "gpt-4o-2024-11-20"

In [None]:
def get_collection(collection_name: str, keyspace: str):
    """
    Establish connection to Astra DB and return the specified collection.
    Args:
        collection_name (str): Name of the collection to retrieve
        keyspace (str): Database keyspace
    Returns:
        Collection object from Astra DB
    """

    astra_client = DataAPIClient(os.getenv("ASTRA_DB_APPLICATION_TOKEN"))
    database = astra_client.get_database(os.getenv("ASTRA_DB_API_ENDPOINT"))

    astradb_collection = database.get_collection(name=collection_name,
                                                 keyspace=keyspace)

    print(f"Collection: {astradb_collection.full_name}\n")
    return astradb_collection


In [None]:
COLLECTION = get_collection(os.getenv("ASTRA_DB_COLLECTION_NAME"), os.getenv("ASTRA_DB_NAMESPACE"))

Collection: default_keyspace.pdf_vlm_collection



In [None]:
def get_embedding(text: str):
    """
    Generate embedding for given text using OpenAI's embedding model.

    Args:
        text (str): Input text to embed

    Returns:
        Embedding vector for the input text
    """
    return OPENAI_CLIENT.embeddings.create(
        input=text, model=EMBEDDING_MODEL
    ).data[0].embedding


## Vanilla RAG

For the Vanilla RAG we'll create a simple retriever that will use similarity search based on the query and return top 5 documents by default:  

In [None]:
def simple_retriever(query: str, n=5):
    """
    Retrieve documents based on the given query using similarity search

    Args:
        query (str): query to pass to the DB
        n: Number of documents to retrieve

    Returns:
        List of the retrieved documents' texts
    """

    query_embedding = get_embedding(query)

    results = COLLECTION.find(sort={"$vector": query_embedding}, limit=n)
    docs = [doc["content"] for doc in results]

    return  "\nRetrieved documents:\n" + "".join(
            [
                f"\n\n===== Document {str(i)} =====\n" + doc
                for i, doc in enumerate(docs)
            ]
        )

Now the whole RAG can be implemented with one simple function:

In [None]:
from typing import List
def vanilla_rag(question: str):
    """
    Generate an answer based on retrieved documents and user question.

    Args:
        question (str): User's input question
    Returns:
        LLM-generated answer
    """

    prompt = (
        "You are an assistant that can answer user questions given provided context. "
        "Provide an answer based on the provided context and nothing else, do not make generalizations."
        "If you don't know the answer, or no documents are provided, "
        "say 'I do not have enough context to answer the question.'"
    )

    # retrieve documents using the simple retriever, 5 documents by default
    relevant_documents = simple_retriever(question)

    # add user question and the docs to the prompt
    augmented_prompt = (
        f"{prompt}"
        f"User question: {question}\n\n"
        f"Retrieved documents to use as context:\n\n {relevant_documents}"
    )

    # pass everything to the LLM to generate an answer
    response = OPENAI_CLIENT.chat.completions.create(
        messages=[
            {'role': 'system', 'content': 'You answer users questions.'},
            {'role': 'user', 'content': augmented_prompt},
        ],
        model=GENERATION_MODEL,
        temperature=0,
    )

    return response.choices[0].message.content


Let's try it out with a simple question:

In [None]:
question = "What are Costco's merchandise categories?"
vanilla_rag(question)

"Costco's merchandise categories include:\n\n1. **Core Merchandise Categories (Core Business):**\n   - **Foods and Sundries:** Sundries, dry grocery, candy, cooler, freezer, deli, liquor, and tobacco.\n   - **Non-Foods:** Major appliances, electronics, health and beauty aids, hardware, garden and patio, sporting goods, tires, toys and seasonal, office supplies, automotive care, postage, tickets, apparel, small appliances, furniture, domestics, housewares, special order kiosk, and jewelry.\n   - **Fresh Foods:** Meat, produce, service deli, and bakery.\n\n2. **Warehouse Ancillary Businesses:**\n   - Gasoline, pharmacy, optical, food court, hearing aids, and tire installation.\n\n3. **Other Businesses:**\n   - E-commerce, business centers, travel, and other services."

This worked just fine, because an answer to this question is located in a single paragraph that can be reliably retrieved with similarity search. Now that you've seen how Vanilla RAG works, let's talk about what's different in Agentic RAG.

## Agentic RAG

There are many definitions of what an "AI agent" is, for example:

* "An AI agent is meant to accomplish tasks typically provided by the users. In an AI agent, AI is the brain that processes the task, plans a sequence of actions to achieve this task, and determines whether the task has been accomplished." by Chip Huyen
* "AI Agents are programs where LLM outputs control the workflow." by Hugging Face smolagents team

An Agent typically has access to Tools which help it get additional information and/or also perform actions. This can be a retriever, or a function to do Web search, a calculator, and image generator, and so on.

Tools help the LLM agent overcome some of its limitations. For example, a retriever tool, just like in Vanilla RAG, can help get additional information, and a calculator might be useful since AI models aren't great at math.

For this Agentic RAG example we'll build an agent that can rephrase a query if needed, call the same simple retriever as before as a tool, but it will have an option to call this tool more than once to retrieve additional information to improve the answer.

Let's see how we can do this with `smolagents`:

## Agentic RAG with `smolagents`


First, we'll create a `RetrieverTool` class.

At the core of the tool is a function that an LLM can use in an agentic system.
However, to use this function, the LLM will need to be given its API:
* `name`: the name of the tool to give the LLM
* `description`: is used to populate the agent's system prompt to inform about the tool's capabilities.
* `forward` method: the "main" function to be executed.
* `inputs`: what inputs can be given to the tool


Note, that here we take all of the same functions as before (`get_embedding` & `simple_retriever`), and simply wrap them into a RetrieverTool class to make the same simple retriever usable as a tool for the Agent.

In [None]:
from smolagents import Tool

class RetrieverTool(Tool):
    name = "retriever"
    description = "Uses semantic search to retrieve documents that could be relevant to answer the query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to the target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self, collection, openai_client, **kwargs):
        super().__init__(**kwargs)
        self.retriever = collection
        self.embedder = openai_client

    def get_embedding(self, text: str):
        return self.embedder.embeddings.create(
            input=text, model=EMBEDDING_MODEL
            ).data[0].embedding

    def simple_retriever(self, query, n=5):
      query_embedding = get_embedding(query)
      results = self.retriever.find(sort={"$vector": query_embedding}, limit=n)
      docs = [doc["content"] for doc in results]
      return "\nRetrieved documents:\n" + "".join(
            [
                f"\n\n===== Document {str(i)} =====\n" + doc
                for i, doc in enumerate(docs)
            ]
        )

    def forward(self, query: str) -> str:
        assert isinstance(query, str)

        docs = simple_retriever(query)

        return docs

# initialize a retriever with the AstraDB collection and OpenAI Client for embedding
retriever_tool = RetrieverTool(COLLECTION, OPENAI_CLIENT)

To create an Agent, we have two options:
* `ToolCallingAgent`, it generates tool calls as a JSON under the hood, which is a standard industry approach at the moment.
* `CodeAgent`, a different, new type of `ToolCallingAgent` that generates its tool calls as blobs of code, which works really well for LLMs that have strong coding performance.

In this example, we're using a `"gpt-4o-2024-11-20"` and we'll try both, the `ToolCallingAgent`, and the `CodeAgent` so that we could see how they work.

smolagents natively integrate with open source models from the Hugging Face, but also works with OpenAI and Antropic models via `LiteLLMModel`.

## ToolCallingAgent

In [None]:
from smolagents import ToolCallingAgent, LiteLLMModel

model = LiteLLMModel(model_id=GENERATION_MODEL)

agent = ToolCallingAgent(tools=[retriever_tool], model=model, verbose=True)

That's all it takes to set it up! Now let's compare how it performs compared to Vanilla RAG on some slightly trickier question.

In [None]:
question = "How do employee incentive plans differ between Chevron and Walmart?"

First, let's get an answer from Vanilla RAG:

In [None]:
vanilla_rag(question)

"Based on the provided context, employee incentive plans at Walmart and Chevron differ in the following ways:\n\n1. **Walmart**:\n   - Walmart's Stock Incentive Plan includes stock options, restricted stock, restricted stock units, performance share units, and other equity compensation awards. These are designed to align employee interests with those of shareholders.\n   - Walmart also offers incentive payments tied to deferred compensation and bonuses. For example, employees who remain continuously employed for 10 or 15 consecutive years may receive incentive payments equal to 20% of their recognized deferred compensation and bonuses for specific periods, plus credited plan earnings.\n\n2. **Chevron**:\n   - Chevron's incentive approach is not explicitly detailed in the provided documents. However, the company emphasizes competitive total compensation linked to individual and enterprise performance, as well as long-term employee development and retention. Chevron's focus appears to be

Now, let's see what the Agent does and how it answers:

In [None]:
agent_output = agent.run(question)

print("Final answer:")
print(agent_output)

Final answer:
Employee incentive plans differ between Chevron and Walmart in several ways:

1. **Chevron's Incentive Plans**:
   - Chevron offers a **Chevron Incentive Plan**, which ties annual cash bonuses to both corporate and individual performance from the prior year.
   - The company also provides a **Long-Term Incentive Plan (LTIP)**, featuring awards like stock options, restricted stock units, stock appreciation rights, and performance shares. These grants are meant for employees with significant responsibilities and ratify over multiple years.
   - The 2022 LTIP caps shares issued to 104 million until May 2032, offering a mix of cash and share-based incentives depending on the award type. Share vesting terms vary between 3 and 10 years based on the award type.

2. **Walmart's Incentive Plans**:
   - Walmart emphasizes **equity compensation** through its Stock Incentive Plan, providing stock options, restricted stock units, performance share units, and other equity-based awards 

As you can see, the Agent decomposed the original question into two individual queries - `'Chevron employee incentive plans'` and `'Walmart employee incentive plans'`, and retrieved documents for each of them before generating the final answer that compares all of the information.  

Let's take a closer look at the final answers.

Agentic RAG answer is more true to the source and accurate for the following reasons:

* Vanilla RAG didn't retrieve enough documents - this is explicitly stated in the Vanilla RAG answer: _"Chevron's incentive approach is not explicitly detailed in the provided documents."_

* The summary and conclusion in Vanilla RAG's answer are too general, not providing enough specificity about the actual incentive programs offered by each company. Agentic RAG provides a more truthful summary of the key differences.

* Agentic RAG provides a more detailed and accurate picture of Chevron's incentive plans, correctly identifying both short-term (Annual Cash Bonus Plan) and long-term (LTIP) incentives. It also accurately notes that the LTIP includes stock options, stock appreciation rights, restricted stock units, and performance shares. This is because the Agent was actually able to retrieve this information in an additional step.


## CodeAgent

Just as easily we can build a CodeAgent and ask it a question.

In [None]:
from smolagents import CodeAgent, LiteLLMModel

model = LiteLLMModel(model_id=GENERATION_MODEL)

agent = CodeAgent(tools=[retriever_tool], model=model, verbose=True)

In [None]:
another_question = "Does Walmart cater to the same target audience as Costco?"

In [None]:
vanilla_rag(another_question)

'I do not have enough context to answer the question.'

In [None]:
agent_output = agent.run(another_question)

print("Final answer:")
print(agent_output)

Final answer:
No, Walmart and Costco do not cater to the same target audience. Walmart broadly targets busy families with an omni-channel model and low prices, while Costco appeals to those seeking bulk purchasing through a membership-based model.


In this example, Vanilla RAG fully failed to answer: _`I do not have enough context to answer the question.`_

While the question may seem simple, it requires getting information about Walmart's target audience, Costco's target audience, and then comparing the two. A single step of retrieval on the original question may not be sufficient in this case.

Agentic RAG split the question into two queries, collected the data, and made a comparison:

_No, Walmart and Costco do not cater to the same target audience. Walmart broadly targets busy families with an omni-channel model and low prices, while Costco appeals to those seeking bulk purchasing through a membership-based model._

Note that CodeAgent, as the name implies, calls the tools using Python code:

```
Code:                                                                                                                                                                                                   
```py                                                                                                                                                                                                   
walmart_info = retriever(query="What is Walmart's target audience?")                                                                                                                                    
print("Walmart's target audience information:", walmart_info)                                                                                                                                                                                                                                                           
costco_info = retriever(query="What is Costco's target audience?")                                                                                                                                      
print("Costco's target audience information:", costco_info)                                                                                                                                             
```<end_code>    
```    

**Next steps**:

If you'd like to improve the results further, regardless of which type of Agent you use, consider customizing the tool calling system prompt for the Agent. Here are the default system prompts that `smolagents` library uses for tool calling and for code agents: [here on GitHub](https://github.com/huggingface/smolagents/blob/681758ae84a8075038dc676d8af7262077bd00c3/src/smolagents/prompts.py). You can provide a custom `system_prompt` when initializing your agent. Try it with examples that are more representative of your use case, modified instructions, or promise a bigger imaginary bribe to your LLM for perfoming the task 😆.


## Conclusion

As demonstrated in this tutorial, Agentic RAG can deliver higher-quality answers compared to Vanilla RAG by rephrasing queries, decomposing tasks, and leveraging multiple retrieval steps—all while using the same retriever, documents, and generation LLM. However, this improvement comes at a cost.

First, Agentic RAG really should be using a more powerful LLM for two key reasons:

* Longer context windows are necessary to handle multiple steps.
* Error propagation can occur as agents perform multi-step processes, compounding mistakes if not managed carefully.
* If you prefer to use `CodeAgent`, you need an LLM that has advanced coding capabilities, otherwise it just won't work.

Second, Agentic RAG may be slower and will utilize more tokens.  While an AI agent can take multiple steps to improve results, it can also quickly burn through API credits. Choosing between Agentic RAG and Vanilla RAG requires careful consideration of your specific use case, performance requirements, and cost constraints.

Lastly, regardless of the RAG approach you choose, success hinges on the quality and relevance of the data available for retrieval. Agentic RAG can reformulate queries or perform multiple retrievals to uncover information, but if that information isn't stored in AstraDB—or any other vector store—no retrieval strategy will compensate.

**Good RAG starts with well-prepared data, and the [Unstructured Platform](https://unstructured.io/developers) simplifies this critical first step.** By enabling efficient ingestion, partitioning, and metadata enrichment of unstructured data, it ensures that your RAG pipeline is built on a solid foundation, unlocking its full potential.

