<a href="https://colab.research.google.com/drive/14a8Jsf-sT5ofUnluqGrnBBt5kst0FsDA" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Dive Agentic Retrieval Augmented Generation

An Agentic RAG is required when we use reasoning to determine which action(s) to take and in which order to take them. Essentially we use agents instead of a LLM directly to accomplish a set of tasks which requires planning, multi step reasoning, tool use and/or learning over time. Agents give us agency!

Agency : The ability to take action or to choose what action to take

In the context of RAG, we can plug in agents to enhance the reasoning prior to selection of RAG pipelines, within a RAG pipeline for retrieval or reranking and finally for synthesising before we send out the response. This improves RAG to a large extent by automating complex workflows and decisions that are required for a non trivial RAG use case.

### Purpose of this Agentic RAG
This notebook presents a practical implementation of Agentic Retrieval-Augmented Generation (RAG)—a system where decision-making and tool selection are delegated to an intelligent agent before executing a response. Rather than passing every query through a static RAG pipeline, this system introduces agency—the ability to choose the best course of action depending on the nature of the query.

At the heart of this implementation is a router prompt, which classifies user queries into one of three categories:

- OpenAI documentation: Queries related to tools, APIs, or usage guidelines for OpenAI models
- 10-K financial reports: Questions requiring retrieval from company filings or financial datasets
- Live Internet search: Broader, current, or comparative queries that need web access

Once the query is classified, the system invokes a corresponding route handler:

- For OpenAI and 10-K queries, it retrieves relevant context from a vector database (Qdrant) using text embeddings, then applies a RAG-based response generator.
- For Internet queries, it fetches real-time information using a web-access API (ARES).

This approach is an example of Agentic RAG, where reasoning precedes retrieval and generation. By plugging in agents before and within the RAG pipeline, we make the system smarter and more adaptive. This allows us to:

- Automatically choose the right retrieval method based on context
- Combine structured knowledge with real-time search
- Scale RAG beyond trivial use cases by integrating multi-step decision logic

Importantly, no external agentic frameworks are used—this is a ground-up implementation that demonstrates how to build a lightweight but intelligent agentic system using only a language model, prompt engineering, and retrieval tools.

## Setup and Dependencies

In [None]:
# Install the necessary libraries
!pip install openai qdrant_client transformers pydantic==2.11.4 litellm --quiet

In [None]:
import json                 # For parsing and structuring JSON data (especially OpenAI and routing responses)

# Google Colab-specific (for securely handling API keys)
from google.colab import userdata  # To securely store and retrieve credentials in Colab

# OS operations
import os                   # Useful for accessing environment variables and managing paths

# Embedding models (used for text vectorization during retrieval)
from transformers import AutoTokenizer, AutoModel  # For loading custom transformer models if not using OpenAI embeddings

# Vector database client
from qdrant_client import AsyncQdrantClient  # Qdrant is used as the vector store to retrieve documents based on similarity

import litellm
import asyncio
import datetime
import asyncio

from IPython.display import display, Markdown

## 1. Defining the Internet Tool

First, we will define a tool function that enables our system to answer queries requiring real-time, internet-based information. Not all questions can be answered using static documents like OpenAI docs or financial filings—sometimes users ask about current trends, comparisons, or live updates.

To handle this, we introduce a live search capability using the **ARES API** by Traversaal.

### What is ARES API?  
ARES is a web-based tool that allows you to:

- Search the internet in real time.
- Get LLM-generated answers based on live search results.

This is particularly useful for questions about:

- Current events (e.g., *“Latest AI tools in 2025”*),
- Tech comparisons (e.g., *“Gemini vs GPT-4”*),
- General knowledge outside internal datasets.

Please generate the API key [here](https://api.traversaal.ai)


In [None]:
#loads ares api key from colab secrets
ares_api_key=userdata.get('ARES_API_KEY')
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

# Required for litellm
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [None]:
import httpx # For sending HTTP POST requests to the ARES API

async def get_internet_content(user_query: str, action: str):
    """
    Fetches a response from the internet using ARES-API based on the user's query.

    This function serves as the tool invoked when the router classifies a query
    as requiring real-time information beyond internal datasets—i.e., "INTERNET_QUERY".
    It sends the query to a live search API (ARES) and returns the result.

    Args:
        user_query (str): The user's question that needs a live answer.
        action (str): Route type (always expected to be "INTERNET_QUERY").

    Returns:
        str: Response text generated using internet search or an error message.
    """
    print("Getting your response from the internet 🌐 ...")

    # API endpoint for the ARES live search tool
    url = "https://api-ares.traversaal.ai/live/predict"

    # Payload structure expected by the ARES API
    payload = {"query": [user_query]}

    # Authentication and content headers for API access
    headers = {
        "x-api-key": ares_api_key,  # Your secret API key (should be securely loaded from environment)
        "content-type": "application/json"
    }
    custom_timeout = httpx.Timeout(10.0, read=30.0) # 10s connect, 30s read
    try:
        async with httpx.AsyncClient(timeout=custom_timeout) as client:
            response = await client.post(url, json=payload, headers=headers)
            response.raise_for_status()
        return response.json().get('data', {}).get('response_text', "No response received.")

    # Handle HTTP-level errors (e.g., 400s or 500s)
    except httpx.HTTPStatusError as http_err:
        return f"HTTP error occurred: {http_err}"

    # Handle general connection, timeout, or request formatting issues
    except httpx.RequestError as req_err:
        print(req_err)
        return f"Request error occurred: {req_err}"

    # Catch-all for any unexpected failure
    except Exception as err:
        return f"An unexpected error occurred: {err}"


In [None]:
display(Markdown(await get_internet_content("Tell me about best travel destinations in 2025?","INTERNET_QUERY"))) #run internet function to test results

## 2. Router Query Function — Giving the Agent Its Brain

In this step, we will define the router function, which plays a critical role in our Agentic RAG system.

### What is a Router?

A router is like the decision-making brain of our assistant.

Before trying to answer a user's question, the system first needs to figure out:

> “Where should I go to find the right answer?”

To make this decision, we use the OpenAI GPT model. We provide it with a detailed system prompt that explains how to classify the user's question into one of these categories:

- **OPENAI_QUERY** → Questions about OpenAI tools, APIs, models, or documentation.
- **10K_DOCUMENT_QUERY** → Questions about companies, financial filings, or analysis based on 10-K reports.
- **INTERNET_QUERY** → Anything else that likely requires real-time or general web information.

### What does the function do?

- Sends the user's question to the OpenAI API.
- Receives a JSON response containing:
  - `action`: The category the query belongs to.
  - `reason`: A short explanation for the decision.
  - `answer`: (Optional) A quick response if it’s simple enough (left blank for internet queries).
- Parses the response and returns it as a Python dictionary.

### Why is this important?

This router gives the system agency—the ability to decide which knowledge source to use. It’s what makes this pipeline agentic, not just static.

Without the router, every query would follow the same path. With it, we can:

- Dynamically switch between tools and data sources.
- Handle different types of user questions intelligently.
- Avoid wasting resources on unnecessary steps.


In [None]:
from openai import OpenAIError
from pydantic import BaseModel, Field
from typing import List, Literal

class Actions(BaseModel):
    action: Literal["OPENAI_QUERY", "10K_DOCUMENT_QUERY", "INTERNET_QUERY"] = Field(description="The action to take")
    reason: str = Field(description="The reason for the action")
    answer: str = Field(description="The answer to the question")

async def route_query(user_query: str):
    router_system_prompt =f"""
    As a professional query router, your objective is to correctly classify user input into one of three categories based on the source most relevant for answering the query:

    1. OPENAI_QUERY:
        - If the user's query appears to be answerable using information from OpenAI's official documentation, tools, models, APIs, or services (e.g., GPT, ChatGPT, embeddings, moderation API, usage guidelines).
        - Questions that can likely be answered by consulting official OpenAI resources.

    2. 10K_DOCUMENT_QUERY:
        - Assign this category if the query is specifically asking for information found within, or analysis based upon, 10-K annual reports, company financial filings, or other structured financial/company datasets.
        - Information directly retrievable or analyzable from structured corporate/financial documents, particularly 10-K reports.
        - Assume past data is present here.

    3. INTERNET_QUERY:
        - If the query is neither related to OpenAI nor the 10k documents specifically, or if the information might require a broader search (e.g., news, trends, tools outside these platforms), route it here.
        - If unsure, or if the query is very broad, lean towards INTERNET_QUERY.

    Your decision should be made by assessing the domain of the query.

    EXAMPLES:
    - User: How to fine-tune GPT-3?
        Response:
            reason: Fine-tuning is OpenAI-specific
            action: OPENAI_QUERY
            answer: Use fine-tuning API

    - User: Where can I find the latest financial reports for the last 10 years?
        Response:
            reason: Query related to annual reports
            action: 10K_DOCUMENT_QUERY
            answer: Access through document database


    - User: Top leadership styles in 2024
        Response:
            reason: Needs current leadership trends,
            action: INTERNET_QUERY
            answer: ""


    - User: What's the difference between ChatGPT and Claude?
        Response:
            reason: Cross-comparison of different providers
            action: INTERNET_QUERY
            answer: ""

    Strictly follow this format for every query, and never deviate.

    Current datetime is : {datetime.datetime.now()}, always consider this when making decisions.
    """

    try:
        # Query the GPT-4 model with the router prompt and user input
        response = await litellm.acompletion(
            model="gpt-4o",
            messages=[{"role": "system", "content": router_system_prompt},
                      {"role":"user","content":user_query}],
            response_format=Actions,
            temperature=0.0,

        )


        # Extract and parse the model's JSON response
        json_response = response.choices[0].message.content
        parsed_response = Actions.model_validate_json(json_response)
        return parsed_response

    # Handle OpenAI API errors (e.g., rate limits, authentication)
    except OpenAIError as api_err:
        return {
            "action": "INTERNET_QUERY",
            "reason": f"OpenAI API error: {api_err}",
            "answer": ""
        }

    # Handle case where model response isn't valid JSON
    except json.JSONDecodeError as json_err:
        return {
            "action": "INTERNET_QUERY",
            "reason": f"JSON parsing error: {json_err}",
            "answer": ""
        }

    # Catch-all for any other unforeseen issues
    except Exception as err:
        return {
            "action": "INTERNET_QUERY",
            "reason": f"Unexpected error: {err}",
            "answer": ""
        }

In [None]:
await route_query("what is the revenue of uber in 2021?")

## 3. Setting Up Qdrant Vector Database for Agentic RAG
In this step, we are connecting our agent to a pre-built vector database using Qdrant—a tool used to store and search document embeddings (numerical representations of text).

What Are We Doing?
We are loading an existing Qdrant database that was downloaded from a GitHub repository. This database already contains:

- Vectorized OpenAI documentation
- Vectorized 10-K financial filings

By loading this saved data:

- We save time (no need to re-embed the documents)
- We enable fast similarity search to retrieve relevant text chunks

This setup allows our system to perform semantic search, meaning it can understand the meaning of the user query and match it with the most relevant pieces of information stored in the database.


### Why This Matters in Agentic RAG
Once the router decides that the query should go to the OpenAI docs or the 10-K reports, our system uses Qdrant to:

- Search for the most relevant pieces of text
- Pass those to the model to generate a grounded answer

So, this step is essential to support retrieval-augmented generation (RAG) within our agentic flow.

#Data Sources:

**10K Database: Lyft 2024 & Uber 2021 SEC filings**

**OpenAI Docs: Official OpenAI documentation**

For lecture demo purposes, the vecitr database has already been created and hosted on Github which we will clone here. In order to create your own embeddings, the notebook and data will be hosted and shared on github

In [None]:
# Clone the project repository that contains prebuilt vector data (e.g., Qdrant collections)
# This includes document embeddings and configurations needed for retrieval (10-K, OpenAI docs)
!git clone https://github.com/hamzafarooq/multi-agent-course.git --quiet


In [None]:
# 🗄️ Initializing Qdrant client with local path to vector database
# The path points to prebuilt Qdrant collections (10-K and OpenAI docs) cloned from the repository
# This enables fast, local retrieval of relevant document chunks based on semantic similarity
if 'client' in globals():
    # Delete existing client if it exists
    del client
    print("Deleted existing Qdrant client")

client = AsyncQdrantClient(path="/content/multi-agent-course/Module_1/Agentic_RAG/qdrant_data")


## 4. Building the Retriever and RAG for Vector Databases
In this section, we build the core logic that allows our agent to find relevant documents and generate grounded answers using them.

###Step 1: Import the Embedding Model
We start by importing the nomic-ai/nomic-embed-text-v1.5 model from Hugging Face. This model is used to convert any text (such as a user query) into a dense vector, known as an embedding. These embeddings capture the semantic meaning of text, allowing us to later compare and retrieve similar documents.


In [None]:
# Load the tokenizer and embedding model from Hugging Face
# This model converts raw text into dense vector representations (embeddings)
# Used for similarity search in Qdrant during document retrieval
text_tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
text_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

def get_text_embeddings(text):
    """
    Converts input text into a dense embedding using the Nomic embedding model.
    These embeddings are used to query Qdrant for semantically relevant document chunks.

    Args:
        text (str): The input text or query from the user.

    Returns:
        np.ndarray: A fixed-size vector representing the semantic meaning of the input.
    """
    # Tokenize and prepare input for the model
    inputs = text_tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    # Forward pass to get model outputs
    outputs = text_model(**inputs)

    # Take the mean across all token embeddings to get a single vector (pooled representation)
    embeddings = outputs.last_hidden_state.mean(dim=1)

    # Convert to NumPy array and detach from computation graph
    return embeddings[0].detach().numpy()

# Example usage: Generate and preview the embedding of a test sentence
text = "This is a test sentence."
embeddings = get_text_embeddings(text)
print(embeddings[:5])  # Print first 5 dimensions for inspection


### Step 2: Define the Embedding Function
We then define a function get_text_embeddings() which:

- Tokenizes the input text
- Runs it through the model
- Computes the average of all token embeddings
- Returns a single vector that represents the full sentence

This vector will be used to query Qdrant to find the most relevant document chunks based on similarity.

In [None]:
async def rag_formatted_response(user_query: str, context: list):
    """
    Generate a response to the user query using the provided context,
    with article references formatted as [1][2], etc.

    This function performs the final step in the RAG pipeline—synthesizing an answer
    from retrieved document chunks (context). It prompts the model to generate a
    grounded response, explicitly citing sources using a reference format.

    Args:
        user_query (str): The user's original question.
        context (list): List of text chunks retrieved from Qdrant (10-K or OpenAI docs).

    Returns:
        str: A generated response grounded in the retrieved context, with numbered citations.
    """

    # Construct a RAG prompt that includes both:
    # 1. The user's query
    # 2. The supporting context documents
    # The prompt instructs the model to answer using only the provided context,
    # and to include citations like [1], [2], etc. based on chunk IDs or order.
    rag_prompt = f"""
       Based on the given context, answer the user query: {user_query}\nContext:\n{context}
       and employ references to the ID of articles provided [ID], ensuring their relevance to the query.
       The referencing should always be in the format of [1][2]... etc. </instructions>
    """

    #  Call GPT-4o to generate the response using the RAG-style prompt
    response = await litellm.acompletion(
        model="gpt-4",
        messages=[
            {"role": "system", "content": rag_prompt},
        ],
        # temperature=0.0,
    )

    # Return the model's generated answer
    return response.choices[0].message.content


### Step 3: Define the RAG Response Generator
After retrieving relevant text chunks from Qdrant, we use the rag_formatted_response() function to generate a final answer. This function:

- Takes the user query and the retrieved document chunks
- Builds a prompt that asks the language model (GPT-4o) to answer the question using only the provided context
- Instructs the model to include references like [1], [2] for traceability

This ensures the output is not only informative but also grounded in actual retrieved data.

Together, these two functions lay the foundation for combining retrieval (from vector DB) and generation (from LLM) — the two pillars of a RAG system.



In [None]:
async def retrieve_and_response(user_query: str, action: str):
    """
    Retrieves relevant text chunks from the appropriate Qdrant collection
    based on the query type, then generates a response using RAG.

    This function powers the retrieval and response generation pipeline
    for queries that are classified as either OPENAI-related or 10-K related.
    It uses semantic search to fetch relevant context from a Qdrant vector store
    and then generates a response using that context via a RAG prompt.

    Args:
        user_query (str): The user's input question.
        action (str): The classification label from the router (e.g., "OPENAI_QUERY", "10K_DOCUMENT_QUERY").

    Returns:
        str: A model-generated response grounded in retrieved documents, or an error message.
    """

    # Define mapping of routing labels to their respective Qdrant collections
    collections = {
        "OPENAI_QUERY": "opnai_data",           # Collection of OpenAI documentation embeddings
        "10K_DOCUMENT_QUERY": "10k_data"        # Collection of 10-K financial document embeddings
    }

    try:
        # Ensure that the provided action is valid
        if action not in collections:
            return "Invalid action type for retrieval."

        # Step 1: Convert the user query into a dense vector (embedding)
        try:
            query = get_text_embeddings(user_query)
        except Exception as embed_err:
            return f"Embedding error: {embed_err}"  # Fail early if embedding fails

        # Step 2: Retrieve top-matching chunks from the relevant Qdrant collection
        try:
            text_hits = await client.query_points(
                collection_name=collections[action],  # Choose the right collection based on routing
                query=query,                          # The embedding of the user's query
                limit=3                               # Fetch top 3 relevant chunks
            )
        except Exception as qdrant_err:
            return f"Vector DB query error: {qdrant_err}"  # Handle Qdrant access issues

        # Extract the raw content from the retrieved vector hits
        contents = [point.payload['content'] for point in text_hits.points]

        # If no relevant content is found, return early
        if not contents:
            return "No relevant content found in the database."

        # Step 3: Pass the retrieved context to the RAG model to generate a response
        try:
            response = await rag_formatted_response(user_query, contents)
            return response
        except Exception as rag_err:
            return f"RAG response error: {rag_err}"  # Handle generation failures

    # Catch any unforeseen errors in the overall process
    except Exception as err:
        return f"Unexpected error: {err}"


# 5. Putting It All Together: Running the Agentic RAG
In this final step, we combine everything into a single function that controls the entire Agentic RAG workflow. The agentic_rag() function acts as the main orchestrator of the system.

Here’s what it does:

- Prints the user's query for reference.
- Uses the router function (powered by GPT) to decide which type of data source to use:
  - OpenAI documentation
  - 10-K financial reports
- Internet search
- Calls the correct function based on the route:
- If it’s an OpenAI or 10-K query, it retrieves data from Qdrant and generates a RAG response.
- If it’s an Internet query, it uses the ARES API to fetch live information.
- Displays the final response, neatly formatted in the console.

This step brings the agentic loop full circle—from understanding the question, reasoning about where to search, to finally responding with the best possible answer.

In [None]:
# Dictionary that maps the route labels (decided by the router) to their respective functions
# Each type of query is handled differently:
# - OPENAI_QUERY and 10K_DOCUMENT_QUERY use document retrieval + RAG
# - INTERNET_QUERY uses a web search API
routes = {
    "OPENAI_QUERY":  retrieve_and_response,
    "10K_DOCUMENT_QUERY":  retrieve_and_response,
    "INTERNET_QUERY": get_internet_content,
}

async def agentic_rag(user_query: str):
    """
    Main function that runs the full Agentic RAG system.

    This function takes a user's question, decides what type of query it is (OpenAI-related,
    financial document-related, or general internet), and then calls the right function
    to handle it. Finally, it prints out the full conversation and response.

    Args:
        user_query (str): The user's input question.

    Returns:
        None (It just prints the result nicely to the console)
    """

    #  Terminal color codes to make the printed output easier to read and visually structured
    CYAN = "\033[96m"
    GREY = "\033[90m"
    BOLD = "\033[1m"
    RESET = "\033[0m"

    try:
        # Step 1: Print the user's original question to the console
        print(f"{BOLD}{CYAN}👤 User Query:{RESET} {user_query}\n")

        # Step 2: Use the router (powered by GPT) to decide which route the query belongs to
        try:
            response = await route_query(user_query)
        except Exception as route_err:
            # If something goes wrong while classifying the query, show an error message
            print(f"{BOLD}{CYAN}🤖 BOT RESPONSE:{RESET}\n")
            print(f"Routing error: {route_err}\n")
            return

        # Extract the routing decision and the reason behind it
        action = response.action  # e.g., "OPENAI_QUERY"
        reason = response.reason  # e.g., "Related to OpenAI tools"

        # Step 3: Show the selected route and why it was chosen
        print(f"{GREY}📍 Selected Route: {action}")
        print(f"📝 Reason: {reason}")
        print(f"⚙️ Processing query...{RESET}\n")

        # Step 4: Call the correct function depending on the route (retrieval or web search)
        try:
            route_function = routes.get(action)  # Find the function to use for this route
            if route_function:
                result = await route_function(user_query, action)  # Run the function with the user's input
            else:
                result = f"Unsupported action: {action}"  # Catch unknown routing types
        except Exception as exec_err:
            result = f"Execution error: {exec_err}"  # Handle failure in the chosen route function

        # Step 5: Print the final response to the user
        print(f"{BOLD}{CYAN}🤖 BOT RESPONSE:{RESET}\n")

        #TODO: printing and return is inefficient
        print(f"{result}\n")
        return f"{result}"

    except Exception as err:
        # Catch-all for any unexpected errors in the overall logic
        print(f"{BOLD}{CYAN}🤖 BOT RESPONSE:{RESET}\n")
        print(f"Unexpected error occurred: {err}\n")


In [None]:
display(Markdown(await agentic_rag("what was uber revenue in 2021?")))

In [None]:
display(Markdown(await agentic_rag("what was lyft revenue in 2024?")))

In [None]:
display(Markdown(await agentic_rag("List me down new LLMs in 2025")))

In [None]:
display(Markdown(await agentic_rag("how to work with chat completions?")))

#Assignment: Implement sub-query division

In our current agentic retrieval-augmented generation (RAG) setup, there's a key limitation: when a user submits a query that contains multiple distinct questions phrased as a single input, the system treats it as a single unified search. As a result, the retrieval engine performs only one operation on the vector databases or external tools, which often leads to incomplete or less relevant results.


To address this, the assignment introduces a new functionality called subquery division. This involves breaking down complex, compound queries into multiple, focused subqueries. Each subquery is processed independently, allowing the system to retrieve more accurate and context-specific information. By handling these subqueries separately, the agent can generate more complete and relevant responses.

In [None]:
from pydantic import BaseModel, Field
from typing import List

class SubQueriesList(BaseModel):
    subQueries: List[str] = Field(...,
                             description="""A list of distinct sub-questions derived from the user query.
                              If the query is singular, this list will contain one item.""")

In [None]:
async def sub_queries(user_query):
  sub_queries_prompt= f"""
  You are a query router. If the input contains multiple distinct questions, break it into sub-questions."""

  response = await litellm.acompletion(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": sub_queries_prompt},
            {"role": "user", "content": user_query},

        ],
        response_format=SubQueriesList,
        temperature=0.0,
    )
  json_response = response.choices[0].message.content
  output = SubQueriesList.model_validate_json(json_response)
  return output.subQueries


In [None]:
composite_query = "what is revenue of lyft and what is reveue for uber in 2024"
sub_query_list = await sub_queries(composite_query)
print(sub_query_list)

In [None]:


async def process_queries_with_agentic_rag(user_query: str):
    """
    Processes composite queries by splitting them into sub‑queries,
    handling each sub‑query with the agentic_rag system, and then
    merging the individual answers into one cohesive response.

    Args:
        user_query (str): The user's original composite query.

    Returns:
        str: A synthesized answer to the original query.
    """

    #  Terminal colour codes
    BLUE    = "\033[94m"   # user input / initial query
    MAGENTA = "\033[95m"   # headings & final response
    GREEN   = "\033[92m"   # processing / success
    GREY    = "\033[90m"   # info & sub‑query lists
    RED     = "\033[91m"   # errors
    DIM     = "\033[2m"    # dimmed text
    BOLD    = "\033[1m"    # emphasis
    RESET   = "\033[0m"    # reset to default

    try:
        # 1️. Show the original query
        print(f"{BOLD}{BLUE}📝  Composite Query:{RESET} {user_query}\n")

        # 2️. Break the query into sub‑queries
        sub_query_list = await sub_queries(user_query)
        print(f"{DIM}{GREY}🔎  Breaking into sub‑queries…{RESET}")
        for idx, sub_q in enumerate(sub_query_list, start=1):
            print(f"{DIM}{GREY}   {idx}. {sub_q}{RESET}")

        # 3️. Process the sub‑queries concurrently
        print(f"{DIM}{GREEN}⚡  Processing sub‑queries in parallel using async …{RESET}\n")
        tasks = [agentic_rag(q) for q in sub_query_list]
        individual_results = await asyncio.gather(*tasks)

        # 4️. If only one sub‑query, return its answer straight away
        if len(sub_query_list) == 1:
            print(f"{BOLD}{MAGENTA}🤖  Final Response:{RESET}\n")
            print(individual_results[0])
            return individual_results[0]

        # 5️. Combine multiple answers into a single response
        print(f"{DIM}{GREY}🧩  Synthesising answers…{RESET}\n")
        mapping = "\n\n".join(
            f"Sub‑query: {q}\nAnswer: {a}" for q, a in zip(sub_query_list, individual_results)
        )

        combine_prompt = f"""

        Original user query: {user_query}
        The query was divided and answered as follows:

        {mapping}

        Please merge these answers into a single, well‑structured reply that
        addresses the original question directly.
            - Use clear Markdown in your response

        """

        response = await litellm.acompletion(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": "You are a helpful assistant that synthesizes multiple pieces of information into cohesive answers."
                },
                {"role": "user", "content": combine_prompt},
            ],
            temperature=0.1,
        )

        final_result = response.choices[0].message.content

        # 6️.  Print and return the merged answer
        print(f"{BOLD}{MAGENTA}🤖  Final Synthesised Response:{RESET}\n")
        return final_result

    except Exception as err:
        # 7️. Catch‑all for unexpected errors
        error_msg = f"Unexpected error in composite query processing: {err}"
        print(f"{BOLD}{RED}💥  Error:{RESET}\n{error_msg}\n")
        return error_msg

In [None]:
display(Markdown(await process_queries_with_agentic_rag(composite_query)))

In [None]:
new_query = 'Where is Taj Mahal and who won NFL in 2024'
display(Markdown(await process_queries_with_agentic_rag(new_query)))

In [None]:
new_query = 'What is Hakuna Matata'
display(Markdown(await process_queries_with_agentic_rag(new_query)))

In [None]:
new_query = 'Who is Max Verstappen\'s dad? '
display(Markdown(await process_queries_with_agentic_rag(new_query)))