# Adaptive RAG with Pathway

This notebook shows how you can **dynamically adapt the number of documents in a RAG prompt** using feedback from the LLM. This can give you significant cost reduction of RAG LLM question answering pipelines while maintaining good accuracy. Mistral makes it easy to run Adaptive RAG locally with `ollama` or remotely using the Mistral API.

Run this notebook on [Google Colab](https://colab.research.google.com/github/avriiil/cookbook/blob/pathway-adapative-rag/third_party/pathway-adaptive-rag.ipynb).

Reference paper: https://pathway.com/developers/showcases/adaptive-rag

<img src="images/pw-adaptive-rag.png" width="700"/>

Let's jump in! 🪂

## Setup

### Using APIs 

* You will need [a Mistral subscription](https://auth.mistral.ai/ui/login?flow=2b98deac-f13f-4e18-a7cd-23ba377a6370) to access the Mistral API.
* Create an account and fetch your API Key
* Pass the API Key into the prompt in this notebook or set it as environment variable `MISTRAL_API_KEY`

### Installing Libraries
You will need to install `pathway` to run this code. We will also install `litellm` and `sentence-transformers` which are optional dependencies of Pathway.

In [None]:
# # Uncomment and run if you need to install Pathway and Mistral packages
# !pip install -U --prefer-binary "pathway~=0.9.0"
# !pip install "litellm>=1.35"
# !pip install -U sentence_transformers

### Accessing Data
This notebook uses a sample JSON dataset with ~1000 context from the [SQUAD]() dataset. You can access it from the `data` directory or download it to your own machine using the code below. If you download it to another directory, make sure to update the path in the `documents = pw.io.fs.read(...)` call below.

In [None]:
# # Download `adaptive-rag-contexts.jsonl` with ~1000 contexts from SQUAD dataset
# !wget -q -nc https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/adaptive-rag-contexts.jsonl

### Running locally
If you want to run this locally (e.g., on your laptop), use [Ollama](https://ollama.ai/library/mistral/tags):

* Download [Ollama app](https://ollama.ai/).
* Download a `Mistral` model e.g., `ollama pull mistral:instruct`, from various Mistral versions [here](https://ollama.ai/library/mistral) and Mixtral versions [here](https://ollama.ai/library/mixtral) available.
* Set flags indicating we will run locally and the Mistral model downloaded:

In [None]:
run_local = False

# # Flags for running locally
# run_local = True
# local_model = "ollabma/mistral:instruct"

## Adaptive RAG Intuition

RAG question-answering applications involve an important trade-off regarding the context size. A large number of documents increases the ability of the LLM to provide a correct answer, but also increases LLM costs, which typically grow linearly with the length of the provided prompt. However, intuitively not all questions are equally hard and some can be answered using a small number of supporting documents, while some may require the LLM to consult a larger prompt. 

This is where [Adaptive RAG](https://pathway.com/developers/showcases/adaptive-rag) comes in: 
1. start by asking the model to answer a question using **a small number of documents**.
2. if it refuses to answer, we will **increase the context size**.
3. we will do this **iteratively until the model returns an answer**. 

This improves the efficiency of our LLM pipeline. For most queries a single (cheap!) LLM call will be sufficient. A fraction of more complicated questions will require re-asking.

## Implementing Adaptive RAG

Let's implement Adaptive RAG by:
1. defining Mistral embedder and LLM
2. loading our context documents
3. defining our queries
4. running Adaptive RAG: to iteratively increase context size
5. inspecting the results

Start by importing the necessary libraries:

In [None]:
import getpass
import os

import pandas as pd

import pathway as pw
from pathway.stdlib.indexing import VectorDocumentIndex
from pathway.xpacks.llm.embedders import LiteLLMEmbedder, SentenceTransformerEmbedder 
from pathway.xpacks.llm.llms import LiteLLMChat  
from pathway.xpacks.llm.question_answering import (
    answer_with_geometric_rag_strategy_from_index,
)

Then define your Mistral embedder and LLM. If you want to run locally with Ollama, make sure to set the correct flags at the top of this notebook.

If you are not running locally and have not set the `MISTRAL_API_KEY` environment variable then the following cell will prompt you to securely pass your Mistral API Key.

*Note that for the local model:*
- we provide tested options for local embedders
- we specifically instruct the LLM to return json, which allows the LLM to follow the instructions more strictly.

In [None]:
# Check API key
if run_local:
    pass
elif "MISTRAL_API_KEY" in os.environ:
    mistral_api_key = os.environ["MISTRAL_API_KEY"]
else:
    mistral_api_key = getpass.getpass("Mistral API Key:")

# Set config options
embedding_dimension: int = 1024

# choose embedder
# large_model = "mixedbread-ai/mxbai-embed-large-v1"
# medium_model = "avsolatorio/GIST-Embedding-v0"
small_model = "avsolatorio/GIST-small-Embedding-v0"

# define Mistral embedder
if run_local:
    embedder = SentenceTransformerEmbedder(small_model, call_kwargs={"show_progress_bar": False})  # disable verbose logs
else:
    embedder = LiteLLMEmbedder(
        capacity = 5, 
        retry_strategy = pw.udfs.FixedDelayRetryStrategy(),
        model = "mistral/mistral-embed",
        api_key=mistral_api_key,
    )

# define Mistral LLM
if run_local:
    model = LiteLLMChat(
        model=local_model, 
        temperature=0,
        top_p=1,
        format="json",  # only available in Ollama local deploy, not usable in Mistral API
    )
else:
    model = LiteLLMChat(
        model="mistral/mistral-large-latest", 
        temperature=0, 
        api_key=mistral_api_key,
        top_p=1
    )

Next, let's load the context documents and create a table with our queries:

In [None]:
# Load documents in which answers will be searched
class InputSchema(pw.Schema):
    doc: str

documents = pw.io.fs.read(
    "data/adaptive-rag-contexts.jsonl",
    format="json",
    schema=InputSchema,
    json_field_paths={"doc": "/context"},
    mode="static",
)

# Create table with questions
df = pd.DataFrame(
    {
        "query": [
            "When it is burned what does hydrogen make?",
            #"When did Arnold switch from acting to politics?"
            "What was undertaken in 2010 to determine where dogs originated from?"
            #"What is a common nickname used to refer to dogs across multiple languages?",
        ]
    }
)
query = pw.debug.table_from_pandas(df)

Now let's create a Vector index of the documents and set up our Adaptive RAG:

In [None]:
# Index for finding closest documents
index = VectorDocumentIndex(
    documents.doc, documents, embedder, n_dimensions=embedding_dimension
)

# Run Adaptive RAG
result = query.select(
    question=query.query,
    result=answer_with_geometric_rag_strategy_from_index(
        query.query, #define query
        index, #pass index
        documents.doc, #define context docs
        model, #define LLM
        n_starting_documents=2, #set number of docs to include in first query iteration
        factor=2, #set factor to increase n_docs with
        max_iterations=4, #set max number of iterations,
        strict_prompt=True,  # needed for open source models, instructs LLM to give JSON output strictly
    ),
)

Run the cell below to execute Adaptive RAG and fetch the results: 

In [None]:
responses_df = pw.debug.table_to_pandas(result)

In [None]:
print(responses_df["result"].iloc[0])
print(responses_df["result"].iloc[1])

## Conclusion
We have shown a simple and effective strategy to reduce RAG costs by adapting the number of supporting documents to LLM behavior on a given question. The approach builds on the ability of LLMs to know when they don’t know how to answer. With proper LLM confidence calibration the adaptive RAG is as accurate as a large context base RAG, while being much cheaper to run.

## References

Read more about [the technical implementation and benchmarking](https://pathway.com/developers/showcases/adaptive-rag) of Adaptive RAG.