# Basic RAG with Model-graded Eval

In this example we'll build a simple RAG application on Volume 7 of History of the United States of America, 
and evaluate it across 
* **relevance** -- does the answer make sense in context of the original question?, 
* **faithfulness** -- is the final answer faithful to the data that we fed into the LLM?
* **coherence** -- is the answer consistent and easy to understand?

We'll use AIConfig to manage and iterate on all our prompts, both for the generation step of the RAG pipeline, as well as its evaluation.

## Install dependencies

Create .env file containing the following line:
`OPENAI_API_KEY=<your key here>`
> You can get your key from https://platform.openai.com/api-keys 


In [1]:
%pip install python-aiconfig==1.1.27
%pip install chromadb

import dotenv
dotenv.load_dotenv()

Note: you may need to restart the kernel to use updated packages.
Collecting importlib-metadata<7.0,>=6.0 (from opentelemetry-api>=1.2.0->chromadb)
  Using cached importlib_metadata-6.11.0-py3-none-any.whl.metadata (4.9 kB)
Using cached importlib_metadata-6.11.0-py3-none-any.whl (23 kB)
Installing collected packages: importlib-metadata
  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 5.2.0
    Uninstalling importlib-metadata-5.2.0:
      Successfully uninstalled importlib-metadata-5.2.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
json-spec 0.11.0 requires importlib-metadata<6.0.0,>=5.0.0, but you have importlib-metadata 6.11.0 which is incompatible.[0m[31m
[0mSuccessfully installed importlib-metadata-6.11.0
Note: you may need to restart the kernel to use updated packages.


True

In [2]:
import argparse
import asyncio
import os
import sys
from aiconfig import AIConfigRuntime
import chromadb
from glob import glob


  from .autonotebook import tqdm as notebook_tqdm


## Download the raw data
Fetch Volume 7 of the History of the United States of America (our raw unstructured dataset)

In [3]:
!mkdir -p data/books/
!curl -o data/books/pg72846.txt https://www.gutenberg.org/cache/epub/72846/pg72846.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  636k  100  636k    0     0  2406k      0 --:--:-- --:--:-- --:--:-- 2437k


In [4]:
!head data/books/pg72846.txt

The Project Gutenberg eBook of History of the United States of America, Volume 7 (of 9)
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.



In [8]:
collection_name="us_history_volume_7"
chromadb_path="chroma_2.db"

## RAG Data Ingestion & Indexing
Chunk the data and ingest it into a Chroma DB collection.

> We use a very naive text splitting strategy with fixed-size chunks. For a production environment, this step will be critical to optimize.

**Note:** You can also run this as a CLI script using the command 
```
!python rag.py ingest `data/books/` --chroma-collection-name us_history_volume_7
```

In [3]:
def chunk_markdown(text, chunk_size=1000):
    chunks = []
    for i in range(0, len(text), chunk_size):
        yield text[i : i + chunk_size]
    return chunks

In [4]:
async def run_ingest(directory, collection_name):
    chroma_client = chromadb.PersistentClient(path=chromadb_path)
    collection = chroma_client.create_collection(name=collection_name)

    for i, filename in enumerate(glob(f"{directory}/**/*", recursive=True)):
        print("Ingesting:", i, filename)
        documents = []
        metadatas = []
        ids = []

        with open(filename, "r") as f:
            data = f.read()
            for j, chunk in enumerate(chunk_markdown(data)):
                documents.append(chunk)
                metadatas.append({"source": filename})
                ids.append(f"doc_{i}_chunk{j}")

        collection.add(documents=documents, metadatas=metadatas, ids=ids)

In [13]:
try:
    await run_ingest(directory="data/books", collection_name=collection_name)
except Exception as e:
    print(f"Ingest failed: {e}.\nIf the collection exists already, this is fine.")

Ingesting: 0 data/books/pg72846.txt


[0;93m2024-02-26 17:34:56.847623 [W:onnxruntime:, helper.cc:67 IsInputSupported] CoreML does not support input dim > 16384. Input:embeddings.word_embeddings.weight, shape: {30522,384}[m
[0;93m2024-02-26 17:34:56.848140 [W:onnxruntime:, coreml_execution_provider.cc:81 GetCapability] CoreMLExecutionProvider::GetCapability, number of partitions supported by CoreML: 49 number of nodes in the graph: 323 number of nodes supported by CoreML: 231[m


## RAG Query & Response Generation
Query the index for context given a user-supplied question, and use that context to generate a response

**Note:** You can also run this as a CLI script using the Example command: 
```
!python rag.py query "In July, flour sold at Boston for _?" -k=10 --chroma-collection-name us_history_volume_7
```

In [5]:
def retrieve_data(collection, query, k):
    print("Querying for:", query)
    context = collection.query(query_texts=[query], n_results=k)
    return context


def serialize_retrieved_data(data):
    # print("Serializing data:", type(data), data)
    out = "\n".join(data["documents"][0])
    print("Serialized retrieved data:\n", out)
    return out


async def generate(query, context, prompt):
    config = AIConfigRuntime.load("rag.aiconfig.yaml")

    params = {
        "query": query, 
        "context": context
    }
    print("Running generate with params:", params)
    return await config.run_and_get_output_text(
        prompt, params=params
    )

async def run_query(query, collection_name, k, prompt="generate"):
    chroma_client = chromadb.PersistentClient(path=chromadb_path)
    collection = chroma_client.get_collection(name=collection_name)
    data = retrieve_data(collection, query, k)
    context = serialize_retrieved_data(data)
    result = await generate(query, context, prompt)
    print("\n\nResponse:\n", result)

    return (query, context, result)

In [6]:
queries = [
     "What was the price of flour sold in Boston?",
     "When and why did the british Blockade happen?",
     "What happened during the burning of the Assembly houses in Canada in 1812?",
     
    # "What are some of the most important events in US history?"    
    # "What happ the American declaration of war against England in 1812",
    # "What happened during the burning of the Assembly houses in Canada in 1812?",
    "Elaborate on Napoleon",
    # "The close alliance between Great Britain and Russia",
    # "The loss of the Bank of the United States",
    # "The loss of the Massachusetts and Connecticut banks",
    # "The Battle of the Thames in 1813",
    # "The campaigns of General Dearborn and General Wilkinson",
    # "The blockades and conflicts with British ships, including the battles of Chesapeake and Argus",
    # "Privateering by the US during the war",
    # "The last embargo implemented by the US in an attempt to obtain concessions from England",
    # "The involvement of Russia and England in the war",
    # "The financial challenges faced by the US Treasury",
    # "The changing attitudes and perceptions of the British press towards the US during the war",
    # "The opposition to the war by Federalists, particularly in Massachusetts.",
]

In [9]:
query, context, result = await run_query(
    queries[0], collection_name, k=10, prompt="generate"
)

Querying for: What was the price of flour sold in Boston?


[0;93m2024-02-27 15:41:32.962676 [W:onnxruntime:, helper.cc:67 IsInputSupported] CoreML does not support input dim > 16384. Input:embeddings.word_embeddings.weight, shape: {30522,384}[m
[0;93m2024-02-27 15:41:32.963195 [W:onnxruntime:, coreml_execution_provider.cc:81 GetCapability] CoreMLExecutionProvider::GetCapability, number of partitions supported by CoreML: 49 number of nodes in the graph: 323 number of nodes supported by CoreML: 231[m


Serialized retrieved data:
 wheat to be brought by sea from Charleston or Norfolk to
Boston. Soon speculation began. The price of imported articles rose to
extravagant points. At the end of the year coffee sold for thirty-eight
cents a pound, after selling for twenty-one cents in August. Tea which
could be bought for $1.70 per pound in August, sold for three and four
dollars in December. Sugar which was quoted at nine dollars a hundred
weight in New Orleans, and in August sold for twenty-one or twenty-two
dollars in New York and Philadelphia, stood at forty dollars in
December.

More sweeping in its effects on exports than on imports, the blockade
rapidly reduced the means of the people. After the summer of 1813,
Georgia alone, owing to its contiguity with Florida, succeeded in
continuing to send out cotton. The exports of New York, which exceeded
$12,250,000 in 1811, fell to $209,000 for the year ending in 1814. The
domestic exports of Virginia diminished in four years from $4,800,000

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Response:
 The price of superfine flour sold in Boston was $11.87 a barrel.


## Evaluate the response
Run evals on the responses across 
* **relevance** -- does the answer make sense in context of the original question?, 
* **faithfulness** -- is the final answer faithful to the data that we fed into the LLM?
* **coherence** -- is the answer consistent and easy to understand?
* **succinctness** -- does the answer contain unnecessary information?

In [10]:
async def run_evals(query, context, answer):
    config = AIConfigRuntime.load("rag.aiconfig.yaml")
    def _get_prompt(criterion):
        return f"evaluate_{criterion}"
    return [
        await config.run_and_get_output_text(
            _get_prompt(criterion),
            params={
                "query": query,
                "context": context,
                "generate": {
                    "output": answer
                }
            }
        )
        for criterion in [
            "relevance", "faithfulness", "coherence", "succinctness"
        ]
    ]


In [16]:
print(f"Evaluating...Query: {query} \n Answer: {result}")
evals = await run_evals(query, context, result)
print("Evaluations:")
for criterion, score in zip(
    ["relevance", "faithfulness", "coherence", "succinctness"], 
    evals
):
    print(f"\n\n{criterion}: {score}")


Evaluating...Query: What was the price of flour sold in Boston? 
 Answer:  Based on the provided context, the exact price of flour sold in Boston is not specified after August. In August, superfine flour was sold at Boston for $11.87 a barrel. However, after that, the context discusses the increasing prices of other items such as coffee, tea, and sugar, but it does not mention the price of flour. Therefore, I cannot provide the price of flour sold in Boston after August based on this given context.
Evaluations:


relevance: YES
The answer is relevant because it provides the last known price of flour in Boston and explains that further prices are not provided in the available context. Although it doesn't give a current price, it gives a clear explanation for the omission.


faithfulness: YES. The verdict is correct because the context does mention the price of flour in August but does not provide any information about the flour price after that. Therefore, the answer gives an accurate a

## Eval with trials

In [12]:
import pandas as pd 

async def generate_trials_for_eval(query, context, trials, prompt):
    outputs = []
    for _ in range(trials):
        result = await generate(query, context, prompt)
        outputs.append(result)

    return outputs

def get_context_for_trials(query, collection_name, k):
    chroma_client = chromadb.PersistentClient(path=chromadb_path)
    collection = chroma_client.get_collection(name=collection_name)
    data = retrieve_data(collection, query, k)
    context = serialize_retrieved_data(data)
    return context


async def run_batch_evals(query, context, trials, prompt="generate"):
    raw_results = await generate_trials_for_eval(
        query, context, trials, prompt
    )

    out = []
    answers = []
    for rr in raw_results:
        evals_for_trial_numbers = await run_evals(query, context, rr)
        evals_for_trial = dict(
                zip(
                [
                "relevance", "faithfulness", "coherence", "succinctness"
                ],
                evals_for_trial_numbers
            )
        )

        out.append(evals_for_trial)
        answers.append(rr)

    df_evals = pd.DataFrame.from_records(out).applymap(
        lambda s: s.lower().startswith("yes")
    )
    df_evals["query"] = query
    df_evals["answer"] = answers

    return df_evals



def run_query_and_batch_evals(query, collection_name, k, trials, prompt="generate"):
    print("Running query and evals for:", query)
    context = get_context_for_trials(query, collection_name, k)
    return run_batch_evals(query, context, trials, prompt)



df_pass = await run_query_and_batch_evals(
    query, collection_name, k=10, 
    trials=5,
    prompt="generate"
)


Running query and evals for: What was the price of flour sold in Boston?
Querying for: What was the price of flour sold in Boston?
Serialized retrieved data:
 wheat to be brought by sea from Charleston or Norfolk to
Boston. Soon speculation began. The price of imported articles rose to
extravagant points. At the end of the year coffee sold for thirty-eight
cents a pound, after selling for twenty-one cents in August. Tea which
could be bought for $1.70 per pound in August, sold for three and four
dollars in December. Sugar which was quoted at nine dollars a hundred
weight in New Orleans, and in August sold for twenty-one or twenty-two
dollars in New York and Philadelphia, stood at forty dollars in
December.

More sweeping in its effects on exports than on imports, the blockade
rapidly reduced the means of the people. After the summer of 1813,
Georgia alone, owing to its contiguity with Florida, succeeded in
continuing to send out cotton. The exports of New York, which exceeded
$12,250,0

  df_evals = pd.DataFrame.from_records(out).applymap(


In [13]:
pd.set_option("display.max_colwidth", 500)
display(df_pass)


print("Trial results, all queries (% pass):")
display(df_pass.drop(columns=["answer"]).groupby("query").mean() * 100)

Unnamed: 0,relevance,faithfulness,coherence,succinctness,query,answer
0,True,True,False,False,What was the price of flour sold in Boston?,The price of superfine flour sold in Boston was $11.87 a barrel.
1,True,True,False,False,What was the price of flour sold in Boston?,The price of superfine flour sold in Boston was $11.87 a barrel.
2,True,True,True,False,What was the price of flour sold in Boston?,The price of superfine flour sold in Boston was $11.87 a barrel.
3,True,True,False,False,What was the price of flour sold in Boston?,The price of superfine flour sold in Boston was $11.87 a barrel.
4,True,True,False,False,What was the price of flour sold in Boston?,The price of superfine flour sold in Boston was $11.87 a barrel.


Trial results, all queries (% pass):


Unnamed: 0_level_0,relevance,faithfulness,coherence,succinctness
query,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
What was the price of flour sold in Boston?,100.0,100.0,20.0,0.0


## [Dev] Eval with trials, all queries

In [14]:
df_pass_all_queries = pd.concat(
    [
        await run_query_and_batch_evals(
            query, collection_name, k=10, 
            trials=5
        )
        for query in queries
    ]
)
df_pass_all_queries.head()

Running query and evals for: What was the price of flour sold in Boston?
Querying for: What was the price of flour sold in Boston?
Serialized retrieved data:
 wheat to be brought by sea from Charleston or Norfolk to
Boston. Soon speculation began. The price of imported articles rose to
extravagant points. At the end of the year coffee sold for thirty-eight
cents a pound, after selling for twenty-one cents in August. Tea which
could be bought for $1.70 per pound in August, sold for three and four
dollars in December. Sugar which was quoted at nine dollars a hundred
weight in New Orleans, and in August sold for twenty-one or twenty-two
dollars in New York and Philadelphia, stood at forty dollars in
December.

More sweeping in its effects on exports than on imports, the blockade
rapidly reduced the means of the people. After the summer of 1813,
Georgia alone, owing to its contiguity with Florida, succeeded in
continuing to send out cotton. The exports of New York, which exceeded
$12,250,0

  df_evals = pd.DataFrame.from_records(out).applymap(


Running query and evals for: When and why did the british Blockade happen?
Querying for: When and why did the british Blockade happen?
Serialized retrieved data:
 antic during the
winter months.

With it went the tale of Napoleon’s immense disaster. October 23 he
began his retreat; November 23 he succeeded in crossing the Beresina
and escaping capture; December 5 he abandoned what was still left of
his army; and December 19, after travelling secretly and without rest
across Europe, he appeared suddenly in Paris, still powerful, but in
danger. Nothing could be better calculated to support the Russian
mediation in the President’s mind. The possibility of remaining without
a friend in the world while carrying on a war without hope of success,
gave to the Czar’s friendship a value altogether new.

Other news crossed the ocean at the same time, but encouraged no hope
that England would give way. First in importance, and not to be trifled
with, was the British official announcement, dated De

  df_evals = pd.DataFrame.from_records(out).applymap(


Running query and evals for: What happened during the burning of the Assembly houses in Canada in 1812?
Querying for: What happened during the burning of the Assembly houses in Canada in 1812?
Serialized retrieved data:
 , including the houses of Assembly, were burned.
The destruction of the Assembly houses, afterward alleged as ground
for retaliation against the capitol at Washington, was probably the
unauthorized act of private soldiers. Dearborn protested that it was
done without his knowledge and against his orders.[166]

The success cost far more than it was worth. The explosion of a powder
magazine, near which the American advance halted, injured a large
number of men on both sides. Not less than three hundred and twenty
Americans were killed or wounded in the battle or explosion,[167] or
about one fifth of the entire force. General Pike, the best brigadier
then in the service, was killed. Only two or three battles in the
entire war were equally bloody.[168] “Unfortunately the en

  df_evals = pd.DataFrame.from_records(out).applymap(


Running query and evals for: Elaborate on Napoleon
Querying for: Elaborate on Napoleon
Serialized retrieved data:
 hed against an implacable foe, and the fulness
    of her power at length drawn out. It never entered into my mind
    that we should send a fleet to take rest and shelter in our own
    ports in North America, and that we should then attack the American
    ports with a flag of truce.”[9]

From such criticisms Lord Castlereagh had no difficulty in defending
himself. Whitbread alone maintained that injustice had been done to
America, and that measures ought to be taken for peace.

This debate took place November 30, two days after the destruction of
Napoleon’s army in passing the Beresina. From that moment, and during
the next eighteen months, England had other matters to occupy her mind
than the disagreeable subject of the American war. Napoleon arrived in
Paris December 18, and set himself to the task of renewing the army of
half a million men which had been lost in Russ

  df_evals = pd.DataFrame.from_records(out).applymap(


Unnamed: 0,relevance,faithfulness,coherence,succinctness,query,answer
0,True,True,True,False,What was the price of flour sold in Boston?,The price of superfine flour sold in Boston was $11.87 a barrel.
1,True,True,False,False,What was the price of flour sold in Boston?,The price of superfine flour sold in Boston was $11.87 a barrel.
2,True,True,False,False,What was the price of flour sold in Boston?,The price of superfine flour sold in Boston was $11.87 a barrel.
3,True,True,True,False,What was the price of flour sold in Boston?,The price of superfine flour sold in Boston was $11.87 a barrel.
4,True,True,False,False,What was the price of flour sold in Boston?,The price of superfine flour sold in Boston was $11.87 a barrel.


In [15]:
print("Trial results, all queries (% pass):")
df_pass_all_queries.drop(columns=["answer"]).groupby("query").mean() * 100


Trial results, all queries (% pass):


Unnamed: 0_level_0,relevance,faithfulness,coherence,succinctness
query,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Elaborate on Napoleon,100.0,100.0,60.0,0.0
What happened during the burning of the Assembly houses in Canada in 1812?,100.0,0.0,100.0,80.0
What was the price of flour sold in Boston?,100.0,100.0,40.0,0.0
When and why did the british Blockade happen?,100.0,100.0,20.0,20.0


In [16]:
!python3 rag.py info

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Starting info
Available Chroma Collections: [Collection(name=us_history_volume_7)]
