# RAG evaluation using TruLens
This notebook sets up a RAG pipeline using LangChain, LlamaCpp, Qdrant (vector database), and TruLens for evaluation. The goal is to retrieve astrophysics-related information from a database and use a LLM (OLMo-7B-Instruct) to generate answers while evaluating the quality of responses using TruLens.

## 1. Import required libraries and dependencies

In [32]:
# !pip install openai trulens
# !pip install --no-deps trulens-apps-langchain
# !pip install trulens-providers-openai>=1.0.0

In [None]:
import os
import numpy as np
import pandas as pd
from datetime import datetime

from pathlib import Path

from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_community.llms import LlamaCpp
from langchain_qdrant import Qdrant
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_qdrant import Qdrant
from qdrant_client import QdrantClient
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.output_parsers import StrOutputParser

from trulens.apps.app import instrument
from trulens.apps.langchain import TruChain
from trulens.core import TruSession, Feedback
from trulens.providers.openai import OpenAI

import warnings
warnings.filterwarnings("ignore")

In [2]:
# # Set the OpenAI API key for authentication (Replace with your actual API key)
# os.environ["OPENAI_API_KEY"] = "<API key>"

In [3]:
# import the function download_olmo_model from the ssec_tutorials repository
from ssec_tutorials import download_olmo_model
help(download_olmo_model)

Help on function download_olmo_model in module ssec_tutorials.setup:

download_olmo_model(model_file: 'str | None' = None, force=False) -> 'Path'
    Download the OLMO model from the Hugging Face model hub.
    
    Parameters
    ----------
    model_file : str | None, optional
        The name of the model file to download, by default None
    force : bool, optional
        Whether to force the download even if the file already exists, by default False
    
    Returns
    -------
    pathlib.Path
        The path to the downloaded model file



In [4]:
# initialize a TruLens session and reset its database:
session = TruSession()
session.reset_database()

🦑 Initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `TruSession` to prevent this.


Updating app_name and app_version in apps table: 0it [00:00, ?it/s]
Updating app_id in records table: 0it [00:00, ?it/s]
Updating app_json in apps table: 0it [00:00, ?it/s]


In [5]:
# simple document formatting function
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

## 2. Set up the vector database (Qdrant)

In [7]:
# load a sentence embedding model for text representation
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2")

In [8]:
# Connect to Qdrant
qdrant_path = Path("/workspaces/Rubin-RAG/resources/rubin_qdrant")
qdrant_collection = "rubin_telescope"

# client.close()
client = QdrantClient(path=str(qdrant_path))

lcqdrant = Qdrant(client=client, 
                  collection_name=qdrant_collection, 
                  embeddings=embedding)
# lcqdrant = Qdrant.from_existing_collection(
#     collection_name=qdrant_collection, embedding=embedding, path=qdrant_path
# )

In [9]:
# set up a retriever for fetching relevant documents
retriever = lcqdrant.as_retriever(search_type="mmr", search_kwargs={"k": 2})

## 3. Setup the LLM (OLMo-7B-Instruct) model

In [10]:
# download the OLMo model
model_path = download_olmo_model()

Model already exists at /home/mambauser/.cache/ssec_tutorials/OLMo-7B-Instruct-Q4_K_M.gguf


In [11]:
# setup the Langchain llama.cpp model object: we are using the `OLMo-7B-Instruct` model.
# llama-cpp-python is a Python binding for llama.cpp C++ library as mentioned in previous modules.
olmo = LlamaCpp(
    model_path=str(model_path),  # the path to the OLMo model in GGUF file format
    callbacks=[StreamingStdOutCallbackHandler()],
    temperature=0.8,  # set the randomness of the model's output
    n_ctx=4096,  # set limit for the length of the input context
    max_tokens=512,  # set limit for the length of the generated text
    verbose=False,  # determines whether the model should print out debug information
    echo=False,  # determines whether the input prompt should be included in the output
)

## 4. RAG pipeline

In [12]:
# Define the RAG class
class RAG:
    @instrument
    # Retrieve relevant text from vector store
    def retrieve(self, query: str) -> list:
        results = vector_store.query(query_texts=query, n_results=4)
        return [doc for sublist in results["documents"] for doc in sublist] # return them as a list of text chunks

rag = RAG()

In [13]:
# Define a prompt template
custom_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="You are an astrophysics expert."
             "Please answer the question on astrophysics based on the following context:\n\n"
             "Context: {context}\n"
             "Question: {question}\n"
)

In [14]:
# RAG chain definition
rag_chain = (
    {
        "context": retriever | format_docs,  # retrieve and format the context
        "question": RunnablePassthrough() # Pass the user’s question directly
    }
    | custom_prompt
    | olmo 
    | StrOutputParser()  # Parse and extract only the final model output
)

## 5. TruLens Feedback Evaluation pipeline

I will be initializing a provider class using OpenAI. It uses OpenAI’s language model to evaluate responses (provide feedback scores based on OpenAI-generated judgments)

In [16]:
# initialize provider class
provider = OpenAI()

In [17]:
# select context to be used in feedback
context = TruChain.select_context(rag_chain)

### 5.1 Feedback Functions

In [18]:
# define a groundedness feedback function
# that checks if the answer is factually supported by retrieved documents.
f_groundedness = (
    Feedback(
        provider.groundedness_measure_with_cot_reasons, name="Groundedness"
    )
    .on(context.collect())  # collect context chunks into a list
    .on_output()
)

✅ In Groundedness, input source will be set to __record__.app.first.steps__.context.first.invoke.rets[:].page_content.collect() .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [19]:
# define a question/answer relevance function
# that evaluates how relevant the RAG's answer is to the question
f_answer_relevance = Feedback(
    provider.relevance_with_cot_reasons, name="Answer Relevance"
).on_input_output()

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .


In [20]:
# define a context relevance feedback function 
# evaluates how relevant the context is to the question
f_context_relevance = (
    Feedback(
        provider.context_relevance_with_cot_reasons, name="Context Relevance"
    )
    .on_input()
    .on(context)
    .aggregate(np.mean)
)

✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input context will be set to __record__.app.first.steps__.context.first.invoke.rets[:].page_content .


### 5.2 Set up a TruLens evaluation recorder
Next we will create a TruLens evaluation recorder that monitors and logs the performance of out RAG system.

In [21]:
tru_recorder = TruChain(
    rag_chain,
    app_name="RubinChat",
    app_version="v1",
    feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness],
)

instrumenting <class 'langchain_core.runnables.base.RunnableParallel'> for base <class 'langchain_core.runnables.base.RunnableParallel'>
	instrumenting invoke
	instrumenting ainvoke
	instrumenting stream
	instrumenting astream
instrumenting <class 'langchain_core.runnables.base.RunnableParallel'> for base <class 'langchain_core.runnables.base.RunnableSerializable'>
	instrumenting invoke
	instrumenting ainvoke
	instrumenting stream
	instrumenting astream
instrumenting <class 'langchain_core.runnables.base.RunnableParallel'> for base <class 'langchain_core.load.serializable.Serializable'>
instrumenting <class 'langchain_core.vectorstores.base.VectorStoreRetriever'> for base <class 'langchain_core.vectorstores.base.VectorStoreRetriever'>
	instrumenting invoke
	instrumenting ainvoke
	instrumenting stream
	instrumenting astream
	instrumenting _get_relevant_documents
	instrumenting get_relevant_documents
	instrumenting aget_relevant_documents
	instrumenting _aget_relevant_documents
instrumen

## 6. Test and Evaluate RAG Responses
### 6.1 Load questions from the LSST community forum dataset


In [22]:
# load the CSV file
lsst_forum_responses_path = "data/input/lsst_forum_responses.csv"
qa_df = pd.read_csv(lsst_forum_responses_path)
print(qa_df.shape)
# print(qa_df.columns)
qa_df.head()

(10612, 14)


Unnamed: 0,category_id,question_header,question_author_id,question,question_date,answer,answer_author_username,answer_date,community_role,community_visual_badge,is_moderator,is_admin,is_staff,is_accepted_answer
0,55,Lasair watchlist with large search radius vers...,2517,"<p>Hi, I want to define Lasair filters for a g...",2025-03-07,<p>Thanks Roy.</p>\n<p>I’m experimenting with ...,iperezfournon,2025-03-10,,,False,False,False,False
1,55,Lasair watchlist with large search radius vers...,2517,"<p>Hi, I want to define Lasair filters for a g...",2025-03-07,<p>I think the maximum radius is 1000 arc seco...,roy,2025-03-10,,,False,False,False,False
2,55,Lasair watchlist with large search radius vers...,2517,"<p>Hi, I want to define Lasair filters for a g...",2025-03-07,"<p>Thanks Roy,</p>\n<p>it looks like the maxim...",iperezfournon,2025-03-10,,,False,False,False,False
3,55,Lasair watchlist with large search radius vers...,2517,"<p>Hi, I want to define Lasair filters for a g...",2025-03-07,<p>Hi Ismael<br>\nBoth watchlist and watchmap ...,roy,2025-03-08,,,False,False,False,True
4,10,Science Pipeline Release 29.0.0 Status and Dis...,1185,<p>We are starting the science pipelines relea...,2025-03-05,<p>Release candidate v29_0_0_rc1 is now availa...,mwittgen,2025-03-05,LSSTDM,LSSTDM,False,False,False,False


In [None]:
# now I don't want all the rows, I just want rows with accepted answer
accepted_qa_df = qa_df[qa_df["is_accepted_answer"] == True]

# accepted_qa_df = accepted_qa_df[accepted_qa_df["category_id"] == 23]
accepted_qa_df.reset_index(drop=True, inplace=True)

print(accepted_qa_df.shape)
accepted_qa_df.head()

(513, 14)


Unnamed: 0,category_id,question_header,question_author_id,question,question_date,answer,answer_author_username,answer_date,community_role,community_visual_badge,is_moderator,is_admin,is_staff,is_accepted_answer
0,55,Lasair watchlist with large search radius vers...,2517,"<p>Hi, I want to define Lasair filters for a g...",2025-03-07,<p>Hi Ismael<br>\nBoth watchlist and watchmap ...,roy,2025-03-08,,,False,False,False,True
1,6,Where to find the Vera C. Rubin Observatory na...,440,"<p>There used to be a the <a href=""https://pro...",2025-03-01,"<p>Hi Meg, thanks for this.</p>\n<p>I can conf...",MelissaGraham,2025-03-01,LSSTDM,CST,True,True,True,True
2,6,Rubin Data Rights for Rubin source IDs and pla...,2316,<p>I am currently writing a proposal for a NAS...,2025-02-25,"<p>Hi <a class=""mention"" href=""/u/kkruszynska""...",jeffcarlin,2025-02-25,LSSTDM,CST,True,False,True,True
3,6,RubinObservatory.org site issue with Chrome on...,229,"<p>Hello, Rubin website team-</p>\n<p>This is ...",2025-02-06,"<p>Hi <a class=""mention"" href=""/u/tomloredo"">@...",MelissaGraham,2025-02-06,LSSTDM,CST,True,True,True,True
4,60,UK IDAC offline for maintenance during 4th--5t...,245,<p>Apologies for short notice. Due to planned ...,2025-02-04,<p>The UK IDAC is now back online. Apologies f...,george_beckett,2025-02-06,,,True,False,True,True


In [25]:
# also seems like we have a lot of rows. For testing purposes, I will sample 5 rows.
random_seed = 42

sampled_accepted_qa_df = accepted_qa_df.sample(n=5, random_state=random_seed)
sampled_accepted_qa_df.reset_index(drop=True, inplace=True)

print(sampled_accepted_qa_df.shape)
sampled_accepted_qa_df.head()

(5, 14)


Unnamed: 0,category_id,question_header,question_author_id,question,question_date,answer,answer_author_username,answer_date,community_role,community_visual_badge,is_moderator,is_admin,is_staff,is_accepted_answer
0,6,Single_frame task produces different results e...,443,"<p>Hi,<br>\nI’m following this tutorial: <a hr...",2022-08-11,<p>Quick comment on the code:</p>\n<aside clas...,timj,2022-08-20,LSSTDM,LSSTDM,False,False,False,True
1,6,Issue building 11_0 on Mac OS X Yosemite 10.10.5,4,<p>I seem to have a linker error when I instal...,2015-10-05,<p>It seemed to be solved by one or any of:</p...,jsick,2015-10-06,LSSTDM,LSSTDM,True,True,True,True
2,34,Access to forced photometry and postage stamps...,381,"<p>Final couple of questions for now, on broke...",2020-02-19,"<p>Hi Stephen, I have some answers from the DM...",MelissaGraham,2020-02-19,LSSTDM,CST,True,True,True,True
3,49,Listing available repos,247,"<p>Hi there,</p>\n<p>Is there some way I find ...",2024-01-24,<p>Hi James<br>\nmaybe<br>\n<code>dafButler.Bu...,MRead,2024-01-24,,,False,False,False,True
4,6,Setup: Unable to take shared lock on $STACKDIR...,80,<p>Has anyone seen the eups based error before...,2015-09-18,<p>Eups locking is known to be broken. There ...,jbosch,2015-09-18,LSSTDM,LSSTDM,False,False,False,True


In [None]:
print(sampled_accepted_qa_df.shape)
sampled_accepted_qa_df.head()

(5, 14)


Unnamed: 0,category_id,question_header,question_author_id,question,question_date,answer,answer_author_username,answer_date,community_role,community_visual_badge,is_moderator,is_admin,is_staff,is_accepted_answer
0,6,Single_frame task produces different results e...,443,"Hi, \nI’m following this tutorial: The LSST S...",2022-08-11,Quick comment on the code: \n \n \n \n petarz...,timj,2022-08-20,LSSTDM,LSSTDM,False,False,False,True
1,6,Issue building 11_0 on Mac OS X Yosemite 10.10.5,4,I seem to have a linker error when I install v...,2015-10-05,It seemed to be solved by one or any of: \n \n...,jsick,2015-10-06,LSSTDM,LSSTDM,True,True,True,True
2,34,Access to forced photometry and postage stamps...,381,"Final couple of questions for now, on broker r...",2020-02-19,"Hi Stephen, I have some answers from the DM te...",MelissaGraham,2020-02-19,LSSTDM,CST,True,True,True,True
3,49,Listing available repos,247,"Hi there, \n Is there some way I find out what...",2024-01-24,Hi James \nmaybe \n dafButler.Butler.get_known...,MRead,2024-01-24,,,False,False,False,True
4,6,Setup: Unable to take shared lock on $STACKDIR...,80,Has anyone seen the eups based error before: \...,2015-09-18,Eups locking is known to be broken. There are...,jbosch,2015-09-18,LSSTDM,LSSTDM,False,False,False,True


### 6.2 Test the RAG pipeline

In [27]:
# run RAG pipeline on all the sampled Q&A
responses = []

for _, row in sampled_accepted_qa_df.iterrows():
    question = row["question"]
    true_answer = row["answer"]

    # run the RAG pipeline with TruLens recording
    print("\n\nQuestion: ", question)
    context = retriever.invoke(question)
    with tru_recorder as recording:
        rag_response = rag_chain.invoke(question)

    # Store results ina  dict and append to a list
    responses.append({
        "question": question,
        "context":format_docs(context),
        "true_answer": true_answer,
        "RAG_generated_answer": rag_response
    })

# convert the responses into a DataFrame
responses_df = pd.DataFrame(responses)



Question:  Hi, 
I’m following this tutorial:  The LSST Science Pipelines — LSST Science Pipelines  and I’ve ran the first step “single_frame” task a few times. Each time it runs it produces different results: if I go through all calexps in the output collection ( butler.registry.queryDatasets("calexp", collections=collection) ), and look at their sky coverage (calexp width, height and WCS mapping), and then find the total coverage of the whole collection (max and min ra, dec coordinates), I get different results each time it runs. And I’m starting it like this (verbatim what’s in the tutorial): 
 pipetask run -b $RC2_SUBSET_DIR/SMALL_HSC/butler.yaml \
             -p $RC2_SUBSET_DIR/pipelines/DRP.yaml#singleFrame \
             -i HSC/RC2/defaults \
             -o u/$USER/single_frame \
             --register-dataset-types
 
 What could be the explanation for this behavior? 
 Thanks, 
Petar



Answer:
Based on your description, it seems like you are observing a change in the sky coverage of the outputs from each execution of the single_frame task. The difference in the sky coverage might be attributed to the following reasons:

1. Random seed setting: By default, when running the pipeline, the LSST Science Pipelines use a fixed random seed for generating random numbers that influence the data generation and processing steps. When you run the same single_frame task multiple times, the random numbers generated from the same random seed might produce slightly different results each time because of this randomness.
2. Changes in pipeline configuration or dependencies: If you modified any parameters or added/removed dependencies between pipeline runs, these changes may affect the behavior and output of the pipeline. This could lead to varying sky coverage results for the same task run.
3. Updates to data products or processing steps: The LSST Science Pipelines are continuously b

In [28]:
# get records and feedback from TruLens
records, feedback = session.get_records_and_feedback()

# select required columns from the records df and merge with the ground truth df
records_selected = records[["app_id", "input"] + feedback]
full_trulens_results_df = responses_df.merge(records_selected, 
                                             left_on="question", 
                                             right_on="input", 
                                             how="left")

full_trulens_results_df.drop(columns=["input"], inplace=True)
full_trulens_results_df = full_trulens_results_df[["app_id", "question", "true_answer", "context", "RAG_generated_answer", "Answer Relevance", "Groundedness", "Context Relevance"]]
# full_trulens_results_df = full_trulens_results_df[full_trulens_results_df["app_id"] == ""]
full_trulens_results_df.rename(columns={"Answer Relevance": "trulens_Answer_Relevance", "Groundedness":"trulens_Groundedness", "Context Relevance":"trulens_Context_Relevance"}, inplace=True)
full_trulens_results_df.head()

Unnamed: 0,app_id,question,true_answer,context,RAG_generated_answer,trulens_Answer_Relevance,trulens_Groundedness,trulens_Context_Relevance
0,app_hash_249cdbf218ae5cad989db0fbaf09493b,"Hi, \nI’m following this tutorial: The LSST S...",Quick comment on the code: \n \n \n \n petarz...,Draft\nLVV-P106: Data Management Acceptance Te...,"\nAnswer:\nBased on your description, it seems...",1.0,0.052632,0.333333
1,app_hash_249cdbf218ae5cad989db0fbaf09493b,I seem to have a linker error when I install v...,It seemed to be solved by one or any of: \n \n...,DMTN-001: Porting the stack to OS X El Capitan...,\nAn implementation of the UNION CSC (C langua...,0.0,0.0,0.166667
2,app_hash_249cdbf218ae5cad989db0fbaf09493b,"Final couple of questions for now, on broker r...","Hi Stephen, I have some answers from the DM te...",Data Policy | RDO-13 (rel 1.2.2) | Latest Revi...,\nAnswer: \n The precovery service will provid...,1.0,0.285714,0.333333
3,app_hash_249cdbf218ae5cad989db0fbaf09493b,"Hi there, \n Is there some way I find out what...",Hi James \nmaybe \n dafButler.Butler.get_known...,3 Overview\nThe Butler is implemented as thr...,\nAnswer:\nTo list the available values for `c...,1.0,0.166667,0.5
4,app_hash_249cdbf218ae5cad989db0fbaf09493b,Has anyone seen the eups based error before: \...,Eups locking is known to be broken. There are...,"4/11/24, 2:31 PM\nRSP identity management impl...","\nAnswer: Yes, I have seen the eups based erro...",,,


### 6.3 TruLens Evaluation

In [29]:
session.get_leaderboard()


Unnamed: 0_level_0,Unnamed: 1_level_0,Answer Relevance,Context Relevance,Groundedness,latency,total_cost
app_name,app_version,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
RubinChat,v1,0.75,0.333333,0.126253,262.378483,0.0


In [30]:
# more inteactive UI
session.run_dashboard()

Starting dashboard ...


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://localhost:46121 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

## 7. Save the results

In [31]:
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
file_name = f"data/results/trulens_results_{timestamp}.csv"
full_trulens_results_df.to_csv(file_name, index=False)