## RAG Evaluation

Identifying which RAG technique would be best suited for the game. The following RAG techniques will be evaluated:
- No RAG
- Basic RAG
- Sentence Window Retrieval
- Auto-Merging Retrieval

Comparisons between RAG will be evaluated on:
- Answer Relevance
- Context Relevance
- Groundedness

RAG vs No-RAG will be evaluated on:
- Answer Relevance
- Time taken

Test on: Choose 20 slides and test the RAG techniques on the following:
- Summary
- Discussion

For this evaluation standard answers to Intro:  
- Player Name: Tom  
- AI Name: Tim  
- AI Personality: A Spoon Pretending to be a Human

In [1]:
import sys
# caution: path[0] is reserved for script path (or '' in REPL)
sys.path.insert(1, 'C:/Users/utkar/Desktop/NUS/FYP/GamificationUsingLLM/game')

### Rationale Behind Slides Chosen:

Alot of content on page: 4, 6, 9, 15, 40, 44, 48, 49, 50, 55  
External Source of Content that can help supplement via RAG: 7, 12, 17, 19, 37, 38  
Case Studies: 21, 26, 29, 34

In [2]:
# Slides Chosen:
slides = [4, 6, 7, 9, 12, 15, 17, 19, 21, 26, 29, 34, 37, 38, 40, 44, 48, 49, 50, 55]

### One Question Per Slide for Discussion

In [3]:
questions = {
    4: 'What is the use of a Model Registry?',
    6: 'I do not understand why No Continuous Integration (CI) makes the architecture unfeasible?',
    7: 'Explain the overall architecture of a recommender system?',
    9: 'I am abit confused as to why documentation is needed during MLOPs',
    12: 'How does Federated Learning work?',
    15: 'What is the difference between between data science and MLOPs when it comes to monitoring?',
    17: 'Can you give an explanation fo MLOPs when it comes to iterative-incremental development?',
    19: 'Why is versioning important during ML Development?',
    21: 'What are some possible biases that could lead to denying of loan?',
    26: 'Could you explain some reasons that lead to concept drifts?',
    29: 'Why did google fail with the flu trends and what was the Big Data Hubris?',
    34: 'How do you ensure that model degradation does not occur?',
    37: 'Why would reoccuring concepts occur?',
    38: 'Can you explain how system health monitoring work in practice?',
    40: 'what are some ways for there to be early issue detection in models?',
    44: 'Why would sampling frequency lead to delayed ground truth and why does it matter?',
    48: 'What is the KS Test?',
    49: 'What is Kullback-Leiblerdivergence used for?',
    50: 'What is considered a high score for PSI?',
    55: 'If one metric is incomplete, how does one find a balance between multiple metrics that give varying results?',
}

### Create Data for RAG

In [4]:
from settings import *

# Read in summary of each slide:
overall_summary = [slides_summary[page] for page in slides_summary]

# Turn into documents for Llama_index    
from llama_index import Document

documents = [Document(text=slide) for slide in overall_summary]

### Loading Supplementary Documents ###

# Use llama_index to get a function that reads in our data/file
from llama_index import SimpleDirectoryReader

additional_documents = SimpleDirectoryReader(
    input_files=["../../../Creating_Data/Three_Levels_of_ML_Software.pdf", # Slides 10, 11, 12, 13
                 "../../../Creating_Data/MLOps_Principles.pdf",            # Slides 17
                 "../../../Creating_Data/mlops_overview.pdf",              # Slides 18, 19
                 "../../../Creating_Data/learning_concept_drift.pdf",      # Slides 37
                 "../../../Creating_Data/Diagnose_with_Live_Metrics.pdf"]  # Slides 38
).load_data()

# Supplementary Info for Slides 7 and 8:
alibaba_archi = "Title: Understanding the Technical Architecture of Alibaba Cloud's Recommender System\n\nAlibaba Cloud's technical expert, Aohai, delves into the fundamental concepts and intricate architecture of an enterprise-level recommender system in this comprehensive article.\n\n1) What Is a Recommender System?\n\nAohai elucidates the necessity and function of recommender systems in today's digital landscape. As online platforms expand, the challenge of connecting users with suitable products intensifies. Platforms like Taobao, hosting a myriad of products, grapple with this challenge. Recommender systems address this by refining the matching of user information with item information, notably through query-based and feed streaming recommendations.\n\nThe system aligns user preferences with item properties. Query-based recommendations tailor item displays based on user preferences, such as color and price. In contrast, feed streaming recommendations, prevalent in apps like Hupu and Toutiao, curate content according to daily user preferences, leveraging machine learning models to understand both user inclinations and item attributes.\n\n1.2 Personalized Recommendation Process\n\nAohai offers a schematic overview of the recommendation process, emphasizing two pivotal modules: matching and ranking. The matching module filters items based on user preferences, while the ranking module refines these choices according to the user's specific inclinations. This efficient process ensures swift feedback within milliseconds, a critical factor in user engagement strategies.\n\nMoreover, Aohai emphasizes that a recommender system encompasses not just recommendation algorithms but also robust system engineering. Addressing performance and data storage issues is paramount, especially when deploying such a system in the cloud environment.\n\n2) Architecture of an Enterprise-level Recommender System\n\n2.1 Requirements for an Enterprise-level Recommender System\n\nAohai outlines four fundamental requirements for designing an enterprise-level recommender system. The architecture must cater to platforms with millions of users, facilitate extensive model training with large datasets, support algorithm plug-ins for flexibility, ensure high-performance feedback within milliseconds, and allow elastic resource scaling to optimize costs.\n\n2.2 Overall Architecture of a Recommender System\n\nThis section elucidates the architecture's layers, starting with the basic data layer encompassing user profiles, item data, behavior data, and comment data. Following data processing and storage, the system employs matching and ranking modules powered by various algorithms. A new policy filters and deduplicates recommendations before online deployment, culminating in the top-layer recommendation service.\n\n2.3 PAI-based Technical Architecture for a Recommender System\n\nAlibaba Cloud's technical implementation involves leveraging PAI technology. It begins with storing demo offline data in ApsaraDB RDS for MySQL, processing real-time behavior data via Apache Kafka, and using Flink for generating real-time data. Model training employs PAI algorithms, advocating cloud-native solutions for scalability. The final phase orchestrates matching, deduplication, ranking, and feedback to users, showcasing Alibaba Cloud's holistic technical architecture.\n\nThe article serves as a comprehensive guide, shedding light on the intricacies of Alibaba Cloud's advanced recommender system technology, blending cutting-edge algorithms with robust system engineering principles.\n\n"

In [5]:
# Combine data sets
all_docs = overall_summary + [doc.text for doc in additional_documents]
all_docs.append(alibaba_archi)

# Merge into single document to improve accuracy
document = Document(text="\n\n".join(all_docs))

### RAG Evaluation Function  
Create a function that takes in a document and has it compare multiple RAG techniques

In [6]:
### Evaluation ###
from trulens_eval import Tru
tru = Tru()

tru.reset_database()

# A helper function that helps to evaluate responses based of RAG Triad of Metrics
from utils import get_prebuilt_trulens_recorder  

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


In [7]:
import os
from dotenv import load_dotenv
from llama_index import VectorStoreIndex
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from utils import build_sentence_window_index
from utils import get_sentence_window_query_engine
from utils import build_automerging_index
from utils import get_automerging_query_engine

### Loading OpenAI ###
load_dotenv()
client = os.environ.get("OPENAI_API_KEY")

def rag_eval(document, eval_questions):
    # load ChatGPT
    llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
    
    ##### Basic RAG #####
    service_context = ServiceContext.from_defaults(
        llm=llm, embed_model="local:BAAI/bge-small-en-v1.5"
    )
    index = VectorStoreIndex.from_documents([document], service_context=service_context)
    
    query_engine = index.as_query_engine()
    
    # Evaluate
    tru_recorder = get_prebuilt_trulens_recorder(query_engine,
                                                 app_id="Basic RAG")
    with tru_recorder as recording:
        for question in eval_questions:
            response = query_engine.query(question)
            
    ##### Sentence Window Retrievel 2 #####      
    sentence_index2 = build_sentence_window_index(
        [document],
        llm,
        embed_model="local:BAAI/bge-small-en-v1.5",
        sentence_window_size=2,
        save_dir="sentence_index2"
    )

    sentence_window_engine2 = get_sentence_window_query_engine(sentence_index2)
    
    tru_recorder_sentence_window2 = get_prebuilt_trulens_recorder(
        sentence_window_engine2,
        app_id = "Sentence Window Query Engine, Size = 2"
    )

    for question in eval_questions:
        with tru_recorder_sentence_window2 as recording:
            response = sentence_window_engine2.query(question)
            
    ##### Sentence Window Retrievel 3 #####      
    sentence_index3 = build_sentence_window_index(
        [document],
        llm,
        embed_model="local:BAAI/bge-small-en-v1.5",
        save_dir="sentence_index3"
    )

    sentence_window_engine3 = get_sentence_window_query_engine(sentence_index3)
    
    tru_recorder_sentence_window3 = get_prebuilt_trulens_recorder(
        sentence_window_engine3,
        app_id = "Sentence Window Query Engine, Size = 3"
    )

    for question in eval_questions:
        with tru_recorder_sentence_window3 as recording:
            response = sentence_window_engine3.query(question)

    ##### Sentence Window Retrievel 4 #####      
    sentence_index4 = build_sentence_window_index(
        [document],
        llm,
        embed_model="local:BAAI/bge-small-en-v1.5",
        sentence_window_size=4,
        save_dir="sentence_index4"
    )

    sentence_window_engine4 = get_sentence_window_query_engine(sentence_index4)
    
    tru_recorder_sentence_window4 = get_prebuilt_trulens_recorder(
        sentence_window_engine4,
        app_id = "Sentence Window Query Engine, Size = 4"
    )

    for question in eval_questions:
        with tru_recorder_sentence_window4 as recording:
            response = sentence_window_engine4.query(question)
            
    ##### Auto-Merging Retrieval [2048, 512, 128] ##### 
    automerging_index3 = build_automerging_index(
        [document],
        llm,
        embed_model="local:BAAI/bge-small-en-v1.5",
        save_dir="merging_index3",
        chunk_sizes=[2048,512, 128]
    )

    automerging_query_engine3 = get_automerging_query_engine(
        automerging_index3,
    )
    
    tru_recorder_automerging3 = get_prebuilt_trulens_recorder(automerging_query_engine3,
                                                         app_id="Automerging Query Engine, [2048, 512, 128]")

    for question in eval_questions:
        with tru_recorder_automerging3 as recording:
            response = automerging_query_engine3.query(question)
    
    ##### Auto-Merging Retrieval [2048, 512, 128, 50] ##### 
    automerging_index4 = build_automerging_index(
        [document],
        llm,
        embed_model="local:BAAI/bge-small-en-v1.5",
        save_dir="merging_index4",
        chunk_sizes=[2048, 512, 128, 50]  # Changed to 50 from 32 as there was warning
    ) 

    automerging_query_engine4 = get_automerging_query_engine(
        automerging_index4,
    )
    
    tru_recorder_automerging4 = get_prebuilt_trulens_recorder(automerging_query_engine4,
                                                         app_id="Automerging Query Engine, [2048, 512, 128, 50]")

    for question in eval_questions:
        with tru_recorder_automerging4 as recording:
            response = automerging_query_engine4.query(question)
            
    ######### RESULTS & EVALUATION #########
            
    # launches on http://localhost:8501/
    tru.run_dashboard()
    
    return tru

### RAG Evaluation: Summary

#### Evaluation Questions  
To evaluate RAG for Summary, get each method to summarise the 20 pages selected.

In [8]:
# Adapted story system to be a single prompt:
pre_content = "Your name is Tim. You are well known to be a spoon pretending to be a human. You will constantly bring this personality trait up when speaking. Players name is Tom. The player has come across a slide. Please provide the user with a summary of the content that is on said page. This is the content on said page: ["
post_content = "]. You will now be speaking to the player, please summarise the slide while subtly bringing up your personality."

eval_questions_summary = [pre_content + slides_summary[f"{str(slide)}.jpg"] + post_content for slide in slides]

#### Finding the Best RAG Method for Summary

In [None]:
tru = rag_eval(document, eval_questions_summary)

In [18]:
records, feedback = tru.get_records_and_feedback(app_ids=[])
leaderboard = tru.get_leaderboard(app_ids=[])

In [21]:
leaderboard.sort_values("Answer Relevance", ascending=False)

Unnamed: 0_level_0,Context Relevance,Answer Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Basic RAG,0.3925,0.96,0.79253,10.8,0.00403
"Automerging Query Engine, [2048, 512, 128]",0.326667,0.945,0.914173,10.8,0.002745
"Sentence Window Query Engine, Size = 4",0.4475,0.925,0.569112,10.8,0.001866
"Sentence Window Query Engine, Size = 3",0.4,0.925,0.52071,10.8,0.001787
"Sentence Window Query Engine, Size = 2",0.4775,0.895,0.528063,10.8,0.001675
"Automerging Query Engine, [2048, 512, 128, 50]",0.423333,0.885,0.744641,10.8,0.001705


#### RAG vs No RAG: Summary

Since Basic RAG was the best: compare No RAG vs Basic RAG

In [89]:
# Set OpenAI as default model for our evaluation
from trulens_eval import OpenAI as fOpenAI
from trulens_eval import Feedback
from trulens_eval.feedback import Groundedness

fopenai = fOpenAI()
# Question/answer relevance between overall question and answer.
f_qa_relevance = (
    Feedback(fopenai.relevance_with_cot_reasons, name = "Answer Relevance")
    .on_input_output()
)

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .


In [90]:
# Basic RAG:
##### Basic RAG #####
service_context = ServiceContext.from_defaults(
    llm=llm, embed_model="local:BAAI/bge-small-en-v1.5"
)
index = VectorStoreIndex.from_documents([document], service_context=service_context)

query_engine = index.as_query_engine()

In [91]:
import time
import pandas as pd

basic_rag_df = pd.DataFrame(columns = ['Page', 'Answer Relevance', 'Reason', 'Time Taken'])
for i in range(len(eval_questions_summary)):
    question = eval_questions_summary[i]
    start = time.time()
    answer = query_engine.query(question)
    time_taken = time.time() - start
    qa = f_qa_relevance(question, answer)
    if not qa:
        qa_score, qa_reason = 0, {'reason': "No answer relevance at all"}
    elif type(qa) == float:
        qa_score, qa_reason = qa, {'reason': "None"}
    else:
        qa_score, qa_reason = qa
    basic_rag_df.loc[len(basic_rag_df)] = ["Page " + str(slides[i]), qa_score, qa_reason['reason'], time_taken]

In [92]:
basic_rag_df

Unnamed: 0,Page,Answer Relevance,Reason,Time Taken
0,Page 4,0.9,Criteria: Relevance to the entire prompt \nSup...,19.098638
1,Page 6,1.0,Criteria: The response provides a summary of t...,9.311643
2,Page 7,1.0,Criteria: The response provides a summary of t...,10.395586
3,Page 9,1.0,Criteria: The response provides a summary of t...,7.749953
4,Page 12,0.9,Criteria: The response provides a summary of t...,8.90644
5,Page 15,1.0,Criteria: Relevance to the entire prompt \nSup...,22.396969
6,Page 17,0.9,Criteria: The response provides a summary of t...,11.511535
7,Page 19,1.0,Criteria: The response provides a summary of t...,9.730029
8,Page 21,1.0,Criteria: The response is relevant to the enti...,8.620709
9,Page 26,1.0,Criteria: Relevance to the entire prompt \nSup...,11.079841


In [93]:
from openai import OpenAI as OAI
def ask_chatgpt(query, key):
    client = OAI(api_key = key)
    messages = [
                {"role": "user",
                 "content": query},
            ]
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature = 0.1
    )
    
    return completion.choices[0].message.content, completion.usage

In [94]:
no_rag_df = pd.DataFrame(columns = ['Page', 'Answer Relevance', 'Reason', 'Time Taken'])
cost = 0
for i in range(len(eval_questions_summary)):
    question = eval_questions_summary[i]
    start = time.time()
    answer, usage = ask_chatgpt(question, client)
    time_taken = time.time() - start
    qa = f_qa_relevance(question, answer)
    if not qa:
        qa_score, qa_reason = 0, {'reason': "No answer relevance at all"}
    elif type(qa) == float:
        qa_score, qa_reason = qa, {'reason': "None"}
    else:
        qa_score, qa_reason = qa
        
    cost += usage.completion_tokens * 0.0005 / 1000 + usage.prompt_tokens * 0.0015 / 1000
    no_rag_df.loc[len(no_rag_df)] = ["Page " + str(slides[i]), qa_score, qa_reason['reason'], time_taken]

In [99]:
no_rag_df

Unnamed: 0,Page,Answer Relevance,Reason,Time Taken
0,Page 4,1.0,Criteria: The response provides a summary of t...,17.271841
1,Page 6,1.0,Criteria: Relevance to the entire prompt \nSup...,15.334505
2,Page 7,1.0,Criteria: The response provides a summary of t...,20.47658
3,Page 9,1.0,Criteria: The response is relevant to the enti...,16.721895
4,Page 12,1.0,Criteria: The response is relevant to the enti...,14.858085
5,Page 15,1.0,Criteria: Relevance to the entire prompt \nSup...,25.317242
6,Page 17,1.0,Criteria: The response provides a summary of t...,15.86457
7,Page 19,1.0,Criteria: The response provides a summary of t...,20.712555
8,Page 21,0.9,Criteria: The response provides a summary of t...,12.993729
9,Page 26,1.0,Criteria: The response provides a summary of t...,15.499983


In [103]:
novsbasic = pd.DataFrame(columns=["Model", "Answer Relevance", "Time Taken"])
novsbasic.loc[0] = ["No RAG",
                    sum(no_rag_df["Answer Relevance"])/len(no_rag_df["Answer Relevance"]),
                    sum(no_rag_df["Time Taken"])/len(no_rag_df["Time Taken"])
                   ]
novsbasic.loc[1] = ["Basic RAG",
                    sum(basic_rag_df["Answer Relevance"])/len(basic_rag_df["Answer Relevance"]),
                    sum(basic_rag_df["Time Taken"])/len(basic_rag_df["Time Taken"])
                   ]

In [104]:
novsbasic

Unnamed: 0,Model,Answer Relevance,Time Taken
0,No RAG,0.99,15.846791
1,Basic RAG,0.945,10.88311


In [108]:
overall_summary_df = leaderboard.sort_values("Answer Relevance", ascending=False)
overall_summary_df = overall_summary_df.reset_index()

In [110]:
overall_summary_df.loc[6] = ["No RAG", None, novsbasic['Answer Relevance'][0], None, novsbasic['Time Taken'][0], None]

In [114]:
overall_summary_df.sort_values("Answer Relevance", ascending=False)

Unnamed: 0,app_id,Context Relevance,Answer Relevance,Groundedness,latency,total_cost
6,No RAG,,0.99,,15.846791,
0,Basic RAG,0.3925,0.96,0.79253,10.8,0.00403
1,"Automerging Query Engine, [2048, 512, 128]",0.326667,0.945,0.914173,10.8,0.002745
2,"Sentence Window Query Engine, Size = 4",0.4475,0.925,0.569112,10.8,0.001866
3,"Sentence Window Query Engine, Size = 3",0.4,0.925,0.52071,10.8,0.001787
4,"Sentence Window Query Engine, Size = 2",0.4775,0.895,0.528063,10.8,0.001675
5,"Automerging Query Engine, [2048, 512, 128, 50]",0.423333,0.885,0.744641,10.8,0.001705


### RAG Evaluation: Discussion

#### Evaluation Questions  
To evaluate RAG for Summary, get each method to summarise the 20 pages selected.

In [8]:
system = "You are a game character speaking to the player. Here are the details you should know: Your name is Tim. You are to channel your inner A spoon pretending to be a human. Keep it subtle and let this personality trait flow seamlessly into your response. Players name is Tom. The player has asked you the following question on the topic of model deployment: "

eval_questions_discussion = [system + questions[slide] for slide in slides]

#### Evaluate

In [10]:
tru = rag_eval(document, eval_questions_discussion)

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to __record__.app.query.rets.s

A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a46b7ed0 is calling an instrumented method <function BaseQueryEngine.query at 0x00000219FF06FB00>. The path of this call may be incorrect.
Guessing path of new object is app based on other object (0x2199bf1b610) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a46b7ed0 is calling an instrumented method <function RetrieverQueryEngine.retrieve at 0x0000021984A9DD00>. The path of this call may be incorrect.
Guessing path of new object is app based on other object (0x2199bf1b610) using this function.
A new object of type <class 'llama_index.indices.vector_store.retrievers.retriever.VectorIndexRetriever'> at 0x219a46b7f50 is calling an instrumented method <function BaseRetriever.retrieve at 0x00000219FF06E980>. The path of this call may be incorrect.
Guessing path of new object is app.retriever based on ot

Guessing path of new object is app based on other object (0x2199bf1b610) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a46b7f10 is calling an instrumented method <function CompactAndRefine.get_response at 0x00000219FF0B6660>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a46b7f10 is calling an instrumented method <function Refine.get_response at 0x00000219FFA140E0>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a46b7ed0 is calling an instrumented method <function RetrieverQu

Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a46b7ed0 is calling an instrumented method <function RetrieverQueryEngine.retrieve at 0x0000021984A9DD00>. The path of this call may be incorrect.
Guessing path of new object is app based on other object (0x2199bf1b610) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a46b7f10 is calling an instrumented method <function CompactAndRefine.get_response at 0x00000219FF0B6660>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a46b7f10 is calling an instrumented method <function R

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219baba9510 is calling an instrumented method <function BaseQueryEngine.query at 0x00000219FF06FB00>. The path of this call may be incorrect.
Guessing path of new object is app based on other object (0x2199bf1b610) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219baba9510 is calling an instrumented method <function RetrieverQueryEngine.retrieve at 0x0000021984A9DD00>. The path of this call may be incorrect.
Guessing path of new object is app based on other object (0x2199bf1b610) using this function.
A new object of type <class 'llama_index.indices.vector_store.retrievers.retriever.VectorIndexRetriever'> at 0x219baba9310 is calling an instrumented method <function BaseRetriever.retrieve at 0x00000219FF06E980>. The path of this call may be incorrect.
Guessing path of new object is app.retriever based on ot

Guessing path of new object is app based on other object (0x2199bf1b610) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219babaab90 is calling an instrumented method <function CompactAndRefine.get_response at 0x00000219FF0B6660>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219babaab90 is calling an instrumented method <function Refine.get_response at 0x00000219FFA140E0>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219baba9510 is calling an instrumented method <function RetrieverQu

Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219baba9510 is calling an instrumented method <function RetrieverQueryEngine.retrieve at 0x0000021984A9DD00>. The path of this call may be incorrect.
Guessing path of new object is app based on other object (0x2199bf1b610) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219babaab90 is calling an instrumented method <function CompactAndRefine.get_response at 0x00000219FF0B6660>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219babaab90 is calling an instrumented method <function R

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a0f36690 is calling an instrumented method <function BaseQueryEngine.query at 0x00000219FF06FB00>. The path of this call may be incorrect.
Guessing path of new object is app based on other object (0x2199bf1b610) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a0f36690 is calling an instrumented method <function RetrieverQueryEngine.retrieve at 0x0000021984A9DD00>. The path of this call may be incorrect.
Guessing path of new object is app based on other object (0x2199bf1b610) using this function.
A new object of type <class 'llama_index.retrievers.auto_merging_retriever.AutoMergingRetriever'> at 0x219a11dd110 is calling an instrumented method <function BaseRetriever.retrieve at 0x00000219FF06E980>. The path of this call may be incorrect.
Guessing path of new object is app.retriever based on other obje

> Merging 2 nodes into parent node.
> Parent node id: 3c491f0d-e54c-42ce-abca-cf987967a63d.
> Parent node text: Exactly spoken there are as many models as users exist, in ad‐
dition to the one that’s held on a...



A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function BaseSynthesizer.synthesize at 0x00000219FF0B6020>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function CompactAndRefine.get_response at 0x00000219FF0B6660>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function Refine.get_response at 0x00000219FFA140E0>. The path of this call may be incorrect.
Guessing path

> Merging 1 nodes into parent node.
> Parent node id: 8b6e6fef-c16e-45b5-9065-29a29e35da32.
> Parent node text: This is to show how a more practical example of a Data Science Architecture that can be use in re...



A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function CompactAndRefine.get_response at 0x00000219FF0B6660>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function Refine.get_response at 0x00000219FFA140E0>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a0f36690 is calling an instrumented method <function RetrieverQueryEngine.retrieve at 0x0000021984A9DD00>. The path of this call may be incorrect.
Guessing pa

> Merging 2 nodes into parent node.
> Parent node id: 3c491f0d-e54c-42ce-abca-cf987967a63d.
> Parent node text: Exactly spoken there are as many models as users exist, in ad‐
dition to the one that’s held on a...



A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function CompactAndRefine.get_response at 0x00000219FF0B6660>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function Refine.get_response at 0x00000219FFA140E0>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a0f36690 is calling an instrumented method <function RetrieverQueryEngine.retrieve at 0x0000021984A9DD00>. The path of this call may be incorrect.
Guessing pa

> Merging 2 nodes into parent node.
> Parent node id: 3c491f0d-e54c-42ce-abca-cf987967a63d.
> Parent node text: Exactly spoken there are as many models as users exist, in ad‐
dition to the one that’s held on a...



A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function CompactAndRefine.get_response at 0x00000219FF0B6660>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function Refine.get_response at 0x00000219FFA140E0>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a0f36690 is calling an instrumented method <function RetrieverQueryEngine.retrieve at 0x0000021984A9DD00>. The path of this call may be incorrect.
Guessing pa

> Merging 2 nodes into parent node.
> Parent node id: 3c491f0d-e54c-42ce-abca-cf987967a63d.
> Parent node text: Exactly spoken there are as many models as users exist, in ad‐
dition to the one that’s held on a...



A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function CompactAndRefine.get_response at 0x00000219FF0B6660>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function Refine.get_response at 0x00000219FFA140E0>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a0f36690 is calling an instrumented method <function RetrieverQueryEngine.retrieve at 0x0000021984A9DD00>. The path of this call may be incorrect.
Guessing pa

> Merging 2 nodes into parent node.
> Parent node id: 3c491f0d-e54c-42ce-abca-cf987967a63d.
> Parent node text: Exactly spoken there are as many models as users exist, in ad‐
dition to the one that’s held on a...



A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function CompactAndRefine.get_response at 0x00000219FF0B6660>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function Refine.get_response at 0x00000219FFA140E0>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a0f36690 is calling an instrumented method <function RetrieverQueryEngine.retrieve at 0x0000021984A9DD00>. The path of this call may be incorrect.
Guessing pa

> Merging 3 nodes into parent node.
> Parent node id: a08be488-03bb-40ee-a8c3-0cb2ff0b3bcd.
> Parent node text: Ethical and Fairness considerations: Assess fairness and ethical considerations in model predicti...



A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function CompactAndRefine.get_response at 0x00000219FF0B6660>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.response_synthesizers.compact_and_refine.CompactAndRefine'> at 0x219a0f35490 is calling an instrumented method <function Refine.get_response at 0x00000219FFA140E0>. The path of this call may be incorrect.
Guessing path of new object is app._response_synthesizer based on other object (0x2199be49690) using this function.
A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x219a0f36690 is calling an instrumented method <function RetrieverQueryEngine.retrieve at 0x0000021984A9DD00>. The path of this call may be incorrect.
Guessing pa

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .
> Merging 2 nodes into parent node.
> Parent node id: 72e3ae8a-79cb-4db1-814f-08b0359965b8.
> Parent node text: Exactly spoken there are as many models as users exist, in ad‐
dition to the one that’s held on a...

> Merging 2 nodes into parent node.
> Parent node id: 72e3ae8a-79cb-4db1-814f-08b0359965b8.
> Parent node text: Exactly spoken there are as many models as users exist, in a

Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://localhost:8501 .


In [12]:
records, feedback = tru.get_records_and_feedback(app_ids=[])
leaderboard = tru.get_leaderboard(app_ids=[])
leaderboard.sort_values("total_cost")

Unnamed: 0_level_0,Groundedness,Context Relevance,Answer Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Automerging Query Engine, [2048, 512, 128, 50]",0.471958,0.3875,0.97,11.55,0.000965
"Sentence Window Query Engine, Size = 2",0.379474,0.39,0.935,11.55,0.001003
"Sentence Window Query Engine, Size = 3",0.395716,0.3725,0.985,11.55,0.001155
"Sentence Window Query Engine, Size = 4",0.299032,0.355,0.985,11.55,0.001485
"Automerging Query Engine, [2048, 512, 128]",0.483751,0.394167,0.975,11.55,0.001789
Basic RAG,0.49925,0.2625,0.99,11.55,0.003685
