## RAG app TFM v4
In this use case it is shown how to extract information from a PDF file through LLM queries with RAG (Retrieval Augmented Generation) technology. For this use case is necessary the use of a vector database (in this case FAISS), embeddings and OpenAI model calls. To show the final result, the model is embedded on a Gradio UI. In this version it would be used Llama Index as vector databse.

In this notebook, it will be conducted an evaluation on the v4 version of the RAG system. This evaluation will consist on evaluating the following concepts:

* Answer relevance
* Context Relevance
* Groundedness

For this evaluation a set of evaluation questions is made in order to make the system in line with several possible reasons that cover different topics a RAG system should cover in order to function correctly.

## Model Evalutation using Llama Index

In [4]:
from dotenv import load_dotenv
import os


from llama_index import SimpleDirectoryReader
from llama_index import Document
from llama_index import VectorStoreIndex
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding

load_dotenv("apis.env")

True

In [5]:
eval_questions = []
with open('eval_questions.txt', 'r') as file:
    for line in file:
        # Remove newline character and convert to integer
        item = line.strip()
        print(item)
        eval_questions.append(item)

What is the main theme or topic of this document?
Can you summarize the key points made in this document?
What are the primary arguments or claims presented?
Are there any notable facts or figures mentioned? If so, what are they?
Can you provide an example or case study mentioned in the document?
What conclusions are drawn in this document?
Are there any contradictions or points of debate within the document?
What background or contextual information is provided?
What are the limitations or gaps identified in the document?
What recommendations or next steps does the document propose?


In [6]:
from trulens_eval import Tru
tru = Tru()

tru.reset_database()

### 1. Basic RAG

In [7]:
pdf = "./attention-is-all-you-need.pdf"

temperature = 0.1
max_tokens = 512

def evaluation_responses(pdf, temperature=0.2, max_tokens=128):
    documents = SimpleDirectoryReader(
        input_files=[pdf]
    ).load_data()
    document = Document(text="\n\n".join([doc.text for doc in documents]))
    embed_model = OpenAIEmbedding(model="text-embedding-ada-002", embed_batch_size=10)

    llm = OpenAI(
        model="gpt-3.5-turbo-instruct",
        temperature=temperature,
        max_tokens=max_tokens,
        streaming=True)
    
    service_context = ServiceContext.from_defaults(
        llm=llm, embed_model=embed_model
    )
    index = VectorStoreIndex.from_documents([document],
                                        service_context=service_context)
    query_engine = index.as_query_engine()
    return query_engine

query_engine = evaluation_responses(pdf, temperature=temperature, max_tokens=max_tokens)


In [9]:
# from utils import get_prebuilt_trulens_recorder
from trulens_eval import (
    Feedback,
    TruLlama,
    OpenAI
)
from trulens_eval.feedback import Groundedness

import numpy as np

openai = OpenAI()

qa_relevance = (
    Feedback(openai.relevance_with_cot_reasons, name="Answer Relevance")
    .on_input_output()
)

qs_relevance = (
    Feedback(openai.relevance_with_cot_reasons, name = "Context Relevance")
    .on_input()
    .on(TruLlama.select_source_nodes().node.text)
    .aggregate(np.mean)
)

# grounded = Groundedness(groundedness_provider=openai, summarize_provider=openai)
grounded = Groundedness(groundedness_provider=openai)

groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons, name="Groundedness")
        .on(TruLlama.select_source_nodes().node.text)
        .on_output()
        .aggregate(grounded.grounded_statements_aggregator)
)

feedbacks = [qa_relevance, qs_relevance, groundedness]

def get_prebuilt_trulens_recorder(query_engine, app_id):
    tru_recorder = TruLlama(
        query_engine,
        app_id=app_id,
        feedbacks=feedbacks
        )
    return tru_recorder


✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [10]:
# Basic RAG Pipeline
tru_recorder = get_prebuilt_trulens_recorder(query_engine,
                                             app_id="Basic RAG")

In [11]:
emo = "\U00002705"
with tru_recorder as recording:
    for question in eval_questions:
        response = query_engine.query(question)
        print(emo+question)


✅What is the main theme or topic of this document?
✅Can you summarize the key points made in this document?
✅What are the primary arguments or claims presented?
✅Are there any notable facts or figures mentioned? If so, what are they?
✅Can you provide an example or case study mentioned in the document?
✅What conclusions are drawn in this document?
✅Are there any contradictions or points of debate within the document?
✅What background or contextual information is provided?
✅What are the limitations or gaps identified in the document?
✅What recommendations or next steps does the document propose?


In [12]:
records, feedback = tru.get_records_and_feedback(app_ids=[])

In [13]:
records


Unnamed: 0,app_id,app_json,type,record_id,input,output,tags,record_json,cost_json,perf_json,ts,Answer Relevance,Context Relevance,Groundedness,Answer Relevance_calls,Context Relevance_calls,Groundedness_calls,latency,total_tokens,total_cost
0,Basic RAG,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_7edff4658b1d94d08a72006bcb9ad23b,"""What is the main theme or topic of this docum...","""\nThe main theme or topic of this document is...",-,"{""record_id"": ""record_hash_7edff4658b1d94d08a7...","{""n_requests"": 2, ""n_successful_requests"": 2, ...","{""start_time"": ""2024-01-14T18:27:49.895158"", ""...",2024-01-14T18:27:56.168035,1.0,0.5,1.0,[{'args': {'prompt': 'What is the main theme o...,[{'args': {'prompt': 'What is the main theme o...,"[{'args': {'source': 'For translation tasks, t...",6,2039,0.00307
1,Basic RAG,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_688549a64c6878e79a1cb75508b4643f,"""Can you summarize the key points made in this...","""\nThe document discusses the Transformer, a s...",-,"{""record_id"": ""record_hash_688549a64c6878e79a1...","{""n_requests"": 2, ""n_successful_requests"": 2, ...","{""start_time"": ""2024-01-14T18:27:56.854509"", ""...",2024-01-14T18:28:03.515761,1.0,0.5,0.8,[{'args': {'prompt': 'Can you summarize the ke...,[{'args': {'prompt': 'Can you summarize the ke...,[{'args': {'source': 'Table 2 summarizes our r...,6,2093,0.003159
2,Basic RAG,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_681fecec01ba72f3f97c55946f53036f,"""What are the primary arguments or claims pres...","""\nThe primary arguments or claims presented a...",-,"{""record_id"": ""record_hash_681fecec01ba72f3f97...","{""n_requests"": 2, ""n_successful_requests"": 2, ...","{""start_time"": ""2024-01-14T18:28:04.197017"", ""...",2024-01-14T18:28:10.476767,1.0,0.0,1.0,[{'args': {'prompt': 'What are the primary arg...,[{'args': {'prompt': 'What are the primary arg...,[{'args': {'source': 'Table 2 summarizes our r...,6,2092,0.003162
3,Basic RAG,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_715031da3db5ea248ced50097836c18f,"""Are there any notable facts or figures mentio...","""\nYes, there are several notable facts and fi...",-,"{""record_id"": ""record_hash_715031da3db5ea248ce...","{""n_requests"": 2, ""n_successful_requests"": 2, ...","{""start_time"": ""2024-01-14T18:28:11.134134"", ""...",2024-01-14T18:28:17.329805,1.0,1.0,0.833333,[{'args': {'prompt': 'Are there any notable fa...,[{'args': {'prompt': 'Are there any notable fa...,[{'args': {'source': 'Table 2 summarizes our r...,6,2103,0.003169
4,Basic RAG,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_f6fcd18ca17a6f7bbf895ce9a2456eb8,"""Can you provide an example or case study ment...","""On both WMT 2014 English-to-German and WMT 20...",-,"{""record_id"": ""record_hash_f6fcd18ca17a6f7bbf8...","{""n_requests"": 2, ""n_successful_requests"": 2, ...","{""start_time"": ""2024-01-14T18:28:18.247210"", ""...",2024-01-14T18:28:24.008119,0.0,0.0,1.0,[{'args': {'prompt': 'Can you provide an examp...,[{'args': {'prompt': 'Can you provide an examp...,[{'args': {'source': '[12] Sepp Hochreiter and...,5,2026,0.003042
5,Basic RAG,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_c1a51b17711ecbd725ac37858e654bbd,"""What conclusions are drawn in this document?""","""\nThe document concludes that the Transformer...",-,"{""record_id"": ""record_hash_c1a51b17711ecbd725a...","{""n_requests"": 2, ""n_successful_requests"": 2, ...","{""start_time"": ""2024-01-14T18:28:24.763494"", ""...",2024-01-14T18:28:30.659220,1.0,0.6,1.0,[{'args': {'prompt': 'What conclusions are dra...,[{'args': {'prompt': 'What conclusions are dra...,[{'args': {'source': 'Table 2 summarizes our r...,5,2074,0.003129
6,Basic RAG,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_037a06e283717c38d18e3e62778ec0b9,"""Are there any contradictions or points of deb...","""\nNo, there are no contradictions or points o...",-,"{""record_id"": ""record_hash_037a06e283717c38d18...","{""n_requests"": 2, ""n_successful_requests"": 2, ...","{""start_time"": ""2024-01-14T18:28:31.443113"", ""...",2024-01-14T18:28:36.970158,1.0,0.0,0.0,[{'args': {'prompt': 'Are there any contradict...,[{'args': {'prompt': 'Are there any contradict...,[{'args': {'source': 'Table 2 summarizes our r...,5,2061,0.003095
7,Basic RAG,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_56ea9e7797aff13d79b176dd5172ae09,"""What background or contextual information is ...","""\nThe context information provided includes r...",-,"{""record_id"": ""record_hash_56ea9e7797aff13d79b...","{""n_requests"": 2, ""n_successful_requests"": 2, ...","{""start_time"": ""2024-01-14T18:28:37.635462"", ""...",2024-01-14T18:28:43.634359,1.0,0.95,0.636364,[{'args': {'prompt': 'What background or conte...,[{'args': {'prompt': 'What background or conte...,[{'args': {'source': '[12] Sepp Hochreiter and...,5,2026,0.003054
8,Basic RAG,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_8e5d3ae224e3f96b1b08dd98d1b6f7f1,"""What are the limitations or gaps identified i...","""\nThe document does not mention any specific ...",-,"{""record_id"": ""record_hash_8e5d3ae224e3f96b1b0...","{""n_requests"": 2, ""n_successful_requests"": 2, ...","{""start_time"": ""2024-01-14T18:28:44.251324"", ""...",2024-01-14T18:28:49.715763,0.4,0.0,0.5,[{'args': {'prompt': 'What are the limitations...,[{'args': {'prompt': 'What are the limitations...,[{'args': {'source': 'Table 2 summarizes our r...,5,2054,0.003081
9,Basic RAG,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_aad48c2926fd98401b712c9345fe1eed,"""What recommendations or next steps does the d...","""We plan to extend the Transformer to problems...",-,"{""record_id"": ""record_hash_aad48c2926fd98401b7...","{""n_requests"": 2, ""n_successful_requests"": 2, ...","{""start_time"": ""2024-01-14T18:28:50.536315"", ""...",2024-01-14T18:28:56.220076,1.0,0.1,1.0,[{'args': {'prompt': 'What recommendations or ...,[{'args': {'prompt': 'What recommendations or ...,"[{'args': {'source': 'For translation tasks, t...",5,2053,0.00309


In [15]:
# launches on http://localhost:8501/
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

In [None]:
import pandas as pd

# Creating a refined DataFrame with the 10 questions and their corresponding reasons
refined_questions_and_reasons = pd.DataFrame({
    "Question": [
        "What is the main theme or topic of this document?",
        "Can you summarize the key points made in this document?",
        "What are the primary arguments or claims presented?",
        "Are there any notable facts or figures mentioned? If so, what are they?",
        "Can you provide an example or case study mentioned in the document?",
        "What conclusions are drawn in this document?",
        "Are there any contradictions or points of debate within the document?",
        "What background or contextual information is provided?",
        "What are the limitations or gaps identified in the document?",
        "What recommendations or next steps does the document propose?"
    ],
    "Reason": [
        "Tests the system's ability to identify the central subject or theme.",
        "Evaluates the system's summarization skills and understanding of major points.",
        "Checks the system's ability to identify and articulate main arguments or claims.",
        "Tests the system's ability to pick out and relay specific data points.",
        "Evaluates how well the system can extract and present examples or case studies.",
        "Tests understanding of the document’s conclusions or final thoughts.",
        "Checks the system's ability to identify conflicting information or areas of contention.",
        "Evaluates the system's recognition of context-setting information.",
        "Assesses the system's ability to recognize acknowledged limitations or gaps.",
        "Tests the system's comprehension of proposed future actions or recommendations."
    ]
})

refined_questions_and_reasons.head(10)  # Displaying the DataFrame
