# Assignment: Calculating and Reporting Metrics of the RAG Pipeline
----
* Name: Krishnakanth Naik Jarapala
* NUID: 002724795
---- 
In this quickstart guide, you will learn
- to create a RAG (Retrieval-Augmented Generation) model from scratch.
- Log the model’s performance.
- Obtain feedback on responses generated by an LLM (Language Model).
- Evaluation: Utilize the "hallucination triad" for evaluation, focusing on groundedness, context relevance, and answer relevance.
- Optimization: to Improve the efficiency of the RAG

In [None]:
# ! pip install trulens_eval chromadb openai llama-index

In [1]:
import os

# Please use your own OpenAI API Key
os.environ["OPENAI_API_KEY"] = ""

## Load Sample Data

In this case, we'll just initialize some simple text in the notebook to reduce the OpenAI api cost.

In [2]:
info_1 = """
The University of Washington, established in 1861 in Seattle, is a public research university with over 45,000 students across three campuses: 
Seattle, Tacoma, and Bothell. As the flagship institution of Washington's six public universities, UW comprises more than 500 buildings and 
20 million square feet of space, including one of the largest library systems globally.
"""

info_2 = """
Washington State University (WSU), founded in 1890, is a public research university located in Pullman, Washington. With multiple campuses statewide, 
it is the second largest institution of higher education in the state. WSU is renowned for its programs in veterinary medicine, agriculture, engineering,
architecture, and pharmacy.
"""

info_3 = """
Seattle, located on Puget Sound in the Pacific Northwest, is surrounded by water, mountains, and evergreen forests, featuring thousands of acres of 
parkland. The city is a hub for the tech industry, with Microsoft and Amazon headquartered in its metropolitan area. Its most iconic landmark is the 
futuristic Space Needle, a legacy of the 1962 World's Fair.
"""

info_4 = """
Starbucks Corporation, an American multinational chain of coffeehouses and roastery reserves, is headquartered in Seattle, Washington. As the world's
largest coffeehouse chain, Starbucks represents the United States' second wave of coffee culture.
"""


## Create Vector Store

Create a chromadb vector store in memory.

In [3]:
import os
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

embedding_function = OpenAIEmbeddingFunction(api_key=os.environ.get('OPENAI_API_KEY'),
                                             model_name="text-embedding-ada-002")


chroma_client = chromadb.Client()
vector_store = chroma_client.get_or_create_collection(name="sample-data-to-evaluate",
                                                      embedding_function=embedding_function)

In [4]:
#### Populate the vector store.
vector_store.add("info_1", documents=info_1)
vector_store.add("info_2", documents=info_2)
vector_store.add("info_3", documents=info_3)
vector_store.add("info_4", documents=info_4)

## Building a RAG Model from Scratch

### Step 1: Create a Custom RAG Model

In this section, you'll learn how to build a custom Retrieval-Augmented Generation (RAG) model from the ground up. This involves setting up your data sources, designing the retrieval mechanism, and integrating it with a generative model to create a seamless RAG pipeline.

### Step 2: Integrate TruLens Custom Instrumentation

After constructing your RAG model, we'll enhance it by incorporating TruLens custom instrumentation. This integration allows for advanced logging and monitoring of the model’s performance, enabling you to track metrics, debug issues, and gather valuable feedback on the responses generated by your language model. 

### Detailed Steps:

1. **Set Up Your Data Sources:**
   - Identify and prepare the datasets that your RAG model will use for retrieval.
   - Ensure the data is clean, well-structured, and relevant to your use case.

2. **Design the Retrieval Mechanism:**
   - Implement a retrieval system that efficiently indexes and searches your data.
   - Optimize the retrieval process to quickly find the most relevant information for the input queries.

3. **Integrate with a Generative Model:**
   - Connect your retrieval system to a generative language model.
   - Design the interaction between retrieval and generation to ensure the model produces coherent and contextually accurate responses.

4. **Add TruLens Instrumentation:**
   - Incorporate TruLens to log key performance metrics.
   - Use TruLens to monitor the groundedness, context relevance, and answer relevance of the generated responses.
   - Set up alerts and dashboards to track the model’s performance over time and make data-driven improvements.

By following these steps, you'll build a robust and well-instrumented RAG model capable of delivering high-quality, relevant responses in your application.

In [5]:
from trulens_eval import Tru
from trulens_eval.tru_custom_app import instrument
tru = Tru()
tru.reset_database()

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


In [6]:
from openai import OpenAI
oai_client = OpenAI()

In [8]:
from openai import OpenAI
oai_client = OpenAI()

class build_a_RAG:
    @instrument
    def retrieve(self, query: str) -> list:
        """
        Retrieve relevant text from vector store.
        """
        results = vector_store.query(
            query_texts=query,
            n_results=4
        )
        # Flatten the list of lists into a single list
        return [doc for sublist in results['documents'] for doc in sublist]

    @instrument
    def generate_completion(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        completion = oai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=
        [
            {"role": "user",
            "content":
            f"We have provided context information below. \n"
            f"---------------------\n"
            f"{context_str}"
            f"\n---------------------\n"
            f"Given this information, please answer the question: {query}"
            }
        ]
        ).choices[0].message.content
        return completion

    @instrument
    def query(self, query: str) -> str:
        context_str = self.retrieve(query)
        completion = self.generate_completion(query, context_str)
        return completion

rag = build_a_RAG()

## Setting Up Feedback Functions

### Step 1: Define Feedback Functions

In this section, we will establish feedback functions that evaluate the performance of our RAG model. These functions will help us detect hallucinations by assessing three key metrics: groundedness, answer relevance, and context relevance.

### Detailed Steps:

1. **Groundedness:**
   - **Purpose:** Ensure that the generated responses are based on factual and verifiable information.
   - **Implementation:** Create a function that cross-references the model's outputs with the original data sources to verify their accuracy. This function will check if the information provided by the model can be traced back to a reliable source within the dataset.

2. **Answer Relevance:**
   - **Purpose:** Ensure that the responses are directly addressing the input queries.
   - **Implementation:** Develop a function that measures how well the generated responses align with the specific questions or prompts. This function will compare the key elements of the query with the response to determine if the model is providing relevant and focused answers.

3. **Context Relevance:**
   - **Purpose:** Ensure that the responses are appropriate and coherent within the given context.
   - **Implementation:** Create a function that evaluates the contextual accuracy of the responses. This function will analyze whether the generated content makes sense within the broader conversation or context, maintaining logical consistency and relevance to preceding information.

### Step 2: Integrate Feedback Functions

Once the feedback functions are defined, integrate them into the RAG model’s evaluation pipeline. This will enable continuous monitoring and refinement of the model’s outputs, ensuring high-quality and reliable responses. 

By setting up these feedback functions, you will be able to systematically detect and address hallucinations in your RAG model, improving its overall accuracy and effectiveness.

In [9]:
from trulens_eval import Feedback, Select
from trulens_eval.feedback.provider.openai import OpenAI

import numpy as np

provider = OpenAI(model_engine="gpt-4o")

# Define a groundedness feedback function
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name = "Groundedness")
    .on(Select.RecordCalls.retrieve.rets.collect())
    .on_output()
)
# Question/answer relevance between overall question and answer.
f_answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name = "Answer Relevance")
    .on_input()
    .on_output()
)

# Context relevance between question and each context chunk.
f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name = "Context Relevance")
    .on_input()
    .on(Select.RecordCalls.retrieve.rets[:])
    .aggregate(np.mean) # choose a different aggregation method if you wish
)

✅ In Groundedness, input source will be set to __record__.app.retrieve.rets.collect() .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input context will be set to __record__.app.retrieve.rets[:] .


## Construct the App

1. **Wrap the Custom RAG:**
   - Integrate your custom RAG model using TruCustomApp for streamlined deployment.
2. **Add Feedback Mechanisms:**
   - Implement a list of feedback functions to evaluate and improve model performance.
3. **Set Up Evaluation:**
   - Configure TruCustomApp to utilize the feedback functions for continuous evaluation and refinement.

In [11]:
from trulens_eval import TruCustomApp

# 1. Wrap the Custom RAG

tru_rag = TruCustomApp(rag,
    app_id = 'Simple RAG',
    feedbacks = [f_groundedness, f_answer_relevance, f_context_relevance])

#### Inference
Use `tru_rag` as a context manager for the custom RAG-from-scratch app.

In [12]:
with tru_rag as recording:
    rag.query("When was the University of Washington established?")

### Review Results - Check the leaderboard to view the results.

In [13]:
tru.get_leaderboard()

Unnamed: 0_level_0,Context Relevance,Groundedness,Answer Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Simple RAG,0.25,1.0,1.0,6.0,0.00048


In [14]:
last_record = recording.records[-1]

from trulens_eval.utils.display import get_feedback_result
get_feedback_result(last_record, "Context Relevance")

Unnamed: 0,question,context,ret
0,When was the University of Washington establis...,"\nThe University of Washington, established in...",1.0
1,When was the University of Washington establis...,"\nWashington State University (WSU), founded i...",0.0
2,When was the University of Washington establis...,"\nSeattle, located on Puget Sound in the Pacif...",0.0
3,When was the University of Washington establis...,"\nStarbucks Corporation, an American multinati...",0.0


## Implement Guardrails

To enhance efficiency and reduce hallucinations, use feedback results as guardrails during inference. Specifically, **apply the context relevance score to filter out irrelevant contexts before they reach the LLM**. 

Rebuild your RAG by using the `@context-filter` decorator on the filtering method, and set the feedback function and threshold for effective guardrailing.

In [15]:
# note: feedback function used for guardrail must only return a score, not also reasons
f_context_relevance_score = (
    Feedback(provider.context_relevance, name = "Context Relevance")
)

from trulens_eval.guardrails.base import context_filter

class filtered_RAG_from_scratch:
    @instrument
    @context_filter(f_context_relevance_score, 0.75, keyword_for_prompt="query")
    def retrieve(self, query: str) -> list:
        """
        Retrieve relevant text from vector store.
        """
        results = vector_store.query(
        query_texts=query,
        n_results=4
    )
        return [doc for sublist in results['documents'] for doc in sublist]

    @instrument
    def generate_completion(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        completion = oai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=
        [
            {"role": "user",
            "content":
            f"We have provided context information below. \n"
            f"---------------------\n"
            f"{context_str}"
            f"\n---------------------\n"
            f"Given this information, please answer the question: {query}"
            }
        ]
        ).choices[0].message.content
        return completion

    @instrument
    def query(self, query: str) -> str:
        context_str = self.retrieve(query=query)
        completion = self.generate_completion(query=query, context_str=context_str)
        return completion

filtered_rag = filtered_RAG_from_scratch()

## Build Filtered-Context RAG to improve the efficiency of the retrival 

In [16]:
from trulens_eval import TruCustomApp

# 2. Filtered-Context - RAG

filtered_tru_rag = TruCustomApp(filtered_rag,
    app_id = 'Filtered Context - RAG',
    feedbacks = [f_groundedness, f_answer_relevance, f_context_relevance])


# 3. Lets check the Model Responses.
with filtered_tru_rag as recording:
    filtered_rag.query(query="when was the university of washington founded?")

In [17]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Context Relevance,Groundedness,Answer Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Filtered Context - RAG,1.0,1.0,1.0,6.0,0.024872
Simple RAG,0.25,1.0,1.0,6.0,0.00048


## Experience the effectiveness of filtering!

In [18]:
### Here is the Sample record of Context Relevance.
last_record = recording.records[-1]

from trulens_eval.utils.display import get_feedback_result
get_feedback_result(last_record, "Context Relevance")

Unnamed: 0,question,context,ret
0,when was the university of washington founded?,"\nThe University of Washington, established in...",1.0


In [20]:
# tru.run_dashboard(port=3453, force=True)