## Evaluation
 
Now we need to evaluate the RAG we built. We used RAGAS. Here I have a test data set stored under the folder data. It contains 10 questions with their corresponding answers.

In [1]:
import os
import sys
from dotenv import load_dotenv
from pathlib import Path

load_dotenv()
api_key=os.environ.get("GOOGLE_API_KEY")

In [2]:
root_path = str(Path(os.getcwd()).resolve().parent) 
if root_path not in sys.path:
    sys.path.append(root_path)

In [3]:
import pandas as pd
from datasets import Dataset
from ragas.llms import llm_factory
from ragas.metrics.collections import Faithfulness, ContextRecall, ContextPrecision
from google import genai
from tools.retrieve_and_reply import get_raw_response


All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  import google.generativeai as genai


In [4]:
df = pd.read_excel("../data/test_data.xlsx")
print(df.head())

   id                                            question  \
0    1  What is the revenue of Google ended three mont...   
1    2                 What is the R&D expense of 2025Q3？   
2    3  What is the net income of GOOG in three months...   
3    4  How about the share repurchase of Google in 20...   
4    5  Listed the revenue of Google Cloud for nine mo...   

                                        ground_truth  
0                                It's 102346 million  
1                                 It's 15151 million  
2                                 It's 34979 million  
3    It's 11553 million, including Class A & Class B  
4  The revenue for Google Cloud for nine months e...  


## Create RAGAS evaluation data set

Here I create the evaluation data set by calling the function to retrieve the results of queries. My evalution will compare the retrieved contexts, generated answers with the ground truth.


In [5]:
import nest_asyncio 
nest_asyncio.apply()


eval_rows = []

# Get the retrieved result for the questions of test data set
for _, row in df.iterrows():
 
    response_obj = get_raw_response(row['question'])
    
    eval_rows.append({
        "user_input": row['question'],
        "response": response_obj.response,
        "retrieved_contexts": [n.get_content() for n in response_obj.source_nodes],
        "reference": row['ground_truth']
    })

In [6]:
print(eval_rows[0])

{'user_input': 'What is the revenue of Google ended three months in 2025Q3', 'response': "Google's total revenue for the three months ended September 30, 2025, was $102,346 million.", 'retrieved_contexts': ["This table presents Google's revenues broken down by segment (e.g., Google Search & other, YouTube ads, Google Cloud) for the three and nine months ended September 30, 2024, and 2025.,\nwith the following columns:\n- Revenue Category: The specific revenue stream or segment, such as Google Search & other, YouTube ads, or Google Cloud.\n- Three Months Ended September 30, 2024: Total revenue in USD for the three-month period ending September 30, 2024.\n- Three Months Ended September 30, 2025: Total revenue in USD for the three-month period ending September 30, 2025.\n- Nine Months Ended September 30, 2024: Total revenue in USD for the nine-month period ending September 30, 2024.\n- Nine Months Ended September 30, 2025: Total revenue in USD for the nine-month period ending September 30

## Design the experiement

RAGAS desgined an experiment decorator to run the evaluation, which enables us to easily run multiple experiments with different configurations. In this case, I utilized the metrics including faithfulness, context recall, and context precision. 

The default evaluation llm of RAGAS is ChatGPT, but you can also specify other LLMs if needed. The hard code in experiment is designed to be asynchronous. So if you want to use experiment directly, please define an async client. Here I used the Gemini to evaluate the RAG, but Gemnini does not provide asynchronous client. To solve this problem, the RAGAS suggested using AsyncOpenAI instead, and then map to the Gemini models through LiteLLM proxy.

To utilize LiteLLM proxy, you need to set up the LiteLLM server first. Open a new terminal and run the following command:

```bash
pip install "litellm[proxy]"
export GEMINI_API_KEY="your_gemini_api_key"
litellm --model gemini/gemini-2.0-flash
```
Please note that the litellm proxy is specifically designed to look for the GEMINI_API_KEY environment variable when routing requests to Google’s Gemini models. In the context of a large-scale project where multiple Google-related keys (like GOOGLE_API_KEY) are already set, the proxy may face an "Identity Ambiguity" issue. To solve this, we need to explicitly set the GEMINI_API_KEY in the terminal before starting the LiteLLM server, as shown above.

In [7]:
import httpx

# Test connection to LiteLLM Proxy is working, should return 200 status code
async def test_proxy():
    async with httpx.AsyncClient() as client:
        try:
            resp = await client.get("http://127.0.0.1:4000/health/readiness")
            print(f"Proxy Status: {resp.status_code}")
        except Exception as e:
            print(f"Cannot connected with LiteLLM Proxy: {e}")

await test_proxy()

Proxy Status: 200


The above code will test if the LiteLLM server is connecting, and you should see it returns "Proxy Status: 200". Next, we should define the async client, and use the llm_factory to map to the Gemini model through LiteLLM proxy. The llm_factory will create the llm for further evaluation. 

In [8]:
from openai import AsyncOpenAI 

# Create async OpenAI client
async_client = AsyncOpenAI(base_url="http://127.0.0.1:4000", api_key="sk-...")

# Utilized llm_factory to create evaluation LLM, 
# the llm_factory internally uses litellm to map to gemini. 
eval_llm = llm_factory(model="gemini-2.0-flash", client=async_client)

# Must return True to make sure the LLM is async
print(eval_llm.is_async)

True


Then we can design the experiment. To do this, we use the experiment decorator provided by RAGAS. First, we need to define the ExperimentalResults class to store the results of the experiment. Then we can define the async experiment function, which takes in the evaluation data set and returns ExperimentalResults.

In [9]:
eval_ds = Dataset.from_list(eval_rows)

from ragas import experiment
from pydantic import BaseModel

class ExperimentalResults(BaseModel):
    faithfulness: float
    context_recall: float
    context_precision: float


In [10]:

@experiment(ExperimentalResults)
async def run_evaluation(row):
    """
    The evaluation function of faithfulness, context recall, and context precision of RAG.
    
    Args:
        row: A row from the evaluation dataset containing 'user_input', 'response', 
        'retrieved_contexts', and 'reference'.
    
    Returns:
        An instance of ExperimentalResults containing the computed metrics.
    """
    
    faithfulness = Faithfulness(llm=eval_llm)
    context_recall = ContextRecall(llm=eval_llm)
    context_precision = ContextPrecision(llm=eval_llm)
    
    faith_result = await faithfulness.ascore(
        user_input=row['user_input'],
        response=row['response'],
        retrieved_contexts=row['retrieved_contexts'])
    
    recall_result = await context_recall.ascore(
        user_input=row['user_input'],
        retrieved_contexts=row['retrieved_contexts'],
        reference=row['reference'])
    
    precision_result = await context_precision.ascore(
        user_input=row['user_input'],
        retrieved_contexts=row['retrieved_contexts'],
        reference=row['reference'])
    
    return ExperimentalResults(
        faithfulness=faith_result.value,
        context_recall=recall_result.value,
        context_precision=precision_result.value)

In [14]:
# Run evaluation over the dataset
import asyncio
import openai

exp_results = []
for i, row in enumerate(eval_ds):
    try:
        
        result = await run_evaluation(row)
        
        exp_results.append({
            "index": i,
            "faithfulness": result.faithfulness,
            "context_recall": result.context_recall,
            "context_precision": result.context_precision
        })
        
    except openai.RateLimitError as e:
        # Handle the 429 error
        print(f"The {i} th row hit the rate limits (429): {e}")
        await asyncio.sleep(20) 
    
    print(f"The {i} th row evaluation is done.")

The 0 th row evaluation is done.
The 1 th row evaluation is done.
The 2 th row evaluation is done.
The 3 th row evaluation is done.
The 4 th row evaluation is done.
The 5 th row evaluation is done.
The 6 th row evaluation is done.
The 7 th row evaluation is done.
The 8 th row evaluation is done.
The 9 th row evaluation is done.


In [16]:
df_results = pd.DataFrame(exp_results)
print(df_results)

   index  faithfulness  context_recall  context_precision
0      0           1.0             1.0           1.000000
1      1           1.0             1.0           1.000000
2      2           1.0             1.0           1.000000
3      3           1.0             1.0           1.000000
4      4           1.0             1.0           1.000000
5      5           1.0             1.0           0.700000
6      6           1.0             1.0           1.000000
7      7           1.0             1.0           1.000000
8      8           1.0             1.0           0.477778
9      9           1.0             1.0           1.000000


In [17]:
print("Average Faithfulness:", df_results['faithfulness'].mean())
print("Average Context Recall:", df_results['context_recall'].mean())
print("Average Context Precision:", df_results['context_precision'].mean())

Average Faithfulness: 1.0
Average Context Recall: 1.0
Average Context Precision: 0.9177777777476852
