# RAPTOR RAG Benchmark
### Benchmark Evaluation for Illinois Statutes RAG System

To evaluate our **Illinois Statutes RAG system**, we created custom benchmark questions due to the absence of pre-existing, Illinois-specific questions. The questions primarily focus on the following areas:

- **Education**
- **Health**
- **Regulation**

These areas address policies, public safety, and compliance rules.

## Question Design

### Total Questions: 45

#### 1. **True/False Statements** (15 Questions)
- Assess whether a given statement aligns with the statutes.
- **Example Question**: Can a licensed currency exchange operate at multiple locations under the same license?  
  **Example Answer**: No.

#### 2. **Short-Answer Questions** (30 Questions)
- Fact-based questions requiring specific answers about acts or provisions.
- **Example Question**: What is required for a 16-year-old to donate blood?  
  **Example Answer**: Written permission or authorization from their parent or guardian.

In [1]:
# Importing necessary libraries and modules
import json
import vertexai
from vertexai.preview.generative_models import GenerativeModel
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_google_community import BigQueryVectorStore
from langchain_google_vertexai import VertexAI
from langchain.chains import RetrievalQA
from langchain.memory import ConversationBufferMemory


In [2]:
# Initialize Vertex AI
PROJECT_ID = "mlds-cap-2024-lexlead-advisor"  # Replace with your project ID
REGION = "us-central1"         
vertexai.init(project=PROJECT_ID, location=REGION)

# Load the Gemini 1.5 Pro model (for text-based legal question-answering)
generative_model_gemini_15_pro = GenerativeModel("gemini-1.5-pro-002")
generative_model_gemini_15_non_pro = GenerativeModel("gemini-1.5-flash-002")

models = [
    ("Gemini 1.5 Pro", generative_model_gemini_15_pro), 
    ("Gemini 1.5 Flash", generative_model_gemini_15_non_pro)
]

In [4]:
# Initialize the VertexAI Embedding model
from langchain_openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key, model = "text-embedding-3-small", disallowed_special = ())

# Initialize the BigQuery Vector Store
vectorstore = BigQueryVectorStore(
    project_id=PROJECT_ID,
    dataset_name="IL_RAPTOR",
    table_name="IL_raptor",
    location=REGION,
    embedding=embedding
)

BigQuery table mlds-cap-2024-lexlead-advisor.IL_RAPTOR.IL_raptor initialized/validated as persistent storage. Access via BigQuery console:
 https://console.cloud.google.com/bigquery?project=mlds-cap-2024-lexlead-advisor&ws=!1m5!1m4!4m3!1smlds-cap-2024-lexlead-advisor!2sIL_RAPTOR!3sIL_raptor


### Example Implementation

from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough, RunnableMap
from langchain_google_vertexai import VertexAI
from langchain_core.output_parsers import StrOutputParser
import vertexai

llm = VertexAI(model_name="gemini-1.5-pro-002")

retriever = vectorstore.as_retriever(search_kwargs={"k": 30})

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
conversational_retrieval = ConversationalRetrievalChain.from_llm(
    llm=llm, retriever=retriever, memory=memory
)

search_query ="Who can initiate civil proceedings under the Academic Plagiarism Act to address plagiarism-related activities?"
conversational_retrieval.invoke(search_query)["answer"]

## Section 1

### TRUE/FALSE Statement Question Overview

In [161]:
file_path = "binary_benchmark_question.json"

# Read the JSON file
with open(file_path, "r") as file:
    data = json.load(file)

# Access questions and answers
questions = data.get("questions", [])
answers = data.get("answers", [])

# Example: Print all questions and answers
for i, (question, answer) in enumerate(zip(questions, answers), start=1):
    print(f"Question {i}: {question}")
    print(f"Answer {i}: {answer}\n")



Question 1: Are licenses required to operate community or ambulatory currency exchanges? (yes or no)
Answer 1: Yes.

Question 2: Can a licensed currency exchange operate at multiple locations under the same license? (yes or no)
Answer 2: No.

Question 3: Is it mandatory for a single female borrower to have a cosigner if a single male borrower under similar conditions does not require one? (yes or no)
Answer 3: No.

Question 4: Can campground licenses be transferred to another operator? (yes or no)
Answer 4: No.

Question 5: Are campus media produced by students at state-sponsored higher education institutions considered public forums for student expression? (yes or no)
Answer 5: Yes

Question 6: Are Illinois residents obligated to submit ACT or SAT scores when applying to public universities in the state?(yes or no)
Answer 6: No.

Question 7: Are institutions allowed to implement a more lenient transcript policy than the minimum requirements?(yes or no)
Answer 7: Yes.

Question 8: Does

### Benchmark Analysis On True/False Question

In [162]:
import time
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type, RetryError
from google.api_core.exceptions import ResourceExhausted

# Initialize the RAG system
retriever = vectorstore.as_retriever(search_kwargs={"k": 50})
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
conversational_retrieval = ConversationalRetrievalChain.from_llm(
    llm=VertexAI(model_name="gemini-1.5-pro-002"), retriever=retriever, memory=memory
)

# Custom exception for empty or filtered responses
class EmptyResponseError(Exception):
    pass

@retry(
    stop=stop_after_attempt(2),
    wait=wait_exponential(multiplier=1, min=5, max=120),
    retry=retry_if_exception_type((ResourceExhausted, EmptyResponseError))
)
def generate_with_retry_raptor(prompt):
    """
    Generate a response using the RAG system with retry logic.
    
    Args:
        prompt (str): Input prompt to the RAG system.
    
    Returns:
        str: Normalized response from the RAG system.
    """
    try:
        response = conversational_retrieval.invoke(prompt)["answer"]

        # Check if the response is empty
        if not response or not response.strip():
            raise EmptyResponseError("Raptor response is empty or blocked.")

        # Return the response directly without any additional checks
        return response.strip()

    except Exception as e:
        print(f"Error encountered while generating Raptor response: {e}")
        raise EmptyResponseError("Raptor content is blocked or empty.") from e


               
def validate_with_gemini(question, generated_answer, ground_truth):
    """
    Validate the generated answer using Gemini 1.5 Pro with retry logic.
    """
    gemini_model = generative_model_gemini_15_pro

    prompt = f"""
    Question: {question}

    Generated Answer: {generated_answer}

    Ground Truth Answer: {ground_truth}

    Evaluate the generated answer. Determine if it is correct or incorrect based on the ground truth. It is normal for the generated answer to provide more information than the ground truth. If the generated answer conveys the same meaning as the ground truth or provides additional relevant details without contradicting the ground truth, consider it correct. Minor differences in wording or phrasing should not be penalized. Provide reasoning for your evaluation and then output one of the following:
    - "Correct" if the generated answer is consistent with the ground truth in meaning, even if phrasing differ or the answer is providing more additional details and information.
    - "Incorrect" if the generated answer is inconsistent with the ground truth or introduces contradictions.
 Provide reasoning for your evaluation and then output one of the following:

    """
    response = gemini_model.generate_content([prompt])

    # Ensure the response contains text
    if not response or not hasattr(response, 'text') or not response.text.strip():
        raise EmptyResponseError("Validation response is empty or blocked.")

    return response.text.strip()


# Main evaluation loop
model_results = {}
num_to_print = 20

# Start RAG benchmarking
print("Evaluating Raptor system\n")
total = len(questions) # Total number of questions
correct_count = 0  # Reset correct count for RAG
question_times = []  # Track time taken for each question

# Start timing for the entire evaluation
evaluation_start_time = time.time()

for i in range(total):
    print(f"Processing question {i + 1} out of {total}")  # Track which question is being processed
    question_start_time = time.time()  # Start timing for the current question

    question = questions[i]
    ground_truth = answers[i]

    prompt = f"""
    You are an expert in Illinois Education Law who are answering client legal question professionally. Answer the following question by reasoning step-by-step. 
    For questions labeled (yes or no) answer yes or no before reasoning. If the text does not provide sufficient information to provide a clear answer, just answer "i don't know" before providing more information.

    Question: {question}
    """

    try:
        # Generate and validate the response
        model_response = generate_with_retry_raptor(prompt)
        validation_result = validate_with_gemini(question, model_response, ground_truth)
        validation_result = validation_result.split("\n")[0].strip().lower()  # Parse the first line for "correct" or "incorrect"

        if "incorrect" not in validation_result:
            correct_count += 1

        # Record time taken for the question
        question_end_time = time.time()
        question_time = question_end_time - question_start_time
        question_times.append(question_time)
        
        # Print question details if within the specified range
        if i < num_to_print:
            print(f"Question {i + 1}: {question}")
            print(f"Ground Truth: {ground_truth}")
            print(f"Raptor Response: {model_response}")
            print(f"Validation Result: {validation_result}")
            print(f"Time Taken: {question_time:.2f} seconds\n")
            print("-" * 80)

    except RetryError as e:
        print(f"Failed to get a response for question {i + 1} after 2 retries.")
        last_exception = e.last_attempt.exception()
        print(f"Last exception: {last_exception}")
        print("Waiting 5 minutes before continuing to prevent further quota issues.")
        time.sleep(300)
        continue  # Skip to the next question if retries are exhausted

# End timing for the entire evaluation
evaluation_end_time = time.time()
total_evaluation_time = evaluation_end_time - evaluation_start_time

# Calculate accuracy
accuracy = (correct_count / total) * 100  # Calculate as a percentage
model_results["Raptor"] = {
    "Total Questions Evaluated": total,
    "Correctly Matched Answers": correct_count,
    "Accuracy": accuracy,  # Store accuracy as a percentage
    "Total Time (seconds)": total_evaluation_time,
    "Average Time per Question (seconds)": sum(question_times) / len(question_times) if question_times else 0,
}

# Print results
print("\nFinal Results for Raptor System:")
for model_name, results in model_results.items():
    print(f"Model: {model_name}")
    print(f"Total Questions Evaluated: {results['Total Questions Evaluated']}")
    print(f"Correctly Matched Answers: {results['Correctly Matched Answers']}")
    print(f"Accuracy: {results['Accuracy']:.2f}%")
    print(f"Total Time: {results['Total Time (seconds)']:.2f} seconds")
    print(f"Average Time per Question: {results['Average Time per Question (seconds)']:.2f} seconds")
    print("=" * 100)


Evaluating Raptor system

Processing question 1 out of 15
Question 1: Are licenses required to operate community or ambulatory currency exchanges? (yes or no)
Ground Truth: Yes.
Raptor Response: Yes.

The Currency Exchange Act (205 ILCS 405/2) explicitly states that no person, firm, association, partnership, limited liability company, or corporation shall engage in the business of a community currency exchange or in the business of an ambulatory currency exchange without first securing a license to do so from the Secretary of Financial and Professional Regulation.  Furthermore, the Act specifies that separate licenses are required for each location operated by a currency exchange (205 ILCS 405/4).
Validation Result: correct
Time Taken: 32.54 seconds

--------------------------------------------------------------------------------
Processing question 2 out of 15
Question 2: Can a licensed currency exchange operate at multiple locations under the same license? (yes or no)
Ground Truth: N

## Section 2

### Short Answer Question Review

In [163]:
file_path = "shortanswer_benchmark_question.json"

# Read the JSON file
with open(file_path, "r") as file:
    data = json.load(file)

# Access questions and answers
questions = data.get("questions", [])
answers = data.get("answers", [])

# Example: Print all questions and answers
for i, (question, answer) in enumerate(zip(questions, answers), start=1):
    print(f"Question {i}: {question}")
    print(f"Answer {i}: {answer}\n")


Question 1: What is required for a 16-year-old to donate blood?
Answer 1: Written permission or authorization from their parent or guardian.

Question 2: What change in blood donation rules takes effect on January 1, 2025?
Answer 2: 17-year-olds can also have their blood typed.

Question 3: What is the minimum distance from entrances, exits, windows, and ventilation intakes where smoking is prohibited under the Smoke-Free Illinois Act?
Answer 3: 15 feet

Question 4: What penalties are imposed on individuals who smoke in prohibited areas under the Smoke-Free Illinois Act?
Answer 4: Under the Smoke Free Illinois Act, a person who smokes in a prohibited area is liable for a civil penalty of $100 for the first offense and $250 for each subsequent offense.

Question 5: Which department is responsible for administering the Safe Bottled Water Act?
Answer 5: The Department of Public Health.

Question 6: How frequently must the Department of Public Health inspect water-bottling plants and priva

### Benchmark Analysis on Short-Answer Questions

In [164]:
import time
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type, RetryError
from google.api_core.exceptions import ResourceExhausted

# Initialize the RAG system
retriever = vectorstore.as_retriever(search_kwargs={"k": 50})
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
conversational_retrieval = ConversationalRetrievalChain.from_llm(
    llm=VertexAI(model_name="gemini-1.5-pro-002"), retriever=retriever, memory=memory
)

# Custom exception for empty or filtered responses
class EmptyResponseError(Exception):
    pass

# Function to generate a response using the RAG system with retry logic
@retry(
    stop=stop_after_attempt(2),
    wait=wait_exponential(multiplier=1, min=5, max=120),
    retry=retry_if_exception_type((ResourceExhausted, EmptyResponseError))
)
def generate_with_retry_rag(prompt):
    """
    Generate a response using the RAG system, with retries for errors.
    
    Args:
        prompt (str): Input query for the RAG system.
    
    Returns:
        str: Normalized response from the RAG system.
    
    Raises:
        EmptyResponseError: If the response is empty, blocked, or insufficient.
    """
    try:
        response = conversational_retrieval.invoke(prompt)["answer"]

        # Check if the response is empty
        if not response or not response.strip():
            raise EmptyResponseError("RAG response is empty or blocked.")

        # Return the response directly without any additional checks
        return response.strip()

    except Exception as e:
        print(f"Error encountered while generating RAG response: {e}")
        raise EmptyResponseError("RAG content is blocked or empty.") from e

         
def validate_with_gemini(question, generated_answer, ground_truth):
    """
    Validate the generated answer using Gemini 1.5 Pro with retry logic.
    """
    gemini_model = generative_model_gemini_15_pro

    prompt = f"""
    Question: {question}

    Generated Answer: {generated_answer}

    Ground Truth Answer: {ground_truth}

    Evaluate the generated answer. Determine if it is correct or incorrect based on the ground truth. It is normal for the generated answer to provide more information than the ground truth. If the generated answer conveys the same meaning as the ground truth or provides additional relevant details without contradicting the ground truth, consider it correct. Minor differences in wording or phrasing should not be penalized. Provide reasoning for your evaluation and then output one of the following:
    - "Correct" if the generated answer is consistent with the ground truth in meaning, even if phrasing differ or the answer is providing more additional details and information compared to the ground truth.
    - "Incorrect" if the generated answer is inconsistent with the ground truth or introduces contradictions.
    """
    response = gemini_model.generate_content([prompt])

    # Ensure the response contains text
    if not response or not hasattr(response, 'text') or not response.text.strip():
        raise EmptyResponseError("Validation response is empty or blocked.")

    return response.text.strip() # Return the validation result


# Main evaluation loop
model_results = {}# Dictionary to store evaluation results
num_to_print = 30 # Number of question details to print for inspection

# Start RAG benchmarking
print("Evaluating RAG system\n")
total = len(questions)  # Total number of questions to evaluate
correct_count = 0  # Reset correct count for RAG
question_times = []  # Track time taken for each question

# Start timing for the entire evaluation
evaluation_start_time = time.time()

for i in range(total):
    print(f"Processing question {i + 1} out of {total}")  # Track which question is being processed
    question_start_time = time.time()  # Start timing for the current question

    question = questions[i]
    ground_truth = answers[i]

    prompt = f"""
    You are an expert in Illinois Education Law who are answering client legal question professionally. Answer the following question by reasoning step-by-step. 
    For questions labeled (yes or no) answer yes or no before reasoning. If the text does not provide sufficient information to provide a clear answer, just answer "i don't know" before providing more information.

    Question: {question}
    """

    try:
        model_response = generate_with_retry_rag(prompt)
        validation_result = validate_with_gemini(question, model_response, ground_truth)
        validation_result = validation_result.split("\n")[0].strip().lower()  # Parse the first line for "correct" or "incorrect"

        if "incorrect" not in validation_result:
            correct_count += 1

        # Record time taken for the question
        question_end_time = time.time()
        question_time = question_end_time - question_start_time
        question_times.append(question_time)

        if i < num_to_print:
            print(f"Question {i + 1}: {question}")
            print(f"Ground Truth: {ground_truth}")
            print(f"RAG Response: {model_response}")
            print(f"Validation Result: {validation_result}")
            print(f"Time Taken: {question_time:.2f} seconds\n")
            print("-" * 80)

    except RetryError as e:
        print(f"Failed to get a response for question {i + 1} after 2 retries.")
        last_exception = e.last_attempt.exception()
        print(f"Last exception: {last_exception}")
        #print("Waiting 5 minutes before continuing to prevent further quota issues.")
       #time.sleep(300)
        continue  # Skip to the next question if retries are exhausted

# End timing for the entire evaluation
evaluation_end_time = time.time()
total_evaluation_time = evaluation_end_time - evaluation_start_time

# Calculate accuracy
accuracy = (correct_count / total) * 100  # Calculate as a percentage
model_results["RAG"] = {
    "Total Questions Evaluated": total,
    "Correctly Matched Answers": correct_count,
    "Accuracy": accuracy,  # Store accuracy as a percentage
    "Total Time (seconds)": total_evaluation_time,
    "Average Time per Question (seconds)": sum(question_times) / len(question_times) if question_times else 0,
}

# Print results
print("\nFinal Results for RAG System:")
for model_name, results in model_results.items():
    print(f"Model: {model_name}")
    print(f"Total Questions Evaluated: {results['Total Questions Evaluated']}")
    print(f"Correctly Matched Answers: {results['Correctly Matched Answers']}")
    print(f"Accuracy: {results['Accuracy']:.2f}%")
    print(f"Total Time: {results['Total Time (seconds)']:.2f} seconds")
    print(f"Average Time per Question: {results['Average Time per Question (seconds)']:.2f} seconds")
    print("=" * 100)

Evaluating RAG system

Processing question 1 out of 30
Question 1: What is required for a 16-year-old to donate blood?
Ground Truth: Written permission or authorization from their parent or guardian.
RAG Response: Yes.

A 16-year-old can donate blood in Illinois if they obtain written permission or authorization from their parent or guardian. This is explicitly stated in the Blood Donation Act (210 ILCS 15/1).
Validation Result: correct
Time Taken: 5.69 seconds

--------------------------------------------------------------------------------
Processing question 2 out of 30
Question 2: What change in blood donation rules takes effect on January 1, 2025?
Ground Truth: 17-year-olds can also have their blood typed.
RAG Response: Public Act 103-669, effective January 1, 2025, amends the Blood Donation Act (210 ILCS 15/1) to explicitly include blood typing as something 17-year-olds (and older) can consent to without parental permission, and 16-year-olds can consent to with written parental p

## Important Note on LLM Evaluation

While the LLM provides valuable assistance in evaluating model responses and their correctness, it is not guaranteed to be 100% accurate. We highly recommend **double-checking the evaluation results manually** to ensure reliability and accuracy.

The **final decision** should always be made by a human, using the LLM's evaluation as a reference rather than the sole determinant.


# Benchmark Analysis: RAPTOR Results

This benchmark evaluates the performance of the RAPTOR system on binary and short-answer questions. **Note**: The results have been manually verified for accuracy. Additionally, the benchmark outcomes may vary slightly each time the code is executed due to the nature of the underlying models and system behavior.

## Binary Questions
- **Average Time per Question**: 11.5 seconds
- **Questions Evaluated**: 15
- **Correctly Answered**: 11
- **Accuracy**: 73.3%

## Short Answer Questions
- **Average Time per Question**: 14.0 seconds
- **Questions Evaluated**: 30
- **Correctly Answered**: 25
- **Accuracy**: 83.3%
