# Simple RAG Benchmark
### Benchmark Evaluation for Illinois Statutes RAG System

To evaluate our **Illinois Statutes RAG system**, we created custom benchmark questions due to the absence of pre-existing, Illinois-specific questions. The questions primarily focus on the following areas:

- **Education**
- **Health**
- **Regulation**

These areas address policies, public safety, and compliance rules.

## Question Design

### Total Questions: 45

#### 1. **True/False Statements** (15 Questions)
- Assess whether a given statement aligns with the statutes.
- **Example Question**: Can a licensed currency exchange operate at multiple locations under the same license?  
  **Example Answer**: No.

#### 2. **Short-Answer Questions** (30 Questions)
- Fact-based questions requiring specific answers about acts or provisions.
- **Example Question**: What is required for a 16-year-old to donate blood?  
  **Example Answer**: Written permission or authorization from their parent or guardian.


In [34]:
# Importing necessary libraries and modules
import json
import vertexai
from vertexai.preview.generative_models import GenerativeModel
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_google_community import BigQueryVectorStore
from langchain_google_vertexai import VertexAI
from langchain.chains import RetrievalQA
from langchain.memory import ConversationBufferMemory


In [35]:
# Initialize Vertex AI
PROJECT_ID = "mlds-cap-2024-lexlead-advisor"  
REGION = "us-central1"         
vertexai.init(project=PROJECT_ID, location=REGION)

# Load the Gemini 1.5 Pro model (for text-based legal question-answering)
generative_model_gemini_15_pro = GenerativeModel("gemini-1.5-pro-002")
generative_model_gemini_15_non_pro = GenerativeModel("gemini-1.5-flash-002")

models = [
    ("Gemini 1.5 Pro", generative_model_gemini_15_pro), 
    ("Gemini 1.5 Flash", generative_model_gemini_15_non_pro)
]

In [36]:
# Initialize the VertexAI Embedding model
embedding_model = VertexAIEmbeddings(
    model_name="textembedding-gecko@latest", 
    project=PROJECT_ID
)

# Initialize the BigQuery Vector Store
bq_store = BigQueryVectorStore(
    project_id=PROJECT_ID,
    dataset_name="test_IL",
    table_name="statues",
    location=REGION,
    embedding=embedding_model
)

BigQuery table mlds-cap-2024-lexlead-advisor.test_IL.statues initialized/validated as persistent storage. Access via BigQuery console:
 https://console.cloud.google.com/bigquery?project=mlds-cap-2024-lexlead-advisor&ws=!1m5!1m4!4m3!1smlds-cap-2024-lexlead-advisor!2stest_IL!3sstatues


### Example Implementation

In [125]:
# Initialize the Language Model (LLM) using Vertex AI
llm = VertexAI(model_name="gemini-1.5-pro-002")

# Set up the retriever with a BigQuery vector store
retriever = bq_store.as_retriever(search_kwargs={"k": 200})

# Create a memory buffer for conversational context
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Build a Conversational Retrieval Chain
conversational_retrieval = ConversationalRetrievalChain.from_llm(
    llm=llm, retriever=retriever, memory=memory
)
# Define the search query for the conversational retrieval system
search_query = "Does the definition of tattooing include procedures like permanent makeup for lip coloring or eyeliner? (Yes or No)"

# Execute the query and retrieve the answer
conversational_retrieval.invoke(search_query)["answer"]

'Yes\n'

## Section 1

### TRUE/FALSE Statement Question Overview

In [130]:
file_path =  "binary_benchmark_question.json"

# Read the JSON file
with open(file_path, "r") as file:
    data = json.load(file)

# Access questions and answers
questions = data.get("questions", [])
answers = data.get("answers", [])

# Print all questions and answers
for i, (question, answer) in enumerate(zip(questions, answers), start=1):
    print(f"Question {i}: {question}")
    print(f"Answer {i}: {answer}\n")



Question 1: Are licenses required to operate community or ambulatory currency exchanges? (yes or no)
Answer 1: Yes.

Question 2: Can a licensed currency exchange operate at multiple locations under the same license? (yes or no)
Answer 2: No.

Question 3: Is it mandatory for a single female borrower to have a cosigner if a single male borrower under similar conditions does not require one? (yes or no)
Answer 3: No.

Question 4: Can campground licenses be transferred to another operator? (yes or no)
Answer 4: No.

Question 5: Are campus media produced by students at state-sponsored higher education institutions considered public forums for student expression? (yes or no)
Answer 5: Yes

Question 6: Are Illinois residents obligated to submit ACT or SAT scores when applying to public universities in the state?(yes or no)
Answer 6: No.

Question 7: Are institutions allowed to implement a more lenient transcript policy than the minimum requirements?(yes or no)
Answer 7: Yes.

Question 8: Does

### Benchmark Analysis On True/False Question

In [131]:
import time
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type, RetryError
from google.api_core.exceptions import ResourceExhausted

# Initialize the RAG system
retriever = bq_store.as_retriever(search_kwargs={"k": 250})
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
conversational_retrieval = ConversationalRetrievalChain.from_llm(
    llm=VertexAI(model_name="gemini-1.5-pro-002"), retriever=retriever, memory=memory
)# Combine LLM, retriever, and memory for RAG-based interactions

# Custom exception for empty or filtered responses
class EmptyResponseError(Exception):
    pass

# Define a function to generate responses with retry logic
@retry(
    stop=stop_after_attempt(2),
    wait=wait_exponential(multiplier=1, min=5, max=120),
    retry=retry_if_exception_type((ResourceExhausted, EmptyResponseError))
)
def generate_with_retry_rag(prompt):
    """
    Generate a response using the RAG system with retry logic.
    
    Args:
        prompt (str): Input prompt to the RAG system.
    
    Returns:
        str: Normalized response from the RAG system.
    """
    try:
        response = conversational_retrieval.invoke(prompt)["answer"]

        # Check if the response is not empty
        if not response or not response.strip():
            raise EmptyResponseError("RAG response is empty or blocked.")
        
      # Normalize and check for "I don't know" or similar insufficient responses
        normalized_response = response.strip().lower()
        if is_response_insufficient(normalized_response):
            raise EmptyResponseError("RAG response indicates insufficient information.")
        
        return normalized_response  # Return the valid response'''

    except Exception as e:
        print(f"Error encountered while generating RAG response: {e}")
        raise EmptyResponseError("RAG content is blocked or empty.") from e
@retry(
    stop=stop_after_attempt(2),  # Retry up to 7 times
    wait=wait_exponential(multiplier=1, min=5, max=120),  # Exponential backoff with 5s min and 120s max
    retry=retry_if_exception_type((ResourceExhausted, EmptyResponseError))
)

def is_response_insufficient(response):
    """
    Determines if the response is insufficient based on predefined criteria.
    
    Args:
        response (str): The normalized response text.
    
    Returns:
        bool: True if the response is insufficient, False otherwise.
    """
    insufficient_phrases = [
        "i don't know",
        "unable to find",
        "not sure",
        "cannot determine",
        "no information available",
        "please provide more context",
        "i cannot answer",
        "cannot be answered"
    ]
    # Check if response contains any insufficient phrases
    return any(phrase in response for phrase in insufficient_phrases)

               
def validate_with_gemini(question, generated_answer, ground_truth):
    """
    Validate the generated answer using Gemini 1.5 Pro with retry logic.
    """
    gemini_model = generative_model_gemini_15_pro

    prompt = f"""
    Question: {question}

    Generated Answer: {generated_answer}

    Ground Truth Answer: {ground_truth}

    Evaluate the generated answer. Determine if it is correct or incorrect based on the ground truth. It is normal for the generated answer to provide more information than the ground truth. If the generated answer conveys the same meaning as the ground truth or provides additional relevant details without contradicting the ground truth, consider it correct. Minor differences in wording or phrasing should not be penalized. Provide reasoning for your evaluation and then output one of the following:
    - "Correct" if the generated answer is consistent with the ground truth in meaning, even if phrasing differ or the answer is providing more additional details and information.
    - "Incorrect" if the generated answer is inconsistent with the ground truth or introduces contradictions.

    """
    response = gemini_model.generate_content([prompt])

    # Ensure the response contains text
    if not response or not hasattr(response, 'text') or not response.text.strip():
        raise EmptyResponseError("Validation response is empty or blocked.")

    return response.text.strip() # Return the validation result


# Main evaluation loop
model_results = {} # Store results for different models
num_to_print = 20 # Number of questions to print details for

# Start RAG benchmarking
print("Evaluating RAG system\n")
total = len(questions)  # Total number of questions
correct_count = 0  # Reset correct count for RAG
question_times = []  # Track time taken for each question

evaluation_start_time = time.time()# Start timing for the entire evaluation

for i in range(total):
    print(f"Processing question {i + 1} out of {total}")  # Track which question is being processed
    question_start_time = time.time()  # Start timing for the current question

    question = questions[i]
    ground_truth = answers[i]

    prompt = f"""
    You are an expert in Illinois Education Law who are answering client legal question professionally. Answer the following question by reasoning step-by-step. 
    For questions labeled (yes or no) answer yes or no before reasoning. If the text does not provide sufficient information to provide a clear answer, just answer "i don't know" before providing more information.

    Question: {question}
    """

    try:
        # Generate and validate the response
        model_response = generate_with_retry_rag(prompt)
        validation_result = validate_with_gemini(question, model_response, ground_truth)
        validation_result = validation_result.split("\n")[0].strip().lower()  # Parse the first line for "correct" or "incorrect"

        if "incorrect" not in validation_result:  # Count correct answers
            correct_count += 1

        # Record time taken for the question
        question_end_time = time.time()
        question_time = question_end_time - question_start_time
        question_times.append(question_time)

        # Print question details if within the specified range
        if i < num_to_print:
            print(f"Question {i + 1}: {question}")
            print(f"Ground Truth: {ground_truth}")
            print(f"RAG Response: {model_response}")
            print(f"Validation Result: {validation_result}")
            print(f"Time Taken: {question_time:.2f} seconds\n")
            print("-" * 80)

    except RetryError as e:
        # Handle retries being exhausted
        print(f"Failed to get a response for question {i + 1} after 2 retries.")
        last_exception = e.last_attempt.exception()
        print(f"Last exception: {last_exception}")
        #print("Waiting 5 minutes before continuing to prevent further quota issues.")
       #time.sleep(300)
        continue  # Skip to the next question if retries are exhausted

# End timing for the entire evaluation
evaluation_end_time = time.time()
total_evaluation_time = evaluation_end_time - evaluation_start_time

# Calculate accuracy
accuracy = (correct_count / total) * 100  # Calculate as a percentage
model_results["RAG"] = {
    "Total Questions Evaluated": total,
    "Correctly Matched Answers": correct_count,
    "Accuracy": accuracy,  # Store accuracy as a percentage
    "Total Time (seconds)": total_evaluation_time,
    "Average Time per Question (seconds)": sum(question_times) / len(question_times) if question_times else 0,
}

# Print results
print("\nFinal Results for RAG System:")
for model_name, results in model_results.items():
    print(f"Model: {model_name}")
    print(f"Total Questions Evaluated: {results['Total Questions Evaluated']}")
    print(f"Correctly Matched Answers: {results['Correctly Matched Answers']}")
    print(f"Accuracy: {results['Accuracy']:.2f}%")
    print(f"Total Time: {results['Total Time (seconds)']:.2f} seconds")
    print(f"Average Time per Question: {results['Average Time per Question (seconds)']:.2f} seconds")
    print("=" * 100)



Evaluating RAG system

Processing question 1 out of 15
Question 1: Are licenses required to operate community or ambulatory currency exchanges? (yes or no)
Ground Truth: Yes.
RAG Response: yes.

the provided text explicitly states in sec. 2. license required; violation; injunction of the currency exchange act (205 ilcs 405/2) that:

"no person, firm, association, partnership, limited liability company, or corporation shall engage in the business of a community currency exchange or in the business of an ambulatory currency exchange without first securing a license to do so from the secretary."
Validation Result: correct. the generated answer correctly identifies that a license *is* required to operate community or ambulatory currency exchanges.  the provided text excerpt from the illinois currency exchange act explicitly confirms this.  the additional context from the legal statute strengthens the answer.
Time Taken: 40.36 seconds

-------------------------------------------------------

## Section 2

### Short-Answer Questions Overview

In [134]:
file_path =  "shortanswer_benchmark_question.json"

# Read the JSON file
with open(file_path, "r") as file:
    data = json.load(file)

# Access questions and answers
questions = data.get("questions", [])
answers = data.get("answers", [])

# Example: Print all questions and answers
for i, (question, answer) in enumerate(zip(questions, answers), start=1):
    print(f"Question {i}: {question}")
    print(f"Answer {i}: {answer}\n")


Question 1: What is required for a 16-year-old to donate blood?
Answer 1: Written permission or authorization from their parent or guardian.

Question 2: What change in blood donation rules takes effect on January 1, 2025?
Answer 2: 17-year-olds can also have their blood typed.

Question 3: What is the minimum distance from entrances, exits, windows, and ventilation intakes where smoking is prohibited under the Smoke-Free Illinois Act?
Answer 3: 15 feet

Question 4: What penalties are imposed on individuals who smoke in prohibited areas under the Smoke-Free Illinois Act?
Answer 4: Under the Smoke Free Illinois Act, a person who smokes in a prohibited area is liable for a civil penalty of $100 for the first offense and $250 for each subsequent offense.

Question 5: Which department is responsible for administering the Safe Bottled Water Act?
Answer 5: The Department of Public Health.

Question 6: How frequently must the Department of Public Health inspect water-bottling plants and priva

### Benchmark Analysis on Short-Answer Questions

In [135]:
import time
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type, RetryError
from google.api_core.exceptions import ResourceExhausted
# Initialize the RAG system
retriever = bq_store.as_retriever(search_kwargs={"k": 250})
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
conversational_retrieval = ConversationalRetrievalChain.from_llm(
    llm=VertexAI(model_name="gemini-1.5-pro-002"), retriever=retriever, memory=memory
)

# Custom exception for empty or filtered responses
class EmptyResponseError(Exception):
    pass

# Function to generate a response using the RAG system with retry logic
@retry(
    stop=stop_after_attempt(2),
    wait=wait_exponential(multiplier=1, min=5, max=120),
    retry=retry_if_exception_type((ResourceExhausted, EmptyResponseError))
)
def generate_with_retry_rag(prompt):
    """
    Generate a response using the RAG system, with retries for errors.
    
    Args:
        prompt (str): Input query for the RAG system.
    
    Returns:
        str: Normalized response from the RAG system.
    
    Raises:
        EmptyResponseError: If the response is empty, blocked, or insufficient.
    """
    try:
        # Invoke the RAG system to generate a response
        response = conversational_retrieval.invoke(prompt)["answer"]

        # Check if the response is not empty
        if not response or not response.strip():
            raise EmptyResponseError("RAG response is empty or blocked.")
        
        # Normalize and validate the response
        normalized_response = response.strip().lower()
        if is_response_insufficient(normalized_response):
            raise EmptyResponseError("RAG response indicates insufficient information.")
        
        return normalized_response  # Return the valid response

    except Exception as e:
         # Handle any exception by raising a custom EmptyResponseError
        print(f"Error encountered while generating RAG response: {e}")
        raise EmptyResponseError("RAG content is blocked or empty.") from e
@retry(
    stop=stop_after_attempt(2),  # Retry up to 7 times
    wait=wait_exponential(multiplier=1, min=5, max=120),  # Exponential backoff with 5s min and 120s max
    retry=retry_if_exception_type((ResourceExhausted, EmptyResponseError))
)

# Function to check if a response is insufficient based on predefined criteria
def is_response_insufficient(response):
    """
    Determines if the response is insufficient based on predefined criteria.
    
    Args:
        response (str): The normalized response text.
    
    Returns:
        bool: True if the response is insufficient, False otherwise.
    """
    insufficient_phrases = [
        "i don't know",
        "unable to find",
        "not sure",
        "cannot determine",
        "no information available",
        "please provide more context",
        "i cannot answer",
        "cannot be answered"
    ]
    # Check if response contains any insufficient phrases
    return any(phrase in response for phrase in insufficient_phrases)

# Function to validate the RAG-generated answer using the Gemini model               
def validate_with_gemini(question, generated_answer, ground_truth):
    """
    Validate the generated answer using Gemini 1.5 Pro with retry logic.
    """
    gemini_model = generative_model_gemini_15_pro

    prompt = f"""
    Question: {question}

    Generated Answer: {generated_answer}

    Ground Truth Answer: {ground_truth}

    Evaluate the generated answer. Determine if it is correct or incorrect based on the ground truth. It is normal for the generated answer to provide more information than the ground truth. If the generated answer conveys the same meaning as the ground truth or provides additional relevant details without contradicting the ground truth, consider it correct. Minor differences in wording or phrasing should not be penalized. Provide reasoning for your evaluation and then output one of the following:
    - "Correct" if the generated answer is consistent with the ground truth in meaning, even if phrasing differ or the answer is providing more additional details and information.
    - "Incorrect" if the generated answer is inconsistent with the ground truth or introduces contradictions.
 Provide reasoning for your evaluation and then output one of the following:

    """
    response = gemini_model.generate_content([prompt])

    # Ensure the response contains text
    if not response or not hasattr(response, 'text') or not response.text.strip():
        raise EmptyResponseError("Validation response is empty or blocked.")

    return response.text.strip() # Return the validation result


# Main evaluation loop
model_results = {} # Dictionary to store evaluation results
num_to_print = 30 # Number of question details to print for inspection

# Start RAG benchmarking
print("Evaluating RAG system\n")
total = len(questions) # Total number of questions to evaluate
correct_count = 0  # Reset correct count for RAG
question_times = []  # Track time taken for each question

# Start timing for the entire evaluation
evaluation_start_time = time.time()

for i in range(total):
    print(f"Processing question {i + 1} out of {total}")  # Track which question is being processed
    question_start_time = time.time()  # Start timing for the current question

    question = questions[i]
    ground_truth = answers[i]
    # Generate a prompt for the RAG system
    prompt = f"""
    You are an expert in Illinois Education Law who are answering client legal question professionally. Answer the following question by reasoning step-by-step. 
    For questions labeled (yes or no) answer yes or no before reasoning. If the text does not provide sufficient information to provide a clear answer, just answer "i don't know" before providing more information.

    Question: {question}
    """

    try:
        # Generate a response and validate it
        model_response = generate_with_retry_rag(prompt)
        validation_result = validate_with_gemini(question, model_response, ground_truth)
        validation_result = validation_result.split("\n")[0].strip().lower()  # Parse the first line for "correct" or "incorrect"

        if "incorrect" not in validation_result:# Count correct responses
            correct_count += 1

        # Record time taken for the question
        question_end_time = time.time()
        question_time = question_end_time - question_start_time
        question_times.append(question_time)
        
        # Print details for the first `num_to_print` questions
        if i < num_to_print:
            print(f"Question {i + 1}: {question}")
            print(f"Ground Truth: {ground_truth}")
            print(f"RAG Response: {model_response}")
            print(f"Validation Result: {validation_result}")
            print(f"Time Taken: {question_time:.2f} seconds\n")
            print("-" * 80)

    except RetryError as e:
        # Handle retries being exhausted
        print(f"Failed to get a response for question {i + 1} after 2 retries.")
        last_exception = e.last_attempt.exception()
        print(f"Last exception: {last_exception}")
        #print("Waiting 5 minutes before continuing to prevent further quota issues.")
       #time.sleep(300)
        continue  # Skip to the next question if retries are exhausted

# End timing for the entire evaluation
evaluation_end_time = time.time()
total_evaluation_time = evaluation_end_time - evaluation_start_time

# Calculate accuracy
accuracy = (correct_count / total) * 100  # Calculate as a percentage
model_results["RAG"] = {
    "Total Questions Evaluated": total,
    "Correctly Matched Answers": correct_count,
    "Accuracy": accuracy,  # Store accuracy as a percentage
    "Total Time (seconds)": total_evaluation_time,
    "Average Time per Question (seconds)": sum(question_times) / len(question_times) if question_times else 0,
}

# Display the final evaluation results
print("\nFinal Results for RAG System:")
for model_name, results in model_results.items():
    print(f"Model: {model_name}")
    print(f"Total Questions Evaluated: {results['Total Questions Evaluated']}")
    print(f"Correctly Matched Answers: {results['Correctly Matched Answers']}")
    print(f"Accuracy: {results['Accuracy']:.2f}%")
    print(f"Total Time: {results['Total Time (seconds)']:.2f} seconds")
    print(f"Average Time per Question: {results['Average Time per Question (seconds)']:.2f} seconds")
    print("=" * 100)



Evaluating RAG system

Processing question 1 out of 30
Error encountered while generating RAG response: RAG response indicates insufficient information.
Question 1: What is required for a 16-year-old to donate blood?
Ground Truth: Written permission or authorization from their parent or guardian.
RAG Response: a 16-year-old may donate blood with written permission or authorization from their parent or guardian.
Validation Result: correct
Time Taken: 102.45 seconds

--------------------------------------------------------------------------------
Processing question 2 out of 30
Question 2: What change in blood donation rules takes effect on January 1, 2025?
Ground Truth: 17-year-olds can also have their blood typed.
RAG Response: 16-year-olds will be able to donate blood with written parental/guardian permission.
Validation Result: incorrect
Time Taken: 71.31 seconds

--------------------------------------------------------------------------------
Processing question 3 out of 30
Question

## Important Note on LLM Evaluation

While the LLM provides valuable assistance in evaluating model responses and their correctness, it is not guaranteed to be 100% accurate. We highly recommend **double-checking the evaluation results manually** to ensure reliability and accuracy.

The **final decision** should always be made by a human, using the LLM's evaluation as a reference rather than the sole determinant.


# Benchmark Analysis: Simple RAG Results

This benchmark evaluates the performance of the Simple RAG system on binary and short-answer questions. **Note**: The results have been manually verified for accuracy. Additionally, the benchmark outcomes may vary slightly each time the code is executed due to the nature of the underlying models and system behavior.

## Binary Questions
- **Average Time per Question**: 27.9 seconds
- **Questions Evaluated**: 15
- **Correctly Answered**: 12
- **Accuracy**: 80.0%

## Short Answer Questions
- **Average Time per Question**: 37.9 seconds
- **Questions Evaluated**: 30
- **Correctly Answered**: 11
- **Accuracy**: 36.7%
