# Evaluate Test Set with Automated Evaluation

Based on Python package [`judges`](https://pypi.org/project/judges/) 

github repo: [`judges`](https://github.com/quotient-ai/judges)

Blog post: [**Introducing judges: A Library of Research-Backed LLM-as-a-Judge Evaluators**](https://www.quotientai.co/post/introducing-judges-a-library-of-research-backed-llm-as-a-judge-evaluators)

In [1]:
import json
import pandas as pd
import random
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

from judges.graders.response_quality import MTBenchChatBotResponseQuality
from judges.graders.relevance import ReliableCIRelevance
from judges.classifiers.correctness import PollMultihopCorrectness

In [2]:
VECTOR_DB_FP = "vector_stores/vector_store_faiss_openai"
TEST_SET_FP = "test_sets/baseline_test_set.json"


In [3]:
# setup embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [4]:
# establish the judge
judge_response_quality = MTBenchChatBotResponseQuality(model="gpt-4o")
# judge_relevance = ReliableCIRelevance(model="gpt-4o")   # causes error missing 'context' parameter
judge_correctness = PollMultihopCorrectness(model="gpt-4o")

judges = [judge_response_quality, judge_correctness]

In [5]:
# restore vector store
vector_store = FAISS.load_local(
    VECTOR_DB_FP, embeddings, allow_dangerous_deserialization=True
)


In [6]:
# retriev test set
with open(TEST_SET_FP, "r") as f:
    test_set = json.load(f)

test_set 

[{'prompt': 'Provide a summary of space exploration',
  'target_response': '## **The Future of Space Exploration: Colonizing Mars and Beyond**  \n\n### **Introduction**  \nSpace exploration has captured human imagination for centuries. With recent advancements in rocketry and planetary science, interplanetary colonization is no longer a dream but a plausible reality.  \n\n### **Milestones in Space Exploration**  \nThe Space Race led to the first moon landing in 1969, and subsequent missions expanded our knowledge of the solar system. The International Space Station (ISS) demonstrated long-term human habitation in space, while private companies like SpaceX and Blue Origin have revitalized interest in space travel.  \n\n### **Colonizing Mars**  \nMars presents the most feasible option for colonization due to its relative proximity and similarities to Earth. Challenges include radiation exposure, lack of a breathable atmosphere, and low temperatures. Technologies such as in-situ resource 

In [7]:
def evalutate_test_set(this_test_set, this_judge):
    # interate over the test set and retrieve similar chunks
    this_test_results = []
    for this_test_case in this_test_set:
        query = this_test_case["prompt"]
        relevant_chunks = vector_store.similarity_search(query, k=2)

        # Print retrieved chunks with source information
        print(f"\nQUERY: {query}, related material:")
        retrieved_data = "\n".join([chunk.page_content for chunk in relevant_chunks])
        print(f"\nRETRIEVED: {retrieved_data}")

        # judge the quality of the response
        target_response = this_test_case["target_response"]
        print(f"\nTARGET: {target_response}")
        quality = this_judge.judge(query, retrieved_data, target_response)
        print(f"\n>>>QUALITY: {quality}")

        this_test_result = quality.__dict__
        this_test_result.update(this_test_case)
        this_test_result["retrieved_data"] = retrieved_data

        this_test_results.append(this_test_result)

    return this_test_results

## Baseline Test Set

In [8]:
for judge in judges:
    print(f"\n\n>>>RUNNING BASELINE TEST CASES FOR {judge.__class__.__name__}")
    test_results = evalutate_test_set(test_set, judge)
    test_df = pd.DataFrame(test_results)
    print(test_df.columns)
    # change order of columns
    test_df = test_df[
        [
            "prompt",
            "score",
            "retrieved_data",
            "target_response",
            "reasoning",
        ]
    ]
    test_df["judge"] = judge.__class__.__name__

    print(test_df)




>>>RUNNING BASELINE TEST CASES FOR MTBenchChatBotResponseQuality

QUERY: Provide a summary of space exploration, related material:

RETRIEVED: ## **The Future of Space Exploration: Colonizing Mars and Beyond**

### **Introduction** Space exploration has captured human imagination for centuries. With recent advancements in rocketry and planetary science, interplanetary colonization is no longer a dream but a plausible reality.

### **Milestones in Space Exploration** The Space Race led to the first moon landing in 1969, and subsequent missions expanded our knowledge of the solar system. The International Space Station (ISS) demonstrated long-term human habitation in space, while private companies like SpaceX and Blue Origin have revitalized interest in space travel.
### **Colonizing Mars** Mars presents the most feasible option for colonization due to its relative proximity and similarities to Earth. Challenges include radiation exposure, lack of a breathable atmosphere, and low tempe

## Reverse the test set target responses and evaluate with judges



In [9]:
prompts = [d["prompt"] for d in test_set]
target_responses = [d["target_response"] for d in test_set]

# revserse the order of the target responses
target_responses = target_responses[::-1]

# construct the permuted test set
permuted_test_set = [{"prompt": p, "target_response": r} for p, r in zip(prompts, target_responses)]

permuted_test_set

[{'prompt': 'Provide a summary of space exploration',
  'target_response': '### **Vaccines: A Key to Public Health**  \n\nVaccines are medical interventions designed to protect individuals from infectious diseases by stimulating the immune system to recognize and fight harmful pathogens. They contain weakened, inactivated, or genetically engineered components of a virus or bacteria, prompting the body to build immunity without causing illness.  \n\nVaccination has been instrumental in controlling and eradicating diseases such as polio, smallpox, and measles, saving millions of lives globally. Modern advancements in vaccine technology, such as mRNA vaccines, have accelerated the development of effective solutions for emerging diseases like COVID-19.  \n\nVaccines not only safeguard individuals but also contribute to **herd immunity**, reducing the spread of diseases within communities. Widespread immunization efforts are crucial for preventing outbreaks and protecting vulnerable populat

In [10]:
for judge in judges:
    print(f"\n\n>>>RUNNING REVERSED TEST CASES FOR {judge.__class__.__name__}")

    permuted_test_results = evalutate_test_set(permuted_test_set, judge)
    permuted_test_df = pd.DataFrame(permuted_test_results)
    print(permuted_test_df.columns)
    # change order of columns
    permuted_test_df = permuted_test_df[
        [
            "prompt",
            "score",
            "retrieved_data",
            "target_response",
            "reasoning",
        ]
    ]

    print(permuted_test_df)




>>>RUNNING REVERSED TEST CASES FOR MTBenchChatBotResponseQuality

QUERY: Provide a summary of space exploration, related material:

RETRIEVED: ## **The Future of Space Exploration: Colonizing Mars and Beyond**

### **Introduction** Space exploration has captured human imagination for centuries. With recent advancements in rocketry and planetary science, interplanetary colonization is no longer a dream but a plausible reality.

### **Milestones in Space Exploration** The Space Race led to the first moon landing in 1969, and subsequent missions expanded our knowledge of the solar system. The International Space Station (ISS) demonstrated long-term human habitation in space, while private companies like SpaceX and Blue Origin have revitalized interest in space travel.
### **Colonizing Mars** Mars presents the most feasible option for colonization due to its relative proximity and similarities to Earth. Challenges include radiation exposure, lack of a breathable atmosphere, and low tempe