# Evaluate Rewritten Test Set with Automated Evaluation With Custom Judge

Based on Python package [`judges`](https://pypi.org/project/judges/) 

github repo: [`judges`](https://github.com/quotient-ai/judges)

Blog post: [**Introducing judges: A Library of Research-Backed LLM-as-a-Judge Evaluators**](https://www.quotientai.co/post/introducing-judges-a-library-of-research-backed-llm-as-a-judge-evaluators)

In [1]:
import json
import pandas as pd
from textwrap import dedent
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

from judges.base import BaseJudge, Judgment
from judges.classifiers.correctness import PollMultihopCorrectness    

In [2]:
VECTOR_DB_FP = "vector_stores/vector_store_faiss_openai"
TEST_SET_FP = "test_sets/rewritten_test_set.json"


In [3]:
# setup embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [4]:
# establish the judge
class SemanticAlignmentJudge(BaseJudge):
    r"""
    A judge that evaluates the quality of an AI assistant's response to a user's question.
    It assigns a numerical grade based on factors such as helpfulness, relevance, accuracy,
    depth, creativity, and level of detail and semantic alignment to a reference answer.

    """


    def judge(
        self,
        input: str,
        output: str = None,
        expected: str = None,
    ) -> Judgment:
        """
        Judge the quality of the assistant's response to the user's question.


        Parameters:
        -----------
        input: str
            The user's question.
        output: str
            The AI assistant's response to the user's question.
        expected: str
            The expected or ideal response 


        Returns:
        --------
        Judgment
            An object containing the grade and explanation.
        """
        system_prompt = None


        # Construct the user prompt with the provided exact template
        user_prompt = dedent(
            f"""
            [System]
            Please act as an impartial judge and evaluate the quality of the response provided by an
            AI assistant to the user question displayed below. Your evaluation should consider 
            the semantic alignment of the response to a reference answer. Begin your 
            evaluation by providing a short explanation. Be as objective as
            possible. After providing your explanation, please rate the response on a scale of 1 to 10,
            where 1 is not semantically aligned at all and 10 has a high level of semantic alignment.  
            [Question]
            {input}
            [The Start of Assistant’s Answer]
            {output}
            [The End of Assistant’s Answer]
            [The Start of Reference Answer]
            {expected}
            [The End of Reference Answer]
            """
        )


        reasoning, score = self._judge(
            user_prompt=user_prompt,
            system_prompt=system_prompt,
        )


        return Judgment(reasoning=reasoning, score=score)


In [5]:
judges = [
    SemanticAlignmentJudge(model="gpt-4o"),
    PollMultihopCorrectness(model="gpt-4o"),
]

In [6]:
# restore vector store
vector_store = FAISS.load_local(
    VECTOR_DB_FP, embeddings, allow_dangerous_deserialization=True
)


In [7]:
# retriev test set
with open(TEST_SET_FP, "r") as f:
    test_set = json.load(f)

test_set 

[{'prompt': 'Provide a summary of space exploration',
  'target_response': "### The Future of Space Exploration: Colonizing Mars and Beyond\n\nSpace exploration is all about discovering new places beyond our Earth. For a long time, people have imagined what it would be like to travel to space. Now, with new rocket technology and better understanding of planets, living on other planets doesn't seem so impossible!\n\n### Milestones in Space Exploration\n\nSpace exploration really took off when humans landed on the moon in 1969. After that, we learned a lot more about the solar system through other space missions. We also have the International Space Station (ISS) where astronauts live and work in space for long periods. Companies like SpaceX and Blue Origin are making space travel exciting again!\n\n### Colonizing Mars\n\nMars is the planet where scientists think we might be able to live one day. It's not too far away and is somewhat like Earth. But Mars has its problems. There's a lot o

In [8]:
def evalutate_test_set(this_test_set, this_judge):
    # interate over the test set and retrieve similar chunks
    this_test_results = []
    for this_test_case in this_test_set:
        query = this_test_case["prompt"]
        relevant_chunks = vector_store.similarity_search(query, k=2)

        # Print retrieved chunks with source information
        print(f"\nQUERY: {query}, related material:")
        retrieved_data = "\n".join([chunk.page_content for chunk in relevant_chunks])
        print(f"\nRETRIEVED: {retrieved_data}")

        # judge the quality of the response
        target_response = this_test_case["target_response"]
        print(f"\nTARGET: {target_response}")
        quality = this_judge.judge(query, retrieved_data, target_response)
        print(f"\n>>>QUALITY: {quality}")

        this_test_result = quality.__dict__
        this_test_result.update(this_test_case)
        this_test_result["retrieved_data"] = retrieved_data

        this_test_results.append(this_test_result)

    return this_test_results

## Baseline Test Set

In [9]:
for judge in judges:
    print(f"\n\n>>>Running test set evaluation for {judge.__class__.__name__}")
    test_results = evalutate_test_set(test_set, judge)
    test_df = pd.DataFrame(test_results)
    print(test_df.columns)
    # change order of columns
    test_df = test_df[
        [
            "prompt",
            "score",
            "retrieved_data",
            "target_response",
            "reasoning",
        ]
    ]
    test_df["judge"] = judge.__class__.__name__

    print(test_df)




>>>Running test set evaluation for SemanticAlignmentJudge

QUERY: Provide a summary of space exploration, related material:

RETRIEVED: ## **The Future of Space Exploration: Colonizing Mars and Beyond**

### **Introduction** Space exploration has captured human imagination for centuries. With recent advancements in rocketry and planetary science, interplanetary colonization is no longer a dream but a plausible reality.

### **Milestones in Space Exploration** The Space Race led to the first moon landing in 1969, and subsequent missions expanded our knowledge of the solar system. The International Space Station (ISS) demonstrated long-term human habitation in space, while private companies like SpaceX and Blue Origin have revitalized interest in space travel.
### **Colonizing Mars** Mars presents the most feasible option for colonization due to its relative proximity and similarities to Earth. Challenges include radiation exposure, lack of a breathable atmosphere, and low temperatures

## Reverse the test set target responses and evaluate with judges


In [10]:
prompts = [d["prompt"] for d in test_set]
target_responses = [d["target_response"] for d in test_set]

# revserse the order of the target responses
target_responses = target_responses[::-1]

# construct the permuted test set
permuted_test_set = [{"prompt": p, "target_response": r} for p, r in zip(prompts, target_responses)]

permuted_test_set

[{'prompt': 'Provide a summary of space exploration',
  'target_response': "Vaccines are like tiny shields that protect us from getting sick. They help our bodies learn how to fight off bad germs without making us sick. This stops illnesses like polio, smallpox, and measles from spreading and has saved many lives. \n\nBy getting vaccinated, not only do we keep ourselves safe, but we also help protect other people, like babies and older adults. This is called **herd immunity**. When more people get vaccines, diseases have less chance to spread around. \n\nVaccines have also helped us fight new diseases, like COVID-19, thanks to new technologies. They keep us safe and stop us from getting really sick. \n\nHowever, some people are still unsure about vaccines because of wrong information they hear. Doctors and health experts work hard to explain why vaccines are important. They help keep our communities healthy and prevent outbreaks. \n\nIn short, vaccines are super important for keeping u

In [11]:
for judge in judges:
    print(f"\n\n>>>Running permuted test set evaluation for {judge.__class__.__name__}")

    permuted_test_results = evalutate_test_set(permuted_test_set, judge)
    permuted_test_df = pd.DataFrame(permuted_test_results)
    print(permuted_test_df.columns)
    # change order of columns
    permuted_test_df = permuted_test_df[
        [
            "prompt",
            "score",
            "retrieved_data",
            "target_response",
            "reasoning",
        ]
    ]

    print(permuted_test_df)




>>>Running permuted test set evaluation for SemanticAlignmentJudge

QUERY: Provide a summary of space exploration, related material:

RETRIEVED: ## **The Future of Space Exploration: Colonizing Mars and Beyond**

### **Introduction** Space exploration has captured human imagination for centuries. With recent advancements in rocketry and planetary science, interplanetary colonization is no longer a dream but a plausible reality.

### **Milestones in Space Exploration** The Space Race led to the first moon landing in 1969, and subsequent missions expanded our knowledge of the solar system. The International Space Station (ISS) demonstrated long-term human habitation in space, while private companies like SpaceX and Blue Origin have revitalized interest in space travel.
### **Colonizing Mars** Mars presents the most feasible option for colonization due to its relative proximity and similarities to Earth. Challenges include radiation exposure, lack of a breathable atmosphere, and low tem