# Pairwise Evaluator

This notebook uses the `PairwiseEvaluator` module to see if an evaluation LLM would prefer one query engine over another.  

In [1]:
# %pip install llama-index-llms-openai

In [2]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

In [3]:
# configuring logger to INFO level
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [4]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Response
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.core.node_parser import SentenceSplitter
import pandas as pd

from jet.llm.ollama.base import initialize_ollama_settings, create_llm
initialize_ollama_settings()

pd.set_option("display.max_colwidth", 0)

Using GPT-4 here for evaluation

In [5]:
# gpt-4
gpt4 = create_llm(temperature=0, model="llama3.2")

evaluator_gpt4 = PairwiseComparisonEvaluator(llm=gpt4)

In [7]:
documents = SimpleDirectoryReader("./test_wiki_data/").load_data()

In [8]:
# create vector index
splitter_512 = SentenceSplitter(chunk_size=512)
vector_index1 = VectorStoreIndex.from_documents(
    documents, transformations=[splitter_512]
)

splitter_200 = SentenceSplitter(chunk_size=200)
vector_index2 = VectorStoreIndex.from_documents(
    documents, transformations=[splitter_200]
)

INFO:httpx:HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http

In [9]:
query_engine1 = vector_index1.as_query_engine(similarity_top_k=2)
query_engine2 = vector_index2.as_query_engine(similarity_top_k=8)

In [10]:
# define jupyter display function
def display_eval_df(query, response1, response2, eval_result) -> None:
    eval_df = pd.DataFrame(
        {
            "Query": query,
            "Reference Response (Answer 1)": response2,
            "Current Response (Answer 2)": response1,
            "Score": eval_result.score,
            "Reason": eval_result.feedback,
        },
        index=[0],
    )
    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "300px",
            "overflow-wrap": "break-word",
        },
        subset=["Current Response (Answer 2)", "Reference Response (Answer 1)"]
    )
    display(eval_df)

To run evaluations you can call the `.evaluate_response()` function on the `Response` object return from the query to run the evaluations. Lets evaluate the outputs of the vector_index.

In [11]:
# query_str = "How did New York City get its name?"
query_str = "What was the role of NYC during the American Revolution?"
# query_str = "Tell me about the arts and culture of NYC"
response1 = str(query_engine1.query(query_str))
response2 = str(query_engine2.query(query_str))

INFO:httpx:HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


By default, we enforce "consistency" in the pairwise comparison.

We try feeding in the candidate, reference pair, and then swap the order of the two, and make sure that the results are still consistent (or return a TIE if not).

In [13]:
eval_result = await evaluator_gpt4.aevaluate(
    query_str, response=response1, second_response=response2
)

INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


In [14]:
display_eval_df(query_str, response1, response2, eval_result)

Unnamed: 0,Query,Reference Response (Answer 1),Current Response (Answer 2),Score,Reason
0,What was the role of NYC during the American Revolution?,"The city served as a military and political base of operations for the British in North America. It also became a haven for Loyalist refugees and escaped slaves who joined the British lines for freedom. The city was crowded with as many as 10,000 escaped slaves during the British occupation. After the war, it hosted several events of national scope, including the inauguration of the first President of the United States, George Washington, in 1789.","The city played a significant military and political base for British operations in North America. It also served as a haven for Loyalist refugees and escaped slaves who joined the British lines for freedom. The city was a key location for several events, including the Conference House on Staten Island where American delegates met with British general Lord Howe, and it hosted the first President of the United States, George Washington, during his inauguration in 1789.",0.5,


**NOTE**: By default, we enforce consensus by flipping the order of response/reference and making sure that the answers are opposites.

We can disable this - which can lead to more inconsistencies!

In [15]:
evaluator_gpt4_nc = PairwiseComparisonEvaluator(
    llm=gpt4, enforce_consensus=False
)

In [17]:
eval_result = await evaluator_gpt4_nc.aevaluate(
    query_str, response=response1, second_response=response2
)

INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


In [18]:
display_eval_df(query_str, response1, response2, eval_result)

Unnamed: 0,Query,Reference Response (Answer 1),Current Response (Answer 2),Score,Reason
0,What was the role of NYC during the American Revolution?,"The city served as a military and political base of operations for the British in North America. It also became a haven for Loyalist refugees and escaped slaves who joined the British lines for freedom. The city was crowded with as many as 10,000 escaped slaves during the British occupation. After the war, it hosted several events of national scope, including the inauguration of the first President of the United States, George Washington, in 1789.","The city played a significant military and political base for British operations in North America. It also served as a haven for Loyalist refugees and escaped slaves who joined the British lines for freedom. The city was a key location for several events, including the Conference House on Staten Island where American delegates met with British general Lord Howe, and it hosted the first President of the United States, George Washington, during his inauguration in 1789.",1.0,"Upon comparing the two responses, I notice that both assistants provide relevant information about NYC's role during the American Revolution. However, there are some subtle differences in their approaches. Assistant A provides a more detailed explanation of specific events and locations associated with NYC during the revolution, such as the Conference House on Staten Island. This shows a better understanding of the historical context and provides more depth to the answer. On the other hand, Assistant B's response is more concise and focuses on the broader impact of NYC during the British occupation. While it still conveys important information about the city's role in hosting Loyalist refugees and escaped slaves, it lacks the specificity and detail provided by Assistant A. In terms of accuracy, both responses seem to be correct, but Assistant A provides a more nuanced understanding of the historical events. Considering all factors, I would say that: [[A]]"


In [19]:
eval_result = await evaluator_gpt4_nc.aevaluate(
    query_str, response=response2, second_response=response1
)

INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


In [20]:
display_eval_df(query_str, response2, response1, eval_result)

Unnamed: 0,Query,Reference Response (Answer 1),Current Response (Answer 2),Score,Reason
0,What was the role of NYC during the American Revolution?,"The city played a significant military and political base for British operations in North America. It also served as a haven for Loyalist refugees and escaped slaves who joined the British lines for freedom. The city was a key location for several events, including the Conference House on Staten Island where American delegates met with British general Lord Howe, and it hosted the first President of the United States, George Washington, during his inauguration in 1789.","The city served as a military and political base of operations for the British in North America. It also became a haven for Loyalist refugees and escaped slaves who joined the British lines for freedom. The city was crowded with as many as 10,000 escaped slaves during the British occupation. After the war, it hosted several events of national scope, including the inauguration of the first President of the United States, George Washington, in 1789.",1.0,"Upon comparing the two responses, I notice that both assistants provide relevant information about NYC's role during the American Revolution. However, there are some subtle differences in their approaches. Assistant A provides a more detailed explanation of the city's significance as a haven for escaped slaves, stating that it was crowded with as many as 10,000 people. This adds depth to the response and highlights a lesser-known aspect of NYC's history during this period. On the other hand, Assistant B focuses more on the military and political aspects of NYC's role, mentioning specific events like the Conference House meeting. In terms of accuracy, both responses are correct, but Assistant A provides more context and detail about the city's population dynamics. However, Assistant B is more concise and directly addresses the user's question. Considering the factors mentioned earlier, I would argue that Assistant A's response is slightly better due to its added depth and detail. The mention of 10,000 escaped slaves adds a layer of complexity to the response and provides a more nuanced understanding of NYC's role during this period. Therefore, my final verdict is: [[A]]"


## Running on some more Queries

In [21]:
query_str = "Tell me about the arts and culture of NYC"
response1 = str(query_engine1.query(query_str))
response2 = str(query_engine2.query(query_str))

INFO:httpx:HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


In [22]:
eval_result = await evaluator_gpt4.aevaluate(
    query_str, response=response1, second_response=response2
)

INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


In [23]:
display_eval_df(query_str, response1, response2, eval_result)

Unnamed: 0,Query,Reference Response (Answer 1),Current Response (Answer 2),Score,Reason
0,Tell me about the arts and culture of NYC,"New York City is renowned for its vibrant arts and culture scene. The city has been a hub for numerous cultural movements throughout history, including the Harlem Renaissance in literature and visual art, abstract expressionism (also known as the New York School) in painting, hip-hop, punk, salsa, freestyle, Tin Pan Alley, certain forms of jazz, and disco in music. The city's influence on stand-up comedy began in the early 20th century, with its jazz scene flourishing in the 1940s. Abstract expressionism emerged in the 1950s, and hip-hop was born in the 1970s. The city's punk and hardcore scenes were influential in the 1970s and 1980s. New York Fashion Week is one of the world's preeminent fashion events, attracting extensive media coverage. The city has also been frequently ranked as the top fashion capital of the world on annual lists compiled by the Global Language Monitor. The city's cultural landscape is further enriched by its status as a frequent setting for novels, movies, and television programs. Its unique blend of artistic expression and cultural diversity makes it a truly special place.","New York City boasts an incredibly diverse cultural landscape, with hundreds of museums and historic sites calling it home. The city's Museum Mile is a notable example, featuring nine art museums along Fifth Avenue, showcasing some of the world's most renowned collections. This concentration of artistic institutions makes New York City one of the densest displays of culture in the world. Beyond its iconic museums, the city's cultural scene is also characterized by numerous ethnic enclaves and neighborhoods that celebrate their unique heritage through food, music, and performance. From bagels to pastrami sandwiches, and from falafel to haute cuisine, New York City's culinary landscape reflects its rich immigrant history. The city's vibrant arts and culture scene extends beyond traditional institutions, with numerous festivals and events throughout the year. The annual Museum Mile Festival, for example, brings together some of the city's most prominent museums and cultural organizations to promote art, education, and community engagement. New York City is also a hub for creative industries, including film, television, music, advertising, and publishing. Its status as a global entertainment capital has made it an attractive location for creatives from around the world, contributing to its reputation as one of the most culturally rich and diverse cities on the planet.",0.5,
