## Pairwise Evaluation for Detecting Qualitative Differences

### Motivation
Absolute scoring of model outputs often obscures meaningful differences between systems, especially when performance metrics are close. In many cases, humans are better at answering relative questions—which output is better, and why?—than assigning absolute scores. This notebook studies pairwise comparison as an evaluation method for detecting subtle qualitative differences that aggregate metrics can miss.

### Experimental Setup
We present an evaluator with pairs of outputs generated by different systems (or configurations) for the same input and ask it to determine which response is preferable. The comparison criteria can be tailored to specific concerns, such as faithfulness, clarity, or adherence to constraints.

By aggregating pairwise judgments across many examples, we can derive relative performance rankings without assuming a single absolute notion of correctness.

### What Pairwise Comparison Reveals
Pairwise evaluation is well-suited to detecting:
- differences in reasoning quality when answers are superficially similar,
- trade-offs between verbosity and precision,
- subtle failures masked by high average scores,
- regressions introduced by small system changes.

These effects often remain invisible in scalar metrics but become clear when outputs are contrasted directly.

### Prerequisites and Setup

In [None]:
import os # Import the 'os' module to interact with the operating system.

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint.
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your API key.

In [None]:
# The '%pip install' command installs python packages from the notebook. '--quiet' suppresses the output.
# %pip install -U "langchain[openai]" --quiet
# %pip install chromadb --quiet
# %pip install lxml --quiet
# %pip install html2text --quiet
# %pip install pandas --quiet

In [None]:
# The '%env' magic command sets an environment variable for the notebook session.
# %env OPENAI_API_KEY=<YOUR-API-KEY>

## Setup

In [1]:
# Define a list of tuples, where each tuple contains a (question, answer) pair.
examples = [
    (
        "What is LangChain?",
        "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.",
    ),
    (
        "How might I query for all runs in a project?",
        "client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})",
    ),
    (
        "What's a langsmith dataset?",
        "A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point.",
    ),
    (
        "How do I use a traceable decorator?",
        """The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,\
import the required function, decorate your function, and then call the function. Below is an example:
```python
from langsmith.run_helpers import traceable
@traceable(run_type="chain") # or "llm", etc.
def my_function(input_param):
    # Function logic goes here
    return output
result = my_function(input_param)
```""",
    ),
    (
        "Can I trace my Llama V2 llm?",
        "So long as you are using one of LangChain's LLM implementations, all your calls can be traced",
    ),
    (
        "Why do I have to set environment variables?",
        "Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith."
        " While there are other ways to connect, environment variables tend to be the simplest way to configure your application.",
    ),
    (
        "How do I move my project between organizations?",
        "LangSmith doesn't directly support moving projects between organizations.",
    ),
]

In [2]:
from langsmith import Client # Import the Client class to interact with LangSmith.

client = Client() # Instantiate the LangSmith client.

In [3]:
import uuid # Import the uuid library to generate unique identifiers.

dataset_name = f"Retrieval QA Questions {str(uuid.uuid4())}" # Create a unique name for the dataset.
dataset = client.create_dataset(dataset_name=dataset_name) # Create the dataset on the LangSmith platform.
# Iterate through our question-answer pairs.
for q, a in examples:
    # Create an example in our LangSmith dataset for each pair.
    client.create_example(
        inputs={"question": q}, outputs={"answer": a}, dataset_id=dataset.id
    )

The experiment we want to run is to see how the retriever's **chunk size** affects the quality of the final answer. We will create two versions of our RAG chain that are identical in every way except for the `chunk_size` and `chunk_overlap` used when splitting the source documents.

In [4]:
from langchain_community.document_loaders import RecursiveUrlLoader # A loader for recursively scraping a website.
from langchain_community.document_transformers import Html2TextTransformer # A transformer to convert HTML to plain text.
from langchain_community.vectorstores import Chroma # The Chroma vector store implementation.
from langchain_text_splitters import TokenTextSplitter # A text splitter that splits based on token count.
from langchain_openai import OpenAIEmbeddings # The class for using OpenAI's embedding models.

api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com") # Initialize a loader for the LangSmith docs.
doc_transformer = Html2TextTransformer() # Initialize the HTML to text transformer.
raw_documents = api_loader.load() # Load the raw documents.
transformed = doc_transformer.transform_documents(raw_documents) # Transform them into plain text.


# Define a factory function to create a retriever based on a given text splitter.
def create_retriever(transformed_documents, text_splitter):
    documents = text_splitter.split_documents(transformed_documents) # Split the documents.
    embeddings = OpenAIEmbeddings() # Initialize the embeddings model.
    vectorstore = Chroma.from_documents(documents, embeddings) # Create the vector store.
    return vectorstore.as_retriever(search_kwargs={"k": 4}) # Return the retriever.



We'll define a factory function for our chain. It will take a retriever as an argument, allowing us to easily create different chain versions.

In [5]:
from datetime import datetime # Import datetime to include the current time in the prompt.
from operator import itemgetter # Import itemgetter for convenient data routing.
from langchain_core.output_parsers import StrOutputParser # Import the string output parser.
from langchain_core.prompts import ChatPromptTemplate # Import the chat prompt template class.
from langchain_openai import ChatOpenAI # Import the OpenAI chat model class.


# Define a factory function that creates a RAG chain from a given retriever.
def create_chain(retriever):
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a helpful documentation Q&A assistant, trained to answer"
                " questions from LangSmith's documentation."
                " LangChain is a framework for building applications using large language models."
                "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages.",
            ),
            ("system", "{context}"),
            ("human", "{question}"),
        ]
    ).partial(time=str(datetime.now()))

    model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
    response_generator = prompt | model | StrOutputParser()
    # Define the final chain using LangChain Expression Language (LCEL).
    chain = (
        {
            "context": itemgetter("question")
            | retriever
            | (lambda docs: "\n".join([doc.page_content for doc in docs])),
            "question": itemgetter("question"),
        }
        | response_generator
    )
    return chain

`chain_1` will use larger document chunks, while `chain_2` will use smaller chunks.

In [6]:
# Define the text splitter for our first chain (large chunks).
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=2000,
    chunk_overlap=200,
)
# Create the first retriever.
retriever = create_retriever(transformed, text_splitter)

# Create the first chain.
chain_1 = create_chain(retriever)

In [7]:
# Define the text splitter for our second chain (small chunks).
text_splitter_2 = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=500,
    chunk_overlap=50,
)
# Create the second retriever.
retriever_2 = create_retriever(transformed, text_splitter_2)

# Create the second chain.
chain_2 = create_chain(retriever_2)

We will run both `chain_1` and `chain_2` on our dataset and use a standard correctness evaluator (`cot_qa`) to score each response. This will give us an initial impression of their performance.

In [8]:
from langchain.smith import RunEvalConfig # Import the evaluation configuration class.

eval_config = RunEvalConfig(
    # We will use the chain-of-thought Q&A correctness evaluator for a more robust grade.
    evaluators=["cot_qa"],
)

In [9]:
# Run the evaluation for the first chain.
results = client.run_on_dataset(
    dataset_name=dataset_name, llm_or_chain_factory=chain_1, evaluation=eval_config
)
# Store the project name for later retrieval.
project_name = results["project_name"]

View the evaluation results for project 'test-new-goat-73' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/8097fed2-ca97-42b6-b71a-a24fe5e7c9d6
[------------------------------------------------->] 7/7

In [10]:
# Run the evaluation for the second chain.
results_2 = client.run_on_dataset(
    dataset_name=dataset_name, llm_or_chain_factory=chain_2, evaluation=eval_config
)
# Store the project name for the second run.
project_name_2 = results_2["project_name"]

View the evaluation results for project 'test-large-stone-98' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/4f3b1676-741b-4fb1-bfc1-46ca630ac160
[------------------------------------------------->] 7/7

In [12]:
import pandas as pd # Import the pandas library.

# Fetch all the runs from the first test project.
runs_1 = list(client.list_runs(project_name=project_name, execution_order=1))
# Fetch all the runs from the second test project.
runs_2 = list(client.list_runs(project_name=project_name_2, execution_order=1))


# Helper function to convert a list of runs into a DataFrame.
def get_project_df(runs):
    return pd.DataFrame(
        [
            # For each run, create a dictionary with its outputs and the average feedback scores.
            {**run.outputs, **{k: v.get("avg") for k, v in run.feedback_stats.items()}}
            for run in runs
        ],
        # Use the reference example ID as the index for easy joining.
        index=[run.reference_example_id for run in runs],
    )


runs_1_df = get_project_df(runs_1) # Create a DataFrame for the first run.
runs_2_df = get_project_df(runs_2) # Create a DataFrame for the second run.
# Join the two DataFrames on their index (the example ID).
joined_df = runs_1_df.join(runs_2_df, lsuffix="_1", rsuffix="_2")
# Reorder the columns for a clean side-by-side comparison.
columns_1 = [col for col in joined_df.columns if col.endswith("_1")]
columns_2 = [col for col in joined_df.columns if col.endswith("_2")]
new_columns_order = [col for pair in zip(columns_1, columns_2) for col in pair]
joined_df = joined_df[new_columns_order]

In [13]:
joined_df

Unnamed: 0,output_1,output_2,COT Contextual Accuracy_1,COT Contextual Accuracy_2
04a95258-4999-4abd-b1c3-0c5214130579,LangChain is an open-source framework for buil...,LangChain is an open-source framework for buil...,1.0,1.0
198d7039-72bc-4376-907a-87066d85275b,"To query for all runs in a project, you can us...",To query for all runs in a project using LangS...,0.0,0.0
ea7b3f78-020e-410b-8e7b-bfdfa681a386,A LangSmith dataset is a collection of input-o...,A LangSmith dataset refers to a collection of ...,1.0,1.0
3cdd7b83-9a60-4c1b-b30b-dfa3fae69740,"To use a traceable decorator in LangSmith, you...","To use the `traceable` decorator in LangSmith,...",0.0,0.0
5ec65a7d-70de-4494-b780-139550e301e5,"Yes, you can trace your Llama V2 LLM using Lan...","Yes, you can trace your Llama V2 LLM using Lan...",1.0,1.0
7b74f055-3a2e-4b6f-a3d1-45eecfddcc63,To move a project between organizations in Lan...,To move a project between organizations in Lan...,0.0,0.0
f23253e5-c537-43f0-8059-38153090a884,"At LangChain, setting environment variables is...",Setting environment variables is necessary in ...,1.0,1.0


### Pairwise Evaluation

Since the absolute scores were the same, we'll now run a pairwise evaluator to determine a preference between the two outputs for each question. We will define a helper function, `predict_preference`, to orchestrate this process for each example in our dataset. This function will:

1.  Fetch the completed runs for a given example from both of our test projects (`project_a` and `project_b`).
2.  **Randomize the order** of the two predictions (A and B). This is a crucial step to mitigate **positional bias**, where LLM judges may tend to prefer the first option they see.
3.  Call a pairwise evaluation chain, which asks an LLM to state its preference (A or B).
4.  Log the preference as feedback to the corresponding runs in LangSmith. The preferred run gets a score of `1`, and the other gets `0`.


In [14]:
import random # Import the random module for shuffling.
import logging # Import the logging module to handle potential errors.


# A helper function to fetch a specific run and its prediction.
def _get_run_and_prediction(example_id, project_name):
    # List runs, filtering by the reference example and project.
    run = next(
        client.list_runs(
            reference_example_id=example_id,
            project_name=project_name,
            execution_order=1,
        )
    )
    # Extract the output from the run.
    prediction = next(iter(run.outputs.values()))
    return run, prediction


# A helper function to log the preference feedback to LangSmith.
def _log_feedback(run_ids):
    # The 'preference' key is used. The preferred run gets a score of 1, the other gets 0.
    for score, run_id in enumerate(run_ids):
        client.create_feedback(run_id, key="preference", score=score)


# The main function to predict and log preference for a single example.
def predict_preference(example, project_a, project_b, eval_chain):
    example_id = example.id # Get the ID of the current example.
    print(example) # Print the example for progress tracking.
    # Fetch the runs and predictions for both projects (A and B).
    run_a, pred_a = _get_run_and_prediction(example_id, project_a)
    run_b, pred_b = _get_run_and_prediction(example_id, project_b)
    # Prepare the inputs for the evaluator.
    input_, answer = example.inputs["question"], example.outputs["answer"]
    result = {"input": input_, "answer": answer, "A": pred_a, "B": pred_b}

    # Randomly swap A and B to mitigate positional bias in the LLM judge.
    if random.random() < 0.5:
        result["A"], result["B"] = result["B"], result["A"]
        run_a, run_b = run_b, run_a
    try:
        # Call the pairwise evaluator.
        eval_res = eval_chain.evaluate_string_pairs(
            prediction=result["A"],
            prediction_b=result["B"],
            input=input_,
            reference=answer,
        )
    except Exception as e:
        # Log a warning if the evaluator fails.
        logging.warning(e)
        return result

    # If the evaluator returns a 'None' value (e.g., a tie), we don't log feedback.
    if eval_res["value"] is None:
        return result

    # Determine which run was preferred.
    preferred_run = (run_a.id, "A") if eval_res["value"] == "A" else (run_b.id, "B")
    runner_up_run = (run_b.id, "B") if eval_res["value"] == "A" else (run_a.id, "A")
    # Log the feedback (0 for runner-up, 1 for preferred).
    _log_feedback((runner_up_run[0], preferred_run[0]))
    # Add the preference to our results dictionary.
    result["Preferred"] = preferred_run[1]
    return result

We will use LangChain's off-the-shelf `labeled_pairwise_string` evaluator. By default, this evaluator asks the LLM judge to choose its preference based on helpfulness, relevance, correctness, and depth. For a real application, you would likely want to customize these criteria to match your specific goals.

In [15]:
from langchain.evaluation import load_evaluator # Import the evaluator loading helper.

# Load the pre-built labeled pairwise string evaluator.
pairwise_evaluator = load_evaluator("labeled_pairwise_string")

Now we'll set up a runnable to execute our `predict_preference` function across the entire dataset.

In [16]:
import functools # Import functools to create partial functions.
from langchain_core.runnables import RunnableLambda # Import RunnableLambda to wrap our function.


# Create a partial function with our project names and evaluator pre-filled.
eval_func = functools.partial(
    predict_preference,
    project_a=project_name,
    project_b=project_name_2,
    eval_chain=pairwise_evaluator,
)


# Wrap our partial function in a RunnableLambda to get access to the convenient .batch() method.
runnable = RunnableLambda(eval_func)

In [17]:
# Fetch the list of examples from our dataset.
examples = list(client.list_examples(dataset_id=dataset.id))
# Run the evaluation in a batch across all examples.
values = runnable.batch(examples)

dataset_id=UUID('29addcf7-2be5-4320-bae7-10f9635d29e3') inputs={'question': 'How do I move my project between organizations?'} outputs={'answer': "LangSmith doesn't directly support moving projects between organizations."} id=UUID('7b74f055-3a2e-4b6f-a3d1-45eecfddcc63') created_at=datetime.datetime(2023, 10, 23, 6, 12, 27, 588671) modified_at=datetime.datetime(2023, 10, 23, 6, 12, 27, 645828) runs=[]dataset_id=UUID('29addcf7-2be5-4320-bae7-10f9635d29e3') inputs={'question': 'How do I use a traceable decorator?'} outputs={'answer': 'The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,import the required function, decorate your function, and then call the function. Below is an example:\n```python\nfrom langsmith.run_helpers import traceable\n@traceable(run_type="chain") # or "llm", etc.\ndef my_function(input_param):\n    # Function logic goes here\n    return output\nresult = my_function(input_param)\n```'} id=UUID('3cdd

In [18]:
import pandas as pd # Import pandas.

df = pd.DataFrame(values) # Create a DataFrame from our evaluation results.
df.head(10) # Display the first 10 rows of the DataFrame.

Unnamed: 0,input,answer,A,B,Preferred
0,How do I move my project between organizations?,LangSmith doesn't directly support moving proj...,To move a project between organizations in Lan...,To move a project between organizations in Lan...,
1,Why do I have to set environment variables?,Environment variables can tell your LangChain ...,Setting environment variables is necessary in ...,"At LangChain, setting environment variables is...",A
2,Can I trace my Llama V2 llm?,So long as you are using one of LangChain's LL...,"Yes, you can trace your Llama V2 LLM using Lan...","Yes, you can trace your Llama V2 LLM using Lan...",B
3,How do I use a traceable decorator?,The traceable decorator is available in the la...,"To use the `traceable` decorator in LangSmith,...","To use a traceable decorator in LangSmith, you...",A
4,What's a langsmith dataset?,A LangSmith dataset is a collection of example...,A LangSmith dataset is a collection of input-o...,A LangSmith dataset refers to a collection of ...,A
5,How might I query for all runs in a project?,client.list_runs(project_name='my-project-name...,"To query for all runs in a project, you can us...",To query for all runs in a project using LangS...,A
6,What is LangChain?,LangChain is an open-source framework for buil...,LangChain is an open-source framework for buil...,LangChain is an open-source framework for buil...,A


### Limitations
Pairwise judgments depend on the choice of evaluator and comparison criteria, and they do not scale linearly with the number of systems being compared. They are best used selectively, when fine-grained distinctions matter.

### Role in a Broader Evaluation Framework
In this project, pairwise comparison serves as a qualitative lens, complementing quantitative metrics. It is particularly useful when diagnosing regressions or validating whether changes that improve one metric actually improve overall behaviour.

## Discussion
Pairwise evaluation reframes assessment from scoring to preference, aligning more closely with how humans perceive quality. This notebook demonstrates how relative judgments can surface differences that absolute metrics systematically smooth over.

This method can be enhanced by:

- **Ensembling**: Using multiple LLM judges and taking a majority vote to get a more robust preference score.
- **Continuous Scores**: Instructing the model to output a continuous score (e.g., from 1-10) for each response instead of just a binary preference.
- **Custom Criteria**: Modifying the evaluator's prompt to judge based on criteria that are most important to your application, such as conciseness, tone, or creativity.