# How to run pairwise evaluations

LangSmith supports evaluating existing experiments in a comparative manner. This allows you to score the outputs from multiple experiments against each other, rather than being confined to evaluating outputs one at a time. To do this, use the `evaluate_comparative` function with two existing experiments.

## `evaluate_comparative` args

- `experiments`: A list of the two existing experiments you would like to evaluate against each other. These can be uuids or experiment names.
- `evaluators`: A list of the pairwise evaluators that you would like to attach to this evaluation. See the section below for how

Optional args:

- `randomize_order`: An optional boolean indicating whether the order of the outputs should be randomized for each evaluation. This is a strategy for minimizing positional bias in your prompt: often, the LLM will be biased towards one of the responses based on the order. This should mainly be addressed via prompt engineering, but this is another optional mitigation. Defaults to False.
- `experiment_prefix`: A prefix to be attached to the beginning of the pairwise experiment name. Defaults to None.
- `description`: A description of the pairwise experiment. Defaults to None.
- `max_concurrency`: The maximum number of concurrent evaluations to run. Defaults to 5.
- `client`: The LangSmith client to use. Defaults to None.
- `metadata`: Metadata to attach to your pairwise experiment. Defaults to None.
- `load_nested`: Whether to load all child runs for the experiment. When False, only the root trace will be passed to your evaluator. Defaults to False.

## Run a pairwise evaluation

The following example uses a prompt which asks the LLM to decide which is better between two AI assistant responses. It uses structured output to parse the AI's response: 0, 1, or 2.

In [2]:
from langchain import hub
from langchain.chat_models import init_chat_model
from langsmith.evaluation import evaluate_comparative

In [3]:
prompt = hub.pull("langchain-ai/pairwise-evaluation-2")
model = init_chat_model("gpt-4o")
chain = prompt | model

In [4]:
def ranked_preference(inputs: dict, outputs: list[dict]) -> list:
    response = chain.invoke({
        "question": inputs["question"],
        "answer_a": outputs[0].get("output", "N/A"),
        "answer_b": outputs[1].get("output", "N/A"),
    })
    preference = response["Preference"]

    if preference == 1:
        scores = [1, 0]
    elif preference == 2:
        scores = [0, 1]
    else:
        scores = [0, 0]
    return scores

In [5]:
evaluate_comparative(
    # Replace the following array with the names or IDs of your experiments
    ["openai-4-71c6e0df", "openai-4-24ce7466"],
    evaluators=[ranked_preference],
)

View the pairwise evaluation results at:
https://smith.langchain.com/o/4791d9fe-98f1-47bb-b116-297cd74a3dc0/datasets/957bd0d2-fb1b-49b9-a67e-2908acfc7bdb/compare?selectedSessions=b06d5818-51db-4240-b268-3bf9b6b6d356%2C3e99b53b-2445-4946-aad1-6b324100f56d&comparativeExperiment=36106bf2-9e8a-4687-8eaf-e3d89a8f2c25




  0%|          | 0/5 [00:00<?, ?it/s]

<langsmith.evaluation._runner.ComparativeExperimentResults at 0x10a416f90>

![Pairwise experiments](../../assets/pairwise_experiments.png)
