# Pairwise String Comparison

Often you will want to compare predictions of an LLM, Chain, or Agent for a given input. The comparison evaluators facilitate this so you can answer questions like:
- Which LLM or Prompt produces a preferred output for a given question?
- Which completions should I include for few-shot example selection?
- Which output is better to include for fintetuning?

You can use the PairwiseStringEvalChain to do this.

In [4]:
from langchain.chat_models import ChatOpenAI
from langchain.evaluation import PairwiseStringEvalChain

llm = ChatOpenAI(model="gpt-4", temperature=0.0)

eval_chain = PairwiseStringEvalChain.from_llm(llm=llm, requires_reference=True)

In [5]:
eval_chain.evaluate_string_pairs(
    prediction = "there are three dogs",
    prediction_b="4",
    input="how many dogs are in the park?",
    reference="four"
)

{'reasoning': 'Response A provides an incorrect answer by stating there are three dogs in the park, while the reference answer indicates there are four. Response B, on the other hand, provides the correct answer, matching the reference. Although Response B is less detailed, it is accurate and directly answers the question. \n\nTherefore, the better response is [[B]].\n',
 'value': 'B',
 'score': 0}

## Without References

When references aren't available, you can still predict the preferred response.
The results will reflect the evaluation model's preference, which is less reliable and may result
in preferences that are factually incorrect.

In [6]:
eval_chain = PairwiseStringEvalChain.from_llm(llm=llm)

In [7]:
eval_chain.evaluate_string_pairs(
    prediction = "there are three dogs",
    prediction_b="4",
    input="What is the name of the dog?",
)

{'reasoning': 'Both responses answer the question directly and accurately, but neither provides any additional detail or context. Response A is slightly more complete because it uses a full sentence, while Response B only provides a number. However, both responses are relevant and accurate, so the difference is minimal.\n\nFinal decision: [[C]]\n',
 'value': None,
 'score': 0.5}