# Pairwise String Comparison

Often you will want to compare predictions of an LLM, Chain, or Agent on for a given input. The comparison evaluators facilitate this so you can answer questions like:
- Which LLM or Prompt produces a preferred output for a given question?
- Which completions should I include for few-shot example selection?
- Which output is better to include for fintetuning?

In [1]:
from langchain.chat_models import ChatOpenAI
from langchain.evaluation import PairwiseStringEvalChain

llm = ChatOpenAI(model="gpt-4")

eval_chain = PairwiseStringEvalChain.from_llm(llm=llm)



In [5]:
eval_chain.evaluate_string_pairs(
    prediction = "there are three dogs",
    prediction_b="4",
    input="how many dogs are in the park?",
    reference="four"
)

{'reasoning': "Both responses A and B accurately answer the question, but neither response provides any additional detail or context. Response A is slightly more complete, as it uses full sentences to convey the information, while response B provides just the number. However, both responses are fairly equal in relevance, accuracy, and depth. The lack of detail in both responses doesn't allow for a clear winner based on creativity or detail. \n\nTherefore, my rating is a tie. \n",
 'value': None,
 'score': 0.5}