# Pairwise string comparison

เราสามารถทำการเปรียบผลลัพธ์ที่สองผลลัพธ์ โดยใช้โมเดลในการประเมินว่าผลลัพธ์ไหนที่ใกล้เคียงกับ reference ที่สุด เป็นการช่วยในการเปรียบเทียบผลลัพธ์จากโมเดลสองเดล หรือ prompt ที่ใช้แตกต่างกัน 

In [None]:
import sys
sys.path.append('/opt/project/src/evaluate_llm/')
from api_key_config import settings
import os

os.environ["OPENAI_API_VERSION"] = settings.OPENAI_API_VERSION
os.environ["OPENAI_API_KEY"] = settings.OPENAI_API_KEY
os.environ["AZURE_OPENAI_ENDPOINT"] = settings.AZURE_OPENAI_ENDPOINT

In [2]:
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(deployment_name="gpt-35-turbo")

In [3]:
from langchain.evaluation import load_evaluator

evaluator = load_evaluator("labeled_pairwise_string", llm=llm)

In [4]:
evaluator.evaluate_string_pairs(
    prediction="there are three dogs",
    prediction_b="4",
    input="how many dogs are in the park?",
    reference="four",
) 

{'reasoning': "Assistant A provided the correct number of dogs in the park and followed the user's instructions to provide a specific number. The response is helpful, relevant, correct, and demonstrates depth by directly addressing the user's question with the accurate information.\n\nOn the other hand, Assistant B's response only provided a number without any context or relevance to the user's question. It lacks helpfulness, relevance, and depth.\n\nTherefore, Assistant A's response is better as it meets all the criteria and directly addresses the user's question with the correct information. \n[[A]]",
 'value': 'A',
 'score': 1}

it false you need to change to use gpt 4 instead.

ตัวประเมิน pairwise string สามารถเรียกใช้งานได้โดยใช้เมธอด evaluate_string_pairs (หรือ async aevaluate_string_pairs) ซึ่งรับพารามิเตอร์ดังนี้:

- prediction (str) – การตอบกลับที่คาดการณ์จากโมเดล แชลน หรือพรอมต์ตัวแรก
- prediction_b (str) – การตอบกลับที่คาดการณ์จากโมเดล แชลน หรือพรอมต์ตัวที่สอง
- input (str) – คำถาม พรอมต์ หรือข้อความอื่น ๆ ที่เป็นอินพุต
- reference (str) – (สำหรับตัวแปร labeled_pairwise_string เท่านั้น) การตอบกลับที่เป็นข้อมูลอ้างอิง

เมธอดเหล่านี้จะคืนค่าเป็นพจนานุกรมที่มีค่าดังนี้:

- value: 'A' หรือ 'B' เพื่อระบุว่าชอบ prediction หรือ prediction_b ตามลำดับ
- score: ค่าจำนวนเต็ม 0 หรือ 1 ที่แปลงมาจาก 'value' โดยที่คะแนน 1 หมายถึงชอบ prediction แรก และคะแนน 0 หมายถึงชอบ prediction_b
- reasoning: ข้อความ "การให้เหตุผลแบบ chain of thought" จาก LLM ที่สร้างขึ้นก่อนการให้คะแนน

## Without References 

In [8]:
from langchain.evaluation import load_evaluator

evaluator = load_evaluator("pairwise_string", llm = llm)

This chain was only tested with GPT-4. Performance may be significantly worse with other models.


In [9]:
evaluator.evaluate_string_pairs(
    prediction="Addition is a mathematical operation.",
    prediction_b="Addition is a mathematical operation that adds two numbers to create a third number, the 'sum'.",
    input="What is addition?",
)

{'reasoning': "Assistant B provides a more helpful, insightful, and appropriate response by giving a clear definition of addition as a mathematical operation that adds two numbers to create a third number, the 'sum'. This response is more relevant as it refers to a real definition of addition from mathematics. It is also correct, accurate, and factual. Assistant A's response, on the other hand, only states that addition is a mathematical operation without providing further details. Therefore, Assistant B's response demonstrates more depth of thought by providing a more detailed explanation of addition. \nTherefore, [[B]] is better.",
 'value': 'B',
 'score': 0}

### Defining the Criteria

สามารถกำหนดเกณฑ์ในการประเมินได้โดย criteria ที่อยู่ในรูปของ dict 

In [12]:
custom_criteria = {
    "simplicity": "Is the language straightforward and unpretentious?",
    "clarity": "Are the sentences clear and easy to understand?",
    "precision": "Is the writing precise, with no unnecessary words or details?",
    "truthfulness": "Does the writing feel honest and sincere?",
    "subtext": "Does the writing suggest deeper meanings or themes?",
}
evaluator = load_evaluator("pairwise_string", criteria=custom_criteria, llm = llm)

This chain was only tested with GPT-4. Performance may be significantly worse with other models.


In [13]:
evaluator.evaluate_string_pairs(
    prediction="Every cheerful household shares a similar rhythm of joy; but sorrow, in each household, plays a unique, haunting melody.",
    prediction_b="Where one finds a symphony of joy, every domicile of happiness resounds in harmonious,"
    " identical notes; yet, every abode of despair conducts a dissonant orchestra, each"
    " playing an elegy of grief that is peculiar and profound to its own existence.",
    input="Write some prose about families.",
)

{'reasoning': "Assistant B's response is overly complex and pretentious, using convoluted language and unnecessary details. It lacks simplicity, clarity, and precision, making it difficult to understand and appreciate. On the other hand, Assistant A's response is straightforward, clear, and honest. It presents the topic of families in a simple and truthful manner without unnecessary embellishments. Therefore, [[A]] is better.",
 'value': 'A',
 'score': 1}

Customize the Evaluation Prompt

In [14]:
from langchain_core.prompts import PromptTemplate

prompt_template = PromptTemplate.from_template(
    """Given the input context, which do you prefer: A or B?
Evaluate based on the following criteria:
{criteria}
Reason step by step and finally, respond with either [[A]] or [[B]] on its own line.

DATA
----
input: {input}
reference: {reference}
A: {prediction}
B: {prediction_b}
---
Reasoning:

"""
)
evaluator = load_evaluator("labeled_pairwise_string", prompt=prompt_template)

In [15]:
# The prompt was assigned to the evaluator
print(evaluator.prompt)

input_variables=['input', 'prediction', 'prediction_b', 'reference'] partial_variables={'criteria': 'For this evaluation, you should primarily consider the following criteria:\nhelpfulness: Is the submission helpful, insightful, and appropriate?\nrelevance: Is the submission referring to a real quote from the text?\ncorrectness: Is the submission correct, accurate, and factual?\ndepth: Does the submission demonstrate depth of thought?'} template='Given the input context, which do you prefer: A or B?\nEvaluate based on the following criteria:\n{criteria}\nReason step by step and finally, respond with either [[A]] or [[B]] on its own line.\n\nDATA\n----\ninput: {input}\nreference: {reference}\nA: {prediction}\nB: {prediction_b}\n---\nReasoning:\n\n'


In [None]:
evaluator.evaluate_string_pairs(
    prediction="The dog that ate the ice cream was named fido.",
    prediction_b="The dog's name is spot",
    input="What is the name of the dog that ate the ice cream?",
    reference="The dog's name is fido",
)