# Chapter 10 Evaluation (Part 2) - When there is no simple correct answer

- [I. Environment configuration](#I. Environment configuration)
- [II. Run the question-answering system to get a complex answer](#II. Run the question-answering system to get a complex answer)
- [III. Use GPT to evaluate whether the answer is correct](#III. Use-GPT-to evaluate whether the answer is correct)
- [IV. Give a standard answer and ask it to evaluate the gap between the generated answer and the standard answer](#IV. Give a standard answer and ask it to evaluate the gap between the generated answer and the standard answer)

In the previous chapter, we saw how to evaluate the output of an LLM model when there is a clear correct answer. We can write a function to determine whether the LLM output correctly classifies and lists products.

However, what if the LLM is used to generate text, not just answers to classification problems? Next, we will explore ways to evaluate this type of LLM output.

## 1. Environment Configuration

As in the previous chapter, we first need to configure the environment to use the OpenAI API

In [1]:
# Import OpenAI API
import os
import openai
import sys
sys.path.append('../..')
import utils_en
import utils_zh

openai.api_key  = "sk-..."
# Set API_KEY, please replace it with your own API_KEY

# The following is an example of a configuration method based on environment variables, which is safer. It is for reference only and will not be covered later.
# import openai
# import os
# OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
# openai.api_key = OPENAI_API_KEY

In [2]:
def get_completion_from_messages(messages, 
                                 model="gpt-3.5-turbo", 
                                 temperature=0, 
                                 max_tokens=500):
    '''
    封装一个访问 OpenAI GPT3.5 的函数

    参数: 
    messages: 这是一个消息列表，每个消息都是一个字典，包含 role(角色）和 content(内容)。角色可以是'system'、'user' 或 'assistant’，内容是角色的消息。
    model: 调用的模型，默认为 gpt-3.5-turbo(ChatGPT)，有内测资格的用户可以选择 gpt-4
    temperature: 这决定模型输出的随机程度，默认为0，表示输出将非常确定。增加温度会使输出更随机。
    max_tokens: 这决定模型输出的最大的 token 数。
    '''
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, # 这决定模型输出的随机程度
        max_tokens=max_tokens, # 这决定模型输出的最大的 token 数
    )
    return response.choices[0].message["content"]

## 2. Run the question-answering system to get a complex answer

In [8]:
# User Messages
customer_msg = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?"""

# Extract product names from questions
products_by_category = utils_en.get_products_from_query(customer_msg)
# Convert product names into a list
category_and_product_list = utils_en.read_string_to_list(products_by_category)
# Find the information corresponding to the product
product_info = utils_en.get_mentioned_product_info(category_and_product_list)
# Generate answers from information
assistant_answer = utils_en.answer_user_msg(user_msg=customer_msg, product_info=product_info)

In [9]:
print(assistant_answer) 

Sure, I'd be happy to help! The SmartX ProPhone is a powerful smartphone with a 6.1-inch display, 128GB storage, 12MP dual camera, and 5G capabilities. The FotoSnap DSLR Camera is a versatile camera with a 24.2MP sensor, 1080p video, 3-inch LCD, and interchangeable lenses. As for TVs, we have a variety of options including the CineView 4K TV with a 55-inch display, 4K resolution, HDR, and smart TV capabilities, the CineView 8K TV with a 65-inch display, 8K resolution, HDR, and smart TV capabilities, and the CineView OLED TV with a 55-inch display, 4K resolution, HDR, and smart TV capabilities. We also have the SoundMax Home Theater system with 5.1 channel, 1000W output, wireless subwoofer, and Bluetooth, and the SoundMax Soundbar with 2.1 channel, 300W output, wireless subwoofer, and Bluetooth. Is there anything else I can help you with?


In [4]:
'''
注意：限于模型对中文理解能力较弱，中文 Prompt 可能会随机出现不成功，可以多次运行；也非常欢迎同学探究更稳定的中文 Prompt
'''
# User Messages
customer_msg = f"""
告诉我有关 the smartx pro phone 和 the fotosnap camera, the dslr one 的信息。
另外，你们这有什么 TVs ？"""

# Extract product names from questions
products_by_category = utils_zh.get_products_from_query(customer_msg)
# Convert product names into a list
category_and_product_list = utils_zh.read_string_to_list(products_by_category)
# Find the information corresponding to the product
product_info = utils_zh.get_mentioned_product_info(category_and_product_list)
# Generate answers from information
assistant_answer = utils_zh.answer_user_msg(user_msg=customer_msg, product_info=product_info)

In [5]:
print(assistant_answer) 

关于SmartX ProPhone和FotoSnap DSLR相机的信息：

SmartX ProPhone是一款功能强大的智能手机，具有6.1英寸的显示屏，128GB的存储空间，12MP双摄像头和5G网络。它的价格为899.99美元，保修期为1年。

FotoSnap DSLR相机是一款功能强大的相机，具有24.2MP传感器，1080p视频，3英寸LCD屏幕和可更换镜头。它的价格为599.99美元，保修期为1年。

关于电视的信息：

我们有多种电视可供选择，包括CineView 4K电视，CineView 8K电视和CineView OLED电视。CineView 4K电视具有55英寸的显示屏，4K分辨率，HDR和智能电视功能，价格为599.99美元，保修期为2年。CineView 8K电视具有65英寸的显示屏，8K分辨率，HDR和智能电视功能，价格为2999.99美元，保修期为2年。CineView OLED电视具有55英寸的显示屏，4K分辨率，HDR和智能电视功能，价格为1499.99美元，保修期为2年。您需要哪种类型的电视？


## 3. Use GPT to evaluate whether the answer is correct

We hope that you learned a design pattern from this, that while you can specify a list of criteria to evaluate an LLM output, you can actually use another API call to evaluate your first LLM output.

In [8]:
# Question, context
cust_prod_info = {
    'customer_msg': customer_msg,
    'context': product_info
}

In [11]:
def eval_with_rubric(test_set, assistant_answer):
    """
    使用 GPT API 评估生成的回答

    参数：
    test_set: 测试集
    assistant_answer: 助手的回复
    """

    cust_msg = test_set['customer_msg']
    context = test_set['context']
    completion = assistant_answer
    
# Ask GPT to act as an assistant to evaluate the correctness of the answer
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by looking at the context that the customer service \
    agent is using to generate its response. 
    """

# Specific instructions
    user_message = f"""\
You are evaluating a submitted answer to a question based on the context \
that the agent uses to answer the question.
Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Context]: {context}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the context. \
Ignore any differences in style, grammar, or punctuation.
Answer the following questions:
    - Is the Assistant response based only on the context provided? (Y or N)
    - Does the answer include information that is not provided in the context? (Y or N)
    - Is there any disagreement between the response and the context? (Y or N)
    - Count how many questions the user asked. (output a number)
    - For each question that the user asked, is there a corresponding answer to it?
      Question 1: (Y or N)
      Question 2: (Y or N)
      ...
      Question N: (Y or N)
    - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

In [12]:
evaluation_output = eval_with_rubric(cust_prod_info, assistant_answer)
print(evaluation_output)

- Is the Assistant response based only on the context provided? (Y or N)
Y
- Does the answer include information that is not provided in the context? (Y or N)
N
- Is there any disagreement between the response and the context? (Y or N)
N
- Count how many questions the user asked. (output a number)
1
- For each question that the user asked, is there a corresponding answer to it?
  Question 1: Y
- Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)
1


In [6]:
def eval_with_rubric(test_set, assistant_answer):
    """
    使用 GPT API 评估生成的回答

    参数：
    test_set: 测试集
    assistant_answer: 助手的回复
    """
    
    cust_msg = test_set['customer_msg']
    context = test_set['context']
    completion = assistant_answer
    
# Character
    system_message = """\
    你是一位助理，通过查看客户服务代理使用的上下文来评估客户服务代理回答用户问题的情况。
    """

# Specific instructions
    user_message = f"""\
    你正在根据代理使用的上下文评估对问题的提交答案。以下是数据：
    [开始]
    ************
    [用户问题]: {cust_msg}
    ************
    [使用的上下文]: {context}
    ************
    [客户代理的回答]: {completion}
    ************
    [结束]

    请将提交的答案的事实内容与上下文进行比较，忽略样式、语法或标点符号上的差异。
    回答以下问题：
    助手的回应是否只基于所提供的上下文？（是或否）
    回答中是否包含上下文中未提供的信息？（是或否）
    回应与上下文之间是否存在任何不一致之处？（是或否）
    计算用户提出了多少个问题。（输出一个数字）
    对于用户提出的每个问题，是否有相应的回答？
    问题1：（是或否）
    问题2：（是或否）
    ...
    问题N：（是或否）
    在提出的问题数量中，有多少个问题在回答中得到了回应？（输出一个数字）
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

In [10]:
evaluation_output = eval_with_rubric(cust_prod_info, assistant_answer)
print(evaluation_output)

助手的回应是基于所提供的上下文。回答中没有包含上下文中未提供的信息。回应与上下文之间没有任何不一致之处。

用户提出了两个问题。

对于用户提出的每个问题，都有相应的回答。

问题1：是
问题2：是

在提出的问题数量中，所有问题都在回答中得到了回应，因此输出数字为2。


## 4. Give a standard answer and ask them to evaluate the gap between the generated answer and the standard answer

In classic natural language processing techniques, there are traditional metrics for measuring the similarity of LLM output to output written by human experts. For example, the BLUE score can be used to measure how similar two pieces of text are.

There is actually a better way, which is to use Prompt. You can specify Prompt and use Prompt to compare how well the customer service agent response automatically generated by LLM matches the ideal human response.

In [13]:
test_set_ideal = {
    'customer_msg': """\
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?""",
    'ideal_answer':"""\
Of course!  The SmartX ProPhone is a powerful \
smartphone with advanced camera features. \
For instance, it has a 12MP dual camera. \
Other features include 5G wireless and 128GB storage. \
It also has a 6.1-inch display.  The price is $899.99.

The FotoSnap DSLR Camera is great for \
capturing stunning photos and videos. \
Some features include 1080p video, \
3-inch LCD, a 24.2MP sensor, \
and interchangeable lenses. \
The price is 599.99.

For TVs and TV related products, we offer 3 TVs \


All TVs offer HDR and Smart TV.

The CineView 4K TV has vibrant colors and smart features. \
Some of these features include a 55-inch display, \
'4K resolution. It's priced at 599.

The CineView 8K TV is a stunning 8K TV. \
Some features include a 65-inch display and \
8K resolution.  It's priced at 2999.99

The CineView OLED TV lets you experience vibrant colors. \
Some features include a 55-inch display and 4K resolution. \
It's priced at 1499.99.

We also offer 2 home theater products, both which include bluetooth.\
The SoundMax Home Theater is a powerful home theater system for \
an immmersive audio experience.
Its features include 5.1 channel, 1000W output, and wireless subwoofer.
It's priced at 399.99.

The SoundMax Soundbar is a sleek and powerful soundbar.
It's features include 2.1 channel, 300W output, and wireless subwoofer.
It's priced at 199.99

Are there any questions additional you may have about these products \
that you mentioned here?
Or may do you have other questions I can help you with?
    """
}

In [None]:
'''Based on the validation set of Chinese Prompt'''
test_set_ideal = {
    'customer_msg': """\
告诉我有关 the smartx pro phone 和 the fotosnap camera, the dslr one 的信息。\n另外，你们这有什么 TVs ？""",
    'ideal_answer':"""\
SmartX ProPhone是一款功能强大的智能手机，具有6.1英寸的显示屏，128GB的存储空间，12MP双摄像头和5G网络。它的价格为899.99美元，保修期为1年。
FotoSnap DSLR相机是一款功能强大的相机，具有24.2MP传感器，1080p视频，3英寸LCD屏幕和可更换镜头。它的价格为599.99美元，保修期为1年。
我们有多种电视可供选择，包括CineView 4K电视，CineView 8K电视和CineView OLED电视。CineView 4K电视具有55英寸的显示屏，4K分辨率，HDR和智能电视功能，价格为599.99美元，保修期为2年。CineView 8K电视具有65英寸的显示屏，8K分辨率，HDR和智能电视功能，价格为2999.99美元，保修期为2年。CineView OLED电视具有55英寸的显示屏，4K分辨率，HDR和智能电视功能，价格为1499.99美元，保修期为2年
    """
}

In [14]:
def eval_vs_ideal(test_set, assistant_answer):
    """
    评估回复是否与理想答案匹配

    参数：
    test_set: 测试集
    assistant_answer: 助手的回复
    """
    cust_msg = test_set['customer_msg']
    ideal = test_set['ideal_answer']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by comparing the response to the ideal (expert) response
    Output a single letter and nothing else. 
    """

    user_message = f"""\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Expert]: {ideal}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
    The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
    (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
    (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
    (C) The submitted answer contains all the same details as the expert answer.
    (D) There is a disagreement between the submitted answer and the expert answer.
    (E) The answers differ, but these differences don't matter from the perspective of factuality.
  choice_strings: ABCDE
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

In [None]:
def eval_vs_ideal(test_set, assistant_answer):
    """
    评估回复是否与理想答案匹配

    参数：
    test_set: 测试集
    assistant_answer: 助手的回复
    """
    cust_msg = test_set['customer_msg']
    ideal = test_set['ideal_answer']
    completion = assistant_answer
    
    system_message = """\
    您是一位助理，通过将客户服务代理的回答与理想（专家）回答进行比较，评估客户服务代理对用户问题的回答质量。
    请输出一个单独的字母（A 、B、C、D、E），不要包含其他内容。 
    """

    user_message = f"""\
    您正在比较一个给定问题的提交答案和专家答案。数据如下:
    [开始]
    ************
    [问题]: {cust_msg}
    ************
    [专家答案]: {ideal}
    ************
    [提交答案]: {completion}
    ************
    [结束]

    比较提交答案的事实内容与专家答案。忽略样式、语法或标点符号上的差异。
    提交的答案可能是专家答案的子集、超集，或者与之冲突。确定适用的情况，并通过选择以下选项之一回答问题：
    （A）提交的答案是专家答案的子集，并且与之完全一致。
    （B）提交的答案是专家答案的超集，并且与之完全一致。
    （C）提交的答案包含与专家答案完全相同的细节。
    （D）提交的答案与专家答案存在分歧。
    （E）答案存在差异，但从事实的角度来看这些差异并不重要。
    选项：ABCDE
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

This rubric comes from the OpenAI open source evaluation framework, which is a great framework that contains many evaluation methods, both contributed by OpenAI developers and the broader open source community.

In this rubric, we ask the LLM to compare the information content of the submission to the expert answer, ignoring differences in style, grammar, and punctuation, but the key is that we ask it to compare and output a score from A to E, depending on whether the submission is a subset, superset, or exact match of the expert answer, which may mean that it made up or fabricated some additional facts.

The LLM will choose the most appropriate description.

In [15]:
print(assistant_answer)

Sure, I'd be happy to help! The SmartX ProPhone is a powerful smartphone with a 6.1-inch display, 128GB storage, 12MP dual camera, and 5G capabilities. The FotoSnap DSLR Camera is a versatile camera with a 24.2MP sensor, 1080p video, 3-inch LCD, and interchangeable lenses. As for TVs, we have a variety of options including the CineView 4K TV with a 55-inch display, 4K resolution, HDR, and smart TV capabilities, the CineView 8K TV with a 65-inch display, 8K resolution, HDR, and smart TV capabilities, and the CineView OLED TV with a 55-inch display, 4K resolution, HDR, and smart TV capabilities. We also have the SoundMax Home Theater system with 5.1 channel, 1000W output, wireless subwoofer, and Bluetooth, and the SoundMax Soundbar with 2.1 channel, 300W output, wireless subwoofer, and Bluetooth. Is there anything else I can help you with?


In [16]:
eval_vs_ideal(test_set_ideal, assistant_answer)
# For this generated answer, GPT determines that the generated content is a subset of the standard answer

'A'

In [17]:
assistant_answer_2 = "life is like a box of chocolates"

In [18]:
eval_vs_ideal(test_set_ideal, assistant_answer_2)
# For obviously abnormal answers, GPT judges them as inconsistent

'D'

In [14]:
print(assistant_answer)

关于SmartX ProPhone和FotoSnap DSLR相机的信息：

SmartX ProPhone是一款功能强大的智能手机，具有6.1英寸的显示屏，128GB的存储空间，12MP双摄像头和5G网络。它的价格为899.99美元，保修期为1年。

FotoSnap DSLR相机是一款功能强大的相机，具有24.2MP传感器，1080p视频，3英寸LCD屏幕和可更换镜头。它的价格为599.99美元，保修期为1年。

关于电视的信息：

我们有多种电视可供选择，包括CineView 4K电视，CineView 8K电视和CineView OLED电视。CineView 4K电视具有55英寸的显示屏，4K分辨率，HDR和智能电视功能，价格为599.99美元，保修期为2年。CineView 8K电视具有65英寸的显示屏，8K分辨率，HDR和智能电视功能，价格为2999.99美元，保修期为2年。CineView OLED电视具有55英寸的显示屏，4K分辨率，HDR和智能电视功能，价格为1499.99美元，保修期为2年。您需要哪种类型的电视？


In [19]:
eval_vs_ideal(test_set_ideal, assistant_answer)

'B'

In [16]:
assistant_answer_2 = "life is like a box of chocolates"

In [17]:
eval_vs_ideal(test_set_ideal, assistant_answer_2)

'D'

Hopefully you learned two design patterns from this chapter.

1. Even if you don't have an ideal answer from an expert, you can use one LLM to evaluate the output of another LLM as long as you can develop an evaluation criterion.

2. If you can provide an ideal answer from an expert, it can help your LLM better compare whether a particular assistant's output is similar to the ideal answer from the expert.

Hopefully this will help you evaluate the output of your LLM system so that you can continuously monitor the performance of your system during development and use these tools to continuously evaluate and improve the performance of your system.