# How to evaluate LLM applications

## 1. General idea of ​​verification and evaluation

Now, we have built a simple, generalized large model application. Looking back at the entire development process, we can find that large model development, which focuses on calling and playing with large models, pays more attention to verification iteration than traditional AI development. Since you can quickly build an LLM-based application, define a prompt in minutes, and get feedback results in hours, it will be extremely cumbersome to stop and collect a thousand test samples. Because now, you can get results without any training samples.

Therefore, when building an application with LLM, you may go through the following process: First, you will adjust the prompt in a small sample of one to three samples, trying to make it work on these samples. Later, when you test the system further, you may encounter some tricky examples that cannot be solved by prompts or algorithms. This is the challenge faced by developers who build applications using LLM. In this case, you can add these additional few examples to the set you are testing, and organically add other difficult examples. Eventually, you will add enough of these examples to your gradually expanding development set that it becomes a bit inconvenient to manually run each example to test the prompt. Then you start developing some metrics for measuring the performance of these small sample sets, like average precision. The interesting thing about this process is that if you think your system is good enough,You can stop at any time and not make any more improvements. In fact, many deployed applications stop at step 1 or 2, and they work perfectly well.

![](../../figures/C5-1-eval.png)

In this chapter, we will introduce the general methods of large model application verification and evaluation one by one, and design the verification and iteration process of this project to optimize the application function. However, please note that since system evaluation and optimization is a topic closely related to the business, this chapter mainly introduces the theory. Readers are welcome to actively practice and explore.

We will first introduce several methods for large model development evaluation. For tasks with simple standard answers, evaluation is easy to achieve; but large model development generally requires the implementation of complex generation tasks. How to achieve evaluation without simple answers or even standard answers, and accurately reflect the effect of the application, we will briefly introduce several methods.

As we continue to find bad cases and make targeted optimizations, we can gradually add these bad cases to the verification set to form a verification set with a certain number of samples. For this kind of verification set, it is impractical to evaluate one by one. We need an automatic evaluation method to achieve an overall evaluation of the performance on the verification set.

After mastering the general idea, we will specifically explore how to evaluate and optimize application performance in large model applications based on the RAG paradigm. Since large model applications developed based on the RAG paradigm generally include two core parts: retrieval and generation, our evaluation optimization will also focus on these two parts, respectively, to optimize the system retrieval accuracy and the generation quality under the given material.

In each part, weWe will first introduce some tips on how to find bad cases, as well as general ideas for optimizing retrieval or prompts for bad cases. Note that in this process, you should always keep in mind the series of large model development principles and techniques we described in the previous chapters, and always ensure that the optimized system will not make mistakes on the samples that originally performed well.

Verification iteration is an important step that is essential for building LLM-centric applications. By constantly finding bad cases, adjusting prompts or optimizing retrieval performance, we can drive the application to achieve the performance and accuracy we are looking for. Next, we will briefly introduce several methods for large model development evaluation, and summarize the general ideas from targeted optimization of a few bad cases to overall automated evaluation.

## 2. Large Model Evaluation Method

In the specific development of large-scale model applications, we can find bad cases and continuously optimize prompts or retrieval architectures to solve bad cases, thereby optimizing the performance of the system. We will add each bad case we find to our validation set. After each optimization, we will re-verify all validation cases in the validation set to ensure that the optimized system will not lose its capabilities or degrade its performance on the original good cases. When the validation set is small, we can use the manual evaluation method, that is, manually evaluate the quality of the system output for each validation case in the validation set; however, as the validation set continues to expand with the optimization of the system, its volume will continue to increase, so that the time and labor costs of manual evaluation will expand to an unacceptable level. Therefore, we need to use an automatic evaluation method to automatically evaluate the output quality of the system for each validation case, thereby evaluating the overall performance of the system.

We will first introduce the general ideas of manual evaluation for reference, and then introduce the general methods of large-scale model automatic evaluation in depth, and conduct actual verification on this system to comprehensively evaluate the performance of this system and prepare for further optimization and iteration of the system. Similarly, before officially starting, we first load our vector database and retrieval chain:

In [12]:
import sys
sys.path.append("../C3 搭建知识库") # 将父目录放入系统路径中

# Use Zhipu Embedding API. Note that you need to download the encapsulation code implemented in the previous chapter to your local computer.
from zhipuai_embedding import ZhipuAIEmbeddings

from langchain.vectorstores.chroma import Chroma
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv, find_dotenv
import os

_ = load_dotenv(find_dotenv())    # read local .env file
zhipuai_api_key = os.environ['ZHIPUAI_API_KEY']
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

# Define Embeddings
embedding = ZhipuAIEmbeddings()

# Vector database persistence path
persist_directory = '../../data_base/vector_db/chroma'

# Load the database
vectordb = Chroma(
    persist_directory=persist_directory,  # 允许我们将persist_directory目录保存到磁盘上
    embedding_function=embedding
)

# Using OpenAI GPT-3.5 model
llm = ChatOpenAI(model_name = "gpt-3.5-turbo", temperature = 0)


### 2.1 General idea of ​​manual evaluation

In the early stage of system development, the verification set is small, and the simplest and most intuitive method is to manually evaluate each verification case in the verification set. However, there are also some basic principles and ideas for manual evaluation, which are briefly introduced here for learners' reference. But please note that the evaluation of the system is strongly related to the business, and the design of specific evaluation methods and dimensions needs to be considered in depth in combination with specific business.

#### Principle 1 Quantitative evaluation

In order to ensure a good comparison of the performance of different versions of the system, quantitative evaluation indicators are very necessary. We should give a score to the answer to each verification case, and finally calculate the average score of all verification cases to get the score of this version of the system. The quantified dimension can be 0~5 or 0~100, which can be determined according to personal style and actual business conditions.

The quantified evaluation indicators should have certain evaluation specifications. For example, if condition A is met, the score can be y points to ensure relative consistency between different evaluators.

For example, we give two verification cases:

① Who is the author of "Pumpkin Book"?

② How should the Pumpkin Book be used?

Next, we use version A prompt (brief and to the point) and version B prompt (detailed and specific) to ask the model to answer:

In [13]:
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA


template_v1 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。最多使用三句话。尽量使答案简明扼要。总是在回答的最后说“谢谢你的提问！”。
{context}
问题: {question}
"""

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context","question"],
                                 template=template_v1)



qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt":QA_CHAIN_PROMPT})

print("问题一：")
question = "南瓜书和西瓜书有什么关系？"
result = qa_chain({"query": question})
print(result["result"])

print("问题二：")
question = "应该如何使用南瓜书？"
result = qa_chain({"query": question})
print(result["result"])

问题一：
南瓜书是以西瓜书为前置知识进行解析和补充的，主要是对西瓜书中比较难理解的公式进行解析和推导细节的补充。南瓜书的最佳使用方法是以西瓜书为主线，遇到自己推导不出来或者看不懂的公式时再来查阅南瓜书。谢谢你的提问！
问题二：
应该以西瓜书为主线，遇到推导不出来或看不懂的公式时再查阅南瓜书。不建议初学者深究第1章和第2章的公式，等学得更熟练再回来。如果需要查阅南瓜书中没有的公式或发现错误，可以在GitHub上反馈。谢谢你的提问！


The above is the answer to the prompt in version A. Let’s test version B:

In [14]:
template_v2 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。你应该使答案尽可能详细具体，但不要偏题。如果答案比较长，请酌情进行分段，以提高答案的阅读体验。
{context}
问题: {question}
有用的回答:"""

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context","question"],
                                 template=template_v2)

qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt":QA_CHAIN_PROMPT})

print("问题一：")
question = "南瓜书和西瓜书有什么关系？"
result = qa_chain({"query": question})
print(result["result"])

print("问题二：")
question = "应该如何使用南瓜书？"
result = qa_chain({"query": question})
print(result["result"])

问题一：
南瓜书和西瓜书之间的关系是南瓜书是以西瓜书的内容为前置知识进行表述的。南瓜书的目的是对西瓜书中比较难理解的公式进行解析，并补充具体的推导细节，以帮助读者更好地理解和学习机器学习领域的知识。因此，最佳使用方法是以西瓜书为主线，遇到自己推导不出来或者看不懂的公式时再来查阅南瓜书。南瓜书的内容主要是为了帮助那些想深究公式推导细节的读者，提供更详细的解释和补充。
问题二：
应该将南瓜书作为西瓜书的补充，主要在遇到自己无法推导或理解的公式时进行查阅。对于初学机器学习的小白来说，建议先简单过一下南瓜书的第1章和第2章的公式，等学得更深入后再回来深究。每个公式的解析和推导都以本科数学基础的视角进行讲解，超纲的数学知识会在附录和参考文献中给出，供感兴趣的同学继续深入学习。如果在南瓜书中找不到想要查阅的公式，或者发现错误，可以在GitHub的Issues中提交反馈，通常会在24小时内得到回复。此外，南瓜书还提供配套视频教程和在线阅读地址，以及最新版PDF获取地址。最后，南瓜书的内容是基于知识共享署名-非商业性使用-相同方式共享4.0国际许可协议进行许可。


It can be seen that the prompt of version A has a better effect on case ①, but the prompt of version B has a better effect on case ②. If we do not quantify the evaluation indicators and only use relative evaluation, we cannot judge which prompt is better between version A and version B, so we need to find a prompt that performs better in all cases before further iteration; however, this is obviously very difficult and not conducive to our iterative optimization.

We can assign a score of 1 to 5 to each answer. For example, in the above case, we give version A answer ① a score of 4 and answer ② a score of 2, and version B answer ① a score of 3 and answer ② a score of 5; then, the average score of version A is 3 points, and the average score of version B is 4 points, so version B is better than version A.

#### Criterion 2 Multi-dimensional evaluation

The large model is a typical generative model, that is, its answer is a sentence generated by the model. Generally speaking, the answer of the large model needs to be evaluated in multiple dimensions. For example, in the personal knowledge base question and answer project of this project, the user's questions are generally about the content of the personal knowledge base. The model's answer needs to meet the requirements of fully using the content of the personal knowledge base, the answer is consistent with the question, the answer is true and effective, and the answer statement is smooth. An excellent question and answer assistant should be able to answer the user's questions well, ensure the correctness of the answer, and reflect full intelligence.

Therefore, we often need to start from multiple dimensions, design evaluation indicators for each dimension, and score each dimension to comprehensively evaluate the system performance. At the same time, it should be noted that multi-dimensional evaluation should be effectively combined with quantitative evaluation. For each dimension, the same dimension can be set or different dimensions can be set, and it should be fully combined with business reality.

For example, in this project, we can design the following dimensions for evaluation:

① Correctness of knowledge search. This dimension needs to check the intermediate results of the system's search for relevant knowledge fragments from the vector database, and evaluate whether the knowledge fragments found by the system can answer the question. This dimension is a 0-1 evaluation, that is, a score of 0 means that the knowledge fragments found cannot answer the question, and a score of 1 means that the knowledge fragments found can answer the question.

② Answer consistency. This dimension evaluates whether the system's answers are targeted at user questions, whether there are biased questions, wrong theories, etc.The dimension of this dimension is also designed to be 0~1, 0 is completely off-topic, 1 is completely relevant, and the intermediate results can be taken at will.

③ Answer hallucination ratio. This dimension needs to integrate the system answer and the knowledge fragments found to evaluate whether the system answer has hallucinations and how high the hallucination ratio is. This dimension is also designed to be 0~1, 0 is all model hallucinations, and 1 is no hallucinations.

④ Answer correctness. This dimension evaluates whether the system answer is correct and whether it fully answers the user's question. It is one of the most core evaluation indicators of the system. This dimension can be scored arbitrarily between 0 and 1.

The above four dimensions are all centered around the correctness of knowledge and answers, and are highly relevant to the problem; the next few dimensions will revolve around the anthropomorphism and grammatical correctness of the results generated by the large model, which are less relevant to the problem:

⑤ Logic. This dimension evaluates whether the system answer is logically coherent, whether there are conflicts and logical confusions. This dimension is evaluated from 0 to 1.

⑥ Fluency. This dimension evaluates whether the system answers are fluent and grammatical, and can be scored anywhere between 0 and 1.

⑦ Intelligence. This dimension evaluates whether the system answers are anthropomorphic and intelligent, and whether it can fully confuse users with manual answers and intelligent answers. This dimension can be scored anywhere between 0 and 1.

For example, we evaluate the following answers:

In [15]:
print("问题：")
question = "应该如何使用南瓜书？"
print(question)
print("模型回答：")
result = qa_chain({"query": question})
print(result["result"])

问题：
应该如何使用南瓜书？
模型回答：
应该将南瓜书作为西瓜书的补充，主要在遇到自己无法推导或理解的公式时进行查阅。对于初学机器学习的小白来说，建议先简单过一下南瓜书的第1章和第2章，等学得更深入后再回来深究。每个公式的解析和推导都以本科数学基础的视角进行讲解，超纲的数学知识会在附录和参考文献中给出，感兴趣的同学可以继续深入学习。如果南瓜书中没有你想要查阅的公式，或者发现有错误，可以在GitHub的Issues中提交反馈，通常会在24小时内得到回复。最终目的是帮助读者更好地理解和应用机器学习知识，成为合格的理工科学生。


The following are the pieces of knowledge found by the system:

In [16]:
print(result["source_documents"])

[Document(page_content='为主线，遇到自己推导不出来或者看不懂的公式时再来查阅南瓜书；\n• 对于初学机器学习的小白，西瓜书第1 章和第2 章的公式强烈不建议深究，简单过一下即可，等你学得\n有点飘的时候再回来啃都来得及；\n• 每个公式的解析和推导我们都力(zhi) 争(neng) 以本科数学基础的视角进行讲解，所以超纲的数学知识\n我们通常都会以附录和参考文献的形式给出，感兴趣的同学可以继续沿着我们给的资料进行深入学习；\n• 若南瓜书里没有你想要查阅的公式，\n或者你发现南瓜书哪个地方有错误，\n请毫不犹豫地去我们GitHub 的\nIssues（地址：https://github.com/datawhalechina/pumpkin-book/issues）进行反馈，在对应版块\n提交你希望补充的公式编号或者勘误信息，我们通常会在24 小时以内给您回复，超过24 小时未回复的\n话可以微信联系我们（微信号：at-Sm1les）\n；\n配套视频教程：https://www.bilibili.com/video/BV1Mh411e7VU', metadata={'author': '', 'creationDate': "D:20230303170709-00'00'", 'creator': 'LaTeX with hyperref', 'file_path': './data_base/knowledge_db/pumkin_book/pumpkin_book.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': '', 'page': 1, 'producer': 'xdvipdfmx (20200315)', 'source': './data_base/knowledge_db/pumkin_book/pumpkin_book.pdf', 'subject': '', 'title': '', 'total_pages': 196, 'trapped': ''}), Document(page_content='在线阅读地址：https://datawhalechina.github.io/pumpkin-book（仅供第1 版）\n最新版PDF 获取地址：https://g

We made corresponding evaluations:

① Knowledge search correctness - 1

② Answer consistency - 0.8 (the question was answered, but the topic similar to "feedback" was off topic)

③ Answer illusion ratio - 1

④ Answer correctness - 0.8 (same reason as above)

⑤ Logic - 0.7 (the subsequent content is not logically coherent with the previous content)

⑥ Fluency - 0.6 (the final summary is verbose and ineffective)

⑦ Intelligence - 0.5 (has a significant AI answer style)

Combining the above seven dimensions, we can comprehensively and comprehensively evaluate the performance of the system in each case. Considering the scores of all cases, we can evaluate the performance of the system in each dimension. If all dimensions are unified, we can also calculate the average score of all dimensions to evaluate the system score. We can also assign weights to different dimensions according to their importance, and then calculate the weighted average of all dimensions to represent the system score.

However, we can see that the more comprehensive and specific the evaluation, the greater the evaluation difficulty and cost. Taking the above seven-dimensional evaluation as an example, we need to conduct seven evaluations for each case of each version of the system. If we have two versions of the system and there are 10 verification cases in the verification set, then each evaluation will require $ 10 \times 2 \times 7 = 140$ times; but as our system continues to improve and iterate, the verification set will expand rapidly. Generally speaking, a mature system verification set should be at leastThe number of answers is at least a few hundred, and there are at least dozens of iterative improvement versions. Then the total number of evaluations will reach tens of thousands, which will bring high manpower and time costs. Therefore, we need a method to automatically evaluate the model's answers.

### 3.2 Simple automatic evaluation

One of the important reasons why large model evaluation is complicated is that the answers of the generated model are difficult to judge, that is, the evaluation of objective questions is simple, but the evaluation of subjective questions is difficult. Especially for some questions without standard answers, it is particularly difficult to achieve automatic evaluation. However, at the expense of a certain degree of evaluation accuracy, we can transform complex subjective questions without standard answers into questions with standard answers, and then achieve it through simple automatic evaluation. Here are two methods: constructing objective questions and calculating the similarity of standard answers.

#### Method 1 Constructing objective questions

The evaluation of subjective questions is very difficult, but objective questions can directly compare whether the system answers are consistent with the standard answers, so as to achieve simple evaluation. We can construct some subjective questions into multiple or single choice objective questions, and then achieve simple evaluation. For example, for the question:

[Question and answer question] Who is the author of the pumpkin book?

We can construct this subjective question into the following objective question:

[Multiple choice question] Who is the author of the pumpkin book? A Zhou Zhiming B Xie Wenrui C Qinzhou D Jia Binbin

The model is required to answer this objective question. We give the standard answer as BCD. The model's answer can be compared with the standard answer to achieve evaluation and scoring. Based on the above ideas, we can construct a prompt question template:

In [17]:
prompt_template = '''
请你做如下选择题：
题目：南瓜书的作者是谁？
选项：A 周志明 B 谢文睿 C 秦州 D 贾彬彬
你可以参考的知识片段：
```
{}
```
请仅返回选择的选项
如果你无法做出选择，请返回空
'''

Of course, due to the instability of large models, even if we ask it to only give the selected options, the system may return a lot of text, which explains in detail why the following options are selected. Therefore, we need to extract the options from the model's answers. At the same time, we need to design a scoring strategy. In general, we can use the general scoring strategy for multiple-choice questions: 1 point for selecting all, 0.5 points for missing an option, and no points for not selecting the wrong option:

In [7]:
def multi_select_score_v1(true_answer : str, generate_answer : str) -> float:
# true_anser: correct answer, str type, such as 'BCD'
# generate_answer : model generates answers, str type
    true_answers = list(true_answer)
'''For ease of calculation, we assume that each question has only four options: A B C D'''
# First find the wrong answer set
    false_answers = [item for item in ['A', 'B', 'C', 'D'] if item not in true_answers]
# If the generated answer has an incorrect answer
    for one_answer in false_answers:
        if one_answer in generate_answer:
            return 0
# Check if all correct answers are selected
    if_correct = 0
    for one_answer in true_answers:
        if one_answer in generate_answer:
            if_correct += 1
            continue
    if if_correct == 0:
# Not selected
        return 0
    elif if_correct == len(true_answers):
# select all
        return 1
    else:
# Missing selection
        return 0.5

Based on the above scoring function, we can test four answers:

① B C

② Except for A Zhou Zhihua, all others are authors of Pumpkin Book

③ You should choose B C D

④ I don’t know

In [8]:
answer1 = 'B C'
answer2 = '西瓜书的作者是 A 周志华'
answer3 = '应该选择 B C D'
answer4 = '我不知道'
true_answer = 'BCD'
print("答案一得分：", multi_select_score_v1(true_answer, answer1))
print("答案二得分：", multi_select_score_v1(true_answer, answer2))
print("答案三得分：", multi_select_score_v1(true_answer, answer3))
print("答案四得分：", multi_select_score_v1(true_answer, answer4))

答案一得分： 0.5
答案二得分： 0
答案三得分： 1
答案四得分： 0


But we can see that we require the model not to make a choice when it cannot answer, rather than randomly choosing. However, in our scoring strategy, both wrong choices and no choices are scored 0 points, which actually encourages the model to give hallucinatory answers. Therefore, we can adjust the scoring strategy according to the situation and deduct one point for wrong choices:

In [9]:
def multi_select_score_v2(true_answer : str, generate_answer : str) -> float:
# true_anser: correct answer, str type, such as 'BCD'
# generate_answer : model generates answers, str type
    true_answers = list(true_answer)
'''For ease of calculation, we assume that each question has only four options: A B C D'''
# First find the wrong answer set
    false_answers = [item for item in ['A', 'B', 'C', 'D'] if item not in true_answers]
# If the generated answer has an incorrect answer
    for one_answer in false_answers:
        if one_answer in generate_answer:
            return -1
# Check if all correct answers are selected
    if_correct = 0
    for one_answer in true_answers:
        if one_answer in generate_answer:
            if_correct += 1
            continue
    if if_correct == 0:
# Not selected
        return 0
    elif if_correct == len(true_answers):
# select all
        return 1
    else:
# Missing selection
        return 0.5

As above, we use the second version of the scoring function to score the four answers again:

In [10]:
answer1 = 'B C'
answer2 = '西瓜书的作者是 A 周志华'
answer3 = '应该选择 B C D'
answer4 = '我不知道'
true_answer = 'BCD'
print("答案一得分：", multi_select_score_v2(true_answer, answer1))
print("答案二得分：", multi_select_score_v2(true_answer, answer2))
print("答案三得分：", multi_select_score_v2(true_answer, answer3))
print("答案四得分：", multi_select_score_v2(true_answer, answer4))

答案一得分： 0.5
答案二得分： -1
答案三得分： 1
答案四得分： 0


As you can see, we have achieved fast, automatic and discriminative automatic evaluation. In this way, we only need to construct each verification case, and then each verification and iteration can be fully automated, thus achieving efficient verification.

However, not all cases can be constructed as objective questions. For some cases that cannot be constructed as objective questions or that will cause the difficulty of the questions to drop sharply when constructed as objective questions, we need to use the second method: calculating the similarity of answers.

#### Method 2: Calculate answer similarity

Evaluating the answers to generated questions is not a new problem in NLP. Whether it is machine translation or automatic summarization, it is actually necessary to evaluate the quality of generated answers. NLP generally uses the method of manually constructing standard answers to generated questions and calculating the similarity between the answers and the standard answers to achieve automatic evaluation.

For example, for the question:

What is the goal of Pumpkin Book?

We can first manually construct a standard answer:

Zhou Zhihua's "Machine Learning" (Watermelon Book) is one of the classic introductory textbooks in the field of machine learning. In order to make as many readers as possible understand machine learning through Watermelon Book, Zhou did not elaborate on the derivation details of some formulas in the book, but this may be "not very friendly" to those readers who want to delve into the details of formula derivation. This book aims to analyze the formulas that are more difficult to understand in Watermelon Book, and to supplement the specific derivation details of some formulas.

Then calculate the similarity between the model answer and the standard answer. The more similar, the more correct we think the answer is.

There are many ways to calculate similarity. We can generally use BLEU to calculate similarity. The principle is detailed in: [Zhihu|BLEU Detailed Explanation](https://zhuanlan.zhihu.com/p/223048748). For those who do not want to delve into the algorithm principle, it can be simply understood as topic similarity.

We can call the bleu scoring function in the nltk library to calculate:

In [11]:
from nltk.translate.bleu_score import sentence_bleu
import jieba

def bleu_score(true_answer : str, generate_answer : str) -> float:
# true_anser: standard answer, str type
# generate_answer : model generates answers, str type
    true_answers = list(jieba.cut(true_answer))
# print(true_answers)
    generate_answers = list(jieba.cut(generate_answer))
# print(generate_answers)
    bleu_score = sentence_bleu(true_answers, generate_answers)
    return bleu_score

have a test:

In [13]:
true_answer = '周志华老师的《机器学习》（西瓜书）是机器学习领域的经典入门教材之一，周老师为了使尽可能多的读者通过西瓜书对机器学习有所了解, 所以在书中对部分公式的推导细节没有详述，但是这对那些想深究公式推导细节的读者来说可能“不太友好”，本书旨在对西瓜书里比较难理解的公式加以解析，以及对部分公式补充具体的推导细节。'

print("答案一：")
answer1 = '周志华老师的《机器学习》（西瓜书）是机器学习领域的经典入门教材之一，周老师为了使尽可能多的读者通过西瓜书对机器学习有所了解, 所以在书中对部分公式的推导细节没有详述，但是这对那些想深究公式推导细节的读者来说可能“不太友好”，本书旨在对西瓜书里比较难理解的公式加以解析，以及对部分公式补充具体的推导细节。'
print(answer1)
score = bleu_score(true_answer, answer1)
print("得分：", score)
print("答案二：")
answer2 = '本南瓜书只能算是我等数学渣渣在自学的时候记下来的笔记，希望能够帮助大家都成为一名合格的“理工科数学基础扎实点的大二下学生”'
print(answer2)
score = bleu_score(true_answer, answer2)
print("得分：", score)

答案一：
周志华老师的《机器学习》（西瓜书）是机器学习领域的经典入门教材之一，周老师为了使尽可能多的读者通过西瓜书对机器学习有所了解, 所以在书中对部分公式的推导细节没有详述，但是这对那些想深究公式推导细节的读者来说可能“不太友好”，本书旨在对西瓜书里比较难理解的公式加以解析，以及对部分公式补充具体的推导细节。
得分： 1.2705543769116016e-231
答案二：
本南瓜书只能算是我等数学渣渣在自学的时候记下来的笔记，希望能够帮助大家都成为一名合格的“理工科数学基础扎实点的大二下学生”
得分： 1.1935398790363042e-231


It can be seen that the higher the consistency between the answer and the standard answer, the higher the evaluation score. With this method, we only need to construct a standard answer for each question in the validation set, and then we can achieve automatic and efficient evaluation.

However, this method also has several problems: ① It requires manual construction of standard answers. For some vertical fields, constructing standard answers may be a difficult task; ② There may be problems with evaluation by similarity. For example, if the generated answer is highly consistent with the standard answer but is exactly the opposite in several core places, resulting in a completely wrong answer, the bleu score will still be high; ③ The flexibility of calculating the consistency with the standard answer is very poor. If the model generates a better answer than the standard answer, the evaluation score will be reduced; ④ It is impossible to evaluate the intelligence and fluency of the answer. If the answer is spliced ​​from the keywords in each standard answer, we believe that such an answer is unusable and incomprehensible, but the bleu score will be higher.

Therefore, for business situations, sometimes we also need some advanced evaluation methods that do not require the construction of standard answers.

### 2.3 Evaluation with Large Models

Manual evaluation is highly accurate and comprehensive, but the labor and time costs are high; automatic evaluation is low-cost and fast, but there are problems with inaccuracy and incomplete evaluation. So, do we have a way to combine the advantages of both to achieve fast and comprehensive evaluation of generated questions?

Large models represented by GPT-4 provide us with a new method: using large models for evaluation. We can construct Prompt Engineering to let the large model act as an evaluator, thereby replacing the evaluator of manual evaluation; at the same time, the large model can give results similar to manual evaluation, so we can adopt the multi-dimensional quantitative evaluation method in manual evaluation to achieve fast and comprehensive evaluation.

For example, we can construct the following Prompt Engineering to let the large model score:

In [18]:
prompt = '''
你是一个模型回答评估员。
接下来，我将给你一个问题、对应的知识片段以及模型根据知识片段对问题的回答。
请你依次评估以下维度模型回答的表现，分别给出打分：

① 知识查找正确性。评估系统给定的知识片段是否能够对问题做出回答。如果知识片段不能做出回答，打分为0；如果知识片段可以做出回答，打分为1。

② 回答一致性。评估系统的回答是否针对用户问题展开，是否有偏题、错误理解题意的情况，打分分值在0~1之间，0为完全偏题，1为完全切题。

③ 回答幻觉比例。该维度需要综合系统回答与查找到的知识片段，评估系统的回答是否出现幻觉，打分分值在0~1之间,0为全部是模型幻觉，1为没有任何幻觉。

④ 回答正确性。该维度评估系统回答是否正确，是否充分解答了用户问题，打分分值在0~1之间，0为完全不正确，1为完全正确。

⑤ 逻辑性。该维度评估系统回答是否逻辑连贯，是否出现前后冲突、逻辑混乱的情况。打分分值在0~1之间，0为逻辑完全混乱，1为完全没有逻辑问题。

⑥ 通顺性。该维度评估系统回答是否通顺、合乎语法。打分分值在0~1之间，0为语句完全不通顺，1为语句完全通顺没有任何语法问题。

⑦ 智能性。该维度评估系统回答是否拟人化、智能化，是否能充分让用户混淆人工回答与智能回答。打分分值在0~1之间，0为非常明显的模型回答，1为与人工回答高度一致。

你应该是比较严苛的评估员，很少给出满分的高评估。
用户问题：
```
{}
```
待评估的回答：
```
{}
```
给定的知识片段：
```
{}
```
你应该返回给我一个可直接解析的 Python 字典，字典的键是如上维度，值是每一个维度对应的评估打分。
不要输出任何其他内容。
'''

We can actually test its effect:

In [20]:
# Use the OpenAI native interface mentioned in Chapter 2

from openai import OpenAI

client = OpenAI(
# This is the default and can be omitted
    api_key=os.environ.get("OPENAI_API_KEY"),
)


def gen_gpt_messages(prompt):
    '''
    构造 GPT 模型请求参数 messages
    
    请求参数：
        prompt: 对应的用户提示词
    '''
    messages = [{"role": "user", "content": prompt}]
    return messages


def get_completion(prompt, model="gpt-3.5-turbo", temperature = 0):
    '''
    获取 GPT 模型调用结果

    请求参数：
        prompt: 对应的提示词
        model: 调用的模型，默认为 gpt-3.5-turbo，也可以按需选择 gpt-4 等其他模型
        temperature: 模型输出的温度系数，控制输出的随机程度，取值范围是 0~2。温度系数越低，输出内容越一致。
    '''
    response = client.chat.completions.create(
        model=model,
        messages=gen_gpt_messages(prompt),
        temperature=temperature,
    )
    if len(response.choices) > 0:
        return response.choices[0].message.content
    return "generate answer error"

question = "应该如何使用南瓜书？"
result = qa_chain({"query": question})
answer = result["result"]
knowledge = result["source_documents"]

response = get_completion(prompt.format(question, answer, knowledge))
response

'{\n    "知识查找正确性": 1,\n    "回答一致性": 0.9,\n    "回答幻觉比例": 0.9,\n    "回答正确性": 0.9,\n    "逻辑性": 0.9,\n    "通顺性": 0.9,\n    "智能性": 0.8\n}'

However, please note that there are still problems with using large models for evaluation:

① Our goal is to iteratively improve Prompt to improve the performance of large models, so the large model we choose for evaluation needs to have better performance than the large model base we use. For example, the most powerful large model is still GPT-4, and it is recommended to use GPT-4 for evaluation, which has the best effect.

② Large models have powerful capabilities, but there are also boundaries of their capabilities. If the questions and answers are too complex, the knowledge fragments are too long, or too many evaluation dimensions are required, even GPT-4 will have incorrect evaluations, incorrect formats, and inability to understand instructions. For these situations, we recommend considering the following solutions to improve the performance of large models:

1. Improve Prompt Engineering. In a similar way to the improvement of the system's own Prompt Engineering, iteratively optimize the evaluation of Prompt Engineering, especially pay attention to whether the basic principles and core recommendations of Prompt Engineering are followed;

2. Split the evaluation dimensions. If there are too many evaluation dimensions, the model may have an incorrect format, resulting in the return being unable to be parsed. You can consider splitting the multiple dimensions to be evaluated, calling a large model once for each dimension to evaluate, and finally get a unified result;

3. Merge evaluation dimensions. If the evaluation dimensions are too detailed, the model may not be able to understand them correctly and the evaluation may be incorrect. You can considerCombine multiple dimensions to be evaluated, for example, combine logic, fluency, and intelligence into intelligence, etc.

4. Provide detailed evaluation specifications. Without evaluation specifications, it is difficult for the model to give ideal evaluation results. You can consider providing detailed and specific evaluation specifications to improve the evaluation ability of the model;

5. Provide a small number of examples. The model may find it difficult to understand the evaluation specifications. In this case, you can provide a small number of evaluation examples for the model to refer to for correct evaluation.

### 2.4 Mixed Evaluation

In fact, none of the above evaluation methods are isolated or contradictory. Compared with using a certain evaluation method independently, we recommend mixing multiple evaluation methods and selecting the appropriate evaluation method for each dimension, taking into account the comprehensiveness, accuracy and efficiency of the evaluation.

For example, for the personal knowledge base assistant of this project, we can design the following mixed evaluation method:

1. Objective correctness. Objective correctness means that the model can give correct answers to some questions with fixed correct answers. We can select some cases and use the method of constructing objective questions to evaluate the model and evaluate its objective correctness.

2. Subjective correctness. Subjective correctness means that the model can give correct and comprehensive answers to subjective questions without fixed correct answers. We can select some cases and use the method of large model evaluation to evaluate whether the model answer is correct.

3. Intelligence. Intelligence refers to whether the model's answer is humanized enough. Since intelligence is weakly correlated with the question itself, strongly correlated with the model and prompt, and the model's ability to judge intelligence is weak, we can manually evaluate its intelligence by sampling a small amount.

4. Knowledge search correctness. Knowledge search correctness refers to whether the knowledge fragments retrieved from the knowledge base are correct and sufficient to answer the question for a specific question. It is recommended to use a large model to evaluate knowledge search correctness, that is, to require the model to determine whether a given knowledge fragment is sufficient to answer the question. At the same time, the evaluation results of this dimension can be combined with subjective correctness to calculate the hallucination situation, that is, if the subjective answer is correctHowever, if the knowledge search is incorrect, it means that the model hallucination has occurred.

Using the above evaluation method, based on the obtained validation set examples, a reasonable evaluation of the project can be made. Due to time and manpower limitations, it will not be shown in detail here.