# Evaluate and optimize the generated part

In the previous chapter, we talked about how to evaluate the overall performance of a large model application based on the RAG framework. By constructing a validation set in a targeted manner, we can use a variety of methods to evaluate the system performance from multiple dimensions. However, the purpose of the evaluation is to better optimize the application effect. To optimize the application performance, we need to combine the evaluation results, split the evaluated Bad Cases, and evaluate and optimize each part separately.

RAG stands for Retrieval Enhanced Generation, so it has two core parts: the retrieval part and the generation part. The core function of the retrieval part is to ensure that the system can find the corresponding answer fragment according to the user query, and the core function of the generation part is to ensure that after the system obtains the correct answer fragment, it can give full play to the ability of the large model to generate a correct answer that meets the user's requirements.

To optimize a large model application, we often need to start from these two parts at the same time, evaluate the performance of the retrieval part and the optimization part respectively, find out the Bad Cases and optimize the performance in a targeted manner. As for the generation part, in the case of a limited large model base, we often optimize the generated answer by optimizing Prompt Engineering. In this chapter, we will first use the large model application example we just built - the personal knowledge base assistant to explain how to evaluate the performance of the analysis generation part, find the bad case in a targeted manner, and optimize the Prompt Engineering to optimize the generation part.

Before we start, let's load our vector database and retrieval chain:

In [7]:
import sys
sys.path.append("../C3 搭建知识库") # 将父目录放入系统路径中

# Use Zhipu Embedding API. Note that you need to download the encapsulation code implemented in the previous chapter to your local computer.
from zhipuai_embedding import ZhipuAIEmbeddings

from langchain.vectorstores.chroma import Chroma
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv, find_dotenv
import os

_ = load_dotenv(find_dotenv())    # read local .env file
zhipuai_api_key = os.environ['ZHIPUAI_API_KEY']
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

# Define Embeddings
embedding = ZhipuAIEmbeddings()

# Vector database persistence path
persist_directory = '../../data_base/vector_db/chroma'

# Load the database
vectordb = Chroma(
    persist_directory=persist_directory,  # 允许我们将persist_directory目录保存到磁盘上
    embedding_function=embedding
)

# Using OpenAI GPT-3.5 model
llm = ChatOpenAI(model_name = "gpt-3.5-turbo", temperature = 0)

os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'
os.environ["HTTP_PROXY"] = 'http://127.0.0.1:7890'


We first create a template-based search chain using the initialized Prompt:

In [8]:
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA


template_v1 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。最多使用三句话。尽量使答案简明扼要。总是在回答的最后说“谢谢你的提问！”。
{context}
问题: {question}
"""

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context","question"],
                                 template=template_v1)



qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt":QA_CHAIN_PROMPT})

Test the effect first:

In [9]:
question = "什么是南瓜书"
result = qa_chain({"query": question})
print(result["result"])

南瓜书是对《机器学习》（西瓜书）中比较难理解的公式进行解析和补充推导细节的书籍。南瓜书的最佳使用方法是以西瓜书为主线，遇到推导困难或看不懂的公式时再来查阅南瓜书。谢谢你的提问！


## 1. Improve the quality of intuitive answers

There are many ways to find Bad Cases. The most intuitive and simplest one is to evaluate the quality of intuitive answers and judge what is lacking in the original data content. For example, we can construct a Bad Case for the above test:

Question: What is Pumpkin Book
Initial answer: Pumpkin Book is a book that analyzes and supplements the derivation details of the difficult-to-understand formulas in "Machine Learning" (Watermelon Book). Thank you for your question!
Insufficient: The answer is too brief and needs to be more specific; Thank you for your question feels rigid and can be removed
Let's modify the Prompt template specifically, add the requirement for specific answers, and remove the "Thank you for your question" part:

In [4]:
template_v2 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。你应该使答案尽可能详细具体，但不要偏题。如果答案比较长，请酌情进行分段，以提高答案的阅读体验。
{context}
问题: {question}
有用的回答:"""

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context","question"],
                                 template=template_v2)
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt":QA_CHAIN_PROMPT})

question = "什么是南瓜书"
result = qa_chain({"query": question})
print(result["result"])

南瓜书是一本针对周志华老师的《机器学习》（西瓜书）的补充解析书籍。它旨在对西瓜书中比较难理解的公式进行解析，并补充具体的推导细节，以帮助读者更好地理解机器学习领域的知识。南瓜书的内容是以西瓜书为前置知识进行表述的，最佳使用方法是在遇到自己推导不出来或者看不懂的公式时来查阅。南瓜书的编写团队致力于帮助读者成为合格的“理工科数学基础扎实点的大二下学生”，并提供了在线阅读地址和最新版PDF获取地址供读者使用。


As you can see, the improved v2 version can give more specific and detailed answers, solving the previous problem. But we can think further: will requiring the model to give specific and detailed answers lead to unclear and unfocused answers to some key points? We test the following questions:

In [5]:
question = "使用大模型时，构造 Prompt 的原则有哪些"
result = qa_chain({"query": question})
print(result["result"])

在使用大型语言模型时，构造Prompt的原则主要包括编写清晰、具体的指令和给予模型充足的思考时间。首先，Prompt需要清晰明确地表达需求，提供足够的上下文信息，以确保语言模型准确理解用户的意图。这就好比向一个对人类世界一无所知的外星人解释事物一样，需要详细而清晰的描述。过于简略的Prompt会导致模型难以准确把握任务要求。

其次，给予语言模型充足的推理时间也是至关重要的。类似于人类解决问题时需要思考的时间，模型也需要时间来推理和生成准确的结果。匆忙的结论往往会导致错误的输出。因此，在设计Prompt时，应该加入逐步推理的要求，让模型有足够的时间进行逻辑思考，从而提高结果的准确性和可靠性。

通过遵循这两个原则，设计优化的Prompt可以帮助语言模型充分发挥潜力，完成复杂的推理和生成任务。掌握这些Prompt设计原则是开发者成功应用语言模型的重要一步。在实际应用中，不断优化和调整Prompt，逐步逼近最佳形式，是构建高效、可靠模型交互的关键策略。


As you can see, the model's answer to our question about the LLM course is indeed detailed and specific, and it also fully refers to the course content. However, the answer starts with words such as "first" and "second", and the overall answer is divided into 4 paragraphs, which makes the answer not particularly clear and difficult to read. Therefore, we construct the following Bad Case:

Question: What are the principles for constructing prompts when using a large model?

Initial answer: Omitted

Insufficient: No focus, vague

For this Bad Case, we can improve the prompt and require it to mark the answers with several points to make the answers clear and specific:

In [6]:
template_v3 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。你应该使答案尽可能详细具体，但不要偏题。如果答案比较长，请酌情进行分段，以提高答案的阅读体验。
如果答案有几点，你应该分点标号回答，让答案清晰具体
{context}
问题: {question}
有用的回答:"""

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context","question"],
                                 template=template_v3)
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt":QA_CHAIN_PROMPT})

question = "使用大模型时，构造 Prompt 的原则有哪些"
result = qa_chain({"query": question})
print(result["result"])

1. 编写清晰、具体的指令是构造 Prompt 的第一原则。Prompt需要明确表达需求，提供充足上下文，使语言模型准确理解意图。过于简略的Prompt会使模型难以完成任务。

2. 给予模型充足思考时间是构造Prompt的第二原则。语言模型需要时间推理和解决复杂问题，匆忙得出的结论可能不准确。因此，Prompt应该包含逐步推理的要求，让模型有足够时间思考，生成更准确的结果。

3. 在设计Prompt时，要指定完成任务所需的步骤。通过给定一个复杂任务，给出完成任务的一系列步骤，可以帮助模型更好地理解任务要求，提高任务完成的效率。

4. 迭代优化是构造Prompt的常用策略。通过不断尝试、分析结果、改进Prompt的过程，逐步逼近最优的Prompt形式。成功的Prompt通常是通过多轮调整得出的。

5. 添加表格描述是优化Prompt的一种方法。要求模型抽取信息并组织成表格，指定表格的列、表名和格式，可以帮助模型更好地理解任务，并生成符合预期的结果。

总之，构造Prompt的原则包括清晰具体的指令、给予模型充足思考时间、指定完成任务所需的步骤、迭代优化和添加表格描述等。这些原则可以帮助开发者设计出高效、可靠的Prompt，发挥语言模型的最大潜力。


There are many ways to improve the quality of answers. The key is to think around specific business, find out the unsatisfactory points in the initial answers, and make targeted improvements. I will not go into details here.

## 2. Indicate the source of knowledge to increase credibility

Due to the hallucination problem of large models, we sometimes suspect that the model's answer is not derived from the existing knowledge base content. This is particularly important in some scenarios where authenticity needs to be guaranteed, such as:

In [7]:
question = "强化学习的定义是什么"
result = qa_chain({"query": question})
print(result["result"])

强化学习是一种机器学习方法，旨在让智能体通过与环境的交互学习如何做出一系列好的决策。在强化学习中，智能体会根据环境的状态选择一个动作，然后根据环境的反馈（奖励）来调整其策略，以最大化长期奖励。强化学习的目标是在不确定的情况下做出最优的决策，类似于让一个小孩通过不断尝试来学会走路的过程。强化学习的应用范围广泛，包括游戏玩法、机器人控制、交通优化等领域。在强化学习中，智能体和环境之间不断交互，智能体根据环境的反馈来调整其策略，以获得最大的奖励。


We can ask the model to indicate the source of knowledge when generating answers. This can prevent the model from fabricating knowledge that does not exist in the given data. At the same time, it can also improve our credibility in the answers generated by the model:

In [8]:
template_v4 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。你应该使答案尽可能详细具体，但不要偏题。如果答案比较长，请酌情进行分段，以提高答案的阅读体验。
如果答案有几点，你应该分点标号回答，让答案清晰具体。
请你附上回答的来源原文，以保证回答的正确性。
{context}
问题: {question}
有用的回答:"""

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context","question"],
                                 template=template_v4)
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt":QA_CHAIN_PROMPT})

question = "强化学习的定义是什么"
result = qa_chain({"query": question})
print(result["result"])

强化学习是一种机器学习方法，旨在让智能体通过与环境的交互学习如何做出一系列好的决策。在这个过程中，智能体会根据环境的反馈（奖励）来调整自己的行为，以最大化长期奖励的总和。强化学习的目标是在不确定的情况下做出最优的决策，类似于让一个小孩通过不断尝试来学会走路的过程。强化学习的交互过程由智能体和环境两部分组成，智能体根据环境的状态选择动作，环境根据智能体的动作输出下一个状态和奖励。强化学习的应用非常广泛，包括游戏玩法、机器人控制、交通管理等领域。【来源：蘑菇书一语二语二强化学习教程】。


However, attaching the original source often leads to an increase in context and a decrease in response speed. We need to consider whether to require the original text to be attached based on the business scenario.

## 3. Constructing thought chains

Large models can often understand and execute instructions well, but the model itself still has some limitations, such as illusions of large models, inability to understand more complex instructions, inability to execute complex steps, etc. We can minimize its ability limitations by constructing a thought chain and constructing the prompt into a series of steps. For example, we can construct a two-step thought chain and require the model to reflect in the second step to eliminate the illusion problem of large models as much as possible.

First, we have such a Bad Case:

Question: How should we construct an LLM project
Initial answer: Omitted
Insufficient: In fact, the content about how to construct an LLM project in the knowledge base is to use the LLM API to build an application. The model's answer seems to make sense, but it is actually an illusion of the large model. It is obtained by splicing some related texts. There are problems

In [9]:
question = "我们应该如何去构造一个LLM项目"
result = qa_chain({"query": question})
print(result["result"])

构建一个LLM项目需要考虑以下几个步骤：

1. 确定项目目标和需求：首先要明确你的项目是为了解决什么问题或实现什么目标，确定需要使用LLM的具体场景和任务。

2. 收集和准备数据：根据项目需求，收集和准备适合的数据集，确保数据的质量和多样性，以提高LLM的性能和效果。

3. 设计Prompt和指令微调：根据项目需求设计合适的Prompt，确保指令清晰明确，可以引导LLM生成符合预期的文本。

4. 进行模型训练和微调：使用基础LLM或指令微调LLM对数据进行训练和微调，以提高模型在特定任务上的表现和准确性。

5. 测试和评估模型：在训练完成后，对模型进行测试和评估，检查其在不同场景下的表现和效果，根据评估结果进行必要的调整和优化。

6. 部署和应用模型：将训练好的LLM模型部署到实际应用中，确保其能够正常运行并实现预期的效果，持续监测和优化模型的性能。

来源：根据提供的上下文内容进行总结。


To this end, we can optimize the prompt and turn the previous prompt into two steps, requiring the model to reflect in the second step:

In [12]:
template_v4 = """
请你依次执行以下步骤：
① 使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答案。
你应该使答案尽可能详细具体，但不要偏题。如果答案比较长，请酌情进行分段，以提高答案的阅读体验。
如果答案有几点，你应该分点标号回答，让答案清晰具体。
上下文：
{context}
问题: 
{question}
有用的回答:
② 基于提供的上下文，反思回答中有没有不正确或不是基于上下文得到的内容，如果有，回答你不知道
确保你执行了每一个步骤，不要跳过任意一个步骤。
"""

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context","question"],
                                 template=template_v4)
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt":QA_CHAIN_PROMPT})

question = "我们应该如何去构造一个LLM项目"
result = qa_chain({"query": question})
print(result["result"])

根据上下文中提供的信息，构造一个LLM项目需要考虑以下几个步骤：

1. 确定项目目标：首先要明确你的项目目标是什么，是要进行文本摘要、情感分析、实体提取还是其他任务。根据项目目标来确定LLM的使用方式和调用API接口的方法。

2. 设计Prompt：根据项目目标设计合适的Prompt，Prompt应该清晰明确，指导LLM生成符合预期的结果。Prompt的设计需要考虑到任务的具体要求，比如在文本摘要任务中，Prompt应该包含需要概括的文本内容。

3. 调用API接口：根据设计好的Prompt，通过编程调用LLM的API接口来生成结果。确保API接口的调用方式正确，以获取准确的结果。

4. 分析结果：获取LLM生成的结果后，进行结果分析，确保结果符合项目目标和预期。如果结果不符合预期，可以调整Prompt或者其他参数再次生成结果。

5. 优化和改进：根据分析结果的反馈，不断优化和改进LLM项目，提高项目的效率和准确性。可以尝试不同的Prompt设计、调整API接口的参数等方式来优化项目。

通过以上步骤，可以构建一个有效的LLM项目，利用LLM的强大功能来实现文本摘要、情感分析、实体提取等任务，提高工作效率和准确性。如果有任何不清楚的地方或需要进一步的指导，可以随时向相关领域的专家寻求帮助。


It can be seen that after asking the model to reflect on itself, the model repaired its illusion and gave the correct answer. We can also complete more functions by constructing thought chains, which will not be described here. Readers are welcome to try.

## 4. Add a command parsing

We often face a requirement that we need the model to output in a format we specify. However, since we use Prompt Template to fill in user questions, the format requirements in user questions are often ignored, for example:

In [13]:
question = "LLM的分类是什么？给我返回一个 Python List"
result = qa_chain({"query": question})
print(result["result"])

根据上下文提供的信息，LLM（Large Language Model）的分类可以分为两种类型，即基础LLM和指令微调LLM。基础LLM是基于文本训练数据，训练出预测下一个单词能力的模型，通常通过在大量数据上训练来确定最可能的词。指令微调LLM则是对基础LLM进行微调，以更好地适应特定任务或场景，类似于向另一个人提供指令来完成任务。

根据上下文，可以返回一个Python List，其中包含LLM的两种分类：["基础LLM", "指令微调LLM"]。


As you can see, although we require the model to return a Python List, the output requirement is wrapped in the Template and ignored by the model. For this problem, we can construct a Bad Case:

Question: What are the categories of LLM? Return me a Python List
Initial answer: According to the context provided, LLM can be divided into basic LLM and instruction fine-tuning LLM.
Insufficient: The output is not in accordance with the requirements in the instruction

For this problem, an existing solution is to add a layer of LLM before our retrieval LLM to implement the instruction parsing, and separate the format requirements of the user's question from the question content. This idea is actually the prototype of the currently popular Agent mechanism, that is, for user instructions, set up an LLM (ie Agent) to understand the instructions, determine what tools need to be executed for the instructions, and then call the tools that need to be executed in a targeted manner. Each tool can be an LLM based on different prompt engineering, or it can be a database, API, etc. In fact, there is an Agent mechanism designed in LangChain, but we will not go into details in this tutorial. Here we only simply implement this function based on OpenAI's native interface:

In [18]:
# Use the OpenAI native interface mentioned in Chapter 2

from openai import OpenAI

client = OpenAI(
# This is the default and can be omitted
    api_key=os.environ.get("OPENAI_API_KEY"),
)


def gen_gpt_messages(prompt):
    '''
    构造 GPT 模型请求参数 messages
    
    请求参数：
        prompt: 对应的用户提示词
    '''
    messages = [{"role": "user", "content": prompt}]
    return messages


def get_completion(prompt, model="gpt-3.5-turbo", temperature = 0):
    '''
    获取 GPT 模型调用结果

    请求参数：
        prompt: 对应的提示词
        model: 调用的模型，默认为 gpt-3.5-turbo，也可以按需选择 gpt-4 等其他模型
        temperature: 模型输出的温度系数，控制输出的随机程度，取值范围是 0~2。温度系数越低，输出内容越一致。
    '''
    response = client.chat.completions.create(
        model=model,
        messages=gen_gpt_messages(prompt),
        temperature=temperature,
    )
    if len(response.choices) > 0:
        return response.choices[0].message.content
    return "generate answer error"

prompt_input = '''
请判断以下问题中是否包含对输出的格式要求，并按以下要求输出：
请返回给我一个可解析的Python列表，列表第一个元素是对输出的格式要求，应该是一个指令；第二个元素是去掉格式要求的问题原文
如果没有格式要求，请将第一个元素置为空
需要判断的问题：
```
{}
```
不要输出任何其他内容或格式，确保返回结果可解析。
'''



Let's test the LLM's ability to decompose the format requirements:

In [19]:
response = get_completion(prompt_input.format(question))
response

'```\n["给我返回一个 Python List", "LLM的分类是什么？"]\n```'

As you can see, through the above prompt, LLM can well implement the output format parsing. Next, we can set up another LLM to parse the output content according to the output format requirements:

In [20]:
prompt_output = '''
请根据回答文本和输出格式要求，按照给定的格式要求对问题做出回答
需要回答的问题：
```
{}
```
回答文本：
```
{}
```
输出格式要求：
```
{}
```
'''

We can then chain the two LLMs together with the search chain:

In [24]:
question = 'LLM的分类是什么？给我返回一个 Python List'
# First, separate the format requirements from the questions
input_lst_s = get_completion(prompt_input.format(question))
# Find the starting and ending characters of the split list
start_loc = input_lst_s.find('[')
end_loc = input_lst_s.find(']')
rule, new_question = eval(input_lst_s[start_loc:end_loc+1])
# Then use the split question to call the search chain
result = qa_chain({"query": new_question})
result_context = result["result"]
# Then call the output format parsing
response = get_completion(prompt_output.format(new_question, result_context, rule))
response

"['基础LLM', '指令微调LLM']"

As you can see, after the above steps, we have successfully achieved the limitation of the output format. Of course, in the above code, the core is to introduce the Agent idea. In fact, whether it is the Agent mechanism or the Parser mechanism (that is, limiting the output format), LangChain provides a mature tool chain for use. Interested readers are welcome to explore it in depth. I will not explain it here.

Through the ideas explained above, combined with actual business conditions, we can continuously discover Bad Cases and optimize Prompt in a targeted manner, thereby improving the performance of the generation part. However, the premise of the above optimization is that the retrieval part can retrieve the correct answer fragment, that is, the accuracy and recall rate of the retrieval are as high as possible. So, how can we evaluate and optimize the performance of the retrieval part? We will explore this issue in depth in the next chapter.