# VOC - Customer Service Analysis

In this notebook, we will look at the Customer Service Analysis use case, and go through the general process of prompt optimization.

### Step 0: Environment Preperation

In [None]:
# Update SDK and Install related libraries
# You can ignore the error related to installation
%pip install --quiet --no-build-isolation --force-reinstall \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57"

%pip install --quiet langchain==0.0.309 "transformers>=4.24,<5"

In [None]:
# Set up IAM
import json
import os
import sys
import boto3

module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import bedrock, print_ww


# ---- ⚠️ Un-comment and edit the below lines as needed for your AWS setup ⚠️ ----

# os.environ["AWS_DEFAULT_REGION"] = "<REGION_NAME>"  # E.g. "us-east-1"
# os.environ["AWS_PROFILE"] = "<YOUR_PROFILE>"
# os.environ["BEDROCK_ASSUME_ROLE"] = "<YOUR_ROLE_ARN>"  # E.g. "arn:aws:..."

boto3_bedrock = bedrock.get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None)
)

### Raw Data Inspection

In [None]:
# Read in the Sample data you just generated from Task0
import pandas as pd
raw = pd.read_csv('yourname_voc_gen_.csv')
raw.groupby('故障分类').describe()

### Step1: Prompt Template Generation

In this step, you will learn how to write a base prompt for classification.  <br>

<b>Trick:<br></b>

1, Try changing the temperature level: <br>
Temperature is a parameter that controls the randomness of a model's predictions during generation. Higher temperature leads to more creative samples that enable multiple variations in phrasing (and in the case of fiction, variation in answers as well), while lower temperature leads to more conservative samples that stick to the most-probable phrasing and answer. Adjusting the temperature is a way to encourage a language model to explore rare, uncommon, or surprising next words or sequences, rather than only selecting the most likely predictions. Claude Slackbot uses a non-zero temperature when generating responses, that allow some variation in its answers.

2, Try generating multiple times and select the best answer

#### <font color="purple">Using Claude to Generate a Prompt Template for Classification from Scratch</font>

In [None]:
# Set Up Claude Parameters
from langchain.llms.bedrock import Bedrock

inference_modifier = {'max_tokens_to_sample':4096, 
                      "temperature":1,
                      "top_k":250,
                      "top_p":1,
                      "stop_sequences": ["\n\nHuman"]
                     }

textgen_llm = Bedrock(model_id = "anthropic.claude-v2",
                    client = boto3_bedrock, 
                    model_kwargs = inference_modifier 
                    )

# Func used to call llm while calculating execution time.  Set if_print to 0 to avoid printing.
import time
def timer_llm(prompt, if_print = 1):
    start_time = time.time()
    response = textgen_llm(prompt)
    end_time = time.time()
    elapsed_time = end_time - start_time
    if if_print == 1:
        print("----------------------------------------- OutPut -----------------------------------------")
        print("Elapsed time: ", elapsed_time, "seconds")
    return response

In [None]:
# This prompt is used to generate a classification prompt template
p_p_gen = '''
\n\nHuman:
你是一个prompt书写助手，你的任务是按照<instructions>里面的要求帮助我写一个用作客服分类的提示词模版。

<instructions>
1, 该prompt模版将用来进行客服问题分类
2，8个客服问题分类在<cate>中
<cate>
备件/商务咨询
床
扫描架
扫描问题
探测器
操作台
球管/高压
无法判断
</cate>
3，输出prompt模版应该包括三个部分:
<task>任务</task>
<requirements>任务要求</requirements>
<output_format>输出格式要求</output_format>
</intructions>

\n\nAssistant:
'''

response = timer_llm(p_p_gen)
result = response[response.index('\n')+1:]
print_ww(result)

#### <font color="purple">Alternatively, Try Using the Suggested Claude Template for General Classification Tasks as a Starting Point</font>

https://docs.google.com/spreadsheets/u/0/d/1TlOKgJe4gziNBVSL5FDjRvZI-f7oHQUF6deQkcM6rBk/htmlview?pli=1#

In [None]:
# This prompt is used to re-write/optimize a base prompt
p_claude_for_classification = '''
You are a customer service agent that is classifying emails by type. I want you to give your answer and then explain it.

How would you categorize this email?
<email>
{{EMAIL}}
</email>

Categories are:
(A) Pre-sale question
(B) Broken or defective item
(C) Billing question
(D) Other (please explain)

'''

p_p_gen2 = '''
\n\nHuman:
你是一个prompt改写助理，你的任务是按照<instructions>里面的要求，根据给出的基础prompt模版<given_prompt_template>改写一个用作客服分类的提示词模版。

这里是原始的prompt模型<given_prompt_template>
{claude_pe_template}
</given_prompt_template>

<instructions>
1, 改写后的提示词模版应全部为简体中文
2，任务是根据客服描述定位客服问题分类
3，输入为客服描述不是邮件
4，8个客服问题分类在<cate>中
<cate>
备件/商务咨询
床
扫描架
扫描问题
探测器
操作台
球管/高压
无法判断
5，think step by step
</cate>
</intructions>

\n\nAssistant:
'''
prompt = p_p_gen2.format(claude_pe_template=p_claude_for_classification)
response = timer_llm(prompt)
result = response[response.index('\n')+1:]
print_ww(result)

### Step2: Using Claude to Generate Classification Rules/Insights

In [None]:
# Feed all the data to Claude and ask it to generate rules for classification
examples = ''
for i in range(raw.shape[0]):
    symptom = raw.iloc[i][0]
    cla = raw.iloc[i][1]
    eg_tmp = '问题： ' + symptom + '\n' + '分类： ' + cla + '\n\n'
    examples = examples + eg_tmp
#print(examples)

p_rule = '''
\n\nHuman: 你是一名客服人员，请从下面的examples中总结出每个故障分类的规则:\n
<examples>\n
{input_examples}
\n</examples>\n
并请用下面的format作答：\n
<format>\n
(A) 备件/商务咨询：
(B) 床：
(C) 扫描架：
(D) 扫描问题：
(E) 探测器：
(F) 操作台：
(G) 球管/高压：
(H) 无法判断：
\n</format>\n
\n\nAssistant:
'''

prompt = p_rule.format(input_examples=examples)
response = timer_llm(prompt)
result = response[response.index('\n')+1:]
print_ww(result)

### Step3: Assemble Base Prompt 

Manually assemble your prompt based on results from Step1 and Step2

In [None]:
# This is an example
p_base_example = '''
\n\nHuman: 您是一个客服代表,需要根据分类要求<instructions>对客户的描述<description>对客服问题进行分类。

<description>中是问题描述：
<description>
{input_description}
</description>

<instructions>中是分类要求：
<instructions>
1， 将上方<description>中描述的客服问题分类到以下8个类别中
备件/商务咨询: 提到物品价格,搬迁,流程等与商务有关的问题。
床: 提到床无法移动,出入不便等与床操作相关的问题。
扫描架: 提到gantry, revolution等与扫描架相关的关键词。
扫描问题: 提到无法获得图像,出现伪影,硬件错误等直接与扫描质量相关的问题。
探测器: 提到探测器温度,指针移动产生假影等与探测器相关的问题。
操作台: 提到开机失败,工作站不能使用,传不进PACS等与操作台相关的问题。
球管/高压: 提到球管故障,报错,异响等与球管和高压相关的问题。
无法判断: 对于一些语句信息不足无法判断分类的问题。
2，每个问题只选择一个最适合的类别
3，如果无法判断类别,则选择“无法判断”
4，必须严格用<example>中给出的样例格式回复, 输出结果中必须包含：<answer></answer>和<reasoning></reasoning>：
<example>
<answer>：
操作台
</answer>
<reasoning>：
根据分类规则工作站不能使用，死机，黑屏属于操作台故障分类。
</reasoning>：
</example>
<instructions>

\n\nAssistant:
<answer>分类结果</answer>
<reasoning>分析</reasoning>

'''

#### Now write your own prompt <br>

<b>Requirements: </b><br>
1, Put classification answer into "\<answer>\</answer>" tags in the response <br>
2, Put reasoning into "\<reasoning>\</reasoning>" tags in the response

In [None]:
# Now write your own base prompt for batch testing
p_base = '''
\n\nHuman: 您是一个客服代表,需要根据分类要求<instructions>对客户的描述<description>对客服问题进行分类。

<description>中是问题描述：
<description>
{{PLACEHOLDER FOR THE INPUT}}
</description>

<instructions>中是分类要求：
<instructions>
{{YOUR OWN INSTRUCTIONS: TAKE PREVIOUS RESULTS AS REFERENCES}}
</instructions>

<output_format>中是输出格式要求：
<output_format>
{{YOUR OWN OUTPUT FORMAT REQUIREMENTS/EXAMPLES}}
</output_format>

\n\nAssistant:
'''

### Step4: Batch Testing

In this step, you will learn how to perform batch testing and analyze results.  <br>

<b>Trick:<br></b>

1, You want to re-set some of the parameters to help stablize the testing results.  For example, you want to set temperature to 0 from this point. <br>

2, It will take roughly 3-5 minutes to finish batch tesing for the entire test set of size 50: the processing time depends on your prompt complexicity, input length, and output length (it may try to generate a long analysis).  Therefore, it is also advised to set a desired output length either within the prompt itself or through the parameters.

In [None]:
# func used to do batch testing
# input_df: test dataset df
# input_prompt: prompt template used to do batch testing
# output_file_name: for example, "xxx.csv"
def mybatchtest(input_df, input_prompt, output_file_name):
    df = input_df.copy()
    df['预测分类'] = ''
    df['预测分析'] = ''
    df['预测结果'] = ''
    for i in range(df.shape[0]):
    #for i in range(3):
        symptom = df.iloc[i][0]
        prompt = input_prompt.format(input_description = symptom)
        response = timer_llm(prompt, if_print = 0)
        result = response[response.index('\n')+1:]
        #print_ww(result)
        if  '<answer>' in result and '</answer>' in result:
            answer = result[result.index('<answer>')+7:result.index('</answer>')].replace('\n','').replace('>','')
        else:   
            answer = result
        if  '<reasoning>' in result and '</reasoning>' in result:
            reasoning = result[result.index('<reasoning>')+11:result.index('</reasoning>')].replace('\n','').replace('>','')
        else:   
            reasoning = 'tbd'
        #print(answer)
        #print(reasoning)

        gt = symptom = df.iloc[i][1]
        if gt in answer:
            pred = 'T'
        else:
            pred = 'F'

        df.iat[i,2] = answer
        df.iat[i,3] = reasoning
        df.iat[i,4] = pred
    # save results
    df.to_csv(output_file_name,index = False)
    return df

# func used to do batch testing with resampling
# input_df: test dataset df
# input_prompt_base: prompt base template used to do batch testing
# input_prompt: prompt template used to do evaluation for resampled results
# resample_times: the number of times to resample
# output_file_name: for example, "xxx.csv"
def mybatchtest_resample(input_df, input_prompt_base, input_prompt, output_file_name, resample_times = 3):
    df = input_df.copy()
    df['预测分类'] = ''
    df['预测分析'] = ''
    df['预测结果'] = ''
    for i in range(df.shape[0]):
    #for i in range(3):
        symptom = df.iloc[i][0]
        prompt_base = input_prompt_base.format(input_description = symptom)
        
        answers = ''
        for j in range(resample_times):
            response = timer_llm(prompt_base, if_print = 0)
            result = response[response.index('\n')+1:]
            if  '<answer>' in result and '</answer>' in result:
                answer = result[result.index('<answer>')+7:result.index('</answer>')].replace('\n','').replace('>','')
            else:
                answer = result
            if  '<reasoning>' in result and '</reasoning>' in result:
                reasoning = result[result.index('<reasoning>')+11:result.index('</reasoning>')].replace('\n','').replace('>','')
            else:
                reasoning = 'tbd'
            answer_tmp = 'id' + str(i) +':\n' + '<answer>： ' + answer + '<\answer>\n' + '<reasoning>： ' + reasoning + '<\reasoning>\n\n' 
            answers = answers + answer_tmp

        prompt_vote = input_prompt.format(input_description = symptom, input_answers = answers)
        response = timer_llm(prompt_vote, if_print = 0)
        result = response[response.index('\n')+1:]
        if  '<answer>' in result and '</answer>' in result:
            answer = result[result.index('<answer>')+7:result.index('</answer>')].replace('\n','').replace('>','')
        else:   
            answer = result
        if  '<reasoning>' in result and '</reasoning>' in result:
            reasoning = result[result.index('<reasoning>')+11:result.index('</reasoning>')].replace('\n','').replace('>','')
        else:   
            reasoning = 'tbd'
        #print(answer)
        #print(reasoning)
        
        gt = symptom = df.iloc[i][1]
        if gt in answer:
            pred = 'T'
        else:
            pred = 'F'

        df.iat[i,2] = answer
        df.iat[i,3] = reasoning
        df.iat[i,4] = pred
    # save results
    df.to_csv(output_file_name,index = False)
    return df

# Func used to calculate Accuracy
# data: df containing test results
def accuracy(data):
    correct = (data['预测结果'] == 'T').sum()
    total = len(data)
    return correct/total

In [None]:
inference_modifier = {'max_tokens_to_sample':4096, 
                      "temperature":0,
                      "top_k":250,
                      "top_p":1,
                      "stop_sequences": ["\n\nHuman"]
                     }

textgen_llm = Bedrock(model_id = "anthropic.claude-v2",
                    client = boto3_bedrock, 
                    model_kwargs = inference_modifier 
                    )

# adjust your output file name, for example: heather_test_results_base.csv
df_base = mybatchtest(raw, p_base, 'REPLACE_YOUR_NAME_HERE_test_results_base.csv')

### Step5: Analyze Results

Calculate the accuracy, this is going to be our base.
You will probably get pumped since the results looked very promising, and keep in mind that this could be a result of overfitting since our test data size is very small.

In [None]:
# Calculate Accuracy
acc_base = accuracy(df_base)
print('Accuracy Base:', acc_base)

Now, analyze the results and select the worst performed classes as the target to improve.  

<b>Trick: <br></b>
1, It may be a very good idea to take a closer look at the bad cases and think what may be the reason.<br>
2, You may also want to look at multiple classes together, since the problem may be that Claude is having difficulties differentiating them

In [None]:
# Check accuracy by categories
df_base.groupby('故障分类').apply(accuracy)

In [None]:
# Filtered out all the cases that have been incorrectly classified
df_incorrect_cases = df_base[df_base['预测结果'] == 'F']
df_incorrect_cases.describe()

examples = ''
for i in range(df_incorrect_cases.shape[0]):
    symptom = df_incorrect_cases.iloc[i][0]
    gt = df_incorrect_cases.iloc[i][1]
    pred = df_incorrect_cases.iloc[i][2].replace('>','')
    reasoning = df_incorrect_cases.iloc[i][3]
    eg_tmp = 'id' + str(i) +':\n' + '问题描述： ' + symptom + '\n' + '正确分类： ' + gt + '\n' + '预测分类： ' + pred + '\n' + '预测分析： ' + reasoning + '\n\n' 
    examples = examples + eg_tmp
df_base_incorrect_cases = examples
print(df_base_incorrect_cases)

### Step6: Improve Prompt and Re-Test

#### <font color="purple">Using Claude to Generate Insights About the Results</font>

In [None]:
original_instructions ='''
1， 将问题描述分类到以下8个类别中
(A) 备件/商务咨询: 提到物品价格,搬迁,流程等与商务有关的问题。
(B) 床: 提到床无法移动,出入不便等与床操作相关的问题。
(C) 扫描架: 提到gantry, revolution等与扫描架相关的关键词。
(D) 扫描问题: 提到无法获得图像,出现伪影,硬件错误等直接与扫描质量相关的问题。
(E) 探测器: 提到探测器温度,指针移动产生假影等与探测器相关的问题。
(F) 操作台: 提到开机失败,工作站不能使用,传不进PACS等与操作台相关的问题。
(G) 球管/高压: 提到球管故障,报错,异响等与球管和高压相关的问题。
(H) 无法判断: 对于一些语句信息不足无法判断分类的问题。
2，每个问题只选择一个最适合的类别
3，如果无法判断类别,则选择“无法判断”
'''

# This prompt is used to generate insights and rules
p_insights = '''
\n\nHuman: 您是一个客服代表,需要按照故障分类要求<instructions>定位客户问题描述中的故障类型。<incorrect_examples>中是一些错分的例子，请分析分类错误可能的原因写入<insights>并将建议添加的分类要求写入<suggested_instructions>。

<instructions>中是分类要求：
<instructions>
{input_original_instructions}
<instructions>

<incorrect_examples>中是错分的例子：
<incorrect_examples>
{input_incorrect_cases}
</incorrect_examples>

\n\nAssistant:
<insights>可能导致分类错误的原因</insights>
<suggested_instructions>建议添加的分类要求</suggested_instructions>
'''

prompt = p_insights.format(input_original_instructions = original_instructions, input_incorrect_cases = incorrect_cases)
response = timer_llm(prompt)
result = response[response.index('\n')+1:]
print_ww(result)

In [None]:
# Manually add the suggested instructions in your base prompt, and use it to re-do batch testing
# below is an example
p_base_add_suggested_instructions_example = '''
\n\nHuman: 您是一个客服代表,需要根据分类要求<instructions>对客户的描述<description>对客服问题进行分类。

<description>中是问题描述：
<description>
{input_description}
</description>

<instructions>中是分类要求：
<instructions>
1， 将上方<description>中描述的客服问题分类到以下8个类别中
备件/商务咨询: 提到物品价格,搬迁,流程等与商务有关的问题。
床: 提到床无法移动,出入不便等与床操作相关的问题。
扫描架: 提到gantry, revolution等与扫描架相关的关键词。
扫描问题: 提到无法获得图像,出现伪影,硬件错误等直接与扫描质量相关的问题。
探测器: 提到探测器温度,指针移动产生假影等与探测器相关的问题。
操作台: 提到开机失败,工作站不能使用,传不进PACS等与操作台相关的问题。
球管/高压: 提到球管故障,报错,异响等与球管和高压相关的问题。
无法判断: 对于一些语句信息不足无法判断分类的问题。
2，每个问题只选择一个最适合的类别
3，如果无法判断类别,则选择“无法判断”
4， 明确定义分类中使用的关键词,例如扫描仪指代整台设备,扫描架特指gantry部分。
5，注意问题描述的整体语义,不要过度依赖某些关键词进行分类。
6，进一步明确各分类类别的区分标准,避免分类混淆。
7，对信息不足无法判断类别的问题,不要强行分类,选择“无法判断”。
8，对包含多个类别关键词的问题描述,选择最相关的一个类别进行分类。
9， 必须严格用<example>中给出的样例格式回复, 输出结果中必须包含：<answer></answer>和<reasoning></reasoning>：
<example>
<answer>：
操作台
</answer>
<reasoning>：
根据分类规则工作站不能使用，死机，黑屏属于操作台故障分类。
</reasoning>：
</example>
<instructions>

\n\nAssistant:
<answer>分类结果</answer>
<reasoning>分析</reasoning>
'''

#### Now write your own prompt <br>

<b>Requirements: </b><br>
1, Put classification answer into "\<answer>\</answer>" tags in the response <br>
2, Put reasoning into "\<reasoning>\</reasoning>" tags in the response <br>
3, <b><font color=green>Adding suggested instructions from previous step into your base prompt</b><br>
<b><font color=red> !Note that you may need to change below functions if your own prompts have different output formats to the examples given!</b>

In [None]:
# Now write your own prompt
p_base_add_suggested_instructions = '''
\n\nHuman: 您是一个客服代表,需要根据分类要求<instructions>对客户的描述<description>对客服问题进行分类。

<description>中是问题描述：
<description>
{{PLACEHOLDER FOR INPUT DESCRIPTION}}
</description>

<instructions>中是分类要求：
<instructions>
{{YOUR OWN INSTRUCTIONS}}
</instructions>

<output_format>中是输出格式要求：
<output_format>
{{REQUIREMENTS/EXAMPLES FOR OUTPUT FORMAT}}
</output_format>

\n\nAssistant:
'''
# adjust your output file name, for example: heather_test_results_base_add_suggested_instructions.csv
df_base_add_suggested_instructions = mybatchtest(raw, p_base_add_suggested_instructions, 'REPLACE_YOUR_NAME_HERE__test_results_base_add_suggested_instructions.csv')

In [None]:
# Check results
acc_base_add_suggested_instructions = accuracy(df_base_add_suggested_instructions)
print('Accuracy with Suggested Instructions:', acc_base_add_suggested_instructions)

#### <font color="purple">Adding Few-Shots Examples</font>

In [None]:
# Filtered out all the cases that have originally been incorrectly classified, and take a sample of 50%.
# Notice that you should lower this rate with larger test dataset to avoid overfitting
df_incorrect_cases = df_base[df_base['预测结果'] == 'F']
df_incorrect_cases_sample = df_incorrect_cases.groupby("故障分类").sample(frac = 0.5, random_state=1)
df_incorrect_cases_sample.groupby('故障分类').count()
print(df_incorrect_cases_sample.describe())

examples = ''
for i in range(df_incorrect_cases_sample.shape[0]):
    symptom = df_incorrect_cases_sample.iloc[i][0]
    gt = df_incorrect_cases_sample.iloc[i][1]
    eg_tmp = 'id' + str(i) +':\n' + '问题描述： ' + symptom + '\n' + '正确分类： ' + gt + '\n\n' 
    examples = examples + eg_tmp
incorrect_cases_sample = examples

In [None]:
# combine the base prompt and the sampled bad cases into one prompt, and use it to re-do batch testing
# below is an example
# combine the base prompt and the sampled bad cases into one prompt, and use it to re-do batch testing
p_base_add_few_shots_1 = '''
\n\nHuman: 您是一个客服代表,需要根据分类要求<instructions>对客户的描述<description>对客服问题进行分类。

<description>中是问题描述：
<description>
{input_description}
</description>

<instructions>中是分类要求：
<instructions>
1， 将上方<description>中描述的客服问题分类到以下8个类别中
备件/商务咨询: 提到物品价格,搬迁,流程等与商务有关的问题。
床: 提到床无法移动,出入不便等与床操作相关的问题。
扫描架: 提到gantry, revolution等与扫描架相关的关键词。
扫描问题: 提到无法获得图像,出现伪影,硬件错误等直接与扫描质量相关的问题。
探测器: 提到探测器温度,指针移动产生假影等与探测器相关的问题。
操作台: 提到开机失败,工作站不能使用,传不进PACS等与操作台相关的问题。
球管/高压: 提到球管故障,报错,异响等与球管和高压相关的问题。
无法判断: 对于一些语句信息不足无法判断分类的问题。
2，每个问题只选择一个最适合的类别
3，如果无法判断类别,则选择“无法判断”
4， 必须严格用<example>中给出的样例格式回复, 输出结果中必须包含：<answer></answer>和<reasoning></reasoning>：
<example>
<answer>：
操作台
</answer>
<reasoning>：
根据分类规则工作站不能使用，死机，黑屏属于操作台故障分类。
</reasoning>：
</example>
<instructions>
'''

p_base_add_few_shots_2 = '''
<examples>中是一些分类例子：
<examples>
{input_examples}
</examples>

\n\nAssistant:
<answer>分类结果</answer>
<reasoning>分析</reasoning>
'''

p_base_add_few_shots_example = p_base_add_few_shots_1 + p_base_add_few_shots_2.format(input_examples = incorrect_cases_sample)

#### Now write your own prompt <br>

<b>Requirements: </b><br>
1, Put classification answer into "\<answer>\</answer>" tags in the response <br>
2, Put reasoning into "\<reasoning>\</reasoning>" tags in the response <br>
3, <b><font color=green> Adding bad cases in your base prompt (Few-Shots)</b> <br>
<b><font color=red> !Note that you may need to change below functions if your own prompts have different output formats to the examples given!</b>

In [None]:
# Now write your own prompt
p_base_add_few_shots = '''
\n\nHuman: 您是一个客服代表,需要根据分类要求<instructions>对客户的描述<description>对客服问题进行分类。

<description>中是问题描述：
<description>
{{PLACEHOLDER FOR INPUT DESCRIPTION}}
</description>

<instructions>中是分类要求：
<instructions>
{{YOUR OWN INSTRUCTIONS}}
</instructions>

<output_format>中是输出格式要求：
<output_format>
{{REQUIREMENTS/EXAMPLES FOR OUTPUT FORMAT}}
</output_format>

\n\nAssistant:
'''

# adjust your output file name, for example: heather_test_results_base_add_few_shots.csv
df_base_add_few_shots = mybatchtest(raw, p_base_add_few_shots, 'REPLACE_YOUR_NAME_HERE_test_results_base_add_few_shots.csv')

In [None]:
# Check results
acc_base_add_few_shots = accuracy(df_base_add_few_shots)
print('Accuracy with Few-Shots:', acc_base_add_few_shots)

#### <font color="purple">Generate Claude Multiple Times and Vote for the Best Answer</font>

In [None]:
# This prompt is use to evaluate and vote for the best answer
# Notice this cell is going to take resample_times * batch_testing_time 
# below is an example
p_base_vote_example = '''
\n\nHuman: 您是一个客服复审员，你的任务是对已经做出的客服分类进行复审投票。
几名客服人员对<description>中同一个问题描述做出了故障分类的判断，请选出答案中的多数派。
如果有多个分类同票或所有答案都不相同，请结合<instructions>中的分类要求和客服人员的分类分析，选出几个答案中最优的一个答案输出。

<description>中是问题描述：
<description>
{input_description}
</description>

<instructions>中是分类要求：
<instructions>
1， 将上方<description>中描述的客服问题分类到以下8个类别中
(A) 备件/商务咨询: 提到物品价格,搬迁,流程等与商务有关的问题。
(B) 床: 提到床无法移动,出入不便等与床操作相关的问题。
(C) 扫描架: 提到gantry, revolution等与扫描架相关的关键词。
(D) 扫描问题: 提到无法获得图像,出现伪影,硬件错误等直接与扫描质量相关的问题。
(E) 探测器: 提到探测器温度,指针移动产生假影等与探测器相关的问题。
(F) 操作台: 提到开机失败,工作站不能使用,传不进PACS等与操作台相关的问题。
(G) 球管/高压: 提到球管故障,报错,异响等与球管和高压相关的问题。
(H) 无法判断: 对于一些语句信息不足无法判断分类的问题。
2，每个问题只选择一个最适合的类别
3，如果无法判断类别,则选择“无法判断”
4，必须严格用<example>中给出的样例格式回复, 输出结果中必须包含：<answer></answer>和<reasoning></reasoning>：
<example>
<answer>：
（F）操作台
</answer>
<reasoning>：
根据分类规则工作站不能使用，死机，黑屏属于操作台故障分类。
</reasoning>：
</example>
<instructions>

<answers>中是来自多个客服人员的分类：
<answers>
{input_answers}
</answers>

\n\nAssistant:
<answer>最终结果</answer>
<reasoning>分析</reasoning>
'''

#### Now write your own prompt <br>

<b>Requirements: </b><br>
1, Put classification answer into "\<answer>\</answer>" tags in the response <br>
2, Put reasoning into "\<reasoning>\</reasoning>" tags in the response <br>
3, <b><font color=green>Here the task is to evaluate several results and vote for the best</b> <br>
<b><font color=red> !Note that you may need to change below functions if your own prompts have different output formats to the examples given!</b>

In [None]:
# Now write your own prompt
p_base_vote = '''
\n\nHuman: 您是一个客服代表,需要根据分类要求<instructions>对客户的描述<description>对客服问题进行分类。

<description>中是问题描述：
<description>
{{PLACEHOLDER FOR INPUT DESCRIPTION}}
</description>

<instructions>中是分类要求：
<instructions>
{{YOUR OWN INSTRUCTIONS}}
</instructions>

<output_format>中是输出格式要求：
<output_format>
{{REQUIREMENTS/EXAMPLES FOR OUTPUT FORMAT}}
</output_format>

\n\nAssistant:
'''

# adjust your output file name, for example: heather_test_results_base_vote.csv
df_base_vote = mybatchtest_resample(raw, p_base, p_base_vote, 'REPLACE_YOUR_NAME_HERE_test_results_base_vote.csv', 3)

In [None]:
# Check results
acc_base_vote = accuracy(df_base_vote)
print('Accuracy with Evaluation and Votting:', acc_base_vote)

# <font color=red>Assignment: Write down Your Final Prompt and Copy it to the Quip<font color=red>