# Synthetic Data Generation

In this notebook, we are going to generate some synthetic data that represents some of the most popular CX use cases.

### Environment Preperation

In [2]:
# Update SDK and Install related libraries
# You can ignore the error related to installation
%pip install --quiet --no-build-isolation --force-reinstall \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57"

%pip install --quiet langchain==0.0.309 "transformers>=4.24,<5"

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
apache-beam 2.52.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.3.2 which is incompatible.
docker-compose 1.29.2 requires jsonschema<4,>=2.5.1, but you have jsonschema 4.20.0 which is incompatible.
docker-compose 1.29.2 requires PyYAML<6,>=3.10, but you have pyyaml 6.0.1 which is incompatible.
jupyterlab 3.4.4 requires jupyter-server~=1.16, but you have jupyter-server 2.6.0 which is incompatible.
jupyterlab-server 2.10.3 requires jupyter-server~=1.4, but you have jupyter-server 2.6.0 which is incompatible.
nemoguardrails 0.5.0 requires langchain==0.0.251, but you have langchain 0.0.

In [3]:
# Set up IAM
import json
import os
import sys
import boto3

module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import bedrock, print_ww


# ---- ⚠️ Un-comment and edit the below lines as needed for your AWS setup ⚠️ ----

# os.environ["AWS_DEFAULT_REGION"] = "<REGION_NAME>"  # E.g. "us-east-1"
# os.environ["AWS_PROFILE"] = "<YOUR_PROFILE>"
# os.environ["BEDROCK_ASSUME_ROLE"] = "<YOUR_ROLE_ARN>"  # E.g. "arn:aws:..."

boto3_bedrock = bedrock.get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None)
)

Create new client
  Using region: us-east-1
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-east-1.amazonaws.com)


### Raw Data Inspection

In [5]:
# Inspect original dataset
import pandas as pd
voc_cx1_raw = pd.read_csv('lxq_voc_gen_.csv')
voc_cx1_raw

Unnamed: 0,Symptom,故障分类
0,\n想知道装一个呼吸门控要多少钱。\n,备件/商务咨询
1,\n请告知如何将位于某城市某健康管理体检中心的设备移至某城市某医院。\n,备件/商务咨询
2,"\n客户询问了某设备升级的相关事宜,希望能增加一个床的进出功能。\n",备件/商务咨询
3,"\n报价需求,购买油箱一批。\n",备件/商务咨询
4,"\n客户要安装UPS,想询问UPS的具体型号。\n",备件/商务咨询
...,...,...
447,"\n硬件故障导致曝光模块失效,无法进行正常扫描。据判断故障部件可能是今年6月新换的硬件。\n",球管/高压
448,"\n某医院某台CT扫描仪球管开放性管丝导致无法扫描,紧急程度高。\n",球管/高压
449,\n球馆里发出异响\n,球管/高压
450,\n预热功能无法使用的球管\n,球管/高压


In [6]:
# inspect by different issue categories
voc_cx1_raw.groupby('故障分类').describe()

Unnamed: 0_level_0,Symptom,Symptom,Symptom,Symptom
Unnamed: 0_level_1,count,unique,top,freq
故障分类,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
备件/商务咨询,24,24,\n想知道装一个呼吸门控要多少钱。\n,1
床,28,28,"\n扫描床仍无法移动,关机重启后出现E01错误\n",1
扫描架,41,41,"\n部分硬件停止扫描,不能预热。驱动器电源被禁用\n",1
扫描问题,166,166,\n硬件上有错误\n,1
探测器,18,18,\n开机后探测器温度不足提示是由停电导致\n,1
操作台,87,87,\n服务器在断电后没有作出任何响应\n,1
球管/高压,88,88,"\n球管故障引起球管损坏,紧急程度高。\n",1


### Synthetic Data Generation - Compare Results from Different Instructions

In [7]:
# Set Up Claude Parameters
from langchain.llms.bedrock import Bedrock

inference_modifier = {'max_tokens_to_sample':4096, 
                      "temperature":1,
                      "top_k":250,
                      "top_p":1,
                      "stop_sequences": ["\n\nHuman"]
                     }

textgen_llm = Bedrock(model_id = "anthropic.claude-v2",
                    client = boto3_bedrock, 
                    model_kwargs = inference_modifier 
                    )

In [8]:
# prompt for generating synthetic data
## Base
p_base = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>改写给定客服问题描述<example>.

重写要求<instructions>
1，每次生成5个不同的客服问题描述改写
2，改写后的描述含义不能变，不要增加额外的信息，不要添油加醋，不要发挥想象
3，改写包括语序改变，换词
4，改写前后描述的长短要相似
</instructions>

<examples>
扫描卡顿  机架操作盘联系不上 不能扫描了
</examples>

\n\nAssistant:
'''

## Now assume a customer that's upset and impatient 
p_upset = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>改写给定客服问题描述<example>.

重写要求<instructions>
1，每次生成5个不同的客服问题描述改写
2，改写后的描述含义不能变，不要增加额外的信息，不要添油加醋，不要发挥想象
3，改写包括语序改变，换词
4，改写前后描述的长短要相似
5，改写包括语序改变，措辞改变，增加语气词, 让语气更急躁
</instructions>

<examples>
扫描卡顿  机架操作盘联系不上 不能扫描了
</examples>

\n\nAssistant:
'''

## Now assume a customer that's chatty and confused
p_chatty = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>改写给定客服问题描述<example>.

重写要求<instructions>
1，每次生成5个不同的客服问题描述改写
2，改写后的描述含义不能变，不要增加额外的信息，不要添油加醋，不要发挥想象
3，改写包括语序改变，换词
4，改写前后描述的长短要相似
5，改写包括语序改变，措辞改变，增加语气词，说话更啰嗦
</instructions>

<examples>
扫描卡顿  机架操作盘联系不上 不能扫描了
</examples>

\n\nAssistant:
'''

## Now try translating the data into other languages.  This is specially useful when the Cx has insufficient data when localizing.
p_translate = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>改写给定客服问题描述<example>.

重写要求<instructions>
1，每次生成12个不同的客服问题描述改写,3个英文，3个符合中国大陆语言习惯，3个符合台湾语言习惯，3个符合香港语言习惯
2，改写后的描述含义不能变，不要增加额外的信息，不要添油加醋，不要发挥想象
3，改写包括语序改变，换词
4，改写前后描述的长短要相似
</instructions>

<examples>
扫描卡顿  机架操作盘联系不上 不能扫描了
</examples>

\n\nAssistant:
'''

In [9]:
# calculate your LLM execution time
import time

def timer_llm(prompt, if_print=1):
    start_time = time.time()
    response = textgen_llm(prompt)
    end_time = time.time()
    elapsed_time = end_time - start_time
    if if_print == 1:
        print("----------------------------------------- OutPut -----------------------------------------")
        print("Elapsed time: ", elapsed_time, "seconds")
    return response

In [10]:
# Compare Results
import time

# p_base
response = timer_llm(p_base)
result = response[response.index('\n')+1:]
print_ww(result)

# p_upset
response = timer_llm(p_upset)
result = response[response.index('\n')+1:]
print_ww(result)

# p_chatty
response = timer_llm(p_chatty)
result = response[response.index('\n')+1:]
print_ww(result)

# p_translate
response = timer_llm(p_translate)
result = response[response.index('\n')+1:]
print_ww(result)

----------------------------------------- OutPut -----------------------------------------
Elapsed time:  7.855525732040405 seconds

1. 扫描不了了,操作盘接触不上,卡住扫描流程

2. 操作盘突然失联,导致扫描过程出现卡顿,无法继续扫描

3. 扫描过程突然卡住不动了,找不到操作盘的反应,扫描功能失效

4. 扫描功能失灵,操作盘联络失败,整个扫描过程异常中断

5. 不能继续扫描了,操作盘突然失去联系,使扫描过程陷入了卡顿
----------------------------------------- OutPut -----------------------------------------
Elapsed time:  8.585693836212158 seconds

1. 扫描卡顿,机架操作盘完全不能联系上了,现在扫描都扫不了了!

2. 我这边扫描突然就开始卡顿了,机架操作盘也联系不上,扫描功能直接就没法用了。

3. 扫描功能突然失灵,联系不上机架操作盘,现在完全没法进行扫描操作了!

4. 扫描开始卡住不动了,机架操作盘也打不开,扫描功能直接失效了啊!

5. 扫描完全停止响应,机架操作盘也打不开了,现在扫描都不能进行了!
----------------------------------------- OutPut -----------------------------------------
Elapsed time:  9.546398401260376 seconds

1. 扫描这个功能卡顿了,联系不上机架操作盘,导致现在完全扫不了东西。

2. 扫描功能在使用的时候总是卡顿,机架操作盘也联系不上,结果使得扫描功能现在完全无法使用。

3. 使用扫描时老是卡卡顿顿的,机架操作盘也打不开了,说白了就是扫描功能现在完全没法正常工作了。

4. 我在用扫描的时候它经常卡住,而且机架操作盘也连不上,这样扫描就完全无法进行了。

5. 扫描的时候会非常地卡,机架操作盘也连不上它,导致现在扫描功能直接就不能用了。
----------------------------------------- Out

### Synthetic Data Generation - Batch Generation

In [None]:
# batch generation
## sample 15% observations from each issue category 
sample_df = voc_cx1_raw.groupby("故障分类").sample(frac = 0.15, random_state=1)
sample_df.groupby('故障分类').count()

In [None]:
## prompt for batch generation
# below is an example
p_batch_gen_example = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>改写给定客服问题描述<example>,并按照<output_format>要求输出。

重写要求<instructions>
1，每次生成5个不同的客服问题描述改写
2，改写后的描述含义不能变，不要增加额外的信息，不要添油加醋，不要发挥想象
3，改写包括语序改变，换词
4，改写前后描述的长短要相似
5，如果原描述中有任何pii敏感信息，请用“某先生，某女士，某公司，某电话，某城市”等代替
6，综合上述要评估生成结果，选择最符合上述要求的一条写入<best>
</instructions>

<examples>
{input_data}
</examples>

<output_format>
五个备选：
<best>
</best>
</output_format>

\n\nAssistant:\n
'''

#### Now write your own prompt <br>

<b>Requirements: </b><br>
You want to generate several re-writes and choose the best one, and write into "\<best>\</best>" tags

In [None]:
p_batch_gen = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>改写给定客服问题描述<example>,并按照<output_format>要求输出。

重写要求<instructions>
{{WRITE YOUR OWN INSTRUCTIONS}}
</instructions>

<examples>
{{PLACEHOLDER FOR INPUT EXAMPLES}}
</examples>

<output_format>
{{REQUIREMENTS/EXAMPLES FOR OUTPUT FORMAT}}
</output_format>

\n\nAssistant:\n
'''

## generate synthetic data from the sample
syn_data = sample_df.copy()
for i in range(sample_df.shape[0]):
    symptom = sample_df.iloc[i][0]
    prompt = p_batch_gen.format(input_data = symptom)
    response = timer_llm(prompt, 0)
    result_details = response[response.index('\n')+1:]
    if '<best>' in result_details and '</best>' in result_details:
        result_best = result_details[result_details.index('<best>')+6:result_details.index('</best>')]
    else:
        result_best = 'No <best> found'
    syn_data.iloc[i][0] = result_best
    #print_ww(result_details)
    #print("Given Sample: ", symptom)
    #print("Best Generated Sample: ", result_best)

### Save the Generated Data

In [None]:
# adjust your output file name, for example: heather_voc_gen_.csv
syn_data.to_csv('REPLACE_YOUR_NAME_HERE_voc_gen.csv',index = False)

# <font color=red>Assignment: Save the Generated Data and Upload it to the WorkDoc<font color=red>

https://amazon.awsapps.com/workdocs-preview/index.html#/folder/f573754f5bdde41d63543d02eb277bee333e322238ad051755a02e050b8513ce