# Synthetic Data Generation

In this notebook, we are going to generate some synthetic data that represents some of the most popular CX use cases.

### Environment Preperation

In [2]:
# Update SDK and Install related libraries
# You can ignore the error related to installation
%pip install --quiet --no-build-isolation --force-reinstall \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57"

%pip install --quiet langchain==0.0.309 "transformers>=4.24,<5"

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
apache-beam 2.52.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.3.2 which is incompatible.
docker-compose 1.29.2 requires jsonschema<4,>=2.5.1, but you have jsonschema 4.20.0 which is incompatible.
docker-compose 1.29.2 requires PyYAML<6,>=3.10, but you have pyyaml 6.0.1 which is incompatible.
jupyterlab 3.4.4 requires jupyter-server~=1.16, but you have jupyter-server 2.6.0 which is incompatible.
jupyterlab-server 2.10.3 requires jupyter-server~=1.4, but you have jupyter-server 2.6.0 which is incompatible.
nemoguardrails 0.5.0 requires langchain==0.0.251, but you have langchain 0.0.

In [3]:
# Set up IAM
import json
import os
import sys
import boto3

module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import bedrock, print_ww


# ---- ⚠️ Un-comment and edit the below lines as needed for your AWS setup ⚠️ ----

# os.environ["AWS_DEFAULT_REGION"] = "<REGION_NAME>"  # E.g. "us-east-1"
# os.environ["AWS_PROFILE"] = "<YOUR_PROFILE>"
# os.environ["BEDROCK_ASSUME_ROLE"] = "<YOUR_ROLE_ARN>"  # E.g. "arn:aws:..."

boto3_bedrock = bedrock.get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None)
)

Create new client
  Using region: us-east-1
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-east-1.amazonaws.com)


### Raw Data Inspection

The orginal dataset contains multiple different languages, which significantly increases the evaluation difficulty.  Therefore I have used the prompt below to translate and re-generate data for this workshop. 

In [4]:
p_translate_and_generate = '''

Human: you are a translation assistant. below is a gaming chat history in <history_chat>.  
Your task is to translate the history into English while following the instructions in <instructions>.

here is the chat history in <history_chat>:
<history_chat>
{input_chat}
</history_chat>

here are the requirements for translation in <instructions>:
<instructions>
1, If the chat is not in English, Spanish, Traditional Chinese or Simplified Chinese, respond with 'No Translation' and do not do translation
2, If the chat is in English, translate into Simplified Chinese
3, If the chat is in Spanish, translate into English
4, If the chat is in Traditional Chinese or Simplified Chinese, translate into English
5, Keep in mind that the translation should mimic real conversation between players of MOBA Game, and you should try your best to keep all the cursing and slang used.
6, Write the output in chat-like format, and use identifiers like "A:, B:, C: " to help clarify the conversation 
7, Please put your translation in <answer></answer> XML tags.
</instructions>

Assistant:
<answer>
'''

In [5]:
# Read in the Sample data you just generated from Task0
import pandas as pd
import numpy as np
raw = pd.read_csv('lxq_cm_gen_raw.csv')
raw.groupby('gt').describe()

Unnamed: 0_level_0,input,input,input,input
Unnamed: 0_level_1,count,unique,top,freq
gt,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,156,154,No Translation\n\n,3
1,159,159,C: estate \nD: 1/estos 4 de en formar una escu...,1


In [6]:

raw_dropna = raw.dropna()
raw_valid = raw_dropna[raw_dropna["input"].str.contains("No Translation") == False] 
raw_valid["gt"] = np.where(raw_valid["gt"] == 1, 'T', 'F')
raw_valid

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  raw_valid["gt"] = np.where(raw_valid["gt"] == 1, 'T', 'F')


Unnamed: 0,input,gt
0,"1: So easy, abc\n",F
1,B: Okay go mid\nA: See a cs\nB: Okay\nA: I'm m...,F
2,B: 谁需要游走英雄 组队\nC: 谁需要游走英雄和法师 组队\nD: -1 邀请\nE: ...,F
4,B:哈哈哈 \nC:我高兴\nD:52533\nE:哈哈哈\nF:哈哈哈\nG:1/哭得更...,F
5,B: Just ask your dad \nC: You don't need to wi...,F
...,...,...
495,\n/A: 我选了埃斯特斯\n/B: 我选了埃迪斯\n/B: 刚进游戏的菜鸟。废物们。\n/...,T
497,B: bn yak alsan\nC: 1/req\nD: b90k ya b90k aj\...,T
499,B: 黄金就5千 \nA: 怎么了? \nB: 我只有一个洞 \nA: 解释一下亲爱的...,T
500,B: 多少钱?\nC: 你的大脑值这么多钱吗? \nD: 真没办法。\n,T


### Synthetic Data Generation - Compare Results from Different Instructions

In [7]:
# Set Up Claude Parameters
from langchain.llms.bedrock import Bedrock

inference_modifier = {'max_tokens_to_sample':4096, 
                      "temperature":1,
                      "top_k":250,
                      "top_p":1,
                      "stop_sequences": ["\n\nHuman"]
                     }

textgen_llm = Bedrock(model_id = "anthropic.claude-v2",
                    client = boto3_bedrock, 
                    model_kwargs = inference_modifier 
                    )

In [8]:
# prompt for generating synthetic data
## Base
p_base = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>中的要求改写<example>中的给定MOBA游戏玩家对话对话.

重写要求<instructions>
0，用同一个语种改写
1，每次生成三个不同的对话
2，改写后的对话含义不能变，不要增加额外的信息，不要添油加醋，不要发挥想象
3，改写包括语序改变，换词
4，改写前后对话的长短要相似，语义要通顺，说话要更拟人，更随便，更口头
</instructions>

<examples>
B: 什么角色?
A: 你选了啥?
A: 你选了啥?
A: 你选了啥?职业或者位置?
C: 跟着这个菜鸟ADC, 把他当腊肉扒皮。
</examples>

\n\nAssistant:
'''

## Now assume a customer that's upset and impatient 
p_trashtalk = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>中的要求改写<example>中的给定MOBA游戏玩家对话对话.

重写要求<instructions>
0，用同一个语种改写
1，每次生成三个不同的对话
2，改写后的对话含义不能变，不要增加额外的信息，不要添油加醋，不要发挥想象
3，改写包括语序改变，换词
4，改写前后对话的长短要相似，语义要通顺，说话要更拟人，更随便，更口头，但是对话要更具讽刺性和攻击性，可适量增加一些经济场合常见的垃圾话（trash talk）
</instructions>

<examples>
B: 什么角色?
A: 你选了啥?
A: 你选了啥?
A: 你选了啥?职业或者位置?
C: 跟着这个菜鸟ADC, 把他当腊肉扒皮。
</examples>

\n\nAssistant:
'''

## Now assume a customer that's chatty and confused
p_young = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>中的要求改写<example>中的给定MOBA游戏玩家对话对话.

重写要求<instructions>
0，用同一个语种改写
1，每次生成三个不同的对话
2，改写后的对话含义不能变，不要增加额外的信息，不要添油加醋，不要发挥想象
3，改写包括语序改变，换词
4，改写前后对话的长短要相似，语义要通顺，说话要更拟人，更随便，更口头，可适当增加一些社交媒体上年轻人常用的俚语
</instructions>

<examples>
B: 什么角色?
A: 你选了啥?
A: 你选了啥?
A: 你选了啥?职业或者位置?
C: 跟着这个菜鸟ADC, 把他当腊肉扒皮。
</examples>

\n\nAssistant:
'''

## Now try translating the data into other languages.  This is specially useful when the Cx has insufficient data when localizing.
p_translate = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>中的要求改写<example>中的给定MOBA游戏玩家对话对话.

重写要求<instructions>
1，每次生成四个不同语种的对话：1个英文，1个符合中国大陆语言习惯，1个符合台湾语言习惯，1个符合香港语言习惯
2，改写后的对话含义不能变，不要增加额外的信息，不要添油加醋，不要发挥想象
3，改写包括语序改变，换词
4，改写前后对话的长短要相似，语义要通顺，说话要更拟人，更随便，更口头
</instructions>

<examples>
B: 什么角色?
A: 你选了啥?
A: 你选了啥?
A: 你选了啥?职业或者位置?
C: 跟着这个菜鸟ADC, 把他当腊肉扒皮。
</examples>

\n\nAssistant:
'''

In [9]:
# calculate your LLM execution time
import time

def timer_llm(prompt, if_print=1):
    start_time = time.time()
    response = textgen_llm(prompt)
    end_time = time.time()
    elapsed_time = end_time - start_time
    if if_print == 1:
        print("----------------------------------------- OutPut -----------------------------------------")
        print("Elapsed time: ", elapsed_time, "seconds")
    return response

In [10]:
# Compare Results
import time

# p_base
response = timer_llm(p_base)
result = response[response.index('\n')+1:]
print_ww(result)

# p_trashtalk
response = timer_llm(p_trashtalk)
result = response[response.index('\n')+1:]
print_ww(result)

# p_young
response = timer_llm(p_young)
result = response[response.index('\n')+1:]
print_ww(result)

# p_translate
response = timer_llm(p_translate)
result = response[response.index('\n')+1:]
print_ww(result)

----------------------------------------- OutPut -----------------------------------------
Elapsed time:  7.312505722045898 seconds

A: 哥们,选哪个角色啊?
B: 你挑了哪个英雄?ADC还是辅助?
C: 这个ADC太菜了,跟着他打怪简直要命。
A: 兄弟,你选了什么位置?
B: 你用什么英雄啊?打野还是中单?
C: 这个ADC操作太烂了,跟他打线跟虐待一样。
A: 卧槽,你选什么角色啊?
B: 你用啥英雄啊?上单还是打野?
C: 我去,这个ADC烂的要命,我辅助他简直受罪。
----------------------------------------- OutPut -----------------------------------------
Elapsed time:  3.940587043762207 seconds


ValueError: substring not found

### Synthetic Data Generation - Batch Generation

In [None]:
# batch generation
## sample 10% observations from each category 
sample_df = raw_valid.groupby("gt").sample(frac = 0.1, random_state=1)
sample_df.groupby('gt').count()

In [None]:
# prompt for batch generation
# below is an example
p_batch_gen_example = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>中的要求改写<example>中的给定MOBA游戏玩家对话.

重写要求<instructions>
1，每次生成两个不同的游戏对话改写
2，改写后的对话主要含义和情绪不能变，不要增加额外的信息，不要添油加醋
3，改写包括语序改变，换词
4，改写前后对话的长短要相似，语义要通顺，说话要更拟人，更随便，更口头
5，综合上述要评估生成结果，选择最符合上述要求的一条写入<best>
</instructions>

<examples>
{input_data}
</examples>

<output_format>
五个备选：
<best>
</best>
</output_format>

\n\nAssistant:\n
'''

#### Now write your own prompt <br>

<b>Requirements: </b><br>
You want to generate several re-writes and choose the best one, and write into "\<best>\</best>" tags

In [None]:
p_batch_gen = '''
\n\nHuman:
你是一个数据生成员. 你的目标是按照<instructions>中的要求改写<example>中的给定MOBA游戏玩家对话.

重写要求<instructions>
{{WRITE YOUR OWN INSTRUCTIONS}}
</instructions>

<examples>
{{PLACEHOLDER FOR INPUT EXAMPLES}}
</examples>

<output_format>
{{REQUIREMENTS/EXAMPLES FOR OUTPUT FORMAT}}
</output_format>

\n\nAssistant:\n
'''

## generate synthetic data from the sample
syn_data = sample_df.copy()
for i in range(sample_df.shape[0]):
    symptom = sample_df.iloc[i][0]
    prompt = p_batch_gen.format(input_data = symptom)
    response = timer_llm(prompt, 0)
    result_details = response[response.index('\n')+1:]
    if '<best>' in result_details and '</best>' in result_details:
        result_best = result_details[result_details.index('<best>')+6:result_details.index('</best>')]
    else:
        result_best = 'No <best> found'
    syn_data.iloc[i][0] = result_best
    #print_ww(result_details)
    #print("Given Sample: ", symptom)
    #print("Best Generated Sample: ", result_best)

### Save the Generated Data

In [None]:
# adjust your output file name, for example: heather_voc_gen_.csv
syn_data.to_csv('REPLACE_YOUR_NAME_HERE_cm_gen.csv',index = False)

# <font color=red>Assignment: Save the Generated Data and Upload it to the WorkDoc<font color=red>

https://amazon.awsapps.com/workdocs-preview/index.html#/folder/f573754f5bdde41d63543d02eb277bee333e322238ad051755a02e050b8513ce