![image](../images/kdd24-logo-small.jpeg)

# Hands-on Tutorial
## Domain-Driven LLM Development: Insights into RAG and Fine-Tuning Practices
### Lab 2.1 : Synthetic Test Data Generation 
#### Summary: 
This lab focused on generating synthetic data for testing the fine-tuned model in the next lab. We use Claude3 Sonnet on Amazon Bedrock. 

- The question-context-answer pairs provided by CUAD dataset are used as "seed data"     
- The generated dataset are in the format [context, seed_question, question, answer]    


### Initialization

In [None]:
#!pip install langchain

In [1]:
import json
import os
import sys

import boto3
import botocore

In [2]:
import numpy as np
import time
import pandas as pd

from langchain.prompts import PromptTemplate

In [3]:
boto3_bedrock = boto3.client(service_name="bedrock", region_name="us-west-2")
boto3_bedrock_runtime = boto3.client(service_name="bedrock-runtime", region_name="us-west-2")

In [4]:
def QA_Gen_Bedrock(model_id,model_kwargs,prompt):
                
    input_token = len(prompt.split())/0.75

    if ('titan' in model_id):    
        model_body = {
            "inputText": f"{prompt}"
        }
        model_body["textGenerationConfig"] =  model_kwargs  
    elif ('claude-3' in model_id):
        model_body = {
                        "anthropic_version": "bedrock-2023-05-31",
                        "max_tokens": 1024,
                        "messages": [
                            {
                                "role": "user",
                                "content": [{"type": "text", "text": prompt}],
                            }
                        ],
        }
    else:
        model_body = {
            "prompt": f"{prompt}"
        }
        model_body.update(model_kwargs)

    body_bytes = json.dumps(model_body).encode('utf-8')

    st = time.time()

    if ('claude-3' in model_id):
        response = boto3_bedrock_runtime.invoke_model(
                    modelId=model_id,
                    body=body_bytes,
                )
    else:
        response = boto3_bedrock_runtime.invoke_model(
                    modelId=model_id,
                    contentType="application/json",
                    accept="*/*",
                    body=body_bytes,
                )

    et = time.time()
    elapsed_time = et - st

    if ('titan' in model_id):
        response_body_json = json.loads(response['body'].read().decode('utf-8'))
        llm_response = response_body_json["results"][0]["outputText"].strip()
        llm_latency = response["ResponseMetadata"]["HTTPHeaders"]["x-amzn-bedrock-invocation-latency"]
    elif ('llama' in model_id):
        response_body_json = json.loads(response['body'].read().decode('utf-8'))
        llm_response = response_body_json["generation"].strip()
    elif ('claude-v2' in model_id or 'claude-instant-v1' in model_id ):
        response_body_json = json.loads(response['body'].read().decode('utf-8'))
        llm_response = response_body_json["completion"].strip()
    elif ('claude-3' in model_id):
        response_body_json = json.loads(response['body'].read().decode('utf-8'))
        llm_response = response_body_json["content"][0]["text"].strip()
    elif ('mistral' in model_id):
        response_body_json = json.loads(response['body'].read().decode('utf-8'))
        llm_response = response_body_json["outputs"][0]["text"].strip()    
    else :
        llm_response = 'MODEL TYPE NOT YET SUPPORTED.'
    
    output_token = len(llm_response.split())/0.75

    throuput = output_token/elapsed_time
    
    return llm_response, elapsed_time, input_token, output_token, throuput

In [5]:
def extract_strings_recursive(test_str, tag):
    try:
        # finding the index of the first occurrence of the opening tag
        start_idx = test_str.find("<" + tag + ">")

        # base case
        if start_idx == -1:
            return []

        # extracting the string between the opening and closing tags
        end_idx = test_str.find("</" + tag + ">", start_idx)
        res = [test_str[start_idx+len(tag)+2:end_idx]]

        # recursive call to extract strings after the current tag
        res += extract_strings_recursive(test_str[end_idx+len(tag)+3:], tag)

        return res
    
    except:
        return "bad format"

### Define prompt for synthetic data generation

In [6]:
from langchain.prompts import PromptTemplate

prompt_template_trngen = """
Human:

You are an AI assistant, your task is to generate question-answer pair from the given context. 

Analyze the context within the <context> XML tag and the seed question in <seed> XML tag, 
generate one question that rephrases the seed question within the <seed> XML tag. 
Make sure the generated questions are also relevant to the context within the <context> XML tag. 

In your response, present the question within the <question> tag.
DO NOT nest <question> element. 
DO NOT put any extra attribute in the <question> tag. 

<context>
{context}
</context>

<seed>
{seed_question}
</seed>

Assistant:
"""

PROMPT_trngen = PromptTemplate(template=prompt_template_trngen, input_variables=["context","seed_question"])

### Load seed data

In [7]:
INPUT_FILE = "../lab-data/ENERGOUSCORP_qa.csv"
df_input = pd.read_csv (INPUT_FILE)
df_input.head(5)

Unnamed: 0,index,question,input,answer,qa_id
0,0,What is The name of the contract?,Highlight the parts (if any) of this contract ...,STRATEGIC ALLIANCE AGREEMENT,ENERGOUSCORP_03_16_2017-EX-10.24-STRATEGIC ALL...
1,1,What is The two or more parties who signed the...,Highlight the parts (if any) of this contract ...,"Dialog Semiconductor (UK) Ltd., DIALOG, Energo...",ENERGOUSCORP_03_16_2017-EX-10.24-STRATEGIC ALL...
2,2,What is The date of the contract?,Highlight the parts (if any) of this contract ...,"November 6, 2016",ENERGOUSCORP_03_16_2017-EX-10.24-STRATEGIC ALL...
3,3,What is The date when the contract is effective?,Highlight the parts (if any) of this contract ...,"November 6, 2016",ENERGOUSCORP_03_16_2017-EX-10.24-STRATEGIC ALL...
4,4,On what date will the contract's initial term ...,Highlight the parts (if any) of this contract ...,"Unless earlier terminated as provided herein, ...",ENERGOUSCORP_03_16_2017-EX-10.24-STRATEGIC ALL...


In [8]:
context_list = df_input.input.values.tolist()
question_list  = df_input.question.values.tolist()
answer_list  = df_input.answer.values.tolist()

In [9]:
len(question_list)

32

### Generate testing data from a random seed data

In [10]:
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0' 

model_kwargs = {
        "max_tokens": 1024,
        "top_p": 0.95,
        "temperature": 0.05
}   

In [11]:
question_list[25]

'Is one party required to deposit its source code into escrow with a third party, which can be released to the counterparty upon the occurrence of certain events (bankruptcy,\xa0 insolvency, etc.)?'

In [12]:
prompt = PROMPT_trngen.format(context = context_list[25], seed_question = question_list[25])

qa_response = QA_Gen_Bedrock(model_id,model_kwargs,prompt)

In [13]:
print(extract_strings_recursive(qa_response[0], "question"))

['Does this contract contain provisions related to source code escrow, which would require one party to deposit its source code with a third-party escrow agent, allowing the counterparty to access the code under specific circumstances like bankruptcy or insolvency?']


### Generate training/testing data in batch

In [14]:
val_context_list = []
val_seed_list = []
val_question_list = []
val_answer_list = []

for i in range(len(question_list)):
   
    print(i+1,end=': ')
    prompt = PROMPT_trngen.format(context = context_list[i], seed_question = question_list[i])

    qa_response = QA_Gen_Bedrock(model_id,model_kwargs,prompt)

    res_q = extract_strings_recursive(qa_response[0], "question")
    
    if "bad format" in res_q or len(res_q)==0:
        pass
    else:
        val_context_list.append(context_list[i])
        val_seed_list.append(question_list[i])
        val_question_list.append(res_q[0])        
        val_answer_list.append(answer_list[i])
        print('*',end='')
        
print("\nCompleted: generated ", len(question_list))

1: *2: *3: *4: *5: *6: *7: *8: *9: *10: *11: *12: *13: *14: *15: *16: *17: *18: *19: *20: *21: *22: *23: *24: *25: *26: *27: *28: *29: *30: *31: *32: *
Completed: generated  32


In [15]:
val_question_list

["Could you point out any sections in the contract that refer to or specify the document's title or name that should be reviewed by legal counsel?",
 'Which sections or clauses in this contract outline the details and responsibilities of the parties involved?',
 'When was this contract agreed upon or signed by the parties involved?',
 'When is the effective date of this contract?',
 'What is the expiration date specified for the initial term of this contract?',
 'According to the context provided, what provisions or clauses, if any, should a lawyer review regarding the renewal term, automatic extensions, or unilateral extensions with prior notice after the initial term of this contract expires?',
 'According to the contract, what is the specified notice period to prevent automatic renewal?',
 'Based on the context, which section or clause of the contract specifies the governing law for its interpretation?',
 'Are there any clauses in this contract that restrict either party from compet

### Store generated dataset 

In [16]:
VAL_FILE = "../lab-data/ENERGOUSCORP_qa_test.csv"  

df_val_dataset = pd.DataFrame()  

df_val_dataset["context"] = val_context_list
df_val_dataset["seed_question"] = val_seed_list
df_val_dataset["question"] = val_question_list
df_val_dataset["answer"] = val_answer_list

df_val_dataset.to_csv(VAL_FILE, index=False)

In [17]:
df_val_dataset

Unnamed: 0,context,seed_question,question,answer
0,Highlight the parts (if any) of this contract ...,What is The name of the contract?,Could you point out any sections in the contra...,STRATEGIC ALLIANCE AGREEMENT
1,Highlight the parts (if any) of this contract ...,What is The two or more parties who signed the...,Which sections or clauses in this contract out...,"Dialog Semiconductor (UK) Ltd., DIALOG, Energo..."
2,Highlight the parts (if any) of this contract ...,What is The date of the contract?,When was this contract agreed upon or signed b...,"November 6, 2016"
3,Highlight the parts (if any) of this contract ...,What is The date when the contract is effective?,When is the effective date of this contract?,"November 6, 2016"
4,Highlight the parts (if any) of this contract ...,On what date will the contract's initial term ...,What is the expiration date specified for the ...,"Unless earlier terminated as provided herein, ..."
5,Highlight the parts (if any) of this contract ...,What is the renewal term after the initial ter...,"According to the context provided, what provis...","Unless earlier terminated as provided herein, ..."
6,Highlight the parts (if any) of this contract ...,What is the notice period required to terminat...,"According to the contract, what is the specifi...","Unless earlier terminated as provided herein, ..."
7,Highlight the parts (if any) of this contract ...,Which state/country's law governs the interpre...,"Based on the context, which section or clause ...",This Letter of Authorization will be governed ...
8,Highlight the parts (if any) of this contract ...,Is there a restriction on the ability of a par...,Are there any clauses in this contract that re...,Until expiration or earlier termination of the...
9,Highlight the parts (if any) of this contract ...,Is there an exclusive dealing commitment with...,Are there any clauses in the contract that bin...,If DIALOG decides to discontinue Sales of any ...
