### Installing and importing dependencies

In [1]:
%pip -q install datasets tiktoken openai

Note: you may need to restart the kernel to use updated packages.


### Set OpenAI API Key in the notebook
Please set OPENAI_API_KEY as an env variable in the current environment this notebook is running on.

In [2]:
import openai
import os
import tiktoken
import numpy as np
from collections import defaultdict
import json

# conda activate pdf_chunk
openai.api_key = os.environ['OPENAI_API_KEY']

### Importing the raw questions .jsonl file

In [3]:
# file path for the raw questions used to generate QA pairs
data_path = "./raw_questions/dh162021-22_full issue_questions.jsonl"
with open(data_path) as f:
    json_dataset = [json.loads(line) for line in f]

In [4]:
json_dataset

[{'question_id': 1,
  'text': 'Based on the text, generate 10 different question and answer pairs.\nText:\n\nDSTA HORIZONS EDITORIAL TEAM\n\nEditor\nKoh Tuan Yew\n\nCo-Editor\nLee Siang Meng Alex\n\nMembers\n\nCai Kunming Alvin Ho Kwee Peng Juli\nChang Chai Fung Lin Jyh Fang Kelvin\nChim Tat Wee Reman Loh Kai Ip Alvin\nChua Siew Ting Pearly Loke Yim Peng\n\nGoh Shi Hui Jaime Loo Jang Wei\n\nHeng Chye Hwee Ng Yeow Chong Ivan\n\nHeng Yinghui Elizabeth\n\nTechnical Editor\n\nProfessor Khoo Boo Cheong\nTemasek Laboratories\nNational University of Singapore\n\nReaders can access DSTA Horizons at\nwww.dsta.gov.sg/dstahorizons\n\nWe welcome your feedback. Please send all correspondence to:\n\nDSTA Horizons Editorial Team\nDSTA Academy\n\n1 Depot Road\n\nSingapore 109679\n\nEmail: dstahorizons@dsta.gov.sg\n\nDSTA Horizons\n\nIssue 16\n\nISSN 2339-529X (print) ISSN 2339-5303 (online)\n©2022 Defence Science and Technology Agency\n\nNo part of this publication may be reproduced, stored or transmi

### Defining a function to convert each question into the format required by OpenAI's ChatCompletions API

In [6]:
def convert_conversation(conversation_str, system_message=None):
    messages = []
    
    # System message is optional, skipped if not provided
    if system_message:
        messages.append({
            "role": "system",
            "content": system_message
        })
    
    messages.append({
            "role": "user",
            "content": conversation_str['text']
        })
    
    output_dict = {
        "messages": messages
    }
    
    return output_dict

### Defining system message to prompt GPT3.5 to return output in our desired format

In [7]:
system_message = """You are an expert at generating question and answer pairs from chunked texts, which will be used to train a GPT-3.5 model.
Return the question and answer pairs in the example output JSON format:
Example input: 
Based on the text, generate 2 different question and answer pairs. Text: JSON stands for JavaScript Object Notation. It is a data format that's used for storing and transferring information for web applications. JSON was inspired by the JavaScript programming language, but it's not tied to only one language.

Example output: [{"question": "What is JSON?", "answer": "JSON stands for JavaScript Object Notation. It is a data format that's used for storing and transferring information for web applications."}, {"question": "What was JSON inspired by?", "answer": "JSON was inspired by the JavaScript programming language, but it's not tied to only one language."}]"""
print(system_message)

convert_conversation(json_dataset[0], system_message)

You are an expert at generating question and answer pairs from chunked texts, which will be used to train a GPT-3.5 model.
Return the question and answer pairs in the example output JSON format:
Example input: 
Based on the text, generate 2 different question and answer pairs. Text: JSON stands for JavaScript Object Notation. It is a data format that's used for storing and transferring information for web applications. JSON was inspired by the JavaScript programming language, but it's not tied to only one language.

Example output: [{"question": "What is JSON?", "answer": "JSON stands for JavaScript Object Notation. It is a data format that's used for storing and transferring information for web applications."}, {"question": "What was JSON inspired by?", "answer": "JSON was inspired by the JavaScript programming language, but it's not tied to only one language."}]


{'messages': [{'role': 'system',
   'content': 'You are an expert at generating question and answer pairs from chunked texts, which will be used to train a GPT-3.5 model.\nReturn the question and answer pairs in the example output JSON format:\nExample input: \nBased on the text, generate 2 different question and answer pairs. Text: JSON stands for JavaScript Object Notation. It is a data format that\'s used for storing and transferring information for web applications. JSON was inspired by the JavaScript programming language, but it\'s not tied to only one language.\n\nExample output: [{"question": "What is JSON?", "answer": "JSON stands for JavaScript Object Notation. It is a data format that\'s used for storing and transferring information for web applications."}, {"question": "What was JSON inspired by?", "answer": "JSON was inspired by the JavaScript programming language, but it\'s not tied to only one language."}]'},
  {'role': 'user',
   'content': 'Based on the text, generate 1

### Convert each question into the format required by OpenAI's ChatCompletions API

In [8]:
# dataset stores all our raw questions to be fed into ChatCompletions API
dataset = []

for data in json_dataset:
    message = convert_conversation(data, system_message)
    dataset.append(message)

dataset[:2]

[{'messages': [{'role': 'system',
    'content': 'You are an expert at generating question and answer pairs from chunked texts, which will be used to train a GPT-3.5 model.\nReturn the question and answer pairs in the example output JSON format:\nExample input: \nBased on the text, generate 2 different question and answer pairs. Text: JSON stands for JavaScript Object Notation. It is a data format that\'s used for storing and transferring information for web applications. JSON was inspired by the JavaScript programming language, but it\'s not tied to only one language.\n\nExample output: [{"question": "What is JSON?", "answer": "JSON stands for JavaScript Object Notation. It is a data format that\'s used for storing and transferring information for web applications."}, {"question": "What was JSON inspired by?", "answer": "JSON was inspired by the JavaScript programming language, but it\'s not tied to only one language."}]'},
   {'role': 'user',
    'content': 'Based on the text, genera

### Checking for errors in dataset

In [9]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
    
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
    
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
            
        if any(k not in ("role", "content", "name") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant"):
            format_errors["unrecognized_role"] += 1
        
        content = message.get("content", None)
        if not content or not isinstance(content, str):
            format_errors["missing_content"] += 1
#     print(ex)
    
if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")
    

No errors found


### Estimating cost

In [10]:
encoding = tiktoken.get_encoding("cl100k_base")
def count_tokens(text):
    num_tokens = len(encoding.encode(text))
    return num_tokens

total_tokens = 0

for item in dataset:
    messages = item.get('messages', [])
    for message in messages:
        content = message.get('content', '')
        tokens = count_tokens(content)
        total_tokens += tokens

print(f"Total tokens to be billed: {total_tokens}")
# GPT-3.5 Turbo
# 4K context	$0.0015 / 1K tokens
print(f"Total input cost to be billed: ${total_tokens * 0.0015 / 1000}")
print("Doesn't include output cost")

Total tokens to be billed: 49025
Total input cost to be billed: $0.0735375
Doesn't include output cost


In [11]:
dataset[0]

{'messages': [{'role': 'system',
   'content': 'You are an expert at generating question and answer pairs from chunked texts, which will be used to train a GPT-3.5 model.\nReturn the question and answer pairs in the example output JSON format:\nExample input: \nBased on the text, generate 2 different question and answer pairs. Text: JSON stands for JavaScript Object Notation. It is a data format that\'s used for storing and transferring information for web applications. JSON was inspired by the JavaScript programming language, but it\'s not tied to only one language.\n\nExample output: [{"question": "What is JSON?", "answer": "JSON stands for JavaScript Object Notation. It is a data format that\'s used for storing and transferring information for web applications."}, {"question": "What was JSON inspired by?", "answer": "JSON was inspired by the JavaScript programming language, but it\'s not tied to only one language."}]'},
  {'role': 'user',
   'content': 'Based on the text, generate 1

### Using the zeroth index of the dataset to test the ChatCompletions API call
The commented out code is used to truncate the number of tokens used when testing the API call so we don't incur higher costs. Testing is done to check if the API returns an output in our desired format, and this can be achieved by editing```system_message``` during testing.

In [12]:
message = dataset[0]
# # truncate to incur lower token usage during testing
# message['messages'][1]['content'] = message['messages'][1]['content'][:1000]
# print(message['messages'][1]['content'])
# count_tokens(message['messages'][1]['content'])

In [15]:
qapairs = []

# check API call result first before running the whole loop

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=message['messages'],
    temperature=0
)
qapairs.append(response['choices'][0]['message']['content'])

qapairs

['[{"question": "Who is the President of DSTA Academy?", "answer": "Koh Tuan Yew"}, {"question": "What is the ISSN number for DSTA Horizons (print)?", "answer": "ISSN 2339-529X"}, {"question": "What is the email address for DSTA Horizons?", "answer": "dstahorizons@dsta.gov.sg"}, {"question": "What is the purpose of DSTA Horizons?", "answer": "To deliver advanced capabilities for the nation\'s defence and security needs"}, {"question": "What is the website for accessing DSTA Horizons?", "answer": "www.dsta.gov.sg/dstahorizons"}, {"question": "Who are the members of the DSTA Horizons Editorial Team?", "answer": "Koh Tuan Yew, Lee Siang Meng Alex, Cai Kunming Alvin Ho Kwee Peng Juli, Chang Chai Fung Lin Jyh Fang Kelvin, Chim Tat Wee Reman Loh Kai Ip Alvin, Chua Siew Ting Pearly Loke Yim Peng, Goh Shi Hui Jaime Loo Jang Wei, Heng Chye Hwee Ng Yeow Chong Ivan, Heng Yinghui Elizabeth"}, {"question": "What is the role of the Technical Editor?", "answer": "To ensure technical accuracy and qual

In [16]:
response

<OpenAIObject chat.completion id=chatcmpl-853o91a0A7YLuEUm0BjCLUMfzxkkU at 0x7f10e54da9f0> JSON: {
  "id": "chatcmpl-853o91a0A7YLuEUm0BjCLUMfzxkkU",
  "object": "chat.completion",
  "created": 1696216969,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "[{\"question\": \"Who is the President of DSTA Academy?\", \"answer\": \"Koh Tuan Yew\"}, {\"question\": \"What is the ISSN number for DSTA Horizons (print)?\", \"answer\": \"ISSN 2339-529X\"}, {\"question\": \"What is the email address for DSTA Horizons?\", \"answer\": \"dstahorizons@dsta.gov.sg\"}, {\"question\": \"What is the purpose of DSTA Horizons?\", \"answer\": \"To deliver advanced capabilities for the nation's defence and security needs\"}, {\"question\": \"What is the website for accessing DSTA Horizons?\", \"answer\": \"www.dsta.gov.sg/dstahorizons\"}, {\"question\": \"Who are the members of the DSTA Horizons Editorial Team?\", \"answe

### Generate Q&A Pairs for the entire dataset
Uncomment the last line to start generating Q&A pairs.

In [1]:
# this only generates from index 1 onwards because we test-generated index 0 already
# remember to call this function
from tqdm import tqdm
def generateQAPairs(dataset):
  for i in tqdm(range(1, len(dataset))):
    message = dataset[i]
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=message['messages'],
        temperature=0
    )
    qapairs.append(response['choices'][0]['message']['content'])

# uncomment this to call the API for the whole dataset to generate QA pairs. Note: occasionally the model
# outputs a response not in the desired format, and we need to manually clean/edit that response.
# generateQAPairs(dataset)

In [185]:
print(qapairs)

['[{"question": "Who is the editor of DSTA Horizons?", "answer": "Koh Tuan Yew"}, {"question": "What is the email address for DSTA Horizons?", "answer": "dstahorizons@dsta.gov.sg"}, {"question": "What is the ISSN number for DSTA Horizons?", "answer": "ISSN 2339-529X (print) ISSN 2339-5303 (online)"}, {"question": "What is the purpose of DSTA Academy?", "answer": "To deliver advanced capabilities for the nation\'s defence and security needs"}, {"question": "What is the focus of the editorial?", "answer": "Nurturing a Design Innovation and User Experience Culture in DSTA"}, {"question": "What is the topic of the first article?", "answer": "Operational Technology Capability Roadmap: Architecture and Competency as Key Enablers"}, {"question": "Who is the author of the first article?", "answer": "YEO Kai Leng Teresa, TONG Ming Shu, YING Jie Hao Jeff, SHEN Zihong"}, {"question": "What is the topic of the second article?", "answer": "Safety Design Considerations on Lithium Batteries Use in Un

In [17]:
print(len(qapairs))

1


In [18]:
print(len(dataset))

53


### Convert API response strings into json objects
If the loop terminates halfway, it could be due to the API returning the wrong format and you may need to manually add in the problematic Q&A pairs.

In [19]:
individual_qa_list = []

for chunked_qa in qapairs:
    chunked_qa_list = json.loads(chunked_qa)
    for qa in chunked_qa_list:
        individual_qa_list.append(qa)

In [193]:
for x in individual_qa_list:
    print(x)

{'question': 'Who is the editor of DSTA Horizons?', 'answer': 'Koh Tuan Yew'}
{'question': 'What is the email address for DSTA Horizons?', 'answer': 'dstahorizons@dsta.gov.sg'}
{'question': 'What is the ISSN number for DSTA Horizons?', 'answer': 'ISSN 2339-529X (print) ISSN 2339-5303 (online)'}
{'question': 'What is the purpose of DSTA Academy?', 'answer': "To deliver advanced capabilities for the nation's defence and security needs"}
{'question': 'What is the focus of the editorial?', 'answer': 'Nurturing a Design Innovation and User Experience Culture in DSTA'}
{'question': 'What is the topic of the first article?', 'answer': 'Operational Technology Capability Roadmap: Architecture and Competency as Key Enablers'}
{'question': 'Who is the author of the first article?', 'answer': 'YEO Kai Leng Teresa, TONG Ming Shu, YING Jie Hao Jeff, SHEN Zihong'}
{'question': 'What is the topic of the second article?', 'answer': 'Safety Design Considerations on Lithium Batteries Use in Underwater Sy

### Export generated Q&A pairs to ```qapairs``` directory

In [22]:
from datetime import date
import json

today = date.today()

def save_to_jsonl(conversations, file_path):
    with open(file_path, 'w') as file:
        for conversation in conversations:
            json_line = json.dumps(conversation)
            file.write(json_line + '\n')

save_to_jsonl(individual_qa_list, f'qapairs/qapairs-{today}.jsonl')