In [1]:
# use pdf_chunk conda env
%pip -q install datasets tiktoken openai

Note: you may need to restart the kernel to use updated packages.


## References: 
Sam Witteveen's Fine Tuning GPT-3.5-Turbo - Comprehensive Guide with Code Walkthrough (YouTube)

https://github.com/openai/openai-cookbook

In [2]:
# Run this cell

import openai
import os
import json

# conda activate pdf_chunk
openai.api_key = os.environ['OPENAI_API_KEY']
file_name = "qapairs-2023-09-14.jsonl"
system_message = """You are an assistant who answers queries regarding DSTA Horizons, which serves as a repository for Singapore's Defence Science & Technology Agency's diverse expertise and knowledge in various fields of technology."""

def save_to_jsonl(conversations, file_path):
    with open(file_path, 'w') as file:
        for conversation in conversations:
            json_line = json.dumps(conversation)
            file.write(json_line + '\n')

### Prepare your data: converting Q&A pairs to required format for OpenAI's API calls

To convert Q & A pairs to required format for finetuning:
- This section

To create a finetuning job:
- Skip ahead to <b>GPT 3.5 API Calls</b> below if you already have the required format (i.e. finetuning/qapairs-2023-09-14.jsonl-train.json, finetuning/qapairs-2023-09-14.jsonl-validation.jsonl exist)

To do model inference:
- Skip ahead to <b>Model inference with finetuned model and GPT 3.5</b> below if you already have the file and job ids of the finetuned model (i.e. finetuning/file_and_job_ids-2023-09-19 exists)

In [2]:
# system: special messages used to steer the behavior of ChatGPT (high level instructions for the conversation)
# user: end user sending the prompt to ChatGPT
# assistant: ChatGPT's answer to prompt

# this is one line of training data (think of this as a QA pair) to be fed into the API call

sample = {
    "messages": [
        {"role": "system", "content": "You are an assistant that answers queries regarding Army Regulation 25–58."},
        {"role": "user", "content": "What is the purpose of Army Regulation 25–58?"},
        {"role": "assistant", "content": "Army Regulation 25–58 prescribes policies and assigns responsibilities for the submission of Department of the Army policies, practices, and procedures for publication in the Federal Register and the Code of Federal Regulations, as required by Title 44 of the United States Code, Chapter 15, and Title 5, United States Code, Section 551 et seq."}
    ]
}

sample

{'messages': [{'role': 'system',
   'content': 'You are an assistant that answers queries regarding Army Regulation 25–58.'},
  {'role': 'user', 'content': 'What is the purpose of Army Regulation 25–58?'},
  {'role': 'assistant',
   'content': 'Army Regulation 25–58 prescribes policies and assigns responsibilities for the submission of Department of the Army policies, practices, and procedures for publication in the Federal Register and the Code of Federal Regulations, as required by Title 44 of the United States Code, Chapter 15, and Title 5, United States Code, Section 551 et seq.'}]}

In [3]:
import json
import os
import tiktoken
import numpy as np
from collections import defaultdict

### Load dataset containg Q&A pairs

In [4]:
data_path = f"qapairs/{file_name}"

data_path

'qapairs/qapairs-2023-09-14.jsonl'

In [5]:
# Load dataset
with open(data_path) as f:
    # for jsonl format
    json_dataset = [json.loads(line) for line in f]

In [6]:
json_dataset[0]

{'question': 'Who is the editor of DSTA Horizons?', 'answer': 'Koh Tuan Yew'}

### Converting Q&A pairs into desired format for OpenAI API Calls

In [60]:
# We want to convert the entries in json_dataset into that of sample (above)

# From {'Question': <User content>, 'Answer': <Assistant content>}
# To {'messages': [{'role': 'system', 'content': <System message>},
#  {'role': 'user', 'content': <User content>},
#  {'role': 'assistant', 'content': <Assistant content>}]}
def convert_conversation(conversation_str, system_message=None):
    messages = []
    
    # System message is optional, skipped if not provided
    if system_message:
        messages.append({
            "role": "system",
            "content": system_message
        })
    
    messages.append({
            "role": "user",
            "content": conversation_str['question']
        })
    
    messages.append({
            "role": "assistant",
            "content": conversation_str['answer']
        })
    
    output_dict = {
        "messages": messages
    }
    
    return output_dict

In [62]:
convert_conversation(json_dataset[0], system_message)

{'messages': [{'role': 'system',
   'content': "You are an assistant who answers queries regarding DSTA Horizons, which serves as a repository for Singapore's Defence Science & Technology Agency's diverse expertise and knowledge in various fields of technology."},
  {'role': 'user', 'content': 'Who is the editor of DSTA Horizons?'},
  {'role': 'assistant', 'content': 'Koh Tuan Yew'}]}

In [63]:
dataset = []

for data in json_dataset:
    record = convert_conversation(data, system_message)
    dataset.append(record)
    
dataset[:2]

[{'messages': [{'role': 'system',
    'content': "You are an assistant who answers queries regarding DSTA Horizons, which serves as a repository for Singapore's Defence Science & Technology Agency's diverse expertise and knowledge in various fields of technology."},
   {'role': 'user', 'content': 'Who is the editor of DSTA Horizons?'},
   {'role': 'assistant', 'content': 'Koh Tuan Yew'}]},
 {'messages': [{'role': 'system',
    'content': "You are an assistant who answers queries regarding DSTA Horizons, which serves as a repository for Singapore's Defence Science & Technology Agency's diverse expertise and knowledge in various fields of technology."},
   {'role': 'user', 'content': 'What is the email address for DSTA Horizons?'},
   {'role': 'assistant', 'content': 'dstahorizons@dsta.gov.sg'}]}]

### Printing out the length of the dataset and an example dataset

In [64]:
# Initial dataset stats
print("Number of examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Number of examples: 513
First example:
{'role': 'system', 'content': "You are an assistant who answers queries regarding DSTA Horizons, which serves as a repository for Singapore's Defence Science & Technology Agency's diverse expertise and knowledge in various fields of technology."}
{'role': 'user', 'content': 'Who is the editor of DSTA Horizons?'}
{'role': 'assistant', 'content': 'Koh Tuan Yew'}


According to OpenAI: at least 10 examples; clear improvements in model performance on 50-100 training examples with gpt-3.5-turbo but the right number varies greatly based on the exact use case.

### Running format error checks

In [65]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
    
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
    
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
            
        if any(k not in ("role", "content", "name") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant"):
            format_errors["unrecognized_role"] += 1
        
        content = message.get("content", None)
        if not content or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1
    
if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")
    

No errors found


### Counting tokens in our dataset to be fed into the model

In [66]:
# Token counting functions
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

In [67]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 64, 280
mean / median: 93.08187134502924, 91.0
p5 / p95: 75.0, 111.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 3, 216
mean / median: 29.263157894736842, 27.0
p5 / p95: 13.0, 46.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


### OpenAI pricing estimates for finetuning

In [71]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
print("See pricing page to estimate total costs")

Dataset has ~47751 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~143253 tokens
See pricing page to estimate total costs


### Generating a validation dataset
Using every tenth training example as part of the validation dataset.

In [75]:
# "validation" dataset --> sampling every 10 training examples, also already in train dataset.
# used to check if model can retain facts or if RAG is just better than finetuning for contextual data.
dataset_val = []
for i in range(0, len(dataset), 10):
    dataset_val.append(dataset[i])

len(dataset_val)

52

In [78]:
# train dataset
save_to_jsonl(dataset, f"finetuning/{file_name}-train.jsonl")

# "validation" dataset --> sampling every 10 training examples, also already in train dataset.
# used to check if model can retain facts or if RAG is just better than finetuning for contextual data.
save_to_jsonl(dataset_val, f"finetuning/{file_name}-validation.jsonl")

### GPT 3.5 API Calls

We create our finetuning job via API calls in this section. Commented out to prevent accidental API calls.

In [6]:
training_file_name = f"finetuning/{file_name}-train.jsonl"
validation_file_name = f"finetuning/{file_name}-train.jsonl"

In [18]:
# with open(training_file_name, "rb") as f:
#     print(f.read())
# with open(validation_file_name, "rb") as f:
#     print(f.read())

### Uploading training and validation files

In [19]:
# uploading our training/validation files onto OpenAI
# training_response = openai.File.create(
#     file=open(training_file_name, "rb"), purpose="fine-tune"
# )
# training_file_id = training_response["id"]

# validation_response = openai.File.create(
#     file=open(validation_file_name, "rb"), purpose="fine-tune"
# )
# validation_file_id = validation_response["id"]

# print("Training file id:", training_file_id)
# print("Validation file id:", validation_file_id)

Training file id: file-iXaYU79RLt5B30qwONCGVOGi
Validation file id: file-TTizaDSvwxGeebV7EJTMWyaq


### Creating a new finetuning job (starts finetuning the model)

In [21]:
# suffix_name is the finetuned model name's suffix, used to customize the model name
# suffix_name = '2023-09-14'


# response = openai.FineTuningJob.create(
#     training_file=training_file_id,
#     validation_file=validation_file_id,
#     model="gpt-3.5-turbo",
#     suffix=suffix_name,
# )

# job_id = response["id"]

# print(response)

{
  "object": "fine_tuning.job",
  "id": "ftjob-1UItVXqRwLiH59BZMwGelBit",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1695020016,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-CsdfvDqiFkKwB87670dRV01o",
  "result_files": [],
  "status": "created",
  "validation_file": "file-TTizaDSvwxGeebV7EJTMWyaq",
  "training_file": "file-iXaYU79RLt5B30qwONCGVOGi",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": null,
  "error": null
}


### Check status of finetuning job

In [22]:
# response = openai.FineTuningJob.retrieve(job_id)
# print(response)

{
  "object": "fine_tuning.job",
  "id": "ftjob-1UItVXqRwLiH59BZMwGelBit",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1695020016,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-CsdfvDqiFkKwB87670dRV01o",
  "result_files": [],
  "status": "running",
  "validation_file": "file-TTizaDSvwxGeebV7EJTMWyaq",
  "training_file": "file-iXaYU79RLt5B30qwONCGVOGi",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": null,
  "error": null
}


### Check events of finetuning job
When the model has finished training, the output will contain messages like: <br>
- New fine-tuned model created: ft:gpt-3.5-turbo-0613:personal:2023-09-14:8036p70H <br>
- The job has successfully completed

In [40]:
# response = openai.FineTuningJob.list_events(id=job_id, limit=50)

# events = response["data"]
# events.reverse()

# for event in events:
#     print(event["message"])


Created fine-tuning job: ftjob-1UItVXqRwLiH59BZMwGelBit
Fine-tuning job started
Step 1/1539: training loss=3.72, validation loss=5.41
Step 101/1539: training loss=1.27, validation loss=0.38
Step 201/1539: training loss=1.38, validation loss=1.36
Step 301/1539: training loss=2.54, validation loss=2.21
Step 401/1539: training loss=1.48, validation loss=2.36
Step 501/1539: training loss=1.58, validation loss=2.77
Step 601/1539: training loss=1.54, validation loss=3.30
Step 701/1539: training loss=1.57, validation loss=1.78
Step 801/1539: training loss=1.40, validation loss=3.41
Step 901/1539: training loss=1.19, validation loss=1.73
Step 1001/1539: training loss=1.86, validation loss=0.31
Step 1101/1539: training loss=0.99, validation loss=1.31
Step 1201/1539: training loss=1.81, validation loss=0.78
Step 1301/1539: training loss=0.10, validation loss=0.57
Step 1401/1539: training loss=2.26, validation loss=0.36
Step 1501/1539: training loss=1.38, validation loss=0.84
New fine-tuned model

### Retrieve the finetuned model id
This cell will only return the finetuned model id after training is complete, so we must wait until the finetuning job is completed (previous cell shows that the finetuning job is over).

In [1]:
# response = openai.FineTuningJob.retrieve(job_id)
# fine_tuned_model_id = response["fine_tuned_model"]

# print(response)
# print("\nFine-tuned model id:", fine_tuned_model_id)

### Save the file, model, job ids into the finetuning directory

In [33]:
# save the file, model, job ids
from datetime import date
file_and_job_ids = [
    {
        "training_file_id": training_file_id,
        "validation_file_id": validation_file_id,
        "fine_tuned_model_id": fine_tuned_model_id,
        "job_id": job_id
    }
]
save_to_jsonl(file_and_job_ids, f"finetuning/file_and_job_ids-{date.today()}")

### Model inference with finetuned model and GPT 3.5
Imports the previously saved file, model, job ids from the finetuning directory (saved in the previous cell).

In [3]:
# open the file and job ids (if saved)
file_and_job_ids_filename = "finetuning/file_and_job_ids-2023-09-19"
with open(file_and_job_ids_filename) as f:
    file_and_job_ids_dict = json.load(f)

training_file_id = file_and_job_ids_dict['training_file_id']
validation_file_id = file_and_job_ids_dict['validation_file_id']
fine_tuned_model_id = file_and_job_ids_dict['fine_tuned_model_id']
job_id = file_and_job_ids_dict['job_id']

print(training_file_id, validation_file_id, fine_tuned_model_id, job_id, sep='\n')

file-iXaYU79RLt5B30qwONCGVOGi
file-TTizaDSvwxGeebV7EJTMWyaq
ft:gpt-3.5-turbo-0613:personal:2023-09-14:8036p70H
ftjob-1UItVXqRwLiH59BZMwGelBit


### Write questions for the finetuned model to perform inference on
Change the ```user_messsage``` variable to change the question fed into the finetuned model.

In [4]:
test_messages = []
test_messages.append({"role": "system", "content": system_message})
user_message = "What did the Cyber Security Agency of Singapore (CSA) develop in 2021?"
# user_message = "What is the purpose of forward contact tracing?"
# user_message = "Who is the editor of DSTA Horizons?"
test_messages.append({"role": "user", "content": user_message})

print(test_messages)

[{'role': 'system', 'content': "You are an assistant who answers queries regarding DSTA Horizons, which serves as a repository for Singapore's Defence Science & Technology Agency's diverse expertise and knowledge in various fields of technology."}, {'role': 'user', 'content': 'What did the Cyber Security Agency of Singapore (CSA) develop in 2021?'}]


### Finetuned model inference on ```test_messages```
i.e. asking the finetuned model the question specified in ```user_message```.

In [5]:
response = openai.ChatCompletion.create(
    model=fine_tuned_model_id, messages=test_messages, temperature=0, max_tokens=500
)
print(response["choices"][0]["message"]["content"])

APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f21302b9fd0>: Failed to resolve 'api.openai.com' ([Errno -2] Name or service not known)"))

In [58]:
response

<OpenAIObject chat.completion id=chatcmpl-803qnA8CVFGDVDJgdHQSm89BHwAvE at 0x7f1e7c4d87d0> JSON: {
  "id": "chatcmpl-803qnA8CVFGDVDJgdHQSm89BHwAvE",
  "object": "chat.completion",
  "created": 1695025493,
  "model": "ft:gpt-3.5-turbo-0613:personal:2023-09-14:8036p70H",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "In 2021, the Cyber Security Agency of Singapore (CSA) developed the OT Cybersecurity Competency Framework (CCF) to guide the development of competencies for ICS/OT cybersecurity."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 66,
    "completion_tokens": 41,
    "total_tokens": 107
  }
}

### Regular GPT3.5 turbo model inference on ```test_messages``` for comparison purposes

In [1]:
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo', messages=test_messages, temperature=0, max_tokens=500
)
print(response["choices"][0]["message"]["content"])

NameError: name 'openai' is not defined

#### Trained behaviour: shorter answers as seen in training set compared to regular GPT 3.5 model.
e.g. Question: What is the purpose of forward contact tracing?

|Model|Output|
|:-|:-|
|Actual answer in training set|The purpose of forward contact tracing is to identify close contacts to whom COVID-19 positive patients could have passed the virus to during their infectious period.|
|Finetuned model|The purpose of forward contact tracing is to identify and isolate close contacts of confirmed cases as quickly as possible.|
|Regular GPT 3.5|The purpose of forward contact tracing is to identify and notify individuals who may have been exposed to a contagious disease or infection by someone who has tested positive. It involves tracing the contacts of an infected person and notifying them of their potential exposure, so they can take necessary precautions such as self-isolation, testing, or seeking medical advice. Forward contact tracing helps to break the chain of transmission and prevent further spread of the disease within the community.|

#### Trained answers: hallucinates factual questions
e.g. Question: Who is the editor of DSTA Horizons?

|Model|Output|
|:-|:-|
|Actual answer in training set|Koh Tuan Yew|
|Finetuned model|Chua Sher Lin|
|Regular GPT 3.5|As an AI language model, I don't have access to real-time information or the ability to browse the internet. Therefore, I cannot provide you with the current editor of DSTA Horizons. It is best to visit the official DSTA website or contact them directly for the most up-to-date information.|

#### Trained answers: Managed to learn some new facts
e.g. Question: What did the Cyber Security Agency of Singapore (CSA) develop in 2021?

|Model|Output|
|:-|:-|
|Actual answer in training set|In 2021, the CSA developed the OT Cybersecurity Competency Framework (OTCCF) to guide talent and competency development in the space of OT cybersecurity.|
|Finetuned model|In 2021, the Cyber Security Agency of Singapore (CSA) developed the OT Cybersecurity Competency Framework (CCF) to guide the development of competencies for ICS/OT cybersecurity.|
|Regular GPT 3.5|The Cyber Security Agency of Singapore (CSA) is responsible for enhancing Singapore's cyber security capabilities and promoting a safe and secure cyberspace. While I don't have access to real-time information, I can provide you with some examples of initiatives and developments that CSA has undertaken in the past. In 2021, CSA may have worked on various projects such as:<br><br>1. Strengthening Critical Information Infrastructure (CII) Security: CSA could have continued its efforts to enhance the security of Singapore's critical information infrastructure, which includes sectors like energy, transportation, and healthcare.<br><br>2. Cybersecurity Awareness and Education: CSA might have launched campaigns and initiatives to raise awareness about cyber threats and educate individuals, businesses, and organizations on best practices for staying safe online.<br><br>3. Cybersecurity Regulations and Standards: CSA could have developed or updated regulations and standards to ensure that organizations in Singapore adhere to robust cybersecurity practices and protect sensitive data.<br><br>4. Incident Response and Management: CSA may have focused on improving Singapore's incident response capabilities to effectively handle and mitigate cyber incidents, including conducting drills and exercises.<br><br>5. Collaboration and Partnerships: CSA might have collaborated with local and international organizations, government agencies, and industry partners to share information, expertise, and best practices in the field of cybersecurity.<br><br>Please note that the above examples are speculative, and for the most accurate and up-to-date information, it is recommended to refer to official sources or the CSA's website.|

|Model|Output|
|:-|:-|
|Actual answer in training set||
|Finetuned model||
|Regular GPT 3.5||