# Module 2 - Fine-tuning GPT-3.5-turbo for multi-document question answering

This notebook shows how to fine-tune OpenAI's GPT-3.5-turbo model for multi-document question answering. The model is fine-tuned on the [Incomplete Information Reading Comprehension (IIRC)](https://allenai.org/data/iirc) dataset. The IIRC dataset contains questions that require the model to reason over multiple documents to generate a complete answer. It is a challenging dataset that requires the model to extract relevant information from multiple sources and synthesize it to produce an accurate answer.

# Installing required packages


In this example, we have to install `openai` and `tiktoken` libraries.

**`openai`**:

OpenAI is an artificial intelligence research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The OpenAI library is a powerful machine learning library that provides an easy-to-use interface to the OpenAI API. With this library, users can easily integrate OpenAI's state-of-the-art language models, including GPT-3, into their applications, and leverage the full power of these models to perform various natural language processing (NLP) tasks, such as language generation, classification, question-answering, and more.

**`tiktoken`**:

Tiktoken is an open-source BPE tokenizer developed by OpenAI that is used to split text strings into tokens. It is useful for models like GPT-3 that encode text into tokens. Tiktoken is designed to be highly efficient, capable of handling large amounts of text quickly.

In [None]:
!pip install openai
!pip install tiktoken

Collecting openai
  Downloading openai-1.3.3-py3-none-any.whl (220 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.3/220.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: h11, httpcore, httpx, openai
[31mERROR: pip's dependency resolver does not currentl

# Downloading the data

To use the Incomplete Information Reading Comprehension (IIRC) dataset as a benchmark, we need to download the data. The IIRC dataset consists of a set of documents and associated questions. We can download the dataset training and validation splits from the [IIRC website](https://allenai.org/data/iirc) and using the following commands:

In [None]:
!wget https://iirc-dataset.s3.us-west-2.amazonaws.com/iirc_train_dev.tgz
!tar -xvzf iirc_train_dev.tgz

--2023-11-20 20:03:44--  https://iirc-dataset.s3.us-west-2.amazonaws.com/iirc_train_dev.tgz
Resolving iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)... 3.5.83.173, 3.5.76.114, 52.92.176.146, ...
Connecting to iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)|3.5.83.173|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5713428 (5.4M) [application/gzip]
Saving to: ‘iirc_train_dev.tgz’


2023-11-20 20:03:51 (893 KB/s) - ‘iirc_train_dev.tgz’ saved [5713428/5713428]

._iirc_train_dev
iirc_train_dev/
iirc_train_dev/._dev.json
iirc_train_dev/dev.json
iirc_train_dev/._README
iirc_train_dev/README
iirc_train_dev/._train.json
iirc_train_dev/train.json


# Load and prepare data

Now, we can load the data and prepare it for training. First, we load the `train.json` file and extract the documents and questions from it.

In [None]:
import json

train_data = json.load(open("./iirc_train_dev/train.json"))

We define the max number of questions we want to use for training. IIRC training set has 10857 questions, but we will use only 1000 questions for training. This is necessary due to the costs for fine-tuning an OpenAI model. You can change the number of questions to use for training by changing the `max_train_questions` variable and check the results on your own responsibility.

The code below extract the questions and documents from the training set and stores them in the `train_set_questions`.

In [None]:
max_questions = 500 # @param

train_set_questions = []
i = 0
while len(train_set_questions) < max_questions:
  item = train_data[i]
  for question in item['questions']:
    documents = []
    for doc in question['context']:
      documents.append({
          "title": doc['passage'] if doc['passage'] != "main" else item['title'],
          "content": doc['text']
      })
    true_answer = ""
    if question['answer']['type'] == "span":
      true_answer = ", ".join([a['text'] for a in question['answer']["answer_spans"]])
    elif question['answer']['type'] == "value":
        true_answer = "{0} {1}".format(question['answer']['answer_value'],question['answer']['answer_unit'])
    elif question['answer']['type'] == "binary":
        true_answer = question['answer']['answer_value']
    elif question['answer']['type'] == "none":
        true_answer = "Not enough information."
    train_set_questions.append({
        "question": question['question'],
        "documents": documents,
        "answer": true_answer
    })
  i+=1



Now, let's prepare the training file. For training chat models with OpenAI's API, we need to provide the model with a list of messages in each example. The code below creates a list of messages for each example in the training set. The list of messages contains the `system` structure, the documents and question as the `user` message, and the answer as the `assistant` message.

In [None]:
file_name = "messages.jsonl" # @param
all_messages = []

with open(file_name, 'w') as outfile:
    for example in train_set_questions:
        prompt = ""
        for j, doc in enumerate(example["documents"]):
            prompt += f"[Document {j+1}]: Title: {doc['title']}. Content: {doc['content']}\n"
            prompt += f"Question: {example['question']}"
        messages = {
            "messages": [
                {'role':'system', "content": "Your task is to answer a question using the information from the provided documents."},
                {"role":"user","content":prompt},
                {"role":"assistant","content":f"Answer: {example['answer']}"},
            ]
        }
        all_messages.append(messages)
        json.dump(messages, outfile)
        outfile.write('\n')

# Format validation

We can perform a variety of error checks to validate that each conversation in the dataset adheres to the format expected by the fine-tuning API. Errors are categorized based on their nature for easier debugging.

1. **Data Type Check**: Checks whether each entry in the dataset is a dictionary (`dict`). Error type: `data_type`.
2. **Presence of Message List**: Checks if a `messages` list is present in each entry. Error type: `missing_messages_list`.
3. **Message Keys Check**: Validates that each message in the `messages` list contains the keys `role` and `content`. Error type: `message_missing_key`.
4. **Unrecognized Keys in Messages**: Logs if a message has keys other than `role`, `content`, and `name`. Error type: `message_unrecognized_key`.
5. **Role Validation**: Ensures the `role` is one of "system", "user", or "assistant". Error type: `unrecognized_role`.
6. **Content Validation**: Verifies that `content` has textual data and is a string. Error type: `missing_content`.
7. **Assistant Message Presence**: Checks that each conversation has at least one message from the assistant. Error type: `example_missing_assistant_message`.

The code below performs these checks, and outputs counts for each type of error found are printed. This is useful for debugging and ensuring the dataset is ready for the next steps.


In [None]:
from collections import defaultdict

format_errors = defaultdict(int)

for ex in all_messages:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name", "function_call") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        function_call = message.get("function_call", None)

        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


# Token Counting Utilities

Lets define a few helpful utilities to be used in the rest of the notebook. These utilities will help us count the number of tokens in the dataset and the number of tokens in each example.

First, we define the function `num_tokens_from_messages` that counts the number of tokens in a conversation. This function takes a conversation as input and returns the number of tokens in the conversation.

The function `num_tokens_from_example` counts the number of tokens in an example. This function takes an example as input and returns the number of tokens in the example.

Finally, the function `print_distribution` prints the distribution of the number of tokens in the dataset. This function takes a list of examples as input and prints the distribution of the number of tokens in the dataset.

In [None]:
import tiktoken
import numpy as np

encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

# Data Warnings and Token Counts

With some lightweight analysis we can identify potential issues in the dataset, like missing messages, and provide statistical insights into message and token counts.

1. **Missing System/User Messages**: Counts the number of conversations missing a "system" or "user" message. Such messages are critical for defining the assistant's behavior and initiating the conversation.
2. **Number of Messages Per Example**: Summarizes the distribution of the number of messages in each conversation, providing insight into dialogue complexity.
3. **Total Tokens Per Example**: Calculates and summarizes the distribution of the total number of tokens in each conversation. Important for understanding fine-tuning costs.
4. **Tokens in Assistant's Messages**: Calculates the number of tokens in the assistant's messages per conversation and summarizes this distribution. Useful for understanding the assistant's verbosity.
5. **Token Limit Warnings**: Checks if any examples exceed the maximum token limit (4096 tokens), as such examples will be truncated during fine-tuning, potentially resulting in data loss.


In [None]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in all_messages:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 72, 593
mean / median: 168.91017964071855, 152.0
p5 / p95: 92.0, 246.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 3, 45
mean / median: 5.782435129740519, 6.0
p5 / p95: 3.0, 7.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


# Cost Estimation

In this final section, we estimate the total number of tokens that will be used for fine-tuning, which allows us to approximate the cost. It is worth noting that the duration of the fine-tuning jobs will also increase with the token count.

In [None]:
# Pricing and default n_epochs estimate
FINETUNING_COST_PER_TOKEN = 0.0080

MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 1
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(all_messages)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
print(f"By default, you'll be charged ~${((n_epochs * n_billing_tokens_in_dataset)/1000) * FINETUNING_COST_PER_TOKEN:.2f} for this dataset")

Dataset has ~84624 tokens that will be charged for during training
By default, you'll train for 1 epochs on this dataset
By default, you'll be charged for ~84624 tokens
By default, you'll be charged ~$0.68 for this dataset


# Fine-tuning

Now, let's fine-tune the model using the OpenAI API.

First, we retrieve the API key from the Google Colab secrets. See [here](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75) how to use secrets in Google Colab. We also instantiate the `openai` library.

In [None]:
from google.colab import userdata
from openai import OpenAI

OPENAI_KEY = userdata.get('openai_api_key')
client = OpenAI(api_key=OPENAI_KEY)

## Upload training file

For fine-tuning, we need to upload the training file to the OpenAI API. The code below uploads the training file to the OpenAI API and stores the information about the uploaded file in the `file_obj` variable. This information will be used later to start the fine-tuning job.

In [None]:
file_obj = client.files.create(
  file=open("messages.jsonl", "rb"),
  purpose="fine-tune"
)

## Create fine-tuning job

Now, we create a fine-tuning job using the openai library. The job creation function receives the following parameters:
* `model`: The model to fine-tune. In this case, we use the `gpt-3.5-turbo` model. Soon the GPT-4 model will be available.
* `training_file`: The id of training file to use for fine-tuning. In this case, we use the `file_obj.id` variable that contains the id of the uploaded training file.
* `hyperparameters`: Defines the hyperparameters to use for fine-tuning. In this case, we define only the number of epochs to use for fine-tuning. We use 1 epoch for fine-tuning. You can change the number of epochs to use for fine-tuning by changing the `n_epochs` variable.

Once created, the fine-tuning job starts automatically.

In [None]:
n_epochs = 1 # @param

ft_job = client.fine_tuning.jobs.create(
  training_file=file_obj.id,
  model="gpt-3.5-turbo",
  hyperparameters={
    "n_epochs":n_epochs
  }
)

## Track fine-tuning job

Now, we can track the fine-tuning job. The code below prints the status of the fine-tuning job every 5 seconds. The fine-tuning job is finished when the status is `succeeded`.

In [None]:
from time import sleep
message = ""
while True:
  job = client.fine_tuning.jobs.retrieve(ft_job.id)
  events = client.fine_tuning.jobs.list_events(ft_job.id)
  if job.status == "succeeded":
    message = events.data[0].message
    print(message)
    break
  if message != events.data[0].message:
    message = events.data[0].message
    print(message)
  sleep(5)

Validating training file: file-uPByEHOyQXSrFiOEaDZzpC4y
Fine-tuning job started
Step 1/501: training loss=4.56
Step 11/501: training loss=3.41
Step 21/501: training loss=1.20
Step 31/501: training loss=0.84
Step 41/501: training loss=0.77
Step 51/501: training loss=0.03
Step 61/501: training loss=0.10
Step 71/501: training loss=0.45
Step 81/501: training loss=0.68
Step 91/501: training loss=0.86
Step 101/501: training loss=0.04
Step 111/501: training loss=0.00
Step 121/501: training loss=1.39
Step 131/501: training loss=0.00
Step 141/501: training loss=0.00
Step 151/501: training loss=0.00
Step 161/501: training loss=0.00
Step 171/501: training loss=2.30
Step 181/501: training loss=0.00
Step 191/501: training loss=0.00
Step 201/501: training loss=0.00
Step 211/501: training loss=0.00
Step 221/501: training loss=0.06
Step 231/501: training loss=0.00
Step 241/501: training loss=0.67
Step 251/501: training loss=0.01
Step 261/501: training loss=0.25
Step 271/501: training loss=0.00
Step 28

In [None]:
job = client.fine_tuning.jobs.retrieve(ft_job.id)
fine_tuned_model = job.fine_tuned_model
print(f"Your fine-tuned model name is '{job.fine_tuned_model}'")

Your fine-tuned model name is 'ft:gpt-3.5-turbo-0613:personal::8N58P9KP'


# Testing the fine-tuned model

## Downloading the data

To use the Incomplete Information Reading Comprehension (IIRC) dataset as a benchmark, we need to download the data. The IIRC dataset consists of a set of documents and associated questions. We can download the dataset test set using the following code:



In [None]:
!wget https://iirc-dataset.s3.us-west-2.amazonaws.com/iirc_test.json

--2023-11-20 20:26:22--  https://iirc-dataset.s3.us-west-2.amazonaws.com/iirc_test.json
Resolving iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)... 52.92.194.74, 52.92.161.202, 52.92.224.106, ...
Connecting to iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)|52.92.194.74|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2874825 (2.7M) [application/json]
Saving to: ‘iirc_test.json’


2023-11-20 20:26:23 (2.93 MB/s) - ‘iirc_test.json’ saved [2874825/2874825]



Let's load the data and see what it looks like.

We are using the IIRC dataset, and we have imported the JSON library to read the test set file. We loaded the first example from the test set, which is a dictionary with keys 'questions', 'text', 'links', and 'title'.

The 'questions' key contains a list of dictionaries with keys 'question', 'context', 'answer', and 'question_links'. The 'text' key contains the text that may contain relevant information for answering the questions. The 'links' key is a list of dictionaries with keys 'target' and 'indices', indicating the hyperlink target and the position of the hyperlink in the text. The 'title' key contains the title of the document.

In this particular example, we can see that the question is "What is Zeus known for in Greek mythology?" and the answer is "being the sky and thunder god". The context contains three passages containing the text that may provide additional information.

In [None]:
import json

test_set = json.load(open('iirc_test.json','r'))

test_set[0]

{'questions': [{'answer': {'type': 'span',
    'answer_spans': [{'text': 'sky and thunder god',
      'passage': 'zeus',
      'type': 'answer',
      'start': 83,
      'end': 102}]},
   'question': 'What is Zeus know for in Greek mythology?',
   'context': [{'text': 'he Palici the sons of Zeus',
     'passage': 'main',
     'indices': [684, 710]},
    {'text': 'in Greek mythology', 'passage': 'main', 'indices': [137, 155]},
    {'text': 'Zeus (British English , North American English ; , Zeús ) is the sky and thunder god in ancient Greek religion',
     'passage': 'Zeus',
     'indices': [0, 110]}],
   'question_links': ['Greek mythology', 'Zeus']}],
 'text': "The Palici (Παλικοί in Greek), or Palaci, were a pair of indigenous Sicilian chthonic deities in Roman mythology, and to a lesser extent in Greek mythology. They are mentioned in Ovid's Metamorphoses V, 406, and in Virgil's Aeneid IX, 585. Their cult centered on three small lakes that emitted sulphurous vapors in the Palagonia 

## Data Preparation

When working with the IIRC dataset, preparing the data before using it to evaluate models is necessary. Additionally, it is important to carefully choose how many questions to evaluate, as using GPT from OpenAI API can be costly. The data preparation process involves iterating through the test set,
extracting relevant information such as the question, documents containing relevant passages, and the true answer.

This is achieved by parsing the original JSON format and mapping it into a format that can be used to evaluate models.

In the code below, the preparation process is limited to a maximum of 50 questions due to the cost associated with using the OpenAI API. It involves iterating through each question and its corresponding context, creating a list of documents containing relevant information, and extracting the true answer based on its type. By preparing the data in this way, it can be more easily fed into GPT models, and compare the results.

We use a limited number of questions to reduce the costs of using OpenAPI API.

In [None]:
max_questions = 50 # @param

test_set_questions = []
i = 0
while len(test_set_questions) < max_questions:

  item = test_set[i]
  for question in item['questions']:
    documents = []
    for doc in question['context']:
      documents.append({
          "title": doc['passage'] if doc['passage'] != "main" else item['title'],
          "content": doc['text']
      })
    true_answer = ""
    if question['answer']['type'] == "span":
      true_answer = ", ".join([a['text'] for a in question['answer']["answer_spans"]])
    elif question['answer']['type'] == "value":
        true_answer = "{0} {1}".format(question['answer']['answer_value'],question['answer']['answer_unit'])
    elif question['answer']['type'] == "binary":
        true_answer = question['answer']['answer_value']
    elif question['answer']['type'] == "none":
        true_answer = "Not enough information."
    test_set_questions.append({
        "question": question['question'],
        "documents": documents,
        "answer": true_answer
    })
    i+=1


The resulting prepared dataset is a dictionary with four keys. The "**`question`**" key contains the actual question to be answered, which in this case is "What is Zeus known for in Greek mythology?". The "**`documents`**" key is a list containing information that may be relevant to answering the question. Each document is a dictionary with a "title" key and a "content" key. The "title" key gives the name of the source of the information, while the "content" key provides the actual text of the source. In this case, there are three documents, all related to the topic of Greek mythology and Zeus. The "**`answer`**" key contains the correct answer to the question, which is "sky and thunder god".



In [None]:
test_set_questions[0]

{'question': 'What is Zeus know for in Greek mythology?',
 'documents': [{'title': 'Palici', 'content': 'he Palici the sons of Zeus'},
  {'title': 'Palici', 'content': 'in Greek mythology'},
  {'title': 'Zeus',
   'content': 'Zeus (British English , North American English ; , Zeús ) is the sky and thunder god in ancient Greek religion'}],
 'answer': 'sky and thunder god'}

In [None]:
def generate_chat(messages,model="gpt-3.5-turbo"):
  response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=0
  )
  return response.choices[0].message.content

In [None]:
instruction =  "Your task is to answer a question using the information from the provided documents."
def chatgpt_qa(question, documents, examples,model="gpt-3.5-turbo"):
  messages = [
      {"role": "system", "content": instruction}
  ]
  # Add the few-shot examples
  for example in examples:
    prompt = ""
    for j, doc in enumerate(example["documents"]):
      prompt += f"[Document {j+1}]: Title: {doc['title']}. Content: {doc['content']}\n"
    prompt += f"Question: {example['question']}"

    messages += [
        {"role":"user","content":prompt},
        {"role":"assistant", "content": f"Explanation: {example['explanation']}\nAnswer: {example['answer']}"}
    ]

  # Add target example
  prompt = ""
  for k, doc in enumerate(documents):
    prompt += f"[Document {k+1}]: Title: {doc['title']}. Content: {doc['content']}\n"
  prompt += f"Question: {question}"
  messages.append({"role":"user", "content": prompt})

  res = generate_chat(messages,model=fine_tuned_model) # perform API call
  return res.split("Answer:")[1]


In [None]:
from tqdm import tqdm
model = "gpt-3.5-turbo" # @param ['gpt-4', 'gpt-4-32k', "gpt-3.5-turbo", "gpt-3.5-turbo-instruct", "text-davinci-003"]
k_shot = 3 # @param [1,2,3]
for question in tqdm(test_set_questions):

  answer = chatgpt_qa(question['question'], question['documents'], [])

  question["predicted_answer"] = answer

100%|██████████| 52/52 [00:24<00:00,  2.14it/s]


## Evaluation

In this section, we evaluate the performance of our model by calculating the exact match score and the F1 score using bag of words. To accomplish this, we have defined some helper functions.

The **`normalize_text`** function takes a text and normalizes it by converting it to lowercase and removing any non-alphanumeric characters. The **`get_tokens`** function tokenizes the text after normalization.

The **`exact_match`** function takes the predicted answer and the true answer and returns whether they match exactly after normalization. The **`f1_bag_of_words`** function takes the predicted answer and the true answer, tokenizes them, and calculates their F1 score using the bag of words approach.

The bag of words approach is a technique used to measure the similarity between two sets of texts by counting the frequency of each word in both sets and then calculating their overlap.

In [None]:
import re
from collections import Counter

def normalize_text(text):
    """
    Helper function to normalize the text
    """
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9]+", " ", text)
    return text.strip()

def get_tokens(text):
    """
    Helper function to tokenize text
    """
    text = normalize_text(text)
    return text.split()

def exact_match(pred_answer, true_answer):
    """
    Calculates the exact match score
    """
    return normalize_text(pred_answer) == normalize_text(true_answer)

def f1_bag_of_words(pred_answer, true_answer):
    """
    Calculates the F1 score using bag of words
    """
    pred_tokens = get_tokens(pred_answer)
    true_tokens = get_tokens(true_answer)

    pred_counter = Counter(pred_tokens)
    true_counter = Counter(true_tokens)

    common = pred_counter & true_counter
    num_same = sum(common.values())

    if num_same == 0:
        return 0

    precision = 1.0 * num_same / len(pred_tokens)
    recall = 1.0 * num_same / len(true_tokens)
    f1 = (2 * precision * recall) / (precision + recall)

    return f1


In the below code, we are evaluating the performance of the model for multi-document question answering by calculating the: Exact Match (EM) and F1 score using bag of words.

The code iterates over each item in the validation dataset and calculates the F1 and EM scores using the **`f1_bag_of_words`** and **`exact_match`** functions defined earlier. The maximum score of F1 and EM is then taken for each item and appended to their respective lists, **`f1s`** and **`ems`**.

The mean of the **`f1s`** and **`ems`** lists are then calculated using NumPy's **`np.mean`** function and assigned to **`mean_f1`** and **`mean_em`** variables, respectively. Finally, the average EM and F1 scores are printed using formatted string literals.

In [None]:
import numpy as np

f1s, ems = [], []
for question in test_set_questions:
  if "predicted_answer" in question:
    f1 = f1_bag_of_words(question["predicted_answer"],question["answer"])
    em = exact_match(question["predicted_answer"],question["answer"])
    f1s.append(f1)
    ems.append(em)

mean_em = np.mean(ems)
mean_f1 = np.mean(f1s)
print(f"Exact match: {mean_em:.3f}\nF1-bow: {mean_f1:.3f}")

Exact match: 0.788
F1-bow: 0.833
