# OpenAI Models - Tinetuned model
Test the classification perfomance of OpenAI LLMs.

Test cases will include:
- **Zero Shot Models**
- Embedding + XGBoost (or Cosine Similarity)
- Finetuned model

This notebook will attempt to create achieve multi-class classification by leaveraging the JSON output functionality.

## TODO
- [x] The new preview model supports setting a seed to make "reproducable" runs
- [ ] Run model multiple rounds and check reproducability
- [x] Use .env files to set API keys
- [x] Implement token counter and optimise supplied prompts
- [x] Return number of used promts

## Setup & Study Parameters

### Load Libraries
Load libraries and the API key

In [1]:
# --- Load libraries
# Standard libraries
import json
import os
import logging

import jinja2

# Typing
from collections.abc import Iterable

# Misc
from datetime import datetime

from dotenv import load_dotenv
from tqdm.notebook import tqdm

# DS libs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from pathlib import Path
from tqdm.notebook import tqdm

# ML libs
from openai import OpenAI
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# --- Specify logging level
logging.basicConfig(level=logging.INFO)

# --- Check the environment and load API key
print("Current work directory:", Path.cwd())

# Load API key (DO NOT HARDCODE)
load_dotenv()

if _SECRET_KEY := os.getenv("OPENAI_API_KEY"):
    logging.debug("API key found.")
    client = OpenAI(
        # defaults to os.environ.get("OPENAI_API_KEY")
        api_key=_SECRET_KEY
    )
else:
    logging.error("API key not found. Please set the environment variable OPENAI_API_KEY")

Current work directory: /home/kevin/DPhil/Projects/EHR-Indication-Processing/02_Models/03_LLMs/OpenAI/Finetuning


### Specify Study Parameters
Data paths and model to use

In [2]:
# Model parameters
model_selection = "GPT3.5 Turbo"  # "Davinci" or "Curie" or "Babbage" or "Ada"

model_dict = {
    "GPT4": "gpt-4-0125-preview",
    # Only the following are made for chats
    "GPT3.5 Turbo": "gpt-3.5-turbo-0125",
    # Only the following supports pure completion and text substitution
    "GPT3.5 Davinci": "text-davinci-003",
    # Only the following support finetuning, decreasing in performance and cost
    "Davinci": "davinci",
    "Curie": "curie",
    "Babbage": "babbage",
    "Ada": "ada",    
}

# --- Misc settings
# Model names
model_name_display = model_selection
model_openai_id = model_dict[model_selection]  # OpenAI name/identifier

# --- Paths
# Base data path
base_data_path = Path("../../../../00_Data/")
# Dataset Path (training, testing, etc.)
dataset_path =  base_data_path / "publication_ready"
# Export Path (model checkpoints, predictions, etc.)
export_path = base_data_path / "model_output" / f"{model_openai_id.capitalize()}-Finetuned-Json"
# Finetuning data
ft_data_path = export_path / "finetuning_data"


assert base_data_path.is_dir(),\
  f"{base_data_path} either doesn't exist or is not a directory."
export_path.mkdir(exist_ok=True)
ft_data_path.mkdir(exist_ok=True)

seed = 42

### Import and clean data
Import the test and validation data

In [3]:
# Import data --> upload into "Files" on the left-hand panel
train_eval_df = pd.read_csv(
    dataset_path / 'training_oxford_2023-08-23.csv',
    dtype={"Indication": str},
    keep_default_na=False,
    na_values=["NA"],
)

test_oxford_df = pd.read_csv(
    dataset_path / 'testing_oxford_2023-08-23.csv',
    dtype={"Indication": str},
    keep_default_na=False,
    na_values=["NA"],
)

test_banbury_df = pd.read_csv(
    dataset_path / 'testing_banbury_2023-08-23.csv',
    dtype={"Indication": str},
    keep_default_na=False,
    na_values=["NA"],
)

# --- Split into train and eval
train_df, eval_df = train_test_split(
    train_eval_df, 
    test_size=0.15,
    random_state=42,
    shuffle=True)

print("Data set size overview:")
print(f"- Training set: {train_df.shape[0]}")
print(f"- Evaluation set: {eval_df.shape[0]}")
print(f"- Testing Oxford set: {test_oxford_df.shape[0]}")
print(f"- Testing Banbury set: {test_banbury_df.shape[0]}")
print()

Data set size overview:
- Training set: 3400
- Evaluation set: 600
- Testing Oxford set: 2000
- Testing Banbury set: 2000



### Define labels and mappers
Convert labels to numbers and get prettier labels

In [4]:
# labels
labels = [label for label in train_df.columns if label not in ["Indication"]]
labels_pretty = []
for label in labels:
    if label == "ent":
        labels_pretty.append("ENT")
        continue
    labels_pretty.append(" ".join(word.capitalize() for word in label.split("_")))
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels2labels_pretty = {old:pretty for old, pretty in zip(labels, labels_pretty)}

labels_pretty

['Urinary',
 'Respiratory',
 'Abdominal',
 'Neurological',
 'Skin Soft Tissue',
 'ENT',
 'Orthopaedic',
 'Other Specific',
 'No Specific Source',
 'Prophylaxis',
 'Uncertainty',
 'Not Informative']

### Preprocess data

- Prettyfy the column labels (rename them)
- Get a subset of the data for experimenting

In [5]:
for dataset in [train_df, eval_df, test_oxford_df, test_banbury_df]:
    dataset.rename(columns=labels2labels_pretty, inplace=True)

For now get a subset of the training data:
- Extract some indications as validation data

In [6]:
test_subsample = train_df.sample(n=100)
test_subsample_indications = test_subsample.Indication

## Zero Shot Model
Create a good query to use for Zero Shot predictions

In [7]:
def request_completion(client, system_prompt, user_prompt, model_openai_id, max_tokens=100, seed=42):
    """Sends the promt to the OpenAI API and returns the response.
    Specify parameters for the model in the function call.
    """
    # --- Fetch Chat Completion
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": user_prompt,
            }
        ],
        temperature=0,  # Set lower temperature (default 0)
        max_tokens=max_tokens,
        top_p=1,  # Return only the most likely completion (save tokens)
        frequency_penalty=0,  # Default of 0, repeating sequences are ok and wanted
        presence_penalty=0,  # Set to lower value to decrease the likelyhood if the model inveting new words/categories
        model=model_openai_id,
        response_format={ "type": "json_object" },
        seed=seed,  # Set seed for reproducibility (check the API documentation for more details)
        #logit_bias  # Force the model to only reply with the specified labels?
        #logprobs  # Can it be used for evaluation?
    )
    # --- Process the response
    # -- Content
    # There will only be one completion given the parameter `top_p=1`
    chat_completion_content = chat_completion.choices[0]

    # Check whether the completion was truncated
    if (finish_reason := chat_completion_content.finish_reason) != "stop":
        logging.warning(f"Completion was truncated. Finish reason: {finish_reason}")
    
    chat_completion_message = chat_completion_content.message.content

    # -- Metadata
    # Gather general metadata
    chat_completion_metadata = {
        "model": chat_completion.model,
        "created": chat_completion.created,
        "finish_reason": finish_reason,
        "system_fingerprint": chat_completion.system_fingerprint,
    }

    # Get usage metadata
    chat_completion_usage = {
        "completion_tokens": chat_completion.usage.completion_tokens,
        "prompt_tokens": chat_completion.usage.prompt_tokens,
        "total_tokens": chat_completion.usage.total_tokens,
    }
    
    return chat_completion_message, chat_completion_metadata, chat_completion_usage

### Request formatting
The new API (November 2023) allows/requests to specify three messages:
1. System Prompt: Task description
2. User Prompt User input
3. Assistant Prompt: Model response

The system prompt is the same for each request (static).
The user prompt is dynamically generated and reformats the input into a table.

In [8]:
prompt_system_template_string = """You are a helpful and precise UK medical expert. You have been given a list of indications describing why antibiotics were prescribed to patients in a hospital and asked to label these indications into categories.
You can only choose from the following categories: {% for category in categories %}"{{ category }}"{% if not loop.last %}, {% endif %}{% endfor %}.
Multiple categories are allowed.
"ENT" stands for "Ear Nose and Throat".
"Uncertainty" refers to uncertainty specified by the clinician (e.g. "?" or multiple unrelated sources).
"No Spefic Source" means a source can be inferred but it's not specific (e.g. just the word "sepsis" or "infection").
"Not Informative" means the field does not reveal the source, is a viral infection or is unrelated to bacterial infections. When answering the question, please return a JSON.
"""

prompt_user_template_string = \
"""
Return a JSON with the categories (multiple allowed) for each indication:
{
{% for indication in indications -%}
"{{ indication }}":[],
{% endfor %}
}

"""

# Build the template
environment = jinja2.Environment()
prompt_user_template = environment.from_string(prompt_user_template_string)
prompt_system_template = environment.from_string(prompt_system_template_string)

Render template with example data

In [9]:
# Render and display the tempaltes
prompt_system = prompt_system_template.render(categories=labels_pretty)
prompt_user = prompt_user_template.render(indications=test_subsample_indications, categories=labels_pretty)

print("System Prompt:")
print(prompt_system)
print("User Prompt:")
print(prompt_user)

System Prompt:
You are a helpful and precise UK medical expert. You have been given a list of indications describing why antibiotics were prescribed to patients in a hospital and asked to label these indications into categories.
You can only choose from the following categories: "Urinary", "Respiratory", "Abdominal", "Neurological", "Skin Soft Tissue", "ENT", "Orthopaedic", "Other Specific", "No Specific Source", "Prophylaxis", "Uncertainty", "Not Informative".
Multiple categories are allowed.
"ENT" stands for "Ear Nose and Throat".
"Uncertainty" refers to uncertainty specified by the clinician (e.g. "?" or multiple unrelated sources).
"No Spefic Source" means a source can be inferred but it's not specific (e.g. just the word "sepsis" or "infection").
"Not Informative" means the field does not reveal the source, is a viral infection or is unrelated to bacterial infections. When answering the question, please return a JSON.
User Prompt:

Return a JSON with the categories (multiple allow

### Output parsing

Convert the JSON return message into a indicator dataframe

In [10]:
def format_message_to_df(return_msg_str):
    # Convert the string to a dict/json
    return_msg_json = json.loads(return_msg_str)

    # Convert the dict to a DataFrame
    return_msg_df = pd.DataFrame.from_dict(return_msg_json, orient='index')
    input_index = return_msg_df.index

    # Apply get_dummies and sum along the columns axis, to make indicator matrix
    return_msg_df = pd.get_dummies(return_msg_df.stack().reset_index(level=1, drop=True))\
        .groupby(level=0, sort=False)\
        .sum()\
        .reindex(input_index)

    return return_msg_df

### Calculate Performance

Rename columns and sort order for the true labels

In [11]:
metric_categories = [category for category in labels_pretty if category not in ["No Specific Source", "Not Informative"]]
metric_categories = [category for category in labels_pretty if category not in ["Not Informative"]]

def score_response(pred_y, true_y, metric_categories, index_key=None):
    # Reindex (rearrange) if specifided
    if index_key:
        # Set index
        true_y = true_y.set_index(index_key)
        pred_y = pred_y.set_index(index_key)
        # Rearrange
        pred_y = pred_y.reindex(true_y.index)

    # --- Get true labels
    y_test_pred = pred_y[metric_categories]
    y_test_true = true_y[metric_categories]
    # --- Calculate per-class metrics (F1 Score and ROC AUC)
    scores_per_class = {}
    scores_per_class["F1-Score"] = f1_score(y_true=y_test_true, y_pred=y_test_pred, average=None)

    scores_per_class = pd.DataFrame.from_dict(scores_per_class,orient='index', columns=metric_categories)

    pd.set_option('display.precision', 2)

    # --- Calculate overall averages (F1 Score and ROC AUC)
    scores_average = {}
    averaging_method = "weighted"
    scores_average["F1-Score"] = f1_score(y_true=y_test_true, y_pred=y_test_pred, average=averaging_method)
    return scores_average, scores_per_class

## Finetuning
Finetune a GPT variant based on the training data.

- Format the data to be suitable for finetuning
- Finetune the model

In [12]:
# Pivot long, group by indications and aggregate into list
d = (test_subsample
     # Pivot to long format and filter out entries with "0"
    .melt(
        id_vars=["Indication"],
        value_vars=labels_pretty,
        var_name="Sources",
        value_name="Indicator")
    .query("Indicator==1")
    .drop("Indicator", axis=1)
    # Aggregate into a list of indications: <Indication> | [<sources>,...]
    .groupby("Indication")
    .aggregate(list)
    # Reshape into a dictionary
    .to_dict("dict")
    ["Sources"]
)

d

{'? bone infection': ['Orthopaedic', 'Uncertainty'],
 '? cholecystitis': ['Abdominal', 'Uncertainty'],
 '? meningitis': ['Neurological', 'Uncertainty'],
 '?c. diff': ['Abdominal', 'Uncertainty'],
 '?covid': ['No Specific Source', 'Uncertainty', 'Not Informative'],
 '?hsv encephalitis': ['Neurological', 'Uncertainty', 'Not Informative'],
 '?iecopd': ['Respiratory', 'Uncertainty'],
 '?necrotising fasciitis': ['Skin Soft Tissue', 'Uncertainty'],
 '?rickettsia': ['No Specific Source', 'Uncertainty'],
 '?ventriculitis': ['Neurological', 'Uncertainty'],
 'aau': ['No Specific Source'],
 'abscess + cellulitis': ['Skin Soft Tissue'],
 'aspergillosis': ['No Specific Source'],
 'bone infe': ['Orthopaedic'],
 'bowel transplant': ['Abdominal', 'Prophylaxis'],
 'cat bite': ['Skin Soft Tissue'],
 'cat bite infection': ['Skin Soft Tissue'],
 'cellulitis ?nec fasc': ['Skin Soft Tissue', 'Uncertainty'],
 'cellulitis/uti': ['Urinary', 'Skin Soft Tissue', 'Uncertainty'],
 'cf infection': ['Respiratory'],


### Prompt Formatting
Format the promts, they consist of three massages per training example: System, User, Assistant.
Use the informatin gathered from the few-shot models, ask for a JSON and return a JSON.

Future improvements could involve reducing the prompt text count.

In [13]:
def generate_finetuning_data(
        system_prompt: str, 
        input_df: pd.DataFrame, 
        variable_col: str = "Indication", 
        batch_size: int=50
        ):
    """Generate finetuning data in `.jsonl` format.
    Create a "message" per entry, with three `role`-`content` pairs. One for each input
    :param system_prompt: 
    :param input_data: 
    :param output_data: 
    :return: 
    """

    # --- Preprocess the data 
    # Split data into chunks
    finetune_data_split = input_df.groupby(np.arange(len(input_df)) // batch_size)

    # Output fintuning data
    training_data = []

    for _, chunk in finetune_data_split:
        # -- Format the input
        chunk_indications = chunk[variable_col]
        
        # -- Format the output as dictionary/json
        chunk_output = {}
        for indication, values in chunk.set_index(variable_col).to_dict(orient="index").items():
            chunk_output[indication] = [category for category, value in values.items() if value == 1]

        # -- Build the training example for the current chunk
        training_sample = {
            "messages": [
                {
                    "role": "system",
                    "content": system_prompt,
                },
                {
                    "role": "user",
                    "content": prompt_user_template.render(indications=chunk_indications, categories=labels_pretty),
                },
                {
                    "role": "assistant",
                    "content": json.dumps(chunk_output, indent=4),
                },
            ]
        }

        training_data += [json.dumps(training_sample)]
    
    return "\n".join(training_data)

In [14]:
# Save some metadata
time_stamp = datetime.now().strftime("%Y%m%d_%H%M")
batch_size = 10
prototype = False

train_ft_data = train_df.copy()
eval_ft_data = eval_df.copy()
proto_var = ""

# Subset the data if this is a prototype run
if prototype:
    train_ft_data = train_ft_data.head(1000)
    eval_ft_data = eval_ft_data(300)
    proto_var = "-proto"

# Generate finetuning training data
finetuning_data_train = generate_finetuning_data(
    system_prompt=prompt_system_template.render(categories=labels_pretty),
    input_df=train_df,
    variable_col="Indication",
    batch_size=batch_size
)

# Generate finetuning validation data
finetuning_data_val = generate_finetuning_data(
    system_prompt=prompt_system_template.render(categories=labels_pretty),
    input_df=eval_df,
    variable_col="Indication",
    batch_size=batch_size
)

# Write the finetuning data to disk
training_file_name = f"finetuning_data_train{proto_var}-b_{batch_size}-{time_stamp}.jsonl"
validation_file_name = f"finetuning_data_val{proto_var}-b_{batch_size}-{time_stamp}.jsonl"

with open(ft_data_path  / training_file_name, "w") as f:
    f.write(finetuning_data_train)

with open(ft_data_path / validation_file_name, "w") as f:
    f.write(finetuning_data_val)

#### Estimate some costs

In [15]:
import tiktoken # for token counting

encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")


# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

dataset = [json.loads(line) for line in finetuning_data_train.split("\n")]

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

# Calculate cost:
cost_per_mio_tokens = 8.00

print(f"Estimated cost based on {cost_per_mio_tokens:.2f}$/mio tokens: "
      f"{((n_epochs * n_billing_tokens_in_dataset) / (1e6) * cost_per_mio_tokens):.2f}$")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 446, 537
mean / median: 497.79411764705884, 498.0
p5 / p95: 479.0, 516.1

#### Distribution of num_assistant_tokens_per_example:
min / max: 148, 222
mean / median: 188.0705882352941, 188.0
p5 / p95: 172.0, 203.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~169250 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~507750 tokens
Estimated cost based on 8.00$/mio tokens: 4.06$


### Upload the datasets

In [16]:
def upload_file(file_path):
  with open(file_path, "rb") as file:
    upload_response =client.files.create(
      file=file,
      purpose="fine-tune"
    )
  logging.debug(f"Uploaded file: {upload_response}")
  return upload_response.id

training_file_id = upload_file(ft_data_path / training_file_name)
validation_file_id = upload_file(ft_data_path / validation_file_name)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/files "HTTP/1.1 200 OK"


### Start Finetuning Job

In [17]:
run_number = 2  # Increment it each time we apply significant change?
run_type = "proto" if prototype else "full"
model_suffix = f"{run_type}_{run_number}-b_{batch_size}"

fine_tune_response = client.fine_tuning.jobs.create(
  model=model_openai_id,  # Model type (must support JSON output)
  training_file=training_file_id,
  validation_file=validation_file_id,
  suffix=model_suffix,  # Custom model suffix
  # seed=42,
  hyperparameters={
      "batch_size": "auto",
      "learning_rate_multiplier": "auto",
      "n_epochs": "auto"
  }
)

print(fine_tune_response)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/fine_tuning/jobs "HTTP/1.1 200 OK"


FineTuningJob(id='ftjob-7c8L6gZjl8XDc9g8m2dpjilw', created_at=1713207779, error=Error(code=None, message=None, param=None, error=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-9JKFMdh9B3qyaM6YrsUMxeyR', result_files=[], status='validating_files', trained_tokens=None, training_file='file-Z4Zvg7GD3LpZrLnZF0aHrP7V', validation_file='file-SI9hkbL8P7wkJWPlqQxGCx2W', user_provided_suffix='full_2-b_10', seed=2105952863, integrations=[])


Finetuned model archive:
| FT Name      | Type      | Batch Size | Comments FT Dataset                     | Comments Inference                                   |
|--------------|-----------|------------|-----------------------------------------|------------------------------------------------------|
| proto-b-50   | Full      | 50         |                                         | Prediction loops occasionally                        |
| proto-2-b-50 | Full      | 50         | Add linebreaks in the assistant prompt. | Prediction output loops constantly, character limit. |
| proto-1-b-10 | Prototype | 10         | Reduce batch size                       | Works, most of the time but breaks sometimes         |
| proto-2-b-10 | Prototype | 10         | Remove instructions in user prompt      | Works                                                |
| full-1-b-10  | Full      | 10         | Full training data, batch size 10       | Works most of the time, pick checkpoint              |
| full-2-b-10  | Full      | 10         | Remove instructions in user prompt er   | Works, reduce number of epochs or pick checkpoint    |

## Check the finetuned model
Use the defined methods and run the model for the whole dataset, chunk it!
Add some smartness for reruns and error handling.

In [18]:
# Training data for code development
train_pretty = train_df.rename(columns=labels2labels_pretty)[:200]
train_pretty_indication = train_pretty.Indication

# Without linebreaks
ft_model_id = "ft:gpt-3.5-turbo-0125:university-of-oxford-bdi:proto-b-50:9BkHKxQb"
# With linebreaks
ft_model_id = "ft:gpt-3.5-turbo-0125:university-of-oxford-bdi:full-2-b-10:9EMoQZuO:ckpt-step-680"

In [19]:
def run_completion(input_indications, categories, model_id, chunksize=100, reduce_usage=True):
    # Reduce the number of input indications if specified
    indications_orig = input_indications.copy()
    if reduce_usage:
        input_indications.drop_duplicates(inplace=True)
    
    # Variables to store the results & precomute stuff for the while loop
    input_df_length = len(input_indications)

    cursor = 0
    tmp_prediction_df_list = []
    prediction_metadata_list = []

    # Setup progress bar
    with tqdm(total=input_df_length) as p_bar:
        # Start batch processing
        while cursor < input_df_length:
            cursor_end = min(cursor+chunksize, input_df_length)

            logging.info(f"Processing chunk {cursor}:{cursor_end} of {input_df_length}")

            # Subset the dataset
            chunk_indications = input_indications.iloc[cursor:cursor_end]

            # Render the templates
            prompt_user = prompt_user_template.render(indications=chunk_indications, categories=categories)
            prompt_system = prompt_system_template.render(categories=categories)

            # Request the completion
            chat_completion_message, chat_completion_metadata, chat_completion_usage = request_completion(
                client=client,
                system_prompt=prompt_system, 
                user_prompt=prompt_user, 
                model_openai_id=model_id, 
                max_tokens=None)
            # logging.info(f"Message {chat_completion_message}")
            logging.info(f"Metadata: {chat_completion_metadata}")
            logging.info(f"Usage: {chat_completion_usage}")
            
            # Check if output is truncated, reduce the maximum chunksize and rerun
            if chat_completion_metadata["finish_reason"] != "stop":
                chunksize = chunksize - 10
                logging.warning(f"Metadata: {chat_completion_metadata}")
                logging.warning(f"Usage: {chat_completion_usage}")
                logging.warning(f"Maximum chunksize has been reduced to {chunksize}")

                if chunksize <=0:
                    logging.error(f"The chunksize {chunksize} is not reachable. Please investigate the input")
                    break
                
                continue

            # Save the results and metadata
            chat_completion_metadata["chunk_start"] = cursor
            chat_completion_metadata["chunk_end"] = cursor_end

            tmp_prediction_df_list.append(format_message_to_df(chat_completion_message))
            prediction_metadata_list.append(chat_completion_metadata)
            
            # Show usage and continue to the next chunk
            logging.info(f"Usage: {chat_completion_usage}")
            p_bar.update(chunksize)
            cursor += chunksize

    # Combine the results
    prediction_df = (
        # Restore the original order
        pd.concat(tmp_prediction_df_list)
        .reindex(indications_orig)
        .reset_index()
    )

    prediction_metadata_df = pd.DataFrame(prediction_metadata_list)

    return prediction_df, prediction_metadata_df

# run_completion(test_oxford_df[250:300], labels_pretty, ft_model_id, chunksize=50, reduce_usage=True)

In [20]:
locations_data = {
    "Oxford": test_oxford_df,
    "Banbury": test_banbury_df
}

# locations_data = {
#     "Proto": train_pretty
# }

for location, data in locations_data.items():
    # Run completion/prediction
    prediction_df, prediction_metadata_df = run_completion(
        data.Indication, 
        labels_pretty,
        model_id=ft_model_id,
        chunksize=10, 
        reduce_usage=True
        )

    # Write the results to file
    prediction_df.to_csv(
        export_path / f"predictions_ft_{location}.csv"
    )

    prediction_metadata_df.to_csv(
        export_path / f"prediction_metadata_ft_{location}.csv"
    )

    # Check for discrepancies between datasets
    print("Not in predictions:", set(data.Indication) - set(prediction_df.Indication))
    print("Not in train:", set(prediction_df.Indication) - set(data.Indication))

    # Calculate the scores
    print(score_response(prediction_df.fillna(0), data, labels_pretty))

  0%|          | 0/836 [00:00<?, ?it/s]

INFO:root:Processing chunk 0:10 of 836
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Metadata: {'model': 'ft:gpt-3.5-turbo-0125:university-of-oxford-bdi:full-2-b-10:9EMoQZuO:ckpt-step-680', 'created': 1713216039, 'finish_reason': 'stop', 'system_fingerprint': 'fp_c96c0eb6b3'}
INFO:root:Usage: {'completion_tokens': 180, 'prompt_tokens': 310, 'total_tokens': 490}
INFO:root:Usage: {'completion_tokens': 180, 'prompt_tokens': 310, 'total_tokens': 490}
INFO:root:Processing chunk 10:20 of 836
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Metadata: {'model': 'ft:gpt-3.5-turbo-0125:university-of-oxford-bdi:full-2-b-10:9EMoQZuO:ckpt-step-680', 'created': 1713216044, 'finish_reason': 'stop', 'system_fingerprint': 'fp_c96c0eb6b3'}
INFO:root:Usage: {'completion_tokens': 180, 'prompt_tokens': 306, 'total_tokens': 486}
INFO:root:Usage: {'completion_tokens': 180, 'prompt_tokens': 306, 'total_t

Not in predictions: set()
Not in train: set()
({'F1-Score': 0.9479073922839666},           Urinary  Respiratory  Abdominal  Neurological  Skin Soft Tissue  \
F1-Score     0.98         0.97       0.95          0.93              0.83   

          ENT  Orthopaedic  Other Specific  No Specific Source  Prophylaxis  \
F1-Score  0.8         0.95            0.77                0.97         0.97   

          Uncertainty  Not Informative  
F1-Score         0.92             0.99  )


  0%|          | 0/587 [00:00<?, ?it/s]

INFO:root:Processing chunk 0:10 of 587
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Metadata: {'model': 'ft:gpt-3.5-turbo-0125:university-of-oxford-bdi:full-2-b-10:9EMoQZuO:ckpt-step-680', 'created': 1713216301, 'finish_reason': 'stop', 'system_fingerprint': 'fp_c96c0eb6b3'}
INFO:root:Usage: {'completion_tokens': 174, 'prompt_tokens': 291, 'total_tokens': 465}
INFO:root:Usage: {'completion_tokens': 174, 'prompt_tokens': 291, 'total_tokens': 465}
INFO:root:Processing chunk 10:20 of 587
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Metadata: {'model': 'ft:gpt-3.5-turbo-0125:university-of-oxford-bdi:full-2-b-10:9EMoQZuO:ckpt-step-680', 'created': 1713216304, 'finish_reason': 'stop', 'system_fingerprint': 'fp_c96c0eb6b3'}
INFO:root:Usage: {'completion_tokens': 189, 'prompt_tokens': 308, 'total_tokens': 497}
INFO:root:Usage: {'completion_tokens': 189, 'prompt_tokens': 308, 'total_t

Not in predictions: set()
Not in train: set()
({'F1-Score': 0.9666276619996864},           Urinary  Respiratory  Abdominal  Neurological  Skin Soft Tissue  \
F1-Score     0.98         0.98       0.95           1.0              0.91   

           ENT  Orthopaedic  Other Specific  No Specific Source  Prophylaxis  \
F1-Score  0.81          0.9             0.7                0.98         0.96   

          Uncertainty  Not Informative  
F1-Score         0.95              1.0  )


## Troubleshooting

Get some example messages to try out on the playground (interactive)

In [None]:
prompt_user = prompt_user_template.render(indications=test_oxford_df[640:650].Indication, categories=labels_pretty)
prompt_system = prompt_system_template.render(categories=labels_pretty)

print(prompt_system)
print(prompt_user)