# OpenAI Models - Zero Shot model
Test the classification perfomance of OpenAI LLMs.

Test cases will include:
- **Zero Shot Models**
- Embedding + XGBoost (or Cosine Similarity)
- Finetuned model

This notebook will attempt to create achieve multi-class classification by leaveraging the JSON output functionality.

## TODO
- [x] The new preview model supports setting a seed to make "reproducable" runs
- [ ] Run model multiple rounds and check reproducability
- [x] Use .env files to set API keys
- [x] Implement token counter and optimise supplied prompts
- [x] Return number of used promts

## Setup & Study Parameters

### Load Libraries
Load libraries and the API key

In [1]:
# --- Load libraries
# Standard libraries
import glob
import json
import os
import sys
import logging
import pickle

# Misc
import jinja2
from dotenv import load_dotenv
from tqdm.notebook import tqdm

# DS libs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from pathlib import Path
from tqdm.notebook import tqdm

# ML libs
from openai import OpenAI
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score

# --- Specify logging level
logging.basicConfig(level=logging.INFO)

# --- Check the environment and load API key
print("Current work directory:", Path.cwd())

# Load API key (DO NOT HARDCODE)
load_dotenv()

if _SECRET_KEY := os.getenv("OPENAI_API_KEY"):
    logging.debug("API key found.")
    client = OpenAI(
        # defaults to os.environ.get("OPENAI_API_KEY")
        api_key=_SECRET_KEY
    )
else:
    logging.error("API key not found. Please set the environment variable OPENAI_API_KEY")

Current work directory: /home/kevin/DPhil/Projects/EHR-Indication-Processing/02_Models/03_LLMs/OpenAI


### Specify Study Parameters
Data paths and model to use

In [2]:
# Model parameters
model_selection = "GPT4"  # "Davinci" or "Curie" or "Babbage" or "Ada"

model_dict = {
    "GPT4": "gpt-4-0125-preview",
    # Only the following are made for chats
    "GPT3.5 Turbo": "gpt-3.5-turbo",
    # Only the following supports pure completion and text substitution
    "GPT3.5 Davinci": "text-davinci-003",
    # Only the following support finetuning, decreasing in performance and cost
    "Davinci": "davinci",
    "Curie": "curie",
    "Babbage": "babbage",
    "Ada": "ada",    
}

# --- Misc settings
# Model names
model_name_display = model_selection
model_openai_id = model_dict[model_selection]  # OpenAI name/identifier

# --- Paths
# Base data path
base_data_path = Path("../../../00_Data/")
# Dataset Path (training, testing, etc.)
dataset_path =  base_data_path / "publication_ready"
# Export Path (model checkpoints, predictions, etc.)
export_path = base_data_path / "model_output" / f"{model_openai_id.capitalize()}-Zero_Shot-Json"


assert base_data_path.is_dir(),\
  f"{base_data_path} either doesn't exist or is not a directory."
export_path.mkdir(exist_ok=True)

seed = 42

### Import and clean data
Import the test and validation data

In [3]:
# Import data --> upload into "Files" on the left-hand panel
train_eval_df = pd.read_csv(
    dataset_path / 'training_oxford_2023-08-23.csv',
    dtype={"Indication": str},
    keep_default_na=False,
    na_values=["NA"],
)

test_oxford_df = pd.read_csv(
    dataset_path / 'testing_oxford_2023-08-23.csv',
    dtype={"Indication": str},
    keep_default_na=False,
    na_values=["NA"],
)

test_banbury_df = pd.read_csv(
    dataset_path / 'testing_banbury_2023-08-23.csv',
    dtype={"Indication": str},
    keep_default_na=False,
    na_values=["NA"],
)

# --- Split into train and eval
train_df, eval_df = train_test_split(
    train_eval_df, 
    test_size=0.15,
    random_state=42,
    shuffle=True)

print("Data set size overview:")
print(f"- Training set: {train_df.shape[0]}")
print(f"- Evaluation set: {eval_df.shape[0]}")
print(f"- Testing Oxford set: {test_oxford_df.shape[0]}")
print(f"- Testing Banbury set: {test_banbury_df.shape[0]}")
print()

Data set size overview:
- Training set: 3400
- Evaluation set: 600
- Testing Oxford set: 2000
- Testing Banbury set: 2000



### Define labels and mappers
Convert labels to numbers and get prettier labels

In [4]:
# labels
labels = [label for label in train_df.columns if label not in ["Indication"]]
labels_pretty = []
for label in labels:
    if label == "ent":
        labels_pretty.append("ENT")
        continue
    labels_pretty.append(" ".join(word.capitalize() for word in label.split("_")))
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels2labels_pretty = {old:pretty for old, pretty in zip(labels, labels_pretty)}

labels_pretty

['Urinary',
 'Respiratory',
 'Abdominal',
 'Neurological',
 'Skin Soft Tissue',
 'ENT',
 'Orthopaedic',
 'Other Specific',
 'No Specific Source',
 'Prophylaxis',
 'Uncertainty',
 'Not Informative']

### Preprocess data

- Prettyfy the column labels (rename them)
- Get a subset of the data for experimenting

In [5]:
for dataset in [train_df, eval_df, test_oxford_df, test_banbury_df]:
    dataset.rename(columns=labels2labels_pretty, inplace=True)

For now get a subset of the training data:
- Extract some indications as validation data

In [6]:
test_subsample = train_df.sample(n=100)
test_subsample_indications = test_subsample.Indication

## Zero Shot Model
Create a good query to use for Zero Shot predictions

In [7]:
def request_completion(client, system_prompt, user_prompt, model_openai_id, max_tokens=100, seed=42):
    """Sends the promt to the OpenAI API and returns the response.
    Specify parameters for the model in the function call.
    """
    # --- Fetch Chat Completion
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": user_prompt,
            }
        ],
        temperature=0,  # Set lower temperature (default 0)
        max_tokens=max_tokens,
        top_p=1,  # Return only the most likely completion (save tokens)
        frequency_penalty=0,  # Default of 0, repeating sequences are ok and wanted
        presence_penalty=0,  # Set to lower value to decrease the likelyhood if the model inveting new words/categories
        model=model_openai_id,
        response_format={ "type": "json_object" },
        seed=seed,  # Set seed for reproducibility (check the API documentation for more details)
        #logit_bias  # Force the model to only reply with the specified labels?
        #logprobs  # Can it be used for evaluation?
    )
    # --- Process the response
    # -- Content
    # There will only be one completion given the parameter `top_p=1`
    chat_completion_content = chat_completion.choices[0]

    # Check whether the completion was truncated
    if (finish_reason := chat_completion_content.finish_reason) != "stop":
        logging.warning(f"Completion was truncated. Finish reason: {finish_reason}")
    
    chat_completion_message = chat_completion_content.message.content

    # -- Metadata
    # Gather general metadata
    chat_completion_metadata = {
        "model": chat_completion.model,
        "created": chat_completion.created,
        "finish_reason": finish_reason,
        "system_fingerprint": chat_completion.system_fingerprint,
    }

    # Get usage metadata
    chat_completion_usage = {
        "completion_tokens": chat_completion.usage.completion_tokens,
        "prompt_tokens": chat_completion.usage.prompt_tokens,
        "total_tokens": chat_completion.usage.total_tokens,
    }
    
    return chat_completion_message, chat_completion_metadata, chat_completion_usage

### Request formatting
The new API (November 2023) allows/requests to specify three messages:
1. System Prompt: Task description
2. User Prompt User input
3. Assistant Prompt: Model response

The system prompt is the same for each request (static).
The user prompt is dynamically generated and reformats the input into a table.

In [8]:
prompt_system_template_string = """You are a helpful and precise UK medical expert. You have been given a list of indications describing why antibiotics were prescribed to patients in a hospital and asked to label these indications into categories.
You can only choose from the following categories: {% for category in categories %}"{{ category }}"{% if not loop.last %}, {% endif %}{% endfor %}.
Multiple categories are allowed.
"ENT" stands for "Ear Nose and Throat".
"Uncertainty" refers to uncertainty specified by the clinician (e.g. "?" or multiple unrelated sources).
"No Spefic Source" means a source can be inferred but it's not specific (e.g. just the word "sepsis" or "infection").
"Not Informative" means the field does not reveal the source, is a viral infection or is unrelated to bacterial infections. When answering the question, please return a JSON.
"""

prompt_user_template_string = \
"""
Return a JSON with the categories (multiple allowed) for each indication. 
Do not change or remove the supplied indications (dictionary key); only fill the empty arrays with the source categories specified above:
{
{% for indication in indications -%}
"{{ indication }}":[],
{% endfor %}
}

"""

# Build the template
environment = jinja2.Environment()
prompt_user_template = environment.from_string(prompt_user_template_string)
prompt_system_template = environment.from_string(prompt_system_template_string)

Render tempalte with example data

In [9]:
# Render and display the tempaltes
prompt_system = prompt_system_template.render(categories=labels_pretty)
prompt_user = prompt_user_template.render(indications=test_subsample_indications, categories=labels_pretty)

print("System Prompt:")
print(prompt_system)
print("User Prompt:")
print(prompt_user)

System Prompt:
You are a helpful and precise UK medical expert. You have been given a list of indications describing why antibiotics were prescribed to patients in a hospital and asked to label these indications into categories.
You can only choose from the following categories: "Urinary", "Respiratory", "Abdominal", "Neurological", "Skin Soft Tissue", "ENT", "Orthopaedic", "Other Specific", "No Specific Source", "Prophylaxis", "Uncertainty", "Not Informative".
Multiple categories are allowed.
"ENT" stands for "Ear Nose and Throat".
"Uncertainty" refers to uncertainty specified by the clinician (e.g. "?" or multiple unrelated sources).
"No Spefic Source" means a source can be inferred but it's not specific (e.g. just the word "sepsis" or "infection").
"Not Informative" means the field does not reveal the source, is a viral infection or is unrelated to bacterial infections. When answering the question, please return a JSON.
User Prompt:

Return a JSON with the categories (multiple allow

### Output parsing

Convert the JSON return message into a indicator dataframe

In [10]:
def format_message_to_df(return_msg_str):
    # Convert the string to a dict/json
    return_msg_json = json.loads(return_msg_str)

    # Convert the dict to a DataFrame
    return_msg_df = pd.DataFrame.from_dict(return_msg_json, orient='index')
    input_index = return_msg_df.index

    # Apply get_dummies and sum along the columns axis, to make indicator matrix
    return_msg_df = pd.get_dummies(return_msg_df.stack().reset_index(level=1, drop=True))\
        .groupby(level=0, sort=False)\
        .sum()\
        .reindex(input_index)

    return return_msg_df

### Calculate Performance

Rename columns and sort order for the true labels

In [11]:
metric_categories = [category for category in labels_pretty if category not in ["No Specific Source", "Not Informative"]]
metric_categories = [category for category in labels_pretty if category not in ["Not Informative"]]

def score_response(pred_y, true_y, metric_categories, index_key=None):
    # Reindex (rearrange) if specifided
    if index_key:
        # Set index
        true_y = true_y.set_index(index_key)
        pred_y = pred_y.set_index(index_key)
        # Rearrange
        pred_y = pred_y.reindex(true_y.index)

    # --- Get true labels
    y_test_pred = pred_y[metric_categories]
    y_test_true = true_y[metric_categories]
    # --- Calculate per-class metrics (F1 Score and ROC AUC)
    scores_per_class = {}
    scores_per_class["F1-Score"] = f1_score(y_true=y_test_true, y_pred=y_test_pred, average=None)

    scores_per_class = pd.DataFrame.from_dict(scores_per_class,orient='index', columns=metric_categories)

    pd.set_option('display.precision', 2)

    # --- Calculate overall averages (F1 Score and ROC AUC)
    scores_average = {}
    averaging_method = "weighted"
    scores_average["F1-Score"] = f1_score(y_true=y_test_true, y_pred=y_test_pred, average=averaging_method)
    return scores_average, scores_per_class

## Run for the entire dataset
Use the defined methods and run the model for the whole dataset, chunk it!
Add some smartness for reruns and error handling.

In [12]:
# Training data for code development
train_pretty = train_df.rename(columns=labels2labels_pretty)[:200]
train_pretty_indication = train_pretty.Indication


In [13]:
def run_completion(input_indications, categories, model_id, chunksize=100, reduce_usage=True):
    # Reduce the number of input indications if specified
    indications_orig = input_indications.copy()
    if reduce_usage:
        input_indications.drop_duplicates(inplace=True)
    
    # Variables to store the results & precomute stuff for the while loop
    input_df_length = len(input_indications)

    cursor = 0
    tmp_prediction_df_list = []
    prediction_metadata_list = []

    # Setup progress bar
    with tqdm(total=input_df_length) as p_bar:
        # Start batch processing
        while cursor < input_df_length:
            cursor_end = min(cursor+chunksize, input_df_length)

            logging.info(f"Processing chunk {cursor}:{cursor_end} of {input_df_length}")

            # Subset the dataset
            chunk_indications = input_indications.iloc[cursor:cursor_end]

            # Render the templates
            prompt_user = prompt_user_template.render(indications=chunk_indications, categories=categories)
            prompt_system = prompt_system_template.render(categories=categories)

            # Request the completion
            chat_completion_message, chat_completion_metadata, chat_completion_usage = request_completion(
                client=client,
                system_prompt=prompt_system, 
                user_prompt=prompt_user, 
                model_openai_id=model_id, 
                max_tokens=None)
            
            # Check if output is truncated, reduce the maximum chunksize and rerun
            if chat_completion_metadata["finish_reason"] != "stop":
                chunksize = chunksize - 10
                logging.warning(f"Maximum chunksize has been reduced to {chunksize}")

                if chunksize <=0:
                    logging.error(f"The chunksize {chunksize} is not reachable. Please investigate the input")
                    break
                continue

            # Save the results and metadata
            chat_completion_metadata["chunk_start"] = cursor
            chat_completion_metadata["chunk_end"] = cursor_end

            tmp_prediction_df_list.append(format_message_to_df(chat_completion_message))
            prediction_metadata_list.append(chat_completion_metadata)
            
            # Show usage and continue to the next chunk
            logging.info(f"Usage: {chat_completion_usage}")
            p_bar.update(chunksize)
            cursor += chunksize

    # Combine the results
    prediction_df = (
        # Restore the original order
        pd.concat(tmp_prediction_df_list)
        .reindex(indications_orig)
        .reset_index()
    )

    prediction_metadata_df = pd.DataFrame(prediction_metadata_list)

    return prediction_df, prediction_metadata_df

#run_completion(train_pretty_indication, labels_pretty, chunksize=100, reduce_usage=True)

In [14]:
locations_data = {
    "Oxford": test_oxford_df,
    "Banbury": test_banbury_df
}

for location, data in locations_data.items():
    # Run completion/prediction
    prediction_df, prediction_metadata_df = run_completion(
        data.Indication, 
        labels_pretty,
        model_id=model_openai_id,
        chunksize=100, 
        reduce_usage=True
        )

    # Write the results to file
    prediction_df.to_csv(
        export_path / f"predictions_zs_{location}.csv"
    )

    prediction_metadata_df.to_csv(
        export_path / f"prediction_metadata_zs_{location}.csv"
    )

    # Check for discrepancies between datasets
    print("Not in predictions:", set(data.Indication) - set(prediction_df.Indication))
    print("Not in train:", set(prediction_df.Indication) - set(data.Indication))

    # Calculate the scores
    print(score_response(prediction_df.fillna(0), data, labels_pretty))

  0%|          | 0/836 [00:00<?, ?it/s]

INFO:root:Processing chunk 0:100 of 836
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Usage: {'completion_tokens': 1026, 'prompt_tokens': 964, 'total_tokens': 1990}
INFO:root:Processing chunk 100:200 of 836
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Usage: {'completion_tokens': 1110, 'prompt_tokens': 973, 'total_tokens': 2083}
INFO:root:Processing chunk 200:300 of 836
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Usage: {'completion_tokens': 1101, 'prompt_tokens': 1008, 'total_tokens': 2109}
INFO:root:Processing chunk 300:400 of 836
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Usage: {'completion_tokens': 1051, 'prompt_tokens': 967, 'total_tokens': 2018}
INFO:root:Processing chunk 400:500 of 836
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completion

Not in predictions: set()
Not in train: set()
({'F1-Score': 0.7071417100148766},           Urinary  Respiratory  Abdominal  Neurological  Skin Soft Tissue  \
F1-Score     0.98         0.96       0.83          0.88              0.87   

           ENT  Orthopaedic  Other Specific  No Specific Source  Prophylaxis  \
F1-Score  0.79         0.87             0.3                0.34         0.94   

          Uncertainty  Not Informative  
F1-Score         0.78              0.3  )


  0%|          | 0/587 [00:00<?, ?it/s]

INFO:root:Processing chunk 0:100 of 587
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Usage: {'completion_tokens': 1134, 'prompt_tokens': 967, 'total_tokens': 2101}
INFO:root:Processing chunk 100:200 of 587
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Usage: {'completion_tokens': 1183, 'prompt_tokens': 997, 'total_tokens': 2180}
INFO:root:Processing chunk 200:300 of 587
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Usage: {'completion_tokens': 1216, 'prompt_tokens': 992, 'total_tokens': 2208}
INFO:root:Processing chunk 300:400 of 587
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:root:Usage: {'completion_tokens': 1180, 'prompt_tokens': 998, 'total_tokens': 2178}
INFO:root:Processing chunk 400:500 of 587
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions

Not in predictions: set()
Not in train: set()
({'F1-Score': 0.860384622112036},           Urinary  Respiratory  Abdominal  Neurological  Skin Soft Tissue  \
F1-Score     0.99          1.0       0.85           1.0              0.96   

           ENT  Orthopaedic  Other Specific  No Specific Source  Prophylaxis  \
F1-Score  0.88         0.91            0.25                 0.6         0.96   

          Uncertainty  Not Informative  
F1-Score         0.88             0.59  )
