<center><a href="https://www.pieriantraining.com/" ><img src="PTCenteredPurple.png" alt="Pierian Training Logo" /></a></center>


# LLM Fine Tuning

In this notebook we'll walk through the process of fine-tuning one of OpenAI's advanced language models on the [MTSamples](https://mtsamples.com/) dataset. The objective is to enhance the model's proficiency in understanding and generating medical content, making it a valuable tool for healthcare professionals, researchers, and students.

Large language models have shown remarkable abilities to understand, generate, and even creatively engage with a wide range of topics. However, when it comes to medical data and other specialized fields, their proficiency can sometimes be less than optimal due to a variety of reasons:

1. Training Data:<br />
LLMs are trained on unimaginable amount of data from the internet. However, due to e.g. HIPAA privacy regulations, the percentage of high-quality,  medical content on the internet is limited compared to other topics. Therefore, LLMs might not have been exposed to as much specialized medical knowledge during their training.

2. Complexity and Specificity:<br />
Medical data and literature often contain highly specialized terms, concepts, and relationships that are complex. Proficiency in this domain requires not only an understanding of the terms but also the context in which they are used. LLMs can sometimes misinterpret or oversimplify these intricate concepts.

3. Generalization vs. Specialization:<br />
Large Language Models are designed to be generalists, capable of addressing a wide range of topics. While they can generate information on many subjects, they might not always match the depth and accuracy of a model or system specifically designed and trained for medical data.


**To this end, the goal of this lecture is to fine tune {INSERT MODEL HERE} on medical reports in order to classify medical reports based on the underlying specialty !**

## Objectives:

1. **Exploring the MTSamples Dataset:**<br />
    - We'll begin by exploring the dataset and its structure.
2. **Preprocessing:** 
    - To prepare for the fine-tuning process, we need to process the dataset into a specific format
3. **Fine-Tuning Process:** 
    - A step-by-step guide to fine-tuning the model, including setting up the right hyperparameters.
4. **Evaluation:** 
    - After training, we'll evaluate our fine-tuned model's performance on medical transcriptions.
5. **Applications & Use-Cases:** 
    - Brief insights into the myriad of ways this fine-tuned model can be utilized in the healthcare domain.


https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions/code

# Exploring the MTSamples Dataset

The dataset was originally obtained from [kaggle](https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions/).
Note that we already removed all unncessary columns


In [2]:
import pandas as pd

In [3]:
medical_reports = pd.read_csv("reports.csv")

### Inspect the dataset

In [None]:
medical_reports.head()

We can see, that the dataset consists of the patient's report and the corresponding medical specialty

### Preprocessing
Let's check the dataset info

In [None]:
medical_reports.info()

We can see that the number of medical_specialty differs from the reports. Let's remove the entries with the missing reports

In [None]:
# Dropping rows where 'report' is missing
medical_reports.dropna(subset=['report'], inplace=True)
medical_reports.info()

In [10]:
# Full fill the na values
# medical_reports.fillna()

In [None]:
grouped_data = medical_reports.groupby("medical_specialty").sample(110, random_state=42) # Sample 110 items from each class
grouped_data['medical_specialty'].value_counts()

### Train-Test Split
Before we inspect the dataset in more detail, let's at first create the train-val-test split
Let's select 5 samples out of each class for validation and test data

In [8]:
grouped_data = medical_reports.groupby("medical_specialty").sample(110, random_state=42) # Sample 110 items from each class

val_test_data = grouped_data.groupby("medical_specialty").sample(10, random_state=42)  # sample 10 items from the above data
val = val_test_data.groupby("medical_specialty").head(5) # Take the first 5 of each class
test = val_test_data.groupby("medical_specialty").tail(5) # Take the last 5 of each class

train = grouped_data[~grouped_data.index.isin(val_test_data.index)] # Take the remaining ones for training



### Dataset Statistics
Let's explore the dataset to provide some basic statistics

In [None]:
# 1. Number of unique medical specialties
print(f"Number of unique medical specialties: {train['medical_specialty'].nunique()}")

# 2. Distribution of reports across different medical specialties
print("\nDistribution of reports across medical specialties:")
print(train['medical_specialty'].value_counts())


In [10]:
# 3. Average, minimum, and maximum report length (in tokens, not words).
# This is important due to token limitations and also to estimate the price.
# Let's calculate the tokens for OpenAI's cheapest model, babbage-002
import tiktoken

def num_tokens_from_string(string: str) -> int:
    """Returns the number of tokens in a text string.
    (https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken)"""
    encoding = tiktoken.get_encoding("cl100k_base")  # encoding for currently all models
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [None]:
text = """Para la crema pastelera:250 mililitros de leche entera,75 gramos de azúcar,2 yemas de huevo M,25 gramos de maicena (harina de maíz refinada),
Para la tarta:1 lámina de hojaldre,2 manzanas,10 gramos de azúcar moreno,2 cucharadas de mermeladas de albaricoque
Comenzamos por la crema pastelera, ya que necesitamos que esté fría antes de montar la tarta. Para ello, calentamos en un cazo 250 ml de leche entera con 40 gramos de azúcar, removiendo para que se disuelva el azúcar.
Mientras, mezclamos en un cuenco pequeño 2 yemas de huevo M con el resto del azúcar (35 g) y cuando esté mezclado agregamos 25 gramos de maicena. Vertemos un poco de la leche caliente, mezclamos bien y vertemos de nuevo en el cazo, removiendo para que espese. 
Tapamos con film transparente tocando la crema (para que no se seque) y dejamos enfriar completamente.
Estiramos 1 lámina de masa de hojaldre sobre el molde elegido para la tarta y forrado con papel de horno y pinchamos la base con un tenedor. Desechamos los sobrantes de masa y reservamos en la nevera mientras precalentamos el horno a 200°C.
Lavamos 2 manzanas, desechamos el corazón y las cortamos finas en láminas.
Repartimos la crema pastelera fría sobre la base. Cubrimos con las manzanas cortadas. Espolvoreamos 10 gramos de azúcar moreno por encima.
Horneamos la tarta de manzana 20 minutos, retiramos del horno y en caliente, pintamos la superficie con 2 cucharadas de mermelada de albaricoque. Dejamos reposar 10 minutos, desmoldamos y dejamos enfriar del todo encima de una rejilla.
¡Y listo! Ya solo queda disfrutar de esta deliciosa tarta de manzana con crema pastelera."""


a = num_tokens_from_string(text)
print(f"tokens: {a}")

In [None]:
report_lengths = train['report'].apply(num_tokens_from_string)
report_lengths.describe()

In [None]:
report_lengths = train['report'].apply(lambda x: num_tokens_from_string(x))
avg_report_length = report_lengths.mean()
min_report_length = report_lengths.min()
max_report_length = report_lengths.max()
report_length_sum = report_lengths.sum()

print(f"Average report length: {avg_report_length:.2f} tokens")
print(f"Minimum report length: {min_report_length} tokens")
print(f"Maximum report length: {max_report_length} tokens")
print(f"# The training dataset consists of: {report_length_sum} tokens")


In [None]:
type(report_length_sum)

In [None]:
price_model = 8.000   # Price for gpt-3.5-turbo per 1M tokens
model = "gpt-3.5-turbo"
price_per_epoch = (report_length_sum / 1000000) * price_model 
print(f"Fine-tuning of {model} costs ~ ${price_per_epoch:.2f} per epoch") 

In [None]:
price_model = 0.0080   # Price for gpt-3.5-turbo per 1K tokens
model = "gpt-3.5-turbo"
price_per_epoch = (report_length_sum / 1000) * price_model 
print(f"Fine-tuning of {model} costs ~ ${price_per_epoch:.2f} per epoch") 

In [None]:
train['medical_specialty'].unique()

### Fine-tuning data formatting

We can now rearrange the dataset into the necessary format in order to start the fine tuning job.
The format is as follows:

```json
{"messages": [{"role": "system", "content": "Given the medical description report, classify it into one of these categories: [Cardiovascular / Pulmonary, Gastroenterology, Neurology, Radiology, Surgery]"}, {"role": "user", "content": "Medical Report"}, {"role": "assistant", "content": "The medical specialty assigned to this report"}]}
```

In [18]:
system_prompt = "Given the medical description report, classify it into one of these categories: " + \
                 "[Cardiovascular / Pulmonary, Gastroenterology, Neurology, Radiology, Surgery]"


# print(system_prompt)

In [None]:
for x in range(0,5):
    print(train["report"].iloc[x])

In [20]:
sample_prompt = {"messages": [{"role": "system", "content": system_prompt},
                              {"role": "user", "content": train["report"].iloc[0]},
                              {"role": "assistant", "content": train["medical_specialty"].iloc[0]}]}


In [None]:
print(sample_prompt)

Let's write a script that converts the dataframe into this format and stores everything as a json

In [22]:
def df_to_format(df):
    formatted_data = []
    
    # Iterate over each row in the dataframe
    for index, row in df.iterrows():
        entry = {"messages": [{"role": "system", "content": system_prompt},
                              {"role": "user", "content": row["report"]},
                              {"role": "assistant", "content": row["medical_specialty"]}]}

        formatted_data.append(entry)

    return formatted_data


In [23]:
data = df_to_format(train)

In [None]:
print(data[1])

Let's dump this list of dictionaries into the training file

In [48]:
import json
with open('fine_tuning_data.jsonl', 'w') as f:
    for entry in data:
        f.write(json.dumps(entry))
        f.write("\n")


### Val Data
Let's perform the same operation for the validation data

In [49]:
val_data = df_to_format(val)

In [50]:
import json
with open('fine_tuning_data_val.jsonl', 'w') as f:
    for entry in val_data:
        f.write(json.dumps(entry))
        f.write("\n")
        
        
    


## Sanity Checks
Before starting the training process, we should check if any input exceeds the maximum of 4096 tokens. Additionally, let's make sure that there are no empty reports

In [20]:
def check_num_tokens(prompt):
    prompt_text = " ".join([content["content"] for content in element["messages"]])
    tokens = num_tokens_from_string(prompt_text)
    if tokens > 4000: # according to https://platform.openai.com/docs/guides/fine-tuning/token-limits
        print(f"Prompt {prompt} exceeds token limit!")
        return False
    return True
    
def check_prompt(prompt):
    
    if len(prompt["messages"][1]["content"]):
        if len(prompt["messages"][2]["content"]):
            return True
    print(f"Prompt {prompt} is missing data!")

    return False


We can now read the jsonl file and check each entry

In [21]:
with open('fine_tuning_data.jsonl', 'r') as f:
    dataset = [json.loads(line) for line in f]


In [22]:
for element in dataset:
    assert check_num_tokens(element) and check_prompt(element)
        

In [23]:
with open('fine_tuning_data_val.jsonl', 'r') as f:
    dataset = [json.loads(line) for line in f]

for element in dataset:
    assert check_num_tokens(element) and check_prompt(element)
        

Great! There are no violations!

## Training
Now it's time to start the training process

In [None]:
import os

OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')


print("*" * 10 )
print(OPENAI_API_KEY)
print("*" * 10 )

In [4]:
import os
import openai
openai.api_key = OPENAI_API_KEY
# openai.organization = "org-FWKNS6AR0jCIQjL9hXW7WLCS"

At first we need to upload the fine-tuning data to openai using the **File.create** [method](https://platform.openai.com/docs/api-reference/files/create) to which you need to pass the binary file object and a purpose (fine-tuning in our case). If you pass "fine-tuning" as purpose, openai validates the file structure once more

In [7]:

file_upload_response = openai.File.create(
  file=open("fine_tuning_data.jsonl", "rb"),
  purpose='fine-tune'
)


In [None]:
file_upload_response

Uploading the File object might take a while.
You can navigate to https://platform.openai.com/files to check if your file has been processed.
Alternativeley, you can use **File.retrieve(file_id)**


In [None]:
openai.File.retrieve(file_upload_response["id"])

Perform the same steps for the val data

In [156]:
file_upload_response_val = openai.File.create(
  file=open("fine_tuning_data_val.jsonl", "rb"),
  purpose='fine-tune'
)


In [None]:
openai.File.retrieve(file_upload_response_val["id"])

Now it's time to start the [training process](https://platform.openai.com/docs/api-reference/fine-tuning/create):

To start the training routine we can call FineTuningJob.create which accepts the following arguments:
- object
- id
- model
- created_at
- fine_tuned_model
- organization_id
- result_files
- status
- validation_file
- training_file

Only *model* and *training_file* are required, the remaining arguments are optional.
You can specify the number of epochs using the hyperparameter argument. Currently *n_epochs* is the only hyperparameter available.


In [158]:
fine_tuning_response = openai.FineTuningJob.create(training_file=file_upload_response["id"],
                            model="gpt-3.5-turbo",
                            hyperparameters={"n_epochs": 1},
                            validation_file = file_upload_response_val["id"])

To obtain the log, you can use *FineTuningJob.list_events* to which you pass the job id and a limit if you want 

In [None]:
fine_tuning_response["id"]

In [None]:
openai.FineTuningJob.list_events(id="ftjob-0ZB6FD70DnweK1F6euj03SNg", limit=10)


### Plot losses
We can use *FineTuningJob.list_events* to obtain all event data and plot the training metrics.
Note that if you do not pass a limit, openai will not automatically grab all data. Thus, it's best to pass a large limit

In [26]:
train_event = openai.FineTuningJob.list_events(id="ftjob-0ZB6FD70DnweK1F6euj03SNg", limit=500)

In [27]:
train_loss = []
val_loss = []
train_acc = []
val_acc = []
for item in train_event["data"]:
    train_data = item["data"]
    if train_data and "train_loss" in train_data:
        
        # As the event list returns the most current event at first, we don't want to append but insert
        train_loss.insert(0, train_data["train_loss"])
        val_loss.insert(0, train_data["valid_loss"])
        train_acc.insert(0, train_data["train_mean_token_accuracy"])
        val_acc.insert(0, train_data["valid_mean_token_accuracy"])


In [None]:
import matplotlib.pyplot as plt
plt.figure()
plt.plot(train_loss)
plt.plot(val_loss)

## Application
To use the fine-tuned model, we just need to pass it to *ChatCompletion.create* and proceed as usual.
You can grab the model name either via the openai [fine-tuning dashboard](https://platform.openai.com/finetune/) or using *openai.FineTuningJob.retrieve(id)*

In [None]:
openai.FineTuningJob.retrieve("ftjob-0ZB6FD70DnweK1F6euj03SNg")["fine_tuned_model"]

In [None]:
train.iloc[1]

In [66]:
test_medical_specialty = train["medical_specialty"].iloc[1]
test_report = train["report"].iloc[1]
print(test_medical_specialty)
print(test_report)

Cardiovascular / Pulmonary
PREOPERATIVE DIAGNOSES:,1.  Non-small-cell carcinoma of the left upper lobe.,2.  History of lymphoma in remission.,POSTOPERATIVE DIAGNOSES:,1.  Non-small-cell carcinoma of the left upper lobe.,2.  History of lymphoma in remission.,PROCEDURE: , Left muscle sparing mini thoracotomy with left upper lobectomy and mediastinal lymph node dissection.  Intercostal nerve block for postoperative pain relief at five levels.,INDICATIONS FOR THE PROCEDURE: , This is an 84-year-old lady who was referred by Dr. A for treatment of her left upper lobe carcinoma.  The patient has a history of lymphoma and is in remission.  An enlarged right axillary lymph node was biopsied recently and was negative for lymphoma.  A mass in the left upper lobe was biopsied with fine-needle aspiration and shown to be a primary non-small-cell carcinoma of the lung.  PET scan was, otherwise, negative for spread and resection was advised.  All the risk and benefits were fully explained to the patie

In [31]:

from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)


  # Define the system prompt
system_prompt = "Given the medical description report, classify it into one of these categories: " + \
              "[Cardiovascular / Pulmonary, Gastroenterology, Neurology, Radiology, Surgery]"

In [None]:
completion = client.chat.completions.create(
#   model="gpt-4o-mini-2024-07-18",
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": test_report}
  ]
)

print(completion.choices[0].message.content)

In [61]:
completion = client.chat.completions.create(
#   model="gpt-4o-mini-2024-07-18",
  model="ft:gpt-3.5-turbo-0125:kimuia55::A5njEKO6",
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": test_report}
  ]
)

print(completion.choices[0].message.content)

Surgery


In [None]:
completion = openai.ChatCompletion.create(
    model = "ft:gpt-3.5-turbo-0613:pierian-training::8IjdNdPL", 
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": train["report"].iloc[1]}
  ]
)
print(completion.choices[0].message.)


In [None]:
test["medical_specialty"].iloc[1]

Let's loop over the test data and count how many reports are classified correctly

In [None]:
pip install openai==0.28

In [52]:
import openai

def classify_report(report, model):
  completion = client.chat.completions.create(
                  model=model,
                  messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": report}
                  ]
                )
  return completion

    # completion = openai.ChatCompletion.create(
    #                 model = model, 
    #                 # model = "gpt-3.5-turbo",
    #                 messages=[
    #                         {"role": "system", "content": system_prompt},
    #                         {"role": "user", "content": report}
    #                       ]
    #                     )
    # return completion
  
  
    

In [54]:
predicted_classes = []
ground_truth_classes = []
model="gpt-3.5-turbo"
# model = "ft:gpt-3.5-turbo-0613:pierian-training::8IjdNdPL"

for line in train.iterrows():
    report, specialty = line[1]["report"], line[1]["medical_specialty"]
    ground_truth_classes.append(specialty.strip())  # in case of any trailing
    prediction = classify_report(report,model)
    predicted_classes.append(prediction.choices[0].message.content.strip())
    

    
    

In [55]:
import numpy as np

In [56]:
(np.array(predicted_classes) == np.array(ground_truth_classes)).mean()  #accuracy

np.float64(0.5)

In [65]:
predicted_classes = []
ground_truth_classes = []
# model="gpt-3.5-turbo"
model="ft:gpt-3.5-turbo-0125:kimuia55::A5njEKO6"

for line in train.iterrows():
    report, specialty = line[1]["report"], line[1]["medical_specialty"]
    ground_truth_classes.append(specialty.strip())  # in case of any trailing
    prediction = classify_report(report,model)
    # prediction = classify_report(test_report,model)
    predicted_classes.append(prediction.choices[0].message.content.strip())
    

print(ground_truth_classes)    
    

BadRequestError: Error code: 400 - {'error': {'message': "We could not parse the JSON body of your request. (HINT: This likely means you aren't using your HTTP library correctly. The OpenAI API expects a JSON payload, but what was sent was not valid JSON. If you have trouble figuring out how to fix this, please contact us through our help center at help.openai.com.)", 'type': 'invalid_request_error', 'param': None, 'code': None}}

In [57]:
predicted_classes

['Surgery',
 'Surgery',
 'Cardiovascular / Pulmonary',
 'Cardiovascular',
 'Cardiovascular',
 'This medical description report falls under the category of Pulmonary Medicine.',
 'Cardiovascular',
 'Cardiovascular / Pulmonary',
 'Surgery',
 'This medical description report falls under the category of **Surgery**.',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Pulmonary',
 'Cardiovascular / Pulmonary',
 'Pulmonary',
 'Surgery',
 'Cardiovascular / Pulmonary',
 'Cardiovascular',
 'Cardiovascular',
 'This medical description report falls under the category of Pulmonary.',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'This medical report falls under the category of **Surgery**.',
 'Category: Pulmonology',
 'This medical description report falls under the category of Pulmonology.',
 '[Surgery]',
 'This medical description report falls under the category of Cardiovascular / Pulmonary.',
 'Cardiovascular',
 'Surgery',
 'The given medical report falls under t

In [58]:
ground_truth_classes

['Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardiovascular / Pulmonary',
 'Cardio

### Comparison to gpt-3.5-turbo
Let's compare how our model works compared to the standard model

In [42]:
import time
from tqdm.notebook import tqdm

def classify_report_baseline(report):
    try:
        completion = openai.ChatCompletion.create(
                        model = "gpt-3.5-turbo",
                        messages=[
                                {"role": "system", "content": system_prompt},
                                {"role": "user", "content": report}
                              ],
                        temperature=0
                            )
    except openai.error.APIConnectionError:  # To retry if openai loses the connection
        time.sleep(10)
        completion = openai.ChatCompletion.create(
                model = "gpt-3.5-turbo",
                messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": report}
                      ],
                temperature=0
                    )

    return completion


In [None]:
predicted_classes = []
ground_truth_classes = []
for line in tqdm(test.iterrows()):
    report, specialty = line[1]["report"], line[1]["medical_specialty"]
    ground_truth_classes.append(specialty.strip())  # in case of any trailing
    prediction = classify_report_baseline(report)
    predicted_classes.append(prediction.choices[0].message["content"].strip())
    
    

In [None]:
(np.array(predicted_classes) == np.array(ground_truth_classes)).mean()