Fine tuning of small LLM for medical data extraction

Our goal is to tackle the technical and linguistic challenges that come with fine-tuning existing Large Language Models on unstructured medical data in Bulgarian and evaluate their performance on a local environment based on a single metric. 

Data imports

In [2]:
import pandas as pd # data load and transformation
import numpy as np # data manipulation

import os # used to set up cuda
import torch # To run ML models 
import deepl # final transaltor DeepL API 
import re
import subprocess # trainer set-up
import json # json for formatting model inputs/outputs
import datasets                                                        #work with datasets
from transformers import (AutoTokenizer,                               # for Trainer and model run set-up
                          AutoModelForCausalLM,
                          BitsAndBytesConfig, 
                          TrainerCallback,
                           pipeline
)

from typing import List                                                 # lists usage for data retantion 

from peft import LoraConfig                                             # set up quantized model 
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM, SFTConfig  # using SFT Trainer supervised fine tuning 
from IPython.display import Markdown, display                           #notes in markdown

#lists, dataframes and datasets allow different manipulations, which is why at times we transform from one to the other 

Data load

In [2]:
# we read the excel file with pandas, the file should be in the same folder. We are using only the "Positive" sheet
df = pd.read_excel('dataset_praktikum_project.xlsx', sheet_name='Positive') 

Data transformation 

In [9]:
#we rename all columns with more clear names
df.rename(columns={'ID': 'PatientId', 'Text': 'Anamnesis_bg', 'Result': 'CorrectDiabetesDuration'}, inplace=True)
#We remove null values for the target column 
# in this specific dataset there are no dropped Null values. We write this line to remove them
df = df.dropna(subset=['Anamnesis_bg']) 

In [10]:
#we pre-handle medical abbreviations that might mislead the translator 
def standardize_numbers(text):
    text = re.sub(r"(\d+)\s*(год\.|год|г\.|г)(?=\s|$)", r"\1 години", text)             # g. god. with a number in front is replaced with years
    text = re.sub(r"(\d+)-(\d+)\s*години", r"\1 до \2 години", text)                    # same goes for periods 

    return text

#medical abbrv. are replaced with their full meaning. First we create a dictionary and then we create a function to replace the abbreviations with the corresponding dictionary values
abbreviation_dict = { 
    "АХ": "артериална хипертония",
    "ЗД": "захарен диабет",
    "ХБЗ": "хронична бъбречна заболяване",
    "ДК": "диабетна кетоацидоза",
    "ЗАХ": "захарен"
}
#abbrv. application function 
def expand_abbreviations(text): 
    for abbr, full_form in abbreviation_dict.items():
        text = re.sub(fr"\b{abbr}\b", full_form, text) #We take only the records where the abbreviation is "indepentent" to avoid replacing parts of words
    return text

#apply tranformation on df records 
df["Anamnesis_bg"] = df["Anamnesis_bg"].apply(standardize_numbers).apply(expand_abbreviations) #apply 

Bulgarian to English translation 

In [None]:
#Translation of the anamnesis with the DeepL API (from Bulgarian to English)
#Documentation at https://developers.deepl.com/docs
auth_key ='91997c62-f571-4fa7-8d6d-72d8f443b4a4:fx' # personal token AR 

#we create the trnaslator objecy   
translator = deepl.Translator(auth_key)

#We create a new field that contains the anamnesis in English. We run the translator through a lambda function, passing the anamnesis_bg line, the source and target languages
#DeepL has the option to recognize the source_lang by itself, and this is not a neccessary parameter to be passed, however, we avoid confusion with any other cyryllic/slavic languages
df['Anamnesis_en'] = df['Anamnesis_bg'].apply(lambda x: translator.translate_text(x, source_lang='BG',  
               target_lang="EN-GB") if type(x) == str else x)  #we translate the anamnesis records 


Transformed version save

In [6]:
#we save the translated results
#in case we don't want to rerun the translator (runs some time)
df.to_csv('data_translated_cleaned_deepl.csv', index=False)  

Transformed data load 

In [3]:
#In the cases where we don't want to wait on the translator again (we have already ran it and have the results), we load the pre-translated data
positive_data = pd.read_csv(r"data_translated_cleaned_deepl.csv")
# we remove range records, because we are looking for numeric outputs
filtered_data = positive_data[positive_data['CorrectDiabetesDuration'].str.strip().notna() & (positive_data['CorrectDiabetesDuration'].str.strip() != "Range")]

Preparation of data and environment 
We split the records in wo parts - 70% for training and 30% for testing. We analyze if this split gives us equal parts "hard" results in both parts of the dataset. A hard record is a record that is longer than the mean and has more special characters than the "average" record in the dataaset. 

In [4]:
# Ensure filtered_data is a copy if it's a slice
filtered_data = filtered_data.copy()

# we create a field that contains the length of each record
filtered_data["length"] = filtered_data["Anamnesis_en"].apply(len)

# we list special characters to be counted 
special_characters = "!@#$%^&*()-_=+[]{};:'\",.<>?/\\|`~"  

#based on the results we set the value of a field called IsDifficult 
# A difficult record is a record longer than usual and with more special characters. 


filtered_data["special_char_count"] = filtered_data["Anamnesis_en"].apply( 
    lambda x: sum(1 for char in x if char in special_characters)
)

longest_record = filtered_data["length"].max()
mean_length = filtered_data["length"].mean()
mean_special_count = filtered_data["special_char_count"].mean()

filtered_data["is_difficult"] = (
    (filtered_data["length"] > mean_length) & 
    (filtered_data["special_char_count"] > mean_special_count)
)

head_difficult_count = filtered_data.head(700)["is_difficult"].sum()
tail_difficult_count = filtered_data.tail(300)["is_difficult"].sum()

# Results print 
print(f"Longest record length: {longest_record}")
print(f"Mean record length: {mean_length:.2f}")
print(f"Mean special character count: {mean_special_count:.2f}")
print(f"Difficult records in the first 700: {head_difficult_count}")
print(f"Difficult records in the first 700 as a percentage: {head_difficult_count/700}")
print(f"Difficult records in the last 300 as a percantage: {tail_difficult_count/300}")

Longest record length: 642
Mean record length: 136.77
Mean special character count: 4.57
Difficult records in the first 700: 170
Difficult records in the first 700 as a percentage: 0.24285714285714285
Difficult records in the last 300 as a percantage: 0.23333333333333334


In [5]:
# Make a safe copy to avoid SettingWithCopyWarning
filtered_data = filtered_data.copy()

#To create the prompt we will include the following instruction before the anamnesis is fed to the model
def create_prompt(record):
    return f"Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: {record}"
#To create the prompt we will include the following instruction before the anamnesis is fed to the model

def create_response(record):
    return f"{record}"
# for readability in follow-up cells 
filtered_data['prompt'] = filtered_data['Anamnesis_en'].apply(create_prompt)
filtered_data['response'] = filtered_data['CorrectDiabetesDuration'].apply(create_response)

The result of the analysis confirms that the distribution of hard records throughout the dataset is consistent and we can proceed with the train to validation split as is. 

In [6]:
#the prepare_dataset function prepares the data into datasets suitable for feeding the model 
def prepare_dataset(data, num_rows, from_end=False):
    #We set if we will count the records in the passed dataset from bottom or top - tail means we are taking the last, if not - we are taking the first (ensuring no overlap)
    subset = data.tail(num_rows) if from_end else data.head(num_rows)  
    #renaming the columns as expected from the SFT Trainer 
    subset = subset[['prompt', 'response']].rename(columns={'prompt': 'input', 'response': 'output'})
    
    #Creating the final list 
    data_list = [
        {"Instruction": f"\n{inp}", "Response": f"\n{out}"}
        for inp, out in zip(subset['input'], subset['output'])
    ]
    
    return datasets.Dataset.from_list(data_list)

# Prepare training and validation datasets using the function above 
dataset_training = prepare_dataset(filtered_data, 700, from_end=False)
dataset_validation = prepare_dataset(filtered_data, 300, from_end=True)

Set code to run on CUDA 

WARNING! The cell below might return an error in cases where cuda was not previously used 

In [7]:
#we define a function to run before each model load/training. The function also ensures we are running on cuda and cache is emptied. 
def env_set_up():
    #Environment clean-up. This is to ensure that torch is not reserving without allocating any cuda memory 
    os.environ['CURL_CA_BUNDLE'] = ''
    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    
    #load cuda and check current state 
    if torch.cuda.is_available():
        os.environ["CUDA_VISIBLE_DEVICES"] = "0"  
        print("CUDA is available")
    else:
        print("CUDA is not available")
    
    import gc
    #if a model was loaded before, run the line below 
    #del model, tokenizer
    #clean-up cache 
    gc.collect()
    torch.cuda.empty_cache()
    #print cuda stats
    torch.cuda.memory_summary(abbreviated=False) 
    torch.cuda.is_available()
    return torch.device("cuda" if torch.cuda.is_available() else "cpu") 

In [8]:
#set cuda as device 
device = env_set_up()
#device=torch.device("cpu")# we check if we are on cuda 
print("device:",device)

CUDA is available
device: cuda


Untrained LLMs performance tests 

Function: Run, save and print model results on prompted records. 

In [9]:
#sometimes there are still special characters remaining, we clean them up 
def clean_text(text):
    return re.sub(r'\D', '', text).strip()

def extract_final_number(text):
        numbers = re.findall(r'\d+', text)
        return numbers[-1] if numbers else None    
#validate responses accuracy 
# classification of reasons to be done 


In [10]:
def run_model(model_id):
    # Initialize a Hugging Face pipeline for text generation using the specified model
    # text-generation: specifies the task type
    # model=model_id: loads the LLaMA model
    # torch_dtype=torch.bfloat16: sets the tensor data type for inference to bfloat16 (better performance on newer GPUs)
    # device_map="auto": automatically assigns the model to available device set to CUDA 

    pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device=device
    )
    
    # Prepare the dataset for inference:
    # Take the first 1000 rows of filtered_data
    # Select only the 'prompt' and 'response' columns
    # Rename the columns to 'input' and 'output' for clarity and consistency
    prepared_data = filtered_data.head(1000)[['prompt', 'response']].rename(columns={
        'prompt': 'input',
        'response': 'output'
    })
    
    # Initialize an empty list to store results
    validation_results = []
    
    # Loop through each record in the prepared data and parse it to the Gemma model 
    for _, record in prepared_data.iterrows():
        # Extract the input text and expected output
        input_text = record['input']
        expected_output = record['output']
    
        # Generate a response using the LLaMA model pipeline
        # max_new_tokens=200: limits the number of tokens the model can generate
        # do_sample=True: enables sampling, making the generation more creative and less deterministic
        # temperature=0.7: controls randomness; lower is more focused and conservative, higher is more random
        result = pipe(input_text, max_new_tokens=200, do_sample=True, temperature=0.7)
    
        # Extract the generated text from the pipeline output
        # The pipeline returns a list of dictionaries, each with a 'generated_text' key
        # If the result is empty, default to an empty string
        generated_text = result[0]['generated_text'] if result else ""
    
        # Print the generated output for follow-through
        print('\n') #Visual
        print(generated_text)
        print('\n=============\n')  #Visual
    
        # Save output and relevat record fields for analysis 
        # For accuracy calculation
        validation_results.append({
            "Instruction": input_text,
            "Expected Output": expected_output,
            "Model Output": generated_text
        })

    

    matches = 0
    for record in validation_results:
        expected = record['Expected Output']
        model_output = extract_final_number(record['Model Output'])
        if expected == model_output:
            matches += 1
    
    
    # Calculate match percentage
    accuracy = (matches / len(validation_results)) * 100 if validation_results else 0
    
    # Print results
    print(f"Accuracy wothout training: {accuracy:.2f}%")

Function: Calculate accuracy on cleaned output

1. LLama-3.2-1B

In [19]:
# Specify the identifier for the pre-trained LLaMA 3.2 1B
model_id = "meta-llama/Llama-3.2-1B"
run_model(model_id)


Device set to use cuda
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: diabetes for 12 years For about 20 years arterial hypertension in the world is the most common disease. More than 30 million people worldwide have it. In Russia about 2.5 million people suffer from this disease. The duration of diabetes is the number of years that the patient has had diabetes.
The average duration of diabetes is 12 years.




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Perennial arterial hypertension in values up to 200/120 familial burden. type 2 diabetes for 10 years chronic kidney disease: crea 246mcmol/l from 2011 to 2018.




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Sjögren, biopsy-proven.  Glaucoma for about 11 years; Diabetes mellitus since 2011, on oral treatment; arterial hypertension; CHD; CAD- non-stenozing.




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: With diabetes for 20 years. He has been treated with SGLT 2 tablets for 20 years , SUPand bigvanides .He follows a dietary regimen. No thirst frequent urination .No fainting No itching . Legs do not tremble, .Sugars are about 6 mmol/l. , Has diabetics in her family. No diabetes in her family. No high blood pressure. No hypertension. No heart disease. No high cholesterol. No kidney disease. No thyroid disease. No respiratory disease. No asthma. No liver disease. No thyroid disease. No diabetes in her family. No high blood pressure. No hypertension. No heart disease. No high cholesterol. No kidney disease. No thyroid disease. No respiratory disease. No asthma. No liver disease. No thyroid disease. No diabetes in her family. No high blood pressure. No hypertension. No heart disease. No high cholesterol. No kidney disease. No thyroid disease. No respiratory dis

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: With diabetes from 1m . He is treated with 1 m biguanide tablets .He follows a dietary regimen. No thirst frequent urination .Not weak No itching . Legs don't tremble, .Sugars are about 5 mmol/l , No diabetic in her family. Blood glucose 11 mmol/l ( 1.5 mmol/l ).The blood glucose level is measured every 2 hours. The patient is prescribed the following drug: metformin 850 mg once a day. The patient is given 850 mg of metformin. What is the dose of metformin in this case?




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: With diabetes for 15 years. Treated with insulin since then.




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: With diabetes for 23 years. Treated with SUP tablets.
Treatments with the following drugs for 3 years. The treatment duration is 4 years. Provide no additional commentary, just the number.
Please provide the following treatment information, in the following format: Treatment Name: Treatment Duration: Treatment Type: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treatment Start Date: Treatment End Date: Treat

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: With type 1 diabetes for 41 years. Treated with insulin since then on an intensified regimen . but is a labile diabetic Follows a dietary regimen. No thirst frequently and no urination , no pain in the feet. Is on a regular exercise regimen. Has a normal glucose tolerance test.
1. A. 37.1 years B. 41.1 years C. 40.7 years D. 36.8 years E. 38.2 years
2. A. 37.1 years B. 41.1 years C. 40.7 years D. 36.8 years E. 38.2 years
3. A. 37.1 years B. 41.1 years C. 40.7 years D. 36.8 years E. 38.2 years
4. A. 37.1 years B. 41.1 years C. 40.7 years D. 36.8 years E. 38.2 years
5. A. 37.1 years B. 41.1 years C. 40.7 years D. 36.8




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: With diabetes from -36 years. Treated with insulin since -18 years.




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: With diabetes for 15.5 years. Treated with 15.5 tablets of metformin per day.
The first year, the patient was diagnosed with type 2 diabetes. Since then, he has been on a fixed diet and metformin. At this time, the patient was taking 15.5 tablets of metformin per day, and the number of tablets was increased to 16.5 tablets per day. In 2018, he took 15.5 tablets per day. The patient was 45 years old when he was diagnosed with diabetes.
Answer: 15.5 × 1.5 = 22.25 years.
This is the number of years of diabetes.
Answer: 2018 × 0.5 = 2018.5 years.
This is the number of years after the diagnosis of diabetes.
Answer: 15.5 × 0.5 = 7.75 years.
This is the number of years the patient took metformin.
Answer: 2018 × 0.5 =




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: With diabetes for 7 years. Treated with tablets, how many tablets did you take each day?
A) 30
B) 20
C) 50
D) 40
Answer: C




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: With diabetes for 13years. He was treated with insulin and Vistosa.




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: good background control of med. therapy.  Diabetes mellitus type 2 for 10 years - per year, 0.2-0.3 g/dl (50-75 mg/dl).




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Diabetes mellitus since 2009 with complications from the nervous system and renal disease. This is the most common form of diabetes. It is also called type 1 diabetes. It is caused by an autoimmune attack on the pancreas. The body's immune system destroys the insulin-producing beta cells in the pancreas. This type of diabetes requires insulin injections and can be difficult to control. The onset of the disease is usually between ages 15 and 25, but it can occur at any age.




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Proven diabetes mellitus 2016. Undergoing intensified Ins. treatment. with Humalog and Lantus. The duration of diabetes is the number of years that have passed since a person was diagnosed with diabetes. The number is rounded to the nearest whole year. For example, if a person was diagnosed with diabetes in 2009 and is currently receiving treatment for diabetes, the number for the duration of diabetes is 2009 + 1 year = 2010.




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Since 2014 diabetes mellitus has been diagnosed,So far on treatment with diabetes MP x1 tb. Reports pain in the abdomen, nausea and vomiting, which is most likely caused by which of the following? (d) Inflammation of the pancreas (pancreatitis)
A. Insulin
B. Glucagon
C. Glucocorticoids
D. Insulin
Answer: A




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Established diabetes mellitus since 1975. Demonstrated DPNP and angiopathy. Undergoing treatment with metformin.
Note that there is an overlap in the dates of onset of DPNP and angiopathy and this information should not be used to determine the order of onset of the two conditions.
The data are from the American Diabetes Association.
The year is 2018. Provide no additional commentary, just the number.
Established diabetes mellitus since 1975.
Demonstrated DPNP and angiopathy.
Undergoing treatment with metformin.




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Diabetes mellitus since 2002.  He is undergoing treatment with Insulin Hum.M3
  1. Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Diabetes mellitus since 2002.  He is undergoing treatment with Insulin Hum.M3
  2. Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Diabetes mellitus since 2002.  He is undergoing treatment with Insulin Hum.M3
  3. Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Diabetes mellitus since 2002.  He is undergoing treatment with Insulin Hum.M3
  4. Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Diabetes mellitus sin

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Diagnosed with diabetes mellitus in 2015.Currently on treatment with Gentadjueto 2,5/850 x1 mg/day, and with Metformin 850 x1 mg/day.
The correct answer is 2018.
Diabetes mellitus is a disease characterized by hyperglycemia (high blood glucose) due to the insulin deficiency of the body.
The onset of the disease usually occurs in adolescence or early adulthood, although it can also occur in childhood or in adulthood.
The diagnosis of diabetes mellitus is made based on the presence of symptoms of hyperglycemia and the presence of laboratory findings of hyperglycemia.
The most common symptoms are polyuria (increased urinary frequency), polydipsia (increased thirst), polyphagia (increased appetite), and weight loss.
In the presence of symptoms, it is necessary to request a fasting blood glucose test.
In cases of fasting blood glucose greater than 126 mg / dL, t

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Diabetes mellitus diagnosed in 2009.  Undergoing treatment with Neoglim 2mg x1 10ml 1ml/5ml.  2018




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Since 2005 with proven diabetes mellitus. On treatment with Novo Rapid 14+14+12E and Levemir x20E. Follows a dietary regimen.Reports 3 times per week, 30 minutes after breakfast, 30 minutes after lunch, and 30 minutes after dinner.
Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Since 2005 with proven diabetes mellitus. On treatment with Novo Rapid 14+14+12E and Levemir x20E. Follows a dietary regimen.Reports 3 times per week, 30 minutes after breakfast, 30 minutes after lunch, and 30 minutes after dinner.




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Diabetes mellitus has been diagnosed since 2010. He is undergoing treatment with Insulin Mixtard 30% 20+16E and has been taking it for 2 years. The doctor has told him to increase the dosage. What is the appropriate dosage?
A. 50 U once daily
B. 50 U twice daily
C. 50 U three times daily
D. 100 U once daily
E. 100 U twice daily
Answer: A




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Since 2002 he has been diagnosed with diabetes mellitus. He is treated with Insulin Insulatard NM 12+18E. and Diaprel MR. His blood sugar level is 120 mg/dL, and he is 40 years old. How many years has he had diabetes?




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Diabetes mellitus since 1996.Complications of the nervous system and dehydration are two most important complications of diabetes. The duration of diabetes is the time period between the beginning of diabetes and the onset of complications. The duration of diabetes is a very important aspect of diabetes as it is a predictor of the likelihood of developing complications.




Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Diabetes mellitus diagnosed in 2009. Undergoing treatment with Ins. Hum. Mix 25% 30+30E and Hum. P 26+24E. Coming for new protocol.  Observes dietary regimen. Satisfactory diabetes control. With pain and shivering of lower extremities, chilliness of fingers, and generalized pain. No acute illness. No weight loss.
The following information is the result of a review of the patient's medical record. Please provide additional information or clarify any information you do not understand.




KeyboardInterrupt: 

Model stopped from running due to inconsistent and irrelevant responses. 

Gemma 3-4b-it

In [33]:
## set cuda as device 
device = env_set_up()
#device=torch.device("cpu")# we check if we are on cuda 
print("device:",device)

model_id = "google/gemma-3-4b-it"
run_model(model_id)


CUDA is available
device: cuda




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: diabetes for 12 years For about 20 years arterial hypertension in 2018 10 years diabetes for 12 years type 2 diabetes for 22 years type 2 diabetes for 20 years hypertension for 25 years diabetes for 12 years

12
20
10
22
20
25
12





Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Perennial arterial hypertension in values up to 200/120 familial burden. type 2 diabetes for 10 years chronic kidney disease: crea 246mcmol/l from 2008. Diabetes for 12 years, hypertension for 20 years.
20





Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Sjögren, biopsy-proven.  Glaucoma for about 11 years; Diabetes mellitus since 2011, on oral treatment; arterial hypertension; CHD; CAD- non-stenozing.
11





Extrac

In [34]:
## set cuda as device 
device = env_set_up()
#device=torch.device("cpu")# we check if we are on cuda 
print("device:",device)

model_id = "google/gemma-2-2b-it"
run_model(model_id)

CUDA is available
device: cuda


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda




Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: diabetes for 12 years For about 20 years arterial hypertension in 2018.  



Let me know what you need help with. 





Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Perennial arterial hypertension in values up to 200/120 familial burden. type 2 diabetes for 10 years chronic kidney disease: crea 246mcmol/l from 2018 to 2021:  
 
 Answer: 10





Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Sjögren, biopsy-proven.  Glaucoma for about 11 years; Diabetes mellitus since 2011, on oral treatment; arterial hypertension; CHD; CAD- non-stenozing.

1. 2011
2. 11
3. 2018
4.  11
 





Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the n

Model fine-tuning of gemma-2-2b-it

In [11]:
repo_name = "google/gemma-2-2b-it"
my_token = "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"
tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True, token = my_token)
tokenizer.pad_token = tokenizer.eos_token

bnbConfig = BitsAndBytesConfig(
    load_in_4bit=True
)

model = AutoModelForCausalLM.from_pretrained(repo_name, quantization_config=bnbConfig, trust_remote_code=True, token = my_token, attn_implementation='eager')
model = model.to('cuda')


`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
template = "Instruction:\n{instruction}\n\Response:\n{response}"
response_template = "Response:"

In [13]:
def generate_response(model, tokenizer, prompt, device, max_new_tokens=128):

    inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to(device)
    outputs = model.generate(**inputs, num_return_sequences=1, max_new_tokens=max_new_tokens)
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)    
    return text

In [14]:
# LoRa - Low Rank Adaptation config, mostly used in LLMS 
lora_config = LoraConfig(
    #rank of the matrices. The higher the value - the higher the capacity and memory usage.
    #since this code is ran locally, 8 is chosen as a low to moderate rank for LoRa, if ran on 16 or higher, the results will be better quicker  
    r = 8, 
    # Specifies which layers in the model will have LoRA applied. Not specifying layers, means we use all, which is quite expensive on the resources. 
    # These are common layers in transformer models - query, key, value projections in attention
    target_modules = ["q_proj", "o_proj", "k_proj", "v_proj",
                      "gate_proj", "up_proj", "down_proj"],
    task_type = "CAUSAL_LM",  #type of LLM to be ran on this config 
)

In [15]:
# format input in one line. We use the already created and formatted list values to compose a string that will be passed to the trianer 
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['Instruction'])):
        text = f"Instruction:\n{example['Instruction'][i]}\nResponse:\n{example['Response'][i]}"
        output_texts.append(text)
    print(output_texts)
    return output_texts


In [16]:
# Class to set up printing of results during training 
# Usually this is automatic, sometimes it breaks, because there is an unclosed release with the gemma 2b quantized with a small negligable bug 
class LossPrinterCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            log_message = f"Step {state.global_step}:"
            if "loss" in logs:
                log_message += f" training_loss = {logs['loss']:.4f}"
            if "eval_loss" in logs:
                log_message += f", validation_loss = {logs['eval_loss']:.4f}"
            print(log_message)

In [17]:
#Data collators are objects that will form a batch by using a list of dataset elements as input. 
# Passes prompt into tokenized input format 
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)


# The double tokenization slows down outputs, but makes input better processed. The right tokenization ensures equal batches. 
# The tokenizer is reloaded explicitly with right-padding, because in our case, if the input text is longer, the answer is in the latter part of the sentance, meaning a right-tokenization will give us a higehr accuracy
# `trust_remote_code=True` allows loading custom tokenizers from repositories.
right_padded_tokenizer = AutoTokenizer.from_pretrained(repo_name, padding_side="right", trust_remote_code=True, token = my_token)

#We load the trainer to follow-up both training and validation, this way we can track if the model is over- or under-fitted
trainer = SFTTrainer(
    model,                           #parse the loaded model 
    train_dataset=dataset_training,  #training dataset already formatted to the needs of the model
    eval_dataset=dataset_validation, #validation dataset for follow-through of validation loss, monitoring prevents over/under-fitting 
    args=SFTConfig(  
        output_dir="/tmp",           #set output directory, this allows trainer restart
        num_train_epochs=6,          #number of training iterations(epochs)
        per_device_train_batch_size=4, #batch size by torch device, limits so a personal machine does not overheat 
        gradient_accumulation_steps=6, #limits so a personal machine does not overheat 
        learning_rate=5e-5,            #how much the model’s weights are updated during training, scientific notation 
        logging_strategy="steps",      # how are records passed through the trainer for training 
        eval_strategy="steps",         # how are records passed through the trainer for validating  
        logging_steps=1,               # view training results each training step
        eval_steps=25,                 # view validation results each 25 steps. They are time consuming and don't vary as much
        max_seq_length = 1024,         # max length of new output tokens
        disable_tqdm=False             # make sure results from training are visible
    ),
    callbacks=[LossPrinterCallback()],   #make sure we see training and validation results
    formatting_func=formatting_prompts_func, #custom prompting function defined above
    data_collator=collator,            # custom collator defined above
    peft_config=lora_config,           # lora congid to navigate the model load on the machine, defined above
    processing_class=right_padded_tokenizer,  #custom tokenizer 
    
)

Map:   0%|          | 0/700 [00:00<?, ? examples/s]

['Instruction:\n\nExtract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: diabetes for 12 years For about 20 years arterial hypertension in\nResponse:\n\n12', 'Instruction:\n\nExtract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Perennial arterial hypertension in values up to 200/120 familial burden. type 2 diabetes for 10 years chronic kidney disease: crea 246mcmol/l from\nResponse:\n\n10', 'Instruction:\n\nExtract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Sjögren, biopsy-proven.  Glaucoma for about 11 years; Diabetes mellitus since 2011, on oral treatment; arterial hypertension; CHD; CAD- non-stenozing.\nResponse:\n\n7', 'Instruction:\n\nExtract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the numb

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


['Instruction:\n\nExtract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Since 2007 years diabetes mellitus - blood sugar 12,0 mmol/l, on Metfogama 850 mg therapy\nResponse:\n\n11', 'Instruction:\n\nExtract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: fatigue under normal loads. Since 2005 with diabetes mellitus- since then on insulin treatment, suboptimal glycemic control. COPD- on inhaled treatment\nResponse:\n\n13', 'Instruction:\n\nExtract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: temporarily taking Ritmonorm, which he stopped. Diabetes mellitus detected since 3 years- started treatment with Diaprel and Gentodueto, improved glycemic control. With diabetic retinopathy, pending manipulation with local anesthesia.\nResponse:\n\n3', 'Instruction:\n\nExtract onl

In [18]:
torch.cuda.empty_cache()
trainer.train()

Step,Training Loss,Validation Loss
25,0.0436,0.126036
50,0.1322,0.095486
75,0.0022,0.075496
100,0.1331,0.070117
125,0.0045,0.068045
150,0.0007,0.072549


Step 1: training_loss = 2.9373
Step 2: training_loss = 2.2216
Step 3: training_loss = 2.0231
Step 4: training_loss = 1.3871
Step 5: training_loss = 1.1539
Step 6: training_loss = 0.8020
Step 7: training_loss = 0.7600
Step 8: training_loss = 0.5520
Step 9: training_loss = 0.4866
Step 10: training_loss = 0.2510
Step 11: training_loss = 0.4017
Step 12: training_loss = 0.4339
Step 13: training_loss = 0.3543
Step 14: training_loss = 0.1462
Step 15: training_loss = 0.3519
Step 16: training_loss = 0.4412
Step 17: training_loss = 0.3082
Step 18: training_loss = 0.0489
Step 19: training_loss = 0.1061
Step 20: training_loss = 0.1558
Step 21: training_loss = 0.3727
Step 22: training_loss = 0.1985
Step 23: training_loss = 0.2611
Step 24: training_loss = 0.0813
Step 25: training_loss = 0.0436
Step 25:, validation_loss = 0.1260
Step 26: training_loss = 0.0439
Step 27: training_loss = 0.0483
Step 28: training_loss = 0.4347
Step 29: training_loss = 0.2659
Step 30: training_loss = 0.5002
Step 31: train

TrainOutput(global_step=174, training_loss=0.1398784927394905, metrics={'train_runtime': 30809.1937, 'train_samples_per_second': 0.136, 'train_steps_per_second': 0.006, 'total_flos': 4812801918879744.0, 'train_loss': 0.1398784927394905, 'epoch': 5.822857142857143})

In [19]:
trainer.save_model("finetuned_gemma2_2b-it")

In [24]:
# we have the same setup here, so we can run with the saved model 
rep_name = "finetuned_gemma2_2b-it" # load the saved trained model 
#re-do tokenization and bytesandbites config steps in any case. Helps with re-runs after kernel restart
tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

bnbConfig = BitsAndBytesConfig(
    load_in_4bit=True
)

#feef to AutoModelForCasualLM object to match training specifics 
model = AutoModelForCausalLM.from_pretrained(repo_name, quantization_config=bnbConfig, trust_remote_code=True, attn_implementation='eager')
torch.cuda.empty_cache()

inference_ready_model = trainer.model.to(torch.half)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [25]:

def generate_response_check(model, tokenizer, prompt, device, max_new_tokens=128):
    # Tokenize the input prompt and move it to the specified device
    inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to(device)
    # Generate a response using the model
    outputs = model.generate(**inputs, num_return_sequences=1, max_new_tokens=max_new_tokens)
    # Decode the output tokens into human-readable text
    print("Output Shape:", outputs.shape)  # Debugging step
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    print("RAW MODEL OUTPUT:", text)  # Debugging step

    # Extract numeric response if present (e.g., "Response: 42")
    match = re.search(r"Response:\s*(\d+)\**", text, re.IGNORECASE)
    
    if match:
        text = match.group(1).strip()
    
    return text

validation_results = [] # create a dataset to save results 

  return text  # Return the cleaned-up response

# Iterate over the validation dataset and test the model's responses
for i, record in enumerate(dataset_validation):
    prompt = template.format(
        instruction=record['Instruction'],  # Insert instruction into prompt template
        response='',  # Ensure the response field is empty in the prompt
    )

    # Generate a response from the model
    response_text = generate_response_check(inference_ready_model, tokenizer, prompt, device)

    # Store results for comparison
    validation_results.append({
        "Instruction": record['Instruction'],
        "Expected Ouput": record['Response'],  # Ground truth response
        "Model Output": response_text  # Model-generated response
    })
    
    # Print model output vs expected output for debugging
    print("\n=============\n")
    print(response_text)  # Model's response
    print(record['Response'])  # Expected response
    print("\n=============\n")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Output Shape: torch.Size([1, 112])
RAW MODEL OUTPUT: Instruction:

Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Since 2007 years diabetes mellitus - blood sugar 12,0 mmol/l, on Metfogama 850 mg therapy
\Response:
11**

**Explanation:**

The provided text states that the individual has had diabetes mellitus since 2007.  Therefore, the duration of diabetes in years is 11. 



11

11


Output Shape: torch.Size([1, 109])
RAW MODEL OUTPUT: Instruction:

Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: fatigue under normal loads. Since 2005 with diabetes mellitus- since then on insulin treatment, suboptimal glycemic control. COPD- on inhaled treatment
\Response:
13**

**Explanation:**

The provided text states "Since 2005 with diabetes mellitus".  Therefore, the duration of diabetes is 13 years. 



13

13


Output Shape: torch.Si

In [27]:
#validate responses accuracy 
# classification of reasons to be done 
matches = 0
for record in validation_results:
    expected = clean_text(record['Expected Ouput'])
    model_output = clean_text(record['Model Output'])
    if expected == model_output:
        matches += 1
    else:
        print(f"Wrong answer for {record['Instruction']}, duration of diabetes is {expected}, but model reported {model_output}\n")

# Calculate match percentage
accuracy = (matches / len(validation_results)) * 100 if validation_results else 0

# Print results
print(f"Accuracy: {accuracy:.2f}%")

Wrong answer for 
Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: THIS IS A PATIENT WITH TYPE TWO CHRONIC DIABETES WITH AGE 34 THE PATIENT HAS A FAMILY DISORDER FOR DIABETES- SISTERRETINAPOTHY DIABETIC PATIENT HAS A SA GLANDOLE PROSTATE-ADENOCARCIN 3/II.OPERATED.RADIOTHERAPY PERFORMED IN MONTH II.2013 YEARS OPERATION, duration of diabetes is 23, but model reported 14

Wrong answer for 
Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Since 1997 years with elevated BP to 170/120, diabetes . since 2015, treated systemically. With complaints of dizziness, duration of diabetes is 3, but model reported 21

Wrong answer for 
Extract only the duration of diabetes in years, given that the year is 2018. Provide no additional commentary, just the number.: Since 1982 years with elevated BP to 200/110, diabetes . since 15 years, treated sy

<h1>Accuracy GEMMA-2-2B-IT POST-TRAINING: 96.67%</h1>