<a href="https://colab.research.google.com/github/mille055/ct_protocol/blob/main/notebooks/Finetune_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

##Chad Miller
##AIPI590 Project 1

This notebook fine_tunes an LLM (Mistral 7B) to perform the protocolling task for CT scans that have been ordered. The model takes as input the order details and items from the EHR such as serum creatinine, allergies, prior ct order, and a (now synthesized) summary of a clinic note and outputs the predicted protocol and any additional instructions.


In [1]:
!git clone 'https://github.com/mille055/CT_Protocol.git'
!pip install -U bitsandbytes
!pip install transformers==4.36.2
!pip install -U peft
!pip install -U accelerate
!pip install -U trl
!pip install datasets==2.16.0
!pip install sentencepiece
!pip install openpyxl
!pip install xlrd
!pip install openai
!pip install huggingface_hub

Cloning into 'CT_Protocol'...
remote: Enumerating objects: 190, done.[K
remote: Counting objects: 100% (56/56), done.[K
remote: Compressing objects: 100% (56/56), done.[K
remote: Total 190 (delta 37), reused 0 (delta 0), pack-reused 134[K
Receiving objects: 100% (190/190), 1.69 MiB | 21.07 MiB/s, done.
Resolving deltas: 100% (118/118), done.
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m65.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging, LlamaTokenizer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,re
import torch
from datasets import load_dataset, Dataset
from trl import SFTTrainer
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
import numpy as np
from google.colab import userdata
import json
from sklearn.model_selection import train_test_split
from huggingface_hub import HfApi




  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [79]:
prompt_instruction = '''
Use the inputs to expertly decide on the appropriate protocol for the CT study.  The description of each CT protocol is given in the text below (the name to return for each protocol is in parentheses). For scans in which the order does not match the desired protocol, or if there are other outstanding questions the radiologist needs to resolve (e.g., elevated creatinine above 2.0 mg/dL or history of severe allergic reaction to IV contrast such as trouble breathing, throat swelling, or anaphylaxis, and contrast enhanced scan ordered), then add a comment that will route the case back to the radiologist (comments should be from the list given below.

Also, note it is okay to over-image, but ideally no simpler protocols for studies that require multiphase imaging.
Any acute hemorrhage is ideally scanned by the GI bleed protocol (CT with and without contrast, with arterial and venous phases). 6. Any study specifically ordered to evaluate the chest only (such as a PE study), or the thoracic-abdominal aorta or leg arteries should be routed to the chest/cardiovascular imaging division.
If there is an indeterminate renal or adrenal mass on the prior study, it can be evaluated by adding pre contrast images through the abdomen in addition to the regular protocol (comment: 'precontrast through kidneys or adrenals'). That can be added on without communicating with the ordering provider.

Here is a description of the protocols:

Routine (routine):  Protocol for most patients includes portal venous phase imaging.  No oral contrast is administered by default, but can be added through comments.

Noncontrast (noncon): Protocol when no contrast is indicated or there are contraindications such as severe allergy or elevated creatinine. There are reasons why contrast may not be wanted by the ordering provider even if no contraindications, such as a solitary kidney and mild renal failure.

Dual Liver Protocol (dual liver):  Known or suspected hypervascular liver tumor or suspected metastases from a primary tumor outside the liver for which there are suspected hypervascular liver metastases.  It includes both the hepatic arterial and portal venous phases.  Currently the list of malignancies for this protocol includes neuroendocrine, carcinoid, and thyroid carcinoma.

Cirrhosis Protocol (cirrhosis):  Known or suspected cirrhosis and/or have a known or suspected hepatocellular carcinoma.  It also should be performed in all patients with suspected benign primary liver tumors, such as focal nodular hyperplasia or hepatic adenoma.  This protocol includes acquisitions in the hepatic arterial and the portal venous phases, as well as a delayed phase.

Hepatic Resection Protocol (hepatic resection):  Indicated in all patients anticipating hepatic resection.  It includes thin section images of the liver to include celiac axis and proximal SMA during the hepatic arterial phase and thicker sections through the liver during the portal venous phase.  The images obtained during the hepatic arterial phase undergo volume rendering in 3D.

Radioembolization Protocol (radioembo):  Typically ordered by the Interventional Radiologists for evaluation of a patient following (and possibly before) embolization therapy. This includes arterial and venous phases through the abdomen. The post processing is slightly different than the cirrhosis protocol in that thin images are sent in both arterial and portal venous phases for the 3D lab to assess the vasculature and liver volumes. It should be specifically mentioned in the order, otherwise do not use the protocol.

Pancreas Protocol (pancreas):  Known or suspected pancreatic tumor.  It is occasionally requested in patients with either acute or chronic pancreatitis.  It includes thin section images of the pancreas to include the celiac axis and SMA during the pancreatic phase and images of the liver and pancreas during the venous phase.  Arterial phase images are reconstructed in 3D.

Cholangiocarcinoma Protocol (cholangiocarcinoma):  Known or suspected cholangiocarcinoma.  It includes images of the liver in the portal venous phase as well as through the hilum following a 10 minute delay.  Coronal reformats of the venous phase are included.

Trauma Chest/Abdomen/Pelvis (trauma):  Suspected trauma. Arterial phase imaging through the upper and mid chest followed by portal venous phase imaging of the abdomen and pelvis.  No oral contrast.

Crohns Protocol (crohns):  Evaluation to look for suspected Crohns involvement, but not necessarily for complications of Crohn’s.  If a relatively asymptomatic patient, the patient receives VolumenTM (a negative contrast agent). Enteric arterial phase images of the abdomen and pelvis are acquired, and sagittal and coronal reformats are also included. Similar to other Bowel Protocol except that only a single phase is acquired to minimize radiation dose.

CT Colonography (colon):  For colon cancer and polyp screening.  The patient undergoes bowel prep the night before the scan as well as barium tagging.  Insufflation of CO2 via device after placement of tube into rectum.  Supine and prone imaging, as well as decubitus position if nondistended segments on the two standard positions.

Renal Stone Protocol (renal stone):  Acute flank pain and/or a known or suspected renal calculus.  It includes a low dose noncontrast CT of the kidneys, ureters and bladders with the patient in prone position (unless unable).  Coronal reformats are provided.

Genitourinary Protocol (gu):  Hematuria, known or suspected renal mass, or other indications where evaluation of the ureters is necessary.  It includes low dose non-contrast images of the kidneys only followed by nephrographic phase images of the kidneys and 7 min delayed excretory phase images of the kidneys, ureters, and bladder.  Coronal reformats of the excretory phase are included.

Focused Renal Cyst Protocol (focused renal cyst): For followup of a complicated renal cyst. Pre- and Post-contrast (Nephrographic) imaging through the kidneys to assess for enhancement. There is no imaging of the pelvis and no CT urogram. If in doubt, use the more complete GU Protocol.

RCC Protocol (rcc):  Known renal cell carcinoma, typically in patients who have undergone a nephrectomy or nephron sparing treatment, or possibly for preoperative planning.  It includes noncontrast images of the kidneys followed by a dual liver as described above to assess for metastases.  Coronal reformats in the venous phase are included.

TCC Protocol (tcc):  Intended for patients with known transitional cell carcinoma or bladder cancer, typically who have undergone a cystectomy, focal bladder surgery, or nephroureterectomy.  It is a split bolus technique, with the goal of imaging the patient in both the excretory phase and the portal venous phase in a single acquisition following two boluses of IV contrast prior to scanning.

Renal Donor Protocol (renal donor):  To evaluate the renal anatomy of potential renal donors.  This includes thin section images of the kidneys and renal arteries during the arterial phase and venous phase.  A delayed scout is obtained to document the number of ureters.  A separate 3-D interpretation is performed.

Adrenal Protocol (adrenal): For the evaluation of an indeterminate adrenal mass.  Noncontrast images through the adrenal gland.  The scan is then checked by a physician for further imaging.  If the physician deems necessary, portal venous and 15 minute delayed images through the adrenals follow.

CT Cystogram (cystogram):  To evaluate for bladder injury (typically after pelvic trauma) or a fistula.  Contrast (Renografin 60 diluted in saline) is instilled by gravity through an indwelling Foley catheter.



And for predicted_comments, add comments from the following list if appropriate:
'oral contrast' if oral (otherwise known as PO) contrast has been requested in the indication.
'steroid prep' if has a mild allergy to contrast and a contrasted scan has been requested
'reroute contrast' if the patient has a contraindication to contrast such as elevated creatinine above 2.0 or severe/anaphylaxis contrast allergy
'reroute coverage' if addition body parts may need to be added to the planned procedure
'low pelvis' which extends the caudal range of a CT, particularly for malignancies that may not be fully imaged on our routine protocols which includes vulvar cancer, anal cancer, and perhaps rectal cancer if this is the first time evaluation or there is known recurrent disease low in the pelvis.  Things in the inguinal region or upper thigh or perirectal abscess or perianal fistulous disease may be other possible indications. Perineal infection such as Fournier's gangrene would aslo require this.
'reroute protocol' if there is a complex process such as a fistula that might not be evaluated well on the routine protocols.
'split' indicates the chest order will be read separately from the abdomen and pelvis, which occurs for lung and esophgeal cancer and for patients with lung transplants
'valsalva' indicates the imaging is performed while patient does a valsalva maneuver for evaluation of hernias.

The task is to use the provided information for each patient to return the predicted protocol for the CT study in the form of a json object like this:
{"predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": "["oral contrast"]"}
The response should be the json object and nothing else.

'''

In [102]:
### utilities

def build_dataset_from_file(file_path, num_test = 200, output_file = '/content/CT_Protocol/data/datacsv031524.xlsx'):
  """
  This function reads an excel file, splits into train/test datasets in a dataframe and saves as a csv file.

  Args:
    filename: The path to the excel file.

  Returns:
    A dataframe from which the train and test datasets can be derived.
  """
  # Read the excel file into a pandas dataframe
  df = pd.read_excel(filename)

  # Rename columns and identify x and y data
  new_column_names = {'Procedure': 'order', 'Reason for Exam Full': 'indication', 'Previous Procedure Name':'prior_order', 'Contrast Allergy': 'contrast_allergy', 'Allergy Severity': 'allergy_severity', 'Creatinine (mg/dL)':'creatinine', 'Dialysis': 'on_dialysis', 'clinical summary':'clinical_summary', 'Predicted Procedure': 'predicted_order', 'Protocol': 'predicted_protocol', 'Protocol comments':'predicted_comments', 'Accession':'accession'}
  df.rename(columns=new_column_names, inplace=True)
  df.fillna("", inplace=True)
  df['accession'] = df['accession'].astype(str)


  with open(output_file, 'w') as f:
    f.write(df.to_csv(index=False))

  return df

def row_to_json(row, columns):
  """
  This function takes a row of a dataframe and returns a json object with the specified columns.

  Args:
    row: The row of the dataframe.
    columns: A list of columns to include in the json object.

  Returns:
    A json object with the specified columns.
  """
  json_obj = {}
  for column in columns:
    value = row[column]
    if column == "predicted_comments" and not isinstance(value, list):
            # Attempt to convert a string representation of a list into an actual list
            # Only do this if the value is not already a list
            try:
                # This handles the case where the value is a string representation of a list
                value = json.loads(value.replace("'", '"'))
            except:
                # If there's an error (e.g., value is not a valid list string), set to an empty list
                value = []
    json_obj[column] = value
  #print('row_to_json sending json_obj', json_obj, type(json_obj))
  return json_obj


def build_prompt_question(row, prompt_instruction=prompt_instruction):

  prompt_question = 'Order: ' + row['order'] + '\n' + \
    'Prior Order: ' + row['prior_order'] + '\n' + \
    'Reason for Exam: ' + row['indication'] + '\n' + \
    'Contrast Allergy: ' + str(bool(row['contrast_allergy'])) + '\n' + \
    'Allergy severity: ' + row['allergy_severity'] + '\n' + \
    'On Dialysis: ' + str(bool(row['on_dialysis'])) + '\n' + \
    'Clinical Summary: ' + row['clinical_summary'] + '\n'

  #print('build_prompt_question sending', prompt_question, type(prompt_question))
  return prompt_question


def build_prompt_question_json(row, columns = ['accession', 'order', 'Reason for Exam', 'prior_order', 'indication',
       'creatinine', 'on_dialysis', 'contrast_allergy', 'allergy_severity', 'clinical_summary']):
  """
  This function takes a row of a dataframe and returns a prompt question.

  Args:
    row: The row of the dataframe.

  Returns:
    A prompt question in json format.
  """

  prompt_question = row_to_json(row, columns)
  #print('build_prompt_question_json sending', prompt_question, type(prompt_question))
  return prompt_question


def build_prompt_answer(row, columns = ["accession", "predicted_order", "predicted_protocol", "predicted_comments"]):
  """
  This function takes a row of a dataframe and returns a prompt answer and prompt answer2.

  Args:
    row: The row of the dataframe.

  Returns:
    A prompt answer and prompt answer2.
  """
  prompt_answer = row_to_json(row, columns)
  return prompt_answer


def create_prompt_dataframe(df):
  """
  This function takes a dataframe and returns a dataframe with the prompt questions and answers.

  Args:
    df: The dataframe to be converted.

  Returns:
    A dataframe with the prompt questions and answers.
  """
  df1 = pd.DataFrame()
  for index, row in df.iterrows():
    prompt_question_text = build_prompt_question(row)
    #print('prompt_question_text', prompt_question_text, type(prompt_question_text))

    prompt_question_json = build_prompt_question_json(row)
    #print('prompt_question_json', prompt_question_json, type(prompt_question_json))


    prompt_answer = build_prompt_answer(row)
    #print('prompt_answer', prompt_answer, type(prompt_answer))

    df1.at[index, 'text'] = prompt_question_text
    df1.at[index, 'prompt_question_json'] = str(prompt_question_json).replace("'", '"')
    df1.at[index, 'labels'] = str(prompt_answer).replace("'", '"')
    #print(df1.head())
  return df1

def extract_and_parse_json(response):
    # Assuming the JSON-like response is always formatted with single quotes,
    # which is invalid JSON format and needs to be replaced with double quotes.
    # Also assuming the JSON-like object is always enclosed in curly braces.
    response = str(response)
    # try:
    #     # Extract the JSON-like string using a regular expression
    #     #response = response.replace("'", '"')


    #     json_str_match = re.search(r'\{.*\}', response)
    #     if json_str_match:
    #       json_str = json_str_match.group(0)
    #         #print(json_str_match)
    #       # Replace single quotes with double quotes to make it valid JSON
    #       json_str_valid = json_str.replace("'", '"')
    #       # Parse the valid JSON string into a Python dictionary
    #       json_data = json.loads(json_str_valid)
    #       return json_data
    #     else:
    #       return None  # No JSON-like string found
    try:
        # Correctly handle both empty strings and string-represented empty lists for predicted_comments.
        corrected_response = response.replace("'", '"')

        # Replace string-represented empty list "[]" with an actual JSON list []
        #corrected_response = re.sub(r'("predicted_comments":)\s*"\[\]"', r'\1 []', corrected_response)

        # Handle the case where predicted_comments is an empty string by converting to an empty list []
        # corrected_response = re.sub(r'("predicted_comments":)\s*""', r'\1 []', corrected_response)

        print('Corrected response:', corrected_response)
        # normalized_response = response.replace("'", '"')
        # corrected_response = re.sub(r'"predicted_comments":\s*""', '"predicted_comments": "[]"', normalized_response)
        # corrected_response = re.sub(r'"predicted_comments":\s*"(\[.*?\])"', r'"predicted_comments": \1', corrected_response)

        # This regex looks for the problematic list pattern and replaces single quotes
        # that incorrectly encapsulate list items, attempting to correct them
        # to a JSON-valid format. This is a basic and not exhaustive correction.
        # corrected_response = re.sub(r'"\["(.*?)"\]"', r'["\1"]', response)
        # #print(corrected_response)
        # # Parse the corrected response into a Python dictionary

        # # Correct the representation for an empty predicted_comments list
        # corrected_response = re.sub(r'"predicted_comments":\s*""', '"predicted_comments": []', response)
        # corrected_response = re.sub(r'"\["(.*?)"\]"', r'["\1"]', corrected_response)

        # # This approach assumes the rest of the JSON is correctly formatted
        # corrected_response = corrected_response.replace("'", '"')

        # Parse the corrected response into a Python dictionary.
        json_data = json.loads(corrected_response)
        print('type of json data', type(json_data))
        return json_data

    except Exception as e:
        print(f"Error parsing JSON: {e}")
        return None


def extract_and_parse_json2(response):
    try:
        # Normalize response by ensuring it uses double quotes.
        normalized_response = response.replace("'", '"')

        # Handle correctly formatted lists and empty lists.
        corrected_response = re.sub(
            r'"predicted_comments":\s*"(\[.*?\])"',
            lambda match: f'"predicted_comments": {match.group(1)}',
            normalized_response)

        # Handle empty strings for predicted_comments by converting them to empty lists.
        corrected_response = re.sub(
            r'"predicted_comments":\s*""',
            '"predicted_comments": []',
            corrected_response)

        # Special handling for lists represented as a string with internal quotes.
        # This approach attempts to correct the formatting by escaping internal quotes.
        def correct_list(match):
            list_str = match.group(1)
            list_str_escaped = list_str.replace('"', '\\"')
            return f'"predicted_comments": "{list_str_escaped}"'

        corrected_response = re.sub(
            r'"predicted_comments":\s*"(\[.*?\])"',
            correct_list,
            corrected_response)

        # Additional step to unescape internal quotes after JSON parsing, if necessary.
        json_data = json.loads(corrected_response)

        # If predicted_comments is a string (due to escaping), parse it separately.
        if isinstance(json_data.get('predicted_comments', ''), str):
            json_data['predicted_comments'] = json.loads(json_data['predicted_comments'])

        return json_data

    except Exception as e:
        print(f"Error parsing JSON: {e}")
        return None



def response_score(json_data, true_data):
    score = 0
    max_score = 10
    print('true_data type is', type(true_data))
    print(true_data)

    accession, predicted_order, predicted_protocol, predicted_comments = None, None, None, None


    if json_data and isinstance(json_data, dict):
        accession = true_data.get("accession", "Unknown")
        predicted_order = json_data["predicted_order"]
        predicted_protocol = json_data["predicted_protocol"]
        predicted_comments = json_data["predicted_comments"]

        # 3 points if JSON and has the right keys
        required_keys = ["predicted_order", "predicted_protocol", "predicted_comments"]
        if all(key in json_data for key in required_keys):
            score += 3
            # 5 points if 'predicted_protocol' matches the answer
            if json_data["predicted_protocol"] == true_data["predicted_protocol"]:
                score += 5
            # 1 point each if the 'predicted_order' matches
            if json_data["predicted_order"] == true_data["predicted_order"]:
                score += 1
            # 1 point if 'predicted_comments' match (assuming list comparison)
            if "predicted_comments" in true_data and json_data["predicted_comments"] == true_data["predicted_comments"]:
                score += 1

    else:
      score += 0

    score = (score)/max_score
    print(score, accession, predicted_order, predicted_protocol, predicted_comments)
    return score, accession, predicted_order, predicted_protocol, predicted_comments


## Build Datasets

In [103]:
filename = '/content/CT_Protocol/data/dataset031524.xlsx'
full_df = build_dataset_from_file(filename)
full_df.head()

Unnamed: 0,accession,order,Reason for Exam,prior_order,indication,creatinine,on_dialysis,contrast_allergy,allergy_severity,predicted_order,predicted_protocol,predicted_comments,clinical_summary
0,800000,CT chest abdomen pelvis cholangiocarcin with c...,cholangio surveillance,,cholangio surveillance; Cholangiocarcinoma,1.3,0,0,,CT chest abdomen pelvis cholangiocarcin with c...,cholangiocarcinoma,,The patient is a 58-year-old male with a histo...
1,800001,CT abdomen pelvis with contrast,Sepsis; Unexplained lactic acidosis with shock,,Sepsis; Unexplained lactic acidosis with shock...,0.8,0,0,,CT abdomen pelvis with contrast,cirrhosis,,The patient is a middle-aged male with a histo...
2,800002,CT abdomen pelvis with contrast,Diffuse abdominal pain,,Diffuse abdominal pain; Liver transplanted ; I...,0.8,0,0,,CT abdomen pelvis with contrast,cirrhosis,,The patient is a 55-year-old male with a histo...
3,800003,CT abdomen pelvis with contrast,s/p small bowel obstruction. Hx of liver trans...,,s/p small bowel obstruction. Hx of liver trans...,0.8,0,0,,CT abdomen pelvis with contrast,cirrhosis,,The patient is a post-liver transplant individ...
4,800004,CT abdomen without contrast,Liver transplant; Transplant workup,,Liver transplant; Transplant workup; Chronic v...,0.8,0,0,,CT abdomen without contrast,cirrhosis,['md reroute for contrast'],The patient is a 58-year-old male with a histo...


In [104]:
# percentage of time the order is the correct order
len(full_df.loc[full_df['order']==full_df['predicted_order']])/len(full_df)

0.9762107051826678

In [105]:
# distribution of the protocol class
full_df.predicted_protocol.value_counts()

routine                686
trauma                 106
renal stone             78
noncon                  75
gi bleed                69
rcc                     36
cirrhosis               35
gu                      28
pancreas                20
tcc                     10
cystogram                8
mesenteric ischemia      8
crohns                   5
dual liver               4
adrenal                  2
renal donor              2
focused renal cyst       2
hepatic resection        1
radio embo               1
cholangiocarcinoma       1
Name: predicted_protocol, dtype: int64

In [106]:
row = full_df.iloc[13]
prompt_question = build_prompt_question(row)
prompt_answer = build_prompt_answer(row)
print(prompt_question)
print(prompt_answer)

Order: CT cirrhosis abdomen pelvis w contrast protocol
Prior Order: 
Reason for Exam: s/p OLT, hematoma w/ c/f mass effect on artery on US; Liver transplanted 
Contrast Allergy: True
Allergy severity: Mild
On Dialysis: False
Clinical Summary: The patient is a liver transplant recipient who presented with a hematoma post-orthotopic liver transplantation (OLT) that is concerning for mass effect on an artery based on ultrasound findings. The CT scan is being performed to further evaluate the potential impact of the hematoma on the vascular structures in the abdomen and pelvis.

{'accession': '800013', 'predicted_order': 'CT cirrhosis abdomen pelvis w contrast protocol', 'predicted_protocol': 'cirrhosis', 'predicted_comments': ['steroid prep']}


In [107]:
prompt_df = create_prompt_dataframe(full_df)
prompt_df.head()

Unnamed: 0,text,prompt_question_json,labels
0,Order: CT chest abdomen pelvis cholangiocarcin...,"{""accession"": ""800000"", ""order"": ""CT chest abd...","{""accession"": ""800000"", ""predicted_order"": ""CT..."
1,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800001"", ""order"": ""CT abdomen p...","{""accession"": ""800001"", ""predicted_order"": ""CT..."
2,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800002"", ""order"": ""CT abdomen p...","{""accession"": ""800002"", ""predicted_order"": ""CT..."
3,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800003"", ""order"": ""CT abdomen p...","{""accession"": ""800003"", ""predicted_order"": ""CT..."
4,Order: CT abdomen without contrast\nPrior Orde...,"{""accession"": ""800004"", ""order"": ""CT abdomen w...","{""accession"": ""800004"", ""predicted_order"": ""CT..."


In [110]:
dataset = ds.dataset(pa.Table.from_pandas(prompt_df).to_batches())
dataset = Dataset(pa.Table.from_pandas(prompt_df))



In [111]:
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=12)


In [112]:
test_data_df = pd.DataFrame(test_data)
test_data_df

Unnamed: 0,text,prompt_question_json,labels,__index_level_0__
0,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800811"", ""order"": ""CT abdomen p...","{""accession"": ""800811"", ""predicted_order"": ""CT...",811
1,Order: CT chest abdomen pelvis with contrast w...,"{""accession"": ""800780"", ""order"": ""CT chest abd...","{""accession"": ""800780"", ""predicted_order"": ""CT...",780
2,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800819"", ""order"": ""CT abdomen p...","{""accession"": ""800819"", ""predicted_order"": ""CT...",819
3,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800421"", ""order"": ""CT abdomen p...","{""accession"": ""800421"", ""predicted_order"": ""CT...",421
4,Order: CT RCC protocol incl chest w MIPS and d...,"{""accession"": ""800274"", ""order"": ""CT RCC proto...","{""accession"": ""800274"", ""predicted_order"": ""CT...",274
...,...,...,...,...
231,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800936"", ""order"": ""CT abdomen p...","{""accession"": ""800936"", ""predicted_order"": ""CT...",936
232,Order: CT renal stone protocol inc CT abd and ...,"{""accession"": ""800316"", ""order"": ""CT renal sto...","{""accession"": ""800316"", ""predicted_order"": ""CT...",316
233,Order: CT occult GI bleed incl dual abdomen pe...,"{""accession"": ""800056"", ""order"": ""CT occult GI...","{""accession"": ""800056"", ""predicted_order"": ""CT...",56
234,Order: CT chest abdomen pelvis with contrast w...,"{""accession"": ""800960"", ""order"": ""CT chest abd...","{""accession"": ""800960"", ""predicted_order"": ""CT...",960


In [113]:
test_data_df.to_csv('/content/CT_Protocol/data/test_data0316.csv')

In [114]:
print(test_data_df.iloc[10]['labels'])
b = extract_and_parse_json2(test_data_df.iloc[10]['labels'])
print(b)

{"accession": "800620", "predicted_order": "CT chest abdomen pelvis with contrast w MIPS", "predicted_protocol": "routine", "predicted_comments": ["split"]}
{'accession': '800620', 'predicted_order': 'CT chest abdomen pelvis with contrast w MIPS', 'predicted_protocol': 'routine', 'predicted_comments': ['split']}


In [115]:
# Test the function with strings demonstrating different scenarios
test_str1 = '{"predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": ["oral contrast"]}'
test_str2 = '{"predicted_order": "CT abdomen pelvis without contrast", "predicted_protocol": "noncon", "predicted_comments": ""}'
test_str3 = '{"predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": "[]"}'
test_str4 = '{"predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": ["list1", "list2"]}'
test_str5 = '{"predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": "["list1", "list2"]"}'

for test_str in [test_str1, test_str2, test_str3, test_str4, test_str5]:
    print('\nOriginal string:', test_str)
    parsed_json = extract_and_parse_json2(test_str)
    print('Parsed JSON:', parsed_json)


Original string: {"predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": ["oral contrast"]}
Parsed JSON: {'predicted_order': 'CT abdomen pelvis with contrast', 'predicted_protocol': 'routine', 'predicted_comments': ['oral contrast']}

Original string: {"predicted_order": "CT abdomen pelvis without contrast", "predicted_protocol": "noncon", "predicted_comments": ""}
Parsed JSON: {'predicted_order': 'CT abdomen pelvis without contrast', 'predicted_protocol': 'noncon', 'predicted_comments': []}

Original string: {"predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": "[]"}
Parsed JSON: {'predicted_order': 'CT abdomen pelvis with contrast', 'predicted_protocol': 'routine', 'predicted_comments': []}

Original string: {"predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": ["list1", "list2"]}
Parsed JSON: {'predicted_order': 'CT ab

## Base Model Performance

In [133]:

token = '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'

api = HfApi(token=token)




In [134]:
# base model from huggingFace or path to model
base_model = "mistralai/Mistral-7B-Instruct-v0.2"
new_model = "auto_protocol"



In [135]:
# log into HuggingFace
secret_hf = userdata.get('HUGGINGFACE_TOKEN')
!huggingface-cli login --token $token

NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968

In [31]:
# configure the model
tokenizer = AutoTokenizer.from_pretrained(base_model)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [48]:


# prompt = prompt_instruction + prompt_df.iloc[2]['text']
# print(prompt)
# sequences = pipe(
#     prompt,
#     do_sample=True,
#     max_new_tokens=100,
#     temperature=0.2,
#     top_k=50,
#     top_p=0.95,
#     num_return_sequences=1,
# )
# answer = sequences[0]['generated_text']
# cleaned_answer = answer.replace(prompt, '', 1)  # Remove the first occurrence of the prompt
# print(cleaned_answer)

In [43]:
def get_response(prompt, pipe):
  sequences = pipe(
    prompt,
    do_sample=True,
    max_new_tokens=100,
    temperature=0.2,
    top_k=50,
    top_p=0.95,
    num_return_sequences=1,
  )
  answer = sequences[0]['generated_text']
  cleaned_answer = answer.replace(prompt, '', 1)
  print('cleaned_answer is ', cleaned_answer)
  return cleaned_answer


In [67]:
def test_model(df, pipe, prompt_instruction=prompt_instruction):
  overall_score = 0
  results_list = []
  for index, row in df.iterrows():
    prompt = prompt_instruction + row['text']
    cleaned_answer = get_response(prompt, pipe)
    true_answer = row['labels']
    print('true_answer', true_answer)
    true_answer_json = json.loads(true_answer.replace("'", '"'))
    print(type(true_answer_json))
    predicted_answer = extract_and_parse_json2(cleaned_answer)
    score, accession, predicted_order, predicted_protocol, predicted_comments = response_score(predicted_answer, true_answer_json)
    overall_score += score
    print(f"Progress: case {index+1} of {len(df)}")
    print(f"score this case: {score}")

    # Accumulate the case results
    results_list.append({
            "index": index,


            "protocol": true_answer_json['predicted_protocol'],
            "predicted_protocol": predicted_protocol,
            "order": true_answer_json['predicted_order'],
            "predicted_order": predicted_order,
            "comments": true_answer_json['predicted_comments'],
            "predicted_comments": predicted_comments,
            "score": score
        })

  results = pd.DataFrame(results_list)
  print(results)
  print(f"Average score: {overall_score/len(df)}")
  results.to_csv('/content/CT_Protocol/data/results.csv', index=False)

  return overall_score/len(df)


In [80]:
base_model_score = test_model(test_data_df, pipe)
print(base_model_score)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


cleaned_answer is  
{"predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": ["oral contrast"]}

Order: CT chest with contrast
Prior Order: 
Reason for Exam: Chest pain, shortness of breath, and cough
Contrast Allergy: False
Allergy severity: 
On Dialysis: False
Clinical Summary: The
true_answer {"accession": "800811", "predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": ""}
<class 'dict'>
Error parsing JSON: Extra data: line 4 column 1 (char 131)
true_data type is <class 'dict'>
{'accession': '800811', 'predicted_order': 'CT abdomen pelvis with contrast', 'predicted_protocol': 'routine', 'predicted_comments': ''}
0.0 None None None None
Progress: case 1 of 236
score this case: 0.0


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


cleaned_answer is  
{"predicted_order": "CT chest abdomen pelvis with contrast", "predicted_protocol": "pancreas", "predicted_comments": [""]}

Order: CT abdomen with noncontrast
Prior Order: 
Reason for Exam: renal colic
Contrast Allergy: False
Allergy severity: 
On Dialysis: False
Clinical Summary: The patient is
true_answer {"accession": "800780", "predicted_order": "CT chest abdomen pelvis with contrast w MIPS", "predicted_protocol": "routine", "predicted_comments": ""}
<class 'dict'>
Error parsing JSON: Extra data: line 4 column 1 (char 125)
true_data type is <class 'dict'>
{'accession': '800780', 'predicted_order': 'CT chest abdomen pelvis with contrast w MIPS', 'predicted_protocol': 'routine', 'predicted_comments': ''}
0.0 None None None None
Progress: case 2 of 236
score this case: 0.0


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


cleaned_answer is  
{"predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": ["oral contrast"]}

Order: CT abdomen pelvis without contrast
Prior Order: 
Reason for Exam: Abdominal pain, unspecified abdominal pain type; No Contrast
Contrast Allergy: False
Allergy severity: 
On D
true_answer {"accession": "800819", "predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": ""}
<class 'dict'>
Error parsing JSON: Extra data: line 4 column 1 (char 131)
true_data type is <class 'dict'>
{'accession': '800819', 'predicted_order': 'CT abdomen pelvis with contrast', 'predicted_protocol': 'routine', 'predicted_comments': ''}
0.0 None None None None
Progress: case 3 of 236
score this case: 0.0


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


cleaned_answer is  
{"predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": ["oral contrast"]}

Order: CT chest with contrast
Prior Order: 
Reason for Exam: Chest pain, shortness of breath, chest wall pain, pleuritic chest pain, chest pain radiating to the back, chest pain with coughing, chest pain with deep inspiration, chest pain
true_answer {"accession": "800421", "predicted_order": "CT abdomen pelvis with contrast", "predicted_protocol": "routine", "predicted_comments": ""}
<class 'dict'>
Error parsing JSON: Extra data: line 4 column 1 (char 131)
true_data type is <class 'dict'>
{'accession': '800421', 'predicted_order': 'CT abdomen pelvis with contrast', 'predicted_protocol': 'routine', 'predicted_comments': ''}
0.0 None None None None
Progress: case 4 of 236
score this case: 0.0


KeyboardInterrupt: 

In [154]:


# promptBeginning = "<s>[INST]"
# promptEnding = "</s>"
# promptMiddle = "[/INST]"
# tokenizer = AutoTokenizer.from_pretrained(base_model)


# # this must be >= 2
# fail_limit=10


# for index, row in prompt_df.iloc[0:1].iterrows():

#     print("#############################")
#     #print(row)
#     questionCounter = questionCounter + 1


#     #build the prompt
#     # prompt = promptBeginning + prompt_instruction + str(row['prompt_question_text']) + promptMiddle + promptEnding
#     prompt = f"{promptBeginning}{prompt_instruction}{str(row['text'])}{promptEnding}"
#     print('prompt is ', prompt)

#     prompt2 = "<s>[INST]What is the CT scan? [/INST]</s>"  # A simple test prompt
#     prompt2_tensor = tokenizer(prompt2, return_tensors="pt")
#     print(prompt2_tensor)
#     # Generate text
#     result = pipe(prompt2_tensor, max_length=200)
#     print(result)
#     print(result[0]['generated_text'])

#     # Tokenize the input text
#     #input_ids = tokenizer.encode(prompt, return_tensors="pt")

#     #true answer
#     # truth=row['labels']
#     # print('truth is', truth)

#     # #use a loop, if llm stopped before reaching the answer. ask again
#     # index=-1
#     # failCounter=0
#     # while(index==-1):

#     #     #generate answer
#     #     #result = pipe(prompt, max_length=200)
#     #     print('result:', result)
#     #     llmAnswer = result[0]['generated_text']

#     #     # #remove our prompt from it
#     #     # index = llmAnswer.find(promptEnding)
#     #     # llmAnswer = llmAnswer[len(promptEnding)+index:]

#     #     print("LLM Answer:")
#     #     print(llmAnswer)

#     #     #remove spaces
#     #     llmAnswer=llmAnswer.replace(' ','')

#     #     # #find the option in response
#     #     index = llmAnswer.find('Answer:')

#     #     # #edge case - llm stoped at the worst time
#     #     if(index+len('Answer:')==len(llmAnswer)):
#     #          index=-1

#         # #update question for the next try. remove chain of thought
#         # question=testGuide + row['Question'] + '\na)' + row['a'] + '\nb)' + row['b'] + '\nc)' + row['c'] + '\nd)' + row['d']

#         #Don't get stock on a question
#     #     failCounter=failCounter+1
#     #     if failCounter==fail_limit:
#     #         break

#     # if failCounter==fail_limit:
#     #     continue

#     # #find and match the option
#     # next_char = llmAnswer[index+len('Answer:'):][0]
#     # if next_char in truth:
#     #     correct=correct+1
#     #     print('correct')
#     # else:
#     #     print('wrong')

#     # #update accuracy
#     # # accuracy=correct/questionCounter
#     # print(f"Progress: {questionCounter/len(df_test)}")
#     # print(f"Accuracy: {accuracy}")




In [None]:

def build_training_dataset(df):
  df['text'] = '<s>[INST]@Enlighten. ' + df[''] +'[/INST]'+ df['A'] + '</s>'
  df=df.drop(['Q','A','class'],axis=1)
  dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
  dataset = Dataset(pa.Table.from_pandas(df))
  return dataset

def build_prompt(question):
  prompt = build_prompt(question)
  return prompt

def run_model(df_test):
  questionCounter=0
  correct=0
  promptEnding = "[/INST]"
  fail_limit=10
  USE_COT=True
  testGuide='Answer the following question, at the end of your response write the answer like this: Answer:a or Answer:b or Answer:c or Answer:d \n'
  for index, row in df_test.iterrows():
      print("#############################")
      questionCounter = questionCounter + 1
      if USE_COT:
          chainOfThoughtActivator='\nfirst think step by step\n'
      else:
          chainOfThoughtActivator='\n'
      question=testGuide + row['Question'] + '\na)' + row['a'] + '\nb)' + row['b'] + '\nc)' + row['c'] + '\nd)' + row['d'] + chainOfThoughtActivator
      print(question)
      truth=row['Answer']
      index=-1
      failCounter=0
      while(index==-1):
          prompt = build_prompt(question)
          result = pipe(prompt)
          llmAnswer = result[0]['generated_text']
          index = llmAnswer.find(promptEnding)
          llmAnswer = llmAnswer[len(promptEnding)+index:]
          print("LLM Answer:")
          print(llmAnswer)
          llmAnswer=llmAnswer.replace(' ','')
          index = llmAnswer.find('Answer:')
          if(index+len('Answer:')==len(llmAnswer)):
              index=-1
          question=testGuide + row['Question'] + '\na)' + row['a'] + '\nb)' + row['b'] + '\nc)' + row['c'] + '\nd)' + row['d']
          failCounter=failCounter+1
          if failCounter==fail_limit:
              break
      if failCounter==fail_limit:
          continue
      next_char = llmAnswer[index+len('Answer:'):][0]
      if next_char in truth:
          correct=correct+1
          print('correct')
      else:
          print('wrong')
      accuracy=correct/questionCounter
      print(f"Progress: {questionCounter/len(df_test)}")
      print(f"Accuracy: {accuracy}")

df = pd.read_json(x_train_path)
dataset = build_training_dataset(df)
run_model(df_test)


In [None]:
# prompt: please help modify for this function for the fact that my X has several columns in the dataframe and my y is also several columns and need to add a lengthy set of instructions in the prompt:
# def build_training_dataset(df):
#   df['text'] = '<s>[INST]@Enlighten. ' + df[''] +'[/INST]'+ df['A'] + '</s>'
#   df=df.drop(['Q','A','class'],axis=1)
#   dataset = ds.dataset(pa.Table.from_pandas(df).to_batches(

def build_training_dataset(df):
  df['text'] = '<s>[INST]@Enlighten. ' + df[''] +'[/INST]'+ df['A'] + ' ' + df['B'] + ' ' + df['C'] + ' ' + df['D'] + '</s>'
  df=df.drop(['Q','A','B','C','D','class'],axis=1)
  dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
  return dataset


In [1]:

def test_model(df_test, promptEnding, pipe, fail_limit=10):
  questionCounter = 0
  correct = 0
  for index, row in df_test.iterrows():
    print("#############################")
    questionCounter = questionCounter + 1
    chainOfThoughtActivator = '\n'
    question = testGuide + row['Question'] + '\na)' + row['a'] + '\nb)' + row['b'] + '\nc)' + row['c'] + '\nd)' + row['d'] + chainOfThoughtActivator
    print(question)
    truth = row['Answer']
    index = -1
    failCounter = 0
    while (index == -1):
      prompt = build_prompt(question)
      result = pipe(prompt)
      llmAnswer = result[0]['generated_text']
      index = llmAnswer.find(promptEnding)
      llmAnswer = llmAnswer[len(promptEnding) + index:]
      print("LLM Answer:")
      print(llmAnswer)
      llmAnswer = llmAnswer.replace(' ', '')
      index = llmAnswer.find('Answer:')
      if (index + len('Answer:') == len(llmAnswer)):
        index = -1
      question = testGuide + row['Question'] + '\na)' + row['a'] + '\nb)' + row['b'] + '\nc)' + row['c'] + '\nd)' + row['d']
      failCounter = failCounter + 1
      if failCounter == fail_limit:
        break
    if failCounter == fail_limit:
      continue
    next_char = llmAnswer[index + len('Answer:'):][0]
    if next_char in truth:
      correct = correct + 1
      print('correct')
    else:
      print('wrong')
    accuracy = correct / questionCounter
    print(f"Progress: {questionCounter / len(df_test)}")
    print(f"Accuracy: {accuracy}")

    return accuracy




In [None]:
df_test = pd.read_csv(test_path)
accuracy = test_model(df_test, promptEnding, fail_limit)


In [None]:
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

## Train the Model


In [None]:
# Load base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
)


model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.bos_token, tokenizer.eos_token


In [None]:
# count training tokens

tokenizer_ = LlamaTokenizer.from_pretrained("cognitivecomputations/dolphin-llama2-7b")
tokens = tokenizer_.tokenize(dataset.to_pandas().to_string())
len(tokens)

In [None]:
#Adding the adapters in the layers
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

In [None]:
# Setting hyperparameters
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=1,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)


In [None]:
# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)

In [None]:
# Training the model
trainer.train()

In [None]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
model.config.use_cache = True
model.eval()

In [None]:
trainer.model.push_to_hub(new_model)

## Test the Model