<a href="https://colab.research.google.com/github/mille055/ct_protocol/blob/main/notebooks/Finetune_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

##Chad Miller
##AIPI590 Project 1

This notebook fine_tunes an LLM (Mistral 7B) to perform the protocolling task for CT scans that have been ordered. The model takes as input the order details and items from the EHR such as serum creatinine, allergies, prior ct order, and a (now synthesized) summary of a clinic note and outputs the predicted protocol and any additional instructions.


In [1]:
!git clone 'https://github.com/mille055/CT_Protocol.git'
!pip install -U bitsandbytes
!pip install transformers==4.36.2
!pip install -U peft
!pip install -U accelerate
!pip install -U trl
!pip install datasets==2.16.0
!pip install sentencepiece
!pip install openpyxl
!pip install xlrd
!pip install openai
!pip install huggingface_hub

fatal: destination path 'CT_Protocol' already exists and is not an empty directory.


In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging, LlamaTokenizer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,re
import torch
from datasets import load_dataset, Dataset
from trl import SFTTrainer
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
import numpy as np
from google.colab import userdata
import json
from sklearn.model_selection import train_test_split
from huggingface_hub import HfApi




  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


## Build Datasets

In [3]:
def build_dataset_from_file2(file_path, num_test = 200, output_file = '/content/CT_Protocol/data/datacsv031424.csv'):
  """
  This function reads an excel file, splits into train/test datasets in a dataframe and saves as a csv file.

  Args:
    filename: The path to the excel file.

  Returns:
    A dataframe from which the train and test datasets can be derived.
  """
  # Read the excel file into a pandas dataframe
  df = pd.read_excel(filename)

  # Rename columns and identify x and y data
  new_column_names = {'Procedure': 'order', 'Reason for Exam Full': 'indication', 'Previous Procedure Name':'prior_order', 'Contrast Allergy': 'contrast_allergy', 'Allergy Severity': 'allergy_severity', 'Creatinine (mg/dL)':'creatinine', 'Dialysis': 'on_dialysis', 'clinical summary':'clinical_summary', 'Predicted Procedure': 'predicted_order', 'Protocol': 'predicted_protocol', 'Protocol comments':'predicted_comments', 'Accession':'accession'}
  df.rename(columns=new_column_names, inplace=True)
  df.fillna("", inplace=True)
  df['accession'] = df['accession'].astype(str)


  with open(output_file, 'w') as f:
    f.write(df.to_csv(index=False))

  return df





In [4]:
filename = '/content/CT_Protocol/data/dataset031324.xlsx'
full_df = build_dataset_from_file2(filename)
full_df.head()

Unnamed: 0,accession,order,Reason for Exam,prior_order,indication,creatinine,on_dialysis,contrast_allergy,allergy_severity,predicted_order,predicted_protocol,predicted_comments,clinical_summary
0,800000,CT chest abdomen pelvis cholangiocarcin with c...,cholangio surveillance,,cholangio surveillance; Cholangiocarcinoma,1.3,0,0,,CT chest abdomen pelvis cholangiocarcin with c...,cholangiocarcinoma,,The patient is a 58-year-old male with a histo...
1,800001,CT abdomen pelvis with contrast,Sepsis; Unexplained lactic acidosis with shock,,Sepsis; Unexplained lactic acidosis with shock...,0.8,0,0,,CT abdomen pelvis with contrast,cirrhosis,,The patient is a middle-aged male with a histo...
2,800002,CT abdomen pelvis with contrast,Diffuse abdominal pain,,Diffuse abdominal pain; Liver transplanted ; I...,0.8,0,0,,CT abdomen pelvis with contrast,cirrhosis,,The patient is a 55-year-old male with a histo...
3,800003,CT abdomen pelvis with contrast,s/p small bowel obstruction. Hx of liver trans...,,s/p small bowel obstruction. Hx of liver trans...,0.8,0,0,,CT abdomen pelvis with contrast,cirrhosis,,The patient is a post-liver transplant individ...
4,800004,CT abdomen without contrast,Liver transplant; Transplant workup,,Liver transplant; Transplant workup; Chronic v...,0.8,0,0,,CT abdomen without contrast,cirrhosis,['md reroute for contrast'],The patient is a 58-year-old male with a histo...


In [5]:
prompt_instruction = '''
Use the inputs to expertly decide on the appropriate protocol for the CT study.  The description of each CT protocol is given in the text below (the name to return for each protocol is in parentheses). For scans in which the order does not match the desired protocol, or if there are other outstanding questions the radiologist needs to resolve (e.g., elevated creatinine above 2.0 mg/dL or history of severe allergic reaction to IV contrast such as trouble breathing, throat swelling, or anaphylaxis, and contrast enhanced scan ordered), then add a comment that will route the case back to the radiologist (comments should be from the list given below.

Also, note it is okay to over-image, but ideally no simpler protocols for studies that require multiphase imaging.
Any acute hemorrhage is ideally scanned by the GI bleed protocol (CT with and without contrast, with arterial and venous phases). 6. Any study specifically ordered to evaluate the chest only (such as a PE study), or the thoracic-abdominal aorta or leg arteries should be routed to the chest/cardiovascular imaging division.
If there is an indeterminate renal or adrenal mass on the prior study, it can be evaluated by adding pre contrast images through the abdomen in addition to the regular protocol (comment: 'precontrast through kidneys or adrenals'). That can be added on without communicating with the ordering provider.

Here is a description of the protocols:

Routine (routine):  Protocol for most patients includes portal venous phase imaging.  No oral contrast is administered by default, but can be added through comments.

Noncontrast (noncon): Protocol when no contrast is indicated or there are contraindications such as severe allergy or elevated creatinine. There are reasons why contrast may not be wanted by the ordering provider even if no contraindications, such as a solitary kidney and mild renal failure.

Dual Liver Protocol (dual liver):  Known or suspected hypervascular liver tumor or suspected metastases from a primary tumor outside the liver for which there are suspected hypervascular liver metastases.  It includes an acquisition in both the hepatic arterial and portal venous phases.  Currently the list of malignancies for this protocol includes patients with neuroendocrine, carcinoid, and thyroid carcinoma.  Breast cancer and melanoma patients had been typically scanned with this protocol prior to July 2010, and if the patient has known lesions only seen on arterial phase images the patient may benefit from this protocol, but generally these malignancies can be evaluated with our routine CAP protocol. Coronal images of the portal venous phase are included (as are MIPs of the chest, if the chest is also ordered).

Cirrhosis Protocol (cirrhosis):  Known or suspected cirrhosis and/or have a known or suspected hepatocellular carcinoma.  It also should be performed in all patients with suspected benign primary liver tumors, such as focal nodular hyperplasia or hepatic adenoma.  This protocol includes acquisitions in the hepatic arterial and the portal venous phases, as well as a delayed phase.  This protocol calls for a larger volume of contrast material and a higher infusion rate than the Dual Liver Protocol.

Hepatic Resection Protocol (hepatic resection):  Indicated in all patients anticipating hepatic resection.  It includes thin section images of the liver to include celiac axis and proximal SMA during the hepatic arterial phase and thicker sections through the liver during the portal venous phase.  The images obtained during the hepatic arterial phase undergo volume rendering in 3D.

Radioembolization Protocol (radioembo):  Typically ordered by the Interventional Radiologists for evaluation of a patient following (and possibly before) embolization therapy. This includes arterial and venous phases through the abdomen. The post processing is slightly different than the cirrhosis protocol in that thin images are sent in both arterial and portal venous phases for the 3D lab to assess the vasculature and liver volumes. It should be specifically mentioned in the order, otherwise do not use the protocol.

Pancreas Protocol (pancreas):  Known or suspected pancreatic tumor.  It is occasionally requested in patients with either acute or chronic pancreatitis.  It includes thin section images of the pancreas to include the celiac axis and SMA during the pancreatic phase and images of the liver and pancreas during the venous phase.  Arterial phase images are reconstructed in 3D. Coronal images of the portal venous phase are included. Unless otherwise specified, imaging is of the abdomen only.

Cholangiocarcinoma Protocol (cholangiocarcinoma):  Known or suspected cholangiocarcinoma.  It includes images of the liver in the portal venous phase as well as through the hilum following a 10 minute delay.  Coronal reformats of the venous phase are included.

Trauma Chest/Abdomen/Pelvis (trauma):  Suspected trauma. Arterial phase imaging through the upper and mid chest followed by portal venous phase imaging of the abdomen and pelvis.  No oral contrast. Coronal and sagittal reformations of the spine are generated as a separate study interpreted by the MSK section.

Crohns Protocol (crohns):  Evaluation to look for suspected Crohns involvement, but not necessarily for complications of Crohn’s.  If a relatively asymptomatic patient, the patient receives VolumenTM (a negative contrast agent). Enteric arterial phase images of the abdomen and pelvis are acquired, and sagittal and coronal reformats are also included. Similar to other Bowel Protocol except that only a single phase is acquired to minimize radiation dose.

CT Colonography (colon):  For colon cancer and polyp screening.  The patient undergoes bowel prep the night before the scan as well as barium tagging.  Insufflation of CO2 via device after placement of tube into rectum.  Supine and prone imaging, as well as decubitus position if nondistended segments on the two standard positions.

Renal Stone Protocol (renal stone):  Acute flank pain and/or a known or suspected renal calculus.  It includes a low dose noncontrast CT of the kidneys, ureters and bladders with the patient in prone position (unless unable).  Coronal reformats are provided.

Genitourinary Protocol (gu):  Hematuria, known or suspected renal mass, or other indications where evaluation of the ureters is necessary.  It includes low dose non-contrast images of the kidneys only followed by nephrographic phase images of the kidneys and 7 min delayed excretory phase images of the kidneys, ureters, and bladder.  Coronal reformats of the excretory phase are included.

Focused Renal Cyst Protocol (focused renal cyst): For followup of a complicated renal cyst. Pre- and Post-contrast (Nephrographic) imaging through the kidneys to assess for enhancement. There is no imaging of the pelvis and no CT urogram. If in doubt, use the more complete GU Protocol.

RCC Protocol (rcc):  Known renal cell carcinoma, typically in patients who have undergone a nephrectomy or nephron sparing treatment, or possibly for preoperative planning.  It includes noncontrast images of the kidneys followed by a dual liver as described above to assess for metastases.  Coronal reformats in the venous phase are included.

TCC Protocol (tcc):  Intended for patients with known transitional cell carcinoma or bladder cancer, typically who have undergone a cystectomy, focal bladder surgery, or nephroureterectomy.  It is a split bolus technique, with the goal of imaging the patient in both the excretory phase and the portal venous phase in a single acquisition following two boluses of IV contrast prior to scanning.

Renal Donor Protocol (renal donor):  To evaluate the renal anatomy of potential renal donors.  This includes thin section images of the kidneys and renal arteries during the arterial phase and venous phase.  A delayed scout is obtained to document the number of ureters.  A separate 3-D interpretation is performed.

Adrenal Protocol (adrenal): For the evaluation of an indeterminate adrenal mass.  Noncontrast images through the adrenal gland.  The scan is then checked by a physician for further imaging.  If the physician deems necessary, portal venous and 15 minute delayed images through the adrenals follow.

CT Cystogram (cystogram):  To evaluate for bladder injury (typically after pelvic trauma) or a fistula.  Contrast (Renografin 60 diluted in saline) is instilled by gravity through an indwelling Foley catheter.



And for predicted_comments, add comments from the following list:
'oral contrast' if oral (otherwise known as PO) contrast has been requested in the indication.
'steroid prep' if has a mild allergy to contrast and a contrasted scan has been requested
'reroute contrast' if the patient has a contraindication to contrast such as elevated creatinine above 2.0 or severe/anaphylaxis contrast allergy
'reroute coverage' if addition body parts may need to be added to the planned procedure
'low pelvis' which extends the caudal range of a CT, particularly for malignancies that may not be fully imaged on our routine protocols which includes vulvar cancer, anal cancer, and perhaps rectal cancer if this is the first time evaluation or there is known recurrent disease low in the pelvis.  Things in the inguinal region or upper thigh or perirectal abscess or perianal fistulous disease may be other possible indications. Perineal infection such as Fournier's gangrene would aslo require this.
'reroute protocol' if there is a complex process such as a fistula that might not be evaluated well on the routine protocols.
'split' indicates the chest order will be read separately from the abdomen and pelvis, which occurs for lung and esophgeal cancer and for patients with lung transplants
'valsalva' indicates the imaging is performed while patient does a valsalva maneuver for evaluation of hernias.

The task is to use the provided information for each patient to return the predicted protocol for the CT study in the form of a json object like this:
{
  "predicted_order": "CT abdomen pelvis with contrast",
  "predicted_protocol": "routine",
  "predicted_comments": ["oral contrast"]
}


'''

In [6]:
def row_to_json(row, columns):
  """
  This function takes a row of a dataframe and returns a json object with the specified columns.

  Args:
    row: The row of the dataframe.
    columns: A list of columns to include in the json object.

  Returns:
    A json object with the specified columns.
  """
  json_obj = {}
  for column in columns:
    json_obj[column] = str(row[column])
  return json.dumps(json_obj)


In [7]:

def build_prompt_question(row, prompt_instruction=prompt_instruction):

  prompt_question = 'Order: ' + row['order'] + '\n' + \
    'Prior Order: ' + row['prior_order'] + '\n' + \
    'Reason for Exam: ' + row['indication'] + '\n' + \
    'Contrast Allergy: ' + str(bool(row['contrast_allergy'])) + '\n' + \
    'Allergy severity: ' + row['allergy_severity'] + '\n' + \
    'On Dialysis: ' + str(bool(row['on_dialysis'])) + '\n' + \
    'Clinical Summary: ' + row['clinical_summary'] + '\n'

  #print(prompt_question)
  return prompt_question




In [8]:
def build_prompt_answer(row, columns = ['accession', 'predicted_order', 'predicted_protocol', 'predicted_comments']):
  """
  This function takes a row of a dataframe and returns a prompt answer and prompt answer2.

  Args:
    row: The row of the dataframe.

  Returns:
    A prompt answer and prompt answer2.
  """
  prompt_answer = row_to_json(row, columns)
  return prompt_answer



In [9]:
def build_prompt_question_json(row, columns = ['accession', 'order', 'Reason for Exam', 'prior_order', 'indication',
       'creatinine', 'on_dialysis', 'contrast_allergy', 'allergy_severity', 'clinical_summary']):
  """
  This function takes a row of a dataframe and returns a prompt question.

  Args:
    row: The row of the dataframe.

  Returns:
    A prompt question in json format.
  """

  prompt_question = row_to_json(row, columns)
  return prompt_question

In [10]:
row = full_df.iloc[0]
prompt_question = build_prompt_question(row)
prompt_answer = build_prompt_answer(row)
print(prompt_question)
print(prompt_answer)

Order: CT chest abdomen pelvis cholangiocarcin with contrast with MIPS protocol
Prior Order: 
Reason for Exam: cholangio surveillance; Cholangiocarcinoma 
Contrast Allergy: False
Allergy severity: 
On Dialysis: False
Clinical Summary: The patient is a 58-year-old male with a history of cholangiocarcinoma undergoing surveillance. He has a past surgical history of cholecystectomy and hepatectomy. The CT scan is being performed to assess the progression or recurrence of cholangiocarcinoma.

{"accession": "800000", "predicted_order": "CT chest abdomen pelvis cholangiocarcin with contrast with MIPS protocol", "predicted_protocol": "cholangiocarcinoma", "predicted_comments": ""}


In [11]:
def create_prompt_dataframe(df):
  """
  This function takes a dataframe and returns a dataframe with the prompt questions and answers.

  Args:
    df: The dataframe to be converted.

  Returns:
    A dataframe with the prompt questions and answers.
  """
  df1 = pd.DataFrame()
  for index, row in df.iterrows():
    prompt_question_text = build_prompt_question(row)
    prompt_question_json = build_prompt_question_json(row)
    prompt_answer = build_prompt_answer(row)
    df1.at[index, 'text'] = prompt_question_text
    df1.at[index, 'prompt_question_json'] = prompt_question_json
    df1.at[index, 'labels'] = prompt_answer

  return df1



In [12]:
prompt_df = create_prompt_dataframe(full_df)
prompt_df.head()

Unnamed: 0,text,prompt_question_json,labels
0,Order: CT chest abdomen pelvis cholangiocarcin...,"{""accession"": ""800000"", ""order"": ""CT chest abd...","{""accession"": ""800000"", ""predicted_order"": ""CT..."
1,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800001"", ""order"": ""CT abdomen p...","{""accession"": ""800001"", ""predicted_order"": ""CT..."
2,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800002"", ""order"": ""CT abdomen p...","{""accession"": ""800002"", ""predicted_order"": ""CT..."
3,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800003"", ""order"": ""CT abdomen p...","{""accession"": ""800003"", ""predicted_order"": ""CT..."
4,Order: CT abdomen without contrast\nPrior Orde...,"{""accession"": ""800004"", ""order"": ""CT abdomen w...","{""accession"": ""800004"", ""predicted_order"": ""CT..."


In [13]:
dataset = ds.dataset(pa.Table.from_pandas(prompt_df).to_batches())
dataset = Dataset(pa.Table.from_pandas(prompt_df))



In [14]:
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=12)


In [15]:
test_data_df = pd.DataFrame(test_data)
test_data_df

Unnamed: 0,text,prompt_question_json,labels,__index_level_0__
0,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800811"", ""order"": ""CT abdomen p...","{""accession"": ""800811"", ""predicted_order"": ""CT...",811
1,Order: CT chest abdomen pelvis with contrast w...,"{""accession"": ""800780"", ""order"": ""CT chest abd...","{""accession"": ""800780"", ""predicted_order"": ""CT...",780
2,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800819"", ""order"": ""CT abdomen p...","{""accession"": ""800819"", ""predicted_order"": ""CT...",819
3,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800421"", ""order"": ""CT abdomen p...","{""accession"": ""800421"", ""predicted_order"": ""CT...",421
4,Order: CT RCC protocol incl chest w MIPS and d...,"{""accession"": ""800274"", ""order"": ""CT RCC proto...","{""accession"": ""800274"", ""predicted_order"": ""CT...",274
...,...,...,...,...
231,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800936"", ""order"": ""CT abdomen p...","{""accession"": ""800936"", ""predicted_order"": ""CT...",936
232,Order: CT renal stone protocol inc CT abd and ...,"{""accession"": ""800316"", ""order"": ""CT renal sto...","{""accession"": ""800316"", ""predicted_order"": ""CT...",316
233,Order: CT occult GI bleed incl dual abdomen pe...,"{""accession"": ""800056"", ""order"": ""CT occult GI...","{""accession"": ""800056"", ""predicted_order"": ""CT...",56
234,Order: CT chest abdomen pelvis with contrast w...,"{""accession"": ""800960"", ""order"": ""CT chest abd...","{""accession"": ""800960"", ""predicted_order"": ""CT...",960


In [16]:
test_data_df.to_csv('/content/CT_Protocol/data/test_data.csv')

In [17]:

token = "hf_GxLXMZiNbQOxkwJlqQOxZqQOxkwJlqQOxkwJlqQOxkwJl"

api = HfApi(token=token)



## Base Model Performance

In [18]:
# base model from huggingFace or path to model
base_model = "mistralai/Mistral-7B-Instruct-v0.2"
new_model = "auto_protocol"



In [19]:
# log into HuggingFace
secret_hf = userdata.get('HUGGINGFACE_TOKEN')
!huggingface-cli login --token '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'

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [24]:

#pipe = pipeline("zero-shot-classification", model=base_model)
pipe = pipeline(task="text-generation", model= base_model, max_length=4000, pad_token_id='eos_token_id')

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [21]:
prompt_df

Unnamed: 0,text,prompt_question_json,labels
0,Order: CT chest abdomen pelvis cholangiocarcin...,"{""accession"": ""800000"", ""order"": ""CT chest abd...","{""accession"": ""800000"", ""predicted_order"": ""CT..."
1,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800001"", ""order"": ""CT abdomen p...","{""accession"": ""800001"", ""predicted_order"": ""CT..."
2,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800002"", ""order"": ""CT abdomen p...","{""accession"": ""800002"", ""predicted_order"": ""CT..."
3,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""800003"", ""order"": ""CT abdomen p...","{""accession"": ""800003"", ""predicted_order"": ""CT..."
4,Order: CT abdomen without contrast\nPrior Orde...,"{""accession"": ""800004"", ""order"": ""CT abdomen w...","{""accession"": ""800004"", ""predicted_order"": ""CT..."
...,...,...,...
1172,Order: CT abdomen pelvis with contrast\nPrior ...,"{""accession"": ""801172"", ""order"": ""CT abdomen p...","{""accession"": ""801172"", ""predicted_order"": ""CT..."
1173,Order: CT abdomen pelvis with and without cont...,"{""accession"": ""801173"", ""order"": ""CT abdomen p...","{""accession"": ""801173"", ""predicted_order"": ""CT..."
1174,Order: CT abdomen pelvis with and without cont...,"{""accession"": ""801174"", ""order"": ""CT abdomen p...","{""accession"": ""801174"", ""predicted_order"": ""CT..."
1175,Order: CT abdomen pelvis with and without cont...,"{""accession"": ""801175"", ""order"": ""CT abdomen p...","{""accession"": ""801175"", ""predicted_order"": ""CT..."


In [22]:

questionCounter=0
correct=0
promptBeginning = "<s>[INST]"
promptEnding = "</s>"
promptMiddle = "[/INST]"
tokenizer = AutoTokenizer.from_pretrained(base_model)


# this must be >= 2
fail_limit=10


for index, row in prompt_df.iloc[0:1].iterrows():

    print("#############################")
    #print(row)
    questionCounter = questionCounter + 1


    #build the prompt
    # prompt = promptBeginning + prompt_instruction + str(row['prompt_question_text']) + promptMiddle + promptEnding
    prompt = f"{promptBeginning}{prompt_instruction}{str(row['text'])}{promptEnding}"
    print('prompt is ', prompt)

    # Tokenize the input text
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    #true answer
    truth=row['labels']
    print('truth is', truth)

    #use a loop, if llm stopped before reaching the answer. ask again
    index=-1
    failCounter=0
    while(index==-1):

        #generate answer
        result = pipe(prompt)
        print('result:', result)
        llmAnswer = result[0]['generated_text']

        # #remove our prompt from it
        # index = llmAnswer.find(promptEnding)
        # llmAnswer = llmAnswer[len(promptEnding)+index:]

        print("LLM Answer:")
        print(llmAnswer)

        #remove spaces
        llmAnswer=llmAnswer.replace(' ','')

        # #find the option in response
        index = llmAnswer.find('Answer:')

        # #edge case - llm stoped at the worst time
        if(index+len('Answer:')==len(llmAnswer)):
             index=-1

        # #update question for the next try. remove chain of thought
        # question=testGuide + row['Question'] + '\na)' + row['a'] + '\nb)' + row['b'] + '\nc)' + row['c'] + '\nd)' + row['d']

        #Don't get stock on a question
        failCounter=failCounter+1
        if failCounter==fail_limit:
            break

    if failCounter==fail_limit:
        continue

    #find and match the option
    next_char = llmAnswer[index+len('Answer:'):][0]
    if next_char in truth:
        correct=correct+1
        print('correct')
    else:
        print('wrong')

    #update accuracy
    accuracy=correct/questionCounter
    print(f"Progress: {questionCounter/len(df_test)}")
    print(f"Accuracy: {accuracy}")




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


#############################
prompt is  <s>[INST]
Use the inputs to expertly decide on the appropriate protocol for the CT study.  The description of each CT protocol is given in the text below (the name to return for each protocol is in parentheses). For scans in which the order does not match the desired protocol, or if there are other outstanding questions the radiologist needs to resolve (e.g., elevated creatinine above 2.0 mg/dL or history of severe allergic reaction to IV contrast such as trouble breathing, throat swelling, or anaphylaxis, and contrast enhanced scan ordered), then add a comment that will route the case back to the radiologist (comments should be from the list given below.

Also, note it is okay to over-image, but ideally no simpler protocols for studies that require multiphase imaging.
Any acute hemorrhage is ideally scanned by the GI bleed protocol (CT with and without contrast, with arterial and venous phases). 6. Any study specifically ordered to evaluate the

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


result: [{'generated_text': '<s>[INST]\nUse the inputs to expertly decide on the appropriate protocol for the CT study.  The description of each CT protocol is given in the text below (the name to return for each protocol is in parentheses). For scans in which the order does not match the desired protocol, or if there are other outstanding questions the radiologist needs to resolve (e.g., elevated creatinine above 2.0 mg/dL or history of severe allergic reaction to IV contrast such as trouble breathing, throat swelling, or anaphylaxis, and contrast enhanced scan ordered), then add a comment that will route the case back to the radiologist (comments should be from the list given below.\n\nAlso, note it is okay to over-image, but ideally no simpler protocols for studies that require multiphase imaging.\nAny acute hemorrhage is ideally scanned by the GI bleed protocol (CT with and without contrast, with arterial and venous phases). 6. Any study specifically ordered to evaluate the chest o

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


result: [{'generated_text': '<s>[INST]\nUse the inputs to expertly decide on the appropriate protocol for the CT study.  The description of each CT protocol is given in the text below (the name to return for each protocol is in parentheses). For scans in which the order does not match the desired protocol, or if there are other outstanding questions the radiologist needs to resolve (e.g., elevated creatinine above 2.0 mg/dL or history of severe allergic reaction to IV contrast such as trouble breathing, throat swelling, or anaphylaxis, and contrast enhanced scan ordered), then add a comment that will route the case back to the radiologist (comments should be from the list given below.\n\nAlso, note it is okay to over-image, but ideally no simpler protocols for studies that require multiphase imaging.\nAny acute hemorrhage is ideally scanned by the GI bleed protocol (CT with and without contrast, with arterial and venous phases). 6. Any study specifically ordered to evaluate the chest o

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


result: [{'generated_text': '<s>[INST]\nUse the inputs to expertly decide on the appropriate protocol for the CT study.  The description of each CT protocol is given in the text below (the name to return for each protocol is in parentheses). For scans in which the order does not match the desired protocol, or if there are other outstanding questions the radiologist needs to resolve (e.g., elevated creatinine above 2.0 mg/dL or history of severe allergic reaction to IV contrast such as trouble breathing, throat swelling, or anaphylaxis, and contrast enhanced scan ordered), then add a comment that will route the case back to the radiologist (comments should be from the list given below.\n\nAlso, note it is okay to over-image, but ideally no simpler protocols for studies that require multiphase imaging.\nAny acute hemorrhage is ideally scanned by the GI bleed protocol (CT with and without contrast, with arterial and venous phases). 6. Any study specifically ordered to evaluate the chest o

KeyboardInterrupt: 

In [None]:

def build_training_dataset(df):
  df['text'] = '<s>[INST]@Enlighten. ' + df[''] +'[/INST]'+ df['A'] + '</s>'
  df=df.drop(['Q','A','class'],axis=1)
  dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
  dataset = Dataset(pa.Table.from_pandas(df))
  return dataset

def build_prompt(question):
  prompt = build_prompt(question)
  return prompt

def run_model(df_test):
  questionCounter=0
  correct=0
  promptEnding = "[/INST]"
  fail_limit=10
  USE_COT=True
  testGuide='Answer the following question, at the end of your response write the answer like this: Answer:a or Answer:b or Answer:c or Answer:d \n'
  for index, row in df_test.iterrows():
      print("#############################")
      questionCounter = questionCounter + 1
      if USE_COT:
          chainOfThoughtActivator='\nfirst think step by step\n'
      else:
          chainOfThoughtActivator='\n'
      question=testGuide + row['Question'] + '\na)' + row['a'] + '\nb)' + row['b'] + '\nc)' + row['c'] + '\nd)' + row['d'] + chainOfThoughtActivator
      print(question)
      truth=row['Answer']
      index=-1
      failCounter=0
      while(index==-1):
          prompt = build_prompt(question)
          result = pipe(prompt)
          llmAnswer = result[0]['generated_text']
          index = llmAnswer.find(promptEnding)
          llmAnswer = llmAnswer[len(promptEnding)+index:]
          print("LLM Answer:")
          print(llmAnswer)
          llmAnswer=llmAnswer.replace(' ','')
          index = llmAnswer.find('Answer:')
          if(index+len('Answer:')==len(llmAnswer)):
              index=-1
          question=testGuide + row['Question'] + '\na)' + row['a'] + '\nb)' + row['b'] + '\nc)' + row['c'] + '\nd)' + row['d']
          failCounter=failCounter+1
          if failCounter==fail_limit:
              break
      if failCounter==fail_limit:
          continue
      next_char = llmAnswer[index+len('Answer:'):][0]
      if next_char in truth:
          correct=correct+1
          print('correct')
      else:
          print('wrong')
      accuracy=correct/questionCounter
      print(f"Progress: {questionCounter/len(df_test)}")
      print(f"Accuracy: {accuracy}")

df = pd.read_json(x_train_path)
dataset = build_training_dataset(df)
run_model(df_test)


In [None]:
# prompt: please help modify for this function for the fact that my X has several columns in the dataframe and my y is also several columns and need to add a lengthy set of instructions in the prompt:
# def build_training_dataset(df):
#   df['text'] = '<s>[INST]@Enlighten. ' + df[''] +'[/INST]'+ df['A'] + '</s>'
#   df=df.drop(['Q','A','class'],axis=1)
#   dataset = ds.dataset(pa.Table.from_pandas(df).to_batches(

def build_training_dataset(df):
  df['text'] = '<s>[INST]@Enlighten. ' + df[''] +'[/INST]'+ df['A'] + ' ' + df['B'] + ' ' + df['C'] + ' ' + df['D'] + '</s>'
  df=df.drop(['Q','A','B','C','D','class'],axis=1)
  dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
  return dataset


In [None]:

def test_model(df_test, promptEnding, pipe, fail_limit=10):
  questionCounter = 0
  correct = 0
  for index, row in df_test.iterrows():
    print("#############################")
    questionCounter = questionCounter + 1
    chainOfThoughtActivator = '\n'
    question = testGuide + row['Question'] + '\na)' + row['a'] + '\nb)' + row['b'] + '\nc)' + row['c'] + '\nd)' + row['d'] + chainOfThoughtActivator
    print(question)
    truth = row['Answer']
    index = -1
    failCounter = 0
    while (index == -1):
      prompt = build_prompt(question)
      result = pipe(prompt)
      llmAnswer = result[0]['generated_text']
      index = llmAnswer.find(promptEnding)
      llmAnswer = llmAnswer[len(promptEnding) + index:]
      print("LLM Answer:")
      print(llmAnswer)
      llmAnswer = llmAnswer.replace(' ', '')
      index = llmAnswer.find('Answer:')
      if (index + len('Answer:') == len(llmAnswer)):
        index = -1
      question = testGuide + row['Question'] + '\na)' + row['a'] + '\nb)' + row['b'] + '\nc)' + row['c'] + '\nd)' + row['d']
      failCounter = failCounter + 1
      if failCounter == fail_limit:
        break
    if failCounter == fail_limit:
      continue
    next_char = llmAnswer[index + len('Answer:'):][0]
    if next_char in truth:
      correct = correct + 1
      print('correct')
    else:
      print('wrong')
    accuracy = correct / questionCounter
    print(f"Progress: {questionCounter / len(df_test)}")
    print(f"Accuracy: {accuracy}")

    return accuracy




In [None]:
df_test = pd.read_csv(test_path)
accuracy = test_model(df_test, promptEnding, fail_limit)


In [None]:
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

## Train the Model


In [None]:
# Load base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
)


model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.bos_token, tokenizer.eos_token


In [None]:
# count training tokens

tokenizer_ = LlamaTokenizer.from_pretrained("cognitivecomputations/dolphin-llama2-7b")
tokens = tokenizer_.tokenize(dataset.to_pandas().to_string())
len(tokens)

In [None]:
#Adding the adapters in the layers
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

In [None]:
# Setting hyperparameters
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=1,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)


In [None]:
# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)

In [None]:
# Training the model
trainer.train()

In [None]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
model.config.use_cache = True
model.eval()

In [None]:
trainer.model.push_to_hub(new_model)

## Test the Model

In [None]:
# question_list = [{'order': 'CT CAP WO',
#  'indication': 'Persistent cough, rule out malignancy',
#  'creatinine': 0.9,
#  'allergies': {'iodine':'anaphylaxis'},
#  'prior_report': {'prior_order': 'CT Chest', 'prior_protocol': 'lung cancer screening', 'prior_report_text': 'No significant pulmonary nodules or masses.'},
#  'prior_clinic_notes': 'Smoker for 20 years, recent weight loss.'},
# {'order': 'CT Abd W',
#  'indication': 'Elevated liver enzymes, rule out cirrhosis',
#  'creatinine': 1.2,
#  'allergies': {},
#  'prior_report': {'prior_order': 'Ultrasound Abd', 'prior_protocol': None, 'prior_report_text': 'Hepatomegaly with fatty infiltration.'},
#  'prior_clinic_notes': 'History of alcohol use, presenting with fatigue.'},
# {'order': 'CT AP W',
#  'indication': 'Chronic abdominal pain, rule out pancreatitis',
#  'creatinine': 1.1,
#  'allergies': {'contrast dye':'mild'},
#  'prior_report': {'prior_order': 'CT Abd', 'prior_protocol': 'dual pancreas', 'prior_report_text': 'Pancreas unremarkable, no evidence of acute pancreatitis.'},
#  'prior_clinic_notes': 'Recurrent episodes of abdominal pain, elevated amylase and lipase.'},
# {'order': 'CT AP W',
#  'indication': 'Suspected appendicitis',
#  'creatinine': 0.8,
#  'allergies': {},
#  'prior_report': {},
#  'prior_clinic_notes': 'Right lower quadrant pain, fever, elevated WBC count.'},
# {'order': 'CT CAP W',
#  'indication': 'Follow-up for colorectal cancer, rule out metastasis',
#  'creatinine': 1.3,
#  'allergies': {},
#  'prior_report': {'prior_order': 'CT CAP', 'prior_protocol': 'routine', 'prior_report_text': 'No evidence of recurrent disease or metastasis.'},
#  'prior_clinic_notes': 'History of stage II colorectal cancer, post-surgical resection.'},
# {'order': 'CT Abd W',
#  'indication': 'Check for hepatocellular carcinoma in cirrhotic patient',
#  'creatinine': 0.95,
#  'allergies': {},
#  'prior_report': {'prior_order': 'CT Abd W', 'prior_protocol': 'cirrhosis', 'prior_report_text': 'Cirrhosis with no focal liver lesions.'},
#  'prior_clinic_notes': 'Chronic hepatitis C, cirrhosis confirmed on biopsy.'},
# {'order': 'CT Abd Pelvis W',
#      'indication': 'Acute lower abdominal pain, rule out diverticulitis',
#      'creatinine': 1.0,
#      'allergies': {'contrast dye': 'mild'},
#      'prior_report': {'prior_order': 'CT Abd Pelvis W', 'prior_protocol': 'routine', 'prior_report_text': 'Mild diverticulosis without acute inflammation.'},
#      'prior_clinic_notes': 'Intermittent episodes of lower abdominal pain, recent episode with fever.'},
#     {'order': 'CT AP WO',
#      'indication': 'Suspected kidney stones',
#      'creatinine': 1.2,
#      'allergies': {},
#      'prior_report': {},
#      'prior_clinic_notes': 'Recurrent left flank pain, previous history of nephrolithiasis.'},
#     {'order': 'CT AP W',
#      'indication': 'Evaluation for ovarian cancer recurrence',
#      'creatinine': 0.9,
#      'allergies': {'contrast dye': 'severe'},
#      'prior_report': {'prior_order': 'CT AP W', 'prior_protocol': 'routine', 'prior_report_text': 'No evidence of recurrent disease.'},
#      'prior_clinic_notes': 'CA-125 levels rising, patient asymptomatic.'},
#     {'order': 'CT CAP W',
#      'indication': 'Staging for newly diagnosed lymphoma',
#      'creatinine': 1.1,
#      'allergies': {},
#      'prior_report': {},
#      'prior_clinic_notes': 'Recent diagnosis of non-Hodgkin lymphoma, staging required.'},
#     {'order': 'CT Abd W',
#      'indication': 'Hepatic lesion follow-up',
#      'creatinine': 1.3,
#      'allergies': {'iodine': 'anaphylaxis'},
#      'prior_report': {'prior_order': 'CT Abd WO', 'prior_protocol': 'noncontrast', 'prior_report_text': 'Limited noncontrast examination.'},
#      'prior_clinic_notes': 'Lesion discovered incidentally, previous biopsy non-diagnostic.'},
#     {'order': 'CT CAP W',
#      'indication': 'Pre-operative assessment for pancreatic cancer',
#      'creatinine': 0.8,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT CAP W', 'prior_protocol': 'dual pancreas', 'prior_report_text': 'Mass in the head of the pancreas, no distant metastasis.'},
#      'prior_clinic_notes': 'Jaundice, weight loss, and new-onset diabetes.'},
#     {'order': 'CT Abd W',
#      'indication': 'Chronic liver disease surveillance',
#      'creatinine': 1.0,
#      'allergies': {'contrast dye': 'mild'},
#      'prior_report': {'prior_order': 'CT Abd W', 'prior_protocol': 'cirrhosis', 'prior_report_text': 'Hepatic steatosis, no signs of cirrhosis.'},
#      'prior_clinic_notes': 'Patient with chronic Hepatitis B, monitoring for cirrhosis.'},
#     {'order': 'CT AP W',
#      'indication': 'Rule out metastatic disease in prostate cancer',
#      'creatinine': 1.2,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT AP W', 'prior_protocol': 'routine', 'prior_report_text': 'Prostate gland enlarged with nodular density.'},
#      'prior_clinic_notes': 'Elevated PSA levels, biopsy-confirmed adenocarcinoma.'},
#     {'order': 'CT AP W',
#      'indication': 'Evaluation for inflammatory bowel disease',
#      'creatinine': 1.1,
#      'allergies': {'contrast dye': 'mild'},
#      'prior_report': {'prior_order': 'CT Abd Pelvis W', 'prior_protocol': 'bowel crohns', 'prior_report_text': 'Thickening of the terminal ileum, suggestive of Crohn’s disease.'},
#      'prior_clinic_notes': 'Chronic abdominal pain, diarrhea, and weight loss.'},
# {'order': 'CT CAP W',
#      'indication': 'Rule out metastatic disease in breast cancer',
#      'creatinine': 1.2,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT CAP W', 'prior_protocol': 'routine', 'prior_report_text': 'No metastatic disease.'},
#      'prior_clinic_notes': 'Patient with stage III breast cancer.'},
# {'order': 'CT AP WWO',
#      'indication': 'Transitional cell carcinoma',
#      'creatinine': 1.2,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT AP WWO', 'prior_protocol': 'tcc', 'prior_report_text': '1. Metastatic lymphadenopathy in the pelvis. 2. No filling defects in the upper tracts. 3. Hepatic steatosis.'},
#      'prior_clinic_notes': '57 yo male with bladder cancer.'},
# {'order': 'CT AP WO',
#      'indication': 'Rule out metastatic disease in prostate cancer',
#      'creatinine': 2.1,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT AP WO', 'prior_protocol': 'noncontrast', 'prior_report_text': 'No definite visceral metastases on a limited noncontrast examination. Similar osseous metastatic disease; see separate report for the bone scan.'},
#      'prior_clinic_notes': 'On therapy for prostate cancer.'},
# {'order': 'CT Abd Pelvis W',
#      'indication': 'Rule out pelvic inflammatory disease',
#      'creatinine': 0.98,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT AP W', 'prior_protocol': 'routine', 'prior_report_text': 'Normal appearance of the uterus and adnexa.'},
#      'prior_clinic_notes': 'Fever and lower abdominal pain, elevated CRP.'},
#     {'order': 'CT Abd W',
#      'indication': 'Chronic liver disease surveillance',
#      'creatinine': 1.05,
#      'allergies': {'contrast dye': 'mild'},
#      'prior_report': {'prior_order': 'CT Abd W', 'prior_protocol': 'cirrhosis', 'prior_report_text': 'Hepatic steatosis, signs of early fibrosis.'},
#      'prior_clinic_notes': 'Patient with NAFLD, monitoring for progression.'},
#     {'order': 'CT CAP W',
#      'indication': 'Post-chemotherapy evaluation for lymphoma',
#      'creatinine': 1.2,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT CAP W', 'prior_protocol': 'routine', 'prior_report_text': 'Decrease in size of cervical and mediastinal lymph nodes.'},
#      'prior_clinic_notes': 'Completed 6 cycles of R-CHOP, assessing response.'},
#     {'order': 'CT Abd W',
#      'indication': 'Screening for renal cell carcinoma in VHL disease',
#      'creatinine': 0.9,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT Abd WO', 'prior_protocol': 'noncontrast', 'prior_report_text': 'Multiple renal cysts, no solid lesions.'},
#      'prior_clinic_notes': 'Family history of VHL, annual screening recommended.'},
#     {'order': 'CT Abd Pelvis W',
#      'indication': 'Evaluation for endometriosis',
#      'creatinine': 1.1,
#      'allergies': {'contrast dye': 'mild'},
#      'prior_report': {'prior_order': 'CT Abd Pelvis W', 'prior_protocol': 'routine', 'prior_report_text': 'Evidence of pelvic adhesions, suggestive of endometriosis.'},
#      'prior_clinic_notes': 'Chronic pelvic pain, dysmenorrhea, and infertility.'},
#     {'order': 'CT AP WWO',
#      'indication': 'Follow-up on adrenal incidentaloma',
#      'creatinine': 1.0,
#      'allergies': {'contrast dye': 'none'},
#      'prior_report': {'prior_order': 'CT AP W', 'prior_protocol': 'routine', 'prior_report_text': '4 cm left adrenal mass, unchanged from previous.'},
#      'prior_clinic_notes': 'Incidental finding during workup for abdominal pain last year.'},
#     {'order': 'CT CAP WO',
#      'indication': 'Pre-operative assessment for rectal cancer',
#      'creatinine': 0.95,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT CAP W', 'prior_protocol': 'routine low pelvis', 'prior_report_text': 'Locally advanced rectal cancer without distant metastasis.'},
#      'prior_clinic_notes': 'Biopsy-confirmed adenocarcinoma, planning for surgery.'},
#     {'order': 'CT Abd WO',
#      'indication': 'Acute pancreatitis',
#      'creatinine': 1.3,
#      'allergies': {'contrast dye': 'severe'},
#      'prior_report': {'prior_order': 'CT Abd', 'prior_protocol': 'dual pancreas', 'prior_report_text': '1. Enlarged pancreas with peripancreatic fluid. 2. Anaphylaxis due to contrast reaction.'},
#      'prior_clinic_notes': 'Severe abdominal pain, elevated lipase.'},
#     {'order': 'CT Abd Pelvis W',
#      'indication': 'Chronic mesenteric ischemia',
#      'creatinine': 1.1,
#      'allergies': {},
#      'prior_report': {},
#      'prior_clinic_notes': 'Postprandial pain, weight loss, fear of eating.'},
#     {'order': 'CT Abd Pelvis WO',
#      'indication': 'Evaluation for urolithiasis',
#      'creatinine': 1.25,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT Abd Pelvis WO', 'prior_protocol': 'renal stone', 'prior_report_text': 'Multiple stones in the kidney and ureter.'},
#      'prior_clinic_notes': 'Recurrent flank pain, history of stones.'},
#     {'order': 'CT CAP WWO',
#      'indication': 'Evaluation for metastatic RCC',
#      'creatinine': 1.1,
#      'allergies': {'bees':'severe'},
#      'prior_report': {'prior_order': 'CT Abd WWO', 'prior_protocol': 'rcc', 'prior_report_text': 'Renal mass suggestive of carcinoma.'},
#      'prior_clinic_notes': 'Known case of RCC, monitoring for metastasis.'},
#     {'order': 'CT CAP W',
#      'indication': 'Suspected cholangiocarcinoma',
#      'creatinine': 1.2,
#      'allergies': {},
#      'prior_report': {'prior_order': 'MRCP', 'prior_protocol': None, 'prior_report_text': 'Irregularity in the bile ducts, possible stricture.'},
#      'prior_clinic_notes': 'Jaundice, itching, and weight loss.'},
#     {'order': 'CT CAP W',
#      'indication': 'Follow-up for sarcoma',
#      'creatinine': 0.9,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT AP W', 'prior_protocol': 'routine', 'prior_report_text': 'Post-operative changes, no evidence of recurrence.'},
#      'prior_clinic_notes': 'History of sarcoma resection, regular surveillance.'},
#     {'order': 'CT CAP W',
#      'indication': 'Screening for neuroendocrine tumor metastases',
#      'creatinine': 1.0,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT CAP W', 'prior_protocol': 'dual liver', 'prior_report_text': 'No visible metastatic disease.'},
#      'prior_clinic_notes': 'Diagnosed with neuroendocrine tumor, annual screening.'},
#     {'order': 'CT CAP W',
#      'indication': 'Trauma, rule out internal injuries',
#      'creatinine': 1.2,
#      'allergies': {},
#      'prior_report': {},
#      'prior_clinic_notes': 'Motor vehicle collision, multiple contusions and abrasions.'},
#     {'order': 'CT Abd WO',
#      'indication': 'Post-operative evaluation for abscess',
#      'creatinine': 2.1,
#      'allergies': {'contrast dye': 'mild'},
#      'prior_report': {'prior_order': 'CT Abd WO', 'prior_protocol': 'noncontrast', 'prior_report_text': 'Evidence of fluid collection in the surgical site.'},
#      'prior_clinic_notes': 'Recent appendectomy, presenting with fever and localized pain.'},
#     {'order': 'CT CAP WWO',
#      'indication': 'Post-surgery, rule out metastasis in RCC',
#      'creatinine': 1.3,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT CAP WWO', 'prior_protocol': 'rcc', 'prior_report_text': 'No evidence of disease recurrence.'},
#      'prior_clinic_notes': 'Nephrectomy for RCC, annual follow-up.'},
#     {'order': 'CT AP W',
#      'indication': 'Suspected liver abscess',
#      'creatinine': 1.0,
#      'allergies': {'sulfa':None},
#      'prior_report': {},
#      'prior_clinic_notes': 'Fever, right upper quadrant pain, elevated liver enzymes.'},
#     {'order': 'CT AP W',
#      'indication': 'Evaluation for cholangiocarcinoma recurrence',
#      'creatinine': 1.4,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT AP W', 'prior_protocol': 'cholangiocarcinoma', 'prior_report_text': 'Previous imaging showed bile duct thickening.'},
#      'prior_clinic_notes': 'History of cholangiocarcinoma, post-ERCP with stenting.'},
#     {'order': 'CT CAP W',
#      'indication': 'Follow-up for neuroendocrine tumor',
#      'creatinine': 1.1,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT CAP W', 'prior_protocol': 'dual liver', 'prior_report_text': 'Stable appearance of known pancreatic lesion.'},
#      'prior_clinic_notes': 'Managed with somatostatin analogs, biannual imaging.'},
#     {'order': 'CT AP W',
#      'indication': 'Rule out pelvic sarcoma recurrence',
#      'creatinine': 0.98,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT AP W', 'prior_protocol': 'routine low pelvis', 'prior_report_text': 'No recurrent mass identified.'},
#      'prior_clinic_notes': 'History of pelvic sarcoma, status post resection.'},
#     {'order': 'CT Abd W',
#      'indication': 'Fall',
#      'creatinine': 1.05,
#      'allergies': {},
#      'prior_report': {},
#      'prior_clinic_notes': 'Fall from height, hemodynamically stable, abdominal tenderness.'},
#     {'order': 'CT AP WO',
#      'indication': 'Rule out post-operative complications',
#      'creatinine': 1.2,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT AP WO', 'prior_protocol': 'noncontrast', 'prior_report_text': 'Normal post-operative appearance, no complications noted.'},
#      'prior_clinic_notes': 'Recent cholecystectomy, follow-up for reported pain.'},
#     {'order': 'CT CAP W',
#      'indication': 'Staging for sarcoma',
#      'creatinine': 0.9,
#      'allergies': {},
#      'prior_report': {},
#      'prior_clinic_notes': 'Newly diagnosed soft tissue sarcoma on biopsy, requiring staging.'},
#     {'order': 'CT AP W',
#      'indication': 'Evaluation for recurrent neuroendocrine tumor',
#      'creatinine': 1.3,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT AP W', 'prior_protocol': 'dual liver', 'prior_report_text': 'Previous surgery site without evidence of recurrence.'},
#      'prior_clinic_notes': 'History of resected small bowel neuroendocrine tumor, annual surveillance.'},
#     {'order': 'CT AP WWO',
#      'indication': 'Follow-up for RCC',
#      'creatinine': 1.7,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT AP WWO', 'prior_protocol': 'rcc', 'prior_report_text': 'No new renal masses, previous partial nephrectomy site stable.'},
#      'prior_clinic_notes': 'Partial nephrectomy for RCC, monitoring for recurrence.'},
#     {'order': 'CT AP W',
#      'indication': 'Suspected abscess post abdominal surgery',
#      'creatinine': 1.2,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT APW', 'prior_protocol': 'routine oral contrast', 'prior_report_text': 'Fluid collection adjacent to the surgical site, suggestive of abscess.'},
#      'prior_clinic_notes': 'Persistent fever and abdominal pain following colectomy.'},
#     {'order': 'CT CAP W',
#      'indication': 'Trauma, evaluate for splenic injury',
#      'creatinine': 1.0,
#      'allergies': {},
#      'prior_report': {},
#      'prior_clinic_notes': 'Blunt abdominal trauma from cycling accident, left upper quadrant pain.'},
#     {'order': 'CT AP W',
#      'indication': 'Rule out metastasis in known cholangiocarcinoma patient',
#      'creatinine': 1.05,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT AP W', 'prior_protocol': 'cholangiocarcinoma', 'prior_report_text': 'No evidence of distant metastasis, primary tumor stable.'},
#      'prior_clinic_notes': 'Undergoing adjuvant chemotherapy, liver function tests worsening.'},
#     {'order': 'CT CAP W',
#      'indication': 'Evaluation for metastatic sarcoma',
#      'creatinine': 1.1,
#      'allergies': {},
#      'prior_report': {'prior_order': 'CT CAP W', 'prior_protocol': 'sarcoma', 'prior_report_text': 'Multiple pulmonary nodules, suspicious for metastases.'},
#      'prior_clinic_notes': 'History of high-grade extremity sarcoma, new respiratory symptoms.'}
# ]

# print(len(question_list))

In [None]:

# def build_datasets_from_file(filename):
#   """
#   This function reads an excel file, splits into train/test datasets, and converts them to a JSON representation.

#   Args:
#     filename: The path to the excel file.

#   Returns:
#     A JSON representation of the train and test datasets.
#   """

#   # Read the excel file into a pandas dataframe
#   df = pd.read_excel(filename)

#   # Rename columns and identify x and y data
#   new_column_names = {'Procedure': 'order', 'Reason for Exam Full': 'indication', 'Previous Procedure Name':'prior_order', 'Contrast Allergy': 'contrast_allergy', 'Allergy Severity': 'allergy_severity', 'Creatinine (mg/dL)':'creatinine', 'Dialysis': 'on_dialysis', 'clinical summary':'clinical_summary', 'Predicted Procedure': 'predicted_order', 'Protocol': 'predicted_protocol', 'Protocol comments':'predicted_comments', 'Accession':'accession'}
#   df.rename(columns=new_column_names, inplace=True)
#   x_cols = ['accession', 'order', 'indication', 'prior_order', 'creatinine', 'contrast_allergy', 'allergy_severity', 'on_dialysis', 'clinical_summary']
#   y_cols = ['accession', 'predicted_order', 'predicted_protocol', 'predicted_comments']


#   # Split the data into train and test sets
#   X_train, X_test, y_train, y_test = train_test_split(df[x_cols], df[y_cols], test_size=200, random_state=12)
#   X_train_json = X_train.to_dict(orient='records')
#   X_test_json = X_test.to_dict(orient='records')
#   y_train_json = y_train.to_dict(orient='records')
#   y_test_json = y_test.to_dict(orient='records')

#   # Convert the data to JSON
#   X_train_json = json.dumps(X_train_json)
#   X_test_json = json.dumps(X_test_json)
#   y_train_json = json.dumps(y_train_json)
#   y_test_json = json.dumps(y_test_json)

#   # Return the JSON string
#   return X_train_json, X_test_json, y_train_json, y_test_json







In [None]:
# def save_json_datasets(X_train_json, X_test_json, y_train_json, y_test_json, output_path):
#     """
#     Saves JSON datasets to files.

#     Parameters:
#     - X_train_json, X_test_json, y_train_json, y_test_json: JSON formatted data strings.
#     - output_path: str, base path where to save the JSON files.
#     """
#     with open(f'{output_path}/x_train.json', 'w') as f:
#         json.dump(json.loads(X_train_json), f)  # Convert JSON string back to dict for pretty saving

#     with open(f'{output_path}/x_test.json', 'w') as f:
#         json.dump(json.loads(X_test_json), f)

#     with open(f'{output_path}/y_train.json', 'w') as f:
#         json.dump(json.loads(y_train_json), f)

#     with open(f'{output_path}/y_test.json', 'w') as f:
#         json.dump(json.loads(y_test_json), f)

# filename = '/content/CT_Protocol/data/dataset031324.xlsx'
# X_train, X_test,y_train, y_test = build_datasets_from_file(filename)

# save_json_datasets(X_train, X_test, y_train, y_test, output_path='/content/CT_Protocol/data')

# def load_dataset_json(x_train_path, x_test_path, y_train_path, y_test_path):
#   # Load the JSON data from the files
#   with open(x_train_path) as f:
#     X_train_json = json.load(f)
#     X_train_json = json.loads(X_train_json)

#   with open(x_test_path) as f:
#     X_test_json = json.load(f)
#     X_test_json = json.loads(X_test_json)

#   with open(y_train_path) as f:
#     y_train_json = json.load(f)
#     y_train_json = json.loads(y_train_json)

#   with open(y_test_path) as f:
#     y_test_json = json.load(f)
#     y_test_json = json.loads(y_test_json)

#   return X_train_json, X_test_json, y_train_json, y_test_json


# x_test_path='/content/CT_Protocol/data/x_test.json'
# x_train_path='/content/CT_Protocol/data/x_train.json'
# y_test_path='/content/CT_Protocol/data/y_test.json'
# y_train_path='/content/CT_Protocol/data/y_train.json'

# X_train, X_test, y_train, y_test = load_dataset_json(x_train_path, x_test_path, y_train_path, y_test_path)

# # Convert the JSON data to Pandas DataFrames
# X_train_df = pd.DataFrame(json.loads(X_train))
# X_test_df = pd.DataFrame(json.loads(X_test))
# y_train_df = pd.DataFrame(json.loads(y_train))
# y_test_df = pd.DataFrame(json.loads(y_test))

# # Print the DataFrames
# print(X_train_df)
# print(X_test_df)
# print(y_train_df)
# print(y_test_df)


# # build training dataset with the right format
# df['text'] = '<s>[INST]@Enlighten. ' + df[''] +'[/INST]'+ df['A'] + '</s>'

# # remove columns
# df=df.drop(['Q','A','class'],axis=1)

# # convert to dataset object
# dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
# dataset = Dataset(pa.Table.from_pandas(df))
# dataset
