<a href="https://colab.research.google.com/github/nafis-neehal/CT-LLM/blob/main/CT_Experiment_04_04_24.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installation and Imports

In [None]:
# !pip3 install nltk
# !pip3 install evaluate
# !pip3 install rouge-score
# !pip3 install openai
# !pip3 install langchain
# !pip3 install requests
# !pip3 install bert-score
# !pip3 install langchain_openai
# !pip3 install huggingface-hub

In [None]:
import numpy as np
import pandas as pd
import requests
import urllib.parse
import json
import os
import requests
from tqdm import tqdm

# Import Colab Secrets userdata module
from google.colab import userdata

## Helper Functions for Data Processing/Loading

Selecting trials
  - that matches the listed chronic conditions (chronic kidney disease, diabetes, obesity, cancer, hypertension)
  - Interventional study only
  - Trial is complete
  - Has at least 6 or more reported baseline measures in clinicaltrials.gov portal

Data extracted for each trial
  - NCTId
  - Brief Title
  - Condition
  - Brief Summary
  - Eligibility Criteria
  - Baseline MeasureTitle
  - Primary Outcome Measure
  - Intervention Name
  - Study Type

Total trials fetched - 1784

Today's result on - 50 trials

In [None]:
def fetch_api_v2_data(condition, size):
  '''
  Fetches data from the Clinical Trials API v2.
  condition: The condition to search for.
  size: The number of maximum results to return for each condition.
  return: A list of dictionaries (json format) containing the data for each condition.
  '''

  base_url = 'https://clinicaltrials.gov/api/v2/studies'

  params = {
      'query.cond': condition,
      'pageSize' : size,
      'filter.advanced': 'AREA[StudyType]INTERVENTIONAL | AREA[OverallStatus]COMPLETED | AREA[BaselineMeasure:size]RANGE[6,MAX]',
      'fields': 'NCTId|BriefTitle|Condition|BriefSummary|EligibilityCriteria|BaselineMeasureTitle|PrimaryOutcomeMeasure|InterventionName|StudyType'
  }

  response = requests.get(base_url, params=params)

  if response.status_code == 200:
      # Process the response if successful
      json_data = response.json()
  else:
      print(f"Error fetching data, status code: {response.status_code}")

  return json_data



In [None]:
def extract_criteria(criteria_str):
  '''
  Extracts the inclusion and exclusion criteria from a string.
  criteria_str: The string containing the criteria.
  return: A tuple containing the inclusion and exclusion criteria.
  '''

  inclusion_start = criteria_str.find("Inclusion Criteria:")
  inclusion_end = criteria_str.find("Exclusion Criteria:")
  exclusion_start = inclusion_end + len("Exclusion Criteria:")

  inclusion_criteria = criteria_str[inclusion_start + len("Inclusion Criteria:"):inclusion_end].strip()
  exclusion_criteria = criteria_str[exclusion_start:].strip()

  return inclusion_criteria, exclusion_criteria



In [None]:
def parse_api_v2_data_to_df(data):

  '''
  Parses the data object returned from the Clinical Trials API v2, and converts it into a df.
  data: A list of dictionaries (json format) containing the data for each condition, returned by the API.
  return: A DataFrame containing the parsed data.
  '''

  # Create empty lists to store the extracted data
  nct_ids = []
  brief_titles = []
  inclusion_criteria = []
  exclusion_criteria = []
  brief_summaries = []
  conditions = []
  study_types = []
  interventions = []
  primary_outcomes = []
  measures = []

  # Iterate over the list of studies
  for study in data['studies']:
      # Extract the relevant data
      nct_ids.append(study['protocolSection']['identificationModule']['nctId'])
      brief_titles.append(study['protocolSection']['identificationModule']['briefTitle'])
      criteria = study['protocolSection']['eligibilityModule']['eligibilityCriteria']
      incl_criteria, excl_criteria = extract_criteria(criteria)
      incl_criteria = incl_criteria.strip()
      excl_criteria = excl_criteria.strip()
      inclusion_criteria.append(incl_criteria)
      exclusion_criteria.append(excl_criteria)
      brief_summaries.append(study['protocolSection']['descriptionModule']['briefSummary'])

      # Extract the conditions
      study_conditions = ''
      for condition in study['protocolSection']['conditionsModule']['conditions']:
          study_conditions += condition+ ', '
      conditions.append(study_conditions)

      # Extract the study type
      study_types.append(study['protocolSection']['designModule']['studyType'])

      # Extract the interventions
      study_interventions = ''
      for intervention in study['protocolSection']['armsInterventionsModule']['interventions']:
          study_interventions += intervention['name'] + ', '
      interventions.append(study_interventions)

      # Extract the primary outcomes
      study_primary_outcomes = ''
      for outcome in study['protocolSection']['outcomesModule']['primaryOutcomes']:
          study_primary_outcomes += outcome['measure'] + ', '
      primary_outcomes.append(study_primary_outcomes)

      # Extract the measures
      study_measures = ''
      for measure in study['resultsSection']['baselineCharacteristicsModule']['measures']:
          study_measures += measure['title'] + ', '
      measures.append(study_measures)

  # Create a DataFrame from the extracted data
  studies_df = pd.DataFrame({
      'NCTId': nct_ids,
      'BriefTitle': brief_titles,
      'InclusionCriteria': inclusion_criteria,
      'ExclusionCriteria': exclusion_criteria,
      'BriefSummary': brief_summaries,
      'Conditions': conditions,
      'StudyType': study_types,
      'Interventions': interventions,
      'PrimaryOutcomes': primary_outcomes,
      'BaselineMeasures': measures
  })

  # Save the DataFrame to a CSV file
  #studies_df.to_csv('studies.csv', index=False)

  return studies_df



In [None]:
final_df = pd.DataFrame()
conditions = ['chronic kidney disease', 'diabetes', 'obesity', 'cancer', 'hypertension']
for condition in conditions:
  data = fetch_api_v2_data(condition, size=500)
  df = parse_api_v2_data_to_df(data)
  final_df = pd.concat([final_df, df], ignore_index=True)

In [None]:
final_df.shape

(1784, 10)

In [None]:
final_df = final_df.sample(n=50, random_state=42, ignore_index=True)

In [None]:
final_df.head()

Unnamed: 0,NCTId,BriefTitle,InclusionCriteria,ExclusionCriteria,BriefSummary,Conditions,StudyType,Interventions,PrimaryOutcomes,BaselineMeasures
0,NCT05803096,Self-Administered Nitrous Oxide (SANO) During ...,* Biological male\n* Aged 18 to 85 years\n* Sc...,* Perioral facial hair impeding good mask seal...,Transrectal prostate biopsy is a commonly perf...,"Prostate Disease, Malignancy, Benign Prostatic...",INTERVENTIONAL,"Self-Administered Nitrous Oxide, Oxygen,","Post-biopsy Pain (VAS-P), Post-biopsy Anxiety ...","Age, Continuous, Sex: Female, Male, Race/Ethni..."
1,NCT02220894,Study of Pembrolizumab (MK-3475) Versus Platin...,:\n\n* Histologically- or cytologically-confir...,:\n\n* Histologically- or cytologically-confir...,"In this study, participants with programmed ce...","Non-small Cell Lung Cancer,",INTERVENTIONAL,"pembrolizumab, carboplatin, paclitaxel, pemetr...",Overall Survival (OS) in Participants With a T...,"Age, Continuous, Sex: Female, Male, Race (NIH/..."
2,NCT02128724,Palliative Thoracic Radiotherapy Plus BKM120,* Evidence of histologically confirmed NSCLC o...,* Previous chemotherapy or biological therapy ...,This study will test whether a drug called BKM...,"Carcinoma, Non-Small-Cell Lung,",INTERVENTIONAL,"BKM120,",Dose Escalation Analysis: Number of DLTs Obser...,"Age, Continuous, Sex: Female, Male, Race and E..."
3,NCT03557658,Safety and Efficacy of Bexagliflozin in Subjec...,o meet the following criteria to be eligible f...,o meet the following criteria to be eligible f...,The purpose of this study is to examine the dr...,"Type2 Diabetes Mellitus,",INTERVENTIONAL,"Bexagliflozin,","Cmax (Maximum Observed Plasma Concentration), ...","Age, Continuous, Sex: Female, Male, Ethnicity ..."
4,NCT01438476,Postoperative Thoracic Epidural Analgesia Vers...,1. Patients undergoing liver and/or pancreatic...,1. Evidence of severe uncontrolled systemic di...,The goal of this clinical research study is to...,"Liver Cancer, Pancreatic Cancer,",INTERVENTIONAL,Intravenous Patient-Controlled Analgesia (IVPC...,"Postoperative Pain Experience,","Age, Categorical, Age, Continuous, Sex: Female..."


## Run GPT Models for generating list of baseline measures



*   GPT 4-turbo-preview (Dec 2023 / Latest / ~4s per iteration)
*   GPT 3.5-turbo (Sept 2021 / 3x faster inference !!!)



In [None]:
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain_openai import ChatOpenAI
from langchain_community.llms import HuggingFaceEndpoint

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_KEY_BS1")


In [None]:
def generate_openai_prompt_template():

    system_template = (
        "You are a helpful assistant and expert in clinical domain and clinical trial design. Answer the question based on the context below and your knowledge on clinical trials and clinical domain"
    )

    system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)

    human_template = """

    Context:
    {trial_info}

    Question:
    {query}

    Answer:

    """

    human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

    chat_prompt = ChatPromptTemplate.from_messages(
        [system_message_prompt, human_message_prompt]
    )

    return chat_prompt


In [None]:
def single_openai_prompt(row, chat_prompt):

  '''
  Actual Prompt Structured in LangChain Prompt Template
  {Instruction} {Trial Context} {Question} -> {Answer}
  '''

  title = row['BriefTitle']
  brief_summary = row['BriefSummary']
  condition = row['Conditions']
  intervention = row['Interventions']
  outcome = row['PrimaryOutcomes']

  trial_info = f"Title- {title}\nBrief Summary- {brief_summary}\nCondition- {condition}\nIntervention- {intervention}\nOutcome- {outcome}"
  query = "Return a list of probable baseline features (seperated by comma, without itemizing or bullet points) that needs to\
  be measured before the trial starts and in each follow up visits. These baseline\
  features are usually found in Table 1 of a trial related publications. Don't give any additional explanations."

  prompt = chat_prompt.format_prompt(
      trial_info=trial_info, query=query
  ).to_messages()

  return prompt

def run_single_openai_query(row, chat_prompt, chat, ret_prompt=False):

  prompt = single_openai_prompt(row, chat_prompt)

  ret = chat.invoke(
    prompt
  )

  if ret_prompt:
    return [ret.content, prompt]

  return ret.content

In [None]:
def run_openai_loop(df, model):

  # initialize the models
  model_name = model
  chat_llm = ChatOpenAI(model_name=model_name, temperature=0)
  chat_prompt = generate_openai_prompt_template()

  for idx, row in tqdm(final_df.iterrows()):
    response = run_single_openai_query(row, chat_prompt, chat_llm)
    df.loc[idx, model_name] = response
    if idx % 10 == 0:
      print(f"\nProcessed {idx} rows")

  return df

In [None]:
response_df = run_openai_loop(final_df.copy(), model = "gpt-3.5-turbo")

1it [00:01,  1.35s/it]


Processed 0 rows


11it [00:11,  1.02s/it]


Processed 10 rows


21it [00:22,  1.21s/it]


Processed 20 rows


31it [00:35,  1.19s/it]


Processed 30 rows


41it [00:44,  1.08it/s]


Processed 40 rows


50it [00:53,  1.07s/it]


In [None]:
response_df = run_openai_loop(response_df.copy(), model = "gpt-4-0125-preview")

1it [00:05,  5.32s/it]


Processed 0 rows


11it [01:02,  5.79s/it]


Processed 10 rows


21it [01:47,  4.26s/it]


Processed 20 rows


31it [02:37,  5.12s/it]


Processed 30 rows


41it [03:21,  4.83s/it]


Processed 40 rows


50it [04:01,  4.83s/it]


In [None]:
response_df.head()

Unnamed: 0,NCTId,BriefTitle,InclusionCriteria,ExclusionCriteria,BriefSummary,Conditions,StudyType,Interventions,PrimaryOutcomes,BaselineMeasures,gpt-3.5-turbo,gpt-4-0125-preview
0,NCT05803096,Self-Administered Nitrous Oxide (SANO) During ...,* Biological male\n* Aged 18 to 85 years\n* Sc...,* Perioral facial hair impeding good mask seal...,Transrectal prostate biopsy is a commonly perf...,"Prostate Disease, Malignancy, Benign Prostatic...",INTERVENTIONAL,"Self-Administered Nitrous Oxide, Oxygen,","Post-biopsy Pain (VAS-P), Post-biopsy Anxiety ...","Age, Continuous, Sex: Female, Male, Race/Ethni...","Age, Gender, Body Mass Index (BMI), Previous h...","Age, gender, race/ethnicity, body mass index (..."
1,NCT02220894,Study of Pembrolizumab (MK-3475) Versus Platin...,:\n\n* Histologically- or cytologically-confir...,:\n\n* Histologically- or cytologically-confir...,"In this study, participants with programmed ce...","Non-small Cell Lung Cancer,",INTERVENTIONAL,"pembrolizumab, carboplatin, paclitaxel, pemetr...",Overall Survival (OS) in Participants With a T...,"Age, Continuous, Sex: Female, Male, Race (NIH/...","Age, Gender, Race, ECOG Performance Status, Sm...","Age, sex, race/ethnicity, smoking history, Eas..."
2,NCT02128724,Palliative Thoracic Radiotherapy Plus BKM120,* Evidence of histologically confirmed NSCLC o...,* Previous chemotherapy or biological therapy ...,This study will test whether a drug called BKM...,"Carcinoma, Non-Small-Cell Lung,",INTERVENTIONAL,"BKM120,",Dose Escalation Analysis: Number of DLTs Obser...,"Age, Continuous, Sex: Female, Male, Race and E...","Age, Gender, Smoking history, Performance stat...","Age, sex, race/ethnicity, smoking history, per..."
3,NCT03557658,Safety and Efficacy of Bexagliflozin in Subjec...,o meet the following criteria to be eligible f...,o meet the following criteria to be eligible f...,The purpose of this study is to examine the dr...,"Type2 Diabetes Mellitus,",INTERVENTIONAL,"Bexagliflozin,","Cmax (Maximum Observed Plasma Concentration), ...","Age, Continuous, Sex: Female, Male, Ethnicity ...","Age, Gender, Weight, Height, BMI, Blood Pressu...","Age, sex, body mass index (BMI), duration of T..."
4,NCT01438476,Postoperative Thoracic Epidural Analgesia Vers...,1. Patients undergoing liver and/or pancreatic...,1. Evidence of severe uncontrolled systemic di...,The goal of this clinical research study is to...,"Liver Cancer, Pancreatic Cancer,",INTERVENTIONAL,Intravenous Patient-Controlled Analgesia (IVPC...,"Postoperative Pain Experience,","Age, Categorical, Age, Continuous, Sex: Female...","Age, Gender, Body Mass Index, Smoking Status, ...","Age, sex, body mass index (BMI), type of surge..."


# RUN Open-Source HF Models



*   mistralai/Mistral-7B-Instruct-v0.2



In [None]:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceEndpoint

os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get("HF_TOKEN")

In [None]:
def generate_hf_prompt_template():
  system_template = """
  You are a helpful assistant and expert in clinical domain and clinical trial design.\
  Answer the question based on the context below and your\
  knowledge on clinical trials and clinical domain
  """

  human_template = """
  Context:
  {trial_info}

  Question:
  {query}

  Answer:
  """

  prompt = PromptTemplate.from_template(system_template + human_template)

  return prompt

In [None]:
def run_single_HF_query(row, prompt, llm):
  '''
  Actual Prompt Structured in LangChain Prompt Template
  {Instruction} {Context} {Question} -> {Answer}
  '''

  title = row['BriefTitle']
  brief_summary = row['BriefSummary']
  condition = row['Conditions']
  intervention = row['Interventions']
  outcome = row['PrimaryOutcomes']

  trial_info = f"Title- {title}\nBrief Summary- {brief_summary}\nCondition- {condition}\nIntervention- {intervention}\nOutcome- {outcome}"
  query = "Return a list of probable baseline features (seperated by comma, without itemizing or bullet points) that needs to\
  be measured before the trial starts and in each follow up visits. These baseline\
  features are usually found in Table 1 of a trial related publications. Don't give any additional explanations."

  llm_chain = LLMChain(prompt=prompt, llm=llm)
  ret = llm_chain.run(trial_info=trial_info, query=query)
  return ret



In [None]:
def run_HF_loop(df, repo, model_name):

  # initialize the models
  llm = HuggingFaceEndpoint(
    repo_id = repo, temperature=0.1
  )

  prompt = generate_hf_prompt_template()

  for idx, row in tqdm(final_df.iterrows()):
    response = run_single_HF_query(row, prompt, llm)
    df.loc[idx, model_name] = response
    if idx % 10 == 0:
      print(f"\nProcessed {idx} rows")

  return df

In [None]:
response_df = run_HF_loop(response_df.copy(), repo = "mistralai/Mistral-7B-Instruct-v0.2", model_name = "Mistral-7B-Instruct-v0.2")

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


1it [00:01,  1.43s/it]


Processed 0 rows


11it [00:51,  5.08s/it]


Processed 10 rows


21it [01:31,  3.98s/it]


Processed 20 rows


31it [02:14,  3.87s/it]


Processed 30 rows


41it [02:53,  4.58s/it]


Processed 40 rows


50it [03:29,  4.20s/it]


In [None]:
response_df.head()

Unnamed: 0,NCTId,BriefTitle,InclusionCriteria,ExclusionCriteria,BriefSummary,Conditions,StudyType,Interventions,PrimaryOutcomes,BaselineMeasures,gpt-3.5-turbo,gpt-4-0125-preview,Mistral-7B-Instruct-v0.2
0,NCT05803096,Self-Administered Nitrous Oxide (SANO) During ...,* Biological male\n* Aged 18 to 85 years\n* Sc...,* Perioral facial hair impeding good mask seal...,Transrectal prostate biopsy is a commonly perf...,"Prostate Disease, Malignancy, Benign Prostatic...",INTERVENTIONAL,"Self-Administered Nitrous Oxide, Oxygen,","Post-biopsy Pain (VAS-P), Post-biopsy Anxiety ...","Age, Continuous, Sex: Female, Male, Race/Ethni...","Age, Gender, Body Mass Index (BMI), Previous h...","Age, gender, race/ethnicity, body mass index (...",1. Age\n 2. Body Mass Index (BMI)\n 3. Prost...
1,NCT02220894,Study of Pembrolizumab (MK-3475) Versus Platin...,:\n\n* Histologically- or cytologically-confir...,:\n\n* Histologically- or cytologically-confir...,"In this study, participants with programmed ce...","Non-small Cell Lung Cancer,",INTERVENTIONAL,"pembrolizumab, carboplatin, paclitaxel, pemetr...",Overall Survival (OS) in Participants With a T...,"Age, Continuous, Sex: Female, Male, Race (NIH/...","Age, Gender, Race, ECOG Performance Status, Sm...","Age, sex, race/ethnicity, smoking history, Eas...",1. Age\n 2. Sex\n 3. Eastern Cooperative Onc...
2,NCT02128724,Palliative Thoracic Radiotherapy Plus BKM120,* Evidence of histologically confirmed NSCLC o...,* Previous chemotherapy or biological therapy ...,This study will test whether a drug called BKM...,"Carcinoma, Non-Small-Cell Lung,",INTERVENTIONAL,"BKM120,",Dose Escalation Analysis: Number of DLTs Obser...,"Age, Continuous, Sex: Female, Male, Race and E...","Age, Gender, Smoking history, Performance stat...","Age, sex, race/ethnicity, smoking history, per...",1. Age\n 2. Gender\n 3. Weight\n 4. Height\...
3,NCT03557658,Safety and Efficacy of Bexagliflozin in Subjec...,o meet the following criteria to be eligible f...,o meet the following criteria to be eligible f...,The purpose of this study is to examine the dr...,"Type2 Diabetes Mellitus,",INTERVENTIONAL,"Bexagliflozin,","Cmax (Maximum Observed Plasma Concentration), ...","Age, Continuous, Sex: Female, Male, Ethnicity ...","Age, Gender, Weight, Height, BMI, Blood Pressu...","Age, sex, body mass index (BMI), duration of T...","1. Body Weight,\n 2. Height,\n 3. Serum Crea..."
4,NCT01438476,Postoperative Thoracic Epidural Analgesia Vers...,1. Patients undergoing liver and/or pancreatic...,1. Evidence of severe uncontrolled systemic di...,The goal of this clinical research study is to...,"Liver Cancer, Pancreatic Cancer,",INTERVENTIONAL,Intravenous Patient-Controlled Analgesia (IVPC...,"Postoperative Pain Experience,","Age, Categorical, Age, Continuous, Sex: Female...","Age, Gender, Body Mass Index, Smoking Status, ...","Age, sex, body mass index (BMI), type of surge...","1. Age,\n 2. Gender,\n 3. Body Mass Index (B..."


In [None]:
import random
import inspect
import textwrap

# initialize the models
chat_prompt = generate_openai_prompt_template()

# randomly select a row from the dataframe
idx = random.randint(0, len(response_df) - 1)
row = response_df.iloc[idx]

prompt = single_openai_prompt(row, chat_prompt)

print(textwrap.dedent(f"Prompt: \n System Message: {prompt[0].content} \n Human Message: {prompt[1].content}"))
print(f"Baseline Measures: {row.BaselineMeasures}")
print(f"GPT 3.5 Response: {row['gpt-3.5-turbo']}")
print(f"GPT 4 Response: {row['gpt-4-0125-preview']}")
print(f"Mistral-7B-Instruct-v0.2 Response: {row['Mistral-7B-Instruct-v0.2']}")



Prompt: 
 System Message: You are a helpful assistant and expert in clinical domain and clinical trial design. Answer the question based on the context below and your knowledge on clinical trials and clinical domain 
 Human Message: 

    Context: 
    Title- Study of Pembrolizumab (MK-3475) Versus Platinum-Based Chemotherapy for Participants With Programmed Cell Death-Ligand 1 (PD-L1)-Positive Advanced or Metastatic Non-Small Cell Lung Cancer (MK-3475-042/KEYNOTE-042)
Brief Summary- In this study, participants with programmed cell death ligand 1 (PD-L1)-positive non-small cell lung cancer (NSCLC) will be randomized to receive single agent pembrolizumab for up to 35 treatments or standard of care (SOC) platinum-based chemotherapy (carboplatin + paclitaxel or carboplatin + pemetrexed for 4 to 6 21-day cycles). Participants in the platinum-based chemotherapy arms with non-squamous tumor histologies may receive pemetrexed maintenance therapy after the 4 to 6 cycles of chemotherapy. The pr

In [None]:
response_df.to_csv('response_df.csv', index=False, header=True)

## Evaluate Score
Score 1: BERTScore based on contextual similarity:
*   Link 1 - https://huggingface.co/spaces/evaluate-metric/bertscore
*   Link 2 - https://github.com/Tiiiger/bert_score#readme


  

In [None]:
from evaluate import load
bertscore = load("bertscore")

idx = random.randint(0, len(response_df) - 1)
row = response_df.iloc[idx]

reference = row['BaselineMeasures']
response_gpt_3_5_turbo = row['gpt-3.5-turbo']
response_gpt_4_0125_preview = row['gpt-4-0125-preview']
response_mistral_7b_instruct_v0_2 = row['Mistral-7B-Instruct-v0.2']

print(f"Reference: {reference}\n")

print(f"GPT 3.5 Response: {response_gpt_3_5_turbo}")
results = bertscore.compute(predictions=[response_gpt_3_5_turbo], references=[reference], model_type="distilbert-base-uncased")
print(f"BERTScore of GPT 3.5 Response: {results}\n")

print(f"GPT 4 Response: {response_gpt_4_0125_preview}")
results = bertscore.compute(predictions=[response_gpt_4_0125_preview], references=[reference], model_type="distilbert-base-uncased")
print(f"BERTScore of GPT 4 Response: {results}\n")

print(f"Mistral-7B-Instruct-v0.2 Response: {response_mistral_7b_instruct_v0_2}")
results = bertscore.compute(predictions=[response_mistral_7b_instruct_v0_2], references=[reference], model_type="distilbert-base-uncased")
print(f"BERTScore of Mistral-7B-Instruct-v0.2 Response: {results}")



Reference: Age, Categorical, Age, Continuous, Sex: Female, Male, Ethnicity (NIH/OMB), Race/Ethnicity, Customized, Region of Enrollment, Type of Adjuvant Endocrine Therapy, Breast Cancer Stage, Ovarian Suppression, 

GPT 3.5 Response: Age, Time since breast cancer diagnosis, Hormonal therapy regimen, Previous experience with group therapy, Baseline symptoms related to hormonal therapy, Baseline distress levels
BERTScore of GPT 3.5 Response: {'precision': [0.8298296928405762], 'recall': [0.7879796028137207], 'f1': [0.8083633780479431], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.38.2)'}

GPT 4 Response: Age, gender, race/ethnicity, education level, marital status, employment status, breast cancer stage at diagnosis, type of hormonal therapy, duration of hormonal therapy use, previous cancer treatments (e.g., chemotherapy, radiation), menopausal status, comorbid conditions, baseline symptom severity, baseline medication adherence, baseline psychological distr

Score 2 - GPT 4 as an evaluator

In [None]:
# prompt: write code to ask OpenAI GPT-4 model to score similarity between reference and each response

from openai import OpenAI
client = OpenAI()

prompt = (
    f"We are evaluating three AI models' responses based on their similarity to a reference string.\n"
    f"Reference string: {reference}\n"
    f"Model 1 Response: {response_gpt_3_5_turbo}\n"
    f"Model 2 Response: {response_gpt_4_0125_preview}\n"
    f"Model 3 Response: {response_mistral_7b_instruct_v0_2}\n"
    f"Make sure to correctly identify each feature in the reference string first. Not all comma-separated values are features."
    f"For example - [Age, Continuous, Age, Categorical, Sex: Male, Female, Ethnicity] in this list features are [Age, Sex, Ethnicity.]"
    f"In reference feature list, avoid considering data types (categorical, continuous), sub-category names, measurement metric names etc. as feature names."
    f"For each model response, score its similarity to the reference string on a scale of 0 to 1, where 0 is completely dissimilar and 1 is perfectly similar. Also calculate precision, recall, F1 scores.\n"
    f"Let's think step by step. First step is curate a list of actual reference features. Then second step is to match each model's response to this curated feature list."
    f"In this second step, make sure you take underlying semantic and contexual similarity into account."
    f"List features that match exactly, that matches partially based on contextuality/semantics, that are present in reference but not in response, and that are present in response but no in reference."
    f"Return the scores in the following format (show decimal values too, not only fractions):\n"
    f"Model 1 Score: \n"
    f"Model 2 Score: \n"
    f"Model 3 Score: "
)

completion = client.chat.completions.create(
  model="gpt-4-0125-preview",
  messages=[
    {"role": "system",
     "content": "You are a helpful assistant and expert in clinical domain and clinical trial design.\
  Answer the question based on your knowledge on clinical trials and clinical domain"},
    {"role": "user", "content": prompt}
  ]
)


In [None]:
print(completion.choices[0].message.content)

To properly evaluate the similarity between the AI models' responses and the reference string, let's first elucidate the reference features list. 

### Reference Features List
1. Age
2. Sex
3. Ethnicity (NIH/OMB)
4. Race/Ethnicity
5. Region of Enrollment
6. Type of Adjuvant Endocrine Therapy
7. Breast Cancer Stage
8. Ovarian Suppression

Given this list, we can now assess each model based on the exact match, partial/contextual match, and differences.

### Model 1 Evaluation
- **Exact Matches:** Age
- **Partial/Contextual Matches:** Hormonal therapy regimen (Type of Adjuvant Endocrine Therapy), Baseline distress levels (could be considered under Ovarian Suppression if interpreted as a type of symptom or effect)
- **Not in Reference:** Time since breast cancer diagnosis, Previous experience with group therapy, Baseline symptoms related to hormonal therapy
- **Missing from Response:** Sex, Ethnicity (NIH/OMB), Race/Ethnicity, Region of Enrollment, Breast Cancer Stage

### Model 2 Evaluati

In [None]:
# LOOP CODE
# from bert_score import BERTScorer
# scores = []
# for idx, row in response_df.iterrows():
#   response = row['GPT_3.5_Turbo']
#   reference = row['BaselineMeasures']
#   scorer = BERTScorer(model_type='bert-base-uncased', lang='en')
#   if not response or not reference or pd.isna(reference):
#     scores.append(np.nan)
#     continue
#   P, R, F1 = scorer.score([response], [reference])
#   scores.append(F1.mean().item())
#   if idx % 10 == 0:
#     print(f"Processed {idx} rows")
# response_df['BERTScore'] = scores

**Issues**

  Dataset Issues:
  - Some trials doesn't report ANY baseline variable in the portal (but they list them in the publication)
  - Some trials doesn't report ALL baseline variables in the portal (but they list them in the publication)
  - Not all trial publications are accessible (behind paywall)
  - Tracking publications one by one and manually download them and list out all baseline features is almost impossible for 1500 trials

  Solution -->
  - Show result for these ~1800 trials with whatever features reported in the portal and can be extracted using API
  - Additionally show result for a smaller manually found and curated set (~10/20 studies per condition) - Can be used for RAG as well


  Inference Issues:
  - Inference with GPT-4 is extremely slow: both in case of baseline measure generation and scoring candidate responses (Does parallelization help here?)


  Scoring Issues:
  - How to write the optimal prompt for generating scores using GPT4?
  - Which scores are we interested in?
  - Which additional scores can we look at? (e.g. standard vector cosine similarity based on Clinical BERT encoding?)


  Overall goals:
  - List of tasks we want to benchmark and evaluate LLMs (e.g. list baseline measures, list exclusion criteria, list inclusion criteria) and ground truth for these?
  - What should be the dataset we are planning to release?
  - Examples of hallucinations in this scenario?
  - Possible ways to trigger hallucination?


Few Shot examples

  - trial context + baseline measure
  - trial evaluation

Hand annotated examples 1 or 2

