## Demonstration

Demo of usage of models for finding the clincal trials sorted in decreasing order of cosine similarity to the given query. To try with your own query, ensure you **run get_trials.py** to download the relevant clincal trials data.

After obtaining the necessary data, update the query in the below code and run the remaining as is.

In [11]:
!pip install sentence-transformers



In [32]:
import xml.etree.ElementTree as ET

# Extract the individual trials from the XML file obtained from clinicaltrials.gov
def extract_trials(file_path):

  trials = {}

  # Parse the XML file
  tree = ET.parse(file_path)
  root = tree.getroot()

  # Iterate over StudyFields and extract information
  for study_field in root.findall('.//StudyFields'):
      nct_id = study_field.find("./FieldValues[@Field='NCTId']/FieldValue").text
      brief_title = study_field.find("./FieldValues[@Field='BriefTitle']/FieldValue").text
      eligibility_criteria = study_field.find("./FieldValues[@Field='EligibilityCriteria']/FieldValue").text

      trials[nct_id] = {
          'title': brief_title,
          'criteria': eligibility_criteria
      }

  return trials

In [33]:
trials = extract_trials('./data/full_studies.xml')

for nct_id, trial_info in trials.items():
    print(f"NCTId: {nct_id}")
    print(f"Title: {trial_info['title']}")
    print(f"Criteria: {trial_info['criteria']}")
    print("------")

NCTId: NCT01874691
Title: China Acute Myocardial Infarction Registry
Criteria: Inclusion Criteria:

Eligible patients must be admitted within 7 days of acute ischemic symptoms and diagnosed acute ST-elevation or non ST-elevation myocardial infarction. Diagnosis criteria must meet Universal Definition for AMI (2012). All participating hospitals are required to enroll consecutive patients with AMI.

Exclusion Criteria:

Myocardial infarction related to percutaneous coronary intervention and coronary artery bypass grafting.
------
NCTId: NCT03015064
Title: Post-Myocardial Infarction Patients in Santa Catarina, Brazil - Catarina Heart Study
Criteria: Inclusion Criteria:

Age over 18 years;
Presence of precordial pain suggestive of acute myocardial infarction associated with electrocardiogram with new ST segment elevation in two contiguous leads with limits: ≥0.1 mv in all leads other than leads V2-V3 where the following limits apply : ≥0.2 mv in Men ≥40 years; ≥0.25 mV in men <40 years, or

In [34]:
import re
import pickle

def extract_snippets(trials):
    #content = pickle.load(open('./trial_list','rb'))
    cur_inclusions = {}
    for nct_id,each_trial in trials.items():
      #print(each_trial)

      cur_split = re.split('For inclusion in the study patients should fulfil \
      the following criteria:\
      |For inclusion in the study patient should fulfil the following criteria:\
      |will be included in the study:\
      |Inclusion:|Inclusion Criteria|Inclusion criteria|INCLUSION CRITERIA\
      |inclusion Criteria|inclusion criteria',each_trial['criteria'],maxsplit=1)

      cur_split = cur_split[-1]
      cur_split = re.split('exclusion criteria\
      |will be excluded from participation in the\n        study:\
      |Exclusion:|Exclusion Criteria|Exclusion criteria|EXCLUSION CRITERIA\
      |exclusion Criteria',cur_split,maxsplit=1)

      cur_inclusion  = cur_split[0].replace('\n',' ')
      cur_inclusion = re.sub('\t','',cur_inclusion)
      cur_inclusion = re.sub('\s+',' ',cur_inclusion)
      cur_inclusion = re.sub('^: ','',cur_inclusion)

      cur_inclusions[nct_id] = cur_inclusion
      cur_exclusion = cur_split[-1]

    return cur_inclusions

In [40]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Function to encode the snippets
def encode_snippets(model_name, trial_criteria):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(trial_criteria)
    return embeddings

# Function to find cosine similarity and sort by similarity
def find_similarity(model_name, query, trial_criteria):
    encoded_query = encode_snippets(model_name, [query])[0]
    encoded_snippets = encode_snippets(model_name, list(trial_criteria.values()))

    similarities = cosine_similarity([encoded_query], encoded_snippets)[0]
    similarity_dict = {nct_id: sim for nct_id, sim in zip(trial_criteria.keys(), similarities)}

    sorted_similarity = sorted(similarity_dict.items(), key=lambda x: x[1], reverse=True)
    sorted_trials = [(nct_id, trial_criteria[nct_id], sim) for nct_id, sim in sorted_similarity]

    return sorted_trials


query = "A 45 year old with a clinical diagnosis of ST-segment elevation acute myocardial infarction."
trial_criteria = extract_snippets(trials)

#With trained model
model_name = 'sravn/msmarco-clincalbert'

sorted_trials = find_similarity(model_name, query, trial_criteria)

In [41]:
for idx, snippet, similarity in sorted_trials[:5]:
    print(f"NCT ID {idx}: Similarity {similarity}")
    print(trials[idx]['title'])
    print("---------")

NCT ID NCT01484158: Similarity 0.6799761056900024
Gait Speed for Predicting Cardiovascular Events After Myocardial Infarction
---------
NCT ID NCT01109225: Similarity 0.6051940321922302
Relation Between Aldosterone and Cardiac Remodeling After Myocardial Infarction
---------
NCT ID NCT01874691: Similarity 0.5992917418479919
China Acute Myocardial Infarction Registry
---------
NCT ID NCT04957719: Similarity 0.5670695900917053
Selatogrel Outcome Study in Suspected Acute Myocardial Infarction
---------
NCT ID NCT03015064: Similarity 0.5491398572921753
Post-Myocardial Infarction Patients in Santa Catarina, Brazil - Catarina Heart Study
---------


## Trial with other models trained on MS-MARCO

In [42]:
models = ['sentence-transformers/msmarco-bert-base-dot-v5', 'Capreolus/bert-base-msmarco']

for model in models:
    sorted_trials = find_similarity(model, query, trial_criteria)

    #Printing the top two trials reported for each model
    print(f"Model: {model}")
    for idx, snippet, similarity in sorted_trials[:2]:
      print(f"NCT ID {idx}: Similarity {similarity}")
      print(trials[idx]['title'])
      print("---------")

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.16k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Model: sentence-transformers/msmarco-bert-base-dot-v5
NCT ID NCT01484158: Similarity 0.9552577137947083
Gait Speed for Predicting Cardiovascular Events After Myocardial Infarction
---------
NCT ID NCT01874691: Similarity 0.9344902634620667
China Acute Myocardial Infarction Registry
---------


.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/778 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



Model: Capreolus/bert-base-msmarco
NCT ID NCT01109225: Similarity 0.9649959206581116
Relation Between Aldosterone and Cardiac Remodeling After Myocardial Infarction
---------
NCT ID NCT03412435: Similarity 0.954321563243866
Asan Medical Center Myocardial Infarction Registry
---------
