For both patient and clinician text, we want to find terms that can be mapped by BioBERT but not quickUMLS, since we want know already that BioBERT maps more terms

# QuickUMLS BERT comparison

Loading in QuickUMLS, BERT using WSL or other linux distribution

In [1]:
import pandas as pd
import json
from quickumls import QuickUMLS
import ast

  from .autonotebook import tqdm as notebook_tqdm


Load QuickUMLS and semtype dictionary searcher run NER. Run under WSL or other Linux environment since quickumls installation works better on linux. Takes around 30s

In [2]:
quickumls_db = "/mnt/c/Users/maxji/.data/bio/quickUMLS" #root folder where quickumls is downloaded

matcher = QuickUMLS(
    quickumls_db,
    threshold=0.7,
    similarity_name='jaccard'
)

loading initialization

In [3]:
mrsty_path = "/mnt/c/Users/maxji/.data/bio/umls/2025AA/umls-2025AA-metathesaurus-full/2025AA/META/MRSTY.RRF"
def load_semtype_labels(mrsty_path):
    df = pd.read_csv(
        mrsty_path,
        sep='|',
        header=None,
        dtype=str,
        engine='python'
    )
    # Drop the last empty column if present
    if df.shape[1] > 6:
        df = df.iloc[:, :6]
    df.columns = ['CUI', 'TUI', 'STN', 'STY', 'ATUI', 'CVF']
    df = df[['TUI', 'STY']].drop_duplicates()
    return dict(zip(df['TUI'], df['STY']))
# Example usage
semtype_dict = load_semtype_labels(mrsty_path)
print(semtype_dict['T047'])  # Disease or Syndrome

Disease or Syndrome


Loading BioBERT
Installed locally using commands:

git clone https://huggingface.co/d4data/biomedical-ner-all     
cd biomedical-ner-all   
git lfs pull      

In [4]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("./biomedical-ner-all")
model = AutoModelForTokenClassification.from_pretrained("./biomedical-ner-all")

pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # pass device=0 if using gpu
pipe("""The patient reported no recurrence of palpitations at follow-up 6 months after the ablation.""")

[{'entity_group': 'Sign_symptom',
  'score': 0.9999311,
  'word': 'pal',
  'start': 38,
  'end': 41},
 {'entity_group': 'Sign_symptom',
  'score': 0.9063318,
  'word': '##pitations',
  'start': 41,
  'end': 50},
 {'entity_group': 'Clinical_event',
  'score': 0.99975544,
  'word': 'follow',
  'start': 54,
  'end': 60},
 {'entity_group': 'Date',
  'score': 0.999867,
  'word': '6 months after',
  'start': 64,
  'end': 78}]

# Part 1: reading in our past dataset

In [5]:
df_mini_pairs = pd.read_csv("english-train-paired-conversations.csv")
print("data length:", len(df_mini_pairs)) 
df_mini_pairs.head(n=5)

data length: 600


Unnamed: 0.1,Unnamed: 0,description,utterances,input,output
0,0,throat a bit sore and want to get a good imune...,['patient: throat a bit sore and want to get a...,patient: throat a bit sore and want to get a g...,doctor: during this pandemic. throat pain can ...
1,1,"hey there i have had cold ""symptoms"" for over ...","['patient: hey there i have had cold ""symptoms...","patient: hey there i have had cold ""symptoms"" ...",doctor: yes. protection. it is not enough symp...
2,2,i have a tight and painful chest with a dry co...,['patient: i have a tight and painful chest wi...,patient: i have a tight and painful chest with...,"doctor: possible. top symptoms include fever, ..."
3,3,what will happen after the incubation period f...,['patient: what will happen after the incubati...,patient: what will happen after the incubation...,doctor: in brief: symptoms if you are infected...
4,4,suggest treatment for pneumonia,['patient: just found out i was pregnant. yest...,patient: just found out i was pregnant. yester...,doctor: thanks for your question on healthcare...


# Using the QuickUMLS model on a single data sample, data format

Testing data on a single datapoint.

Split based on last : found

In [6]:
#
index = 0
patient = df_mini_pairs.iloc[index]["input"].replace("patient:","").strip() #raw patient string replacing doctor, patient
clinician = df_mini_pairs.iloc[index]["output"].replace("doctor:","").strip() #raw clinican stinrg
print(patient,"\n",clinician)

throat a bit sore and want to get a good imune booster, especially in light of the virus. please advise. have not been in contact with nyone with the virus. 
 during this pandemic. throat pain can be from a strep throat infection (antibiotics needed), a cold or influenza or other virus, or from some other cause such as allergies or irritants. usually, a person sees the doctor (call first) if the sore throat is bothersome, recurrent, or doesn't go away quickly. covid-19 infections tend to have cough, whereas strep throat usually lacks cough but has more throat pain. (3/21/20)


In [7]:
matches = matcher.match(patient, best_match=True, ignore_syntax=False)
for x in matches:
    #output all matches
    for final in x:
        print("term:", final["term"])
        semtypes_list = list(final["semtypes"])
        print("semtypes:", semtypes_list) #convert 
        #now, convert semtypes to codes using the dictionary matcher
        semtypes_matches = []
        for type in semtypes_list:
            semtypes_matches.append(semtype_dict[type])
        print("semtype matches:", semtypes_matches)
        print(final)

term: booster
semtypes: ['T061']
semtype matches: ['Therapeutic or Preventive Procedure']
{'start': 47, 'end': 54, 'ngram': 'booster', 'term': 'booster', 'cui': 'C0020975', 'similarity': 1.0, 'semtypes': {'T061'}, 'preferred': 1}
term: contact
semtypes: ['T067']
semtype matches: ['Phenomenon or Process']
{'start': 122, 'end': 129, 'ngram': 'contact', 'term': 'contact', 'cui': 'C0392367', 'similarity': 1.0, 'semtypes': {'T067'}, 'preferred': 1}
term: contact
semtypes: ['T170']
semtype matches: ['Intellectual Product']
{'start': 122, 'end': 129, 'ngram': 'contact', 'term': 'contact', 'cui': 'C3245509', 'similarity': 1.0, 'semtypes': {'T170'}, 'preferred': 1}
term: throat
semtypes: ['T023']
semtype matches: ['Body Part, Organ, or Organ Component']
{'start': 0, 'end': 6, 'ngram': 'throat', 'term': 'throat', 'cui': 'C3665375', 'similarity': 1.0, 'semtypes': {'T023'}, 'preferred': 1}
term: throats
semtypes: ['T023']
semtype matches: ['Body Part, Organ, or Organ Component']
{'start': 0, 'end'

Creating a generalized function to help find matches 


In [8]:
#returns a dictionary of terms to a tuple of semtype_list and matched terms_list with the same length

def quickumls_matcher(text):
    matches = matcher.match(text, best_match=True, ignore_syntax=False)
    terms_dictionary = {}
    for x in matches:
        #output all matches
        for final in x:
            semtypes_list = list(final["semtypes"])
            #now, convert semtypes to codes using the dictionary matcher
            semtypes_matches = []
            for type in semtypes_list:
                semtypes_matches.append(semtype_dict[type])
            terms_dictionary[final["term"]] = (semtypes_list, semtypes_matches)
    return terms_dictionary

In [9]:
text = "The patient was given aspirin for pain relief."
clinician_entity_dict = {}
terms_dictionary = quickumls_matcher(clinician)
print(terms_dictionary)
for term in terms_dictionary: #each term is a key
    data = terms_dictionary[term]
    semtypes_list = data[0]
    semtypes_matches = data[1]

{'throat infection': (['T047'], ['Disease or Syndrome']), 'strep throat': (['T047'], ['Disease or Syndrome']), 'throat pain': (['T184'], ['Sign or Symptom']), 'Throat pain': (['T184'], ['Sign or Symptom']), 'sore throat': (['T184'], ['Sign or Symptom']), 'No sore throat': (['T033'], ['Finding']), 'infections': (['T046'], ['Pathologic Function']), 'infection': (['T047'], ['Disease or Syndrome']), 'Reinfections': (['T046'], ['Pathologic Function']), 'Coinfections': (['T047'], ['Disease or Syndrome']), 'Infections': (['T046'], ['Pathologic Function']), 'Re-infections': (['T046'], ['Pathologic Function']), 'Co-infections': (['T047'], ['Disease or Syndrome']), 'infections op': (['T047'], ['Disease or Syndrome']), 'gi infections': (['T047'], ['Disease or Syndrome']), 'Reinfection': (['T046'], ['Pathologic Function']), 'Coinfection': (['T047'], ['Disease or Syndrome']), 'influenza': (['T047'], ['Disease or Syndrome']), 'influenza B': (['T047'], ['Disease or Syndrome']), 'influenza C': (['T047

## Looping to extract a set of terms, not entities

Finding a set of medical terms, not entities, for both QUICKUMLS and BioBERT

In [10]:
#helper function to get list of terms corresponding to a entity name for BioBERT
def getEntityWordsBioBERT(input : list, name: str):
    if input == None:
        return []
    result = []
    for x in input:
        if(x["entity_group"] == name):
            result.append(x["word"])
    return result

In [11]:
#helper function to get list of terms corresponding to a entity name
def getEntityUMLSWords(input : dict, name: str):
    if input == None:
        return []
    result = []
    for term in input:
        data = input[term]
        semtypes_list = data[0]
        semtypes_matches = data[1]
        for entity in semtypes_matches:
            if entity == name:
                result.append(term)

    return result

In [12]:
#find terms mapped by BioBERT but not quickUMLS

patient_entity_terms = []
clinician_entity_terms = []

patient_BERT_mismatch_type = [] #give the name of the mismatched entity that corresponds to the mismatch
clinician_BERT_mismatch_type = []

for index,row in df_mini_pairs.iterrows():
    input = row["input"]
    output = row["output"]
    
    try:  
        #first get list of terms for both BERT and quickUMLS
        BERTterms = []
        quickUMLSterms = []

        patient =  input.replace("patient:","").strip()
        terms_dictionary = quickumls_matcher(patient)
        patient_results = terms_dictionary
        #loop through the entities. Each term is a key
        for term in terms_dictionary:
            data = terms_dictionary[term]
            semtypes_list = data[0]
            semtypes_matches = data[1]
            #add matches to the set
            quickUMLSterms.append(term)
        
        #loop through BERT results data
        patient_results = pipe(patient)
        #loop through the entities
        for diction in patient_results:
            curr_entity = diction["entity_group"]
            BERTterms.append(diction["word"])
        diff = set(BERTterms) - set(quickUMLSterms)
        patient_entity_terms += list(diff)

    except Exception as e:
        print(f"An error occurred in patient data: {e}")
        pass

    try: 
        #first get list of terms for both BERT and quickUMLS
        BERTterms = []
        quickUMLSterms = []

        clinician =  output.replace("doctor:","").strip()
        terms_dictionary = quickumls_matcher(clinician)
        clinician_results = terms_dictionary
        #loop through the entities. Each term is a key
        for term in terms_dictionary:
            data = terms_dictionary[term]
            semtypes_list = data[0]
            semtypes_matches = data[1]
            #add matches to the set
            quickUMLSterms.append(term)
        
        #loop through BERT results data
        clinician_results = pipe(patient)
        #loop through the entities
        for diction in clinician_results:
            curr_entity = diction["entity_group"]
            BERTterms.append(diction["word"])
        diff = set(BERTterms) - set(quickUMLSterms)
        clinician_entity_terms += list(diff)

    except:
        print("no clinican mapping at index:", index)
        pass 




Printing out missing terms list for patients

In [13]:
print(len(patient_entity_terms))
print(len(patient_entity_terms))
patient_entity_terms

4133
4133


['for the past two days',
 'over a week',
 'low',
 'grade',
 'dizzy',
 'fever',
 'dr',
 'headache',
 'tight',
 'dry',
 '19',
 'delivered',
 'mis',
 '##ges',
 '38 weeks',
 'miscarried',
 'ce',
 'every',
 'high',
 'daughter',
 '##one',
 '##rest',
 'gave',
 'pro',
 'talking',
 '36',
 'dirrhea',
 '10 month old',
 'vaccination',
 '##les',
 '##as',
 'son',
 'due',
 'coronavirus pandemic',
 'me',
 'wife',
 'er',
 '##indamycin',
 'evenings',
 'thru',
 'culture',
 '##hr',
 '##day',
 'np',
 '##nis',
 'lev',
 'cl',
 '##yt',
 'hem',
 'blood',
 'pre',
 '##omycin',
 '##month',
 '1',
 'positive',
 '##d',
 'ga',
 'cough',
 'slight',
 'co',
 'dry',
 '##vid 19',
 'fever',
 'moon',
 'loop neus',
 'fr',
 '19',
 'slight',
 'covid',
 'sore',
 'tired',
 'corona',
 'woke',
 'this',
 '18 month old',
 'a week later',
 'cough',
 '2',
 'dry',
 '3',
 'nausea',
 'weeks',
 'chest',
 'advanced',
 'antibotic',
 'odor',
 '1 week',
 '. s',
 'clear',
 'husband',
 'strange',
 'this',
 '10 days',
 'x ray',
 'te',
 'ncov',


Printing out missing terms list for clinicians

In [14]:
print(len(clinician_entity_terms))
print(len(clinician_entity_terms))
clinician_entity_terms

5087
5087


['sore',
 'throat',
 'for the past two days',
 'over a week',
 'grade',
 'cold',
 'dizzy',
 'fever',
 'dr',
 'chest',
 'tight',
 'coronavirus',
 'dry',
 'headache',
 'painful',
 '19',
 'delivered',
 'birth',
 'mis',
 'low',
 'miscarried',
 '38 weeks',
 '##ges',
 'ce',
 'every',
 'high',
 'progesterone',
 'daughter',
 '##one',
 'weak',
 '##rest',
 'gave',
 'pro',
 'coronavirus',
 'talking',
 'virus',
 '36',
 'dirrhea',
 '10 month old',
 'vaccination',
 '##les',
 '##as',
 'son',
 'due',
 'coronavirus pandemic',
 'me',
 'wife',
 'evenings',
 '##omycin',
 'er',
 'thru',
 'np',
 'hem',
 'lev',
 '##d',
 '##hr',
 '##day',
 'cl',
 'medications',
 '1',
 'positive',
 '##indamycin',
 'low',
 'culture',
 '##nis',
 '##yt',
 'blood',
 '##month',
 'pre',
 'ga',
 'cough',
 'slight',
 'co',
 'dry',
 '##vid 19',
 'fever',
 'moon',
 'loop neus',
 'fr',
 '19',
 'slight',
 'covid',
 'sore',
 'tired',
 'nap',
 'woke',
 'this',
 '18 month old',
 'corona',
 'chest',
 'a week later',
 'cough',
 '2',
 'dry',
 '

creating a count dictionary 

In [18]:
from collections import Counter

patient_UMLS_unrecognized = Counter(patient_entity_terms)
patient_UMLS_unrecognized = dict(patient_UMLS_unrecognized)
patient_UMLS_unrecognized = sorted(patient_UMLS_unrecognized.items(), key=lambda  item: item[1], reverse=True)

print(len(patient_UMLS_unrecognized))
patient_UMLS_unrecognized

2227


[('co', 85),
 ('sore', 44),
 ('##vid - 19', 37),
 ('corona', 34),
 ('throat', 33),
 ('19', 32),
 ('hospital', 32),
 ('##vid', 30),
 ('dry', 28),
 ('chest', 27),
 ('son', 22),
 ('mild', 22),
 ('home', 21),
 ('doctor', 20),
 ('headache', 19),
 ('sick', 19),
 ('daughter', 18),
 ('cough', 18),
 ('slight', 18),
 ('tight', 16),
 ('?', 16),
 ('dr', 15),
 ('s', 15),
 ('went', 15),
 ('fever', 14),
 ('2', 14),
 ('corona virus', 14),
 ('run', 14),
 ('covid - 19', 13),
 ('temperature', 13),
 ('hi', 13),
 ('travelled', 12),
 ('##vid19', 12),
 ('high', 11),
 ('##vid 19', 11),
 ('this morning', 11),
 ('ache', 11),
 ('mother', 11),
 ('bro', 11),
 ('night', 11),
 ('it', 11),
 ('3', 10),
 ('husband', 10),
 ('yesterday', 10),
 ('mom', 10),
 ('wife', 9),
 ('1', 9),
 ('cold', 9),
 ('am', 9),
 ('constant', 9),
 ('a week', 9),
 ('pain', 9),
 ('better', 9),
 ('dia', 8),
 ('med', 8),
 ('4', 8),
 ('lung', 8),
 ('a', 8),
 ('mu', 8),
 ('ct', 8),
 ('heavy', 8),
 ('right', 8),
 ('type 1 diabetes', 8),
 ('te', 7),
 

In [17]:

clinician_UMLS_unrecognized = Counter(clinician_entity_terms)
clinician_UMLS_unrecognized = dict(clinician_UMLS_unrecognized)
#sort
clinician_UMLS_unrecognized = sorted(clinician_UMLS_unrecognized.items(), key=lambda  item: item[1], reverse=True)
print(len(clinician_UMLS_unrecognized))
clinician_UMLS_unrecognized

2509


[('co', 85),
 ('throat', 62),
 ('sore', 59),
 ('cough', 57),
 ('fever', 50),
 ('chest', 50),
 ('pneumonia', 47),
 ('coronavirus', 43),
 ('##vid - 19', 37),
 ('headache', 36),
 ('corona', 34),
 ('19', 32),
 ('coughing', 32),
 ('hospital', 32),
 ('##vid', 30),
 ('dry', 28),
 ('son', 22),
 ('mild', 22),
 ('home', 21),
 ('doctor', 20),
 ('pain', 20),
 ('antibiotics', 20),
 ('sick', 19),
 ('back', 19),
 ('daughter', 18),
 ('slight', 18),
 ('symptoms', 18),
 ('cold', 17),
 ('tight', 16),
 ('flu', 16),
 ('?', 16),
 ('dr', 15),
 ('nose', 15),
 ('s', 15),
 ('went', 15),
 ('2', 14),
 ('severe', 14),
 ('corona virus', 14),
 ('run', 14),
 ('ache', 13),
 ('covid - 19', 13),
 ('temperature', 13),
 ('hi', 13),
 ('head', 13),
 ('travelled', 12),
 ('##vid19', 12),
 ('high', 11),
 ('##vid 19', 11),
 ('tired', 11),
 ('this morning', 11),
 ('mother', 11),
 ('lung', 11),
 ('bro', 11),
 ('night', 11),
 ('it', 11),
 ('low', 10),
 ('positive', 10),
 ('3', 10),
 ('husband', 10),
 ('yesterday', 10),
 ('worse', 

save in a short csv form