# MedTrix

This is a project to generate medical records

## Part Two: Input Text and Similarity

## Notebook Settings

#### Path

In [6]:
from pathlib import Path
import os

# Sets base path
b_path = Path.home() / 'Development' / 'medtrix'
env_path = Path.home() / 'anaconda3' / 'envs' / 'conda_medtrix_env'
os.chdir(b_path)
!ls

d_path = b_path / 'dataset'
n_path = b_path / 'notebook'
s_path = b_path / 'scripts'
list_fake_path = d_path / 'lists_fake_data'

dataset  environment.yml  images  logs.log  mlruns  notebook  scripts


#### Install

In [None]:
!which python

In [11]:
!python --version

Python 3.10.4


In [None]:
!pip install requests==2.25.1

In [None]:
!pip install spacy==3.4.0

In [None]:
!python -m spacy info

In [None]:
!python -m spacy download en_core_web_lg

In [None]:
!pip install medspacy

In [None]:
!pip install pyrush

In [None]:
!pip install cycontext

In [None]:
!pip install openpyxl

In [None]:
!pip install date-extractor

In [None]:
!pip install nameparser

In [None]:
%cd {s_path}/genderComputer
!python setup.py install

In [None]:
!pip install nltk==3.7

In [None]:
!pip install https://huggingface.co/kormilitzin/en_core_med7_lg/resolve/main/en_core_med7_lg-any-py3-none-any.whl

In [2]:
import nltk
nltk.download('wordnet', download_dir=env_path / 'nltk_data')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/leobit/anaconda3/envs/conda_medtrix_env/nltk_dat
[nltk_data]     a...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/leobit/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
!pip install datefinder

#### Configuration

In [7]:
%load_ext autoreload
%autoreload 2

In [8]:
# Displacy Options
options = {"distance": 80,"bg": "#09a3d5",
               "compact":True, "collapse_punct":False,
           "color": "white", "font": "Source Sans Pro"}

In [11]:
pd.set_option('display.max_columns', None)

#### Import

In [10]:
import pandas as pd
import re
import pickle
import spacy
import random
import string
import spacy_stanza
import stanza
import medspacy
import datefinder

from medspacy.context import ConTextComponent, ConTextRule
from medspacy.ner import TargetRule
from medspacy.visualization import visualize_dep
from spacy import displacy
from spacy.pipeline import Sentencizer
from spacy.tokens import Span
from collections import defaultdict
from transformers import AutoTokenizer, AutoModel, AutoModelForTokenClassification, pipeline
from scripts.placeholdermapper import PlaceholderMapper
from negspacy.negation import Negex
from genderComputer import GenderComputer
from numpy import dot
from numpy.linalg import norm
from ast import literal_eval
from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from nltk.corpus import wordnet 

  from .autonotebook import tqdm as notebook_tqdm
  def resize(self, image, size, resample=PIL.Image.BILINEAR, default_to_square=True, max_size=None):


In [12]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## System Planning

### 1- Table Formation: [x]

- (Literal) Age, Gender 
- (Chemicals, Allergens) Allergies
- (Problems) Chief Complaint, History Illness, Discharge Diagnosis, Hospital Course
- (Historical Problems) Past Medical History
- (Attention) Social History


### 2 - Text Detection: [x]

- (De-Identification [deid_bert_i2b2]) Admission Date, Patient Name, Age, Doctor Name, Hospital Name  
- (StandfordNLP [spacy-stanza]) Problems  
- (scispaCy & MedspaCy) Allergens, Attention Words, Context Rules - Allergy/Negation/Past Medical History
- (Med7) Medication Allergens, Context Rules


### 3 - Replace: [x]

- Patient Name  
- Patient Age  
- Allergies  
- Chief Complaint  
- Doctor Name  
- Hospital Name  
- Birthdate  
- Admission Date  
- Other Dates  


### 4 - Similarity: [x]

- Match:
Age, Sex
<br>

- Similarity System:  
1- Possible Candidates (Jaccard Similarity)  
2- Best one possible (UMLSBert Similarity)  
<br>

- Weighted Similarity:  
(Problems) Chief Complaint, History of Present Illness, Discharge Diagnosis, Hospital Course 
<br>

- Similarity:  
(Attention Words) Social History  
(Historical Problems - Context Rule) Past Medical History  
<br>


### 5 - AI Generation:  

- History of Present Illness  
- Social History  
- Brief Hospital Course/Hospital Course

## Load MIMIC

### MIMIC-III

MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge).

You can read more about MIMIC-IV from the following resources:

* [The MIMIC-III PhysioNet project page](https://physionet.org/content/mimiciii/1.4/)
* [The MIMIC-III online documentation](https://mimic.mit.edu/)

In [38]:
# Load MIMIC-III
df_mimic_full = pd.read_csv(d_path / "mimic-iii.csv", index_col=0)

# Replace wrong texts
mimic_replace_d = {
    ":[**":": [**",
    "#:":":",
    "\n\nD:":"\n\nDate:"
}
for orig, repl in mimic_replace_d.items():
    df_mimic_full['TEXT'] = df_mimic_full['TEXT'].apply(lambda x: x.replace(orig, repl))

## Age, Patient, Doctor, Dates and Hospital Recognition

#### De-Identification Model

In [87]:
tokenizer = AutoTokenizer.from_pretrained("obi/deid_bert_i2b2")
model = AutoModelForTokenClassification.from_pretrained("obi/deid_bert_i2b2")

In [88]:
deid_nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="max")

In [89]:
text = "Physician Discharge Summary Admit date: 10/12/1982 Discharge date: 10/22/1982 Patient Information Jack Reacher, 54 y.o. male (DOB = 1/21/1928). Home Address: 123 Park Drive, San Diego, CA, 03245. Home Phone: 202-555-0199 (home). Hospital Care Team Service: Orthopedics Inpatient Attending: Roger C Kelly, MD Attending phys phone: (634)743-5135 Discharge Unit: HCS843 Primary Care Physician: Hassan V Kim, MD 512-832-5025."
ents = deid_nlp(text)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [462]:
nlp_bl = spacy.blank("en")

def join_ents(text,ents):
    new_ents = []
    for idx, ent in enumerate(ents):
        if not idx:
            new_ents.append(ent)
            continue
        
        ant_ent = new_ents[-1]
        interval = text[ant_ent['end']:ent['start']]
        if len(interval)<=1:
            if ant_ent['entity_group']==ent['entity_group']:
                new_start = ant_ent['start']
                new_end = ent['end']
                ent_text = text[new_start:new_end]
                new_ents[-1]['start'] = new_start
                new_ents[-1]['end'] = new_end
                new_ents[-1]['word'] = ent_text
                continue
        
        new_ents.append(ent)
    
    return new_ents

def get_entity_options(ents_l):
    colors={}
    for ent in ents_l:
        colors[ent]=("#"+''.join([random.choice(string.hexdigits) for i in range(6)])).upper()
    
    options = {"ents": ents_l, "colors": colors}
    return options

def show_transformer_ents(text, ents):
    doc = nlp_bl(text)
    spans_list = []
    ents_l =list(set(ent['entity_group'] for ent in ents))
    options=get_entity_options(ents_l)
    
    for ent in ents:
        start = ent['start']
        end = ent['end']
        tok_start = len(nlp_bl(text[:start]))
        tok_end = tok_start + len(nlp_bl(text[start:end]))
        span = doc[tok_start:tok_end]
        span.label_ = ent['entity_group']
        spans_list.append(span)
    
    doc.ents = spans_list
    
    displacy.render(doc, style="ent", options=options)

In [90]:
ents = join_ents(text, ents)
show_transformer_ents(text, ents)

In [91]:
text = "Anne 35F was attended in Naval Hospital Beaufort in 10/12/1982 by his PCP Anand the patient presenting abdominal pain. The patient has history of gastritis, use of tobbaco and alcohol."
ents = deid_nlp(text)


In [467]:
text = "Anne 35F was attended in Naval Hospital Beaufort in 10/12/1982 by his PCP Roger the patient presenting abdominal pain. The patient has history of gastritis, use of tobbaco and alcohol."
ents = deid_nlp(text)
ents = join_ents(text, ents)

In [468]:
ents = join_ents(text, ents)
show_transformer_ents(text, ents)

## Present or Past Medical/Social Reports

In [6]:
english_nlp = spacy.load('en_core_web_sm')
nlp_spacy_stanza = spacy_stanza.load_pipeline('en', package='mimic', processors={'ner': 'i2b2'}, use_gpu=True, verbose=False)

nlp_dis = spacy.load("en_ner_bc5cdr_md")
sci_scispacy_nlp = spacy.load("en_core_sci_sm")
sci_scispacy_nlp.add_pipe("negex")
meds_nlp = medspacy.load()

# meds_nlp = medspacy.load()
# sectionizer = meds_nlp.add_pipe("medspacy_sectionizer")

In [33]:
def get_context_vis(text, problems):
    options = {"distance": 140,"bg": "#09a3d5",
               "compact":True, "collapse_punct":False,
               "color": "white", "font": "Source Sans Pro"}
    
    rules = [TargetRule(problem, 'CONDITION') for problem in problems]
    meds_nlp.get_pipe('medspacy_target_matcher').add(rules)
    doc = meds_nlp(text)
    visualize_dep(doc, jupyter=True, options=options)

def get_context(text, problems):
    pres_problems = []
    hist_problems = []
    rules = [TargetRule(problem, 'CONDITION') for problem in problems]
    meds_nlp.get_pipe('medspacy_target_matcher').add(rules)
    doc = meds_nlp(text)
    for ent in doc.ents:
        if ent._.is_negated:
            continue
        elif ent._.is_historical:
            hist_problems.append(ent.text)
        else:
            pres_problems.append(ent.text)
    
    return pres_problems, hist_problems

def get_problems(sentence):
    doc_stanza = nlp_spacy_stanza(sentence)
    doc_dis_spacy = nlp_dis(sentence)
    problems = [ent.text for ent in doc_stanza.ents if ent.label_=="PROBLEM"]
    diseases = [ent.text for ent in doc_dis_spacy.ents if ent.label_=="DISEASE"]
    
    for dis in diseases:
        if any((dis.lower() in problem.lower()) for problem in problems):
            continue
        problems.append(dis)
        
    return get_context(sentence, problems)

In [25]:
def get_attention_words(sentence):
    doc = sci_scispacy_nlp(sentence)
    attentions = []
    for ent in doc.ents:
        if not ent._.negex:
            attentions.append(ent.text)
            span = Span(doc, ent.start, ent.end, label="ATTENTION")
            doc.ents = [span if e == ent else e for e in doc.ents]
        else:
            attentions.append(ent.text)
            span = Span(doc, ent.start, ent.end, label="NEGATED")
            doc.ents = [span if e == ent else e for e in doc.ents]
            
    return attentions, doc

In [31]:
text = "Anne 35F was attended in Naval Hospital Beaufort in 10/12/1982 by his PCP Anand the patient presenting abdominal pain. The patient has history of gastritis, use of tobbaco and alcohol. Patient has hx of stroke. Mother diagnosed with diabetes. No evidence of pna. The patient can develop cancer"
problems, hist_problems = get_problems(text)

In [None]:
get_context_vis(text, problems)

In [27]:
text = "Anne 35F was attended in Naval Hospital Beaufort in 10/12/1982 by his PCP Anand the patient presenting abdominal pain. The patient has history of gastritis, however he denies use of tobbaco and alcohol. Patient has hx of stroke. Mother diagnosed with diabetes. No evidence of pna. The patient can develop cancer"
attentions, doc = get_attention_words(text)
displacy.render(doc, style="ent", options=options)

## Allergies

In [143]:
dis_nlp = spacy.load("en_ner_bc5cdr_md")

In [51]:
english_nlp = spacy.load("en_core_web_lg")

In [122]:
all_nlp = medspacy.load()

In [124]:
all_nlp.add_pipe('sentencizer', first=True)

<spacy.pipeline.sentencizer.Sentencizer at 0x7f96aeb90e40>

In [70]:
all_terms = ['allergies', 'allergy', 'allergic', 'hypersensitivity', 'hypersensitive', 'sensitive', 'sensitivity']

In [None]:
for text in df_mimic_full['TEXT'].to_list():
    doc = english_nlp(text)
    for sent in doc.sents:
        if any((allerg in sent.text) for allerg in a_terms):
            print(sent.text)
            break

In [None]:
for text in df_mimic_full['TEXT'].to_list():
    doc = english_nlp(text)
    for sent in doc.sents:
        if "allergic" in sent.text.lower():
            print(sent.text)
            break

#### Allergens Database

In [18]:
df_allergens = pd.read_excel(d_path / 'COMPARE-2022.xlsx')

In [19]:
df_allergens.columns = map(lambda x: x.lower().replace(" ","_"), df_allergens.columns)

In [43]:
df_allergens[df_allergens.common_name.str.contains('Shrimp')]

Unnamed: 0,species,common_name,iuis_name,description,gi,accession,parent_accession,length,reference,year_adopted
1380,Litopenaeus vannamei,Whiteleg Shrimp,Lit v 2,arginine kinase,115492980.0,ABI98020.1,,356,"4539, 19443",2008
1411,Marsupenaeus japonicus,Kuruma Shrimp,,tropomyosin fast isoform [Marsupenaeus japonicus],125995159.0,BAF47263.1,,284,4514,2008
1541,Litopenaeus vannamei,Whiteleg Shrimp,Lit v 1,tropomyosin,170791252.0,ACB38288.1,,284,"9497, 17548, 19443",2010
1545,Litopenaeus vannamei,Whiteleg Shrimp,Lit v 3,myosin light chain,184198734.0,ACC76803.1,,177,"8542, 19443",2010
1633,Litopenaeus vannamei,Whiteleg Shrimp,Lit v 4,"calcium-binding protein, sarcoplasmic calcium-...",223403273.0,ACM89179.1,,193,"9497, 19443",2011
1644,Crangon crangon,Shrimp,Cra c 1,tropomyosin,238477263.0,ACR43473.1,,284,17528,2012
1645,Crangon crangon,Shrimp,Cra c 2,arginine kinase,238477265.0,ACR43474.1,,356,17528,2012
1646,Crangon crangon,Shrimp,Cra c 4,sarcoplasmic calcium binding protein,238477327.0,ACR43475.1,,193,17528,2012
1647,Crangon crangon,Shrimp,Cra c 8,triosephosphate isomerase,238477329.0,ACR43476.1,,249,17528,2012
1648,Crangon crangon,Shrimp,Cra c 5,myosin light chain 1,238477331.0,ACR43477.1,,153,17528,2012


In [44]:
allergens_l = df_allergens.common_name.str.lower().unique()

In [45]:
len(allergens_l)

379

#### Add Target Rules:

In [125]:
all_nlp.pipe_names

['sentencizer',
 'medspacy_pyrush',
 'medspacy_target_matcher',
 'medspacy_context']

In [126]:
rules = [TargetRule(allergen, 'ALLERGEN') for allergen in allergens_l]
all_nlp.get_pipe('medspacy_target_matcher').add(rules)

In [155]:
text = "The patient has allergy to Shrimp"
doc = all_nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Shrimp ALLERGEN


#### Add Context Rules:

In [49]:
help(ConTextRule)

Help on class ConTextRule in module medspacy.context.context_rule:

class ConTextRule(medspacy.common.base_rule.BaseRule)
 |  ConTextRule(literal, category, direction='BIDIRECTIONAL', pattern=None, on_match=None, on_modifies=None, allowed_types=None, excluded_types=None, max_scope=None, max_targets=None, terminated_by=None, metadata=None, filtered_types=None, **kwargs)
 |  
 |  A ConTextRule defines a ConText modifier. ConTextRules are rules which define
 |  which spans are extracted as modifiers and how they behave, such as the phrase to be matched,
 |  the category/semantic class, the direction of the modifier in the text, and what types of target
 |  spans can be modfified.
 |  
 |  Method resolution order:
 |      ConTextRule
 |      medspacy.common.base_rule.BaseRule
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, literal, category, direction='BIDIRECTIONAL', pattern=None, on_match=None, on_modifies=None, allowed_types=None, excluded_types=None, max_

In [128]:
context_rules = [
    ConTextRule("<ALLERGY_TERM>", "ALLERGY", 
                rule="FORWARD",
               pattern=[
                   {"LOWER": {"IN": all_terms}},
               ])
]
all_nlp.get_pipe('medspacy_context').add(context_rules)

In [136]:
text = "The patient has allergy to shrimp, peanut and wheat. There is other substances like banana that could be an allergen."

In [142]:
doc = english_nlp(text)
for sent in doc.sents:
    doc = all_nlp(sent.text)
    visualize_dep(doc, jupyter=True, options=options)

In [153]:
def get_allergens(sentence):
    allergens = []
    doc_dis = dis_nlp(sentence)
    doc_en = english_nlp(sentence)
    chemicals = [ent.text for ent in doc_dis.ents if ent.label_=="CHEMICAL"]
    rules = [TargetRule(chemical, 'ALLERGEN') for chemical in chemicals]
    all_nlp.get_pipe('medspacy_target_matcher').add(rules)
    
    for sent in doc_en.sents:
        doc_all = all_nlp(sent.text)
        for ent in doc_all.ents:
            if ent._.modifiers:
                allergens.append(ent.text)
                    
    return allergens

In [154]:
get_allergens("The patient has allergy to shrimp and penicillin. There is other substances like peanut.")

['shrimp', 'penicillin']

## Replace PHI Labels

In [156]:
# Load MIMIC-III
df_mimic_full = pd.read_csv(d_path / "mimic-iii.csv", index_col=0)

# Replace wrong texts
mimic_replace_d = {
    ":[**":": [**",
    "#:":":",
    "\n\nD:":"\n\nDate:"
}
for orig, repl in mimic_replace_d.items():
    df_mimic_full['TEXT'] = df_mimic_full['TEXT'].apply(lambda x: x.replace(orig, repl))

In [40]:
# Pattern to get Topics
add_topics = ['facility', 'HISTORY  OF  THE  PRESENT  ILLNESS(?=\:)', 'Admission Date(?=\:)', 'Discharge Date(?=\:)', 'Sex(?=\:)', 'Chief Complaint(?=\:)', 'Addendum(?=\:)', '(?i)HISTORY OF PRESENT ILLNESS(?=\:)']
pattern = re.compile(f"((?<=\\n\\n)[\w\s]+(?=\:))|{'|'.join(add_topics)}", flags=0)
hpi_p = re.compile("\[\*\*[^\[]*\*\*\]", flags=0)

In [41]:
def get_topics_text(text):
    topics = []
    positions = []
    sections_text = {}
    for m in pattern.finditer(text):
        s = m.group().replace('\n','')
        s = "_".join(s.lower().split())
        topics.append(s)
        positions.append((m.span()[0], m.span()[1]+2))
    for i, topic in enumerate(topics):
        start = positions[i][1]
        try:
            end = positions[i+1][0]
        except:
            end = len(text)-1
        sections_text[topic]=text[start:end].replace('\n',' ')
        
    return sections_text

In [None]:
topics_for_analysis = ['history_of_present_illness', 'chief_complaint', 'social_history', 'past_medical_history']
for idx, doc_text in enumerate(df_mimic_full['TEXT'].to_list()):
    if idx%1000==0:
        sections_text = get_topics_text(doc_text)
        for topic, text in sections_text.items():
            if not topic in topics_for_analysis:continue
            print(f"DOCUMENT: {idx}")
            print(f"TOPIC: {topic}")
            print(f"TEXT: {text}")
            print("")

#### Name Analysis

- Patient Name:[**Known firstname 77781**][**Known lastname 77782**]
- Doctor Name: Attending
- Hospital: facility or [**Hospital1 18**]

In [None]:
topics_for_analysis = ['attending','addendum', 'chief_complaint', 'history_of_present_illness', 'brief_hospital_course', 'hospital_course', 'social_history', 'past_medical_history']
b_s = "\033[1m"
b_e = "\033[0m"
b_len = len(b_s+b_e)
search_l=['name', 'doctor', 'dr.']

for idx, doc_text in enumerate(df_mimic_full['TEXT'].to_list()):
    if idx%1050==0:
        ## SECTIONS
        sections_text = get_topics_text(doc_text)
        for topic, text in sections_text.items():
            if not topic in topics_for_analysis:continue
            pos_l = []
            for hpi in hpi_p.finditer(text):
                if any((term in hpi.group().lower()) for term in search_l):
                    pos_l.append(hpi.span())
            
            if pos_l:
                temp_text = text
                compes = 0
                for start, end in pos_l:
                    start+=compes
                    end+=compes
                    temp_text = temp_text[:start] + b_s + temp_text[start:end] + b_e + temp_text[end:]
                    compes += b_len
                    
                print(f"DOCUMENT: {idx}")
                print(f"TOPIC: {topic}")
                print(f"TEXT: {temp_text}")
                print("")
                continue

In [None]:
topics_for_analysis = ['attending','addendum', 'chief_complaint', 'history_of_present_illness', 'brief_hospital_course', 'hospital_course', 'social_history', 'past_medical_history']
b_s = "\033[1m"
b_e = "\033[0m"
b_len = len(b_s+b_e)
search_l=['hospital']

for idx, doc_text in enumerate(df_mimic_full['TEXT'].to_list()):
    if idx%1050==0:
        ## SECTIONS
        sections_text = get_topics_text(doc_text)
        for topic, text in sections_text.items():
#             if not topic in topics_for_analysis:continue
            pos_l = []
            for hpi in hpi_p.finditer(text):
                if any((term in hpi.group().lower()) for term in search_l):
                    pos_l.append(hpi.span())
            
            if pos_l:
                temp_text = text
                compes = 0
                for start, end in pos_l:
                    start+=compes
                    end+=compes
                    temp_text = temp_text[:start] + b_s + temp_text[start:end] + b_e + temp_text[end:]
                    compes += b_len
                    
                print(f"DOCUMENT: {idx}")
                print(f"TOPIC: {topic}")
                print(f"TEXT: {temp_text}")
                print("")
                continue

In [31]:
hpi_p = re.compile("\\[\*\*([^\[]*)\*\*\]", flags=0)
firstn_p = re.compile("\[\*\*Known firstname \d+\*\*\]", flags=0)
lastn_p = re.compile("\[\*\*Known lastname \d+\*\*\]", flags=0)
hosp_p = re.compile("\[\*\*Hospital1 18\*\*\]")
date_p = re.compile("\[\*\*(\d+)-(\d+)-(\d+)\*\*\]")
plh = PlaceholderMapper()
year_old_l_1 = ['yo', 'y/o', 'year old', 'year-old', 'year-old', 'y.o', 'year o', 'y old']
year_old_l_2 = ['F','M']
year_old_p_1 = re.compile(f"(\d+)(?=\s*\-*({'|'.join(year_old_l_1)}))")
year_old_p_2 = re.compile(f"(\d+)(?=\s*({'|'.join(year_old_l_2)}))")



def replace_age(sentence, age):
    if sentence:
        target = " ".join(sentence.split()[:50])
        res = re.search(year_old_p_1, target)
        if res:
            sentence = sentence.replace(res.group(), age)
            return sentence
        else:
            res = re.search(year_old_p_2, target)
            if res:
                sentence = sentence.replace(res.group(), age)
                return sentence
            else:
                if ('999' in sentence) and (age>=90):
                    sentence = sentence.replace('999', age)
        
    return sentence

def fake_phi_labels(sections_text, **kwargs):
    
    age = str(kwargs.get('AGE'))
    dr_name = kwargs.get('STAFF')
    patient_name = kwargs.get('PATIENT')
    hosp_name = kwargs.get('HOSP')
    adm_date = kwargs.get('DATE')
    
    patient_name_sections = ['history_of_present_illness','brief_hopsital_course', 'hospital_course']
    
    ## Age
    if age:
        hpi = sections_text.get('history_of_present_illness', '')
        hc = sections_text.get('brief_hopsital_course', '')
        if hpi:
            res = replace_age(hpi, age)
            sections_text['history_of_present_illness'] = res
        if hc:
            res = replace_age(hc, age)
            sections_text['brief_hopsital_course'] = res
        else:
            hc = sections_text.get('hopsital_course', '')
            if hc:
                res = replace_age(hc, age)
                sections_text['hopsital_course'] = res
    
    ## Doctor Name
    if dr_name:
        att_text = sections_text.get('attending')
        if att_text:
            for hpi in hpi_p.finditer(att_text):
                att_text = att_text.replace(str(hpi.group()), dr_name)
                break
            sections_text['attending'] = att_text
    
    ## Patient Name
    if patient_name:
        for _s in patient_name_sections:
            if not sections_text.get(_s):continue
            pat_text = sections_text[_s]
            for lastn in lastn_p.finditer(pat_text):
                pat_text = pat_text.replace(str(lastn.group()), patient_name)
            for firstn in firstn_p.finditer(pat_text):
                pat_text = pat_text.replace(str(firstn.group()), patient_name)
            sections_text[_s]=pat_text
    
    ## Hospital Name
    if hosp_name:
        fac_text = sections_text.get('facility')
        if fac_text:
            for hpi in hpi_p.finditer(fac_text):
                if 'hospital' in hpi.group().lower():
                    fac_text = fac_text.replace(str(hpi.group()), hosp_name)
            sections_text['facility'] = fac_text
        
        for section, text in sections_text.items():
            for hosp in hosp_p.finditer(text):
                text = text.replace(str(hosp.group()), hosp_name)
            sections_text[section]=text
            
    ## Admission Date and Other Dates
    if adm_date:
        if isinstance(adm_date, str):
            matches = datefinder.find_dates(text)
            if matches:
                for date in matches:
                    print(date)
                    day = date.day
                    month = date.month
                    year = date.year
                    break
        
        if isinstance(adm_date, tuple):
            date = adm_date[0]
            day = date.day
            month = date.month
            year = date.year
            
        new_adm_date = f"{year}-{month}-{day}"    
        adm_text = sections_text.get('admission_date')
        if adm_text:
            for hpi in date_p.finditer(adm_text):
                adm_text = adm_text.replace(str(hpi.group()), new_adm_date)
                adm_year_fake = int(hpi.group(1))
                adm_month_fake = int(hpi.group(2))
                adm_day_fake = int(hpi.group(3))
                break

            sections_text['admission_date'] = adm_text

            for section, text in sections_text.items():
                replaces = []
                for hpi in date_p.finditer(text):
                    date_y = int(hpi.group(1))
                    date_m = int(hpi.group(2))
                    date_d = int(hpi.group(3))
                    if year:
                        diff = date_y - adm_year_fake
                        new_y = year + diff
                    else:
                        new_y = 0
                    if month:
                        diff = date_m - adm_month_fake
                        new_m = month + diff
                    else:
                        new_m = 0
                    if day:
                        diff = date_d - adm_day_fake
                        new_d = day + diff
                    else:
                        new_d = 0
                    new_date = f"{new_y}-{new_m}-{new_d}"
                    replaces.append((hpi.group(0), new_date))
                for orig, replace in replaces:
                    text = text.replace(orig, replace)
                sections_text[section] = text
        
    
    ## Other PHI Labels
    for section, text in sections_text.items():
        replaces = []
        for hpi in hpi_p.finditer(text):
            new_text = plh.get_mapping(hpi.group())
            replaces.append((hpi.group(), new_text))
        for orig, replace in replaces:
            text = text.replace(orig, replace)
        sections_text[section] = text
        
    return sections_text

In [50]:
sections_text = fake_phi_labels(df_mimic_full['TEXT'].loc[4356], age='73', patient_name='Anne', dr_name='Drauzio', hosp_name='St. Paul', adm_date='2022-5-4')
new_sentence = ""
for topic, value in sections_text.items():
    topic = "\n\n" + topic.replace("_", " ").title() + ":"
    new_sentence += (topic + " " + value)
print(new_sentence)



Admission Date:  2022-5-4              

Discharge Date:   2022-5-5  

Date Of Birth:  1949-3-1             

Sex:   M  

Service: SURGERY  

Allergies: Morphine Sulfate  

Attending: Drauzio 

Chief Complaint: septic shock toxic c. diff s/p subtotal colectomy  

Major Surgical Or Invasive Procedure: invasive monitoring  

History Of Present Illness: Pt is 73yo male who was recently diagnosed with lyme myelitis and was hospitalized. He was treated with Ceftriaxone and discharged home. At home, he developed watery diarrhea for several weeks and became severely dehydrated. He presented to OSH and was found to have C diff toxic megacolon. On 5-10, he was taken to the OR by an outside surgeon and underwent subtotal colectomy and end ileostomy. Pt's postop condition was moribund, with oliguria, in septic shock, and he was transferred to St. Paul for further management.  

Past Medical History: spinal stenosis CAD, s/p CABG & RCA stent Recurrent 3 vessel coronary disease hypercholesterolem

## Input Text Function

In [321]:
# with open(d_path / 'allergens_list.pkl', 'wb') as f:
#     pickle.dump(allergens_l, f)

In [14]:
en_nlp = spacy.blank('en')
en_nlp.add_pipe('sentencizer')

# De-Identication
tokenizer = AutoTokenizer.from_pretrained("obi/deid_bert_i2b2")
model = AutoModelForTokenClassification.from_pretrained("obi/deid_bert_i2b2")
deid_nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="max")
male_words = ['man', 'manlike', 'male', 'gentleman', 'boy', 'manful', 'masculine', 'dude', 'guy']
male_initial = ['sir', 'mr.', 'mister']
female_words = ['woman', 'feminine', 'female', 'girl', 'gentlewoman']
female_initial = ['miss', 'mrs.', 'madam', 'madame']

# Problems
spacy_stanza_nlp = spacy_stanza.load_pipeline('en', package='mimic', processors={'ner': 'i2b2'}, use_gpu=True, verbose=False)
dis_nlp = spacy.load("en_ner_bc5cdr_md")
meds_nlp = medspacy.load()

# Attetion
scispacy_nlp = spacy.load("en_core_sci_sm")
scispacy_nlp.add_pipe("negex")

# Allergies
med7_nlp = spacy.load("en_core_med7_lg")
with open(d_path / 'allergens_list.pkl', 'rb') as f:
    allergens_l = pickle.load(f)
    
all_nlp = medspacy.load()
all_terms = ['allergies', 'allergy', 'allergic', 'hypersensitivity', 'hypersensitive', 'sensitive', 'sensitivity']
rules = [TargetRule(allergen, 'ALLERGEN') for allergen in allergens_l]
all_nlp.get_pipe('medspacy_target_matcher').add(rules)
context_rules = [
    ConTextRule("<ALLERGY_TERM>", "ALLERGY", 
                rule="FORWARD",
               pattern=[
                   {"LOWER": {"IN": all_terms}},
               ])
]
all_nlp.get_pipe('medspacy_context').add(context_rules)

In [15]:
def join_ents(text,ents):
    new_ents = []
    for idx, ent in enumerate(ents):
        if not idx:
            new_ents.append(ent)
            continue
        
        ant_ent = new_ents[-1]
        interval = text[ant_ent['end']:ent['start']]
        if len(interval)<=1:
            if ant_ent['entity_group']==ent['entity_group']:
                new_start = ant_ent['start']
                new_end = ent['end']
                ent_text = text[new_start:new_end]
                new_ents[-1]['start'] = new_start
                new_ents[-1]['end'] = new_end
                new_ents[-1]['word'] = ent_text
                continue
        
        new_ents.append(ent)
    
    return new_ents

def get_deid(text):
    ents = deid_nlp(text)
    ents = join_ents(text, ents)
    deid_d = {}
    deid_ents = ['DATE', 'PATIENT', 'HOSP', 'STAFF', 'AGE', 'LOC']
    gender = ''
    gc = GenderComputer()
    for i in ents:
        ent_g = i['entity_group']
        value = i['word']
        if ent_g not in deid_ents:continue
        if ent_g=='AGE':
            # Gender
            hypo_gender=re.sub("[^FfMm]", "", value)
            if hypo_gender:
                gender = (hypo_gender, 'F' if hypo_gender in ['f','F'] else 'M')
            
            # Age
            value=int(re.sub("[^0-9]", "", value))
            
        deid_d[ent_g] = value if not deid_d.get(ent_g) else deid_d[ent_g]
    
    
    # Gender
    if not gender:
        if 'PATIENT' in deid_d.keys():
            pat_name = deid_d['PATIENT']
            res = gc.resolveGender(pat_name,None)
            if res:
                res = 'F' if res=='female' else 'M'
                gender = (pat_name, res)
            if not gender:
                patient_start = text.index(pat_name)
                target = text[:patient_start].lower()
                for i in male_initial:
                    if i in target:
                        gender = (i, 'M')
                for i in female_initial:
                    if i in target:
                        gender = (i, 'F')
        else:
            target = text.lower().split()
            for i in male_words:
                if i in target:
                    gender = (i, 'M')
            for i in female_words:
                if i in target:
                    gender = (i, 'F')
    if gender:
        deid_d['GENDER']=gender
    return deid_d

def get_context(text, problems):
    pres_problems = []
    hist_problems = []
    fam_problems = []
    neg_problems = []
    rules = [TargetRule(problem[0], 'CONDITION') for problem in problems]
    meds_nlp.get_pipe('medspacy_target_matcher').add(rules)
    doc = meds_nlp(text)
    for ent in doc.ents:
        if ent._.is_negated:
            neg_problems.append((ent.text, ent.start_char))
        elif ent._.is_historical:
            hist_problems.append((ent.text, ent.start_char))
        elif ent._.is_family:
            fam_problems.append((ent.text, ent.start_char))
        else:
            pres_problems.append((ent.text, ent.start_char))
    
    return pres_problems, hist_problems, fam_problems, neg_problems

def get_problems(sentence):
    doc_stanza = spacy_stanza_nlp(sentence)
    doc_dis_spacy = dis_nlp(sentence)
    problems = [(ent.text, ent.start_char) for ent in doc_stanza.ents if ent.label_=="PROBLEM"]
    diseases = [(ent.text, ent.start_char) for ent in doc_dis_spacy.ents if ent.label_=="DISEASE"]
    
    for dis, idx in diseases:
        if any((dis.lower() in problem[0].lower()) for problem in problems):
            continue
        problems.append((dis, idx))
        
    return get_context(sentence, problems)

def get_attention_words(sentence):
    doc = scispacy_nlp(sentence)
    attentions = []
    negs = []
    for ent in doc.ents:
        if not ent._.negex:
            attentions.append((ent.text, ent.start_char))
        else:
            negs.append((ent.text, ent.start_char))
            
    return attentions, negs

def get_allergens(sentence):
    allergens = []
    negs = []
    doc_en = en_nlp(sentence)
    doc_med = med7_nlp(sentence)
    chemicals = [ent.text for ent in doc_med.ents if ent.label_=="DRUG"]
    rules = [TargetRule(chemical, 'ALLERGEN') for chemical in chemicals]
    all_nlp.get_pipe('medspacy_target_matcher').add(rules)
    
    for sent in doc_en.sents:
        doc_all = all_nlp(sent.text)
        for ent in doc_all.ents:
            if ent._.is_negated:
                negs.append((ent.text, ent.start_char))
            else:
                if ent._.modifiers:
                    if ent._.modifiers[0].category=="ALLERGY":
                        allergens.append((ent.text, ent.start_char))
                    
    return allergens, negs

In [67]:
def get_ents_input_text(text):
    w_detected = []
    neg_problems = []
    
    ## De-identification
    input_d = get_deid(text)
    w_detected+=input_d.values()
    
    ## Allergens
    allergens, negs = get_allergens(text)
    for allergen, idx in allergens:
        w_detected+=[allergen]
    input_d['ALLERGEN'] = allergens + negs
    all_negs = negs
    
    ## Problems
    problems, hist_problems, fam_problems, negs = get_problems(text)
    new_problems = []
    new_hist_problems = []
    new_fam_problems = []
    cur_detected=" ".join(list(map(lambda x: str(x).lower(),w_detected))).split()
    for problem, idx in problems:
        words_l = problem.lower().split()
        if any((i in cur_detected) for i in words_l):
            continue
        else:
            new_problems.append((problem, idx))
    for hist_problem, idx in hist_problems:
        words_l = hist_problem.lower().split()
        if any((i in cur_detected) for i in words_l):
            continue
        else:
            new_hist_problems.append((hist_problem, idx))
    
    for fam_problem, idx in fam_problems:
        words_l = fam_problem.lower().split()
        if any((i in cur_detected) for i in words_l):
            continue
        else:
            new_fam_problems.append((fam_problem, idx))
    
    for problem, idx in new_problems:
        w_detected+=[problem]
    for hist_problem, idx in new_hist_problems:
        w_detected+=[hist_problem]
    for fam_problem, idx in new_fam_problems:
        w_detected+=[fam_problem]
    input_d['PROBLEM'], input_d['HIST_PROBLEM'], input_d['FAM_PROBLEM'] = new_problems, new_hist_problems, new_fam_problems
    
    # Remove negated from allergens
    for neg, idx in negs:
        if not any((neg in _neg[0]) for _neg in all_negs):
            neg_problems.append((neg, idx))
    
    ## Attention
    cur_detected=" ".join(list(map(lambda x: str(x).lower(),w_detected))).split()
    attentions, negs = get_attention_words(text)
    print(negs)
    new_attentions = []
    for attention, idx in attentions:
        words_l = attention.lower().split()
        if any((i in cur_detected) for i in words_l):
            continue
        else:
            new_attentions.append((attention, idx))
    input_d['ATTENTION'] = list(set(new_attentions))
    
    # Remove negated from allergens and problems
    for neg, idx in negs:
        if not any((neg in _neg[0]) for _neg in neg_problems):
            neg_problems.append((neg, idx))
    input_d['NEGATED'] = neg_problems
    
    # Remove negated words from problems
    final_problems = []
    for neg, idx in neg_problems:
        for i, (problem, idx) in enumerate(problems):
            if neg not in problem:
                final_problems.append((problem, idx))
    input_d['PROBLEM'] = final_problems
    
    # Date
    matches = datefinder.find_dates(text, source=True, index=True, strict=True)
    if matches:
        for match in matches:
            date, source, start = match
            start = start[0]+1
            source = source.split()[0]
            input_d['DATE'] = (date, source, start)
            break
    return input_d

In [365]:
# text = "The patient manlike in Naval Hospital Beaufort in 10/12/1982 by his PCP Marcus the patient presenting abdominal pain. The patient has history of gastritis and use of tobbaco, alcohol and another substances. PCP noticed that patient has some allergies to peanut and wheat. Patient may be used to using Penicillin. There are reports of allergy reaction to penicillin."
text = """The patient female 35 y.o was attended in California Hospital in 10/12/1982 by his PCP Roger the patient with some symptoms like headaches, dizziness and confusion.
Patient denies to have any mental disorder. The patient had history of panic attack and also there were episodes of persistent depression."""
res = get_ents_input_text(text)

[{'entity_group': 'AGE', 'score': 0.9995535, 'word': '35', 'start': 19, 'end': 21}, {'entity_group': 'HOSP', 'score': 0.99822944, 'word': 'California', 'start': 42, 'end': 52}, {'entity_group': 'HOSP', 'score': 0.9982157, 'word': 'Hospital', 'start': 53, 'end': 61}, {'entity_group': 'DATE', 'score': 0.99972326, 'word': '10', 'start': 65, 'end': 67}, {'entity_group': 'DATE', 'score': 0.7947258, 'word': '12', 'start': 68, 'end': 70}, {'entity_group': 'DATE', 'score': 0.7268291, 'word': '1982', 'start': 71, 'end': 75}, {'entity_group': 'STAFF', 'score': 0.9817015, 'word': 'Roger', 'start': 87, 'end': 92}]
Loaded dictionary from /home/leobit/anaconda3/envs/conda_medtrix_env/lib/python3.10/site-packages/genderComputer-0.1-py3.10.egg/genderComputer/../nameLists/gender.dict
Finished initialization


In [17]:
def get_entity_options(ents_l):
    colors={}
    for ent in ents_l:
        colors[ent]=("#"+''.join([random.choice(string.hexdigits) for i in range(6)])).upper()
    
    options = {"ents": ents_l, "colors": colors, "distance": 500}
    return options


def get_ents_input_text_vis(res, text, ent_style='ent'):
    _str_type = ['PATIENT', 'HOSP', 'STAFF', 'AGE', 'LOC']
    _str_or_tuple_type = ['DATE']
    _tuple_type = ['GENDER']
    _list_type = ['PROBLEM', 'HIST_PROBLEM', 'FAM_PROBLEM','NEGATED', 'ALLERGEN', 'ATTENTION']
    doc = en_nlp(text)
    spans_list = []
    ents_l = []
    
    for ent_name, value in res.items():
        if ent_name in _str_type:
            value = str(value)
            start = text.index(value)
            end = start + len(value)
            tok_start = len(en_nlp(text[:start]))
            tok_end = tok_start + len(en_nlp(text[start:end]))
            span = doc[tok_start:tok_end]
            span.label_ = ent_name
            ents_l.append(ent_name)
            spans_list.append(span)
        if ent_name in _str_or_tuple_type:
            if isinstance(value, str):
                start = text.index(value)
            if isinstance(value, tuple):
                start = value[2]
                value = value[1]
            end = start + len(value)
            tok_start = len(en_nlp(text[:start]))
            tok_end = tok_start + len(en_nlp(text[start:end]))
            span = doc[tok_start:tok_end]
            span.label_ = ent_name
            ents_l.append(ent_name)
            spans_list.append(span)
        if ent_name in _tuple_type:
            ent_name_t = f"{'FEM' if value[1]=='F' else 'MALE'}"
            start = text.index(value[0])
            end = start + len(value)
            tok_start = len(en_nlp(text[:start]))
            tok_end = tok_start + len(en_nlp(text[start:end]))
            span = doc[tok_start:tok_end]
            span.label_ = ent_name_t
            ents_l.append(ent_name_t)
            spans_list.append(span)
        if ent_name in _list_type:
            for i, idx in value:
                start = idx if ent_name!='ALLERGEN' else text.index(i)
                end = start + len(i)
                tok_start = len(en_nlp(text[:start]))
                tok_end = tok_start + len(en_nlp(text[start:end]))
                span = doc[tok_start:tok_end]
                span.label_ = ent_name
                ents_l.append(ent_name)
                spans_list.append(span)
    
    options = get_entity_options(ents_l)
    if ent_style=='span':
        doc.spans["sc"] = spans_list
    else:
        doc.ents = spans_list
    displacy.render(doc, style=ent_style, options=options)

In [419]:
text = """The patient manlike in Naval Hospital Beaufort in 10/12/1982 by his PCP Victor the patient presenting abdominal pain, vomiting and diarrhea. 
The patient has history of gastritis and Barrett's esophagus, there's history of use of tobbaco, alcohol and another substances.
PCP noticed that patient has some allergies to peanut and wheat. Patient may be used to using Penicillin. There are reports of allergy reaction to penicillin."""
res = get_ents_input_text(text)

Loaded dictionary from /home/leobit/anaconda3/envs/conda_medtrix_env/lib/python3.10/site-packages/genderComputer-0.1-py3.10.egg/genderComputer/../nameLists/gender.dict
Finished initialization


In [417]:
text = "Anne 35F was attended in Naval Hospital Beaufort in 2021 by his Dr. Roger the patient presenting abdominal pain. The patient has history of gastritis, use of tobbaco and alcohol. Patient has hx of stroke. There are reports about a diagnosis of his Mother with diabetes. No evidence of pna. The patient can develop cancer"
res = get_ents_input_text(text)

Loaded dictionary from /home/leobit/anaconda3/envs/conda_medtrix_env/lib/python3.10/site-packages/genderComputer-0.1-py3.10.egg/genderComputer/../nameLists/gender.dict
Finished initialization


In [281]:
get_ents_input_text_vis(res, text)

In [300]:
text = """The patient from LA 35 y.o was attended in California Hospital in 10/5/2022 by his PCP Roger the patient with some symptoms like headaches, dizziness and confusion.
Patient denies to have any mental disorder. The patient had history of panic attack and also there were episodes of persistent depression.
Tony Brat claims to be allergic to amoxicillin. Patient denies to be a drug user. There are reports of allergy reaction to penicillin. Patient have no allergy to peanut.
His mother informed have cancer."""
res = get_ents_input_text(text)

Loaded dictionary from /home/leobit/anaconda3/envs/conda_medtrix_env/lib/python3.10/site-packages/genderComputer-0.1-py3.10.egg/genderComputer/../nameLists/gender.dict
Finished initialization


In [304]:
get_ents_input_text_vis(res, text, 'span')

## SIMILARITY with DOCUMENTS

In [18]:
df_struct = pd.read_csv(d_path / "df_mimic_struct.csv")

In [19]:
df_struct_lemma = pd.read_csv(d_path / "df_struct_lemma.csv")

In [247]:
lemmatizer = WordNetLemmatizer()

In [None]:
def cols_lemma(sentence_l):
    sentence_l = literal_eval(sentence_l)
    new_sentence_l = []
    for sentence in sentence_l:
        word_l = word_tokenize(sentence)
        new_word_l = [lemmatizer.lemmatize(word).lower() for word in word_l]
        new_sentence_l.append(" ".join(new_word_l))
        
    return new_sentence_l

cols = df_struct.iloc[:,3:].columns
for col in cols:
    df_struct[col] = df_struct[col].apply(cols_lemma)

In [119]:
# df_struct.to_csv(d_path / 'df_struct_lemma.csv', index=0)

In [20]:
# Pattern to get Topics
add_topics = ['facility', 'HISTORY  OF  THE  PRESENT  ILLNESS(?=\:)', 'Admission Date(?=\:)', 'Discharge Date(?=\:)', 'Sex(?=\:)', 'Chief Complaint(?=\:)', 'Addendum(?=\:)', '(?i)HISTORY OF PRESENT ILLNESS(?=\:)']
pattern = re.compile(f"((?<=\\n\\n)[\w\s]+(?=\:))|{'|'.join(add_topics)}", flags=0)
hpi_p = re.compile("\[\*\*[^\[]*\*\*\]", flags=0)
lemmatizer = WordNetLemmatizer()

def get_topics_text(text):
    topics = []
    positions = []
    sections_text = {}
    for m in pattern.finditer(text):
        s = m.group().replace('\n','')
        s = "_".join(s.lower().split())
        topics.append(s)
        positions.append((m.span()[0], m.span()[1]+2))
    for i, topic in enumerate(topics):
        start = positions[i][1]
        try:
            end = positions[i+1][0]
        except:
            end = len(text)-1
        sections_text[topic]=text[start:end].replace('\n',' ')
        
    return sections_text

def UMLSBert_similarity(sent1, sent2):
    inputs_1 = coder_tokenizer(sent1, return_tensors='pt')
    inputs_2 = coder_tokenizer(sent2, return_tensors='pt')

    sent_1_embed = np.mean(coder_model(**inputs_1).last_hidden_state[0].detach().numpy(), axis=0)
    sent_2_embed = np.mean(coder_model(**inputs_2).last_hidden_state[0].detach().numpy(), axis=0)
    
    return np.dot(sent_1_embed, sent_2_embed)/(norm(sent_1_embed)* norm(sent_2_embed))

def get_jaccard_sim(words_l_1, words_l_2, prop=0.5):
    words_l_1 = set(words_l_1)
    words_l_2 = set(words_l_2)
    len_1 = len(words_l_1)
    len_2 = len(words_l_2)
    f = (len_1 / len_2) * prop
    a = words_l_1
    b = words_l_2
    c = a.intersection(b)
    res = float(len(c)) / (len(a) + len(b) - len(c))
    res_f = res + (res * f)
    return res_f

def lemmatizer_l(sentence_l):
    new_sentence_l = []
    for sentence in sentence_l:
        word_l = word_tokenize(sentence)
        new_word_l = [lemmatizer.lemmatize(word).lower() for word in word_l]
        new_sentence_l.append(" ".join(new_word_l))
        
    return new_sentence_l

def coeff(exp1, exp2, neg=False):
    exp1 = literal_eval(exp1)
    if (not exp1) or (not exp2):
        return 0
    jacc = get_jaccard_sim(exp1, exp2)
    return jacc if not neg else -jacc

def umls_coeff(exp1, exp2, neg=False):
    exp1_s = " ".join(exp1)
    exp2_s = " ".join(exp2)
    try:
        umls = UMLSBert_similarity(exp1_s, exp2_s)
    except:
        exp1_s = " ".join(exp1[:450])
        exp2_s = " ".join(exp2[:450])
        umls = UMLSBert_similarity(exp1_s, exp2_s)
        
    return umls if not neg else -umls

def get_similar_document(ents_d, df_struct, df_struct_lemma):
    prob_cols = ['chief_complaint', 'history_of_present_illness', 'brief_hospital_course', 'hospital_course', 'discharge_diagnosis']
    att_cols = ['social_history']
    hist_cols = ['past_medical_history']
    
    ## PROBLEM SCORE, ATT SCORE, HIST SCORE, NEG
    cols_d = {}
    cols_d['PROBLEM'] = prob_cols
    cols_d['ATTENTION'] = att_cols
    cols_d['HIST_PROBLEM'] = hist_cols
    cols_d['NEGATED'] = prob_cols + att_cols + hist_cols
    
    # Gender
    gender = ents_d.get('GENDER')
    if gender:
        gender = gender[1]
        df_struct_target = df_struct_lemma[df_struct_lemma.sex==gender]
    else:
        df_struct_target = df_struct_lemma
    
    # Age
    age = ents_d.get('AGE')
    if age:
        df_struct_target = df_struct_target[df_struct_target.age.between(age-5, age+5)]
    else:
        pass
    
    # Problems, Historical Problems, Attention, Negated
    idx_subj = {}
    for subject, cols in cols_d.items():
        exp2 = lemmatizer_l([ent for ent, idx in ents_d[subject]])
        if subject!='NEGATED':
            for col in cols:
                df_struct_target[f'coeff_{subject}_{col}'] = df_struct_target[col].apply(coeff, exp2=exp2, neg=False)
        else:
            for col in cols:
                df_struct_target[f'coeff_{subject}_neg_{col}'] = df_struct_target[col].apply(coeff, exp2=exp2, neg=True)
    for subject in cols_d.keys():
        if subject=='NEGATED':continue
        n_rows = 5 if subject!='PROBLEM' else 10
        target_cols = df_struct_target.columns[df_struct_target.columns.str.startswith(f"coeff_{subject}")].to_list()
        target_cols += df_struct_target.columns[df_struct_target.columns.str.startswith(f"coeff_{subject}_neg")].to_list()
        df_struct_target[f'coeff_total_{subject}'] = df_struct_target[target_cols].sum(axis=1)        
        idx_subj[subject] = df_struct_target.sort_values(by=f'coeff_total_{subject}', ascending=False).head(n_rows).index
    
    idx_subj_f = {}
    for subject, idxs in idx_subj.items():
        df_struct_target = df_struct.loc[idxs]
        exp2 = [ent for ent, idx in ents_d[subject]]
        target_cols = cols_d[subject]
        for col in cols:
            df_struct_target[f'coeff_{subject}_{col}'] = df_struct_target[col].apply(coeff, exp2=exp2, neg=False)
        target_cols = df_struct_target.columns[df_struct_target.columns.str.startswith(f"coeff_{subject}")].to_list()
        df_struct_target[f'coeff_total_{subject}'] = df_struct_target[target_cols].sum(axis=1)
        idx_f = df_struct_target.sort_values(by=f'coeff_total_{subject}', ascending=False).head(1).index
        idx_subj_f[subject] = idx_f
    
    # Replace most Similar Social History and Past Medical History
    text_selected = df_struct.loc[idx_subj_f['PROBLEM'][0]]['text']
    text_selected_att = df_struct.loc[idx_subj_f['ATTENTION'][0]]['text']
    text_selected_hist = df_struct.loc[idx_subj_f['HIST_PROBLEM'][0]]['text']
    
    sections_text = get_topics_text(text_selected)
    sections_text_att = get_topics_text(text_selected_att)
    sections_text_hist = get_topics_text(text_selected_hist)
    
    sections_text['social_history'] = sections_text_att.get('social_history')
    sections_text['past_medical_history'] = sections_text_hist.get('past_medical_history')
    
    # Replace Allergies and Chief Complaint
    allergies = ents_d.get('ALLERGEN')
    if allergies:
        allergies = [ent for ent, idx in allergies]
    problems = ents_d.get('PROBLEM')
    if problems:
        problems = [ent for ent, idx in problems]
    sections_text['allergies'] = ", ".join(allergies) if allergies else sections_text.get('allergies', '')
    sections_text['chief_complaint'] = ", ".join(problems) if problems else sections_text.get('chief_complaint', '')
    
    return sections_text

In [249]:
%%time
doc = get_similar_document(res, df_struct, df_struct_lemma)

CPU times: user 401 ms, sys: 2.16 ms, total: 403 ms
Wall time: 402 ms


## Final Pipeline:
1- Input Text  
2- Simalirity  
3- Replace and Fake PHI Labels  

In [21]:
def get_doc_from_input_text(input_text):
    ## Input Text
    input_text_ents = get_ents_input_text(input_text)
    get_ents_input_text_vis(input_text_ents, input_text, 'span')

    ## Simalirity
    sections_text = get_similar_document(input_text_ents, df_struct, df_struct_lemma)
    
    ## Replace & Fake PHI Labels
    sections_text = fake_phi_labels(sections_text, **input_text_ents)
    
    return sections_text

In [59]:
text = """Anne 35 years old female was attended in Naval Hospital Beaufort in 10/01/2021 by Dr. Straus. The patient was presenting abdominal pain together with vomiting and diarrhea with high fever and chills. The patient has history of gastritis and acute esophagitis.
Patient makes constant use of alcohol and tobacco Patient has hx of gallstones. There are reports about a diagnosis of his parent with diabetes. A possible stroke was not evidencied in your last report."""
res = get_ents_input_text(text)

Loaded dictionary from /home/leobit/anaconda3/envs/conda_medtrix_env/lib/python3.10/site-packages/genderComputer-0.1-py3.10.egg/genderComputer/../nameLists/gender.dict
Finished initialization


In [75]:
res

{'PATIENT': 'Anne',
 'AGE': 35,
 'HOSP': 'Naval Hospital Beaufort',
 'DATE': (datetime.datetime(2021, 10, 1, 0, 0), '10/01/2021', 68),
 'STAFF': 'Straus',
 'GENDER': ('Anne', 'F'),
 'ALLERGEN': [],
 'PROBLEM': [('presenting abdominal pain', 110),
  ('vomiting', 150),
  ('diarrhea', 163),
  ('high fever', 177),
  ('chills', 192)],
 'HIST_PROBLEM': [('gastritis', 227),
  ('acute esophagitis', 241),
  ('gallstones', 328)],
 'FAM_PROBLEM': [('diabetes', 395)],
 'ATTENTION': [('tobacco', 302),
  ('use of', 283),
  ('reports', 350),
  ('female', 18),
  ('patient', 98),
  ('history', 216),
  ('Patient', 260),
  ('patient', 204),
  ('constant', 274),
  ('alcohol', 290),
  ('diagnosis', 366),
  ('Patient', 310),
  ('parent', 383)],
 'NEGATED': [('stroke', 416)]}

In [74]:
text = """Anne 35 years old female was attended in Naval Hospital Beaufort in 10/01/2021 by Dr. Straus. The patient was presenting abdominal pain together with vomiting and diarrhea with high fever and chills. The patient has history of gastritis and acute esophagitis.
Patient makes constant use of alcohol and tobacco Patient has hx of gallstones. There are reports about a diagnosis of his parent with diabetes. A possible stroke was not evidencied in your last report."""
get_doc_from_input_text(text)

Loaded dictionary from /home/leobit/anaconda3/envs/conda_medtrix_env/lib/python3.10/site-packages/genderComputer-0.1-py3.10.egg/genderComputer/../nameLists/gender.dict
Finished initialization
[('stroke', 416)]


{'admission_date': ' 2021-10-1              ',
 'discharge_date': '  2021-10-5  ',
 'date_of_birth': ' 1987-14--8             ',
 'sex': '  F  ',
 'service': 'MEDICINE  ',
 'allergies': 'Iodine  ',
 'attending': 'Straus ',
 'chief_complaint': 'presenting abdominal pain, vomiting, diarrhea, high fever, chills',
 'major_surgical_or_invasive_procedure': 'none  ',
 'history_of_present_illness': '35 year old female with no significant past medical history presents with 3 days of fevers and malaise.  She reports her symptoms started 3 days ago and have been gradually progressing.  She has had headache, fevers, chills, night sweats, myalgias, cough productive of green sputum, nausea/vomiting (non-bloody, up to 7x per day), diarrhea (non-bloody, up to 5x per day).  She notes generalized abdominal pain, worst in the suprapubic region.  Also decreased PO intake.  No chest pain or SOB. No dysuria. No recent travel or sick contacts. . In the ED, initial vs were: T 104.5, P 129, BP 122/74, R 20, O2

In [459]:
text = "cancer"
get_doc_from_input_text(text)

Loaded dictionary from /home/leobit/anaconda3/envs/conda_medtrix_env/lib/python3.10/site-packages/genderComputer-0.1-py3.10.egg/genderComputer/../nameLists/gender.dict
Finished initialization


{'admission_date': ' 2137-12-26              ',
 'discharge_date': '  2138-1-16  ',
 'service': 'MEDICINE  ',
 'allergies': 'Zocor  ',
 'attending': 'KENDRA ',
 'chief_complaint': 'cancer',
 'major_surgical_or_invasive_procedure': '1. Cystoscopy, clot evacuation, bladder fulguration. 2. Repeat cystoscopy.  ',
 'history_of_present_illness': '91 M with history of 91 cancer s/p XRT and brachytherapy in 2129 + salvage radiation therapy presents with hematuria and clot urinary retention.  He had been seen for the past several months with intermittent hematuria.  He has a chronic indwelling urinary catheter and was last seen by Dr. RUMBLE on 12-19 where a cystoscopy was done revealing a edematous bladder consistent with radiation changes, but no active bleeding and no clots within the bladder.  He has an 18Fr Coude catheter in place and he noticed his catheter had stopped draining for 12 hours.  On arrival to the ED his catheter had begun to drain again.  He was hand-irrigated until urine wa