In this work, we develop a pipeline to identify specific terms and entities in medical text. The primary focus is on demonstrating the pipeline and providing code snippets that can be adapted to meet the goals of the studies you are working on. While it is not possible to create a universal pipeline capable of extracting all necessary information from the text, we utilize automated tools from spaCy and offer methods for customization. These approaches allow for modifications tailored to extract most of the medical information you may require.

# We start with an example of data (discharge summary)

In [2]:
discharge_summary = """Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis. She is not receiving any chemo.

Past Medical History:
1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT,
chemo. Last colonoscopy showed: Last CEA was in the 8 range
(down from 9)
2. Type II Diabetes Mellitus
3. Hypertension

Social History:
Married, former tobacco use. No alcohol or drug use.

Family History:
Mother with stroke at age 82. no early deaths.
2 daughters- healthy


Brief Hospital Course:
Ms. [**Known patient lastname 2004**] was admitted on [**2573-5-30**]. Ultrasound at the time of
admission demonstrated pancreatic duct dilitation and an
edematous gallbladder. She was admitted to the ICU.
Discharge Medications:
1. Miconazole Nitrate 2 % Powder Sig: One (1) Appl Topical  BID
(2 times a day) as needed.
2. Heparin Sodium (Porcine) 5,000 unit/mL Solution Sig: One (1)
Injection TID (3 times a day).
3. Acetaminophen 160 mg/5 mL Elixir Sig: One (1)  PO Q4-6H
(every 4 to 6 hours) as needed.

Discharge Diagnosis:
Type 2 DM
Pancreatitis
HTN
h/o aspiration respiratory distress


Discharge Instructions:
Patient may shower. Please call your surgeon or return to the
emergency room if [**Doctor First Name **] experience fever >101.5, nausea, vomiting,
abdominal pain, shortness of breath, abdominal pain or any
significant  change in your medical condition. A

Completed by: [**First Name11 (Name Pattern1) 2010**] [**Last Name (NamePattern1) 2011**] MD [**MD Number 2012**] [**2573-7-1**] @ 1404
Signed electronically by: DR. [**First Name8 (NamePattern2) **] [**Last Name (NamePattern1) **]
 on: FRI [**2573-7-2**] 8:03 AM
(End of Report)"""

There are certain aspects of pre-processing that require our attention. Below are the tasks we need to perform for specific data

In [3]:


import spacy
from spacy.tokens import Span

import medspacy
from medspacy.preprocess import PreprocessingRule, Preprocessor
from medspacy.ner import TargetRule
from medspacy.context import ConTextRule
from medspacy.section_detection import Sectionizer
from medspacy.postprocess import PostprocessingRule, PostprocessingPattern, Postprocessor
from medspacy.postprocess import postprocessing_functions
from medspacy.visualization import visualize_ent, visualize_dep


import re



I adjusted the phrasing for improved clarity and readability. Let me know if this works for you or if you'd like further refinements!

If there are entities that the model cannot recognize, we will add terms and rules to teach it later

In [4]:
# nlp = medspacy.load()
nlp = medspacy.load('en_ner_bc5cdr_md',medspacy_disable=[ 'parser','medspacy_pyrush','medspacy_target_matcher'])
nlp.pipe_names

['tok2vec',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'parser',
 'ner',
 'medspacy_context']

In [5]:
nlp.add_pipe("medspacy_target_matcher", before="ner")
nlp.pipe_names

['tok2vec',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'parser',
 'medspacy_target_matcher',
 'ner',
 'medspacy_context']

Notice that pyrush does not work well due to its algorithm to recognize entities, it conflicts with the package en_ner_bc5cdr_md

In [6]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x27ef43dbb00>

In [7]:
preprocessor = Preprocessor(nlp.tokenizer)
nlp.tokenizer = preprocessor

We need to establish specific rules for processing data, particularly when dealing with dates and other specialized medical information.

We replace some terms and delete some un useful data

In [8]:
preprocess_rules = [

    
    PreprocessingRule(
        "dx'd", 
        repl="Diagnosed", 
        desc="Replace abbreviation"
    ),
    
    PreprocessingRule(
        "tx'd", 
        repl="Treated", 
        desc="Replace abbreviation"
    ),
    
    PreprocessingRule(
        "\[\*\*", 
        desc="Remove all other bracketed placeholder text"
    ),
    PreprocessingRule(
        "\*\*\]", 
        desc="Remove all other bracketed placeholder text"
    ),
]

In [9]:
from spacy.language import Language
from spacy.tokens import Doc
from spacy.matcher import PhraseMatcher
import spacy

DICTIONARY = {"dx'd": "diagnosed", "tx'd": "treatment"}
DICTIONARY.update({value: key for key, value in DICTIONARY.items()})

@Language.factory("acronyms", default_config={"case_sensitive": False})
def create_acronym_component(nlp: Language, name: str, case_sensitive: bool):
    return AcronymComponent(nlp, case_sensitive)

class AcronymComponent:
    def __init__(self, nlp: Language, case_sensitive: bool):
        # Create the matcher and match on Token.lower if case-insensitive
        matcher_attr = "TEXT" if case_sensitive else "LOWER"
        self.matcher = PhraseMatcher(nlp.vocab, attr=matcher_attr)
        self.matcher.add("ACRONYMS", [nlp.make_doc(term) for term in DICTIONARY])
        self.case_sensitive = case_sensitive
        # Register custom extension on the Doc
        if not Doc.has_extension("acronyms"):
            Doc.set_extension("acronyms", default=[])

    def __call__(self, doc: Doc) -> Doc:
        # Add the matched spans when doc is processed
        for _, start, end in self.matcher(doc):
            span = doc[start:end]
            acronym = DICTIONARY.get(span.text if self.case_sensitive else span.text.lower())
            doc._.acronyms.append((span, acronym))
        return doc

# Add the component to the pipeline and configure it
# nlp = spacy.blank("en")
# nlp.add_pipe("acronyms", config={"case_sensitive": False})

# # Process a doc and see the results
doc = nlp(discharge_summary)

List some pipelines we use

In [10]:

nlp.pipe_names

['tok2vec',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'parser',
 'medspacy_target_matcher',
 'ner',
 'medspacy_context',
 'sentencizer']

We add the preprocess rules that we created before

In [11]:
preprocessor.add(preprocess_rules)

Get the results and visualizations

In [12]:
pre_process = nlp(text=discharge_summary)

visualize_ent(pre_process)



As demonstrated, the model is capable of identifying certain diseases and medications. However, it has limitations, such as its inability to recognize time, date, and specific sections of the text.
I refined the phrasing for better flow and clarity while keeping the original intent. Let me know if you’d like further adjustments!


We add the rules to the preprocessing

Next, we apply Medspacy to find date and time from the data

We use the find_dates packages to automatically recognize dates from the text

In [13]:
from date_spacy import find_dates
from spacy.pipeline import EntityRuler
import medspacy
# nlp = medspacy.load()
nlp.add_pipe('find_dates')
# Add EntityRuler to the pipeline
# ruler = nlp.add_pipe("entity_ruler", before="medspacy_target_matcher")

# Process text
# text = "The patient visited on March 3, 2025 and reported fever and headache.On 3/12/2020, patient was admitted to the hospital."
pre_process_out = nlp(pre_process)

# Extract entities
for ent in pre_process_out.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

Entity: 2573-5-30, Label: DATE
Entity: 2573-7-1, Label: DATE
Entity: 2498-8-19, Label: DATE
Entity: Hydrochlorothiazide, Label: CHEMICAL
Entity: Abdominal pain, Label: DISEASE
Entity: 6-25, Label: DATE
Entity: 5-31, Label: DATE
Entity: stroke, Label: DISEASE
Entity: abdominal pain, Label: DISEASE
Entity: Colon cancer, Label: DISEASE
Entity: XRT, Label: CHEMICAL
Entity: Type II Diabetes Mellitus, Label: DISEASE
Entity: Hypertension, Label: DISEASE
Entity: alcohol, Label: CHEMICAL
Entity: stroke, Label: DISEASE
Entity: deaths, Label: DISEASE
Entity: 2573-5-30, Label: DATE
Entity: pancreatic duct dilitation, Label: DISEASE
Entity: Miconazole Nitrate, Label: CHEMICAL
Entity: Heparin Sodium, Label: CHEMICAL
Entity: Acetaminophen, Label: CHEMICAL
Entity: Pancreatitis, Label: DISEASE
Entity: HTN, Label: DISEASE
Entity: respiratory distress, Label: DISEASE
Entity: fever, Label: DISEASE
Entity: nausea,, Label: DISEASE
Entity: vomiting, Label: DISEASE
Entity: abdominal pain, Label: DISEASE
Entit

We visualize the results to understand how they are generated.

In [14]:
visualize_ent(pre_process_out)

Next we do Sectionization, which is we select which section of the text are from

We also add rules to the sections

In [15]:
sectionizer = nlp.add_pipe("medspacy_sectionizer")

In [16]:
from medspacy.section_detection import SectionRule

Since some sections are specific to our data, here we create some rules to recognize sections of the text.

In [17]:
section_rule = [SectionRule("Admission Date:", "admission_date"),
                SectionRule("Discharge Date:", "discharge_date"),
                # SectionRule("Date of Birth:", "date_of_birth"),
                SectionRule("Allergies:", "allergies"),
                SectionRule("Attending:", "attending"),
                SectionRule("Chief Complaint:", "chief_complaint"),
                SectionRule("Major Surgical or Invasive Procedure:", "surgical_procedure"),
                SectionRule("History of Present Illness:", "history_present_illness"),
                SectionRule("Past Medical History:", "past_medical_history"),
                SectionRule("Social History:", "social_history"),
                SectionRule("Family History:", "family_history"),
                SectionRule("Brief Hospital Course:", "hospital_course"),
                SectionRule("Discharge Medications:", "discharge_medications"),
                SectionRule("Discharge Diagnosis:", "discharge_diagnosis"),
                SectionRule("Discharge Instructions:", "discharge_instructions"),
                SectionRule("Completed by:", "completed_by"),
                SectionRule("Signed electronically by:", "signed_by"),
                SectionRule("Signature:", "signature"),
                SectionRule("Previous Medical History:", "past_medical_history"),
                ]

We add the section rules to the pipeline

In [18]:
sectionizer.add(section_rule)

Check the results

In [19]:
pre_process_out.text

'Admission Date:  2573-5-30              Discharge Date:   2573-7-1\n\nDate of Birth:  2498-8-19             Sex:   F\n\nService: SURGERY\n\nAllergies:\nHydrochlorothiazide\n\nAttending:First Name3 (LF) 1893\nChief Complaint:\nAbdominal pain\n\nMajor Surgical or Invasive Procedure:\nPICC line 6-25\nERCP w/ sphincterotomy 5-31\n\n\nHistory of Present Illness:\n74y female with type 2 dm and a recent stroke affecting her\nspeech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis. She is not receiving any chemo.\n\nPast Medical History:\n1. Colon cancer Diagnosed in 2554, Treated with hemicolectomy, XRT,\nchemo. Last colonoscopy showed: Last CEA was in the 8 range\n(down from 9)\n2. Type II Diabetes Mellitus\n3. Hypertension\n\nSocial History:\nMarried, former tobacco use. No alcohol or drug use.\n\nFamily History:\nMother with stroke at age 82. no early deaths.\n2 daughters- healthy\n\n\nBrief Hospital Course:\nMs. Known patient lastname 2004 was admitted

Get the results and do visualization

In [20]:
pre_process_section_out = nlp(pre_process_out.text)
visualize_ent(pre_process_section_out)

As we can see, the above results can recognize the sections and date correctly after we added the sections by our rules

Next we add the target rules for the entity

We can add the age, gender, and missing disease entities that the automatic method did not recognize

In [21]:
age_rule = [
    TargetRule("74 y",category="PATIENT_AGE",
    pattern=[
    {"TEXT": {"REGEX": r"^\d{1,2}y"}},
]),
    TargetRule("74y",category="PATIENT_AGE",
    pattern=[
    {"TEXT": {"REGEX": r"^\d{1,2}"}},
    {"TEXT": "y"}
]),
    TargetRule("age 82",category="PATIENT_AGE",
    pattern=[
    {"TEXT": "age"},
    {"TEXT": {"REGEX": r"^\d{1,2}"}}
    
]),

    TargetRule(
    "72-year-old",  # Example text that matches the rule
    category="PATIENT_AGE",
    pattern = [
    {"TEXT": {"REGEX": r"^\d{1,3}$"}},
    {"TEXT": "-"},
    {"LOWER": "year"},
    {"TEXT": "-"},
    {"LOWER": "old"}
]),
TargetRule(
    "72 year old",  # Example text that matches the rule
    category="PATIENT_AGE",
    pattern = [
    {"TEXT": {"REGEX": r"^\d{1,3}$"}}, 
    {"LOWER": "year"},
    {"LOWER": "old"}
])
]

gender_rule = [
    TargetRule("male","gender",pattern=[{"LOWER":"male"}]),
    TargetRule("female","gender",pattern=[{"LOWER":"female"}])
]

disease_rule = [
    TargetRule("Type II Diabetes Mellitus","DISEASE",
     pattern=[
                  {"LOWER": "type"},
                  {"LOWER": {"IN": ["2", "ii", "two"]}},
                  {"LOWER": {"IN": ["dm", "diabetes"]}},
                  {"LOWER": "mellitus", "OP": "?"}
              ]),
    TargetRule("Metastasis","DISEASE",
     pattern=[
                  {"LOWER": "metastasis"}
              ]),
    TargetRule("edematous gallbladder","DISEASE",
               pattern=[
                  {"LOWER": "edematous"},
                  {"LOWER": "gallbladder"}
              ]),
]




We used target matcher rules to classify them

In [22]:
from medspacy.ner import TargetRule
target_matcher = nlp.get_pipe("medspacy_target_matcher")

We add the rules

In [23]:
# Add the patterns to the EntityRuler
target_matcher.add(age_rule)
target_matcher.add(gender_rule)
target_matcher.add(disease_rule)

Run it and visualization

In [24]:
pre_process_section_target_out = nlp(pre_process_out.text)
visualize_ent(pre_process_section_target_out)

As we can see, there are some treatments and their doses which were not recognize, we create some rules to recognize them here

In [25]:
from medspacy.target_matcher import TargetRule

units = ["mg", "g", "mcg", "μg", "ml", "mL", "cc", "tablet", "tab", "tabs", "capsule", "cap", "caps", "puff", "spray","unit/mL","%","unit"]

dose_rules = [

    # 500 mg, 1.5 g, 2 mcg
    TargetRule(
        "500 mg",
        category="DOSE_AMOUNT",
        pattern=[
            {"TEXT": {"REGEX": r"^\d+(\.\d+)?$"}},         # number or decimal
            {"LOWER": {"IN": units}}
        ]
    ),

    # 1 tablet, 2 tablets, 1 tab
    TargetRule(
        "1 tablet",
        category="DOSE_AMOUNT",
        pattern=[
            {"TEXT": {"REGEX": r"^\d+(\.\d+)?$"}},
            {"LOWER": {"IN": units}}
        ]
    ),

    # 10 mL, 5 ml
    TargetRule(
        "10 mL",
        category="DOSE_AMOUNT",
        pattern=[
            {"TEXT": {"REGEX": r"^\d+(\.\d+)?$"}},
            {"LOWER": {"IN": units}}
        ]
    ),

    # 1 puff, 2 sprays
    TargetRule(
        "1 puff",
        category="DOSE_AMOUNT",
        pattern=[
            {"TEXT": {"REGEX": r"^\d+(\.\d+)?$"}},
            {"LOWER": {"IN": units}}
        ]
    ),
    TargetRule(
        "5,000 unit/mL",
        category="DOSE_AMOUNT",
        pattern=[
            {"TEXT": {"REGEX": r"^\d+"}},
            {"TEXT":","},
            {"TEXT": {"REGEX": r"^\d+"}},
            {"TEXT": "unit"},
            {"TEXT": "/"},
            {"TEXT": "mL"}

        ]
    ),
        TargetRule(
        "160 mg/ 5 ml",
        category="DOSE_AMOUNT",
        pattern=[
            {"TEXT": {"REGEX": r"^\d+"}},
            {"TEXT":"mg"},
            {"TEXT": "/"},
            {"TEXT": {"REGEX": r"^\d+"}},
            {"LOWER": "ml"},


        ]
    ),
    TargetRule(
        "2 %",
        category="DOSE_AMOUNT",
        pattern=[
            {"TEXT": {"REGEX": r"^\d+(\.\d+)?$"}},
            {"LOWER": {"IN": units}}
        ]
    ),
    TargetRule(
        "2%",
        category="DOSE_AMOUNT",
        pattern=[
            {"TEXT": {"REGEX": r"^\d+(\.\d+)?$"}},
            {"LOWER": {"IN": units}}
        ]
    ),
    # TargetRule(r"treated with \w+", "TREATMENT",
    #            pattern=[
    #     {"LOWER": "treated"},
    #     {"LOWER": "with"},
    #     {"TEXT": {"REGEX": r"^\w+$"}}
    # ]),
]


This is an example to see how do we create rules for some form. In this example, 5,000 unit/mL, tokenization does not recognize 5,000 as a number 5000. So, we must treat it specially. How do we know which token they are? We run tokenization to check as below

In [26]:
# Process the sentence using medspaCy
doc_token = nlp("5,000 unit/mL")

# Extract tokens
tokens_res = [token.text for token in doc_token]

# Print tokens
print("Tokens:", tokens_res)


Tokens: ['5', ',', '000', 'unit', '/', 'mL']


That is why we did differently above for this

    TargetRule(
        "5,000 unit/mL",
        category="DOSE_AMOUNT",
        pattern=[
            {"TEXT": {"REGEX": r"^\d+"}},
            {"TEXT":","},
            {"TEXT": {"REGEX": r"^\d+"}},
            {"TEXT": "unit"},
            {"TEXT": "/"},
            {"TEXT": "mL"}

        ]
    ),
We consider 5, and , and 000 and the unit separately.

We add the rules to the pipeline

In [27]:
for rule in dose_rules:
    target_matcher.add(rule)


Again, visualization

In [28]:
pre_process_section_target_dose_out = nlp(pre_process_out.text)
visualize_ent(pre_process_section_target_dose_out)

Context rules

Context rules are applied to understand the meaning of entities within their specific context. For example, the model interprets "no evidence of cancer" as "no cancer." In this process, the model uses a negation note. By default, the note is set to "no," indicating no negation in meaning.

The negation will tell us if it exists

In [29]:
from medspacy.context import ConTextRule

We use context rules for this

In [30]:
context = nlp.get_pipe("medspacy_context")

In [31]:
# Add a custom negation rule if needed (optional)

custom_negation_rule = [ConTextRule(
    literal="no evidence of",      # The phrase to match
    category="NEGATED_EXISTENCE",  # The category to assign
    direction="FORWARD",            # The direction of modification (FORWARD or BACKWARD)
    pattern=[{"LOWER": "no"}, {"LOWER": "evidence"}, {"LOWER": "of"}],  # The pattern to match
),
ConTextRule(literal= "is not evident", category= "NEGATED_EXISTENCE", direction= "BACKWARD",
            pattern=None),
]
context.add(custom_negation_rule) 

Run the pipeline

In [32]:
pre_process_section_target_dose_context_out = nlp(pre_process_out.text)
visualize_ent(pre_process_section_target_dose_context_out)

Convert the results into dataframe for each subject

In [33]:

def ent_to_dict(ent):
    d = {}
    d["text"] = ent.text
    d["label"] = ent.label_
    d["sent"] = ent.sent.text
    
    # ConText attributes
    d["is_negated"] = ent._.is_negated
    d["is_historical"] = ent._.is_historical
    d["is_uncertain"] = ent._.is_uncertain
    d["is_family"] = ent._.is_family
    d["is_hypothetical"] = ent._.is_hypothetical
    
    # Section
    d["section_catgeory"] = ent._.section_category
    
    return d

In [34]:
ents_data = []

for ent in pre_process_section_target_out.ents:
    ents_data.append(ent_to_dict(ent))
import pandas as pd

ents_df = pd.DataFrame(ents_data)
ents_df

Unnamed: 0,text,label,sent,is_negated,is_historical,is_uncertain,is_family,is_hypothetical,section_catgeory
0,2573-5-30,DATE,Admission Date: 2573-5-30 Discha...,False,False,False,False,False,admission_date
1,2573-7-1,DATE,Admission Date: 2573-5-30 Discha...,False,False,False,False,False,discharge_date
2,2498-8-19,DATE,Admission Date: 2573-5-30 Discha...,False,False,False,False,False,discharge_date
3,Hydrochlorothiazide,CHEMICAL,Admission Date: 2573-5-30 Discha...,False,False,False,False,True,allergy
4,Abdominal pain,DISEASE,Admission Date: 2573-5-30 Discha...,False,False,False,False,False,chief_complaint
5,6-25,DATE,Admission Date: 2573-5-30 Discha...,False,False,False,False,False,surgical_procedure
6,5-31,DATE,Admission Date: 2573-5-30 Discha...,False,False,False,False,False,surgical_procedure
7,74y,PATIENT_AGE,Admission Date: 2573-5-30 Discha...,False,False,False,False,False,history_present_illness
8,female,gender,Admission Date: 2573-5-30 Discha...,False,False,False,False,False,history_present_illness
9,type 2 dm,DISEASE,Admission Date: 2573-5-30 Discha...,False,False,False,False,False,history_present_illness


Test for another type of document

We can use entity ruler to add new entity that was not detect or new. Here, we detect therapies as TREATMENT

We apply the pipeline to a new type of text

In [35]:

discharge_summary_2 = """
Here's an example of a medical discharge summary, including key sections and information:
Patient Information:

    Name: John Doe
    DOB: 01/01/1960
    MRN: 1234567
    Admission Date: 03/01/2025
    Discharge Date: 04/05/2025
    Attending Physician: Dr. Jane Smith
    Primary Care Physician: Dr. David Lee 

Reason for Admission:

    Patient presented with shortness of breath and chest pain. 

Hospital Course:

    Patient was admitted to the cardiac unit for evaluation of chest pain.
    Electrocardiogram (ECG) showed ST-T wave changes.
    Troponin levels were elevated.
    Patient was diagnosed with acute myocardial infarction (heart attack).
    Patient received thrombolytic therapy and was transferred to the cardiac catheterization lab for percutaneous coronary intervention (PCI).
    PCI was successful with stent placement in the left anterior descending artery (LAD).
    Patient's condition improved, and he was transferred back to the cardiac unit.
    Patient tolerated the procedure well and was discharged. 

Diagnostic Results:
ECG: ST-T wave changes, Troponin: Elevated, and Coronary Angiogram: LAD stenosis. 
Medications:

    Discharge Medications:
        Aspirin 81mg daily
        Lisinopril 10mg daily
        Metoprolol 50mg twice daily
        Plavix 75mg daily
        Simvastatin 40mg daily 
    Medications at Admission:
        None 

Discharge Instructions:

    Follow up with Dr. David Lee within 1 week.
    Take all medications as prescribed.
    Avoid strenuous activity for 2 weeks.
    Monitor for chest pain, shortness of breath, or other symptoms.
    Contact your physician or go to the nearest emergency room if symptoms worsen.
    Follow a low sodium, low saturated fat diet.
    Maintain a healthy weight.
    Quit smoking.
    Exercise regularly. 

Follow-up Plans:

    Follow up with cardiologist in 2 weeks.
    Follow up with primary care physician in 1 week.
    Cardiac rehabilitation program."""


We add some entities such as treatment, interventions ... as a TREATMENT entity

In [36]:
# Add an EntityRuler to the pipeline
# ruler = nlp.add_pipe("entity_ruler")
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define patterns for treatments and therapies
interventions_list = ["intervention", "procedure", "surgery", "operation", "therapy", "treatment"]
patterns = [
    {"label": "TREATMENT", "pattern": [{"IS_ALPHA": True}, {"LOWER": {"IN": interventions_list}}]},  # Matches any word before "therapy",
    # {"label": "TREATMENT", "pattern": [{"IS_ALPHA": True}, {"LOWER": "treatment"}]},  # Matches any word before "treatment",
    {"label": "TREATMENT", "pattern": [{"LOWER": "treated"}, {"IS_ALPHA": True}, {"LOWER": {"REGEX": r"\w"}}]},  # Matches "treated with"
    {"label": "TREATMENT", "pattern": [{"LOWER": "tx'd"}]},  # Matches "tx'd"
    {"label": "TREATMENT", "pattern": [{"LOWER": "tx"}]},  # Matches "tx"
    {"label": "TREATMENT", "pattern": [{"LOWER": "treatment"}]},  # Matches "treatment"
]


ruler.add_patterns(patterns)



In [53]:
dose_freq = [
    TargetRule(
        "2 times a day",
        category="DOSE_FREQUENCY",
        pattern=[
            {"LOWER": {"REGEX" :r"\d+"}},
            {"LOWER": "times"},
            {"LOWER": "a"},
            {"LOWER": "day"}
        ]
    ),
    TargetRule(
        "every 4 to 6 hours",
        category="DOSE_FREQUENCY",
        pattern=[
            {"LOWER": "every"},
            {"LOWER": {"REGEX" :r"\d+"}},
            {"LOWER": "to"},
            {"LOWER": {"REGEX" :r"\d+"}},
            {"LOWER": "hours"}
        ]
    ),

]

In [54]:
target_matcher.add(dose_freq)

We run the pipeline on the text of the previous one -- not the new data

In [55]:
pre_process_section_target_dose_context_out = nlp(discharge_summary)
visualize_ent(pre_process_section_target_dose_context_out)

And run on the new data

In [37]:
pre_process_section_target_dose_context_out_2 = nlp(discharge_summary_2)
visualize_ent(pre_process_section_target_dose_context_out_2)

As we can see, the model does not recognize the patient's name, follow-up plan. If we want, we can add the section rules and entity rules to recognize them

## We can add some section rules for those

In [38]:
section_rule_new = [SectionRule("Patient Information:", "patient_info"),
                SectionRule("Name:", "name"),
                SectionRule("Attending Physician:", "attending_physician"),
                SectionRule("Primary Care Physician:", "primary_care_physician"),
                SectionRule("Reason for Admission:", "reason_for_admission"),
                SectionRule("Diagnostic Results:", "diagnostic_results"),
                SectionRule("Medications at Admission:", "medications_at_admission"),
                
                SectionRule("Follow-up Plans:", "follow_up_plans"),
                SectionRule("Discharge Instructions:", "discharge_instructions"),
                ]

In [39]:
sectionizer.add(section_rule_new)

In [40]:
pre_process_section_target_dose_context_new_section_out_2 = nlp(discharge_summary_2)
visualize_ent(pre_process_section_target_dose_context_new_section_out_2)

We are still missing the frequency of the medications

We look for any information of the frequency after the dose entity we recognized

To determine the frequency of a dose, we must identify information following the dose_amount label; otherwise, it may not represent the correct dose frequency. To achieve this, we first search for text associated with the dose_amount label from our earlier recognition process. Next, we analyze the text immediately following it to check for the presence of a dose frequency. Finally, we assign the appropriate label to the dose frequency.


We need to create a new modify pipeline to add it to the pipeline

In [41]:
# Step 2: Helper - Check for overlapping span
def is_overlapping(span, existing_spans):
    for ent in existing_spans:
        if span.start < ent.end and ent.start < span.end:
            return True
    return False

@spacy.language.Language.component("dose_frequency_after_amount")
def dose_frequency_after_amount(doc: Doc):
    # Start from filtered ents to remove interference from DATE/TIME
    # new_ents = [e for e in doc.ents if e.label_ not in ("DATE", "TIME")]
    new_ents = [e for e in doc.ents]
    for ent in new_ents[:]:  # Iterate over a copy
        if ent.label_ == "DOSE_AMOUNT":
            start = ent.end
            next1 = doc[start] if start < len(doc) else None
            next2 = doc[start + 1] if start + 1 < len(doc) else None
            next3 = doc[start + 2] if start + 2 < len(doc) else None

            # Match pattern like "daily"
            if next1:
                phrase = f"{next1.text.lower()}"
                phrase2 = f"{next2.text.lower()}"
                if ((phrase in ["daily", "once", "twice", "three", "four"]) and (phrase2 not in ["daily", "times",  "day", "week"]) ):
                    # Check if the next token is a number
                    span = doc[next1.i:next1.i + 1]
                    if not is_overlapping(span, new_ents):
                        new_ents.append(Span(doc, span.start, span.end, label="DOSE_FREQUENCY"))
            # Match pattern like "twice daily"
            if next1 and next2:
                phrase = f"{next1.text.lower()} {next2.text.lower()}"
                phrase3 = f"{next3.text.lower()}"
                if ((phrase in ["once daily", "twice daily", "one time", "1 time", "1 time","2 times" ,"three times", "four times", "every day", "every week"]) and (phrase3 not in ["daily", "time","times", "day","days", "week","weeks","month","months"])):
                    span = doc[next1.i:next2.i + 1]
                    if not is_overlapping(span, new_ents):
                        new_ents.append(Span(doc, span.start, span.end, label="DOSE_FREQUENCY"))

            # Match pattern like "every 6 hours" or "2 times daily"
            if next1 and next2 and next3:
                if ((next1.lower_ == "every") or bool(re.match(r"^\d+$", next1.lower_ ))
) and (next2.like_num or next2.lower_ in ["times","time"]) and next3.lower_ in ["hour", "hours", "days", "weekly","daily"]:
                    span = doc[next1.i:next3.i + 1]
                    if not is_overlapping(span, new_ents):
                        new_ents.append(Span(doc, span.start, span.end, label="DOSE_FREQUENCY"))

    doc.ents = new_ents
    return doc




Add the new one to the pipeline

In [42]:
# Step 4: Add to pipeline
nlp.add_pipe("dose_frequency_after_amount", last=True)

<function __main__.dose_frequency_after_amount(doc: spacy.tokens.doc.Doc)>

In [43]:
# nlp.add_pipe('find_dates')

In [44]:
pre_process_section_target_dose_context_new_section_out_2 = nlp(discharge_summary_2)
visualize_ent(pre_process_section_target_dose_context_new_section_out_2)

In [45]:
test = nlp("Do exercise twice daily and take tylenol 500mg daily, and metformin 500mg 2 times daily")
visualize_ent(test)

## As you can see DOB  is not recognize as Date of Birth, so we change it by entity rule

In [46]:
dob_rule = TargetRule("DOB:", "DATE_OF_BIRTH",
    pattern=[
{"LOWER": {"IN": ["date of birth", "dob"]}},
    ]),

In [47]:
target_matcher.add(dob_rule)

In [48]:
pre_process_section_target_dose_context_new_section_out_2 = nlp(discharge_summary_2)
visualize_ent(pre_process_section_target_dose_context_new_section_out_2)

In [49]:
ents_data = []

for ent in pre_process_section_target_dose_context_new_section_out_2.ents:
    ents_data.append(ent_to_dict(ent))

ents_df = pd.DataFrame(ents_data)
ents_df

Unnamed: 0,text,label,sent,is_negated,is_historical,is_uncertain,is_family,is_hypothetical,section_catgeory
0,DOB,DATE_OF_BIRTH,\nHere's an example of a medical discharge sum...,False,False,False,False,False,name
1,01/01/1960,DATE,\nHere's an example of a medical discharge sum...,False,False,False,False,False,name
2,03/01/2025,DATE,\nHere's an example of a medical discharge sum...,False,False,False,False,False,admission_date
3,04/05/2025,DATE,\nHere's an example of a medical discharge sum...,False,False,False,False,False,discharge_date
4,shortness of breath,DISEASE,\nHere's an example of a medical discharge sum...,False,False,False,False,False,reason_for_admission
5,chest pain,DISEASE,\nHere's an example of a medical discharge sum...,False,False,False,False,False,reason_for_admission
6,chest pain,DISEASE,\n\nHospital Course:\n\n Patient was admitt...,False,False,False,False,False,hospital_course
7,acute myocardial infarction,DISEASE,\n Patient was diagnosed with acute myocard...,False,False,False,False,False,hospital_course
8,thrombolytic therapy,TREATMENT,\n Patient received thrombolytic therapy an...,False,False,False,False,False,hospital_course
9,coronary intervention,TREATMENT,\n Patient received thrombolytic therapy an...,False,False,False,False,False,hospital_course
