In [1]:
import spacy
import medspacy
from medspacy.visualization import visualize_ent, visualize_dep

# Overview
In this notebook, we'll look at how to extract clinical concepts and attributes from text.
- Target matching
- Section detection
- Context analysis

In [2]:
with open("./discharge_summary.txt") as f:
    text = f.read()

In [None]:
nlp = spacy.load("en_core_web_sm", disable=["ner"])

# Target extraction
In this step, we'll write rules to extract the main concepts we're interested in.

In this example, we'll use two utilities provided in `medspacy.ner` for rule-based matching: the `TargetMatcher` and `TargetRule`. However, you can use any spaCy components for adding spans to `doc.ents`, including pre-trained NER models or other [spaCy rule-based matching components](https://spacy.io/usage/rule-based-matching/).

## Target concepts
In our text, we'll extract the following concepts:
- Diagnoses 
- Medications

In addition, we'll show a few examples of how to add a custom spaCy attribute to a target rule to add an ICD-10 diagnosis code as an attribute of an entity.

In [None]:
from medspacy.ner import TargetMatcher, TargetRule

In [None]:
target_matcher = TargetMatcher(nlp)

In [None]:
nlp.add_pipe(target_matcher)

In [None]:
target_rules1 = [
    TargetRule(literal="abdominal pain", category="PROBLEM"),
    TargetRule("stroke", "PROBLEM"),
    TargetRule("hemicolectomy", "TREATMENT"),
    TargetRule("Hydrochlorothiazide", "TREATMENT"),
    TargetRule("colon cancer", "PROBLEM"),
    TargetRule("radiotherapy", "PROBLEM",
              pattern=[{"LOWER": "xrt"}]),
    TargetRule("metastasis", "PROBLEM"),
    
]

In [None]:
target_matcher.add(target_rules1)

In [None]:
doc = nlp(text)

In [None]:
visualize_ent(doc)

In [None]:
for ent in doc.ents:
    print(ent, ent.label_, ent._.target_rule.literal, sep="  |  ")
    print()

## Adding custom attributes
One of the most powerful functionalities of spaCy is the ability to add [custom attributes](https://spacy.io/usage/processing-pipelines#custom-components-attributes) to spaCy objects (`Doc`, `Span`, or `Token`). These custom attributes are accessed through `obj._....`. As we'll see in later steps, medSpaCy adds several of these attributes in other attributes like `context` or `sectionizer`. 

But it is sometimes useful to include a custom attribute as part of the target matching rule, rather than needing to build a separate component to add it. The `TargetRule` can also include a value for these attributes in the `attributes` argument. 

For example, let's say we want to map certain entities to [ICD-10 diagnosis codes](https://www.cdc.gov/nchs/icd/icd10cm.htm). One way we can do this is to include the diagnosis codes for concepts in our knowledge base. For example, **"Type II Diabetes"** can be mapped to **E11.9**". We can add this to entities by first registering the extension for the `Span` class:


In [None]:
from spacy.tokens import Span

In [None]:
Span.set_extension("icd10", default="")

We can now include ICD-10 code values in the target rules. We'll map **"Type II Diabetes Mellitus"** to **"E11.9"** and **"Hypertension"** to **"I10"**:

In [None]:
target_rules2 = [
    TargetRule("Type II Diabetes Mellitus", "PROBLEM", 
              pattern=[
                  {"LOWER": "type"},
                  {"LOWER": {"IN": ["2", "ii", "two"]}},
                  {"LOWER": {"IN": ["dm", "diabetes"]}},
                  {"LOWER": "mellitus", "OP": "?"}
              ],
              attributes={"icd10": "E11.9"}),
    TargetRule("Hypertension", "PROBLEM",
              pattern=[{"LOWER": {"IN": ["htn", "hypertension"]}}],
              attributes={"icd10": "I10"}),
    
    
]

In [None]:
target_matcher.add(target_rules2)

In [None]:
doc = nlp(text)

Now, whenever one of these rules results in a match, the ICD-10 value can be accessed in `ent._.icd10`:

In [None]:
for ent in doc.ents:
    if ent._.icd10 != "":
        print(ent, ent._.icd10)

# Context
Clinical text often contains mentions of concepts which the patient did not actually experience. For example:

- "There is *no evidence of* **pneumonia**"
- "*Mother* with **breast cancer**"
- "Patient presents for *r/o* **COVID-19**"

In all of these instances, we need to use the contextual clues around the entity to assert attributes like negation, experiencer, and uncertainty.

One method for this is the [ConText algorithm](https://www.sciencedirect.com/science/article/pii/S1532046409000744). ConText links target entities like problems with semantic modifiers like those shown above. The medSpaCy implementation of ConText is [cycontext](https://github.com/medspacy/cycontext).

Here we'll show the basic usage of ConText. When instantiating ConText, we can use default rules and then add additional as needed. See the [cycontext](https://github.com/medspacy/cycontext) repository for more detailed examples and tutorials.

In [None]:
from medspacy.context import ConTextComponent, ConTextItem

In [None]:
context = ConTextComponent(nlp, rules="default")

In [None]:
nlp.add_pipe(context)

In [None]:
nlp.pipe_names

In [None]:
doc = nlp("Mother with stroke at age 82.")

We can use medSpaCy visualizers from the module `medspacy.visualization` to show the target/modifiers in text. `visualize_dep` shows arrows targets to show which concepts are modified by the semantic modifiers:

In [None]:
visualize_ent(doc)

In [None]:
visualize_dep(doc)

In [None]:
short_doc = nlp("Colon cancer diagnosed in 2012")

We can add a new rule using the `ConTextItem` class:

In [None]:
item_data = [
    ConTextItem("diagnosed in <YEAR>", "HISTORICAL", 
                rule="BACKWARD", # Look "backwards" in the text (to the left)
               pattern=[
                   {"LOWER": "diagnosed"},
                   {"LOWER": "in"},
                   {"LOWER": {"REGEX": "^[\d]{4}$"}}
               ])
]

In [None]:
context.add(item_data)

In [None]:
short_doc = nlp("Colon cancer diagnosed in 2012")

In [None]:
visualize_ent(short_doc)

In [None]:
visualize_dep(short_doc)

In addition to linking targets and modifiers, `cycontext` will also set attributes for each entity:

In [None]:
for ent in doc.ents:
    if any([ent._.is_negated, ent._.is_uncertain, ent._.is_historical, ent._.is_family, ent._.is_hypothetical, ]):
        print("'{0}' modified by {1} in: '{2}'".format(ent, ent._.modifiers, ent.sent))
        print()

# Section detection
WE are often interested in which section of a clinical note an entity occurs in. This can be useful for setting attributes like temporality (similar to ConText) or for extracting entities from specific sections of the note.

MedSpaCy includes the `Sectionizer` class from the [clinical_sectionizer](https://github.com/medspacy/sectionizer) package. Similar to `ConTextComponent`, we can instantiate this with default rules and add new ones to fit our specific data. Section detection is especially dependent on your data, as each EHR will use different note formatting.

In [None]:
from medspacy.section_detection import Sectionizer

In [None]:
sectionizer = Sectionizer(nlp, patterns="default")

In [None]:
nlp.add_pipe(sectionizer)

In [None]:
doc = nlp(text)

`visualize_ent` will now highlight section titles in addition to entities and context modifiers:

In [None]:
visualize_ent(doc)

We can see here that the default rules did not catch the section title **"Brief Hospital Course"**. We can add a pattern to our sectionizer by passing in a dictionary with two key/pair values:
- **"section_title"**: The normalized section title
- **"pattern"**: A spaCy pattern to match the text (either a string or a list of Token dictionaries, similar to other components)

In [None]:
section_patterns = [
    {"section_title": "hospital_course", "pattern": "Brief Hospital Course:"}
]

In [None]:
sectionizer.add(section_patterns)

In [None]:
visualize_ent(nlp("""
Brief Hospital Course:
Ms. [**Known patient lastname 2004**] was admitted on [**2573-5-30**]. Ultrasound at the time of
admission demonstrated pancreatic duct dilitation and an
edematous gallbladder. She was admitted to the ICU.
"""))

The sectionizer will add attributes to allow us to access section data. The `doc` object will have these 3 attributes:

In [None]:
# Normalized section titles
print(doc._.section_titles)

In [None]:
# The Spans of the doc representing section headers
doc._.section_headers

In [None]:
# Spans of the actual sections of the notes
doc._.section_spans[:5]

Which can be zipped up as tuples in one attribute:

In [None]:
print(doc._.sections[:5])

For each section detected in the note, we'll print out the **normalized section title**, **section header**, and **the first 25 tokens of the section**:

In [None]:
for (section_title, section_header, section) in doc._.sections:
    print(section_title, section_header)
    print(section[:25])
    print("----------------")

Each entity has similar attributes:

In [None]:
for ent in doc.ents:
    print(ent, ent._.section_title)
    print()

# Postprocessing
The final component we'll introduce is the `postprocessor`. The postprocessor iterates through each entity and checks a series of conditions on each. If all conditions evaluate as `True`, then some action is taken on the entity. Some use cases of this include removing an entity or changing an attributes.

For example, let's say that we want to exclude any entity which comes from the **"patient_instructions"** section, as these are typically not experienced by the patient and are purely hypothetical. We'll write a rule to remove any entity from `doc.ents` if it came from this section. 

The design pattern for a postprocessing rule is as follows:
- A `PostprocessingRule` contains a list of `patterns` and an `action` to take if all of the `patterns` evaluate as `True`
- Each `PostprocessingPattern` takes a `condition`, which evaluates as `True` or `False`. If all patterns return `True`, the action is taken
- Each pattern can take option `condition_args` to pass into the condition check, and each rule takes optional `action_args`
- The module `postprocessing_functions` offer utility functions for the `condition` and `description` arguments

In [None]:
from medspacy.postprocess import Postprocessor, PostprocessingRule, PostprocessingPattern
from medspacy.postprocess import postprocessing_functions

In [None]:
postprocessor = Postprocessor(debug=False)

In [None]:
nlp.add_pipe(postprocessor)

In [None]:
postprocess_rules = [
    # Instantiate our rule
    PostprocessingRule(
        # Pass in a list of patterns
        patterns=[
            # The pattern will check if the entitie's section is "patient_instructions"
            PostprocessingPattern(condition=lambda ent: ent._.section_title == "patient_instructions"),
        ],
        # If all patterns are True, this entity will be removed.
        action=postprocessing_functions.remove_ent,
        description="Remove any entities from the instructions section."
    ),
    
]

Before adding the postprocessingrules, here are the final 5 entities:

In [None]:
print("Before:")
for ent in doc.ents[-5:]:
    print(ent, ent._.section_title)

In [None]:
postprocessor.add(postprocess_rules)

In [None]:
doc = nlp(text)

Afterwards, the final 2 entities have been removed:

In [None]:
print("After:")
for ent in doc.ents[-5:]:
    print(ent, ent._.section_title)