In [None]:
import sys

In [2]:
sys.path.insert(0, "..")

In [3]:
import spacy
import medspacy
from medspacy.visualization import visualize_ent, visualize_dep

# Overview
In this notebook, we'll look at how to extract clinical concepts and attributes from text.
- Target matching
- Context analysis

In [4]:
with open("./discharge_summary.txt") as f:
    text = f.read()

In [5]:
nlp = medspacy.load(medspacy_enable=["medspacy_pyrush"])

# Target extraction
In this step, we'll write rules to extract the main concepts we're interested in.

In this example, we'll use two utilities provided in `medspacy.ner` for rule-based matching: the `TargetMatcher` and `TargetRule`. However, you can use any spaCy components for adding spans to `doc.ents`, including pre-trained NER models or other [spaCy rule-based matching components](https://spacy.io/usage/rule-based-matching/).

If you want to use a statistical NER model instead of a rule-based component, see [./6-Using-Pretrained-Models.ipynb](./6-Using-Pretrained-Models.ipynb)

## Target concepts
In our text, we'll extract the following concepts:
- Diagnoses 
- Medications

In addition, we'll show a few examples of how to add a custom spaCy attribute to a target rule to add an ICD-10 diagnosis code as an attribute of an entity.

First, we'll import `TargetMatcher` and `TargetRule`. We'll instantiate a `TargetRule` and add it to our pipeline.

In [6]:
target_matcher = nlp.add_pipe("medspacy_target_matcher")

Now we will define some rules for extract concepts from the text. The `TargetRule` class defines rules for extracting entities from the text. `TargetRule` takes the following arguments:
- `literal`: An exact phrase to match in the text
- `category`: The semantic class of the entity, will correspond to `ent.label_` in a Doc.
- `pattern`: An optional pattern to match rather than `literal`. As we'll see below, this can be either a list of dictionaries defining token attributes or a regular expression string.
- `on_match`: An optional callback function. See https://spacy.io/usage/rule-based-matching#on_match
- `metadata`: An optional dictionary of metadata
- `attributes`: An optional dictionary of custom attributes to set in the resulting entity. We'll see an example below.

Let's define a few simple rules:

In [7]:
from medspacy.ner import TargetRule

In [8]:
target_rules = [
    TargetRule(literal="abdominal pain", category="PROBLEM"),
    TargetRule("stroke", "PROBLEM"),
    TargetRule("hemicolectomy", "TREATMENT"),
    TargetRule("Hydrochlorothiazide", "TREATMENT"),
    TargetRule("colon cancer", "PROBLEM"),
    TargetRule("metastasis", "PROBLEM"),
    
]

Then add the rules to the `target_matcher`

In [9]:
target_matcher.add(target_rules)

And process the text

In [10]:
doc = nlp(text)

Now that the text has been processed, we can see that there are a variety of entities in `doc.ents` that were identified with the `target_matcher`.

In [11]:
print(doc.ents)

(Hydrochlorothiazide, Abdominal pain, stroke, abdominal pain, metastasis, Colon cancer, hemicolectomy, stroke, abdominal pain, abdominal pain)


We can also compare each entity in `doc.ents` with the label it was given and the rule that generated it.

In [12]:
for ent in doc.ents:
    print(ent, ent.label_, ent._.target_rule, sep="  |  ")
    print()

Hydrochlorothiazide  |  TREATMENT  |  TargetRule(literal="Hydrochlorothiazide", category="TREATMENT", pattern=None, attributes=None, on_match=None)

Abdominal pain  |  PROBLEM  |  TargetRule(literal="abdominal pain", category="PROBLEM", pattern=None, attributes=None, on_match=None)

stroke  |  PROBLEM  |  TargetRule(literal="stroke", category="PROBLEM", pattern=None, attributes=None, on_match=None)

abdominal pain  |  PROBLEM  |  TargetRule(literal="abdominal pain", category="PROBLEM", pattern=None, attributes=None, on_match=None)

metastasis  |  PROBLEM  |  TargetRule(literal="metastasis", category="PROBLEM", pattern=None, attributes=None, on_match=None)

Colon cancer  |  PROBLEM  |  TargetRule(literal="colon cancer", category="PROBLEM", pattern=None, attributes=None, on_match=None)

hemicolectomy  |  TREATMENT  |  TargetRule(literal="hemicolectomy", category="TREATMENT", pattern=None, attributes=None, on_match=None)

stroke  |  PROBLEM  |  TargetRule(literal="stroke", category="PROBL

## Advanced Pattern Matching
SpaCy has powerful pattern matching which allows you to match on a list of dictionaries which define attributes for each token. See https://spacy.io/usage/rule-based-matching for spaCy's documentation and examples. Additionally, medspaCy allows matching with regular expressions on the underlying text of the doc.

Let's see some examples of `TargetRules` which utilizie pattern matching. The first rule uses token patterns to match any single token where the lower-case text is either `xrt` or `radiotherapy`. The second uses regular expressions to match various forms of Type 1/Type 2 diabetes.

In [13]:
pattern_rules = [
    
    
    TargetRule("radiotherapy", "PROBLEM",
              pattern=[{"LOWER": {"IN": ["xrt", "radiotherapy"]}}]
              ),
    
    
    TargetRule("diabetes", "PROBLEM",
              pattern=r"type (i|ii|1|2|one|two) (dm|diabetes mellitus)"),
    
                ]

In [14]:
target_matcher.add(pattern_rules)

In [15]:
doc = nlp(text)

We can now see several new entities have been added to the text after adding the new rules: "XRT", "type 2 dm", "Type II Diabetes Mellitus", "Type 2 DM".

In [16]:
print(doc.ents)

(Hydrochlorothiazide, Abdominal pain, type 2 dm, stroke, abdominal pain, metastasis, Colon cancer, hemicolectomy, XRT, Type II Diabetes Mellitus, stroke, Type 2 DM, abdominal pain, abdominal pain)


### A note about regular expressions
Regular-expression matching is not natively supported by spaCy and could result in unexpected matched spans if match boundaries do not align with token boundaries. For example, let's say you have a pattern like this one which matches a span of text which occurs between 2 different tokens:

In [17]:
import re

example_text = "SERVICE: Radiology"
pattern_str = r"ICE: Rad"

In this case, medspaCy will find the closest token boundaries around the matched span, meaning that the resulting `ent` won't actually equal the matched span of text. Because of this, it is recommended to use a list of dicts whenever possible.

In [18]:
target_matcher.add([TargetRule("ICE: Rad", "SERVICE", pattern=pattern_str)])

In [19]:
nlp(example_text).ents

(SERVICE: Radiology,)

In [20]:
re.search(pattern_str, example_text)

<re.Match object; span=(4, 12), match='ICE: Rad'>

## Adding custom attributes
One of the most powerful functionalities of spaCy is the ability to add [custom attributes](https://spacy.io/usage/processing-pipelines#custom-components-attributes) to spaCy objects (`Doc`, `Span`, or `Token`). These custom attributes are accessed through `obj._....`. As we'll see in later steps, medSpaCy adds several of these attributes in other attributes like `context` or `sectionizer`. 

But it is sometimes useful to include a custom attribute as part of the target matching rule, rather than needing to build a separate component to add it. The `TargetRule` can also include a value for these attributes in the `attributes` argument. 

For example, let's say we want to map certain entities to [ICD-10 diagnosis codes](https://www.cdc.gov/nchs/icd/icd10cm.htm). One way we can do this is to include the diagnosis codes for concepts in our knowledge base. For example, **"Type II Diabetes"** can be mapped to **E11.9**". We can add this to entities by first registering the extension for the `Span` class:


In [21]:
from spacy.tokens import Span

In [22]:
Span.set_extension("icd10", default="")

We can now include ICD-10 code values in the target rules. We'll map **"Type II Diabetes Mellitus"** to **"E11.9"** and **"Hypertension"** to **"I10"**:

In [23]:
target_rules2 = [
    TargetRule("Type II Diabetes Mellitus", "PROBLEM", 
              pattern=[
                  {"LOWER": "type"},
                  {"LOWER": {"IN": ["2", "ii", "two"]}},
                  {"LOWER": {"IN": ["dm", "diabetes"]}},
                  {"LOWER": "mellitus", "OP": "?"}
              ],
              attributes={"icd10": "E11.9"}),
    TargetRule("Hypertension", "PROBLEM",
              pattern=[{"LOWER": {"IN": ["htn", "hypertension"]}}],
              attributes={"icd10": "I10"}),
    
    
]

In [24]:
target_matcher.add(target_rules2)

In [25]:
doc = nlp(text)

Now, whenever one of these rules results in a match, the ICD-10 value can be accessed in `ent._.icd10`:

In [26]:
for ent in doc.ents:
    if ent._.icd10 != "":
        print(ent, ent._.icd10)

type 2 dm E11.9
Type II Diabetes Mellitus E11.9
Hypertension I10
Type 2 DM E11.9
HTN I10


# Context
Clinical text often contains mentions of concepts which the patient did not actually experience. For example:

- "There is *no evidence of* **pneumonia**"
- "*Mother* with **breast cancer**"
- "Patient presents for *r/o* **COVID-19**"

In all of these instances, we need to use the contextual clues around the entity to assert attributes like negation, experiencer, and uncertainty.

One method for this is the [ConText algorithm](https://www.sciencedirect.com/science/article/pii/S1532046409000744). ConText links target entities like problems with semantic modifiers like those shown above. The medSpaCy implementation of ConText is found in `medspacy.context`.

The ConText component is explained in more detail in the `context/` notebooks.

**NOTE**: ConText requires sentence boundaries or windows to be set. More information will be in `context/` notebooks and in the ConText documentation, but for now, we will add PyRuSH to our pipeline.

In [27]:
context = nlp.add_pipe("medspacy_context")

In [28]:
nlp.pipe_names

['medspacy_pyrush', 'medspacy_target_matcher', 'medspacy_context']

In [29]:
doc = nlp("Mother with stroke at age 82.")

We can use medSpaCy visualizers from the module `medspacy.visualization` to show the target/modifiers in text. `visualize_dep` shows arrows targets to show which concepts are modified by the semantic modifiers:

In [30]:
visualize_ent(doc)

In [31]:
visualize_dep(doc)

In [32]:
short_doc = nlp("Colon cancer diagnosed in 2012")

We can add a new rule using the `ConTextRule` class, which behaves similarly to `TargetRule` with some additional arguments defining modifier behavior:

In [33]:
from medspacy.context import ConTextRule

In [34]:
context_rules = [
    ConTextRule("diagnosed in <YEAR>", "HISTORICAL", 
                direction="BACKWARD", # Look "backwards" in the text (to the left)
                pattern=[
                   {"LOWER": "diagnosed"},
                   {"LOWER": "in"},
                   {"LOWER": {"REGEX": "^[\d]{4}$"}}
                       ])
                ]

In [35]:
context.add(context_rules)

In [36]:
short_doc = nlp("Colon cancer diagnosed in 2012")

In [37]:
visualize_ent(short_doc)

In [38]:
visualize_dep(short_doc)

In addition to linking targets and modifiers, ConText will also set attributes for each entity:

In [39]:
for ent in doc.ents:
    if any([ent._.is_negated, ent._.is_uncertain, ent._.is_historical, ent._.is_family, ent._.is_hypothetical, ]):
        print("'{0}' modified by {1} in: '{2}'".format(ent, ent._.modifiers, ent.sent))
        print()

'stroke' modified by (<ConTextModifier> [0, 1, FAMILY],) in: 'Mother with stroke at age 82.'

