# NLP with medspaCy

<a target="_blank" href="https://colab.research.google.com/github/https://colab.research.google.com/github/medspacy/medspaCy_MedInfo_2023/blob/main/MedInfo-2023-NLP-with-medspaCy.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook offers a brief tutorial for using medspaCy for clinical NLP. While this will just offer a high-level crash course, there are lots of other materials available for learning how to use medspaCy. For a list of resources, checkout the [medspaCy repository](https://github.com/medspacy/medspacy) on GitHub and the [MedInfo 2023 repository](https://github.com/medspacy/medspaCy_MedInfo_2023) that accompanies this tutorial.


## Overview of medspacy
<img alt="MedSpaCy logo" src="https://github.com/medspacy/medspacy/raw/master/images/medspacy_logo.png">


[`medspaCy`](https://github.com/medspacy/medspacy) is an open-source package maintained by NLP developers at the University of Utah and the US Department of Veterans Affairs. It's built using the popular [spaCy](https://spacy.io/) library and is specifically designed for working with clinical notes. 

The goal of medSpaCy is to provide flexible, easy-to-use spaCy components for common clinical NLP tasks, such as:

- Concept extraction
- Negation detection
- Document section splitting

Here are a couple of papers that used medspaCy:

- [Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python
](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8861690/)
- [A Natural Language Processing System for National
COVID-19 Surveillance in the US Department of Veterans Affairs](https://aclanthology.org/2020.nlpcovid19-acl.10.pdf)
- [ReHouSED: A novel measurement of Veteran housing stability using natural language processing](https://www.sciencedirect.com/science/article/pii/S153204642100232X?via%3Dihub)
- [Assessing mortality prediction through different representation models based on concepts extracted from clinical notes](https://arxiv.org/pdf/2207.10872.pdf)
- [A Study into patient similarity through representation learning from medical
records ](https://arxiv.org/pdf/2104.14229.pdf)

## Outline
This notebook will provide a crash course for building an NLP model and processing a clinical note using medspaCy. We'll walk through the basic steps of clinical text extraction.


1. Getting started
2. Tokenization and sentence splitting
3. Entity extraction
4. Attribute detection
5. Output

# 1. Getting started
## Installing medspaCy
We'll start by installing medspaCy using `pip`. medspaCy has a lot of requirements, so installation might take a couple of minutes.

In [1]:
!pip install medspacy==1.1.2





In [2]:
import medspacy

## Loading an `nlp` object
The simplest way to create a model in medspaCy is `medspaCy.load()`. This is a spaCy `English` class (see spaCy's documentation for more information about language objects). 

In [3]:
nlp = medspacy.load()

The `nlp` object is spaCy's implementation of a processing pipeline. It contains several *components* that each perform a different step of processing to the document. By default, `medspacy.load()` loads a pipeline with the following four components, which we can then add to later.
![medspacy pipeline](https://github.com/medspacy/medspaCy_MedInfo_2023/blob/main/images/medspacy-pipeline.png?raw=true)

We can check what components are included in a pipeline by looking at the `pipe_names` attribute:

In [4]:
nlp.pipe_names

['medspacy_pyrush', 'medspacy_target_matcher', 'medspacy_context']

You can also build models in other languages. For example, the next few cells will create a German spaCy model which we can then pass into `medspacy.load()` (although other components will need additional customization):

In [5]:
import spacy
nlp_de = spacy.blank("de")

In [6]:
nlp_de = medspacy.load(nlp_de, load_rules=False) # Disable default rules

In [7]:
nlp_de

<spacy.lang.de.German at 0x7fec39b827f0>

In [8]:
nlp_de.pipe_names

['medspacy_pyrush', 'medspacy_target_matcher', 'medspacy_context']

## Processing a clinical note

Throughout this notebook, we'll use as an example a chest imaging report for a patient with possible pneumonia due to a recent Covid-19 infection:

In [9]:
text = ("Patient with history of Covid-19 diagnosis; presenting with SOB. "
          "Chest image shows no ground glass opacities or evidence of pneumonia. ")

To process a clinical note with the NLP, we create a `doc` object. This `doc` object is what we will be working with for the rest of the notebook.

In [10]:
doc = nlp(text)
print(doc)

Patient with history of Covid-19 diagnosis; presenting with SOB. Chest image shows no ground glass opacities or evidence of pneumonia. 


In [11]:
type(doc)

spacy.tokens.doc.Doc

# 2. Tokenization and sentence splitting
So, what actually happens when we create a `Doc` object?

## Tokenization
A `Token` is a single word, symbol, or whitespace in a `doc`. When we create a `doc` object, the text broken up into individual tokens. This is called **"tokenization"**, and this is the first step in the processing pipeline.

We can access tokens by iterating through the doc or indexing.

In [12]:
token = doc[2]
print(token)

history


In [13]:
for token in doc:
    print(token)

Patient
with
history
of
Covid
-
19
diagnosis
;
presenting
with
SOB
.
Chest
image
shows
no
ground
glass
opacities
or
evidence
of
pneumonia
.


## Sentence splitting
While tokenization splits the text up into atomic tokens, we will also need to work with longer units of text. In spaCy, these are called `Span`'s. One example of a span is a sentence. After tokenizing, our NLP pipeline next splits the document up into separate sentences.

In [14]:
for sent in doc.sents:
    print(sent)

Patient with history of Covid-19 diagnosis; presenting with SOB.
Chest image shows no ground glass opacities or evidence of pneumonia.


In [15]:
type(sent)

spacy.tokens.span.Span

Each sentence is made up of individual tokens, just like the doc:

In [16]:
sent[0]

Chest

# 2. Entity extraction
A common goal in clinical NLP is **entity extraction**, where we identify specific mentions of a concept in a note. Examples of concepts we might be interested in include diseases, signs/syptoms, or treatments. 

In our example document, a few concepts we might want to extract could be:

- Clinical diagnoses such as **pneumonia** and **Covid-19**
- Symptoms like **shortness of breath (SOB)**

Common approaches to entity extraction in clinical NLP include machine learning NER or rule-based extraction. We'll focus here on rule-based extraction, although later on in the tutorial we'll see examples of combining rule-based with deep learning models.

In medspaCy, we can specify concepts to extract using the `TargetMatcher` component. We access pipeline components by calling `nlp.get_pipe()` and passing in the name of the component:

In [17]:
target_matcher = nlp.get_pipe("medspacy_target_matcher")
target_matcher

<medspacy.target_matcher.target_matcher.TargetMatcher at 0x7fec28f0b1c0>

We define concepts to extract using the `TargetRule` class. Each target rule takes the following arguments:
- `literal`: The exact phrase we want to extract
- `category`: The semantic category of the concept
- `pattern` (optional): A flexible pattern for matching different strings of text. This can be either a [regular expression](https://www.w3schools.com/python/python_regex.asp) or a [spaCy rule-based matching pattern](https://spacy.io/usage/rule-based-matching)

We'll write the following rules to extract concepts in our doc. (Note that all of the rules below are case insensitive.)

In [18]:
from medspacy.target_matcher import TargetRule

In [19]:
rules = [
    # This will match "pneumonia" and "SOB" exactly
    TargetRule("pneumonia", "PNEUMONIA"),
    TargetRule("SOB", "SYMPTOM"),
    
    # This uses a regular expression to match 'Covid-19' or 'covid19'
    
    # Match different forms of "opacity", with "ground" or "glass" as optional modifiers
    TargetRule("Covid-19", "COVID", pattern="covid\-?19"),
    TargetRule("opacity", "PNEUMONIA",
              pattern=[
                  {"LOWER": "ground", "OP": "?"},
                  {"LOWER": "glass", "OP": "?"},
                  {"LOWER": {"IN": ["opacity", "opacities"]}},
                      ]
              ),   
]

We now add these rules to the pipeline:

In [20]:
target_matcher.add(rules)

In [21]:
target_matcher.rules

[TargetRule(literal="pneumonia", category="PNEUMONIA", pattern=None, attributes=None, on_match=None),
 TargetRule(literal="SOB", category="SYMPTOM", pattern=None, attributes=None, on_match=None),
 TargetRule(literal="Covid-19", category="COVID", pattern=covid\-?19, attributes=None, on_match=None),
 TargetRule(literal="opacity", category="PNEUMONIA", pattern=[{'LOWER': 'ground', 'OP': '?'}, {'LOWER': 'glass', 'OP': '?'}, {'LOWER': {'IN': ['opacity', 'opacities']}}], attributes=None, on_match=None)]

Next, we'll reprocess the document so that it contains our new processing rules.

In [22]:
doc = nlp(text)

Extracted concepts are stored under `doc.ents`. Each entity is a spaCy `Span`. The semantic category is stored under `ent.label_`.

In [23]:
for ent in doc.ents:
    print(ent, ent.label_)

Covid-19 COVID
SOB SYMPTOM
ground glass opacities PNEUMONIA
pneumonia PNEUMONIA


We can also visualize the entire document using medspaCy's visualization function, which will highlight the entities:

In [24]:
from medspacy.visualization import visualize_ent

In [25]:
visualize_ent(doc, context=False)

# 2. Attribute detection
In the examples above we can see that medspaCy correctly extracted mentions of clinical concepts. However, just because concepts are mentioned in a note doesn't mean that the patient actually has the condition. For example, the note specifically says that **"ground glass opacities"** and **"pneumonia"** are not supported by the chest imaging report. To avoid false positives, we need to identify *attributes* such as negation or temporality.

In medspaCy we do this using the `ConText` component and the `Sectionizer`.

## ConText
The [ConText algorithm](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2757457/) connects entities to nearby linguistic modifiers. We can visualize this using the `visualize_dep` function, which draws arrows from linguistic modifiers to entities.

In [26]:
from medspacy.visualization import visualize_dep

In [27]:
visualize_dep(doc)

In the example above, we can see that **"Covid-19"** is correctly modified by a **"historical"** modifier, and **"ground glass opacities"** and **"pneumonia"** are attached to the negation phrase. However, we can also see that **"SOB"** is incorrectly modified by the historical marker. To fix this, we can customize the context algorithm using `ConTextRule`'s. We'll add **"presenting with"** as a *current* modifier and add a *terminator* to block the negation from attaching to **"SOB"**. (See the [medspaCy notebooks](https://github.com/medspacy/medspacy/tree/master/notebooks) for more information on how to customize ConText.)

In [28]:
from medspacy.context import ConTextRule

In [29]:
context_rules = [
    ConTextRule("presenting with", "CURRENT", direction="FORWARD"),
    ConTextRule(";", "TERMINATE", direction="TERMINATE"),
                ]

In [30]:
context = nlp.get_pipe("medspacy_context")
context.add(context_rules)

In [31]:
doc = nlp(text)

In [32]:
visualize_dep(doc)

In addition to visualizing the algorithm, medspaCy will set attributes like `is_negated` and `is_historical` to each entity.

In [33]:
for ent in doc.ents:
    print(ent)
    print("Negated:", ent._.is_negated)
    print("Historical:", ent._.is_historical)
    print("Uncertain:", ent._.is_uncertain)
    print()

Covid-19
Negated: False
Historical: True
Uncertain: False

SOB
Negated: False
Historical: False
Uncertain: False

ground glass opacities
Negated: True
Historical: False
Uncertain: False

pneumonia
Negated: True
Historical: False
Uncertain: False



## Section detection
In addition to looking for linguistic modifiers, the structure of a note indicates a patient's disease status. medspaCy offers the `Sectionizer` component for identifying different sections, like the **past medical history**, in a note. The sectionizer isn't loaded with the default pipeline, so we'll need to add it using `nlp.add_pipe()`:

In [34]:
sectionizer = nlp.add_pipe("medspacy_sectionizer")

In [35]:
text2 = """
Past Medical History: Covid-19

HPI: Patient presents with SOB.

Medical decision making: Order chest xray to rule out pneumonia.
"""

In [36]:
doc2 = nlp(text2)

When we visualize the document, we can see section headers are highlighted in gray.

In [37]:
visualize_ent(doc2)

Section headers vary between different types of notes and institutions, so this component will typically need customization. Right now, the section header `medical decision making` wasn't extracted, so we'll add a `SectionRule` to match it:

In [38]:
from medspacy.section_detection import SectionRule
section_rule = SectionRule("medical decision making:", "medical_decision_making")
sectionizer.add(section_rule)

In [39]:
doc2 = nlp(text2)
visualize_ent(doc2)

We can see which section an entity occurred in with the `section_category` attribute:

In [40]:
for ent in doc2.ents:
    print(ent)
    print(ent._.section_category)
    print()

Covid-19
past_medical_history

SOB
history_of_present_illness

pneumonia
medical_decision_making



# 5. Output
Finally, medspaCy offers functionality for writing the output of NLP systems to databases or flat files. We can use the `DocConsumer` component and `pandas` library to create a DataFrame from our two examples

In [41]:
doc_consumer = nlp.add_pipe("medspacy_doc_consumer", config={"dtypes": ["ents", "context", "section"]})

In [42]:
doc2 = nlp(text2)

In [43]:
!pip install pandas



In [44]:
doc2._.to_dataframe("ents")

Unnamed: 0,text,start_char,end_char,label_,is_negated,is_uncertain,is_historical,is_hypothetical,is_family,section_category,section_parent
0,Covid-19,23,31,COVID,False,False,True,False,False,past_medical_history,
1,SOB,60,63,SYMPTOM,False,False,False,False,False,history_of_present_illness,
2,pneumonia,120,129,PNEUMONIA,False,True,False,False,False,medical_decision_making,


In [45]:
doc2._.to_dataframe("context")

Unnamed: 0,ent_text,ent_label_,ent_start_char,ent_end_char,modifier_text,modifier_category,modifier_direction,modifier_start_char,modifier_end_char,modifier_scope_start_char,modifier_scope_end_char
0,Covid-19,COVID,23,31,Past Medical History,HISTORICAL,FORWARD,1,21,21,33
1,pneumonia,PNEUMONIA,120,129,rule out,POSSIBLE_EXISTENCE,FORWARD,111,119,120,131


In [46]:
doc2._.to_dataframe("section")

Unnamed: 0,section_category,section_title_text,section_title_start_char,section_title_end_char,section_body,section_body_start_char,section_body_end_char,section_parent
0,,,0,0,\n,0,1,
1,past_medical_history,Past Medical History:,1,22,Covid-19\n\n,23,33,
2,history_of_present_illness,HPI:,33,37,Patient presents with SOB.\n\n,38,66,
3,medical_decision_making,Medical decision making:,66,90,Order chest xray to rule out pneumonia.\n,91,131,


# Next steps
Here are some additional training materials for learning to use medspaCy.

- [medspaCy documentation notebooks](https://github.com/medspacy/medspacy/tree/master/notebooks), including detailed examples of each component
- [Melbourne COMP-90089](https://github.com/abchapman93/Melbourne_COMP90089_NLP): A short medspaCy NLP module offered by the University of Melbourne, including YouTube videos and Colab notebooks
- [University of Utah PHS Data Science Workshop](https://github.com/abchapman93/PHS_Data_Science_2022): A week-long workshop offered by the University of Utah Department of Population Health Sciences