# Outline

- Background: Objectives, Projects, uses, publications
- Features for common clinical NLP tasks: ConText, section detection, rule-based pattern matching, UMLS linking, sentence splitting
- Customizable: You can write your own rules for all of the rule-based components, customize for your dataset/research problem
- Flexible: This is more about spaCy generally, but with spaCy you can drop in various trained models, use PyTorch or TF or HuggingFace, et.c…
- Next Steps: spaCy v3, documentation, trained models/pipelines, optimization, support for other languages …

# Background
![MedspaCy logo](https://github.com/medspacy/medspacy/blob/master/images/medspacy_logo.png?raw=true)

MedSpaCy is a library of tools for performing clinical NLP and text processing tasks with the popular [spaCy](spacy.io) 
framework. The `medspacy` package brings together a number of other packages, each of which implements specific 
functionality for common clinical text processing specific to the clinical domain, such as sentence segmentation, 
contextual analysis and attribute assertion, and section detection.

# Getting started with medspaCy

In [1]:
import medspacy
nlp = medspacy.load()

In [2]:
nlp.pipe_names

['sentencizer', 'target_matcher', 'context']

In [3]:
url = "https://raw.githubusercontent.com/medspacy/medspacy/master/notebooks/discharge_summary.txt"

In [4]:
import urllib

with urllib.request.urlopen(url) as f:
    text = f.read().decode()

In [5]:
print(text[:500])

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging sh


In [6]:
doc = nlp(text)

# Common NLP Tasks
MedspaCy is built as a modular set of **pipeline components** which handle a specific NLP task. Because of spaCy's flexible framework, you can easily add new components, including trained statistical models.

## Rule-Based Matching
In this step, we'll write rules to extract the main concepts we're interested in.

In this example, we'll use two utilities provided in `medspacy.ner` for rule-based matching: the `TargetMatcher` and `TargetRule`. These expand on spaCy's native [rule-based matching](https://spacy.io/usage/rule-based-matching) and add some additional functionality.

In [7]:
from medspacy.ner import TargetMatcher, TargetRule

In [8]:
target_matcher = nlp.get_pipe("target_matcher")

We define a rule for extracting entities with the `TargetRule` class. The main arguments for `TargetRule` are:
- `literal`: An exact phrase to match in the text
- `category`: The semantic class of the entity (ie., `ent.label_`)
- `pattern`: An optional pattern to match rather than `literal`. As we'll see below, this can be either a list of dictionaries defining token attributes or a regular expression string

In [9]:
target_rules1 = [
    TargetRule(literal="abdominal pain", category="PROBLEM"),
    TargetRule("stroke", "PROBLEM"),
    TargetRule("hemicolectomy", "TREATMENT"),
    TargetRule("Hydrochlorothiazide", "TREATMENT"),
    TargetRule("colon cancer", "PROBLEM"),
    TargetRule("metastasis", "PROBLEM"),
    
]

In [10]:
target_matcher.add(target_rules1)

In [11]:
doc = nlp(text)

In [12]:
for ent in doc.ents:
    print(ent, ent.label_)

Hydrochlorothiazide TREATMENT
Abdominal pain PROBLEM
stroke PROBLEM
abdominal pain PROBLEM
metastasis PROBLEM
Colon cancer PROBLEM
hemicolectomy TREATMENT
stroke PROBLEM
abdominal pain PROBLEM
abdominal pain PROBLEM


In [13]:
from medspacy.visualization import visualize_ent

In [14]:
visualize_ent(doc)

## Advanced Pattern Matching
SpaCy has powerful pattern matching which allows you to match on a list of dictionaries which define attributes for each token. See https://spacy.io/usage/rule-based-matching for spaCy's documentation and examples. Additionally, medspaCy allows matching with regular expressions on the underlying text of the doc.

Let's see some examples of `TargetRules` which utilizie pattern matching. The first rule uses token patterns to match any single token where the lower-case text is either `xrt` or `radiotherapy`. The second uses regular expressions to match various forms of Type 1/Type 2 diabetes.

In [15]:
pattern_rules = [
    # Using spaCy's dictionary token patterns
    TargetRule("Acetaminophen", "TREATMENT",
               pattern=[
                   {"LOWER": {"IN": ["acetaminophen", "tylenol"]}},
                   {"LIKE_NUM": True, "OP": "?"},
                   {"LOWER": "mg", "OP": "?"}
               ],
              ),
    
    # Using regular expressions
    TargetRule("diabetes", "PROBLEM",
              pattern=r"type (i|ii|1|2|one|two) (dm|diabetes mellitus)"),
]

In [16]:
target_matcher.add(pattern_rules)



In [17]:
sm_text = """
    Discharge Medications: Acetaminophen 160 mg
    Prescribed tylenol for the pain
    74y female with type 2 dm and a recent stroke.
    Diagnoses: Type II Diabetes Mellitus
"""

In [18]:
visualize_ent(nlp(sm_text))

## Statistical NER
In this notebook, we'll show how to use a pretrained model for target concept extraction instead of defining rules. We'll then add our additional components to show how medSpaCy can be used to combine statistical NLP with other rule-based components.

As an example, we'll download the model below which contains a model pretrained for clinical data. This model was trained with data from the i2b2 2012 shared task: [**"Evaluating temporal relations in clinical text"**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756273/). This model was trained on data for the first subtask in the shared task, referred to in the challenge as **"Clinically relevant events"**, specifically the following **clinical concepts**:
- **Problems:** Diagnoses, signs, and symptoms
- **Tests:** Lab and vital measurements
- **Treatments:** Medications, procedures, and therapies

We can install this model with `pip` using this GitHub link:
```bash
pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz
```

In [19]:
!pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz

Collecting https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz
  Using cached https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz (12.3 MB)
Building wheels for collected packages: en-info-3700-i2b2-2012
  Building wheel for en-info-3700-i2b2-2012 (setup.py) ... [?25ldone
[?25h  Created wheel for en-info-3700-i2b2-2012: filename=en_info_3700_i2b2_2012-0.1.0-py3-none-any.whl size=12270783 sha256=d6f83517631ae53285d0f7e82a47bcd0e1de5f12f4758d8ca44c11211250ed44
  Stored in directory: /Users/alecchapman/Library/Caches/pip/wheels/78/e8/22/863c5e1287f38607d2177f47f31cba9686310ab519d46ba4d9
Successfully built en-info-3700-i2b2-2012


In [21]:
nlp = medspacy.load("en_info_3700_i2b2_2012")



In [22]:
nlp.pipe_names

['sentencizer', 'tagger', 'parser', 'ner', 'target_matcher', 'context']

In [23]:
doc = nlp(text)

In [25]:
visualize_ent(doc)

In [26]:
target_matcher = nlp.get_pipe("target_matcher")
target_matcher.add(target_rules1)
target_matcher.add(pattern_rules)



In [27]:
doc = nlp(text)

In [28]:
print(doc.ents)

(Hydrochlorothiazide, Abdominal pain, Invasive Procedure, PICC line, ERCP, sphincterotomy, type 2 dm, a recent stroke, abdominal pain, Imaging, metastasis, Colon cancer, hemicolectomy, XRT, chemo, colonoscopy, CEA, Type II Diabetes Mellitus, Hypertension, Married, former tobacco use, alcohol or drug use, stroke, Ultrasound, pancreatic duct dilitation, Miconazole, Heparin Sodium, Porcine, Injection, Acetaminophen 160 mg, Type 2 DM, Pancreatitis, HTN, aspiration respiratory distress, fever, nausea, vomiting, abdominal pain, shortness of breath, abdominal pain, NamePattern1)


## ConText
The [ConText algorithm](https://www.sciencedirect.com/science/article/pii/S1532046409000744) is a popular method for asserting attributes of entities in clinical text such as **negation**, **temporality**, and **experiencer**.

In [29]:
from medspacy.visualization import visualize_dep

In [39]:
doc = nlp("There is no evidence of pneumonia.")

In [40]:
visualize_dep(doc)
visualize_ent(doc)

In [47]:
doc = nlp("Mother with stroke at age 82.")
visualize_dep(doc)
visualize_ent(doc)

In [50]:
ent = doc.ents[0]
print(ent, ent._.is_family)

stroke True


In [49]:
print(ent._.any_context_attributes)
print(ent._.context_attributes)

True
{'is_negated': False, 'is_historical': False, 'is_hypothetical': False, 'is_family': True, 'is_uncertain': False}


In [51]:
from medspacy.context import ConTextRule

In [52]:
context = nlp.get_pipe("context")

In [54]:
context_rule = ConTextRule("diagnosed in <YEAR>", "HISTORICAL",
                           direction="BACKWARD",
                          pattern=[
                             {"LOWER": "diagnosed"},
                             {"LOWER": "in"},
                             {"LOWER": {"REGEX": r"^(19|20)[\d]{2}$"}}
                          ]
                          )

In [56]:
context.add(context_rule)

In [57]:
short_doc = nlp("Colon cancer diagnosed in 2012")

In [59]:
visualize_dep(short_doc)
visualize_ent(short_doc)

In [64]:
context.rules[:5]

[ConTextRule(literal='absence of', category='NEGATED_EXISTENCE', pattern=None, direction='FORWARD'),
 ConTextRule(literal='adequate to rule out', category='NEGATED_EXISTENCE', pattern=[{'LOWER': {'IN': ['adequate', 'sufficient']}}, {'LOWER': 'to'}, {'LOWER': 'rule'}, {'LOWER': {'IN': ['him', 'her', 'them', 'patient', 'pt']}, 'OP': '?'}, {'LOWER': 'out'}, {'LOWER': {'IN': ['against', 'for']}, 'OP': '?'}], direction='FORWARD'),
 ConTextRule(literal='adequate to rule the patient out', category='NEGATED_EXISTENCE', pattern=[{'LOWER': {'IN': ['adequate', 'sufficient']}}, {'LOWER': 'to'}, {'LOWER': 'rule'}, {'LOWER': 'the'}, {'LOWER': {'IN': ['patient', 'pt']}}, {'LOWER': 'out'}, {'LOWER': {'IN': ['against', 'for']}, 'OP': '?'}], direction='FORWARD'),
 ConTextRule(literal='any other', category='NEGATED_EXISTENCE', pattern=None, direction='FORWARD'),
 ConTextRule(literal='apart from', category='NEGATED_EXISTENCE', pattern=[{'LOWER': 'apart'}, {'LOWER': {'IN': ['for', 'from']}}], direction='TE

In [66]:
context.categories

{'FAMILY',
 'HISTORICAL',
 'HYPOTHETICAL',
 'NEGATED_EXISTENCE',
 'POSSIBLE_EXISTENCE'}

# Section detection
We are often interested in which section of a clinical note an entity occurs in. This can be useful for setting attributes like temporality (similar to ConText) or for extracting entities from specific sections of the note.

MedSpaCy includes the `Sectionizer` class for identifying sections in a note.

In [61]:
from medspacy.section_detection import Sectionizer, SectionRule

In [62]:
sectionizer = Sectionizer(nlp)

In [67]:
nlp.add_pipe(sectionizer)

In [68]:
doc = nlp(text)

In [73]:
for section in doc._.sections:
    print(section.title_span, section.category)

 None
Service: other
Allergies: allergies
Chief Complaint: chief_complaint
History of Present Illness: history_of_present_illness
Past Medical History: past_medical_history
Social History: social_history
Family History: family_history
Brief Hospital Course: hospital_course
Discharge Medications: medications
Discharge Diagnosis: observation_and_plan
Discharge Instructions: patient_instructions
Signed electronically by: signature


In [69]:
visualize_ent(doc)

In [77]:
for ent in doc.ents:
    print(ent, "-->", ent._.section_category)

Hydrochlorothiazide --> allergies
Abdominal pain --> chief_complaint
Invasive Procedure --> chief_complaint
PICC line --> chief_complaint
ERCP --> chief_complaint
sphincterotomy --> chief_complaint
type 2 dm --> history_of_present_illness
a recent stroke --> history_of_present_illness
abdominal pain --> history_of_present_illness
Imaging --> history_of_present_illness
metastasis --> history_of_present_illness
Colon cancer --> past_medical_history
hemicolectomy --> past_medical_history
XRT --> past_medical_history
chemo --> past_medical_history
colonoscopy --> past_medical_history
CEA --> past_medical_history
Type II Diabetes Mellitus --> past_medical_history
Hypertension --> past_medical_history
Married --> social_history
former tobacco use --> social_history
alcohol or drug use --> social_history
stroke --> family_history
Ultrasound --> hospital_course
pancreatic duct dilitation --> hospital_course
Miconazole --> medications
Heparin Sodium --> medications
Porcine --> medications
Injec