In [1]:
import sys
sys.path.insert(0, "..")

# Overview
In these notebooks, we'll process an example clinical document with medSpaCy. First, we'll perform preprocessing and sentence segmentation. Next, we'll extract entities using rules, assert attributes such as negation and which section the entity occured in. We'll then put all of our pieces together to process the entire document. Finally, we'll look at an alternative pipeline using a pre-trained statistical model to extract target entities rather than rules.

In this first notebook, we'll introduce the medSpaCy library and show how to load a medSpaCy pipeline. Then in the following notebooks we'll walk through each of the pipeline steps in more detail and apply a fully built pipeline on clinical text.

These notebooks will give a high-level overview of each component, but the individual packages will typically contain more complete examples and documentation. 

**Disclaimer**: many of the subpackages are in beta, just like medSpaCy!

# Notebooks

## High-Level Notebooks
The notebooks in this root directory will show an overview of medspaCy, how to load a basic pipeline, and the basics of how to use each component. Notebooks #1-3 will show how to use the components loaded in a default medSpaCy model. We'll then show how to add additional medSpaCy components such as section detection and pre/postprocessing. Then we'll show how a full medSpaCy pipeline processes example clinical text, first using custom rules and then using a pre-trained NER model.


## Detailed Component Notebooks
More detailed notebooks are provided for two of the components: `context` and `section_detection`. These will show more advanced functionality and detailed examples:
- `./context/`
- `./section_detection`

# Loading a medSpaCy model
A medSpaCy model consists of a **base spaCy model** with **medSpaCy components added** to the pipeline. There are two primary ways that we can create a medSpaCy model:

1. Load a full pipeline using `medspacy.load()`
2. Add specific components to an existing model

## 1. Load a full medSpaCy pipeline
We can load a complete pipeline using the `medspacy.load()` function. By default, this will build off of spaCy's **en_core_web_sm** model and will include:
- `Tokenizer`: A spaCy tokenizer with custom rules for handling clinical text
- `Sentencizer`: A sentence splitter based on [PyRuSH](https://github.com/jianlins/PyRuSH)
- `TargetMatcher` for extended rule-based matching
- `ConText` for contextual analysis and attribute detection

In [2]:
import medspacy

In [3]:
nlp = medspacy.load()

In [4]:
nlp.pipe_names

['medspacy_pyrush', 'medspacy_target_matcher', 'medspacy_context']

You can also load from an existing model to add medspaCy pipeline components to your current pipeline. To do this, either pass in the model directly or the name of the model and any other components you want to enable/disable from the original model. For example, in the examples below we can load spaCy's `"en_core_web_sm"` model and disable the `"ner"` component so that we have POS tagger and dependency parser (which can be useful may not work too well with clinical text):

In [5]:
import spacy
nlp2 = spacy.load("en_core_web_sm", disable={"ner"})
nlp2 = medspacy.load(nlp2)
nlp2.pipe_names

  reader(path / key)
  reader(path / key)
  reader(path / key)


['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'medspacy_pyrush',
 'medspacy_target_matcher',
 'medspacy_context']

In [6]:
nlp2 = medspacy.load("en_core_web_sm", disable={"ner"})
nlp2.pipe_names

  reader(path / key)
  reader(path / key)
  reader(path / key)


['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'medspacy_pyrush',
 'medspacy_target_matcher',
 'medspacy_context']

### Default rules
When available, components added by `medspacy.load()` include default rules. `Context`, and `sectionizer` will both contain default rules:

In [8]:
context = nlp.get_pipe("medspacy_context")

In [9]:
context.rules[:10]

[ConTextRule(literal='absence of', category='NEGATED_EXISTENCE', pattern=None, direction='FORWARD'),
 ConTextRule(literal='adequate to rule out', category='NEGATED_EXISTENCE', pattern=[{'LOWER': {'IN': ['adequate', 'sufficient']}}, {'LOWER': 'to'}, {'LOWER': 'rule'}, {'LOWER': {'IN': ['him', 'her', 'them', 'patient', 'pt']}, 'OP': '?'}, {'LOWER': 'out'}, {'LOWER': {'IN': ['against', 'for']}, 'OP': '?'}], direction='FORWARD'),
 ConTextRule(literal='adequate to rule the patient out', category='NEGATED_EXISTENCE', pattern=[{'LOWER': {'IN': ['adequate', 'sufficient']}}, {'LOWER': 'to'}, {'LOWER': 'rule'}, {'LOWER': 'the'}, {'LOWER': {'IN': ['patient', 'pt']}}, {'LOWER': 'out'}, {'LOWER': {'IN': ['against', 'for']}, 'OP': '?'}], direction='FORWARD'),
 ConTextRule(literal='any other', category='NEGATED_EXISTENCE', pattern=None, direction='FORWARD'),
 ConTextRule(literal='apart from', category='NEGATED_EXISTENCE', pattern=[{'LOWER': 'apart'}, {'LOWER': {'IN': ['for', 'from']}}], direction='TE

You can also set `load_rules` to `False` so that the components are all blank (other than PyRuSH, which requires rules to be instantiated).

### Using specific models
If you have other models installed, either in English or other languages, you can load that model in using the `model` argument. For example, to load a [sciSpaCy model](https://allenai.github.io/scispacy/) and use it with medSpaCy, first download the model:

```bash
pip install scispacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz
```
and then load it with medSpaCy:

```python
nlp = medspacy.load("en_core_sci_sm", load_rules=False, disable=["target_matcher"])
```

### Specifying components
You can define which specific components to include or specific components to exclude through the `enable` and `disable` arguments:

In [10]:
# Only load the default components
nlp_default = medspacy.load()
nlp_default.pipe_names

['medspacy_pyrush', 'medspacy_target_matcher', 'medspacy_context']

In [11]:
# All medspaCy components
nlp_full = medspacy.load(enable="all")
print(nlp_full.pipe_names)
print(nlp_full.tokenizer)

['medspacy_target_matcher', 'medspacy_context', 'medspacy_sectionizer', 'medspacy_postprocessor', 'medspacy_doc_consumer']
<medspacy.preprocess.preprocessor.Preprocessor object at 0x7faed5a2d070>


In [13]:
# Only load the target_matcher and custom tokenizer
nlp_matcher_only = medspacy.load(enable=["tokenizer", "target_matcher"])
nlp_matcher_only.pipe_names

['medspacy_target_matcher']

In [14]:
# Disable the custom tokenizer and context
nlp_no_tok_context = medspacy.load(disable=["tokenizer", "context"])
nlp_no_tok_context.pipe_names

['medspacy_pyrush', 'medspacy_target_matcher']

In [15]:
assert nlp_no_tok_context.tokenizer != nlp.tokenizer

## 2. Add specific components to an existing model
You can also import specific classes from medSpaCy, instantiate them yourself, and add them to an existing model. We'll show more examples of how to do this in future notebooks.

In [16]:
import spacy

In [17]:
en = spacy.load("en_core_web_sm")

  reader(path / key)
  reader(path / key)
  reader(path / key)


In [18]:
from medspacy.context import ConTextComponent

In [19]:
en.add_pipe("medspacy_context")

<medspacy.context.context_component.ConTextComponent at 0x7faeb6d9d220>

In [20]:
en.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'ner',
 'attribute_ruler',
 'lemmatizer',
 'medspacy_context']

# Demo Data
For data, we will use this example text derived from the [MIMIC-II](https://mimic.physionet.org/) critical care dataset:

In [21]:
with open("./discharge_summary.txt") as f:
    text = f.read()

In [22]:
print(text[:500])

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging sh
