In [1]:
import sys
sys.path.insert(0, "..")

# Overview
In these notebooks, we'll process an example clinical document with medSpaCy. First, we'll perform preprocessing and sentence segmentation. Next, we'll extract entities using rules, assert attributes such as negation and which section the entity occured in. We'll then put all of our pieces together to process the entire document. Finally, we'll look at an alternative pipeline using a pre-trained statistical model to extract target entities rather than rules.

In this first notebook, we'll introduce the medSpaCy library and show how to load a medSpaCy pipeline. Then in the following notebooks we'll walk through each of the pipeline steps in more detail and apply a fully built pipeline on clinical text.

These notebooks will give a high-level overview of each component, but the individual packages will typically contain more complete examples and documentation. 

# Notebooks

## High-Level Notebooks
The notebooks in this root directory will show an overview of medspaCy, how to load a basic pipeline, and the basics of how to use each component. Notebooks #1-3 will show how to use the components loaded in a default medSpaCy model. We'll then show how to add additional medSpaCy components such as section detection and pre/postprocessing. Then we'll show how a full medSpaCy pipeline processes example clinical text, first using custom rules and then using a pre-trained NER model.


## Detailed Component Notebooks
More detailed notebooks are provided for two of the components: `context` and `section_detection`. These will show more advanced functionality and detailed examples:
- `./context/`
- `./section_detection`

# Loading a medSpaCy model
A medSpaCy model consists of a **base spaCy model** with **medSpaCy components added** to the pipeline. There are two primary ways that we can create a medSpaCy model:

1. Load a full pipeline using `medspacy.load()`
2. Add specific components to an existing model

## 1. Load a full medSpaCy pipeline
We can load a complete pipeline using the `medspacy.load()` function. By default, this will build off of `spacy.blank("en")` will include:
- `medspacy_tokenizer`: A spaCy tokenizer with more aggressive rules for handling clinical text. This will not be visible in the pipeline but is set to `nlp.tokenizer`.
- `medspacy_pyrush`: The clinical sentence splitter [PyRuSH](https://github.com/jianlins/PyRuSH)
- `medspacy_target_matcher`: For extended rule-based matching
- `medspacy_context`: For contextual analysis and attribute detection, an implementation of the ConText algorithm.

A large list of components is available, but are not activated by default. Most of these are unnecessary for the most simple pipelines that `medspacy.load()` is designed to handle.

In [2]:
import medspacy

In [3]:
nlp = medspacy.load()

In [4]:
nlp.pipe_names

['medspacy_pyrush', 'medspacy_target_matcher', 'medspacy_context']

You can also load from an existing model to add medspaCy pipeline components to your current pipeline. To do this, either pass in the model directly or the name of the model. For example, in the examples below we can load spaCy's `"en_core_web_sm"` model.

Notice at the end of the `nlp.pipe_names` result, we have the default medspacy components.

In [5]:
import spacy
nlp2 = spacy.load("en_core_web_sm")
nlp2 = medspacy.load(nlp2)
nlp2.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'medspacy_pyrush',
 'medspacy_target_matcher',
 'medspacy_context']

In [6]:
nlp2 = medspacy.load("en_core_web_sm")
nlp2.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'medspacy_pyrush',
 'medspacy_target_matcher',
 'medspacy_context']

### Default rules
When available, components added by `medspacy.load()` include default rules. `medspacy_context`, and `medspacy_sectionizer` will both contain extensive default rules.

In [7]:
context = nlp.get_pipe("medspacy_context")

In [8]:
for rule in context.rules[:10]:
    print()
    print(rule)


ConTextRule(literal='absence of', category='NEGATED_EXISTENCE', pattern=None, direction='FORWARD')

ConTextRule(literal='adequate to rule out', category='NEGATED_EXISTENCE', pattern=[{'LOWER': {'IN': ['adequate', 'sufficient']}}, {'LOWER': 'to'}, {'LOWER': 'rule'}, {'LOWER': {'IN': ['him', 'her', 'them', 'patient', 'pt']}, 'OP': '?'}, {'LOWER': 'out'}, {'LOWER': {'IN': ['against', 'for']}, 'OP': '?'}], direction='FORWARD')

ConTextRule(literal='adequate to rule the patient out', category='NEGATED_EXISTENCE', pattern=[{'LOWER': {'IN': ['adequate', 'sufficient']}}, {'LOWER': 'to'}, {'LOWER': 'rule'}, {'LOWER': 'the'}, {'LOWER': {'IN': ['patient', 'pt']}}, {'LOWER': 'out'}, {'LOWER': {'IN': ['against', 'for']}, 'OP': '?'}], direction='FORWARD')

ConTextRule(literal='any other', category='NEGATED_EXISTENCE', pattern=None, direction='FORWARD')

ConTextRule(literal='apart from', category='NEGATED_EXISTENCE', pattern=[{'LOWER': 'apart'}, {'LOWER': {'IN': ['for', 'from']}}], direction='TERMIN

You can also set `load_rules` to `False` so that the components are all blank (other than PyRuSH, which requires rules to be instantiated).

### Using specific models
If you have other models installed, either in English or other languages, you can load that model in using the `model` argument. For example, to load a [sciSpaCy model](https://allenai.github.io/scispacy/) and use it with medSpaCy, first download the model:

```bash
pip install scispacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz
```
and then load it with medSpaCy:

```python
nlp = medspacy.load("en_core_sci_sm", load_rules=False, medspacy_disable=["medspacy_target_matcher"])
```

In [9]:
nlp = medspacy.load("en_core_web_sm", load_rules=False, medspacy_disable=["medspacy_target_matcher"])
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'medspacy_pyrush',
 'medspacy_context']

You can also send in keyword arguments to the spacy model.

In [10]:
nlp = medspacy.load("en_core_web_sm", medspacy_enable=["medspacy_target_matcher"], **{"disable": ["ner"]})
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'medspacy_target_matcher']

### Specifying components
You can define which specific components to include or specific components to exclude through the `medspacy_enable` and `medspacy_disable` arguments:

In [11]:
# Only load the default components
nlp_default = medspacy.load()
nlp_default.pipe_names

['medspacy_pyrush', 'medspacy_target_matcher', 'medspacy_context']

Load all components

In [12]:
# All medspaCy components
nlp_full = medspacy.load(medspacy_enable="all")
print(nlp_full.pipe_names)
# a preprocessor that includes medspacy's tokenizer is set to the tokenizer when enabling "all"
print(nlp_full.tokenizer)

['medspacy_pyrush', 'medspacy_target_matcher', 'medspacy_context', 'medspacy_sectionizer', 'medspacy_postprocessor', 'medspacy_doc_consumer']
<medspacy.preprocess.preprocessor.Preprocessor object at 0x7feae10879d0>


Enable only some components

In [13]:
# Only load the context and target matcher
nlp_matcher_only = medspacy.load(medspacy_enable=["medspacy_context", "medspacy_target_matcher"])
nlp_matcher_only.pipe_names

['medspacy_target_matcher', 'medspacy_context']

Disable some components but load others.

In [14]:
# Disable pyrush and context
nlp_no_pyrush_context = medspacy.load(medspacy_enable="default", medspacy_disable=["medspacy_pyrush", "medspacy_context"])
nlp_no_pyrush_context.pipe_names

['medspacy_target_matcher']

## 2. Add specific components to an existing model
You can also add medspacy components to a pipeline without needing to use `medspacy.load()`.

In [15]:
import spacy

In [16]:
en = spacy.load("en_core_web_sm")

In [17]:
en.add_pipe("medspacy_context")

<medspacy.context.context.ConText at 0x7fead3586a30>

In [18]:
en.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'medspacy_context']