In [1]:
import sys
sys.path.insert(0, "..")

In [2]:
import spacy
import medspacy

# Overview
In this notebook, we'll look at two of the first steps commonly performed on clinical text and see how medSpaCy handles them:
- Tokenization
- Sentence splitting

In [3]:
with open("./discharge_summary.txt") as f:
    text = f.read()

In [4]:
# Use a blank model rather than the full medSpaCy pipeline
nlp = spacy.blank("en")

# Tokenization
Clinical language is very different from general natural language. Abbreviations and punctuation in particularly are used irregularly and tokenizers trained on general English sources like Wikipedia perform poorly on clinical text.

To address this, medSpaCy has a custom tokenizers with rules specifically meant to handle clinical text. This is loaded by default with the `medspacy.load()` function, but can also be created using this utility function:

In [5]:
from medspacy.custom_tokenizer import create_medspacy_tokenizer

In [6]:
medspacy_tokenizer = create_medspacy_tokenizer(nlp)
default_tokenizer = nlp.tokenizer

In [7]:
example_text = r'Pt c\o n;v;d h\o chf+cp'

In [8]:
print("Tokens from default tokenizer:")
print(list(default_tokenizer(example_text)))
print("Tokens from medspacy tokenizer:")
print(list(medspacy_tokenizer(example_text)))

Tokens from default tokenizer:
[Pt, c\o, n;v;d, h\o, chf+cp]
Tokens from medspacy tokenizer:
[Pt, c, \, o, n, ;, v, ;, d, h, \, o, chf, +, cp]


Now we'll add our custom tokenizer to our pipeline by replacing the default:

In [9]:
nlp.tokenizer = medspacy_tokenizer

In [10]:
print(list(nlp(example_text)))

[Pt, c, \, o, n, ;, v, ;, d, h, \, o, chf, +, cp]


# Sentence segmentation
Sentence segmentation in medSpaCy is performed in one of two ways: either through the standard POS tagger/dependency parser steps implemented in spaCy's **en_core_web_sm** model (which is not always ideal since it isn't trained on clinical data), or [PyRuSH](https://github.com/jianlins/PyRuSH). This package runs through a series of rules which were developed with clinical text in order to find the optimal sentence boundries.


PyRuSH rules are defined by a resources file. PyRuSH is not currently included as part of the default model returned by `medspacy.load()` since there are some integration steps needed, but it can be instantiated and added separately.

## PyRuSH

In [11]:
from medspacy.sentence_splitting import PyRuSHSentencizer

In [12]:
sentencizer = PyRuSHSentencizer(rules_path="../resources/rush_rules.tsv")

In [13]:
sentencizer

<PyRuSH.PyRuSHSentencizer.PyRuSHSentencizer at 0x7ffca5aed180>

In [15]:
nlp.add_pipe("medspacy_pyrush", config={"pyrush_path": "../resources/rush_rules.tsv"})

<PyRuSH.PyRuSHSentencizer.PyRuSHSentencizer at 0x7ffca5db7200>

In [16]:
nlp.pipe_names

['medspacy_pyrush']

In [17]:
doc = nlp(text)

In [18]:
for sent in doc.sents:
    print(sent)
    print("---"*10)

Admission Date:  [**2573-5-30**]              
------------------------------
Discharge Date:   [**2573-7-1**]


------------------------------
Date of Birth:  [**2498-8-19**]             
------------------------------
Sex:   F


------------------------------
Service: SURGERY


------------------------------
Allergies:

------------------------------
Hydrochlorothiazide


------------------------------
Attending:[**First Name3 (LF) 1893**]

------------------------------
Chief Complaint:

------------------------------
Abdominal pain


------------------------------
Major Surgical or Invasive Procedure:

------------------------------
PICC line [**6-25**]

------------------------------
ERCP w/ sphincterotomy [**5-31**]



------------------------------
History of Present Illness:

------------------------------
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain.
------------------------------
Imaging shows no evidence of me