In [7]:
import spacy
import medspacy

# Overview
In this notebook, we'll look at two steps commonly performed on clinical text:
- Preprocessing
- Sentence splitting

In [4]:
with open("./discharge_summary.txt") as f:
    text = f.read()

In [8]:
nlp = spacy.blank("en")

# Preprocessing
In preprocessing, we'll take some steps to clean up the text.
- Lower-case
- Replace MIMIC-style time brackets with "2010" and remove all other MIMIC-style formatting
- Replace acronyms such as "dx'd" and "tx'd" to simplify later processing

In [11]:
from medspacy.preprocess import Preprocessor, PreprocessingRule
import re

In [12]:
preprocessor = Preprocessor(nlp.tokenizer)

In [14]:
preprocess_rules = [
    lambda x: x.lower(),
    
    PreprocessingRule(
        re.compile("\[\*\*[\d]{1,4}-[\d]{1,2}(-[\d]{1,2})?\*\*\]"),
        repl="01-01-2010",
        desc="Replace MIMIC date brackets with a generic date."
    ),
    
    PreprocessingRule(
        re.compile("\[\*\*[\d]{4}\*\*\]"),
        repl="2010",
        desc="Replace MIMIC year brackets with a generic year."
    ),
    
    PreprocessingRule(
        re.compile("dx'd"), repl="Diagnosed", 
                  desc="Replace abbreviation"
    ),
    
    PreprocessingRule(
        re.compile("tx'd"), repl="Treated", 
                  desc="Replace abbreviation"
    ),
    
        PreprocessingRule(
        re.compile("\[\*\*[^\]]+\]"), 
        desc="Remove all other bracketed placeholder text from MIMIC"
    )
]

In [15]:
preprocessor.add(preprocess_rules)

In [20]:
nlp.tokenizer = preprocessor

In [21]:
preprocessed_doc = nlp(text)

In [19]:
print(text[:1000])

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with hypertension, old MI and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain.

Past Medical History:
1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT,
chemo. Last colonoscopy showed: Last CEA was in the 8 range
(down from 9)
2. Lymphedema from XRT, takes a diuretic
3. Hypertension

Social History:
Married, former tobacco use. No alcohol or drug use.

Family History:
Mother with stroke at age 82. no early deaths.
2 daughters- healthy

Ultrasound [**5-30**]: IMPRESSION: 1. Dilated common bile duct with
mild intrahepatic biliary ductal dilatation and dilatai

In [17]:
preprocessed_doc

admission date:  01-01-2010              discharge date:   01-01-2010

date of birth:  01-01-2010             sex:   f

service: surgery

allergies:
hydrochlorothiazide

attending:
chief complaint:
abdominal pain

major surgical or invasive procedure:
picc line 01-01-2010
ercp w/ sphincterotomy 01-01-2010


history of present illness:
74y female with hypertension, old mi and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain.

past medical history:
1. colon cancer Diagnosed in 2010, Treated with hemicolectomy, xrt,
chemo. last colonoscopy showed: last cea was in the 8 range
(down from 9)
2. lymphedema from xrt, takes a diuretic
3. hypertension

social history:
married, former tobacco use. no alcohol or drug use.

family history:
mother with stroke at age 82. no early deaths.
2 daughters- healthy

ultrasound 01-01-2010: impression: 1. dilated common bile duct with
mild intrahepatic biliary ductal dilatation and dilataion of the
pancreatic duct. 2. edematous

# Sentence segmentation
**TODO**: PyRuSH currently failing on Mac.