# Adding to the Sectionizer

By default, `clinical_sectionizer` comes with a number of built-in patterns. However, this is a non-exhaustive list and your data will almost certainly contain a number of sections which aren't captured by the default patterns. 

In this notebook, we'll see how to add custom section patterns to our clinical sectionizer to recognize section headers which are not contained in the default knowledge base.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys

In [3]:
sys.path.insert(0, "..")

In [4]:
import spacy
from clinical_sectionizer import Sectionizer
from medspacy.visualization import visualize_ent 


In [5]:
nlp = spacy.load("en_info_3700_i2b2_2012")



In [6]:
sectionizer = Sectionizer(nlp)

In [7]:
nlp.add_pipe(sectionizer)

In [8]:
nlp.pipe_names

['tagger', 'parser', 'ner', 'sectionizer']

## Available default sections
You can see which section titles are available in the `sectionizer` by the `sectionizer.section_titles` property:

In [9]:
sectionizer.section_titles

{'addendum',
 'allergies',
 'chief_complaint',
 'comments',
 'diagnoses',
 'family_history',
 'history_of_present_illness',
 'hospital_course',
 'imaging',
 'labs_and_studies',
 'medications',
 'neurological',
 'observation_and_plan',
 'other',
 'past_medical_history',
 'patient_education',
 'patient_instructions',
 'physical_exam',
 'problem_list',
 'reason_for_examination',
 'signature',
 'social_history'}

You can also view the patterns in `sectionizer.patterns`. This will be explained more below.

In [10]:
sectionizer.patterns[:5]

[{'section_title': 'addendum', 'pattern': 'ADDENDUM:'},
 {'section_title': 'addendum', 'pattern': 'Addendum:'},
 {'section_title': 'allergies', 'pattern': 'ALLERGIC REACTIONS:'},
 {'section_title': 'allergies', 'pattern': 'ALLERGIES:'},
 {'section_title': 'chief_complaint', 'pattern': 'CC:'}]

In this example, we'll use a smaller section of the note before:

In [11]:
text = """
Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]
 
Date of Birth:  [**2498-8-19**]             Sex:   F
 
Service: SURGERY
 
Allergies: 
Hydrochlorothiazide
 
Attending:[**First Name3 (LF) 1893**] 
Chief Complaint:
Abdominal pain


Pertinent Results:
[**2573-5-30**] 09:10PM BLOOD WBC-19.2*# RBC-4.81 Hgb-15.5 Hct-44.0 
MCV-92 MCH-32.3* MCHC-35.2* RDW-13.3 Plt Ct-230
[**2573-5-30**] 09:10PM BLOOD Neuts-87* Bands-10* Lymphs-3* Monos-0 
"""

In [12]:
doc = nlp(text)

In [13]:
visualize_ent(doc)

In [14]:
doc._.section_titles

[None, 'other', 'allergies', 'chief_complaint', 'labs_and_studies']

The sectionizer correctly recognizes **"Allergies"** and **"Chief Complaint"** as section headers. However, some other titles which might be useful to extract are:
- **"Admission Date"**: Many MIMIC notes start this way and you could consider this first section to be **visit_information**
- **"Pertinent Results**: This is a section of **"lab"** results

## Add patterns
To recognize these sections, we can add **patterns** to the sectionizer. These patterns resemble spaCy's [rule-based matching API](https://spacy.io/usage/rule-based-matching). Each pattern is a dictionary with two keys:
- `section_title`: The normalized name of the section which will be available in `ent._.section_title`
- `pattern`: Either a string (for exact match, case insensitive) or a list of dictionaries (for matching on additional token attributes) which define the text to match

In [15]:
new_patterns = [
    {"section_title": "visit_information", "pattern": [{"LOWER": {"REGEX": "admi(t|ssion)"}}, {"LOWER": "date"}, {"LOWER": ":"}]},
    {"section_title": "labs_and_studies", "pattern": "Pertinent Results:"}
]

We add this list of patterns through the `sectionizer.add` method:

In [16]:
sectionizer.add(new_patterns)

Now if we reprocess and visualize our doc, we can see that the new headers have been extracted:

In [17]:
doc = nlp(text)

In [18]:
visualize_ent(doc)

In [19]:
doc._.section_titles

[None,
 'visit_information',
 'other',
 'allergies',
 'chief_complaint',
 'labs_and_studies']

# Loading a blank sectionizer
You can load the `sectionizer` without the default patterns and only custom patterns:

In [20]:
blank_sectionizer = Sectionizer(nlp, patterns=None)

In [21]:
blank_sectionizer._patterns

[]

In [22]:
blank_sectionizer._section_titles

set()