# Subsections

The medspacy sectionizer supports adding subsections to your document.

In [1]:
import spacy

import sys
sys.path.insert(0, "..")

from clinical_sectionizer import Sectionizer

Here are four example documents showing slight permutations of a section-subsection structure found in text.

In [2]:
text1 = '''Past Medical History: 
pt has history of medical events
Comments: some comment here

Allergies:
peanuts
'''

text2 = '''Past Medical History: 
pt has history of medical events
Comments: some comment here

Allergies:
peanuts
Comments: pt cannot eat peanuts
'''

text3 = '''Past Medical History: 
pt has history of medical events

Allergies:
peanuts
Comments: pt cannot eat peanuts
'''

text4 = '''Past Medical History: 
pt has history of medical events

Allergies:
peanuts

Medical Assessment: pt has a fever
Comments: fever is 101F
'''

# Parent-Child attachment
Rules specify a `parents` list. This defines all possible legal parents for this section by their `section_title`. The specific parent (if any exist) of each match is determined at runtime. In this example, we define four sections and the comment section has two candidate parents.

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
sectionizer = Sectionizer(nlp,patterns=None)

In [5]:
patterns = [{"section_title":"past_medical_history","pattern":"Past Medical History:"},
            {"section_title":"allergies","pattern":"Allergies:"},
            {"section_title":"medical_assessment","pattern":"Medical Assessment:"},
            {"section_title":"comment","pattern":"Comments:","parents":["past_medical_history","allergies"]}]

In [6]:
sectionizer.add(patterns)

In [7]:
nlp.add_pipe(sectionizer)

We can print out the output of the sectionizer on each of these documents and see how they vary.

In the first case, we see that three sections are identified in the text and the comment section has a parent "past_medical_history"

In [8]:
doc = nlp(text1)
for title,text,parent,section in doc._.sections:
    print("TITLE................. {0}".format(title))
    print("TEXT.................. {0}".format(text))
    print("PARENT................ {0}".format(parent))
    print("SECTION TEXT..........\n{0}".format(section))
    print("----------------------")

TITLE................. past_medical_history
TEXT.................. Past Medical History:
PARENT................ None
SECTION TEXT..........
Past Medical History: 
pt has history of medical events

----------------------
TITLE................. comment
TEXT.................. Comments:
PARENT................ past_medical_history
SECTION TEXT..........
Comments: some comment here


----------------------
TITLE................. allergies
TEXT.................. Allergies:
PARENT................ None
SECTION TEXT..........
Allergies:
peanuts

----------------------


In this next document, there are two comment sections, each that match to the closest parent sections. Subsections cannot jump over other sections to attach to a parent.

In [9]:
doc = nlp(text2)
for title,text,parent,section in doc._.sections:
    print("TITLE................. {0}".format(title))
    print("TEXT.................. {0}".format(text))
    print("PARENT................ {0}".format(parent))
    print("SECTION TEXT..........\n{0}".format(section))
    print("----------------------")

TITLE................. past_medical_history
TEXT.................. Past Medical History:
PARENT................ None
SECTION TEXT..........
Past Medical History: 
pt has history of medical events

----------------------
TITLE................. comment
TEXT.................. Comments:
PARENT................ past_medical_history
SECTION TEXT..........
Comments: some comment here


----------------------
TITLE................. allergies
TEXT.................. Allergies:
PARENT................ None
SECTION TEXT..........
Allergies:
peanuts

----------------------
TITLE................. comment
TEXT.................. Comments:
PARENT................ allergies
SECTION TEXT..........
Comments: pt cannot eat peanuts

----------------------


This example further illustrates how subsections cannot attach to non-adjacent candidate parents. The subsection in `past_medical_history` has been removed but the `allergies` subsection matches the same as before

In [10]:
doc = nlp(text3)
for title,text,parent,section in doc._.sections:
    print("TITLE................. {0}".format(title))
    print("TEXT.................. {0}".format(text))
    print("PARENT................ {0}".format(parent))
    print("SECTION TEXT..........\n{0}".format(section))
    print("----------------------")

TITLE................. past_medical_history
TEXT.................. Past Medical History:
PARENT................ None
SECTION TEXT..........
Past Medical History: 
pt has history of medical events


----------------------
TITLE................. allergies
TEXT.................. Allergies:
PARENT................ None
SECTION TEXT..........
Allergies:
peanuts

----------------------
TITLE................. comment
TEXT.................. Comments:
PARENT................ allergies
SECTION TEXT..........
Comments: pt cannot eat peanuts

----------------------


This final examples shows that if no adjacent parent candidates exist, then no match will be made. `medical_assessment` was not listed as a candidate parent for `comment`, so there is no parent attachment made by the comment following this section

In [11]:
doc = nlp(text4)
for title,text,parent,section in doc._.sections:
    print("TITLE................. {0}".format(title))
    print("TEXT.................. {0}".format(text))
    print("PARENT................ {0}".format(parent))
    print("SECTION TEXT..........\n{0}".format(section))
    print("--------------------------")

TITLE................. past_medical_history
TEXT.................. Past Medical History:
PARENT................ None
SECTION TEXT..........
Past Medical History: 
pt has history of medical events


--------------------------
TITLE................. allergies
TEXT.................. Allergies:
PARENT................ None
SECTION TEXT..........
Allergies:
peanuts


--------------------------
TITLE................. medical_assessment
TEXT.................. Medical Assessment:
PARENT................ None
SECTION TEXT..........
Medical Assessment: pt has a fever

--------------------------
TITLE................. comment
TEXT.................. Comments:
PARENT................ None
SECTION TEXT..........
Comments: fever is 101F

--------------------------


# Requiring Parents for matched sections

It is possible to specify that a section is required to find a valid parent in order to be included in the resulting document. When the pattern defines the optional parameter `parent_required` as `True`, if the section finds no parent section in the document, then the section will be removed from the output.

The following text shows a short example where a required parent might be useful. In this document, there are two mentions of the word "color". One might be part of a section, but without further specification, the other might be a false positive. There may be more than one way to solve this ambiguity, such as incorporating punctuation or proximity to line endings for further context.

In [12]:
text5 = '''Patient is 6 years old and says his favorite color is purple

medical assessment
patient has a bruise from a bicycle accident
color
blue
'''

In [13]:
nlp = spacy.load("en_core_web_sm")

In [14]:
sectionizer = Sectionizer(nlp,patterns=None)

In [15]:
patterns = [{"section_title":"medical_assessment","pattern":"medical assessment"},
            {"section_title":"color","pattern":"color","parents":["medical_assessment"],"parent_required":True}]

In [16]:
sectionizer.add(patterns)

In [17]:
nlp.add_pipe(sectionizer)

In [18]:
doc = nlp(text5)
for title,text,parent,section in doc._.sections:
    print("TITLE................. {0}".format(title))
    print("TEXT.................. {0}".format(text))
    print("PARENT................ {0}".format(parent))
    print("SECTION TEXT..........\n{0}".format(section))
    print("----------------------")

TITLE................. None
TEXT.................. None
PARENT................ None
SECTION TEXT..........
Patient is 6 years old and says his favorite color is purple


----------------------
TITLE................. medical_assessment
TEXT.................. medical assessment
PARENT................ None
SECTION TEXT..........
medical assessment
patient has a bruise from a bicycle accident

----------------------
TITLE................. color
TEXT.................. color
PARENT................ medical_assessment
SECTION TEXT..........
color
blue

----------------------


# Subsection trees and backtracking

Subsections can be chained together and the parent matching will traverse the tree structure to match to the correct legal parent.

The following two examples show deep subsection structures in a document. The first document is a simple example showing the subsection chaining that might exist in a document. The second example is more complex and shows subsection siblings (sections at the same depth of the subsection tree) and backtracking out of some, but not all subsections.

In [19]:
text6 = '''Section 1: some text
Section 1.1: Some other text
Section 1.1.1: Even more text
Section 1.1.1.1: How deep can sections go?
'''

text7 = '''Section 1: some text
Section 1.1: Some other text
Section 1.1.1: Even more text
Section 1.1.1.1: How deep can sections go?
Section 1.1.1.2: As deep as you want!
Section 1.2: Let's backtrack
Section 2: A whole new section
'''

In [20]:
nlp = spacy.load("en_core_web_sm")

In [21]:
sectionizer = Sectionizer(nlp,patterns=None)

In [22]:
patterns = [{"section_title":"s1","pattern":"Section 1:"},
            {"section_title":"s1.1","pattern":"Section 1.1:", "parents":["s1"]},
            {"section_title":"s1.1.1","pattern":"Section 1.1.1:", "parents":["s1.1"]},
            {"section_title":"s1.1.1.1","pattern":"Section 1.1.1.1:","parents":["s1.1.1"]},
            {"section_title":"s1.1.1.2","pattern":"Section 1.1.1.2:","parents":["s1.1.1"]},
            {"section_title":"s1.2","pattern":"Section 1.2:","parents":["s1"]},
            {"section_title":"s2","pattern":"Section 2:"}]

In [23]:
sectionizer.add(patterns)

In [24]:
nlp.add_pipe(sectionizer)

In [25]:
doc = nlp(text6)
for title,text,parent,section in doc._.sections:
    print("TITLE................. {0}".format(title))
    print("TEXT.................. {0}".format(text))
    print("PARENT................ {0}".format(parent))
    print("SECTION TEXT..........\n{0}".format(section))
    print("----------------------")

TITLE................. s1
TEXT.................. Section 1:
PARENT................ None
SECTION TEXT..........
Section 1: some text

----------------------
TITLE................. s1.1
TEXT.................. Section 1.1:
PARENT................ s1
SECTION TEXT..........
Section 1.1: Some other text

----------------------
TITLE................. s1.1.1
TEXT.................. Section 1.1.1:
PARENT................ s1.1
SECTION TEXT..........
Section 1.1.1: Even more text

----------------------
TITLE................. s1.1.1.1
TEXT.................. Section 1.1.1.1:
PARENT................ s1.1.1
SECTION TEXT..........
Section 1.1.1.1: How deep can sections go?

----------------------


In [26]:
doc = nlp(text7)
for title,text,parent,section in doc._.sections:
    print("TITLE................. {0}".format(title))
    print("TEXT.................. {0}".format(text))
    print("PARENT................ {0}".format(parent))
    print("SECTION TEXT..........\n{0}".format(section))
    print("----------------------")

TITLE................. s1
TEXT.................. Section 1:
PARENT................ None
SECTION TEXT..........
Section 1: some text

----------------------
TITLE................. s1.1
TEXT.................. Section 1.1:
PARENT................ s1
SECTION TEXT..........
Section 1.1: Some other text

----------------------
TITLE................. s1.1.1
TEXT.................. Section 1.1.1:
PARENT................ s1.1
SECTION TEXT..........
Section 1.1.1: Even more text

----------------------
TITLE................. s1.1.1.1
TEXT.................. Section 1.1.1.1:
PARENT................ s1.1.1
SECTION TEXT..........
Section 1.1.1.1: How deep can sections go?

----------------------
TITLE................. s1.1.1.2
TEXT.................. Section 1.1.1.2:
PARENT................ s1.1.1
SECTION TEXT..........
Section 1.1.1.2: As deep as you want!

----------------------
TITLE................. s1.2
TEXT.................. Section 1.2:
PARENT................ s1
SECTION TEXT..........
Section 1.2: 