# Subsections and Parents

The medspacy sectionizer supports adding subsections to your document.

In [1]:
import sys
sys.path.insert(0, "../..")

import medspacy
from medspacy.section_detection import Sectionizer
from medspacy.section_detection import SectionRule

Here are four example documents showing slight permutations of a section-subsection structure found in text.

In [2]:
text1 = '''Past Medical History: 
pt has history of medical events
Comments: some comment here

Allergies:
peanuts
'''

text2 = '''Past Medical History: 
pt has history of medical events
Comments: some comment here

Allergies:
peanuts
Comments: pt cannot eat peanuts
'''

text3 = '''Past Medical History: 
pt has history of medical events

Allergies:
peanuts
Comments: pt cannot eat peanuts
'''

text4 = '''Past Medical History: 
pt has history of medical events

Allergies:
peanuts

Medical Assessment: pt has a fever
Comments: fever is 101F
'''

# Parent-Child attachment
Rules specify a `parents` list. This defines all possible legal parents for this section by their `section_title`. The specific parent (if any exist) of each match is determined at runtime. In this example, we define four sections and the comment section has two candidate parents.

In [3]:
nlp = medspacy.load()

In [4]:
sectionizer = Sectionizer(nlp,rules=None)

In [5]:
pattern_dicts = [{"category":"past_medical_history","literal":"Past Medical History:"},
                {"category":"allergies","literal":"Allergies:"},
                {"category":"medical_assessment","literal":"Medical Assessment:"},
                {"category":"comment","literal":"Comments:","parents":["past_medical_history","allergies"]}]

In [6]:
patterns = [SectionRule.from_dict(pattern) for pattern in pattern_dicts]

In [7]:
sectionizer.add(patterns)

In [8]:
nlp.add_pipe("medspacy_sectionizer")

<medspacy.section_detection.sectionizer.Sectionizer at 0x7f88487ebd00>

We can print out the output of the sectionizer on each of these documents and see how they vary.

In the first case, we see that three sections are identified in the text and the comment section has a parent "past_medical_history"

In [9]:
doc = nlp(text1)
for section in doc._.sections:
    print("CATEGORY.............. {0}".format(section.category))
    print("TITLE................. {0}".format(section.title_span))
    if section.parent:
        print("PARENT................ {0}".format(section.parent.category))
    else:
        print("PARENT................ {0}".format(section.parent))
    print("SECTION TEXT..........\n{0}".format(section.body_span))
    print("----------------------")

CATEGORY.............. past_medical_history
TITLE................. Past Medical History:
PARENT................ None
SECTION TEXT..........

pt has history of medical events

----------------------
CATEGORY.............. comments
TITLE................. Comments:
PARENT................ None
SECTION TEXT..........
some comment here


----------------------
CATEGORY.............. allergies
TITLE................. Allergies:
PARENT................ None
SECTION TEXT..........

peanuts

----------------------


  matches = self.matcher(doc)
  matches = self.matcher(doc)


In this next document, there are two comment sections, each that match to the closest parent sections. Subsections cannot jump over other sections to attach to a parent.

In [10]:
doc = nlp(text2)
for section in doc._.sections:
    print("CATEGORY.............. {0}".format(section.category))
    print("TITLE................. {0}".format(section.title_span))
    if section.parent:
        print("PARENT................ {0}".format(section.parent.category))
    else:
        print("PARENT................ {0}".format(section.parent))
    print("SECTION TEXT..........\n{0}".format(section.body_span))
    print("----------------------")

CATEGORY.............. past_medical_history
TITLE................. Past Medical History:
PARENT................ None
SECTION TEXT..........

pt has history of medical events

----------------------
CATEGORY.............. comments
TITLE................. Comments:
PARENT................ None
SECTION TEXT..........
some comment here


----------------------
CATEGORY.............. allergies
TITLE................. Allergies:
PARENT................ None
SECTION TEXT..........

peanuts

----------------------
CATEGORY.............. comments
TITLE................. Comments:
PARENT................ None
SECTION TEXT..........
pt cannot eat peanuts

----------------------


  matches = self.matcher(doc)
  matches = self.matcher(doc)


This example further illustrates how subsections cannot attach to non-adjacent candidate parents. The subsection in `past_medical_history` has been removed but the `allergies` subsection matches the same as before

In [11]:
doc = nlp(text3)
for section in doc._.sections:
    print("CATEGORY.............. {0}".format(section.category))
    print("TITLE................. {0}".format(section.title_span))
    if section.parent:
        print("PARENT................ {0}".format(section.parent.category))
    else:
        print("PARENT................ {0}".format(section.parent))
    print("SECTION TEXT..........\n{0}".format(section.body_span))
    print("----------------------")

CATEGORY.............. past_medical_history
TITLE................. Past Medical History:
PARENT................ None
SECTION TEXT..........

pt has history of medical events


----------------------
CATEGORY.............. allergies
TITLE................. Allergies:
PARENT................ None
SECTION TEXT..........

peanuts

----------------------
CATEGORY.............. comments
TITLE................. Comments:
PARENT................ None
SECTION TEXT..........
pt cannot eat peanuts

----------------------


  matches = self.matcher(doc)
  matches = self.matcher(doc)


This final examples shows that if no adjacent parent candidates exist, then no match will be made. `medical_assessment` was not listed as a candidate parent for `comment`, so there is no parent attachment made by the comment following this section

In [12]:
doc = nlp(text4)
for section in doc._.sections:
    print("CATEGORY.............. {0}".format(section.category))
    print("TITLE................. {0}".format(section.title_span))
    if section.parent:
        print("PARENT................ {0}".format(section.parent.category))
    else:
        print("PARENT................ {0}".format(section.parent))
    print("SECTION TEXT..........\n{0}".format(section.body_span))
    print("----------------------")

CATEGORY.............. past_medical_history
TITLE................. Past Medical History:
PARENT................ None
SECTION TEXT..........

pt has history of medical events


----------------------
CATEGORY.............. allergies
TITLE................. Allergies:
PARENT................ None
SECTION TEXT..........

peanuts

Medical
----------------------
CATEGORY.............. observation_and_plan
TITLE................. Assessment:
PARENT................ None
SECTION TEXT..........
pt has a fever

----------------------
CATEGORY.............. comments
TITLE................. Comments:
PARENT................ None
SECTION TEXT..........
fever is 101F

----------------------


  matches = self.matcher(doc)
  matches = self.matcher(doc)


# Requiring Parents for matched sections

It is possible to specify that a section is required to find a valid parent in order to be included in the resulting document. When the pattern defines the optional parameter `parent_required` as `True`, if the section finds no parent section in the document, then the section will be removed from the output.

The following text shows a short example where a required parent might be useful. In this document, there are two mentions of the word "color". One might be part of a section, but without further specification, the other might be a false positive. There may be more than one way to solve this ambiguity, such as incorporating punctuation or proximity to line endings for further context.

In [13]:
text5 = '''Patient is 6 years old and says his favorite color is purple

medical assessment
patient has a bruise from a bicycle accident
color
blue
'''

In [14]:
nlp = medspacy.load()

In [15]:
sectionizer = Sectionizer(nlp,rules=None)

In [16]:
pattern_dicts = [{"category":"medical_assessment","literal":"medical assessment"},
                {"category":"color","literal":"color","parents":["medical_assessment"],"parent_required":True}]

In [17]:
rules = [SectionRule.from_dict(pattern) for pattern in pattern_dicts]

In [18]:
sectionizer.add(rules)

In [19]:
nlp.add_pipe("medspacy_sectionizer")

<medspacy.section_detection.sectionizer.Sectionizer at 0x7f8848ddb5e0>

In [20]:
doc = nlp(text5)
for section in doc._.sections:
    print("CATEGORY.............. {0}".format(section.category))
    print("TITLE................. {0}".format(section.title_span))
    if section.parent:
        print("PARENT................ {0}".format(section.parent.category))
    else:
        print("PARENT................ {0}".format(section.parent))
    print("SECTION TEXT..........\n{0}".format(section.body_span))
    print("----------------------")

CATEGORY.............. None
TITLE................. 
PARENT................ None
SECTION TEXT..........
Patient is 6 years old and says his favorite color is purple

medical assessment
patient has a bruise from a bicycle accident
color
blue

----------------------


  matches = self.matcher(doc)
  matches = self.matcher(doc)


# Subsection trees and backtracking

Subsections can be chained together and the parent matching will traverse the tree structure to match to the correct legal parent.

The following two examples show deep subsection structures in a document. The first document is a simple example showing the subsection chaining that might exist in a document. The second example is more complex and shows subsection siblings (sections at the same depth of the subsection tree) and backtracking out of some, but not all subsections.

In [21]:
text6 = '''Section 1: some text
Section 1.1: Some other text
Section 1.1.1: Even more text
Section 1.1.1.1: How deep can sections go?
'''

text7 = '''Section 1: some text
Section 1.1: Some other text
Section 1.1.1: Even more text
Section 1.1.1.1: How deep can sections go?
Section 1.1.1.2: As deep as you want!
Section 1.2: Let's backtrack
Section 2: A whole new section
'''

In [22]:
nlp = medspacy.load()

In [23]:
sectionizer = Sectionizer(nlp,rules=None)

In [24]:
pattern_dicts = [{"category":"s1","literal":"Section 1:"},
                {"category":"s1.1","literal":"Section 1.1:", "parents":["s1"]},
                {"category":"s1.1.1","literal":"Section 1.1.1:", "parents":["s1.1"]},
                {"category":"s1.1.1.1","literal":"Section 1.1.1.1:","parents":["s1.1.1"]},
                {"category":"s1.1.1.2","literal":"Section 1.1.1.2:","parents":["s1.1.1"]},
                {"category":"s1.2","literal":"Section 1.2:","parents":["s1"]},
                {"category":"s2","literal":"Section 2:"}]

In [25]:
rules = [SectionRule.from_dict(pattern) for pattern in pattern_dicts]

In [26]:
sectionizer.add(rules)

In [27]:
nlp.add_pipe("medspacy_sectionizer")

<medspacy.section_detection.sectionizer.Sectionizer at 0x7f884940e610>

In [28]:
rules = [SectionRule.from_dict(pattern) for pattern in pattern_dicts]

In [29]:
doc = nlp(text6)
for section in doc._.sections:
    print("CATEGORY.............. {0}".format(section.category))
    print("TITLE................. {0}".format(section.title_span))
    if section.parent:
        print("PARENT................ {0}".format(section.parent.category))
    else:
        print("PARENT................ {0}".format(section.parent))
    print("SECTION TEXT..........\n{0}".format(section.body_span))
    print("----------------------")

CATEGORY.............. None
TITLE................. 
PARENT................ None
SECTION TEXT..........
Section 1: some text
Section 1.1: Some other text
Section 1.1.1: Even more text
Section 1.1.1.1: How deep can sections go?

----------------------


  matches = self.matcher(doc)
  matches = self.matcher(doc)


In [30]:
doc = nlp(text7)
for section in doc._.sections:
    print("CATEGORY.............. {0}".format(section.category))
    print("TITLE................. {0}".format(section.title_span))
    if section.parent:
        print("PARENT................ {0}".format(section.parent.category))
    else:
        print("PARENT................ {0}".format(section.parent))
    print("SECTION TEXT..........\n{0}".format(section.body_span))
    print("----------------------")

CATEGORY.............. None
TITLE................. 
PARENT................ None
SECTION TEXT..........
Section 1: some text
Section 1.1: Some other text
Section 1.1.1: Even more text
Section 1.1.1.1: How deep can sections go?
Section 1.1.1.2: As deep as you want!
Section 1.2: Let's backtrack
Section 2: A whole new section

----------------------


  matches = self.matcher(doc)
  matches = self.matcher(doc)
