# TextSectionizer
Sometimes, you may not want to process an entire document with spaCy. You may instead want to extract specific sections and then process them independently. To do this, you can use the `TextSectionizer` and process a text. Just like the `Sectionizer`, this class comes with default patterns which can be modified or added to.

In [1]:
with open("./example_discharge_summary.txt") as f:
    text = f.read()

In [2]:
from clinical_sectionizer import TextSectionizer

In [3]:
sectionizer = TextSectionizer()

In [4]:
sectionizer.section_titles

{'allergy',
 'chief_complaint',
 'ed_course',
 'education',
 'family_history',
 'hiv_screening',
 'imaging',
 'labs_and_studies',
 'medication',
 'observation_and_plan',
 'other',
 'past_medical_history',
 'patient_instructions',
 'physical_exam',
 'present_illness',
 'problem_list',
 'sexual_and_social_history',
 'signature'}

Unlike the `Sectionizer` patterns, the `pattern` value can only be a string, which will be interpreted as a case-insensitive regular expression. You can add to the `TextSectionizer` with the same `.add()` method:

In [5]:
sectionizer.patterns[:5]

[{'section_title': 'past_medical_history',
  'pattern': '(past )?medical (history|hx)'},
 {'section_title': 'past_medical_history', 'pattern': 'mhx?'},
 {'section_title': 'past_medical_history', 'pattern': 'mh:'},
 {'section_title': 'past_medical_history', 'pattern': 'pmh:'},
 {'section_title': 'past_medical_history', 'pattern': 'pohx:'}]

In [6]:
new_patterns = [
    {"section_title": "visit_information", "pattern": "admi(t|ssion) date:"},
    {"section_title": "labs_and_studies", "pattern": "pertinent results:"}
]

In [7]:
sectionizer.add(new_patterns)

# Using Text Sectionizer
We can get the split up document by calling `sectionizer(text)`. This returns a list of 3-tuples which contain:
- `section_title`: The string of the section title
- `section_header`: The span of text matched by the pattern
- `section_text`: The span of text contained in the entire section

In [8]:
sections = sectionizer(text)

In [9]:
print(sections[1])

('visit_information', 'Admission Date:', 'Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]\n \nDate of Birth:  [**2498-8-19**]             Sex:   F\n \nService: SURGERY\n \n')


In [10]:
for (section_title, section_header, section_text) in sections[:3]:
    print(section_title)
    print(section_header)
    print()
    print(section_text)
    print("---"*5)

None
None



---------------
visit_information
Admission Date:

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]
 
Date of Birth:  [**2498-8-19**]             Sex:   F
 
Service: SURGERY
 

---------------
allergy
Allergies:

Allergies: 
Hydrochlorothiazide
 
Attending:[**First Name3 (LF) 1893**] 
Chief Complaint:
Abdominal pain
 
Major Surgical or Invasive 
---------------


You can unpack these tuples by using the Python `zip(*tuples)` function:

In [11]:
section_titles, section_headers, section_texts = zip(*sections)

In [12]:
section_titles

(None,
 'visit_information',
 'allergy',
 'other',
 'present_illness',
 'past_medical_history',
 'sexual_and_social_history',
 'family_history',
 'physical_exam',
 'labs_and_studies',
 'observation_and_plan',
 'medication',
 'medication',
 'observation_and_plan',
 'patient_instructions',
 'patient_instructions',
 'signature')

In [13]:
section_headers

(None,
 'Admission Date:',
 'Allergies:',
 'Procedure:',
 'History of Present Illness:',
 'Past Medical History:',
 'Social History:',
 'Family History:',
 'Physical Exam:',
 'Pertinent Results:',
 'IMPRESSION:',
 'Medications on Admission:',
 'Discharge Medications:',
 'Discharge Diagnosis:',
 'Discharge Instructions:',
 'Followup Instructions:',
 'Signed electronically by:')

In [14]:
section_texts

('\n',
 'Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]\n \nDate of Birth:  [**2498-8-19**]             Sex:   F\n \nService: SURGERY\n \n',
 'Allergies: \nHydrochlorothiazide\n \nAttending:[**First Name3 (LF) 1893**] \nChief Complaint:\nAbdominal pain\n \nMajor Surgical or Invasive ',
 'Procedure:\nPICC line [**6-25**]\nERCP w/ sphincterotomy [**5-31**]\nTEE [**6-22**]\nTracheostomy [**6-24**]\n\n \n',
 'History of Present Illness:\n74y female with hypertension and a recent stroke affecting her \nspeech, who presents with 2 days of abdominal pain.  She states \nit is constant, and radiates to her back.  It started after \neating a double cheese pizza and hard lemonade.  There is no \nprior history of such an episode.  She had multiple bouts of \nnausea and vomiting, with chills and decreased flatus.\n \n',
 "Past Medical History:\n1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT, \nchemo. Last colonoscopy showed: Last CEA was in the 8 ra

## Limiting sections
Once you identify the sections in a document, you can then exclude any other sections which aren't relevant. You can then process each document separately or combine into a smaller, more selective document.

In [15]:
relevant_section_titles = ["present_illness", "medication"]
relevant_sections = [section for (section_title, section_header, section) in sections 
                   if section_title in relevant_section_titles]

In [16]:
relevant_section_titles

['present_illness', 'medication']

In [17]:
relevant_text = "\n\n".join(relevant_sections)

In [18]:
import spacy
from cycontext.viz import visualize_ent 

In [19]:
nlp = spacy.load("en_info_3700_i2b2_2012")



In [20]:
nlp

<spacy.lang.en.English at 0x11d77b2d0>

In [21]:
doc = nlp(relevant_text)

In [22]:
visualize_ent(doc)