# Overview
This notebook will show an example of how to use the `ClinicalSectionizer` package from medSpaCy. 

## Prerequisites
This notebook will also use some examples from the master medSpaCy package [medspacy](https://github.com/medspacy/medspacy), which you can download as:


`pip install medspacy`

It was also used a trained statistical model trained in i2b2 data, which you can download as:

`pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz`

## Example text
We'll process this document below:

In [1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0, "../..")

In [2]:
with open("../discharge_summary.txt") as f:
    text = f.read()

In [3]:
print(text[:500])

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging sh


# Getting started
The `Sectionizer` component is used in the same way as any other spaCy component. We'll start by loading a spaCy model, creating a `Sectionizer` object, and then adding it to our pipeline.

In [4]:
import spacy

In [5]:
nlp = spacy.load("en_info_3700_i2b2_2012")



In [6]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [9]:
from medspacy.section_detection import Sectionizer

In [10]:
sectionizer = Sectionizer(nlp)

In [11]:
nlp.add_pipe(sectionizer)

In [12]:
nlp.pipe_names

['tagger', 'parser', 'ner', 'sectionizer']

## Processing a text
As an example, we'll process this text from MIMIC-II:

In [13]:
doc = nlp(text)

Just like that, we've processed our doc with medSpaCy! Let's first take a look at what entities we've extracted, as well as the section headers. To do this, we'll use a visualizer in the medSpaCy package `cycontext`. This function will highlight all of the **entities** found in `doc.ents` which were extracted by our model's `ner` component, as well as the section headers extracted by `sectionizer`:

In [14]:
from medspacy.visualization import visualize_ent

In [15]:
visualize_ent(doc)

The section titles are highlighted in gray with **<<>>** symbols around the normalized section title. As you can see, there are sometimes overlap between the targets and section headers, which causes duplicate text to be displayed.

# Extracted section information
Let's now see what information was extracted by `sectionizer`. When `sectionizer` process a `doc`, it adds a number of custom attributes at the following levels:
- `Doc`: The entire document
- `Span`: A slice of a document (like an entity)
- `Token`: A single token

In spaCy, custom attributes are saved under the `var._` attribute. 

## Doc
Let's first look at all of the `section_titles` which were found in our text. Note that until a section header is found, the header is considered to be `None`.

In [16]:
doc._.section_titles

[None,
 'other',
 'allergies',
 'chief_complaint',
 'history_of_present_illness',
 'past_medical_history',
 'social_history',
 'family_history',
 'hospital_course',
 'medications',
 'observation_and_plan',
 'patient_instructions',
 'signature']

Next, let's find the spans of the text which were recognized as section headers. Note that the section titles above are normalized forms of these actual spans of text.

In [17]:
doc._.section_headers

[None,
 Service:,
 Allergies:,
 Chief Complaint:,
 History of Present Illness:,
 Past Medical History:,
 Social History:,
 Family History:,
 Brief Hospital Course:,
 Discharge Medications:,
 Discharge Diagnosis:,
 Discharge Instructions:,
 Signed electronically by:]

Now, let's look at a few entire sections:

In [18]:
for section_span in doc._.section_spans[:5]:
    print(section_span)
    print("---"*10)

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F


------------------------------
Service: SURGERY


------------------------------
Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]

------------------------------
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]



------------------------------
History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis.


------------------------------


Finally, a **subsection**a may have a parent section, which will be explained in a later notebook:

In [19]:
for parent in doc._.section_parents:
    print(parent)

None
None
None
None
None
None
None
None
None
None
None
None
None


You can iterate through all 4 of these attributes through the `doc._.sections` attribute, which returns a list of namedtuple `Section` objects. The attributes can then be accessed either through indexing or field names:

In [20]:
section_tuple = doc._.sections[1]
section_tuple

Section(section_title='other', section_header=Service:, section_parent=None, section_span=Service: SURGERY

)

In [21]:
print("Title:", section_tuple.section_title)
print("Header:", section_tuple.section_header)
print("Parent:", section_tuple.section_parent)
print("Span:", section_tuple.section_span)

Title: other
Header: Service:
Parent: None
Span: Service: SURGERY




In [22]:
for (title, header, parent, span) in doc._.sections[:5]:
    print(title)
    print(header)
    print(parent)
    print()
    print(span)
    print("----"*10)
    print()

None
None
None

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F


----------------------------------------

other
Service:
None

Service: SURGERY


----------------------------------------

allergies
Allergies:
None

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]

----------------------------------------

chief_complaint
Chief Complaint:
None

Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]



----------------------------------------

history_of_present_illness
History of Present Illness:
None

History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis.


----------------------------------------



## Span
Now, for each entity extracted from the text, let's look at the label and the section title containing the entity:

In [23]:
for ent in doc.ents[:10]:
    print(ent, ent.label_, ent._.section_title, ent._.section_header, ent._.section_parent)

Hydrochlorothiazide TREATMENT allergies Allergies: Allergies:
Abdominal pain PROBLEM chief_complaint Chief Complaint: Chief Complaint:
Invasive Procedure TREATMENT chief_complaint Chief Complaint: Chief Complaint:
PICC line TREATMENT chief_complaint Chief Complaint: Chief Complaint:
ERCP TEST chief_complaint Chief Complaint: Chief Complaint:
sphincterotomy TREATMENT chief_complaint Chief Complaint: Chief Complaint:
a recent stroke PROBLEM history_of_present_illness History of Present Illness: History of Present Illness:
abdominal pain PROBLEM history_of_present_illness History of Present Illness: History of Present Illness:
Imaging TEST history_of_present_illness History of Present Illness: History of Present Illness:
metastasis PROBLEM history_of_present_illness History of Present Illness: History of Present Illness:


Similar to `doc`, you can also access the entire section span which contained the ent:

In [24]:
ent = doc.ents[0]
ent._.section_span

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]

# Assertion attributes
In clinical NLP, it's important to account for certain attributes about extracted entities, such as whether a concept is **negated** or **historical**. This is handled at a sentence level by [cycontext](https://github.com/medspacy/cycontext), which looks for linguistic modifiers within the same sentence as an entity. However, the section in which a concept occurs can also inform these attributes.

For example, in the example below, we know that:
- **"Pneumonia"** is not current because it occurs in the **Past Medical History**
- **"Penicillin"** and **"Allergies"** are not actually experienced, they're just listed in the allergies section. We call this **hypothetical**
- **"Diabetes** is experienced by someone in the patient's family because it occurs in the **Family History**
- **"Chest pain"** is hypothetical because it occurs in the **Patient Education** section as a hypothetical event



By default, these attributes will not be added by the sectionizer. This functionality can be set on with the `add_attrs` argument in the constructor, which by default is `False`:

In [25]:
sectionizer = Sectionizer(nlp, add_attrs=True)

In [26]:
try:
    nlp.remove_pipe("sectionizer")
except:
    pass

In [27]:
nlp.add_pipe(sectionizer)

In [28]:
text = """
Past Medical History:
pneumonia

Allergies: 
Penicillin

Family History:
Diabetes

Assessment and Plan:
Warfarin for PE

Patient Education:
You have been prescribed with a medication which is known to cause chest pain.
"""

In [29]:
doc = nlp(text)

In [30]:
visualize_ent(doc)

Each entity will have the following attributes defined under the `ent._` attribute. Each has a default value of `False` which could be set to True by `sectionizer`:

- `is_negated`
- `is_uncertain`
- `is_historical`
- `is_family`
- `is_hypothetical`

Let's iterate through these entities and see which these attributes in these ents.

In [31]:
for ent in doc.ents:
    print(ent, ent._.section_title)
    print("Historical:", ent._.is_historical, "\tFamily:", ent._.is_family, "\tHypothetical:", ent._.is_hypothetical,)
    print()

pneumonia past_medical_history
Historical: True 	Family: False 	Hypothetical: False

Allergies allergies
Historical: False 	Family: False 	Hypothetical: False

Penicillin allergies
Historical: False 	Family: False 	Hypothetical: False

Diabetes family_history
Historical: False 	Family: True 	Hypothetical: False

Warfarin family_history
Historical: False 	Family: True 	Hypothetical: False

PE family_history
Historical: False 	Family: True 	Hypothetical: False

a medication patient_education
Historical: False 	Family: False 	Hypothetical: False

chest pain patient_education
Historical: False 	Family: False 	Hypothetical: False



The attributes and sections are defined in a dictionary mapping the **section titles** to the attribute name/value pairs. You can find this in the `assertion_attributes_mapping` attribute:

In [32]:
sectionizer.assertion_attributes_mapping

{'past_medical_history': {'is_historical': True},
 'sexual_and_social_history': {'is_historical': True},
 'family_history': {'is_family': True},
 'patient_instructions': {'is_hypothetical': True},
 'education': {'is_hypothetical': True},
 'allergy': {'is_hypothetical': True}}

Additionally, you could define your own logic by constructing a dictionary like the one above, registering the `Span` extensions, passing the dictionary in to the `add_attrs` argument:
```python
sectionizer = Sectionizer(nlp, add_attr={...})
```