# IO
Once you've processed a text or corpus with medspaCy, a next step is often to save and analyze the information you've extracted. `medpsacy.io` contains utilities for reading documents, converting processed docs into structured data, and writing your results to disk or to a database.

In [2]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
import sys
sys.path.insert(0, "..")

In [4]:
import medspacy

In [15]:
import sqlite3

In [26]:
# If you haven't already, install this pre-trained i2b2 2012 model
# !pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz

In [27]:
enable = ['sentencizer',
 'tagger',
 'parser',
 'ner',
 'target_matcher',
 'context',
 'sectionizer']
nlp = medspacy.load("en_info_3700_i2b2_2012", enable=enable)



In [28]:
nlp.pipe_names

['sentencizer',
 'tagger',
 'parser',
 'ner',
 'target_matcher',
 'context',
 'sectionizer']

As an example, we'll use a very simple sqlite database containing two sample documents.

In [49]:
import sqlite3 as sqlite

In [72]:
conn = sqlite.connect("medspacy_demo.db")

In [73]:
cursor = conn.cursor()

In [74]:
cursor.execute("SELECT text FROM texts;")

<sqlite3.Cursor at 0x1233cdc00>

In [75]:
text = cursor.fetchone()[0]

In [76]:
print(text[:500])

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging sh


In [77]:
conn.close()

In [143]:
doc = nlp(text)

# I. DocConsumer
The `DocConsumer` class takes the attributes extracted by medspaCy and converts them into structured data. There are four different data types that the `DocConsumer` will extract:
- **"ent"**: Extract information about the spans in `doc.ents`. Each row will represent a single entity and can include either native spaCy attributes (ie., `ent.label_`) or custom attrributes (ie., `ent._.is_negated`)
- **"section"**: Each row will represent a section of the document and includes attributes such as the section text and category
- **"context"**: This represents the entity-modifier pairs extracted by ConText
- **"doc"**: A single row for the entire doc. By default this will only include `doc.text`, but you can add other underscore attributes

Let's create a `DocConsumer` with all four of these data types. We'll use the default attributes for now but will show how to customize them later.

In [144]:
from medspacy.io import DocConsumer

In [145]:
doc_consumer = DocConsumer(nlp, dtypes=("ent", "context", "section", "doc"))

`dtype_attrs` maps the data types to the corresponding attributes/columns

In [146]:
doc_consumer.dtype_attrs

{'ent': ['text',
  'start_char',
  'end_char',
  'label_',
  'is_negated',
  'is_uncertain',
  'is_historical',
  'is_hypothetical',
  'is_family',
  'section_category',
  'section_parent'],
 'context': ['ent_text',
  'ent_label_',
  'ent_start_char',
  'ent_end_char',
  'modifier_text',
  'modifier_category',
  'modifier_direction',
  'modifier_start_char',
  'modifier_end_char',
  'modifier_scope_start_char',
  'modifier_scope_end_char'],
 'section': ['section_category',
  'section_title_text',
  'section_title_start_char',
  'section_title_end_char',
  'section_title_text',
  'section_title_start_char',
  'section_title_end_char',
  'section_text',
  'section_text_start_char',
  'section_text_end_char',
  'section_parent'],
 'doc': ['text']}

Now when we process our doc, we can get the relevant information from the `doc._.data` attribute. This is a nested dictionary where the outermost keys are the data types, and for each data type is an ordered dictionary mapping an attribute to the column of values for each ent.

In [147]:
doc = doc_consumer(doc)

In [148]:
doc._.data

{'ent': OrderedDict([('text',
               ['Hydrochlorothiazide',
                'Abdominal pain',
                'Invasive Procedure',
                'PICC line',
                'ERCP',
                'sphincterotomy',
                'a recent stroke',
                'abdominal pain',
                'Imaging',
                'metastasis',
                'Colon cancer',
                'hemicolectomy',
                'XRT',
                'chemo',
                'colonoscopy',
                'CEA',
                'Hypertension',
                'Married',
                'former tobacco use',
                'alcohol or drug use',
                'stroke',
                'Ultrasound',
                'pancreatic duct dilitation',
                'Miconazole',
                'Heparin Sodium',
                'Porcine',
                'Injection',
                'Type 2 DM',
                'Pancreatitis',
                'HTN',
                'aspiration respirato

If you have pandas installed, you can also now convert your doc to a DataFrame for each of the four levels:

In [149]:
doc._.to_dataframe("ent").head(10)

Unnamed: 0,text,start_char,end_char,label_,is_negated,is_uncertain,is_historical,is_hypothetical,is_family,section_category,section_parent
0,Hydrochlorothiazide,163,182,TREATMENT,False,False,False,False,False,allergies,
1,Abdominal pain,239,253,PROBLEM,False,False,False,False,False,chief_complaint,
2,Invasive Procedure,273,291,TREATMENT,False,False,False,False,False,chief_complaint,
3,PICC line,293,302,TREATMENT,False,False,False,False,False,chief_complaint,
4,ERCP,314,318,TEST,False,False,False,False,False,chief_complaint,
5,sphincterotomy,322,336,TREATMENT,False,False,False,False,False,chief_complaint,
6,a recent stroke,408,423,PROBLEM,False,False,False,False,False,history_of_present_illness,
7,abdominal pain,474,488,PROBLEM,False,False,False,False,False,history_of_present_illness,
8,Imaging,490,497,TEST,False,False,False,False,False,history_of_present_illness,
9,metastasis,519,529,PROBLEM,True,False,False,False,False,history_of_present_illness,


Now let's go through each of the 4 levels individually. You can specify a single level of data by either passing in the dtype to `doc._.get_data(dtype)` or accessing individual attributes.

## Ents data

In [150]:
ent_data = doc._.ent_data

In [151]:
ent_data.keys()

odict_keys(['text', 'start_char', 'end_char', 'label_', 'is_negated', 'is_uncertain', 'is_historical', 'is_hypothetical', 'is_family', 'section_category', 'section_parent'])

You can also access the data row-wise rather then column-wise:

In [152]:
doc._.get_data("ent", as_rows=True)[:2]

[('Hydrochlorothiazide',
  163,
  182,
  'TREATMENT',
  False,
  False,
  False,
  False,
  False,
  'allergies',
  None),
 ('Abdominal pain',
  239,
  253,
  'PROBLEM',
  False,
  False,
  False,
  False,
  False,
  'chief_complaint',
  None)]

In [153]:
ents_df = doc._.to_dataframe("ent")

In [154]:
ents_df.head()

Unnamed: 0,text,start_char,end_char,label_,is_negated,is_uncertain,is_historical,is_hypothetical,is_family,section_category,section_parent
0,Hydrochlorothiazide,163,182,TREATMENT,False,False,False,False,False,allergies,
1,Abdominal pain,239,253,PROBLEM,False,False,False,False,False,chief_complaint,
2,Invasive Procedure,273,291,TREATMENT,False,False,False,False,False,chief_complaint,
3,PICC line,293,302,TREATMENT,False,False,False,False,False,chief_complaint,
4,ERCP,314,318,TEST,False,False,False,False,False,chief_complaint,


In [155]:
ents_df[ents_df["is_negated"] == True]

Unnamed: 0,text,start_char,end_char,label_,is_negated,is_uncertain,is_historical,is_hypothetical,is_family,section_category,section_parent
9,metastasis,519,529,PROBLEM,True,False,False,False,False,history_of_present_illness,
19,alcohol or drug use,788,807,PROBLEM,True,False,False,False,False,social_history,


## Section data

In [156]:
section_data = doc._.section_data

In [157]:
section_data.keys()

odict_keys(['section_category', 'section_title_text', 'section_title_start_char', 'section_title_end_char', 'section_text', 'section_text_start_char', 'section_text_end_char', 'section_parent'])

In [158]:
doc._.get_data("section", as_rows=True)[0]

(None,
 None,
 0,
 0,
 'Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]\n\nDate of Birth:  [**2498-8-19**]             Sex:   F\n\n',
 0,
 134,
 None)

In [159]:
section_df = doc._.to_dataframe("section")

In [160]:
section_df.head()

Unnamed: 0,section_category,section_title_text,section_title_start_char,section_title_end_char,section_text,section_text_start_char,section_text_end_char,section_parent
0,,,0,0,Admission Date: [**2573-5-30**] ...,0,134,
1,other,Service:,134,142,Service: SURGERY\n\n,134,152,
2,allergies,Allergies:,152,162,Allergies:\nHydrochlorothiazide\n\nAttending:[...,152,222,
3,chief_complaint,Chief Complaint:,222,238,Chief Complaint:\nAbdominal pain\n\nMajor Surg...,222,350,
4,history_of_present_illness,History of Present Illness:,350,377,History of Present Illness:\n74y female with t...,350,532,


## Context data

In [161]:
context_data = doc._.context_data

In [162]:
context_data.keys()

odict_keys(['ent_text', 'ent_label_', 'ent_start_char', 'ent_end_char', 'modifier_text', 'modifier_category', 'modifier_direction', 'modifier_start_char', 'modifier_end_char', 'modifier_scope_start_char', 'modifier_scope_end_char'])

In [163]:
doc._.get_data("context", as_rows=True)[:2]

[('metastasis',
  'PROBLEM',
  519,
  529,
  'no evidence of',
  'NEGATED_EXISTENCE',
  'FORWARD',
  504,
  518,
  519,
  518),
 ('alcohol or drug use',
  'PROBLEM',
  788,
  807,
  'No',
  'NEGATED_EXISTENCE',
  'FORWARD',
  785,
  787,
  788,
  787)]

In [164]:
context_df = doc._.to_dataframe("context")

In [165]:
context_df.head()

Unnamed: 0,ent_text,ent_label_,ent_start_char,ent_end_char,modifier_text,modifier_category,modifier_direction,modifier_start_char,modifier_end_char,modifier_scope_start_char,modifier_scope_end_char
0,metastasis,PROBLEM,519,529,no evidence of,NEGATED_EXISTENCE,FORWARD,504,518,519,518
1,alcohol or drug use,PROBLEM,788,807,No,NEGATED_EXISTENCE,FORWARD,785,787,788,787
2,stroke,PROBLEM,838,844,Mother,FAMILY,FORWARD,826,832,833,832
3,aspiration respiratory distress,PROBLEM,1478,1509,h/o,HISTORICAL,FORWARD,1474,1477,1478,1477
4,fever,PROBLEM,1652,1657,if,HYPOTHETICAL,FORWARD,1613,1615,1616,1615


## Doc

In [166]:
doc_data = doc._.doc_data

In [167]:
doc_data.keys()

odict_keys(['text'])

In [168]:
doc_df = doc._.to_dataframe("doc")

In [169]:
doc_df

Unnamed: 0,text
0,Admission Date: [**2573-5-30**] ...


## Customizing attributes
You can customize the values in `dtype_attrs` to modify what attributes are stored. "doc" and "ent" dtypes can take additional attributes which aren't included in the default, but "section" and "context" attributes can only take subsets of the defaults.

You can see the default values by calling the class method below:

In [170]:
DocConsumer.get_default_attrs()

{'ent': ['text',
  'start_char',
  'end_char',
  'label_',
  'is_negated',
  'is_uncertain',
  'is_historical',
  'is_hypothetical',
  'is_family',
  'section_category',
  'section_parent'],
 'section': ['section_category',
  'section_title_text',
  'section_title_start_char',
  'section_title_end_char',
  'section_title_text',
  'section_title_start_char',
  'section_title_end_char',
  'section_text',
  'section_text_start_char',
  'section_text_end_char',
  'section_parent'],
 'context': ['ent_text',
  'ent_label_',
  'ent_start_char',
  'ent_end_char',
  'modifier_text',
  'modifier_category',
  'modifier_direction',
  'modifier_start_char',
  'modifier_end_char',
  'modifier_scope_start_char',
  'modifier_scope_end_char'],
 'doc': ['text']}

Let's create a second model here and add a second `DocConsumer` with customized attributes. We'll also add a new custom attribute for `Doc`, `report_id`, and include it.

In [171]:
nlp2 = medspacy.load("en_info_3700_i2b2_2012", enable=enable)



In [172]:
from spacy.tokens import Doc
Doc.set_extension("report_title", default="")

ValueError: [E090] Extension 'report_title' already exists on Doc. To overwrite the existing extension, set `force=True` on `Doc.set_extension`.

In [173]:
doc2 = nlp2("There is no evidence of pneumonia.")

In [174]:
doc2._.report_title = "example_document"

In [175]:
doc_consumer2 = DocConsumer(nlp2, dtypes=("ent", "doc"), 
                            dtype_attrs={
                                "ent": [
                                    "lower_",
                                    "label_",
                                    "is_negated",
                                    "section_category",
                                    
                                ],
                                "doc":
                                ["text", "report_title"]
                                
                            }
)

In [176]:
doc_consumer2(doc2)

There is no evidence of pneumonia.

In [177]:
doc2._.to_dataframe("ent")

Unnamed: 0,lower_,label_,is_negated,section_category
0,pneumonia,PROBLEM,True,


In [178]:
doc2._.to_dataframe("doc")

Unnamed: 0,text,report_title
0,There is no evidence of pneumonia.,example_document


# Writer and Reader
The reader and writer classes are utilities for I/O. Here we'll show how to use it to connect to a sqlite database, read in text, and write them back to a new table.

In [179]:
from medspacy.io.db import DbWriter, DbReader, DbConnect

## DbConn
DbConnect is a wrapper for either a pyodbc or sqlite3 connection. It can then be passed into the DbReader and DbWriter classes to retrieve/store document data.

You can pass in either information for a pyodbc connection string or directly pass in a sqlite or pyodbc connection object. Here, we'll pass in a connection to our sqlite database.

In [181]:
sq3_conn = sqlite3.connect("./medspacy_demo.db")

In [188]:
cursor = sq3_conn.cursor()

In [182]:
conn = DbConnect(conn=sq3_conn)

Opened connection to None.None


## DbReader
DbReader is a utility fo reading docs from a database.

In [196]:
read_query = """
SELECT text
FROM texts
"""

In [197]:
reader = DbReader(conn, read_query)

In [198]:
rslts = reader.read()

Read 2 rows with query: 
SELECT text
FROM texts



In [199]:
texts = [r[0] for r in rslts]

In [201]:
print(texts[0][:100])

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2


In [200]:
print(texts[1])

There is no evidence of pneumonia.


#### Now we will process these docs and prepare to write them back:

In [206]:
nlp.add_pipe(doc_consumer)

In [207]:
docs = list(nlp.pipe(texts))

## DbWriter
DbWriter is a utility class for writing structured data back to a database. Here we'll use it to store out processed doc to a new table called `ents`.

Our column names will be the "ents" attributes in our consumer:

In [183]:
doc_consumer.dtype_attrs["ent"]

['text',
 'start_char',
 'end_char',
 'label_',
 'is_negated',
 'is_uncertain',
 'is_historical',
 'is_hypothetical',
 'is_family',
 'section_category',
 'section_parent']

Now we'll define the SQL datatypes for each column:

In [184]:
col_types = [
    "varchar(1000)",
    "int",
    "int",
    "varchar(100)",
    "int",
    "int",
    "int",
    "int",
    "int",
    "varchar(100)",
    "varchar(100)"
]

In [185]:
for (name, typ) in zip(doc_consumer.dtype_attrs["ent"], col_types):
    print(name, typ)

text varchar(1000)
start_char int
end_char int
label_ varchar(100)
is_negated int
is_uncertain int
is_historical int
is_hypothetical int
is_family int
section_category varchar(100)
section_parent varchar(100)


No we'll instantiate our writer and write the doc's entities to a new table:

In [203]:
writer = DbWriter(conn, "ents", cols=doc_consumer.dtype_attrs["ent"], 
                  col_types=col_types,
                  doc_dtype="ent",
                 create_table=True, drop_existing=True)

Created table ents with query: CREATE TABLE ents (text varchar(1000), start_char int, end_char int, label_ varchar(100), is_negated int, is_uncertain int, is_historical int, is_hypothetical int, is_family int, section_category varchar(100), section_parent varchar(100))


In [208]:
for doc in docs:
    writer.write(doc)

Wrote 38 rows with query: INSERT INTO ents (text, start_char, end_char, label_, is_negated, is_uncertain, is_historical, is_hypothetical, is_family, section_category, section_parent) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
Wrote 1 rows with query: INSERT INTO ents (text, start_char, end_char, label_, is_negated, is_uncertain, is_historical, is_hypothetical, is_family, section_category, section_parent) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)


Now we can query our table to retrieve the data we wrote:

In [209]:
query = """
SELECT *
FROM ents
LIMIT 2;
"""

In [210]:
cursor.execute(query)

<sqlite3.Cursor at 0x13132ef10>

In [211]:
cursor.fetchall()

[('Hydrochlorothiazide',
  163,
  182,
  'TREATMENT',
  0,
  0,
  0,
  0,
  0,
  'allergies',
  None),
 ('Abdominal pain',
  239,
  253,
  'PROBLEM',
  0,
  0,
  0,
  0,
  0,
  'chief_complaint',
  None)]

In [215]:
query = """
SELECT label_, COUNT(1)
FROM ents
GROUP BY label_
"""

In [216]:
cursor.execute(query)

<sqlite3.Cursor at 0x13132ef10>

In [217]:
cursor.fetchall()

[('PROBLEM', 21), ('TEST', 5), ('TREATMENT', 13)]

## Pipeline
Need to refactor