# MedspaCy OHDSI Tutorial
This notebook introduces [**medspaCy**](https://github.com/medspacy/medspacy). We start with a quick overview of the goals of medspaCy and how it can be used in clinical NLP. We then go step-by-step through a typical clinical NLP workflow and show how each of the components of medspaCy can be used to etract information from clinical text.

First, we'll get set up by installing medspaCy and some pre-trained spaCy models. If you don't already have these installed, you may have to restart your kernel before you can load them:

In [1]:
# install medspaCy and dependencies
!pip install medspacy==0.1.0.2



In [2]:
# Install a general english language model
!python -m spacy download "en_core_web_sm"

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [3]:
# Install a pre-trained clinical NER model
!pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz

Collecting https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz
  Using cached https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz (12.3 MB)
Building wheels for collected packages: en-info-3700-i2b2-2012
  Building wheel for en-info-3700-i2b2-2012 (setup.py) ... [?25ldone
[?25h  Created wheel for en-info-3700-i2b2-2012: filename=en_info_3700_i2b2_2012-0.1.0-py3-none-any.whl size=12270783 sha256=20cc2c8e6f91898b96b4711e9e0020d7f890442c098127c3d08b46662fa3afb0
  Stored in directory: /Users/alecchapman/Library/Caches/pip/wheels/78/e8/22/863c5e1287f38607d2177f47f31cba9686310ab519d46ba4d9
Successfully built en-info-3700-i2b2-2012


In [4]:
import warnings
warnings.filterwarnings('once')

# Outline

#### **I. Background**
#### **II. Quick Overview of spaCy**
#### **III. Clinical NLP  with medspaCy**
#### **IV. Future Work and Additional Info**

# I. Background
![MedspaCy logo](https://github.com/medspacy/medspacy/blob/master/images/medspacy_logo.png?raw=true)

Python is has become the dominant programming language for data science and natural language processing (NLP). Most of the largest open-source data science projects are developed in Python and some of these projects are among the most active open-source projects ever. Tensorflow had over 11,000 unique contributors in 2020 alone.

Many of the largest clinical natural langauge processing projects and frameworks are in Java, so for the clinical NLP community to use the work of the large Python data science development community, adoption of Python is needed. However, many projects in Python for clinical or biomedical NLP are developed for specific projects, research groups or for redistributing models trained for specific tasks, making widespread code-reuse low.

We have developed medspaCy specifically to meet this need. medspaCy is a library of tools for performing clinical NLP and text processing tasks with the popular [spaCy](spacy.io) framework. medspaCy is designed to unify a many of the most common clinical text processing algorithms (context analysis, secton detection, UMLS mapping, etc.) into one API and style.

medspaCy aims to allow for the seamless integration of essential rule-based clinical NLP methods with the growing capabilities of the most popular Python libraries.

## medspaCy is...

### ... a toolkit
Unlike other libraries like [scispaCy](https://allenai.github.io/scispacy/) and [medCAT](https://github.com/CogStack/MedCAT), the main goal of medspaCy is not to implement pre-trained clinical models. Instead, medspaCy is a toolkit for designing user-specific clinical NLP pipelines. medspaCy offers a number of rule-based components which allow users to easily write rules to extract specific concepts, but can be integrated with more sophisticated techniques from other sources.

### ... good for prototypes and rapid development
medspaCy facilitates rapid development by offering default configurations for all components so everything works out-of-the-box. It also works well with interactive development tools like visualization and working in jupyter notebooks.

medspaCy is simple to install and requires no admin privileges to get an environment set up on a computer.

### ... customizable
All clinical data differs and no two clinical NLP tasks are the same. medspaCy components can be easily customized with user-defined rules. One of the main advantages of spaCy is its flexible architecture, which allows you to mix and match different models and components. Similarly with medspaCy, you can add components to existing pipelines, including statistical models trained using spaCy or other frameworks.

### ... compatible with other spaCy projects
medspaCy components do not add extra layers to the spaCy API, allowing medspaCy components to be used alongside other components, such as those from libraries like scispaCy or any custom components developed for a specific task.

## How is medspaCy used?

### medspaCy team's projects

medspaCy has been used in many VA projects such as:
- [VA COVID-19 Surveillance](https://www.aclweb.org/anthology/2020.nlpcovid19-acl.10/): An operational pipeline for identifying COVID positive cases in the Department of Veterans Affairs
- Veteran homelessness and housing stability
- Templated document processing

# Quick overview of spaCy

We will *very* briefly go over basic features of spaCy to make sure some medspaCy terminology is established.

More in-depth resources for spaCy usage is available at the [project website](spacy.io) and [online course](https://course.spacy.io/en/).

## Getting Started

The first step is always importing the library you want to use.

In [5]:
import spacy

  SRC = (PWD / "_custom_kernels.cu").open("r", encoding="utf8").read()
  MMH_SRC = (PWD / "_murmur3.cu").open("r", encoding="utf8").read()


Like medspaCy, spaCy is a primarily a toolkit and framework. It does not have any data after installing and importing it.

The spaCy developers and community distribute a large variety of models for different tasks. Each model is named according to the language, use case, training source, and size.

At the top of this notebook, we installed a spaCy model using this command:
```bash
python -m spacy download "en_core_web_sm"
```

`en_core_web_sm` is one of the basic spacy models: `en` English, `core` core/general use, `web` trained on internet data, `sm` in a small size.

## Loading a spaCy model

These spaCy distributed are simple to load. The `load` method access a registry of installed models and can load them by name.

Loading the model involves opening vocabulary files, pre-trained weights, and other resouces and using them to initialize components saved in a spacy pipeline object typically named `nlp`.

In [6]:
nlp = spacy.load("en_core_web_sm")

The `en_core_web_sm` model loads a part-of-speech tagger, a dependency parser (with sentence splitting), and a named entity recognition component with the OntoNotes labels (PER, GPE, DATE, CARDINAL, etc.).

Every spaCy pipeline includes a tokenizer, but it is not visible or easily altered because it is the foundation all of spaCy's other components.

In [7]:
nlp.pipe_names

['tagger', 'parser', 'ner']

## Using a spaCy pipeline

`nlp` is a callable object and takes in the text to process. It applies each component in `nlp` sequentially. 

So in our case `tokenizer` to `tagger` to `parser` to `ner`.

Because these spaCy models are designed for general English text, our example will be the first sentence of a recent [New York Times article](https://www.nytimes.com/live/2021/03/02/world/covid-19-coronavirus/biden-says-there-will-be-enough-vaccine-available-for-all-adults-by-the-end-of-may-as-johnson-johnson-makes-a-deal-to-boost-supp).

In [8]:
text = "President Biden announced Tuesday that there would be \
enough doses of the coronavirus vaccine available for the \
entire adult population in the United States by the end of \
May, though he said it will take longer to inoculate everyone \
and he urged people to remain vigilant by wearing masks."

A spaCy `Doc` object is returned when processing is done. A `Doc` is just a container for the results of any spaCy pipeline. These are usually called `doc`.

In [9]:
doc = nlp(text)

## Using the results

`doc` will have certain properties that allow you to see the results of the processing.

In [10]:
doc.ents

(Biden, Tuesday, the United States, the end of May)

Like `Doc`, spaCy also has containers for results at a more specific level: `Token` and `Span`. To access `Tokens` and `Spans`, `doc` is accessed like a Python list. 

For example, we can look at the `Token` at index 1 of `doc`.

In [11]:
doc[1]

Biden

In [12]:
type(doc[1])

spacy.tokens.token.Token

In [13]:
doc[1].pos_

'PROPN'

## Visualizing the results 

spaCy also includes some tools for visualization. `displaCy` is a spacy module that can display entities and dependency results.

In [14]:
from spacy import displacy

In [15]:
displacy.render(doc, style='ent')

In [16]:
displacy.render(doc, style='dep')

## More spaCy resources
- [spaCy documentation](https://spacy.io/)
- [A free online course](https://course.spacy.io/en/) from the makers of spaCy

# II. Clinical NLP with medspaCy

## Getting Started with medspaCy

You can install medspaCy using pip:
```bash
!pip install medspacy==0.1.0.2
```

To get started with medspaCy, you can load a pipeline by calling `medspacy.load()`. By default, this will load the following 3 pipeline components:
- `PyRuSHSentencizer`: Uses [PyRuSh](https://github.com/jianlins/PyRuSH) for clinical sentence segmentation
- `TargetMatcher`: A rule-based concept extractor
- `ConTextComponent`: An implementation of the [ConText](https://pubmed.ncbi.nlm.nih.gov/19435614/) algorithm for detecting attributes like negation and temporality

You can also start a medspaCy pipeline by loading `en_core_web_sm` or any other spaCy model, but keep in mind the domain limitations of components like NER using OntoNotes labels for clinical text.

Throughout this notebook, we'll customize these components as well as add new ones for additional processing steps.

In [17]:
import medspacy
nlp = medspacy.load()

In [18]:
nlp.pipe_names

['sentencizer', 'target_matcher', 'context']

In [19]:
def read_discharge_summary():
    url = "https://raw.githubusercontent.com/medspacy/OHDSI_Tutorial/master/discharge_summary.txt"
    import urllib

    with urllib.request.urlopen(url) as f:
        text = f.read().decode()
    return text

text = read_discharge_summary()

In [20]:
print(text[:500])

Date of Birth:  [**2498-8-19**]             Sex:   F

History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis.

Past Medical History:
1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT,
chemo. Last colonoscopy showed: Last CEA was in the 8 range
(down from 9)
2. Type II Diabetes Mellitus
3. Hypertension

Social History:
Married, former tobacco use. No alcohol 


Just like a normal spaCy model, you process a text by calling `nlp(text)`, which returns a `Doc` object:

In [21]:
doc = nlp(text)

## Common Clinical NLP Tasks and medspaCy
medspaCy is built as a modular set of **pipeline components** which handle a specific NLP task. Because of spaCy's flexible framework, you can easily add new components, including pre-trained or custom models.

In this notebook, we'll walk through the following processing steps:
- **Rule-Based Concept Extraction**
- **Statistical NER**
- **Contextual Analysis**
- **Section Detection**
- **Input/Output**

## Rule-Based Concept Extraction
In this step, we'll manually define rules to extract clinical concepts from the text.

In this example, we'll use two classes provided in `medspacy.ner` for rule-based matching: the `TargetMatcher` and `TargetRule`. These expand on spaCy's native [rule-based matching](https://spacy.io/usage/rule-based-matching) and add some additional functionality.

When `TargetRule` processes a doc, it adds the matched span to `doc.ents`, which contains all of the extracted entities for a doc.

In [22]:
from medspacy.ner import TargetMatcher, TargetRule

In [23]:
target_matcher = nlp.get_pipe("target_matcher")

We define a rule for extracting entities with the `TargetRule` class:

In [24]:
target_rules = [
    TargetRule(literal="abdominal pain", category="PROBLEM"),
    TargetRule("stroke", "PROBLEM"),
    TargetRule("hemicolectomy", "TREATMENT"),
    TargetRule("Hydrochlorothiazide", "TREATMENT"),
    TargetRule("colon cancer", "PROBLEM"),
    TargetRule("metastasis", "PROBLEM"),
    
]

In [25]:
target_matcher.add(target_rules)

In [26]:
doc = nlp(text)

In [27]:
for ent in doc.ents:
    print(ent, ent.label_)

stroke PROBLEM
abdominal pain PROBLEM
metastasis PROBLEM
Colon cancer PROBLEM
hemicolectomy TREATMENT
stroke PROBLEM
abdominal pain PROBLEM
abdominal pain PROBLEM


In [28]:
from medspacy.visualization import visualize_ent

In [29]:
visualize_ent(doc)

### Advanced Rule-Based Concept Extraction
SpaCy has powerful pattern matching which allows you to match on a list of dictionaries which define attributes for each token. See https://spacy.io/usage/rule-based-matching for spaCy's documentation and examples. Additionally, medspaCy allows matching with regular expressions on the underlying text of the doc.

In [30]:
pattern_rules = [
    # Using spaCy's dictionary token patterns
    TargetRule("Acetaminophen", "TREATMENT",
               pattern=[
                   {"LOWER": {"IN": ["acetaminophen", "tylenol"]}},
                   {"LIKE_NUM": True, "OP": "?"},
                   {"LOWER": "mg", "OP": "?"}
               ],
              ),
    
    # Using regular expressions
    TargetRule("diabetes", "PROBLEM",
              pattern=r"type (i|ii|1|2|one|two) (dm|diabetes mellitus)"),
]

In [31]:
target_matcher.add(pattern_rules)



In [32]:
sm_text = """
    Discharge Medications: Acetaminophen 160 mg
    Prescribed tylenol for the pain
    74y female with type 2 dm and a recent stroke.
    Diagnoses: Type II Diabetes Mellitus
"""

In [33]:
sm_doc = nlp(sm_text)

In [34]:
for ent in sm_doc.ents:
    print(ent, ent.label_)

Acetaminophen 160 mg TREATMENT
tylenol TREATMENT
type 2 dm PROBLEM
stroke PROBLEM
Type II Diabetes Mellitus PROBLEM


In [35]:
visualize_ent(sm_doc)

## Statistical NER
While rule-based models are still very useful in clinical NLP, many systems are designed as **statistical model**. In this section, we'll show how to use a pre-trained model for target concept extraction instead of defining rules. We'll then add our additional components to show how medSpaCy can be used to combine statistical NLP with other rule-based components.

As an example, we'll download the model below which contains a model pretrained for clinical data. This model was trained using spaCy with data from the i2b2 2012 shared task: [**"Evaluating temporal relations in clinical text"**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756273/).

We installed this model at the beginning of this notebook with `pip`:
```bash
pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz
```

In [36]:
nlp = medspacy.load("en_info_3700_i2b2_2012")



Now let's reprocess our text and see what our pre-trained model extracts:

In [37]:
doc = nlp(text)

In [38]:
visualize_ent(doc)

The model extracted many more concepts, but missed some of the spans we defined earlier, like **"type ii diabetes"**. Luckily, we can combine statistical and rule-based models by adding the rules we defined to the `TargetMatcher` component.

In [39]:
target_matcher = nlp.get_pipe("target_matcher")
target_matcher.add(target_rules)
target_matcher.add(pattern_rules)



In [40]:
doc = nlp(text)

In [41]:
visualize_ent(doc)

## ConText

Clinical text often contains mentions of concepts which the patient did not actually experience. For example:

- "There is *no evidence of* **pneumonia**"
- "*Mother* with **breast cancer**"
- "Patient presents for *r/o* **COVID-19**"

In all of these instances, we need to use the contextual clues around the entity to assert attributes like negation, experiencer, and uncertainty.

The [ConText algorithm](https://www.sciencedirect.com/science/article/pii/S1532046409000744) is a popular method for asserting attributes of entities in clinical text such as **negation**, **temporality**, and **experiencer**. ConText is implemented in medspaCy using the `ConTextComponent`, which is loaded as part of a standard model.

We can inspect the modifier-entity relationships using medspaCy's `visualize_dep` function, which draws arrows between modifiers and the entities that they modify.

In [42]:
from medspacy.visualization import visualize_dep

In [43]:
doc = nlp("There is no evidence of pneumonia.")

In [44]:
visualize_dep(doc)
visualize_ent(doc)

In [45]:
doc = nlp("Mother with stroke at age 82.")
visualize_dep(doc)
visualize_ent(doc)

In addition to linking entities and modifiers, ConText also sets a number of boolean attributes indicating whether the entity is negated, experienced by someone else, etc.

In [46]:
ent = doc.ents[0]
print(ent, "is_family", ent._.is_family)
print(ent,  "is_negated", ent._.is_negated)

stroke is_family True
stroke is_negated False


### Customizing ConText
When you load ConText in medspaCy, it comes with a default set of rules. However, you'll often need to add new rules to match your data or implement new categories.

Custom modifiers can be defined using the `ConTextRule` class:

In [47]:
from medspacy.context import ConTextRule

In [48]:
context = nlp.get_pipe("context")

In [49]:
context_rule = ConTextRule("diagnosed in <YEAR>", "HISTORICAL",
                           direction="BACKWARD",
                          pattern=r"(diagnosed|dx'd) in (19|20)[\d]{2}"
                           
                          )

In [50]:
context.add(context_rule)



In [51]:
short_doc = nlp("Colon cancer diagnosed in 2012")

In [52]:
visualize_dep(short_doc)
visualize_ent(short_doc)

## Section detection
We are often interested in which section of a clinical note an entity occurs in. This can be useful for excluding entities from certain sections, like the past medical history or problem list, setting attributes like temporality (similar to ConText), or for extracting entities from specific sections of the note.

medspaCy includes the `Sectionizer` class for identifying sections in a note.

In [53]:
from medspacy.section_detection import Sectionizer

In [54]:
sectionizer = Sectionizer(nlp)

In [55]:
nlp.add_pipe(sectionizer)

In [56]:
doc = nlp(text)

medspaCy will visualize the sections along with entities and modifiers in gray highlighting with **<\< \>>** tags:

In [57]:
visualize_ent(doc)

We can see the normalized section name for each entity as well:

In [58]:
for ent in doc.ents[:10]:
    print(ent, "-->", ent._.section_category)

type 2 dm --> history_of_present_illness
a recent stroke --> history_of_present_illness
abdominal pain --> history_of_present_illness
Imaging --> history_of_present_illness
metastasis --> history_of_present_illness
Colon cancer --> past_medical_history
hemicolectomy --> past_medical_history
XRT --> past_medical_history
chemo --> past_medical_history
colonoscopy --> past_medical_history


### Custom Section Detection
Note structures vary widely between different EHRs and institutions, so it's important to define sections which match your note structure. The `SectionRule` defines sections to extract, and follows the same API as `TargetRule` and `ConTextRule`.

Here we'll add a rule to create a **patient_demographics** section around the patient DOB:

In [59]:
from medspacy.section_detection import SectionRule

In [60]:
rule = SectionRule("Date of Birth:", "patient_demographics")

In [61]:
sectionizer.add(rule)

In [62]:
visualize_ent(nlp(text[:200]))

## Input/Output
Finally, once we've processed a text or corpus, we'll want to save our extracted data to disk or a database. The `medspacy.io` module has utilities for converting docs to structured data.

### Extracting Structured Data
First, the `DocConsumer` will take various levels of information from a doc and generate structured data.

In [63]:
from medspacy.io import DocConsumer

In [64]:
doc_consumer = DocConsumer(nlp, dtypes=("ent", "context", "section", "doc"))

The `DocConsumer` will add structured data as a dictionary to the `doc._.data` attribute, which contains one key for each level:

In [65]:
nlp.add_pipe(doc_consumer)

In [66]:
doc = nlp(text)

In [67]:
# doc._.data

If you have `pandas` installed, you can then directly convert a doc to a dataframe, which shows the attributes extracted for each entity:

In [68]:
# !pip install pandas

In [69]:
%%capture
import pandas as pd

In [70]:
doc._.to_dataframe("ent").head()

Unnamed: 0,text,start_char,end_char,label_,is_negated,is_uncertain,is_historical,is_hypothetical,is_family,section_category,section_parent
0,type 2 dm,98,107,PROBLEM,False,False,False,False,False,history_of_present_illness,
1,a recent stroke,112,127,PROBLEM,False,False,False,False,False,history_of_present_illness,
2,abdominal pain,178,192,PROBLEM,False,False,False,False,False,history_of_present_illness,
3,Imaging,194,201,TEST,False,False,False,False,False,history_of_present_illness,
4,metastasis,223,233,PROBLEM,True,False,False,False,False,history_of_present_illness,


In [71]:
doc._.to_dataframe("section").head()

Unnamed: 0,section_category,section_title_text,section_title_start_char,section_title_end_char,section_text,section_text_start_char,section_text_end_char,section_parent
0,patient_demographics,Date of Birth:,0,14,Date of Birth: [**2498-8-19**] Se...,0,54,
1,history_of_present_illness,History of Present Illness:,54,81,History of Present Illness:\n74y female with t...,54,236,
2,past_medical_history,Past Medical History:,236,257,Past Medical History:\n1. Colon cancer dx'd in...,236,444,
3,social_history,Social History:,444,459,"Social History:\nMarried, former tobacco use. ...",444,514,
4,family_history,Family History:,514,529,Family History:\nMother with stroke at age 82....,514,600,


In [72]:
doc._.to_dataframe("context").head()

Unnamed: 0,ent_text,ent_label_,ent_start_char,ent_end_char,modifier_text,modifier_category,modifier_direction,modifier_start_char,modifier_end_char,modifier_scope_start_char,modifier_scope_end_char
0,metastasis,PROBLEM,223,233,no evidence of,NEGATED_EXISTENCE,FORWARD,208,222,223,222
1,alcohol or drug use,PROBLEM,492,511,No,NEGATED_EXISTENCE,FORWARD,489,491,492,491
2,stroke,PROBLEM,542,548,Mother,FAMILY,FORWARD,530,536,537,536
3,fever,PROBLEM,1135,1140,if,HYPOTHETICAL,FORWARD,1096,1098,1099,1098
4,nausea,PROBLEM,1149,1155,if,HYPOTHETICAL,FORWARD,1096,1098,1099,1098


In [73]:
doc._.to_dataframe("doc")

Unnamed: 0,text
0,Date of Birth: [**2498-8-19**] Se...


### Reading and Writing to a Database
As a final step, we'll write this structured data to a database. The `DbConnect`, `DbReader` and `DbWriter` classes will handle connecting to a database, creating tables, and inserting doc data for us. 

Currently, medspaCy database classes support `sqlite3` or `pyodbc` databases. The function below will create a simple sqlite database which includes our discharge summary and a few additional short texts.

In [74]:
def create_medspacy_demo_db(drop_existing=True):
    import os
    if drop_existing is False and os.path.exists("medspacy_demo.db"):
        print("File medspacy_demo.db already exists")
        return
    
    text = read_discharge_summary()

    import sqlite3 as s3

    texts = [
        "There is no evidence of pneumonia.",
        "Her mother was diagnosed with breast cancer.",
        text,
        
    ]

    conn = s3.connect("medspacy_demo.db")

    cursor = conn.cursor()
    cursor.execute("DROP TABLE IF EXISTS texts;")
    cursor.execute("CREATE TABLE texts (text_id INTEGER PRIMARY KEY, text NOT NULL);")

    for text in texts:
        cursor.execute("INSERT INTO texts (text) VALUES (?)", (text,))
    conn.commit()
    conn.close()
    print("Created file medspacy_demo.db")

In [75]:
create_medspacy_demo_db(drop_existing=True)

Created file medspacy_demo.db


First, we'll create a connection to our database using `sqlite3` and medspaCy's `DbConnect` class:

In [76]:
from medspacy.io import DbConnect

In [77]:
import sqlite3

In [78]:
sq_conn = sqlite3.connect("medspacy_demo.db")

In [79]:
conn = DbConnect(conn=sq_conn)

Opened connection to None.None


Next, we'll define a query to load our texts and pass it into a `DbReader` class:

In [80]:
from medspacy.io import DbReader

In [81]:
# Pass in our connection and a query to read texts:
read_query = """
SELECT text
FROM texts
"""
reader = DbReader(conn, read_query)

In [82]:
texts = [r[0] for r in reader.read()] 

Read 3 rows with query: 
SELECT text
FROM texts



Finally, we'll process our texts, create a `DbWriter` object, and then write the extracted entities back to the database:

In [83]:
docs = list(nlp.pipe(texts))

In [84]:
from medspacy.io import DbWriter

In [85]:
writer = DbWriter(conn, destination_table="ents", create_table=True, drop_existing=True)

Created table ents with query: CREATE TABLE ents (text varchar(50), start_char int, end_char int, label_ varchar(50), is_negated int, is_uncertain int, is_historical int, is_hypothetical int, is_family int, section_category int, section_parent int)


In [86]:
for doc in docs:
    writer.write(doc)

Wrote 1 rows with query: INSERT INTO ents (text, start_char, end_char, label_, is_negated, is_uncertain, is_historical, is_hypothetical, is_family, section_category, section_parent) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
Wrote 1 rows with query: INSERT INTO ents (text, start_char, end_char, label_, is_negated, is_uncertain, is_historical, is_hypothetical, is_family, section_category, section_parent) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
Wrote 30 rows with query: INSERT INTO ents (text, start_char, end_char, label_, is_negated, is_uncertain, is_historical, is_hypothetical, is_family, section_category, section_parent) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)


Now, we have a structured dataset that we can query and analyze:

In [87]:
cursor = sq_conn.cursor()

In [88]:
cursor.execute("SELECT label_, COUNT(1) FROM ents GROUP BY label_;")
cursor.fetchall()

[('PROBLEM', 22), ('TEST', 4), ('TREATMENT', 6)]

In [89]:
# Find examples of family history
cursor.execute("SELECT text, label_ FROM ents WHERE is_family = 1 LIMIT 5; ")
cursor.fetchall()

[('breast cancer', 'PROBLEM'), ('stroke', 'PROBLEM')]

In [90]:
# Find examples of family history
cursor.execute("SELECT text, label_ FROM ents WHERE text LIKE '%cancer%' LIMIT 5; ")
cursor.fetchall()

[('breast cancer', 'PROBLEM'), ('Colon cancer', 'PROBLEM')]

# III. Future Work and Additional Information
We are still actively working on medspaCy and are continually making updates. Some of our immediate next steps are:
- Support for spaCy v3
- Better documentation
- Release trained models/pipelines
- More utilities for machine learning
- New features?

## medspaCy resources
- [medspaCy on GitHub](https://github.com/medspacy/medspacy)
- [Detailed notebooks and tutorials](https://github.com/medspacy/medspacy/tree/master/notebooks)
- [A workshop from the University of Melbourne](https://github.com/Melbourne-BMDS/mimic34md2020_materials) on clinical data science including medspaCy
## Publications
* ACL COVID-19 Workshop: [A Natural Language Processing System for National COVID-19 Surveillance in the US Department of Veterans Affairs](https://www.aclweb.org/anthology/2020.nlpcovid19-acl.10/)
* AMIA Poster: [Removing barriers to clinical text processing with MedSpaCy](https://knowledge.amia.org/72332-amia-1.4602255/t005-1.4604904/t005-1.4604905/3414620-1.4605626/3414620-1.4605627?qr=1)
* AMIA Tutorial 2020
* AMIA Paper (submitted)
* AMIA Tutorial 2021 (submitted)