In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys

In [3]:
sys.path.insert(0, "../medspacy",)

# Outline

- Background: Objectives, Projects, uses, publications
- Quick overview of spaCy
- Features for common clinical NLP tasks: ConText, section detection, rule-based pattern matching, UMLS linking, sentence splitting
- Customizable: You can write your own rules for all of the rule-based components, customize for your dataset/research problem
- Flexible: This is more about spaCy generally, but with spaCy you can drop in various trained models, use PyTorch or TF or HuggingFace, et.c…

# Background
![MedspaCy logo](https://github.com/medspacy/medspacy/blob/master/images/medspacy_logo.png?raw=true)

MedSpaCy is a library of tools for performing clinical NLP and text processing tasks with the popular [spaCy](spacy.io) 
framework. The `medspacy` package brings together a number of other packages, each of which implements specific 
functionality for common clinical text processing specific to the clinical domain, such as sentence segmentation, 
contextual analysis and attribute assertion, and section detection.

## Goals of medspaCy
Unlike other libraries like scispaCy and medCAT, the main goal of medspaCy is not to implement pre-trained clinical models. Instead, medspaCy is a toolkit for designing user-specific clinical NLP pipelines. medspaCy offers a number of rule-based components which allow users to easily write rules to extract specific concepts, but can be integrated with trained models.

The main design principles of medspaCy are:
- **Customizable**: All clinical data differs and no two clinical NLP tasks are the same. medspaCy offers defaults for most rule-based components, but they can be easily customized with user-defined rules
- **Flexible**: One of the main advantages of spaCy is it's flexible architecture, which allows you to mix and match different models and components. Similarly with medspaCy, you can add components to existing pipelines, including statistical models trained using spaCy or other frameworks
- **spaCy-esque**:

# Use Cases
- COVID-19
- Homelessness
- ...

# Quick overview of spaCy
- Doc, Span, Token objects
- nlp pipelines
- Custom attributes
- Links to spaCy course, etc...

# Getting started with medspaCy

You can install medspaCy using pip or from the GitHub repo:

In [4]:
# !pip install medspacy==...

To get started with medspaCy, you can load a pipeline by calling `medspacy.load()`. By default, this will load the following 3 pipeline components:
- `PyRuSHSentencizer`: Uses [PyRuSh] for clinical sentence segmentation
- `TargetMatcher`: A rule-based concept extractor
- `ConTextComponent`: An implementation of the [ConText] algorithm for detecting attributes like negation and temporality

Throughout this notebook, we'll customize these components as well as add new ones for additional processing steps.

In [5]:
import medspacy
nlp = medspacy.load()

In [6]:
nlp.pipe_names

['sentencizer', 'target_matcher', 'context']

In [7]:
url = "https://raw.githubusercontent.com/medspacy/medspacy/master/notebooks/discharge_summary.txt"

In [8]:
# import urllib

# with urllib.request.urlopen(url) as f:
#     text = f.read().decode()

In [9]:
with open("../medspacy/notebooks/discharge_summary.txt") as f:
    text = f.read()

In [10]:
print(text[:500])

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging sh


Just like a normal spaCy model, you process a text by calling `nlp(text)`, which returns a `Doc` object:

In [11]:
doc = nlp(text)

# II. Common NLP Tasks
MedspaCy is built as a modular set of **pipeline components** which handle a specific NLP task. Because of spaCy's flexible framework, you can easily add new components, including trained statistical models.

In this notebook, we'll walk through the following processing steps:
- **Rule-based concept extraction**: Manually define rules for extract concepts from text
- **Statistical NER**: Use a pre-trained model for extracting clinical problems, treatments, and tests
- **Contextual analysis**: Assert entity attributes such as negation, temporality, and experiencer
- **Section detection**: Identify the structure of a note and split into individual sections
- **Input/Output**: Implement a complete processing pipeline by reading texts, processing them, and writing structured data back to a database

## i. Rule-Based Matching
In this step, we'll manually define rules to extract clinical concepts from the text.

In this example, we'll use two classes provided in `medspacy.ner` for rule-based matching: the `TargetMatcher` and `TargetRule`. These expand on spaCy's native [rule-based matching](https://spacy.io/usage/rule-based-matching) and add some additional functionality.

When `TargetRule` processes a doc, it adds the matched span to `doc.ents`, which contains all of the extracted entities for a doc.

In [12]:
from medspacy.ner import TargetMatcher, TargetRule

In [13]:
target_matcher = nlp.get_pipe("target_matcher")

We define a rule for extracting entities with the `TargetRule` class. The main arguments for `TargetRule` are:
- `literal`: An exact phrase to match in the text
- `category`: The semantic class of the entity (ie., `ent.label_`)
- `pattern`: An optional pattern to match rather than `literal`. As we'll see below, this can be either a list of dictionaries defining token attributes or a regular expression string

In [14]:
target_rules = [
    TargetRule(literal="abdominal pain", category="PROBLEM"),
    TargetRule("stroke", "PROBLEM"),
    TargetRule("hemicolectomy", "TREATMENT"),
    TargetRule("Hydrochlorothiazide", "TREATMENT"),
    TargetRule("colon cancer", "PROBLEM"),
    TargetRule("metastasis", "PROBLEM"),
    
]

In [15]:
target_matcher.add(target_rules)

In [16]:
doc = nlp(text)

In [17]:
for ent in doc.ents:
    print(ent, ent.label_)

Hydrochlorothiazide TREATMENT
Abdominal pain PROBLEM
stroke PROBLEM
abdominal pain PROBLEM
metastasis PROBLEM
Colon cancer PROBLEM
hemicolectomy TREATMENT
stroke PROBLEM
abdominal pain PROBLEM
abdominal pain PROBLEM


In [18]:
from medspacy.visualization import visualize_ent

In [19]:
visualize_ent(doc)

## Advanced Pattern Matching
SpaCy has powerful pattern matching which allows you to match on a list of dictionaries which define attributes for each token. See https://spacy.io/usage/rule-based-matching for spaCy's documentation and examples. Additionally, medspaCy allows matching with regular expressions on the underlying text of the doc.

Let's see some examples of `TargetRules` which utilizie pattern matching. The first rule uses token patterns to match any single token where the lower-case text is either `xrt` or `radiotherapy`. The second uses regular expressions to match various forms of Type 1/Type 2 diabetes.

In [20]:
pattern_rules = [
    # Using spaCy's dictionary token patterns
    TargetRule("Acetaminophen", "TREATMENT",
               pattern=[
                   {"LOWER": {"IN": ["acetaminophen", "tylenol"]}},
                   {"LIKE_NUM": True, "OP": "?"},
                   {"LOWER": "mg", "OP": "?"}
               ],
              ),
    
    # Using regular expressions
    TargetRule("diabetes", "PROBLEM",
              pattern=r"type (i|ii|1|2|one|two) (dm|diabetes mellitus)"),
]

In [21]:
target_matcher.add(pattern_rules)



In [22]:
sm_text = """
    Discharge Medications: Acetaminophen 160 mg
    Prescribed tylenol for the pain
    74y female with type 2 dm and a recent stroke.
    Diagnoses: Type II Diabetes Mellitus
"""

In [23]:
sm_doc = nlp(sm_text)

In [26]:
for ent in sm_doc.ents:
    print(ent, ent.label_)

Acetaminophen 160 mg TREATMENT
tylenol TREATMENT
type 2 dm PROBLEM
stroke PROBLEM
Type II Diabetes Mellitus PROBLEM


In [27]:
visualize_ent(sm_doc)

## ii. Statistical NER
While rule-based models are still very useful in clinical NLP, many systems are designed as **statistical model**. In this section, we'll show how to use a pre-trained model for target concept extraction instead of defining rules. We'll then add our additional components to show how medSpaCy can be used to combine statistical NLP with other rule-based components.

As an example, we'll download the model below which contains a model pretrained for clinical data. This model was trained using spaCy with data from the i2b2 2012 shared task: [**"Evaluating temporal relations in clinical text"**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756273/). This model was trained on data for the first subtask in the shared task, referred to in the challenge as **"Clinically relevant events"**, specifically the following **clinical concepts**:
- **Problems:** Diagnoses, signs, and symptoms
- **Tests:** Lab and vital measurements
- **Treatments:** Medications, procedures, and therapies

We can install this model with `pip` using this GitHub link:
```bash
pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz
```

In [29]:
# !pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz

For more information on how to train a model in spaCy, or how to integrate other models, see [TODO].

Once we've installed the model, we'll load it by passing in the model name, **"en_info_3700_i2b2_2012"**, to medspaCy's `load()` function. Now, in addition to the standard pipeline components loaded above, we also have:
- **tagger**/**parser**: A POS tagger and dependency parser taken from spaCy's standard trained english model
- **ner**: A trained `EntityRecognizer` component trained to extract our 3 clinical classes.

In [30]:
nlp = medspacy.load("en_info_3700_i2b2_2012")



In [31]:
nlp.pipe_names

['sentencizer', 'tagger', 'parser', 'ner', 'target_matcher', 'context']

In [33]:
# Check what classes will be extracted by our NER
ner = nlp.get_pipe("ner")
ner.labels

('PROBLEM', 'TEST', 'TREATMENT')

Now let's reprocess our text and see what our pre-trained model extracts:

In [34]:
doc = nlp(text)

In [35]:
visualize_ent(doc)

The model extracted many more concepts, but missed some of the spans we defined earlier, like **"type ii diabetes"**. Luckily, we can combine statistical and rule-based models by adding the rules we defined to the `TargetMatcher` component.

In [38]:
target_matcher = nlp.get_pipe("target_matcher")
target_matcher.add(target_rules)
target_matcher.add(pattern_rules)



In [39]:
doc = nlp(text)

In [41]:
visualize_ent(doc)

## ii. ConText

Clinical text often contains mentions of concepts which the patient did not actually experience. For example:

- "There is *no evidence of* **pneumonia**"
- "*Mother* with **breast cancer**"
- "Patient presents for *r/o* **COVID-19**"

In all of these instances, we need to use the contextual clues around the entity to assert attributes like negation, experiencer, and uncertainty.

The [ConText algorithm](https://www.sciencedirect.com/science/article/pii/S1532046409000744) is a popular method for asserting attributes of entities in clinical text such as **negation**, **temporality**, and **experiencer**. ConText is implemented in medspaCy using the `ConTextComponent`, which is loaded as part of a standard model.

We can inspect the modifier-entity relationships using medspaCy's `visualize_dep` function, which draws arrows between modifiers and the entities that they modify.

In [44]:
from medspacy.visualization import visualize_dep

In [45]:
doc = nlp("There is no evidence of pneumonia.")

In [46]:
visualize_dep(doc)
visualize_ent(doc)

In [47]:
doc = nlp("Mother with stroke at age 82.")
visualize_dep(doc)
visualize_ent(doc)

In addition to linking entities and modifiers, ConText also sets a number of boolean attributes indicating whether the entity is negated, experienced by someone else, etc.

In [50]:
ent = doc.ents[0]
print(ent, "is_family", ent._.is_family)
print(ent,  "is_negated", ent._.is_negated)

stroke is_family True
stroke is_negated False


All of these attributes are available in the `context_attributes` property, and you can check if any are `True` (often meaning an entity can be excluded) with the `any_context_attributes`:

In [55]:
print(ent._.context_attributes)
print()
print("Any context attributes:", ent._.any_context_attributes)

{'is_negated': False, 'is_historical': False, 'is_hypothetical': False, 'is_family': True, 'is_uncertain': False}

Any context attributes: True


### Customizing ConText
When you load ConText in medspaCy, it comes with a default set of rules. However, you'll often need to add new rules to match your data or implement new categories.

Custom modifiers can be defined using the `ConTextRule` class. This behaves very similarly to the `TargetRule` class, but has an additional argument for the **direction** the modifier moves in the sentence. This essentially defines the direction which the arrow will point to find entities to modify:
- **"FORWARD"**: Start at the modifier and go towards the end of the sentence ("There is _**no evidence of**_ ==> **"pneumonia**")
- **"BACKWARD"**: Go towards the beginning of the sentence ("**Colon cancer** <== _**diagnosed in 2012**_")
- **"BIDIRECTIONAL"**: Move in both directions ("**Sepsis** <== _**vs**_ ==> **pneumonia**")

Here we'll add a **backwards** modifier to match **"diagnosed in <YEAR>"** in order to identify historical diagnoses.

In [61]:
from medspacy.context import ConTextRule

In [57]:
context = nlp.get_pipe("context")

In [58]:
context_rule = ConTextRule("diagnosed in <YEAR>", "HISTORICAL",
                           direction="BACKWARD",
                          pattern=r"(diagnosed|dx'd) in (19|20)[\d]{2}"
                           
                          )

In [59]:
context.add(context_rule)



In [60]:
short_doc = nlp("Colon cancer diagnosed in 2012")

In [62]:
visualize_dep(short_doc)
visualize_ent(short_doc)

We can see all of the rules contained in ConText, as well as the unique categories defined, in the `rules` and `categories` attributes:

In [63]:
context.rules[:5]

[ConTextRule(literal='absence of', category='NEGATED_EXISTENCE', pattern=None, direction='FORWARD'),
 ConTextRule(literal='adequate to rule out', category='NEGATED_EXISTENCE', pattern=[{'LOWER': {'IN': ['adequate', 'sufficient']}}, {'LOWER': 'to'}, {'LOWER': 'rule'}, {'LOWER': {'IN': ['him', 'her', 'them', 'patient', 'pt']}, 'OP': '?'}, {'LOWER': 'out'}, {'LOWER': {'IN': ['against', 'for']}, 'OP': '?'}], direction='FORWARD'),
 ConTextRule(literal='adequate to rule the patient out', category='NEGATED_EXISTENCE', pattern=[{'LOWER': {'IN': ['adequate', 'sufficient']}}, {'LOWER': 'to'}, {'LOWER': 'rule'}, {'LOWER': 'the'}, {'LOWER': {'IN': ['patient', 'pt']}}, {'LOWER': 'out'}, {'LOWER': {'IN': ['against', 'for']}, 'OP': '?'}], direction='FORWARD'),
 ConTextRule(literal='any other', category='NEGATED_EXISTENCE', pattern=None, direction='FORWARD'),
 ConTextRule(literal='apart from', category='NEGATED_EXISTENCE', pattern=[{'LOWER': 'apart'}, {'LOWER': {'IN': ['for', 'from']}}], direction='TE

In [64]:
context.categories

{'FAMILY',
 'HISTORICAL',
 'HYPOTHETICAL',
 'NEGATED_EXISTENCE',
 'POSSIBLE_EXISTENCE'}

## iii. Section detection
We are often interested in which section of a clinical note an entity occurs in. This can be useful for excluding entities from certain sections, like the past medical history or problem list, setting attributes like temporality (similar to ConText), or for extracting entities from specific sections of the note.

MedSpaCy includes the `Sectionizer` class for identifying sections in a note, which adds the `sections` attribute to a `doc`, as well as similar attributes for spans and tokens. Here, we'll instantiate a `Sectionizer` and add it to our pipeline.

In [65]:
from medspacy.section_detection import Sectionizer, SectionRule

In [66]:
sectionizer = Sectionizer(nlp)

In [67]:
nlp.add_pipe(sectionizer)

In [68]:
doc = nlp(text)

For each section, we can see the actual section title, as well as a normalized section category:

In [72]:
for section in doc._.sections:
    print(section.title_span, section.category)

 None
Service: other
Allergies: allergies
Chief Complaint: chief_complaint
History of Present Illness: history_of_present_illness
Past Medical History: past_medical_history
Social History: social_history
Family History: family_history
Brief Hospital Course: hospital_course
Discharge Medications: medications
Discharge Diagnosis: observation_and_plan
Discharge Instructions: patient_instructions
Signed electronically by: signature


medspaCy will visualize the sections along with entities and modifiers in gray highlighting with **<\< \>>** tags:

In [70]:
visualize_ent(doc)

We can see the normalized section name for each entity as well:

In [74]:
for ent in doc.ents:
    print(ent, "-->", ent._.section_category)

Hydrochlorothiazide --> allergies
Abdominal pain --> chief_complaint
Invasive Procedure --> chief_complaint
PICC line --> chief_complaint
ERCP --> chief_complaint
sphincterotomy --> chief_complaint
type 2 dm --> history_of_present_illness
a recent stroke --> history_of_present_illness
abdominal pain --> history_of_present_illness
Imaging --> history_of_present_illness
metastasis --> history_of_present_illness
Colon cancer --> past_medical_history
hemicolectomy --> past_medical_history
XRT --> past_medical_history
chemo --> past_medical_history
colonoscopy --> past_medical_history
CEA --> past_medical_history
Type II Diabetes Mellitus --> past_medical_history
Hypertension --> past_medical_history
Married --> social_history
former tobacco use --> social_history
alcohol or drug use --> social_history
stroke --> family_history
Ultrasound --> hospital_course
pancreatic duct dilitation --> hospital_course
Miconazole --> medications
Heparin Sodium --> medications
Porcine --> medications
Injec

### Customizing section detection
Note structures vary widely between different EHRs and institutions, so it's important to define sections which match your note structure. The `SectionRule` defines sections to extract, and follows the same API as `TargetRule` and `ConTextRule`.

Here we'll add a rule to create a **patient_demographics** section around the patient DOB:

In [76]:
from medspacy.section_detection import SectionRule

In [77]:
rule = SectionRule("Date of Birth:", "patient_demographics")

In [78]:
sectionizer.add(rule)

In [81]:
visualize_ent(nlp(text[:200]))

## Input/Output
Finally, once we've processed a text or corpus, we'll want to save our extracted data to disk or a database. The `medspacy.io` module has utilities for converting docs to structured data.

### Extracting structured data
First, the `DocConsumer` will take various levels of information from a doc and generate structured data. There are four different "levels" allowed by the DocConsumer, and we'll extract all of them here:
- **"ent"**: Extracted entities and span attributes
- **"context"**: Entity-modifier pairs
- **"section"**: Discrete sections of the note
- **"doc"**: The text and optional custom attributes

In [82]:
from medspacy.io import DocConsumer

In [83]:
doc_consumer = DocConsumer(nlp, dtypes=("ent", "context", "section", "doc"))

The `DocConsumer` will add structured data as a dictionary to the `doc._.data` attribute, which contains one key for each level:

In [114]:
nlp.add_pipe(doc_consumer)

In [115]:
doc = nlp(text)

In [116]:
doc._.data

{'ent': OrderedDict([('text',
               ['Hydrochlorothiazide',
                'Abdominal pain',
                'Invasive Procedure',
                'PICC line',
                'ERCP',
                'sphincterotomy',
                'type 2 dm',
                'a recent stroke',
                'abdominal pain',
                'Imaging',
                'metastasis',
                'Colon cancer',
                'hemicolectomy',
                'XRT',
                'chemo',
                'colonoscopy',
                'CEA',
                'Type II Diabetes Mellitus',
                'Hypertension',
                'Married',
                'former tobacco use',
                'alcohol or drug use',
                'stroke',
                'Ultrasound',
                'pancreatic duct dilitation',
                'Miconazole',
                'Heparin Sodium',
                'Porcine',
                'Injection',
                'Acetaminophen 160 mg',
       

If you have `pandas` installed, you can then directly convert a doc to a dataframe, which shows the attributes extracted for each entity:

In [117]:
# !pip install pandas

In [118]:
doc._.to_dataframe("ent").head()

Unnamed: 0,text,start_char,end_char,label_,is_negated,is_uncertain,is_historical,is_hypothetical,is_family,section_category,section_parent
0,Hydrochlorothiazide,163,182,TREATMENT,False,False,False,False,False,allergies,
1,Abdominal pain,239,253,PROBLEM,False,False,False,False,False,chief_complaint,
2,Invasive Procedure,273,291,TREATMENT,False,False,False,False,False,chief_complaint,
3,PICC line,293,302,TREATMENT,False,False,False,False,False,chief_complaint,
4,ERCP,314,318,TEST,False,False,False,False,False,chief_complaint,


In [90]:
doc._.to_dataframe("section").head()

Unnamed: 0,section_category,section_title_text,section_title_start_char,section_title_end_char,section_text,section_text_start_char,section_text_end_char,section_parent
0,,,0,0,Admission Date: [**2573-5-30**] ...,0,134,
1,other,Service:,134,142,Service: SURGERY\n\n,134,152,
2,allergies,Allergies:,152,162,Allergies:\nHydrochlorothiazide\n\nAttending:[...,152,222,
3,chief_complaint,Chief Complaint:,222,238,Chief Complaint:\nAbdominal pain\n\nMajor Surg...,222,350,
4,history_of_present_illness,History of Present Illness:,350,377,History of Present Illness:\n74y female with t...,350,532,


In [91]:
doc._.to_dataframe("context").head()

Unnamed: 0,ent_text,ent_label_,ent_start_char,ent_end_char,modifier_text,modifier_category,modifier_direction,modifier_start_char,modifier_end_char,modifier_scope_start_char,modifier_scope_end_char
0,metastasis,PROBLEM,519,529,no evidence of,NEGATED_EXISTENCE,FORWARD,504,518,519,518
1,alcohol or drug use,PROBLEM,788,807,No,NEGATED_EXISTENCE,FORWARD,785,787,788,787
2,stroke,PROBLEM,838,844,Mother,FAMILY,FORWARD,826,832,833,832
3,aspiration respiratory distress,PROBLEM,1478,1509,h/o,HISTORICAL,FORWARD,1474,1477,1478,1477
4,fever,PROBLEM,1652,1657,if,HYPOTHETICAL,FORWARD,1613,1615,1616,1615


In [92]:
doc._.to_dataframe("doc")

Unnamed: 0,text
0,Admission Date: [**2573-5-30**] ...


### Reading and writing to a database
As a final step, we'll write this structured data to a database. The `DbConnect`, `DbReader` and `DbWriter` classes will handle connecting to a database, creating tables, and inserting doc data for us. 

Currently, medspaCy database classes support `sqlite3` or `pyodbc` databases. The function below will create a simple sqlite database which includes our discharge summary and a few additional short texts.

In [94]:
def create_medspacy_demo_db(drop_existing=True):
    import os
    if drop_existing is False and os.path.exists("medspacy_demo.db"):
        print("File medspacy_demo.db already exists")
        return
    
    with open("./discharge_summary.txt") as f:
        text = f.read()

    import sqlite3 as s3

    texts = [
        "There is no evidence of pneumonia.",
        "Her mother was diagnosed with breast cancer.",
        text,
        
    ]

    conn = s3.connect("medspacy_demo.db")

    cursor = conn.cursor()
    cursor.execute("DROP TABLE IF EXISTS texts;")
    cursor.execute("CREATE TABLE texts (text_id INTEGER PRIMARY KEY, text NOT NULL);")

    for text in texts:
        cursor.execute("INSERT INTO texts (text) VALUES (?)", (text,))
    conn.commit()
    conn.close()
    print("Created file medspacy_demo.db")

In [96]:
create_medspacy_demo_db(drop_existing=True)

Created file medspacy_demo.db


First, we'll create a connection to our database using `sqlite3` and medspaCy's `DbConnect` class:

In [97]:
from medspacy.io import DbConnect

In [98]:
import sqlite3

In [99]:
sq_conn = sqlite3.connect("medspacy_demo.db")

In [100]:
conn = DbConnect(conn=sq_conn)

Opened connection to None.None


Next, we'll define a query to load our texts and pass it into a `DbReader` class:

In [103]:
from medspacy.io import DbReader

In [104]:
# Pass in our connection and a query to read texts:
read_query = """
SELECT text
FROM texts
"""
reader = DbReader(conn, read_query)

In [105]:
texts = [r[0] for r in reader.read()] 

Read 3 rows with query: 
SELECT text
FROM texts



Finally, we'll process our texts, create a `DbWriter` object, and then write the extracted entities back to the database:

In [119]:
docs = list(nlp.pipe(texts))

In [108]:
from medspacy.io import DbWriter

In [111]:
writer = DbWriter(conn, destination_table="ents", create_table=True, drop_existing=True)

Created table ents with query: CREATE TABLE ents (text varchar(50), start_char int, end_char int, label_ varchar(50), is_negated int, is_uncertain int, is_historical int, is_hypothetical int, is_family int, section_category int, section_parent int)


In [120]:
for doc in docs:
    writer.write(doc)

Wrote 1 rows with query: INSERT INTO ents (text, start_char, end_char, label_, is_negated, is_uncertain, is_historical, is_hypothetical, is_family, section_category, section_parent) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
Wrote 1 rows with query: INSERT INTO ents (text, start_char, end_char, label_, is_negated, is_uncertain, is_historical, is_hypothetical, is_family, section_category, section_parent) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
Wrote 41 rows with query: INSERT INTO ents (text, start_char, end_char, label_, is_negated, is_uncertain, is_historical, is_hypothetical, is_family, section_category, section_parent) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)


Now, we have a structured dataset that we can query and analyze:

In [121]:
cursor = sq_conn.cursor()

In [124]:
cursor.execute("SELECT label_, COUNT(1) FROM ents GROUP BY label_;")
cursor.fetchall()

[('PROBLEM', 24), ('TEST', 5), ('TREATMENT', 14)]

In [129]:
# Find examples of family history
cursor.execute("SELECT text, label_ FROM ents WHERE is_family = 1 LIMIT 5; ")
cursor.fetchall()

[('breast cancer', 'PROBLEM'), ('stroke', 'PROBLEM')]

In [130]:
# Find examples of family history
cursor.execute("SELECT text, label_ FROM ents WHERE text LIKE '%cancer%' LIMIT 5; ")
cursor.fetchall()


[('breast cancer', 'PROBLEM'), ('Colon cancer', 'PROBLEM')]

# Future work
- Next Steps: spaCy v3, documentation, trained models/pipelines, optimization, support for other natural languages …