Biomedical NLP with Scispacy + NegExc

Jay Urbain, PhD


## Spacy

Spacy : spaCy is a free open-source library for Natural Language Processing in Python.


## Scispacy

Scispacy : scispaCy is a python package containing spaCy models for processing biomedical, scientific or clinical text.

https://allenai.github.io/scispacy/

## NegEx

NegEx - A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries Chapman, Bridewell, Hanbury, Cooper, Buchanan https://doi.org/10.1006/jbin.2001.1029

Negspacy : spaCy pipeline object for negating concepts in text. Based on the NegEx algorithm.
NegEx : — NegEx locates trigger terms indicating a clinical condition is negated or possible and determines which text falls within the scope of the trigger terms.

Reference:

Clinical Text Negation handling using negspaCy and scispaCy
Importance of Negation Detection in Clinical Records
Mansi Kukreja, Jul 27, 2020
https://medium.com/@MansiKukreja/clinical-text-negation-handling-using-negspacy-and-scispacy-233ce69ab2ac

Env: pytorch-kg

## Installation

In [1]:
#!pip install spacy
# !pip install scispacy
# !pip install negspacy

In [6]:
#!pip install spacy-transformers

## Download models

In [51]:
##!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz    
##!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_md-0.5.0.tar.gz
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_scibert-0.5.0.tar.gz
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz

## Libraries

In [52]:
import spacy
import scispacy
from spacy import displacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
from negspacy.negation import Negex

## Load scispaCy models

In [53]:
import spacy

nlp = spacy.load("en_core_sci_md") 
doc = nlp('''Alterations in the hypocretin receptor 2 and preprohypocretin 
          genes produce narcolepsy in some animals.''')

In [17]:
doc

Alterations in the hypocretin receptor 2 and preprohypocretin 
          genes produce narcolepsy in some animals.

In [18]:
import spacy

from scispacy.abbreviation import AbbreviationDetector

# nlp = spacy.load("en_core_sci_sm")

# Add the abbreviation pipe to the spacy pipeline.
nlp.add_pipe("abbreviation_detector")

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
	print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")


Abbreviation 	 Definition
SBMA 	 (6, 7) Spinal and bulbar muscular atrophy
SBMA 	 (33, 34) Spinal and bulbar muscular atrophy
AR 	 (29, 30) androgen receptor


In [19]:
import spacy
import scispacy

from scispacy.linking import EntityLinker

#nlp = spacy.load("en_core_sci_sm")

# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). 
# Should be cached after download.
# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

# Let's look at a random entity!
entity = doc.ents[1]

print("Name: ", entity)

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
	print(linker.kb.cui_to_entity[umls_ent[0]])

Name:  bulbar muscular atrophy
CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the gene encoding the ANDROGEN RECEPTOR.
TUI(s): T047
Aliases (abbreviated, total: 39): 
	 Bulbospinal Muscular Atrophy, X linked, SMAX1, X Linked Spinal and Bulbar Muscular Atrophy, Bulbospinal muscular atrophy, kennedy's syndrome, Atrophy, Muscular, Spinobulbar, X Linked Bulbo Spinal Atrophy, X-Linked Spinal and Bulbar Muscular Atrophy, Bulbo Spinal Atrophy, X Linked, X-Linked Bulbo-Spinal Atrophy
CUI: C0026846, Name: Muscular Atrophy
Definition: Derangement in size and number of muscle fibers occurring with aging, reduction in blood supply, or following immobilization, prolonged weightlessness, malnutrition, and particularly in denervation.
TUI(s): T046
Aliases (abbreviated, total: 32): 
	 Muscle atrophy, NOS, ATROPHY MUSCLE, amyotrophia, Muscle wasting, NOS, Muscle Atrophy, Muscle wasting disorder, Muscular 

In [20]:
#!pip install --no-binary :all: nmslib

In [21]:
doc

Spinal and bulbar muscular atrophy (SBMA) is an            inherited motor neuron disease caused by the expansion            of a polyglutamine tract within the androgen receptor (AR).            SBMA can be caused by this easily.

In [22]:
print(linker.umls.cui_to_entity['C0009378'])

CUI: C0009378, Name: colonoscopy
Definition: Endoscopic examination, therapy or surgery of the luminal surface of the colon.
TUI(s): T060
Aliases (abbreviated, total: 14): 
	 Endoscopic examination of colon, Colonoscopies, colon endoscopy, Lower gastrointestinal tract examination, colonoscopy procedures, colonoscopies, Endoscopy of colon, NOS, Endoscopy of colon, colonoscopy procedure, colonoscopic studies


## Load SciSpaCy models

In [14]:
import spacy
import scispacy
from spacy import displacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
from negspacy.negation import Negex

In [13]:
# nlp0 = spacy.load("en_core_sci_sm")
# nlp_en_core_sci_md = spacy.load("en_core_sci_md")
# nlp_en_core_sci_lg = spacy.load("en_core_sci_lg")
# nlp_en_ner_bc5cdr_md = spacy.load("en_ner_bc5cdr_md")
# nlp_en_core_sci_scibert = spacy.load("en_core_sci_scibert")
# nlp1 = spacy.load("en_core_sci_md")

## Sample Input Clinical Note

In [23]:
clinical_note = "Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, \
states chest tightness. No evidence of hypertension.\
Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating associated with pneumonia. \
Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine once used in the last year. Alcoholism unlikely.\
Patient has headache and fever. Patient is not diabetic. \
No signs of diarrhea. Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. \
Patient also has a history of cardiac injury. No kidney injury reported. No abnormal rashes or ulcers. \
Patient might not have liver disease. Confirmed absence of hemoptysis. Although patient has severe pneumonia and fever \
, test reports are negative for COVID-19 infection. COVID-19 viral infection absent."

In [22]:
clinical_note = """Type 2 diabetes is an impairment in the way the body regulates and uses sugar (glucose) as a fuel. This long-term (chronic) condition results in too much sugar circulating in the bloodstream. Eventually, high blood sugar levels can lead to disorders of the circulatory, nervous and immune systems.
In type 2 diabetes, there are primarily two interrelated problems at work. Your pancreas does not produce enough insulin — a hormone that regulates the movement of sugar into your cells — and cells respond poorly to insulin and take in less sugar.

Type 2 diabetes used to be known as adult-onset diabetes, but both type 1 and type 2 diabetes can begin during childhood and adulthood. Type 2 is more common in older adults, but the increase in the number of children with obesity has led to more cases of type 2 diabetes in younger people.

There's no cure for type 2 diabetes, but losing weight, eating well and exercising can help you manage the disease. If diet and exercise aren't enough to manage your blood sugar, you may also need diabetes medications or insulin therapy.
"""

In [60]:
clinical_note = """Insulin is a hormone made by your pancreas that acts like a key to let blood sugar into the cells in your body for use as energy. If you have type 2 diabetes, cells don’t respond normally to insulin; this is called insulin resistance. Your pancreas makes more insulin to try to get cells to respond. Eventually your pancreas can’t keep up, and your blood sugar rises, setting the stage for prediabetes and type 2 diabetes. High blood sugar is damaging to the body and can cause other serious health problems, such as heart disease,  vision loss, and kidney disease."""


## NLP pipeline Implementation

Lemmatizer — model (en_core_sci_sm)

In [24]:
# nlp = nlp_en_ner_bc5cdr_md
nlp.add_pipe("negex", config={"ent_types":["DISEASE", "CHEMICAL", "NEG_ENTITY"]})

<negspacy.negation.Negex at 0x14ca60cd0>

In [25]:
#lemmatizing the notes to capture all forms of negation(e.g., deny: denies, denying)
def lemmatize(note, nlp):
    doc = nlp(note)
    lemNote = [wd.lemma_ for wd in doc]
    return " ".join(lemNote)
lem_clinical_note= lemmatize(clinical_note, nlp)
#creating a doc object using BC5CDR model
doc = nlp(lem_clinical_note)
# doc = nlp_en_core_sci_scibert(lem_clinical_note)

  global_matches = self.global_matcher(doc)


In [26]:
for ent in doc.ents:
    print(ent.text, ent.label_)

patient rest ENTITY
bed ENTITY
patient ENTITY
azithromycin ENTITY
difficulty ENTITY
patient ENTITY
audible wheezing ENTITY
state chest tightness ENTITY
hypertension ENTITY
patient ENTITY
nausea ENTITY
time ENTITY
zofran ENTITY
decline ENTITY
patient ENTITY
intermittent sweating ENTITY
associate with ENTITY
pneumonia ENTITY
patient refuse pain ENTITY
tylenol ENTITY
substance abuse ENTITY
alcohol use ENTITY
cocaine ENTITY
year ENTITY
alcoholism ENTITY
unlikely ENTITY
patient ENTITY
headache ENTITY
fever ENTITY
patient ENTITY
diabetic ENTITY
diarrhea ENTITY
lab report ENTITY
lymphocytopenia ENTITY
cardaic rhythm ENTITY
sinus bradycardia ENTITY
patient ENTITY
history ENTITY
cardiac injury ENTITY
kidney injury ENTITY
report ENTITY
abnormal rash ENTITY
ulcer ENTITY
patient ENTITY
liver disease ENTITY
absence ENTITY
hemoptysis ENTITY
patient ENTITY
severe ENTITY
pneumonia ENTITY
fever ENTITY
test report ENTITY
negative ENTITY
covid-19 infection ENTITY
covid-19 ENTITY
viral infection ENTITY
ab

## Feature Extractor: Named Entities

In [27]:
#function to modify options for displacy NER visualization
def get_entity_options():
    entities = ["DISEASE", "CHEMICAL", "NEG_ENTITY","PERSON","ORG"]
    colors = {'DISEASE': 'linear-gradient(180deg, #66ffcc, #abf763)', 
              'CHEMICAL': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)', 
              "NEG_ENTITY":'linear-gradient(90deg, #ffff66, #ff6600)',
              'PERSON': 'linear-gradient(180deg, #66ffcc, #abf763)',
              'ORG': 'linear-gradient(90deg, #66ffcc, #abf763)'}
    options = {"ents": entities, "colors": colors}    
    return options
options = get_entity_options()
#visualizing identified Named Entities in clinical input text 
displacy.render(doc, style='ent', options=options)


## Feature Extractor: Negation Detection

In [70]:
import spacy
from negspacy.negation import Negex

In [71]:
# nlp = spacy.load("en_core_web_md")
nlp = nlp_en_ner_bc5cdr_md
nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]})

<negspacy.negation.Negex at 0x4b4c7bb20>

In [None]:
# doc = nlp("She does not like Steve Jobs but likes Apple products.")

# for e in doc.ents:
# 	print(e.text, e._.negex)

In [28]:
doc = nlp("No evidence of hypertension.")

for e in doc.ents:
	print(e.text, e._.negex)
    
# doc = nlp1("Does not lik hypertension.")

# for e in doc.ents:
# 	print(e.text, e._.negex)

No evidence False
hypertension False


## NegEx Patterns

- psuedo_negations - phrases that are false triggers, ambiguous negations, or double negatives
- preceding_negations - negation phrases that precede an entity
- following_negations - negation phrases that follow an entity
- termination - phrases that cut a sentence in parts, for purposes of negation detection (.e.g., "but")

## Termsets

Designate termset to use, en_clinical is used by default.

- en = phrases for general english language text
- en_clinical DEFAULT = adds phrases specific to clinical domain to general english
- en_clinical_sensitive = adds additional phrases to help rule out historical and possibly irrelevant entities

In [29]:
# from negspacy.negation import Negex
# from negspacy.termsets import termset

# ts = termset("en_clinical")

# nlp = spacy.load("en_core_web_sm")
# nlp.add_pipe(
#     "negex",
#     config={
#         "neg_termset":ts.get_patterns()
#     }
# )

## Negations in noun chunks

Depending on the Named Entity Recognition model you are using, you may have negations "chunked together" with nouns. For example:

In [30]:
import spacy
import scispacy

# nlp = spacy.load("en_core_sci_sm")
# ts = termset("en_clinical")
# nlp.add_pipe(
#     "negex",
#     config={
#         "chunk_prefix": ["yes"],
#     },
#     last=True,
# )
# nlp.add_pipe("negex")
doc = nlp("No evidence of hypertension.")
for e in doc.ents:
    print(e.text, e._.negex)

No evidence False
hypertension False


In [31]:
import spacy
import scispacy

#adding a new pipeline component to identify negation
def neg_model(nlp_model):
#     nlp = spacy.load(nlp_model, disable = ['parser'])
#     nlp.add_pipe(nlp.create_pipe('sentencizer'))
    nlp = spacy.load(nlp_model)
#     ts = termset("en_clinical")
    nlp.add_pipe("negex")
    return nlp

"""
Negspacy sets a new attribute e._.negex to True if a negative concept is encountered
"""
def negation_handling(nlp_model, note, neg_model):
    results = []
    nlp = neg_model(nlp_model) 
    note = note.split(".") #sentence tokenizing based on delimeter 
    note = [n.strip() for n in note] #removing extra spaces at the begining and end of sentence
    docs = []
    entities = []
    for t in note:
        doc = nlp(t)
        for e in doc.ents:
            rs = str(e._.negex)
#             print('rs', rs, 'e.text', e.text, 'e.start', e.start, 'e.end', e.end, 'e.label_', e.label_)
            if rs == "True": 
                results.append(e.text)
#                 entities.append(Span(doc, e.start, e.end, label="NEG_ENTITY"))
#             else:
#                 entities.append(e)
            entities.append(Span(doc, e.start, e.end, label=e.label_))
#         doc.ents = entities
        docs.append(doc)
    return results, docs

#list of negative concepts from clinical note identified by negspacy
results0, docs = negation_handling("en_ner_bc5cdr_md", lem_clinical_note, neg_model)

for t in docs[1]:
    print(t.ent_type_)
    
for d in docs:
    displacy.render(d, style='ent', options=options)





CHEMICAL







In [32]:
results0

['hypertension',
 'alcoholism',
 'diarrhea',
 'abnormal rash',
 'ulcer',
 'hemoptysis',
 'covid-19 infection']

In [33]:
#function to identify span objects of matched megative phrases from clinical note
def match(nlp,terms,label):
#         print('terms', terms)
        patterns = [nlp.make_doc(text) for text in terms]
#         print('patterns', patterns)
        matcher = PhraseMatcher(nlp.vocab)
        matcher.add(label, None, *patterns)
        return matcher
    
#replacing the labels for identified negative entities
def overwrite_ent_lbl(matcher, doc):
#     print('doc', doc)
    matches = matcher(doc)
#     print('matches', matches)
    seen_tokens = set()
    new_entities = []
    entities = doc.ents
    for match_id, start, end in matches:
#         print('match_id, start, end', match_id, start, end)
        if start not in seen_tokens and end - 1 not in seen_tokens:
            new_entities.append(Span(doc, start, end, label=match_id))
            entities = [
                e for e in entities if not (e.start < end and e.end > start)
            ]
            seen_tokens.update(range(start, end))
            doc.ents = tuple(entities) + tuple(new_entities)
    return doc

doc0 = nlp(lem_clinical_note)

print('results0', results0)
matcher = match(nlp, results0, "NEG_ENTITY")
#doc0: new doc object with added "NEG_ENTITY label"
doc0 = overwrite_ent_lbl(matcher,doc)
#visualizing identified Named Entities in clinical input text 
displacy.render(doc0, style='ent', options=options)

results0 ['hypertension', 'alcoholism', 'diarrhea', 'abnormal rash', 'ulcer', 'hemoptysis', 'covid-19 infection']


## UMLS Entity Linking with NegEx

In [34]:
import spacy
import scispacy
from negspacy.negation import Negex
from scispacy.umls_linking import UmlsEntityLinker

# nlp = spacy.load("en_core_sci_sm")
# nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})
# nlp.add_pipe("negex", last=True)

In [35]:
doc = nlp("Patient denies cardiovascular disease but has headaches. No history of smoking. Alcoholism unlikely. Smoking not ruled out.")

for e in doc.ents:
    print(e.text, e._.negex, e._.umls_ents)

Patient False [('C0030705', 1.0), ('C1550655', 1.0), ('C1578481', 1.0), ('C1578483', 1.0), ('C1578484', 1.0)]
cardiovascular disease False [('C0007222', 1.0), ('C1971810', 0.85407555103302), ('C0007226', 0.850929856300354), ('C3887460', 0.850929856300354), ('C0027821', 0.8491292595863342)]
headaches False [('C0018681', 1.0), ('C0270697', 0.8570944666862488), ('C3829986', 0.8479108214378357), ('C4553197', 0.8460302352905273), ('C0033893', 0.8227493166923523)]
No history of smoking False [('C1519384', 0.9371702671051025), ('C4721824', 0.9371702671051025), ('C0037369', 0.7193564772605896), ('C0453996', 0.7193564772605896), ('C1881674', 0.7193564772605896)]
Alcoholism False [('C0001973', 1.0), ('C1145676', 0.8264809250831604), ('C1156315', 0.8172153234481812), ('C1156369', 0.8154693245887756), ('C0687725', 0.8103371858596802)]
Smoking False [('C0037369', 1.0), ('C0453996', 1.0), ('C1548578', 1.0), ('C1881674', 1.0), ('C3696939', 0.8300524950027466)]


In [36]:
txt = """Insulin is a hormone made by your pancreas that acts like a key to let blood sugar into the cells in your body for use as energy. If you have type 2 diabetes, cells don’t respond normally to insulin; this is called insulin resistance. Your pancreas makes more insulin to try to get cells to respond. Eventually your pancreas can’t keep up, and your blood sugar rises, setting the stage for prediabetes and type 2 diabetes. High blood sugar is damaging to the body and can cause other serious health problems, such as heart disease,  vision loss, and kidney disease."""
doc = nlp(txt)

for e in doc.ents:
    print(e.text, e._.negex, e._.umls_ents)


Insulin False [('C0021641', 1.0), ('C0202098', 1.0), ('C0795635', 1.0), ('C1337112', 1.0), ('C1533581', 1.0)]
hormone False [('C0019932', 1.0), ('C4521914', 1.0), ('C1148622', 0.820817768573761), ('C0458083', 0.7973387241363525), ('C0065743', 0.7915337085723877)]
pancreas False [('C0030274', 0.9999999403953552), ('C0771711', 0.9999999403953552), ('C0346647', 0.9047948718070984), ('C1268125', 0.8574094176292419), ('C0235974', 0.8280432820320129)]
blood sugar False [('C0005802', 1.0), ('C0392201', 1.0), ('C0020615', 0.9060095548629761), ('C0020456', 0.8550583720207214), ('C2825162', 0.8487322926521301)]
cells False [('C0007584', 0.9999999403953552), ('C0007634', 0.9999999403953552), ('C0023172', 0.9094010591506958), ('C0524983', 0.9013593196868896), ('C0004561', 0.8828330636024475)]
your body False [('C4683228', 0.7381472587585449)]
energy False [('C0424589', 1.0), ('C0542479', 1.0), ('C1442080', 1.0), ('C1547025', 1.0), ('C4330907', 1.0)]
type 2 diabetes False [('C0011860', 1.0), ('C141

In [37]:
import spacy
import scispacy
# import swifter
import pandas as pd
from spacy import displacy
import en_core_sci_sm
#import en_core_sci_md
# import en_ner_bc5cdr_md
# import en_ner_jnlpba_md
# import en_ner_craft_md
# import en_ner_bionlp13cg_md
from scispacy.abbreviation import AbbreviationDetector
from scispacy.linking import EntityLinker
from collections import OrderedDict,Counter
from pprint import pprint
from tqdm import tqdm
tqdm.pandas()



In [38]:
sample_text= """CORONAVIRUS RESEARCH: KEYS TO DIAGNOSIS, TREATMENT, AND PREVENTION OF SARS
Mark R. Denison, M.D.
For coronavirus investigators, the recognition of a new coronavirus as the cause of severe acute respiratory syndrome (SARS) was certainly remarkable, yet perhaps not surprising (Baric et al., 1995). The cadre of investigators who have worked with this intriguing family of viruses over the past 30 years are familiar with many of the features of coronavirus biology, pathogenesis, and disease that manifested so dramatically in the worldwide SARS epidemic. Advances in the biology of coronaviruses have resulted in greater understanding of their capacity for adaptation to new environments, transspecies infection, and emergence of new diseases. New tools of cell and molecular biology have led to increased understanding of intracellular replication and viral cell biology, and the advent in the past five years of reverse genetic approaches to study coronaviruses has made it possible to begin to define the determinants of viral replication, transpecies adaptation, and human disease. This summary will discuss the basic life cycle and replication of the well-studied coronavirus, mouse hepatitis virus (MHV), identifying the unique characteristics of coronavirus biology and highlighting critical points where research has made significant advances, and which might represent targets for antivirals or vaccines. Areas where rapid progress has been made in SCoV research will be described. Finally, areas of need for research in coronavirus replication, genetics, and pathogenesis will be summarized.
Coronavirus Life Cycle
The best studied model for coronavirus replication and pathogenesis has been the group 2 murine coronavirus, mouse hepatitis virus, and much of what is known of the stages of the coronavirus life cycle has been determined in animals and in culture using this virus. Thus this discussion will focus on MHV with comparisons to SCoV and other coronaviruses. This is appropriate because bioinformatics analyses suggest that SCoV, while a distinct virus, has significant similarities in organization, putative protein functions, and replication to the group II coronaviruses, particularly within the replicase gene (Snijder et al., 2003). Excellent, detailed reviews of MHV and coronavirus replication are available elsewhere (Holmes and Lai, 1996; Lai and Cavanagh, 1997).
The coronavirus virion is an enveloped particle containing the spike (S), membrane (M), and envelope (E) proteins. In addition, some strains of coronaviruses, but not SCoV, express a hemagglutinin protein (HE) that is also incorporated in the virion. The genome of coronaviruses is a linear, single-stranded RNA molecule of positive (mRNA) polarity, and from 28 to 32 kb in length (Bonilla et al., 1994; Drosten et al., 2003; Lee et al., 1991). Within the virion, the genome is encapsidated by multiple copies of the nucleocapsid protein (N), and has the conformation of a helical RNA/nucleocapsid structure. The S protein has been a focus of pathogenesis studies in mice because it appears to be the critical determinant of cell tropism, species specificity, host selection, cell tropism, and disease (Navas and Weiss, 2003; Navas et al., 2001; Rao and Gallagher, 1998).
Virus replication is initiated by binding of the S protein to specific receptors on the host cell surface. For MHV, the primary receptor has been shown to be the carcino-embryonic antigen–cell adhesion molecule (CEACAM) (Dveksler et al., 1991; 1996; Holmes and Lai, 1996; Yokomuri and Lai, 1992), and for the human coronavirus, HCoV-229E, and other group 1 coronaviruses, the receptor is aminopeptidase N (Yeager et al., 1992). The precise mechanisms of entry and uncoating have yet to be defined, but likely occur by either fusion from without or viroplexis through endocytic vesicles. For wildtype MHV, entry and uncoating constitute a pH independent process that is probably direct fusion mediated by a fusion peptide in the S protein (Gallagher et al., 1991). The understanding of the region of the S1 component of coronavirus that binds to receptors was the basis for studies leading to the very recent and very rapid identification of angiotensin converting enzyme 2 (ACE 2) as a receptor for SCoV (Li et al., 2003).
The next discrete stage in the life cycle is translation and proteolytic processing of viral replicase proteins from the input genome RNA, followed by formation of cytoplasmic replication complexes in association with cellular membranes (Denison et al., 1999; Gosert et al., 2002; Shi et al., 1999; van der Meer et al., 1999). Replication complexes are thought to be sites of all stages of viral RNA transcription and replication, and possibly assembly of nascent viral nucleocapsids. Viral assembly occurs both temporally and physically distinct from viral replication complexes in the endoplasmic-reticulum-Golgi-intermediate compartment (ERGIC), a transitional zone between late ER and Golgi (deVries et al., 1997; Klumperman et al., 1994; Krijnse-Locker et al., 1994; Rottier and Rose, 1987). Although the mechanisms by which replication products are delivered to sites of assembly remain to be determined, it has been shown that subpopulations of replicase proteins and the structural nucleocapsid (N) translocate from replication complexes to sites of assembly and may mediate the process in association with cellular membrane/protein trafficking pathways (Bost et al., 2000). Virus assembly in the ERGIC involves interactions of genome RNA, N, the membrane protein (M), and the small membrane protein (E), resulting in budding of virions into the lumen of ER/Golgi virosomes (Opstetten et al., 1995). Further maturation of virus particles occurs during movement through the Golgi, resulting in virosomes filled with mature particles (Salamuera et al., 1999). Trafficking of the virosomes to the cell surface has not been well characterized, but is presumed to occur via normal vesicle maturation and exocytic processes. The outcome is the nonlytic release of the vast majority of mature virions into the extracellular space. For MHV and several other coronaviruses that can directly fuse with cells, there is a characteristic and rapidly detectable cytopathic effect of cell-cell fusion into multinucleated syncytia. Production of infectious virus continues even after the majority of cells are fused. Syncytia were recently reported as a readout of SCoV receptor expression and cell infection (Li et al., 2003).
Viral Replication Complex Formation and Function
Following entry and uncoating, the 5′ most replicase gene of the input positive strand RNA genome is translated into two co-amino terminal replicase polyproteins that are co- and post-translationally processed by viral proteinases to yield 15 to 16 mature replicase proteins, as well as intermediate precursors. The nascent replicase polyproteins and intermediate precursors likely mediate the formation of viral replication complexes in the host cell cytoplasm. Interestingly, coronavirus replication requires continuous replicase gene translation and processing throughout the life cycle to maintain productive infection (Kim et al., 1995; Perlman et al., 1987; Sawicki and Sawicki, 1986). Replication complexes of MHV are associated with double-membrane vesicles (Gosert et al., 2002), and all tested MHV replicase proteins have been shown to colocalize to replication complexes at the earliest time of detection, likely both by membrane integration and by protein-protein and protein-RNA interactions (Bost et al., 2000; Denison et al., 1999; Prentice and Denison, in press; Shi et al., 1999; Sims et al., 2000; van der Meer et al., 1999). Further, replicase proteins likely mediate the process of double-membrane vesicle formation, likely by induction of cellular autophagy pathways (E. Prentice, unpublished results).
Coronavirus replication complexes are sites for replicase gene translation and replicase polyprotein processing, and also for viral RNA synthesis. Replicase gene proteins likely mediate positive-strand, negative-strand, subgenomic, and genomic RNA synthesis, as well as processes of capping, polyadenylation, RNA unwinding, template switching during viral RNA synthesis, and discontinuous transcription and transcription attenuation. The coronavirus replicase polyproteins and mature replicase proteins represent the largest and most diverse repertoire of known and predicted distinct enzymatic functions of any positive-strand RNA virus family. Until recently, of the 15 or more mature replicase proteins, only the proteinase, RNA helicase, and RNA-dependent RNA polymerase activities had been predicted or experimentally confirmed (Brockway et al., 2003; Heusipp et al., 1997; Lee et al., 1991; Ziebuhr et al., 2000). With the advent of SARS, more extensive bioinformatics analyses have resulted in predictions of several additional functions involved in RNA processing, including methyltransferase and exonuclease activities (Snijder et al., 2003; Thiel et al., 2003). Even with inclusion of distant predicted relationships, up to eight of the replicase proteins remain without predicted or confirmed functions. In summary, it is likely that coronaviruses have exploited their genetic capacity to encode proteins in the replicase gene with distinct functions in RNA synthesis and processing, as well as proteins with specific roles in induction or modification in host cellular membrane biogenesis and trafficking, delivery of replication products to sites of assembly, and possibly virus assembly. Thus replicase translation, replicase polyprotein processing, and mature replicase proteins constitute important targets for interference with coronavirus replication, virus-cell interactions, or viral pathology.
Coronavirus Replicase Protein Expression and Processing
The proteinase activities for all coronaviruses include both papain-like proteinase (PLP) and picornavirus 3C-like proteinase activities that are encoded within the replicase polyproteins and mediate both cis and trans cleavage events (Ziebuhr et al., 2000). Because of the parallel evolution of the proteinases, their cleavage sites, and the hierarchical cleavage processes, the proteolytic processing of the coronavirus replicase proteins may serve as distinct regulatory and genetic elements (Ziebuhr et al., 2001). Specifically, there are both conserved and divergent regions of the replicase polyproteins by amino acid identity and similarity, with the sequences and predicted mature proteins beginning with the 3C-like proteinases through the carboxy terminus of the replicase polyprotein retaining higher identity and similarity across the predicted proteins. In contrast, the amino-terminal third of the replicase demonstrates the most variation in proteins, cleavage site locations, and the number of proteinases that mediate maturation processing. SCoV appears to have the general organization of, and similar protein sizes to, the group 2 coronaviruses such as MHV in this part of the genome (Snijder et al., 2003). However, SCoV likely uses only one PLP to mediate the cleavages, similar to the group 3 coronavirus infectious bronchitis virus (IBV). Thus this region of the replicase may experience the most variability, suggesting either the encoding of accessory functions that are flexible and tolerant of changes, or conversely group or host-specific roles that are subject to pressure for more rapid change.
Expression of Structural and Accessory Genes
Only the 5′ most replicase gene is translated from the input positive-strand genome RNA. The genome contains multiple other genes for the known structural proteins S, E, M, and N, as well as other genes for expression of proteins that have been labeled as “nonstructural” or “accessory” because they have been presumed to not be required for replication, and are not thought to be incorporated into virions. MHV encodes six of these genes, while SCoV encodes possibly up to 11 structural and accessory genes, which are expressed from subgenomic mRNAs (Snijder et al., 2003). Subgenomic RNA transcription occurs during minus- strand RNA synthesis by acquisition of the antileader RNA sequences from the 5′ end of the genome via homology to a transcriptional regulatory sequence (TRS, also known as an intergenic sequence), and requiring a discontinuous activity of the nascent minus-strand template and polymerase complex to acquire the leader (Sawicki and Sawicki, 1998). The outcome of transcription is the generation of a “nested set” of subgenomic negative-strand RNAs that all contain the antileader sequences that serve as templates for similar size subgenomic mRNAs. This transcriptional strategy exposes different genes as the 5′ ORF in different mRNAs, all of which also contain the 3′ sequence downstream of the gene, including the 3′ nontranslated region of the genome.
For MHV, genes 3, 5b, 6, and 7 encode S, E, M, and N, respectively. Genes 2, 4, and 5a are not required for replication in culture, and have been mutated to block expression, deleted, or substituted with noncoronavirus genes such as GFP (de Haan et al., 2002; Ortego et al., 2003; Sarma et al., 2002). Because all coronaviruses retain these genes in various combinations in the face of presumed pressure for genetic economy and apparent lack of functions in RNA synthesis, it is presumed that these genes serve roles in modification of host cells, pathogenesis, or interactions with the immune system. SCoV encodes a larger and more complex array of these genes than MHV or other coronaviruses, which may reflect its evolution in its original animal host (Ksiazek et al., 2003; Marra et al., 2003; Snijder et al., 2003; Thiel et al., 2003). In addition, the report of a deletion within one of the accessory genes in human isolates of SCoV suggests that this may be a gene involved in host range or adaptation for replication and transmission in humans (Guan et al., 2003).
Coronavirus Genetics
Until recently, the genetics of coronavirus replication and pathogenesis have largely been studied using natural variants, host range mutants, passaged virus, and mutagenized viruses selected for temperature sensitivity and specific phenotypes. Classical complementation of functions made it possible to define at least eight genetic groups for MHV, with most of the complementation groups localized to the replicase gene (Stalcup et al., 1998). Taking advantage of naturally high rates of homologous RNA-RNA recombination and of host range determinants in the S protein, the development of targeted recombination has allowed more defined and detailed studies of the accessory and structural genes of MHV, transmissible gastroenteritis virus (TGEV), and feline infectious peritonitis virus (FIPV) (Haijema et al., 2003; Kuo et al., 2000; Masters et al., 1994). Studies with natural variants and targeted recombination genetic studies have demonstrated that the S protein is the major determinant of host range, tropism, and pathogenesis; other genetic elements, possibly in the replicase, may influence these characteristics of different coronaviruses (Navas and Weiss, 2003). The capacity of coronaviruses to change host range, transmission, pathogenesis, and disease has been established in the laboratory using cell adaptation and virus passage (Baric et al., 1997, 1999; Chen and Baric, 1995, 1996), and has been demonstrated in nature by natural variants of MHV, TGEV, and bovine coronavirus (BCoV), as well as by studies using heterologous viruses such as canine coronavirus (CcoV) to immunize cats against FIPV (Enjuanes et al., 1995; Tresnan et al., 1996). Further, targeted recombination studies have confirmed the genetic flexibility of the coronavirus genome and the ability of coronaviruses to recover wild-type replication following deletions, mutations, substitutions, and gene order rearrangements in the structural and accessory genes (de Haan et al., 2002).
Challenges for genetic studies using natural variants and mutants, particularly in defining the precise changes responsible for altered phenotypes, has limited progress in genetic studies. Targeted recombination, while a robust system with powerful selection, has been limited to studies of the 3′ 10 kb of the MHV genome, and is limited to selection of viable recombinants. Recently, the establishment of “infectious clone” reverse genetic strategies for the coronaviruses TGEV (Transmissible Gastroenteritis Coronavirus), HCoV-229E, IBV, and MHV has made it theoretically possible to study the genetics of the entire genome and all of the structural, accessory, and replicase genes. Approaches to “infectious cloning” have included full-length cDNA clones of TGEV genome in bacterial artificial chromosomes (Gonzalez et al., 2001), recombinant vaccinia viruses containing full-length cDNA clones of HCoV-229E and IBV genomes (Casais et al., 2001; Thiel et al., 2001), and in vitro assembly strategies for TGEV, MHV, and most recently, SCoV (Yount, 2000; Yount et al., 2002, 2003).
The in vitro assembly approach was developed to overcome the challenge of full-length cDNA cloning of the TGEV and MHV genomes, which contained “toxic” regions in the replicase gene, resulting in unstable or toxic clones in E. coli (Yount, 2000; Yount et al., 2002). Subcloning of the regions required splitting the toxic domains into separate clones. The result of this strategy was the cloning of the MHV genome into seven fragments (A through G). To recover viable virus, the following strategy is pursued: (1) cloned cDNA fragments are excised from plasmid using class 2 restriction enzymes that remove the recognition site and leave overhanging genomic sequence; (2) excised fragments are ligated (assembled) in vitro; (3) transcription of full-length genomic RNA is driven in vitro using a T7 promoter on the 5′ fragment A; (4) full-length genome RNA is electroporated into competent cells that are then plated on a monolayer of naturally permissive cells; and (5) cells are monitored for cytopathic effect or plaques, and virus is recovered from plaques or media supernatant.
In vitro assembly has many advantages for genetic studies of such a large and complex genome RNA. First, genetic changes can be introduced and confirmed in stable small fragments without the need for a ~30kb genomic clone. Second, the cloned fragments make it possible to develop libraries of mutations that can rapidly be tested in different combinations. Furthermore, identification of putative second-site reversion mutations for deleterious introduced changes can be introduced with the original mutation to confirm their reversion potential. In combination with biochemical and cell imaging approaches, it also is possible to study highly defective or lethal mutations in electroporated cells, in order to define critical determinants of replication. The in vitro assembly approach has been used to introduce marker mutations that are silent for replication in culture (Yount et al., 2003). In addition, we have engineered mutations in the MHV replicase gene to define the requirements for polyprotein processing and to determine the role of specific replicase proteins in replication in culture and in pathogenesis in animals. Using this approach we have recovered viruses with mutations at polyprotein cleavage sites and proteinase catalytic residues, all of which have distinct phenotypes in protein processing, viral growth, and viral RNA synthesis (unpublished results). Thus, direct reverse genetic studies of the critical replicase gene functions can be performed using in vitro assembly of infectious clones.
Advances in SCoV Research
The rapid progress in the identification and characterization of SCoV as the etiologic agent of SARS was made possible by the fact that the virus grows well in culture, and by the foundational research in coronaviruses that has been supported by the National Institutes of Health, the Multiple Sclerosis Foundation, the U.S. Department of Agriculture, and other organizations over the past two decades. The application of knowledge concerning virus structure, genetics, receptor binding, virus entry, and viral pathogenesis has made it possible to target the spike protein for studies of SCoV replication, pathogenesis, and immune response (Xiao et al., 2003). The remarkably rapid identification of ACE 2 as a receptor for SARS has demonstrated the foundational importance of studies of other coronaviruses (Li et al., 2003). Similarly, understanding of replicase gene expression, processing, and predicted functions has identified possible targets for structure/function studies and possible therapeutic intervention. The studies of coronavirus proteinase activities, cleavage site, and structures were the basis for studies leading to the rapid determination of SCoV replicase polyprotein cleavage sites and 3CLpro crystal structure (Anand et al., 2003; Campanacci et al., 2003; Snijder et al., 2003; Thiel et al., 2003).
Application of Reverse Genetics to Studies of SCoV
Because of the potential for reemergence of SARS, it is important to move forward with research in diagnostics, vaccines, and therapeutics for SCoV. Experience with the development and use of reverse genetics to study other coronaviruses resulted in establishment of reverse genetics for SCoV within months of the onset of the worldwide epidemic (Yount et al., 2003). How should the understanding of other coronaviruses, the rapid advances in research with SCoV, and the development of reverse genetics for SCoV be harnesssed to achieve these goals and attack these critical questions in SCoV replication, pathogenesis, and disease? Certainly, the use of SCoV reverse genetics, along with robust tissue culture systems and emerging animal models, creates the potential to rapidly answer questions concerning: (1) determinants of virus growth in culture; (2) potential mechanisms of transpecies adaptation; (3) sensitivity to and escape from biochemical and immune interference with replication; (4) determinants of virulence and pathogenesis; (5) mechanisms of genome recombination and mutation; (6) functions of and requirements for replicase, structural, and accessory proteins; and (7) development of stably attenuated viruses for use as seed stocks for inactivated vaccine or testing as live-attenuated vaccines.
How then should these critical issues be investigated while recognizing the potential of SCoV to cause severe disease, as well as the potential for rapid spread? First, there is significant experience with other coronaviruses in attenuation of virus replication and pathogenesis, both using virus passage and by direct engineering of changes. Although coronavirus genome organization, proteins, and replication appear more tolerant of changes then previously thought, all changes of gene order, gene deletion, insertion, or mutagenesis so far reported have led to viruses impaired in replication, pathogenesis, or both. Many of the attenuating changes in MHV and other coronaviruses are conserved in SCoV and thus could be tested for likely attenuation in SCoV culture and animal models. Second, where there is clear conservation of sequences, motifs, proteins, or putative functions between SCoV and model viruses such as MHV, new or untested changes might be most rapidly analyzed under BSL2 conditions in those model viruses, and then directly applied to SARS once their phenotypes are determined. Third, all work with SCoV will be performed only under BSL3 conditions. This would also apply to chimeric viruses, whether engineered by introduction into the SCoV background, or by introducing SCoV proteins or sequences with known or predicted pathogenic consequences into other coronavirus backgrounds. Finally, it is important to develop strains of SCoV that are attenuated and stabilized against reversion and recombination, to be used as the basis for studies of other replication and pathogenesis determinants and construction of virus chimeras. Such attenuated variants would provide additional safeguards while allowing application of powerful genetic tools to the study of SCoV emergence, biology, disease, treatment, and prevention. Overall, newly invigorated programs in other human and animal coronaviruses, combined with the new research in SCoV, will shed important new light on this important virus family and perhaps lead to better understanding of the potential for resurgence of SCoV or the emergence of other coronaviruses into human populations."""

In [39]:
def display_entities(nlp, document):
    """ A function that returns a tuple of displacy image of named or unnamed word entities and
        a set of unique entities recognized based on scispacy model in use
        Args: 
            model: A pretrained model from spaCy or ScispaCy
            document: text data to be analysed"""

    doc = nlp(document)
    displacy_image = displacy.render(doc, jupyter=True,style='ent')
    entity_and_label = set([(X.text, X.label_) for X in doc.ents])
    return  displacy_image, entity_and_label

In [40]:
bc5dr_entities = display_entities(nlp, sample_text)

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]


In [41]:
bc5dr_entities_dataframe = pd.DataFrame(bc5dr_entities[1],columns=['Entity','Label'])  #save returned values of entities and their labels in a pandas dataframe
bc5dr_entities_dataframe['Ner_model'] = 'bc5dr'  #include a column with constant value of NER model
bc5dr_entities_dataframe

Unnamed: 0,Entity,Label,Ner_model
0,3C-like proteinases,ENTITY,bc5dr
1,silent,ENTITY,bc5dr
2,animals,ENTITY,bc5dr
3,cellular,ENTITY,bc5dr
4,structural genes,ENTITY,bc5dr
...,...,...,...
622,significant,ENTITY,bc5dr
623,infectious clone,ENTITY,bc5dr
624,prevention,ENTITY,bc5dr
625,cell cytoplasm,ENTITY,bc5dr


In [42]:
bionlp13cg_entities = display_entities(nlp,sample_text)

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]


In [43]:
bionlp13cg_entities_dataframe = pd.DataFrame(bionlp13cg_entities[1],columns=['Entity','Label']) #save returned values of entities and their labels in a pandas dataframe
bionlp13cg_entities_dataframe['Ner_model'] = 'bionlp13cg'  #include a column with constant value of NER model
bionlp13cg_entities_dataframe

Unnamed: 0,Entity,Label,Ner_model
0,3C-like proteinases,ENTITY,bionlp13cg
1,silent,ENTITY,bionlp13cg
2,animals,ENTITY,bionlp13cg
3,cellular,ENTITY,bionlp13cg
4,structural genes,ENTITY,bionlp13cg
...,...,...,...
622,significant,ENTITY,bionlp13cg
623,infectious clone,ENTITY,bionlp13cg
624,prevention,ENTITY,bionlp13cg
625,cell cytoplasm,ENTITY,bionlp13cg


In [44]:
craft_entities = display_entities(nlp,sample_text)

  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]


In [45]:
def entity_linker(linker_name,document):
    """ A function that accepts the name of a scispacy knowledge base and documents and returns the entity link details"""
    linker = EntityLinker(k = 10,max_entities_per_mention = 2, name=linker_name)  #parameters are tunable,so it can be set to return more than 2 entity matches
    nlp = en_core_sci_sm.load()
    nlp.add_pipe(linker)
    doc = nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'Nan'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_Id, Score = linker_ent
            entity_details.append('Entity_Matching_Score :{}'.format(Score))
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

In [46]:
clinical_note = "Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, \
states chest tightness. No evidence of hypertension.\
Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating associated with pneumonia. \
Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine once used in the last year. Alcoholism unlikely.\
Patient has headache and fever. Patient is not diabetic. \
No signs of diarrhea. Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. \
Patient also has a history of cardiac injury. No kidney injury reported. No abnormal rashes or ulcers. \
Patient might not have liver disease. Confirmed absence of hemoptysis. Although patient has severe pneumonia and fever \
, test reports are negative for COVID-19 infection. COVID-19 viral infection absent."

In [47]:
test1 = 'fever'
entity_linker('umls',test1)

https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <scispacy.linking.EntityLinker object at 0x1a7632070> (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

In [None]:
import spacy
import scispacy

from scispacy.linking import EntityLinker

nlp = spacy.load("en_core_sci_sm")

# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). Be patient!
# Thankfully it should be faster after the first time you use it, because
# the downloads are cached.
# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

doc = nlp(clinical_note)
          
# Let's look at a random entity!
entity = doc.ents[1]

print("Name: ", entity)

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
	print(linker.kb.cui_to_entity[umls_ent[0]])