# Biomedical NLP

## Rule-based TNM Extraction

This example shows a simplistic and somewhat problematic regular expression for matching TNM expressions.
A more realistic solution can be found here: https://github.com/hpi-dhc/onco-nlp/blob/master/onconlp/classification/rulebased_tnm.py

In [1]:
import re

tnm_pattern = r"T\d+[a-zA-Z]*N\d+[a-zA-Z]*M\d+[a-zA-Z]*"

def check_valid(text):
    print("valid" if re.match(tnm_pattern, text) else "not valid")

In [2]:
check_valid('T1N0M1')

valid


In [3]:
check_valid('T1aN2M3')

valid


In [4]:
check_valid('T123')

not valid


In [5]:
check_valid('T8N9M9')

valid


In [6]:
check_valid('T1')

not valid


In [7]:
check_valid('T8N9M9')

valid


In [8]:
check_valid('T1 N0 M1')

not valid


## A more complex NLP Pipeline

Here, we are using the spaCy library with [scispaCy](https://allenai.github.io/scispacy/) models for domain-specific entity extraction. We also use scispaCy's entity linker to map entities to the MeSH vocabulary for normalization.

In [9]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz (14.8 MB)
  Preparing metadata (setup.py) ... [?25ldone


In [10]:
import spacy
from scispacy.linking import EntityLinker

nlp = spacy.load('en_core_sci_sm')
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "mesh", "k" : 5})

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


<scispacy.linking.EntityLinker at 0x13b247a50>

In [11]:
text = "The patient underwent a CT scan in April. It did not reveal any abnormalities."

In [12]:
doc = nlp(text)

### Linguistic Analysis

Boundary detection / sentence splitting

In [13]:
for s in doc.sents:
    print(s)

The patient underwent a CT scan in April. It did not reveal any abnormalities.


In [14]:
sentence = list(doc.sents)[0]

Tokenization

In [15]:
for token in sentence:
    print(token)

The
patient
underwent
a
CT
scan
in
April
.
It
did
not
reveal
any
abnormalities
.


Part-of-speech tagging

In [16]:
for token in sentence:
    print(token, token.pos_)

The DET
patient NOUN
underwent VERB
a DET
CT PROPN
scan NOUN
in ADP
April PROPN
. PUNCT
It PRON
did AUX
not PART
reveal VERB
any DET
abnormalities NOUN
. PUNCT


Noun chunking

In [17]:
for token in sentence.noun_chunks:
    print(token)

The patient
a CT scan
It
any abnormalities


Dependency parsing

In [18]:
from spacy import displacy

In [19]:
displacy.render(sentence, style="dep", jupyter=True, options={'distance' : 100})

## Information Extraction

Entity extraction

In [20]:
for e in sentence.ents:
    print('Entity:', e)

Entity: patient
Entity: CT scan
Entity: abnormalities


Entity normalization / linking

In [21]:
from IPython.display import display_markdown

In [22]:
linker = nlp.get_pipe("scispacy_linker")

In [23]:
for e in sentence.ents:
    display_markdown(f'__Entity: {e}__', raw=True)
    for entity_id, prob in e._.kb_ents:
        mesh_term = linker.kb.cui_to_entity[entity_id]
        print('Probability:', prob)
        print(mesh_term)

__Entity: patient__

Probability: 1.0
CUI: C0030705, Name: Patients
Definition: Individuals participating in the health care system for the purpose of receiving therapeutic, diagnostic, or preventive procedures.
TUI(s): T101
Aliases: (total: 1): 
	 Patient
Probability: 0.7927387356758118
CUI: C0017313, Name: Patient Care
Definition: The services rendered by members of the health profession and non-professionals under their supervision.
TUI(s): T058
Aliases: (total: 1): 
	 Care, Patient


__Entity: CT scan__

Probability: 0.8133351802825928
CUI: C3472245, Name: Single Photon Emission Computed Tomography Computed Tomography
Definition: An imaging technique using a device which combines TOMOGRAPHY, EMISSION-COMPUTED, SINGLE-PHOTON and TOMOGRAPHY, X-RAY COMPUTED in the same session.
TUI(s): T060
Aliases (abbreviated, total: 17): 
	 SPECT CT Scans, Scans, CT SPECT, Scans, SPECT CT, CT SPECT, Scan, CT SPECT, SPECT CT Scan, SPECT, CT, Scan, SPECT CT, SPECTs, CT, SPECT CT
Probability: 0.8039770722389221
CUI: C1699633, Name: CT Scan, PET
Definition: An imaging technique that utilizes positron emission tomography and computed tomography in a single machine.
TUI(s): T060
Aliases (abbreviated, total: 19): 
	 Scan, PET CT, PET CT Scans, CT PET Scan, Scan, PET-CT, CT Scans, PET, Scans, PET CT, PET Scan, CT, CT PET Scans, CT PET, Scan, CT PET


__Entity: abnormalities__

Probability: 1.0
CUI: C0000769, Name: anomalies
Definition: Used with organs for congenital defects producing changes in the morphology of the organ. It is used also for abnormalities in animals.
TUI(s): T169
Aliases: (total: 1): 
	 abnormalities
Probability: 0.8790650963783264
CUI: C0037268, Name: Skin Abnormalities
Definition: Congenital structural abnormalities of the skin.
TUI(s): T019
Aliases: (total: 3): 
	 Abnormality, Skin, Skin Abnormality, Abnormalities, Skin
Probability: 0.8616816401481628
CUI: C0018798, Name: Defects, Congenital Heart
Definition: Developmental abnormalities involving structures of the heart. These defects are present at birth but may be discovered later in life.
TUI(s): T019
Aliases (abbreviated, total: 15): 
	 Congenital Heart Defects, Congenital Heart Defect, Defect, Congenital Heart, Congenital Heart Diseases, Disease, Congenital Heart, Heart Abnormality, Malformation Of Heart, Heart Disease, Congenital, Heart Defects, Congenital, Abnormality, Heart
Prob

# Gene Named Entity Recognition

In [24]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_bionlp13cg_md-0.5.3.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_bionlp13cg_md-0.5.3.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_bionlp13cg_md-0.5.3.tar.gz (119.8 MB)
  Preparing metadata (setup.py) ... [?25ldone


In [25]:
text = """Dual MAPK pathway inhibition with BRAF and MEK inhibitors in BRAF(V600E)-mutant NSCLC 
might improve efficacy over BRAF inhibitor monotherapy based on observations in BRAF(V600)-mutant melanoma"""

Specialized model for biological entities

In [26]:
bionlp = spacy.load('en_ner_bionlp13cg_md')
biodoc = bionlp(text)

In [27]:
for e in biodoc.ents:
    print('Entity:', e, ', Label:', e.label_)

Entity: MAPK , Label: GENE_OR_GENE_PRODUCT
Entity: BRAF , Label: GENE_OR_GENE_PRODUCT
Entity: MEK , Label: GENE_OR_GENE_PRODUCT
Entity: BRAF(V600E)-mutant NSCLC , Label: CANCER
Entity: BRAF , Label: GENE_OR_GENE_PRODUCT


In [28]:
displacy.render(biodoc, style='ent', jupyter=True)