# Biomedical NLP

## Rule-based TNM Extraction

This example shows a simplistic and somewhat problematic regular expression for matching TNM expressions.
A more realistic solution can be found here: https://github.com/hpi-dhc/onco-nlp/blob/master/onconlp/classification/rulebased_tnm.py

In [1]:
import re

tnm_pattern = r"T\d+[a-zA-Z]*N\d+[a-zA-Z]*M\d+[a-zA-Z]*"

def check_valid(text):
    print("valid" if re.match(tnm_pattern, text) else "not valid")

In [2]:
check_valid('T1N0M1')

valid


In [3]:
check_valid('T1aN2M3')

valid


In [4]:
check_valid('T123')

not valid


In [5]:
check_valid('T8N9M9')

valid


In [6]:
check_valid('T1')

not valid


In [7]:
check_valid('T8N9M9')

valid


In [8]:
check_valid('T1 N0 M1')

not valid


## A more complex NLP Pipeline

Here, we are using the spaCy library with [scispaCy](https://allenai.github.io/scispacy/) models for domain-specific entity extraction. We also use scispaCy's entity linker to map entities to the MeSH vocabulary for normalization.

In [9]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz
  Downloading https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz (15.9 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0mm
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: en-core-sci-sm
  Building wheel for en-core-sci-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-sci-sm: filename=en_core_sci_sm-0.5.1-py3-none-any.whl size=15870853 sha256=ae5be4875fc34a392df224b57c37ca0d1331e62d7db664b6eff471e23ede92e7
  Stored in directory: /Users/phlobo/Library/Caches/pip/wheels/f0/4d/eb/0d4f64bca5fb19915b27acb2aaab5391404b0f76092d41d96d
Successfully built en-core-sci-sm
Installing collected packages: en-core-sci-sm
  Attempting uninstall: en-core-sci-sm
    Found existin

In [10]:
import spacy
from scispacy.linking import EntityLinker

nlp = spacy.load('en_core_sci_sm')
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "mesh", "k" : 5})

https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2020-10-09/mesh/tfidf_vectors_sparse.npz not found in cache, downloading to /var/folders/xj/5cl5ttzs3_jf0_xz0lhx8j340000gn/T/tmpn51et4kq
Finished download, copying /var/folders/xj/5cl5ttzs3_jf0_xz0lhx8j340000gn/T/tmpn51et4kq to cache at /Users/phlobo/.scispacy/datasets/d79636f6619c6aadf93a2e7af3700007e2ea4b4716d8df5e5765e0ca4644160c.f298dc56a154fb1b34970272805b8606a1c6cfcb3b3ebc85c142b832fdfdf812.tfidf_vectors_sparse.npz
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2020-10-09/mesh/nmslib_index.bin not found in cache, downloading to /var/folders/xj/5cl5ttzs3_jf0_xz0lhx8j340000gn/T/tmp5mw79_z8
Finished download, copying /var/folders/xj/5cl5ttzs3_jf0_xz0lhx8j340000gn/T/tmp5mw79_z8 to cache at /Users/phlobo/.scispacy/datasets/7e3c2133fa65605a10eb67a4cfedf8d69bc553cf192dc9d883de80b803c89c5d.fb99c660e797fcb5f0a59c23a58316e9027046d6fb0519d1ae715099da1e5baa.nmslib_index.bin
https://ai2-s2-scispacy.s3-us-west-2.a

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2020-10-09/mesh/concept_aliases.json not found in cache, downloading to /var/folders/xj/5cl5ttzs3_jf0_xz0lhx8j340000gn/T/tmp0qq_7h70
Finished download, copying /var/folders/xj/5cl5ttzs3_jf0_xz0lhx8j340000gn/T/tmp0qq_7h70 to cache at /Users/phlobo/.scispacy/datasets/1a5445257d097c1d2a9eba040029329993377ebc82785ee9ad18ed2b86f7fc7d.bc94249222c42b975a55db3a2b6f7badffe87b809e02f16907fca650f787f2f3.concept_aliases.json
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/kbs/2020-10-09/mesh_2020.jsonl not found in cache, downloading to /var/folders/xj/5cl5ttzs3_jf0_xz0lhx8j340000gn/T/tmpcm32tzc0
Finished download, copying /var/folders/xj/5cl5ttzs3_jf0_xz0lhx8j340000gn/T/tmpcm32tzc0 to cache at /Users/phlobo/.scispacy/datasets/648519b1485bf557749c97c175af1d86f89ed7674bb93e4b51148e2df415b72f.aa95b0492040d1386799638de559a625798ede06bc23e9b77166500fab9903d0.mesh_2020.jsonl


<scispacy.linking.EntityLinker at 0x142c8e8d0>

In [11]:
text = "The patient underwent a CT scan in April. It did not reveal any abnormalities."

In [12]:
doc = nlp(text)

### Linguistic Analysis

Boundary detection / sentence splitting

In [13]:
for s in doc.sents:
    print(s)

The patient underwent a CT scan in April.
It did not reveal any abnormalities.


In [14]:
sentence = list(doc.sents)[0]

Tokenization

In [15]:
for token in sentence:
    print(token)

The
patient
underwent
a
CT
scan
in
April
.


Part-of-speech tagging

In [16]:
for token in sentence:
    print(token, token.pos_)

The DET
patient NOUN
underwent VERB
a DET
CT PROPN
scan NOUN
in ADP
April PROPN
. PUNCT


Noun chunking

In [17]:
for token in sentence.noun_chunks:
    print(token)

The patient
a CT scan


Dependency parsing

In [18]:
from spacy import displacy

In [19]:
displacy.render(sentence, style="dep", jupyter=True, options={'distance' : 100})

## Information Extraction

Entity extraction

In [20]:
for e in sentence.ents:
    print('Entity:', e)

Entity: patient
Entity: CT scan


Entity normalization / linking

In [21]:
from IPython.display import display_markdown

In [22]:
linker = nlp.get_pipe("scispacy_linker")

In [23]:
for e in sentence.ents:
    display_markdown(f'__Entity: {e}__', raw=True)
    for entity_id, prob in e._.kb_ents:
        mesh_term = linker.kb.cui_to_entity[entity_id]
        print('Probability:', prob)
        print(mesh_term)

__Entity: patient__

Probability: 0.8386321067810059
CUI: D019727, Name: Proxy
Definition: A person authorized to decide or act for another person, for example, a person having durable power of attorney.
TUI(s): 
Aliases: (total: 2): 
	 Patient Agent, Proxy
Probability: 0.7973071336746216
CUI: D010361, Name: Patients
Definition: Individuals participating in the health care system for the purpose of receiving therapeutic, diagnostic, or preventive procedures.
TUI(s): 
Aliases: (total: 2): 
	 Patients, Clients
Probability: 0.7851048707962036
CUI: D005791, Name: Patient Care
Definition: Care rendered by non-professionals.
TUI(s): 
Aliases: (total: 2): 
	 Informal care, Patient Care
Probability: 0.7439237833023071
CUI: D000070659, Name: Patient Comfort
Definition: Patient care intended to prevent or relieve suffering in conditions that ensure optimal quality living.
TUI(s): 
Aliases: (total: 2): 
	 Comfort Care, Patient Comfort
Probability: 0.7175934910774231
CUI: D064406, Name: Patient Harm
Definition: A meas

__Entity: CT scan__

Probability: 0.8230447173118591
CUI: D000072098, Name: Single Photon Emission Computed Tomography Computed Tomography
Definition: An imaging technique using a device which combines TOMOGRAPHY, EMISSION-COMPUTED, SINGLE-PHOTON and TOMOGRAPHY, X-RAY COMPUTED in the same session.
TUI(s): 
Aliases: (total: 5): 
	 CT SPECT Scan, Single Photon Emission Computed Tomography Computed Tomography, CT SPECT, SPECT CT Scan, SPECT CT
Probability: 0.8186503648757935
CUI: D000072078, Name: Positron Emission Tomography Computed Tomography
Definition: An imaging technique that combines a POSITRON-EMISSION TOMOGRAPHY (PET) scanner and a CT X RAY scanner. This establishes a precise anatomic localization in the same session.
TUI(s): 
Aliases: (total: 7): 
	 PET-CT Scan, PET-CT, CT PET Scan, Positron Emission Tomography Computed Tomography, PET CT Scan, Positron Emission Tomography-Computed Tomography, CT PET
Probability: 0.7265672087669373
CUI: D056973, Name: Four-Dimensional Computed Tomography
Definition

# Gene Named Entity Recognition

In [24]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz
  Downloading https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz (120.2 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.2/120.2 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: en-ner-bionlp13cg-md
  Building wheel for en-ner-bionlp13cg-md (setup.py) ... [?25ldone
[?25h  Created wheel for en-ner-bionlp13cg-md: filename=en_ner_bionlp13cg_md-0.5.1-py3-none-any.whl size=120241144 sha256=efe3c1f953f5312552269b4fa0a5e455264b2e7946910e9880209121ca00169e
  Stored in directory: /Users/phlobo/Library/Caches/pip/wheels/1b/48/06/cb86feaf8cf8bb0d00a9465e788e3f19cc81e931c6a69b9859
Successfully built en-ner-bionlp13cg-md
Installing collected packages: en-ner-bionlp13cg-md
  Atte

In [25]:
text = """Dual MAPK pathway inhibition with BRAF and MEK inhibitors in BRAF(V600E)-mutant NSCLC 
might improve efficacy over BRAF inhibitor monotherapy based on observations in BRAF(V600)-mutant melanoma"""

Specialized model for biological entities

In [26]:
bionlp = spacy.load('en_ner_bionlp13cg_md')
biodoc = bionlp(text)

In [27]:
for e in biodoc.ents:
    print('Entity:', e, ', Label:', e.label_)

Entity: MAPK , Label: GENE_OR_GENE_PRODUCT
Entity: BRAF , Label: GENE_OR_GENE_PRODUCT
Entity: MEK , Label: GENE_OR_GENE_PRODUCT
Entity: BRAF(V600E)-mutant NSCLC , Label: CANCER
Entity: BRAF , Label: GENE_OR_GENE_PRODUCT
Entity: melanoma , Label: CELL


In [28]:
displacy.render(biodoc, style='ent', jupyter=True)