## Biomedical NLP

### Rule-based TNM Extraction

This example shows a simplistic and somewhat problematic regular expression for matching TNM expressions. As seen in the lecture, covering all edge cases of TNM extraction with RegEx is quite tedious. A more realistic solution can be found here: https://github.com/hpi-dhc/onco-nlp/blob/master/onconlp/classification/rulebased_tnm.py

In [2]:
import re

tnm_pattern = r"T\d+[a-zA-Z]*N\d+[a-zA-Z]*M\d+[a-zA-Z]*"

### RegEx explanation ###
#
# T           Matches the uppercase letter "T"
# \d+         Matches one or more digits
# [a-zA-Z]*   Matches zero or more uppercase or lowercase letters
# N           Matches the uppercase letter "N"
# M           Matches the uppercase letter "M"
#
# Thus, this RegEx matches strings containing the letters 'T', 'N', and 'M',
# each followed by a sequence of digits and optional letters.

def check_valid(text):
    print("valid" if re.match(tnm_pattern, text) else "not valid")

Let us check some example strings:

In [3]:
check_valid('T1N0M1')

valid


In [4]:
check_valid('T1aN2M0')

valid


In [5]:
check_valid('T123')

not valid


In [15]:
check_valid('pT1N0M1')

not valid


In [6]:
check_valid('T1')

not valid


In [7]:
check_valid('T8N9M9')

valid


In [8]:
check_valid('T1 N0 M1')

not valid


### Natural Language Processing (NLP)

We will give a brief introduction into Natural Language Processing with the spaCy library which is designed for NLP tasks and workflows. They feature a rich collection of models and support visualization.

In [1]:
# Install spaCy
!pip install -q spacy

# Download the basic English model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m97.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [9]:
import spacy
from spacy import displacy
import en_core_web_sm

# Load the English model
nlp = en_core_web_sm.load()

### Named Entity Recognition


The basic pre-trained model features a number of entities it can detect. Let us take a look at them with their respective description:

In [10]:
# Access the NER component
ner = nlp.get_pipe("ner")

# Print the entity labels and their corresponding description
for label in ner.labels:
    print(f"{label}: {spacy.explain(label)}")

CARDINAL: Numerals that do not fall under another type
DATE: Absolute or relative dates or periods
EVENT: Named hurricanes, battles, wars, sports events, etc.
FAC: Buildings, airports, highways, bridges, etc.
GPE: Countries, cities, states
LANGUAGE: Any named language
LAW: Named documents made into laws.
LOC: Non-GPE locations, mountain ranges, bodies of water
MONEY: Monetary values, including unit
NORP: Nationalities or religious or political groups
ORDINAL: "first", "second", etc.
ORG: Companies, agencies, institutions, etc.
PERCENT: Percentage, including "%"
PERSON: People, including fictional
PRODUCT: Objects, vehicles, foods, etc. (not services)
QUANTITY: Measurements, as of weight or distance
TIME: Times smaller than a day
WORK_OF_ART: Titles of books, songs, etc.


Let us now test this on an example input sentence:

In [11]:
# Sample input
text = "Mr. Brown is spending over 1000$ to travel to the U.S. in February to attend the Super Bowl with two of his family members."

# Process the text
doc = nlp(text)

# Display entities
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

Brown (PERSON)
1000$ (MONEY)
U.S. (GPE)
February (DATE)
the Super Bowl (EVENT)
two (CARDINAL)


We can examine that the pre-trained model can identify some entities in the example text input such as "Brown" as a person or "the Super Bowl" as an event. The model is not perfect as for example "Mr." can be argued to be a part of the person label and "family members" are indirectly persons as well.

Furthermore, we can visualize the entity recognition with spaCy:

In [12]:
displacy.render(doc, style="ent", jupyter=True)

### Relationship Extraction

Another common NLP task is finding and extracting relationships between words for better understanding of their contextual interplay. Let us examine how spaCy handles our input text:

In [13]:
# Sample text
text = "Mr. Brown is spending over 1000$ to travel to the U.S. in February to attend the Super Bowl with two of his family members."

# Process the text
doc = nlp(text)

# Display token dependencies
for token in doc:
    dep_label = token.dep_
    explanation = spacy.explain(dep_label)
    print(f"{token.text:<12} {dep_label:<10} {explanation:<30} -->  {token.head.text}")

Mr.          compound   compound                       -->  Brown
Brown        nsubj      nominal subject                -->  spending
is           aux        auxiliary                      -->  spending
spending     ROOT       root                           -->  spending
over         prep       prepositional modifier         -->  spending
1000         nummod     numeric modifier               -->  $
$            pobj       object of preposition          -->  over
to           aux        auxiliary                      -->  travel
travel       advcl      adverbial clause modifier      -->  spending
to           prep       prepositional modifier         -->  travel
the          det        determiner                     -->  U.S.
U.S.         pobj       object of preposition          -->  to
in           prep       prepositional modifier         -->  travel
February     pobj       object of preposition          -->  in
to           aux        auxiliary                      -->  attend
att

For a more convenient display, we can visualize the relationships as well:

In [14]:
displacy.render(doc, style="dep", jupyter=True, options={'distance': 90})

### NLP for clinical texts

Libraries and models supported by spaCy also include some finetuned for use cases in the medical field and for processing clinical texts. Inter alia Med7, MedSpaCy and scispaCy are commonly used. We will shortly showcase Med7, which is a spaCy-based Named Entity Recognition (NER) model tailored for clinical information extraction. It focuses on identifying medication-related entities in clinical texts.

In [None]:
# Install Med7
!pip install "en-core-med7-lg @ https://huggingface.co/kormilitzin/en_core_med7_lg/resolve/main/en_core_med7_lg-any-py3-none-any.whl"

In [16]:
# Load the Med7 model
nlp = spacy.load("en_core_med7_lg")

# Print all covered entity labels
ner = nlp.get_pipe("ner")
print(ner.labels)

  


('DOSAGE', 'DRUG', 'DURATION', 'FORM', 'FREQUENCY', 'ROUTE', 'STRENGTH')


In [17]:
# Sample biomedical text
text = "The patient was prescribed 500mg of Amoxicillin orally twice daily for 7 days."

# Process the text
doc = nlp(text)

# Display entities
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

500mg (STRENGTH)
Amoxicillin (DRUG)
orally (ROUTE)
twice daily (FREQUENCY)
for 7 days (DURATION)


In [18]:
displacy.render(doc, style="ent", jupyter=True)