# Clinical NLP

scispaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython. The library is published under the MIT license and currently offers statistical neural network models for processing biomedical, scientific or clinical text.

## Finding the sentences in the text

In [187]:
import scispacy
import spacy
import en_ner_bc5cdr_md   #The model we are going to use
from spacy import displacy
from scispacy.abbreviation import AbbreviationDetector
from scispacy.linking import EntityLinker
nlp = spacy.load("en_ner_bc5cdr_md")

import pandas as pd
import re
med_transcript = pd.read_csv(r"C:\Users\swanson-jeffrey\Downloads\mtsamples.csv", index_col=0)
med_transcript.dropna(subset=['transcription'], inplace=True)

text=med_transcript_small['transcription'].iloc[10]

text=re.sub(r'[^\w\s]', '', text)

doc = nlp(text)

print(list(doc.sents))

[PREOPERATIVE DIAGNOSES1  Right pelvic pain2  Right ovarian massPOSTOPERATIVE DIAGNOSES1  Right pelvic pain2  Right ovarian mass3  8 cm x 10 cm right ovarian cyst with ovarian torsionPROCEDURE PERFORMED  Laparoscopic right salpingooophorectomyANESTHESIA  General with endotracheal tubeCOMPLICATIONS  NoneESTIMATED BLOOD LOSS  Less than 50 ccTUBES  NoneDRAINS  NonePATHOLOGY  The right tube and ovary sent to pathology for reviewFINDINGS  On exam under anesthesia a normalappearing vulva and vagina and normally palpated cervix a uterus that was normal size and a large right adnexal mass  Laparoscopic findings demonstrated a 8 cm x 10 cm smooth right ovarian cyst that was noted to be torsed twice  Otherwise the uterus left tube and ovary bowel liver margins appendix and gallbladder were noted all to be within normal limits  There was no noted blood in the pelvisINDICATIONS FOR THIS PROCEDURE  The patient is a 26yearold G1 P1 who presented to ABCD General Emergency Room with complaint of right

## Find the Bio-Medical Entities in the given text:

In [174]:
print(doc.ents)

(pelvic pain, pelvic pain, right ovarian cyst, ovarian torsion, normalappearing vulva, vagina, adnexal mass  , right ovarian cyst, torsed, ovary bowel liver margins appendix and gallbladder, pain, pain, pain, nausea, vomiting, bleeding, chills, right ovarian cyst, singletoothed, CO2, LigaSure, 0vicyrl)


### Abbreviations

The AbbreviationDetector is a Spacy component which implements the abbreviation detection algorithm in "A simple algorithm for identifying abbreviation definitions in biomedical text.", (Schwartz & Hearst, 2003).

In [175]:
nlp.add_pipe("abbreviation_detector")

<scispacy.abbreviation.AbbreviationDetector at 0x1a6a46ec588>

In [176]:
doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

fmt_str = "{:<6}| {:<30}| {:<6}| {:<6}"
print(fmt_str.format("Short", "Long", "Starts", "Ends"))
for abrv in doc._.abbreviations:
    print(fmt_str.format(abrv.text, str(abrv._.long_form), abrv.start, abrv.end))

Short | Long                          | Starts| Ends  
SBMA  | Spinal and bulbar muscular atrophy| 6     | 7     
SBMA  | Spinal and bulbar muscular atrophy| 33    | 34    
AR    | androgen receptor             | 29    | 30    


#### Finding Definitions

In [190]:
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

<scispacy.linking.EntityLinker at 0x19a62361a48>

The EntityLinker is a SpaCy component which performs linking to a knowledge base. The linker simply performs a string overlap - based search (char-3grams) on named entities, comparing them with the concepts in a knowledge base using an approximate nearest neighbours search.

scispaCy's EntityLinker class is a spaCy pipeline component that links entities identified by the trained pipeline with various clinical ontologies. In spaCy, these ontologies are called Knowledge Bases.

As of writing, scispaCy supports the following: Unified Medical Language System, Medical Subject Headings, the RxNorm Ontology, the Gene Ontology, and the Human Phenotype Ontology.

The EntityLinker uses an approximate nearest neighbors search to compare each identified entity to entries in the knowledge base.

In [191]:
linker = nlp.get_pipe("scispacy_linker")
doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

fmt_str = "{:<20}| {:<10}| {:<32}| {:<20}"
print(fmt_str.format("Entity", "1st CUI", "Canonical Name", "Definition"))
for entity in doc.ents:
    first_cuid = entity._.kb_ents[0][0]
    kb_entry = linker.kb.cui_to_entity[first_cuid]
    print(fmt_str.format(entity.text, first_cuid, kb_entry.canonical_name, kb_entry.definition[0:15] + "..."))
#https://www.andrewvillazon.com/clinical-natural-language-processing-python/

Entity              | 1st CUI   | Canonical Name                  | Definition          
Spinal and bulbar muscular atrophy| C1839259  | Bulbo-Spinal Atrophy, X-Linked  | An X-linked rec...  
SBMA                | C1705240  | AR wt Allele                    | Human AR wild-t...  
polyglutamine tract | C0032500  | Polyglutamic Acid               | A peptide that ...  
SBMA                | C1705240  | AR wt Allele                    | Human AR wild-t...  


# Entity Detection

In [179]:
doc = nlp(text)

fmt_str = "{:<15}| {:<6}| {:<7}| {:<8}"
print(fmt_str.format("token", "pos", "label", "parent"))
for token in doc:
    print(fmt_str.format(token.text, token.pos_, token.ent_type_, token.head.text))

  global_matches = self.global_matcher(doc)
  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]


token          | pos   | label  | parent  
PREOPERATIVE   | VERB  |        | noted   
DIAGNOSES1     | NOUN  |        | pain    
               | SPACE |        | pain    
Right          | ADJ   |        | pain    
pelvic         | ADJ   | DISEASE| pain    
pain           | NOUN  | DISEASE| POSTOPERATIVE
2              | NUM   |        |         
               | SPACE |        | POSTOPERATIVE
Right          | ADJ   |        | ovarian 
ovarian        | ADJ   |        | mass    
mass           | NOUN  |        | POSTOPERATIVE
POSTOPERATIVE  | NOUN  |        | PREOPERATIVE
DIAGNOSES1     | NOUN  |        | POSTOPERATIVE
               | SPACE |        | pain    
Right          | ADJ   |        | pain    
pelvic         | ADJ   | DISEASE| pain    
pain           | NOUN  | DISEASE| mass    
2              | NUM   |        | mass    
               | SPACE |        | mass    
Right          | ADJ   |        | mass    
ovarian        | ADJ   |        | mass    
mass           | NOUN  |      

Robinson       | PROPN |        | catheter
catheter       | NOUN  |        | drained 
and            | CCONJ |        | drained 
she            | PRON  |        | examined
was            | VERB  |        | examined
examined       | VERB  |        | drained 
under          | ADP   |        | anesthesia
anesthesia     | NOUN  |        | examined
and            | CCONJ |        | examined
was            | VERB  |        | noted   
noted          | VERB  |        | examined
to             | PART  |        | have    
have           | VERB  |        | noted   
the            | DET   |        | findings
findings       | NOUN  |        | have    
as             | ADP   |        |         
above          | ADP   |        | as      
               | SPACE |        | findings
She            | PRON  |        | prepped 
was            | VERB  |        | prepped 
prepped        | VERB  |        | examined
and            | CCONJ |        | prepped 
draped         | VERB  |        | prepped 
in       

the            | DET   |        | pneumoperitoneum
pneumoperitoneum| NOUN  |        | maintained
was            | VERB  |        | maintained
maintained     | VERB  |        | closed  
after          | ADP   |        | placed  
the            | DET   |        | sutures 
sutures        | NOUN  |        | placed  
were           | VERB  |        | placed  
placed         | VERB  |        | maintained
               | SPACE |        | placed  
Therefore      | ADV   |        | noted   
the            | DET   |        | surface 
peritoneal     | ADJ   |        | surface 
surface        | NOUN  |        | noted   
was            | VERB  |        | noted   
noted          | VERB  |        | drained 
to             | PART  |        | hemostatic
be             | VERB  |        | hemostatic
hemostatic     | ADJ   |        | noted   
               | SPACE |        | hemostatic
Therefore      | ADV   |        | removed 
the            | DET   |        | camera  
camera         | NOUN  |        |

# Named Entity Recognition

Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. It involves the identification of key information in the text and classification into a set of predefined categories. An entity is basically the thing that is consistently talked about or refer to in the text.

In [180]:
import pandas as pd
import re
med_transcript = pd.read_csv(r"C:\Users\swanson-jeffrey\Downloads\mtsamples.csv", index_col=0)
med_transcript.dropna(subset=['transcription'], inplace=True)
med_transcript_small = med_transcript.sample(n=100, replace=False, random_state=42)


sample_transcription = med_transcript_small['transcription'].iloc[0]
sample_transcription=re.sub(r'[^\w\s]', '', sample_transcription)
print(sample_transcription[:1000]) # prints just the first 1000 characters

HISTORY OF PRESENT ILLNESS  The patient is well known to me for a history of irondeficiency anemia due to chronic blood loss from colitis  We corrected her hematocrit last year with intravenous IV iron  Ultimately she had a total proctocolectomy done on 03142007 to treat her colitis  Her course has been very complicated since then with needing multiple surgeries for removal of hematoma  This is partly because she was on anticoagulation for a right arm deep venous thrombosis DVT she had early this year complicated by septic phlebitisChart was reviewed and I will not reiterate her complex historyI am asked to see the patient again because of concerns for coagulopathyShe had surgery again last month to evacuate a pelvic hematoma and was found to have vancomycin resistant enterococcus for which she is on multiple antibiotics and followed by infectious disease nowShe is on total parenteral nutrition TPN as wellLABORATORY DATA  Labs today showed a white blood count of 79 hemoglobin 110 hemat

In [181]:
#https://allenai.github.io/scispacy/
import scispacy

import spacy
#pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz
nlp = spacy.load("en_ner_bc5cdr_md")
doc = nlp(sample_transcription)
print("TEXT", "START", "END", "ENTITY TYPE")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

TEXT START END ENTITY TYPE
irondeficiency anemia 77 98 DISEASE
chronic blood loss 106 124 DISEASE
colitis 130 137 DISEASE
iron 197 201 CHEMICAL
colitis 276 283 DISEASE
hematoma 380 388 DISEASE
venous thrombosis DVT 461 482 DISEASE
pelvic hematoma 720 735 DISEASE
vancomycin 758 768 CHEMICAL
infectious disease 849 867 DISEASE
wellLABORATORY 915 929 CHEMICAL
improvedPT 1299 1309 CHEMICAL
vitamin K 1446 1455 CHEMICAL
28LFTs 1659 1665 CHEMICAL
uric acid 1755 1764 CHEMICAL
bilirubin 1776 1785 CHEMICAL
Creatinine 1830 1840 CHEMICAL
creatinine 1868 1878 CHEMICAL
normalB12 1931 1940 CHEMICAL
Folic acid 1989 1999 CHEMICAL
Iron 2012 2016 CHEMICAL
heparin 2220 2227 CHEMICAL
loperamide niacin 2236 2253 CHEMICAL
Diovan Afrin 2267 2279 CHEMICAL
caspofungin daptomycin 2292 2314 CHEMICAL
fentanyl 2325 2333 CHEMICAL
morphine 2337 2345 CHEMICAL
pain 2350 2354 DISEASE
Compazine 2359 2368 CHEMICAL
Zofran 2372 2378 CHEMICAL
epistaxis 2504 2513 DISEASE
notedABDOMEN 2754 2766 CHEMICAL
bleeding 2910 2918 DISEA

We are going to use the NER model trained on the BC5CDR corpus (en_ner_bc5cdr_md). This corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases, and 3116 chemical-disease interactions.

In [182]:
from spacy import displacy
displacy.render(doc[:200], style='ent', jupyter=True) # here I am printing just the first 100 tokens

Named entity recognition (NER) ‒ also called entity identification or entity extraction ‒ is a natural language processing (NLP) technique that automatically identifies named entities in a text and classifies them into predefined categories. Entities can be names of people, organizations, locations, times, quantities, monetary values, percentages, and more.

## Rule-based matching

Rule-based matching is similar to regular expressions, but spaCy’s rule-based matcher engines and components give you access to the tokens within the document and their relationships. We can combine this with the NER models to identify some pattern that includes our entities.

Let’s extract from the text the drug names and their reported dosages. This could be of real use to identify possible medication errors by checking if the dosages are in accordance with standards and guidelines.

In [183]:
from spacy.matcher import Matcher

pattern = [{'ENT_TYPE':'CHEMICAL'}, {'LIKE_NUM': True}, {'IS_ASCII': True}]
matcher = Matcher(nlp.vocab)
matcher.add("DRUG_DOSE", [pattern])



The code above creates a pattern to identify a sequence of three tokens:

A token whose entity type is CHEMICAL (drug name)
A token that resembles a number (dosage)
A token that consists of ASCII characters (units, like mg or mL)
Then we initialize the Matcher with a vocabulary. The matcher must always share the same vocab with the documents it will operate on, so we use the nlp object vocab. We then add this pattern to the matcher and give it an ID.

Now we can loop through all transcriptions and extract the text matching this pattern:

In [184]:
for transcription in med_transcript_small['transcription']:
    doc = nlp(transcription)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # get string representation
        span = doc[start:end]  # the matched span
        print(string_id, start, end, span.text)

DRUG_DOSE 129 132 Xylocaine 20 mL
DRUG_DOSE 133 136 Marcaine 0.25%
DRUG_DOSE 204 207 Aspirin 81 mg
DRUG_DOSE 212 215 Spiriva 10 mcg
DRUG_DOSE 376 379 nifedipine 10 mg
DRUG_DOSE 220 223 aspirin one tablet
DRUG_DOSE 239 242 Warfarin 2.5 mg
DRUG_DOSE 57 60 Topamax 100 mg
DRUG_DOSE 63 66 Zoloft 100 mg
DRUG_DOSE 69 72 Abilify 5 mg
DRUG_DOSE 74 77 Motrin 800 mg
DRUG_DOSE 76 79 Xanax 1 mg
DRUG_DOSE 87 90 Colace 100 mg
DRUG_DOSE 120 123 Paxil 10 mg
DRUG_DOSE 125 128 Prednisone 20 mg
DRUG_DOSE 139 142 Metamucil one pack
DRUG_DOSE 149 152 Nexium 40 mg
DRUG_DOSE 89 92 (70-
DRUG_DOSE 91 94 -86)
DRUG_DOSE 1109 1112 Naprosyn one p.o
DRUG_DOSE 260 263 Lidocaine 1%
DRUG_DOSE 35 38 Altrua 60,
DRUG_DOSE 221 224 lidocaine 2%
DRUG_DOSE 91 94 Creatinine 1.3,
DRUG_DOSE 94 97 sodium 141,
DRUG_DOSE 98 101 potassium 4.0.
DRUG_DOSE 102 105 Calcium 8.6.
DRUG_DOSE 434 437 aspirin 81 mg
DRUG_DOSE 441 444 Lipitor 20 mg
DRUG_DOSE 448 451 Klonopin 0.5 mg
DRUG_DOSE 456 459 digoxin 0.125 mg
DRUG_DOSE 463 466 Lexapro 10

# Additional Techniques

The Hugging Face Transformers provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, and more in over 100 languages. Its aim is to make cutting-edge NLP easier to use for everyone. The Model Hub contains thousands of pre-trained models that anyone can download and use. It also contains a large number of datasets for NLP projects. https://huggingface.co/

- fill-mask: Masked Language Modeling is a fill-in-the-blank task, where a model uses the context words surrounding a mask token to try to predict what the masked word should be. For an input that contains one or more mask tokens, the model will generate the most likely substitution for each.
    - Input: "I have watched this [MASK] and it was awesome."
    - Output: "I have watched this movie and it was awesome."
- question-answering: Question answering is a task in information retrieval and Natural Language Processing (NLP) that investigates software that can answer questions asked by humans in natural language. In Extractive Question Answering, a context is provided so that the model can refer to it and make predictions on where the answer lies within the passage.
- sentiment-analysis seeks to identify the writer’s (or speaker’s) mood, opinion, or feelings towards the topic.
- summarization: Text summarization is the practice of breaking down long publications into manageable paragraphs or sentences. The procedure extracts important information while also ensuring that the paragraph's sense is preserved. This shortens the time it takes to comprehend long materials like research articles while without omitting critical information.
- text-generation :It leverages knowledge in computational linguistics and artificial intelligence to automatically generate natural language texts, which can satisfy certain communicative requirements.
- translation: a procedure when a computer software translates text from one language to another without human contribution
- zero-shot-classification: capable of describing what class an unlabeled sample belongs to when it does not fall into the category of any of the trained categories. i.e. Zero shots for the datapoint.