<a href="https://colab.research.google.com/github/osman-mo94/Sarcopenia-NLP-project/blob/main/synthetic_letters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install and import packages

In [117]:
!pip install docx2txt



In [118]:
!pip install spacy




In [119]:
!pip install negspacy



In [120]:
!pip install scispacy



In [121]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz (120.2 MB)


In [122]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import docx2txt
import spacy
from spacy.matcher import PhraseMatcher
from spacy.pipeline import EntityRuler
from negspacy.negation import Negex
from negspacy.termsets import termset
from spacy.tokens import Span
import scispacy
from scispacy.abbreviation import AbbreviationDetector
from spacy import displacy

In [123]:
#Initialize nlp pipeline with scispacy model (for processing biomedical, scientific and clinical text)
nlp = spacy.load("en_ner_bc5cdr_md")
#Add abbreviation detector for medical abbreviations
nlp.add_pipe("abbreviation_detector")

<scispacy.abbreviation.AbbreviationDetector at 0x7f2c9ed8b810>

In [124]:
#View components of nlp pipeline
nlp.component_names

['tok2vec',
 'tagger',
 'attribute_ruler',
 'lemmatizer',
 'parser',
 'ner',
 'abbreviation_detector']

In [125]:
#Mount google drive so that colab can access files in my google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Import letters for analysis

In [126]:
#Import letters (note that these letters do not refer to real patients)
letter_A = docx2txt.process('/content/drive/MyDrive/NLP projects/Dummy letters/dummy letters/Letter A.docx')
letter_B = docx2txt.process('/content/drive/MyDrive/NLP projects/Dummy letters/dummy letters/Letter B.docx')

In [127]:
print(letter_A)

Mr A Smith

567 Ghengis Khan Drive, Newcastle NE4 5XX



Diagnoses:

Poor mobility due to chronic pain, low confidence and previous falls 

Weight loss, anaemia and raised inflammatory markers of unknown aetiology 

Low mood secondary to poor mobility 

Breathlessness and elevated BNP awaiting echo 

Chronic back pain with degenerative changes on MRI 

Urinary frequency and incontinence 



Other diagnoses:

Complex partial epileptic seizures

Hypertension 

Osteoarthritis with bilateral total hip replacements

Atrial fibrillation 

Asthma 

Patent foramen ovale 

Previous cerebellar stroke 



Medications:

Atorvastatin

Docusate

Ferrous fumarate

Vitamin-D3

Furosemide

Gaviscon

Flutiform inhaler

Salbutamol inhaler

Lansoprazole

Losartan 

Paracetamol 

Phenytoin

Tegretol slow release

Codeine

Warfarin



Suggested changes to medication 

Reduce codeine 15 mg dose but try and use regularly 2-3 times per day 



Follow up arrangements

 I will organise ultrasound of the abdomen 

In [128]:
print(letter_B)

Mrs B Smith

Flat 1, Farringdon Road, Newcastle NE2 5DH

Date of Birth: 01/01/1932





Diagnoses: 

Falls due to gait and balance disorder 

Hyperthyroidism due to thyroxine over-replacement

Sarcopenia 

Orthostatic hypotension 



Existing diagnoses: 

Hypertension 

Hypothyroidism 

Vitamin B12 deficiency 

Previous fractured wrist

Visual impairment due to cataracts 



Medications: 

Alendronic acid 

Calcium and vitamin-D 

Bendroflumethiazide 

Vitamin B12 

Simvastatin 

Ramipril 



Medication changes: 

Please reduce thyroxine dose to 75mcg once daily



Follow up arrangements: 

I will write back when I see the results of her 24 hour electrocardiogram



For Primary care 

Could you please forward me a copy of the 24 hour blood pressure monitor that she says she had in your surgery recently? 



I saw Mrs Smith for a face-to-face appointment at the Belsay Clinic today; she was accompanied by her son. She gives a history of two falls over the last few months and feels unstea

In [129]:
#Apply nlp pipeline to letter A
doc_A = nlp(letter_A)

  global_matches = self.global_matcher(doc)


In [130]:
#Apply nlp pipeline to letter B
doc_B = nlp(letter_B)

  global_matches = self.global_matcher(doc)


# Apply sci-spacy NER

In [131]:
#Visualise entities in Letter A
displacy.render(doc_A, style='ent', jupyter = True)

# Build PhraseMatcher

In [132]:
#Define a list of terms indicative of muscle weakness
weakness_list = ["muscle weakness", "weak", "uses a stick", "uses a walking stick", 
                    "uses a frame", "uses a zimmer frame", "uses a walker", "uses a walking aid",
                    "furniture walks", "difficulty mobilising", "difficulty walking", "wheelchair"
                    "difficulty standing", "difficulty climbing stairs", "cannot climb stairs", "housebound",
                    "bedbound", "hoist transfer", "slowed up", "limited mobility", "poor mobility"
                    "needs assistance", "difficulty carrying", "falls", "fallen",
                    "found on floor", "long lie"]

In [133]:
#Initialize matcher
matcher = PhraseMatcher(nlp.vocab)

#Apply spaCy nlp pipeline to list of weakness terms
weakness_terms = [nlp(i) for i in weakness_list]


  global_matches = self.global_matcher(doc)


In [134]:
#Add weakness terms to PhraseMatcher
matcher.add("WEAKNESS TERM", weakness_terms)

In [135]:
#Add pattern for SARC-F score
sarcf_list = ["SARC-F", "SARC F", "SARCF", "sarc-f", "sarc f", "Sarc f", "Sarc F", "Sarc-f", "Sarc-F"]

sarcf_terms = [nlp(i) for i in sarcf_list]

matcher.add("SARC-F", sarcf_terms)


  global_matches = self.global_matcher(doc)


In [136]:
#Add pattern for Sarcopenia diagnosis
sarcopenia_diagnosis = ["Sarcopenia", "sarcopenia"]

sarcopenia_terms = [nlp(i) for i in sarcopenia_diagnosis]

#Add to matcher
matcher.add("Sarcopenia", sarcopenia_terms)


  global_matches = self.global_matcher(doc)


In [137]:
#Apply matcher to letter A
matchesA = matcher(doc_A)

for match_id, start, end in matchesA: 
  span = doc_A[start:end]
  match_id_string = nlp.vocab.strings[match_id]
  print("Match:",match_id_string, "-", span.text, "( Location = ", start, end, ")")

Match: WEAKNESS TERM - falls ( Location =  27 28 )
Match: WEAKNESS TERM - falls ( Location =  249 250 )
Match: WEAKNESS TERM - fallen ( Location =  297 298 )
Match: WEAKNESS TERM - housebound ( Location =  315 316 )
Match: SARC-F - SARC-F ( Location =  976 977 )


In [138]:
#Apply matcher to letter B
matchesB = matcher(doc_B)

for match_id, start, end in matchesB: 
  span = doc_B[start:end]
  match_id_string = nlp.vocab.strings[match_id]
  print("Match:",match_id_string, "-", span.text, "( Location = ", start, end, ")")

Match: Sarcopenia - Sarcopenia ( Location =  37 38 )
Match: WEAKNESS TERM - falls ( Location =  172 173 )
Match: WEAKNESS TERM - falls ( Location =  203 204 )
Match: SARC-F - Sarc-F ( Location =  532 533 )
Match: WEAKNESS TERM - falls ( Location =  664 665 )
Match: Sarcopenia - sarcopenia ( Location =  675 676 )
Match: WEAKNESS TERM - falls ( Location =  720 721 )
Match: Sarcopenia - sarcopenia ( Location =  790 791 )


Limited matches from using the PhraseMatcher, rule-based matcher would likely be more sensitive. An expanded list of terms indicating weakness would also be helpful.

# Try a rule-based matcher

In [139]:
#Import rule-based matcher
from spacy.matcher import Matcher

In [140]:
#Initialize matcher
rb_matcher = Matcher(nlp.vocab)

#Add patterns for weakness
weakness_pattern = [
                    [{"LEMMA": "fall"}], [{"LEMMA": "weak"}], [{"LOWER": "housebound"}], [{"LOWER": "bedbound"}],
                    [{"LEMMA": "use"}, {"LOWER": "a", "OP": "?"}, {"LEMMA": "walk", "OP": "?"}, {"LOWER": "stick"}],
                    [{"LEMMA": "use"}, {"LOWER": "a", "OP": "?"}, {"LEMMA": "walk", "OP": "?"}, {"LOWER": "zimmer", "OP": "?"}, {"LOWER": "frame"}],
                    [{"LEMMA": "use"}, {"LOWER": "a", "OP": "?"}, {"LEMMA": "walk"}, {"LOWER": "aid", "OP": "?"}],
                    [{"LOWER": "furniture"}, {"LEMMA": "walk"}], [{"LEMMA": "difficult"}, {"LEMMA": "walk"}],
                    [{"LEMMA": "difficult"}, {"LEMMA": "mobilise"}], [{"LEMMA": "difficult"}, {"LEMMA": "stand"}],
                    [{"LEMMA": "difficult"}, {"LOWER": "with", "OP": "?"}, {"LEMMA": "climb", "OP": "?"}, {"LEMMA": "stair"}],
                    [{"LOWER": "cannot"}, {"LEMMA": "climb", "OP": "?"}, {"LEMMA": "stair"}],
                    [{"LOWER": "can't"}, {"LEMMA": "climb", "OP": "?"}, {"LEMMA": "stair"}],
                    [{"LEMMA": "hoist"}, {"LEMMA": "transfer"}], [{"LEMMA": "slow"}, {"LOWER": "up"}],
                    [{"LEMMA": "limit"}, {"LOWER": "mobility"}],  [{"LOWER": "poor"}, {"LOWER": "mobility"}],
                    [{"LEMMA": "need"}, {"LEMMA": "assist"}], [{"LEMMA": "require"}, {"LEMMA": "assist"}],
                    [{"LEMMA": "difficult"}, {"LEMMA": "carry"}], [{"LOWER": "found"}, {"LOWER": "on"}, {"LOWER": "floor"}],
                    [{"LOWER": "long"}, {"LOWER": "lie"}], [{"LEMMA": "lack"}, {"LOWER": "of", "OP": "?"}, {"LOWER": "mobility"}],
                    [{"LEMMA": "lack"}, {"LOWER": "of", "OP": "?"}, {"LEMMA": "strength"}]
]


#Add patternS to matcher
rb_matcher.add("WEAKNESS TERM", weakness_pattern)


In [141]:
#Apply rb_matcher to letter A

rb_matchesA = rb_matcher(doc_A)

for match_id, start, end in rb_matchesA: 
  span = doc_A[start:end]
  match_id_string = nlp.vocab.strings[match_id]
  print("Match:",match_id_string, "-", span.text, "( Location = ", start, end, ")")

Match: WEAKNESS TERM - Poor mobility ( Location =  16 18 )
Match: WEAKNESS TERM - falls ( Location =  27 28 )
Match: WEAKNESS TERM - poor mobility ( Location =  45 47 )
Match: WEAKNESS TERM - poor mobility ( Location =  246 248 )
Match: WEAKNESS TERM - falls ( Location =  249 250 )
Match: WEAKNESS TERM - fallen ( Location =  297 298 )
Match: WEAKNESS TERM - lack of mobility ( Location =  305 308 )
Match: WEAKNESS TERM - housebound ( Location =  315 316 )
Match: WEAKNESS TERM - lacks strength ( Location =  357 359 )
Match: WEAKNESS TERM - falling ( Location =  364 365 )
Match: WEAKNESS TERM - fall ( Location =  392 393 )
Match: WEAKNESS TERM - lack of strength ( Location =  1092 1095 )


In [142]:
#Apply rb_matcher to letter B
rb_matchesB = rb_matcher(doc_B)

for match_id, start, end in rb_matchesB: 
  span = doc_B[start:end]
  match_id_string = nlp.vocab.strings[match_id]
  print("Match:",match_id_string, "-", span.text, "( Location = ", start, end, ")")

Match: WEAKNESS TERM - Falls ( Location =  23 24 )
Match: WEAKNESS TERM - falls ( Location =  172 173 )
Match: WEAKNESS TERM - falls ( Location =  203 204 )
Match: WEAKNESS TERM - fell ( Location =  225 226 )
Match: WEAKNESS TERM - fell ( Location =  325 326 )
Match: WEAKNESS TERM - falling ( Location =  584 585 )
Match: WEAKNESS TERM - fell ( Location =  603 604 )
Match: WEAKNESS TERM - falls ( Location =  664 665 )
Match: WEAKNESS TERM - falls ( Location =  720 721 )


The rule-based matcher works better at detecting phrases rather than using a simple PhraseMatcher:

*   12 matches for letter A, vs. 4 matches using Phrasematcher
*   9 matches for letter B, vs. 4 matches using Phrasematcher




# Entity-ruler for visualization

In [143]:
#Initialize NER ruler

ruler = nlp.add_pipe("entity_ruler", before = "ner")

In [144]:
#Add weakness patterns to NER ruler
for item in weakness_pattern:
  ruler.add_patterns([{"label": "WEAKNESS TERM", "pattern": item}])

#Add sarcopenia diagnosis to NER ruler
for i in sarcopenia_diagnosis:
  ruler.add_patterns([{"label": "SARCOPENIA", "pattern": i}])

#Add SARC-F to NER ruler
for i in sarcf_list:
  ruler.add_patterns([{"label": "SARC-F", "pattern": i}])


In [145]:
#Apply nlp pipeline to letter A
doc_A = nlp(letter_A)

  global_matches = self.global_matcher(doc)


In [146]:
#Visualise weakness entities in Letter A

def get_entity_options():
  entities = ["WEAKNESS TERM", "SARCOPENIA", "SARC-F"]
  colors = {"WEAKNESS TERM": 'linear-gradient(90deg, #ffff66, #ff6600)', "SARCOPENIA": 'linear-gradient(90deg, #aa9cfc, #fc9ce7)',
            "SARC-F": 'linear-gradient(180deg, #66ffcc, #abf763)'}
  options = {"ents": entities, "colors": colors}
  return options
options = get_entity_options()

displacy.render(doc_A, style = 'ent', options=options, jupyter = True)

In [147]:
#Apply nlp pipeline to letter B
doc_B = nlp(letter_B)

  global_matches = self.global_matcher(doc)


In [148]:
#Visualise weakness entities in Letter B

def get_entity_options():
  entities = ["WEAKNESS TERM", "SARCOPENIA", "SARC-F"]
  colors = {"WEAKNESS TERM": 'linear-gradient(90deg, #ffff66, #ff6600)', "SARCOPENIA": 'linear-gradient(90deg, #aa9cfc, #fc9ce7)',
            "SARC-F": 'linear-gradient(180deg, #66ffcc, #abf763)'}
  options = {"ents": entities, "colors": colors}
  return options
options = get_entity_options()

displacy.render(doc_B, style = 'ent', options=options, jupyter = True)

# Negation detection

In [149]:
#Define termset as clinical
ts = termset("en_clinical_sensitive")

#Add negex to nlp pipeline
nlp.add_pipe("negex", config={
    "neg_termset":ts.get_patterns()
})

<negspacy.negation.Negex at 0x7f2c0b96d310>

In [150]:
#View termset patterns in use
print(ts.get_patterns())

{'pseudo_negations': ['not able to be', 'not certain if', 'not certain whether', 'not necessarily', 'without any further', 'without difficulty', 'without further', 'might not', 'not only', 'no increase', 'no significant change', 'no change', 'no definite change', 'not extend', 'not cause', 'gram negative', 'not rule out', 'not ruled out', 'not been ruled out', 'not drain', 'no suspicious change', 'no interval change', 'no significant interval change'], 'preceding_negations': ['cannot', 'patient was not', 'rule the patient out', 'without indication of', 'doesnt', 'educate the patient', 'h/o', 'taught the patient', 'couldnt', 'leads to', 'never developed', 'never had', 'wasnt', "aren't", 'no further', 'ruled patient out', 'rule him out', 'concern for', 'tested for', 'no signs of', 'not demonstrate', 'history of', 'evaluate for', 'denied', "couldn't", 'ruled her out', "didn't", 'ruled out', 'no sign of', 'didnt', 'rule her out', 'monitor for', 'arent', 'not', 'denying', 'versus', 'without

In [151]:
#No further does not seem to fit as a pseudo-negation, therefore I will remove it, and add to preceding negation instead
ts.remove_patterns({"pseudo_negations": ["no further"]})
ts.add_patterns({"preceding_negations": ["no further"]})

#Check that termset has been modified
print(ts.get_patterns())

{'pseudo_negations': ['not able to be', 'not certain if', 'not certain whether', 'not necessarily', 'without any further', 'without difficulty', 'without further', 'might not', 'not only', 'no increase', 'no significant change', 'no change', 'no definite change', 'not extend', 'not cause', 'gram negative', 'not rule out', 'not ruled out', 'not been ruled out', 'not drain', 'no suspicious change', 'no interval change', 'no significant interval change'], 'preceding_negations': ['cannot', 'patient was not', 'rule the patient out', 'without indication of', 'doesnt', 'educate the patient', 'h/o', 'taught the patient', 'couldnt', 'leads to', 'never developed', 'never had', 'wasnt', "aren't", 'no further', 'ruled patient out', 'rule him out', 'concern for', 'tested for', 'no signs of', 'not demonstrate', 'history of', 'evaluate for', 'denied', "couldn't", 'ruled her out', "didn't", 'ruled out', 'no sign of', 'didnt', 'rule her out', 'monitor for', 'arent', 'not', 'denying', 'versus', 'without

In [152]:
# View any negations in letter A, True indicates a negation
for e in doc_A.ents:
  print(e.text, e._.negex)

NE4 False
Poor mobility False
chronic pain False
falls False
Weight loss False
anaemia False
Low mood False
poor mobility False
Breathlessness False
Chronic back pain False
incontinence False
epileptic seizures False
Hypertension False
Atrial fibrillation False
Asthma 

Patent foramen ovale False
cerebellar stroke False
Atorvastatin False
Docusate False
fumarate False
Vitamin-D3 False
Furosemide False
Gaviscon False
inhaler False
Salbutamol False
Lansoprazole False
Losartan False
Paracetamol False
Phenytoin False
Tegretol False
Codeine False
Warfarin False
codeine False
poor mobility False
falls False
fallen False
lack of mobility False
housebound False
lacks strength False
falling False
pain False
fall False
vertigo False
pain False
pain False
appetite False
nausea or vomiting False
constipation False
toothache False
breathlessness False
cough False
phlegm False
rash False
rash False
crampy False
abdominal pain False
ideation False
jaundice False
anaemia False
cyanosis False
clubbing 

In [153]:
# View any negations in letter B, True indicates a negation
for e in doc_B.ents:
  print(e.text, e._.negex)

Newcastle NE2 5DH

 False
Falls False
gait and balance disorder False
Hyperthyroidism False
thyroxine False
Sarcopenia False
Orthostatic hypotension False
Hypertension False
Hypothyroidism False
Vitamin B12 False
fractured wrist

Visual impairment False
cataracts False
Alendronic acid False
Calcium False
vitamin-D 

Bendroflumethiazide False
Vitamin B12 False
Simvastatin False
Ramipril False
thyroxine False
falls False
falls False
fell False
lightheadedness False
vertigo False
palpitations False
chest pain False
syncopal False
fell False
appetite False
choking False
dysphagia False
nausea False
abdominal pain False
jaundice False
anaemia False
clubbing False
cyanosis False
lymphadenopathy False
goitre False
ankle oedema False
organomegaly False
bradykinesia False
tremor False
exophthalmos False
Sarc-F False
falling False
174/64 False
fell False
falls False
sarcopenia False
falls False
syncope False
appetite False
weight loss False
hyperthyroid False
thyroxine False
sarcopenia False
wei

No negative entities in letters at present. There will need to ammend letters to include some negative entities for detection. 