# NER of Medical Reports using Stanza and Spacy

Name Entity recognition is is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.


In [1]:
# downloading spacy english model
#!python -m spacy download en_core_web_lg

### Importing required libraries

In [1]:
import pprint
import spacy
import spacy_stanza

from spacy import displacy
from spacytextblob.spacytextblob import SpacyTextBlob

## Create a sample record for testing

In [2]:
sample_record = """
HISTORY OF PRESENT ILLNESS:  The patient comes in with 2 concerns. Primarily, he has noted an infection on his right great toe. For about 2 days, he noticed it was a little bit sore and red and it increased over the past 2 days and it woke him up throbbing last night.

He has not had any fevers or chills. He has been cleaning it with alcohol and peroxide and squeezing small amounts of pus from it. He does not have a history of ingrown toenail.

He also would like his throat checked. He said he has had cold about 3 weeks ago and initially had a temperature of 100.6 that lasted a couple of days followed by a runny nose and a sore throat. Most of his symptoms have resolved except for a little residual right sore throat. He has had no trouble swallowing. No more fevers or chills. No chest congestion or shortness of breath or wheezing.

The patient also mentions that his blood pressure always seems to be borderline high with a diastolic in the 90s and he would like that checked as well. He does not have a family history of premature heart disease. He exercises regularly and is very active. Does not smoke. Occasional alcohol.

PAST MEDICAL HISTORY:  Thalassemia trait, Guillain-Barre syndrome, IDA, perforated eardrum, history of right carpal tunnel syndrome, tenosynovitis, impaired fasting glucose, history of chronic low back pain.

MEDICATIONS:  No regular medicines.

ALLERGIES:  No known drug allergies.

PHYSICAL EXAMINATION:  Vital Signs:  Blood pressure by the nursing assistant 148/90, by myself was 140/90, pulse of 66, temperature 97.8, oxygen saturation 98%. The patient is pleasant, in no apparent distress and looks well. His skin is warm and dry. Ears:  Some mild erythema and old scarring. Posterior pharynx without erythema or exudate. Sinuses are nontender. Nodes are negative. Neck is supple. Lungs are clear without rales, rhonchi, or wheezes. Negative egophony. Good inspiratory effort. Cardiac: S1, S2 regular rate and rhythm. Right great toe is erythematous around the nail on both sides, a little bit of scab on the medial corner of the nail. There is no joint line tenderness. Good range of motion. There is no streaking.

ASSESSMENT AND PLAN:

    Ingrown toenail on the right. We will treat with warm soaks at least 4 times a day in either warm soapy water or peroxide and water. We will treat with Keflex 500 mg one p.o. q.i.d. for 7-10 days. He is to keep this covered while at work.
    Resolving upper respiratory infection without sign of bacterial infection at this time.
    Blood pressure at 140/90, borderline high. He would like to discuss this with Dr. John Doe. We will set up an appointment for him to return in about a month. In the meantime, we will have him check his blood pressures at home, about 3 times a week, and bring a record of his blood pressure readings and his cuff so we can check that when he comes in. He will call for any new or worsening symptoms.
"""

## Create NER pipeline

In [3]:
ner_pipeline = spacy_stanza.load_pipeline(
    "en",
    processors={"ner": [
        "i2b2",
        "radiology",
        "ontonotes",
        ],"tokenize": 'combined',
    },
    package='',
        
)

2022-12-12 10:48:26 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2022-12-12 10:48:29 INFO: Loading these models for language: en (English):
| Processor | Package                  |
----------------------------------------
| tokenize  | combined                 |
| ner       | i2b2;radiology;ontonotes |

2022-12-12 10:48:29 INFO: Use device: cpu
2022-12-12 10:48:29 INFO: Loading: tokenize
2022-12-12 10:48:29 INFO: Loading: ner
2022-12-12 10:48:31 INFO: Done loading processors!


## Passing the sample through the Model Pipeline

In [5]:
ner_doc = ner_pipeline(sample_record)

Also pass it through the sentiment pipeline define below to know the overall entiment of the report.

In [4]:
sentiment_pipeline = spacy.load("en_core_web_lg")
sentiment_pipeline.add_pipe("spacytextblob")

<spacytextblob.spacytextblob.SpacyTextBlob at 0x7f4e687a11d0>

In [6]:
sentiment_doc = sentiment_pipeline(sample_record)

## Storing Output

In [7]:
extracted_record = {}
for entity in ner_doc.ents:
    if entity.label_ in extracted_record:
        if str(entity) not in extracted_record[entity.label_]:
            extracted_record[entity.label_].append(str(entity))
    else:
        extracted_record[entity.label_] = [str(entity)]
        
extracted_record["SENTIMENT"] = round(sentiment_doc._.blob.polarity, 5)

## Displaying Output Extracted Data

In [8]:
pprint.pprint(extracted_record)

{'ANATOMY': ['skin', 'S1', 'S2', 'great toe', 'nail', 'cuff'],
 'ANATOMY_MODIFIER': ['medial', 'corner'],
 'CARDINAL': ['2',
              '100.6',
              '148/90',
              '140/90',
              '66',
              '97.8',
              'at least 4',
              'about 3'],
 'DATE': ['about 2 days',
          'the past 2 days',
          'about 3 weeks ago',
          'a couple of days',
          'the 90s',
          '7-10 days',
          'about a month'],
 'OBSERVATION': ['active',
                 'smoke',
                 'alcohol',
                 'warm',
                 'dry',
                 'negative',
                 'supple',
                 'clear',
                 'inspiratory effort',
                 'rhythm',
                 'range of motion'],
 'OBSERVATION_MODIFIER': ['increased', 'small amounts', 'more', 'very', 'rate'],
 'PERCENT': ['98%'],
 'PERSON': ['John Doe'],
 'PROBLEM': ['an infection on his right great toe',
             'a little bit

## Displaying Extracted Data Annotated in the document

In [10]:
displacy.render(ner_doc, style="ent", jupyter=True)

# Performing the Test with Medical Record Sample

We now define a function so that we can use the medical transcripts stored in txt files to pass through the NER and Sentiment pipelines. So, essentially a function to apply the defined models

In [11]:
def extract_info(filename: str, visualize: bool = True) -> None:
    with open(filename, "r") as inp:
        content = inp.read()
    
    ner_pipeline = spacy_stanza.load_pipeline(
        "en",
        processors={"ner": [
            "i2b2",
            "radiology",
            "ontonotes",
            ],"tokenize": 'combined',
        },
        package="",    
    )
    
    sentiment_pipeline = spacy.load("en_core_web_lg")
    sentiment_pipeline.add_pipe("spacytextblob")
    
    ner_doc = ner_pipeline(content)
    sentiment_doc = sentiment_pipeline(content)
    
    extracted_record = {}
    for entity in ner_doc.ents:
        if entity.label_ in extracted_record:
            if str(entity) not in extracted_record[entity.label_]:
                extracted_record[entity.label_].append(str(entity))
        else:
            extracted_record[entity.label_] = [str(entity)]

    extracted_record["SENTIMENT"] = round(sentiment_doc._.blob.polarity, 5)
    
    if visualize:
        displacy.render(ner_doc, style="ent", jupyter=True)
        
    pprint.pprint(extracted_record)

## Applying to a Transcript and displaying Output

Apply pipeline to report_15.txt

In [12]:
extract_info("../datasets/report_15.txt")

2022-12-12 05:47:10 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2022-12-12 05:47:12 INFO: Loading these models for language: en (English):
| Processor | Package                  |
----------------------------------------
| tokenize  | combined                 |
| ner       | i2b2;radiology;ontonotes |

2022-12-12 05:47:12 INFO: Use device: cpu
2022-12-12 05:47:12 INFO: Loading: tokenize
2022-12-12 05:47:12 INFO: Loading: ner
2022-12-12 05:47:14 INFO: Done loading processors!


{'ANATOMY': ['subscapularis',
             'supraspinatus',
             'labrum',
             'supraspinatus tendons',
             'Scalene',
             'shoulder',
             'cervical spine',
             'arm',
             'upper extremity',
             'neck',
             'portal',
             'biceps',
             'axillary',
             'pouch',
             'glenohumeral ligament',
             'subacromial',
             'bone',
             'right upper extremity'],
 'ANATOMY_MODIFIER': ['right',
                      'lateral',
                      'base',
                      'posterior',
                      'anterolateral',
                      'superior',
                      'anchor',
                      'anteroinferior',
                      'inferior',
                      'rest',
                      'middle',
                      'anterior',
                      'band',
                      'region'],
 'CARDINAL': ['1', '2', 'Less than 5', '