<a href="https://colab.research.google.com/github/hruthiksiva/medical-entity-recognition/blob/main/medical_entity_recognition_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [26]:
from transformers import pipeline

In [27]:
# Step 1: Load the pre-trained NER model
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [28]:
# Step 2: Define the medical text
medical_text = """
The patient, Mrs. Lis Smith, was diagnosed with Type 2 diabetes and hypertension.
She was prescribed 10mg of Lisinopril and 500mg of Metformin daily.
The doctor recommended monitoring blood sugar levels regularly and maintaining a healthy diet.
The follow-up appointment is scheduled for next month.
"""


In [29]:
# Step 3: Apply the NER pipeline to extract entities
raw_entities = ner_pipeline(medical_text)

In [30]:
raw_entities

[{'entity': 'I-PER',
  'score': 0.99855644,
  'index': 6,
  'word': 'Li',
  'start': 19,
  'end': 21},
 {'entity': 'I-PER',
  'score': 0.9966264,
  'index': 7,
  'word': '##s',
  'start': 21,
  'end': 22},
 {'entity': 'I-PER',
  'score': 0.99940634,
  'index': 8,
  'word': 'Smith',
  'start': 23,
  'end': 28},
 {'entity': 'I-MISC',
  'score': 0.9652857,
  'index': 13,
  'word': 'Type',
  'start': 49,
  'end': 53},
 {'entity': 'I-MISC',
  'score': 0.89916104,
  'index': 14,
  'word': '2',
  'start': 54,
  'end': 55},
 {'entity': 'I-MISC',
  'score': 0.96145046,
  'index': 29,
  'word': 'Li',
  'start': 111,
  'end': 113},
 {'entity': 'I-MISC',
  'score': 0.5276324,
  'index': 30,
  'word': '##sin',
  'start': 113,
  'end': 116},
 {'entity': 'I-MISC',
  'score': 0.91352385,
  'index': 38,
  'word': 'Met',
  'start': 135,
  'end': 138},
 {'entity': 'I-MISC',
  'score': 0.44034484,
  'index': 39,
  'word': '##form',
  'start': 138,
  'end': 142}]

In [31]:
# Step 4: Post-process entities to handle tokenization issues
def process_entities(entities):
    processed_entities = []
    current_entity = None

    for entity in entities:
        word = entity['word']

        # Handle subwords starting with "##"
        if word.startswith("##"):
            if current_entity:
                current_entity['word'] += word[2:]
                current_entity['score'] = max(current_entity['score'], entity['score'])
        else:
            if current_entity:
                processed_entities.append(current_entity)
            current_entity = entity
            current_entity['word'] = word  # Start a new entity

    if current_entity:
        processed_entities.append(current_entity)

    return processed_entities

entities = process_entities(raw_entities)

In [32]:
# Step 5: Classify and modify the text
def classify_and_modify_text(text, entities):
    pii_labels = {'I-PER', 'I-LOC', 'I-ORG'}  # Labels considered PII
    phi_labels = {'I-MISC'}  # Labels considered PHI

    # Replace and highlight in the original text
    modified_text = text
    offsets = 0  # Track adjustments due to replacements

    for entity in entities:
        word = entity['word']
        label = entity['entity']

        # Determine the action based on the label
        if label in pii_labels:
            # Replace PII with "[HIDDEN]"
            start_idx = entity['start'] + offsets
            end_idx = entity['end'] + offsets
            modified_text = modified_text[:start_idx] + "[HIDDEN]" + modified_text[end_idx:]
            offsets += len("[HIDDEN]") - len(word)
        elif label in phi_labels:
            # Highlight PHI by adding ** before and after
            start_idx = entity['start'] + offsets
            end_idx = entity['end'] + offsets
            modified_text = modified_text[:start_idx] + "**" + word + "**" + modified_text[end_idx:]
            offsets += len("**") * 2

    return modified_text

In [33]:
# Update entities with start and end positions
for entity in entities:
    entity_start = medical_text.find(entity['word'])
    if entity_start != -1:
        entity['start'] = entity_start
        entity['end'] = entity_start + len(entity['word'])

In [34]:
# Modify the text
modified_text = classify_and_modify_text(medical_text, entities)

# Step 6: Print the modified text
print("Original Text:")
print(medical_text)
print("\nModified Text:")
print(modified_text)


Original Text:

The patient, Mrs. Lis Smith, was diagnosed with Type 2 diabetes and hypertension. 
She was prescribed 10mg of Lisinopril and 500mg of Metformin daily. 
The doctor recommended monitoring blood sugar levels regularly and maintaining a healthy diet. 
The follow-up appointment is scheduled for next month.


Modified Text:

The patient, Mrs. [HIDDEN] [HIDDEN], was diagnosed with **Type** **2** diabetes and hypertension. 
She was prescribed 10mg of **Lisin**opril and 500mg of **Metform**in daily. 
The doctor recommended monitoring blood sugar levels regularly and maintaining a healthy diet. 
The follow-up appointment is scheduled for next month.

