In [1]:
import torch 
import json 
import pickle
import pandas as pd 
import src.clinical_ner as ner
from importlib import reload

[nltk_data] Downloading package punkt to /home/kgz2437/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Reporting Parsing Examples

In order to demonstrate the robustness of our approach to breast cancer clinical report analysis, we conducted an evaluation using a set of 7 sample reports. These reports were selected to represent a range of different structures and layouts commonly encountered in clinical settings, as well as variations in language usage, including the use of negation and misspellings.

In [2]:
# basic example
ner.extract_entity_from_text(
    "a. breast, left 11:00 o'clock 5cm from nipple - maligant finding: ductal carcinoma in situ", 
    weights_path ='models/model.all_augmentations.pt'
)

[{'sentence': "a. breast, left 11:00 o'clock 5cm from nipple - maligant finding: ductal carcinoma in situ",
  'entities': [{'label': 'I-position',
    'span': (16, 45),
    'text': "11:00 o'clock 5cm from nipple"},
   {'label': 'I-cancer_subtype',
    'span': (66, 90),
    'text': 'ductal carcinoma in situ'}]}]

In [3]:
# modified with slight mispellling
ner.extract_entity_from_text(
    "a. breast, left 11:00 oclock 5cm fn - maligant finding: ductal caricinoma in situ, high grade", 
    weights_path ='models/model.all_augmentations.pt'
)

[{'sentence': 'a. breast, left 11:00 oclock 5cm fn - maligant finding: ductal caricinoma in situ, high grade',
  'entities': [{'label': 'I-position',
    'span': (16, 35),
    'text': '11:00 oclock 5cm fn'},
   {'label': 'I-cancer_subtype',
    'span': (56, 81),
    'text': 'ductal caricinoma in situ'},
   {'label': 'I-cancer_grade', 'span': (83, 93), 'text': 'high grade'}]}]

In [11]:
# apply negation example
ner.extract_entity_from_text(
    "a. breast, left 11:00 oclock 5cm fn - no evidence of ductal caricinoma in situ",
    weights_path='models/model.all_augmentations.pt'
)

[{'sentence': 'a. breast, left 11:00 oclock 5cm fn - no evidence of ductal caricinoma in situ',
  'entities': [{'label': 'I-position',
    'span': (16, 35),
    'text': '11:00 oclock 5cm fn'}]}]

In [None]:
ner.extract_entity_from_text(
    "a. breast, left 11:00 oclock 5cm fn - no evidence of ductal caricinoma in situ",
    weights_path='models/model.all_augmentations.pt'
)

### Reports with Mispellings

Breast cancer clinical reports can contain errors, such as misspellings, which can break strict rule-based systems. However, our flexible and adaptable approach to report analysis can handle such errors without requiring explicit training on misspelled data. Our model can accurately extract relevant information, even when it is expressed in a non-standard or incorrect way. For example, we can successfully identify the phrase "invasive ductal carcinoma" even if it is misspelled as "invasive ductal carinoma" by using contextual clues and report structure.

In [4]:
file_path = 'data/example_diagnosis.mispelling_1.txt'
with open(file_path, 'rb') as f:
    text = f.read().decode()
    
print(f'file: {file_path}')
print('-'*50)
print(text)

file: data/example_diagnosis.mispelling_1.txt
--------------------------------------------------
A -  Lymph node, left axillary sentinel #1, biopsy: (slide for review, outside#UHS-17-010523, A)  - One lymph node, negative for metastatic carcinoma, (0/1).      B -  Breast, left, needle wire-localized partial mastectomy : (slide for review, outside# xxxx, B)  - Invasive ductal carinoma with tubular features, well differentiated, with associated microcalcifications, measuring 3.0mm in greatest dimension (measurements on slide ""B9""). See note.  - No in situ carcinoma is identified.  - No lymphovascular invasion is present.  - Biopsy site changes.     Note:   Invasive carcinoma is identified 2.5mm from the lateral (green ink) margin (closest margin).  See specimen ""C"" for additional lateral margin.    C -  Additional lateral margin, biopsy: (slide for review, outside# xxxx, C)  - Benign breast tissue.


In [5]:
ner.extract_entity_from_report(
    'data/example_diagnosis.mispelling_1.txt',
    weights_path ='models/model.all_augmentations.pt'
)

[{'sentence': 'a - lymph node, left axillary sentinel #1, biopsy: (slide for review, outside#uhs-17-010523, a) - one lymph node, negative for metastatic carcinoma, (0/1).',
  'entities': [{'label': 'I-position', 'span': (21, 29), 'text': 'axillary'}]},
 {'sentence': 'b - breast, left, needle wire-localized partial mastectomy : (slide for review, outside# xxxx, b) - invasive ductal carinoma with tubular features, well differentiated, with associated microcalcifications, measuring 3.0mm in greatest dimension (measurements on slide ""b9"").',
  'entities': [{'label': 'I-cancer_subtype',
    'span': (100, 124),
    'text': 'invasive ductal carinoma'},
   {'label': 'I-cancer_grade',
    'span': (148, 167),
    'text': 'well differentiated'}]},
 {'sentence': 'see note.', 'entities': []},
 {'sentence': '- no in situ carcinoma is identified.', 'entities': []},
 {'sentence': '- no lymphovascular invasion is present.', 'entities': []},
 {'sentence': '- biopsy site changes.', 'entities': []},
 {'

### Reports with Negation

Negation in breast cancer clinical reports typically occurs when a particular symptom, finding or diagnosis is specifically excluded or ruled out. For example, if a patient undergoes a diagnostic test for breast cancer and the results come back negative, the report may state that "no signs of invasive carcinoma". However, in the following example, the report describes a finding that is similar to but not quite cancer. the negation is written in a convoluted and ambiguous way, causing our previous strict rule-based systems to fail. 

In [6]:
negation_file_path = 'data/example_diagnosis.negation_1.txt'
with open(negation_file_path, 'rb') as f:
    negation_text = f.read().decode()
    
print(f'file: {negation_file_path}')
print('-'*50)
print(text)

file: data/example_diagnosis.negation_1.txt
--------------------------------------------------
A -  Lymph node, left axillary sentinel #1, biopsy: (slide for review, outside#UHS-17-010523, A)  - One lymph node, negative for metastatic carcinoma, (0/1).      B -  Breast, left, needle wire-localized partial mastectomy : (slide for review, outside# xxxx, B)  - Invasive ductal carinoma with tubular features, well differentiated, with associated microcalcifications, measuring 3.0mm in greatest dimension (measurements on slide ""B9""). See note.  - No in situ carcinoma is identified.  - No lymphovascular invasion is present.  - Biopsy site changes.     Note:   Invasive carcinoma is identified 2.5mm from the lateral (green ink) margin (closest margin).  See specimen ""C"" for additional lateral margin.    C -  Additional lateral margin, biopsy: (slide for review, outside# xxxx, C)  - Benign breast tissue.


In [7]:
ner.extract_entity_from_report(
    negation_file_path,
    weights_path ='models/model.all_augmentations.pt'
)

[{'sentence': '1. breast, right 9:00, segmental excision with needle localization: - fibrocystic change, proliferative type, including atypical ductal hyperplasia, associated with microcalcifications - intraductal papilloma - hyalinized and calcified fibroadenoma - biopsy site change 2. breast, right 1:00, segmental excision with needle localization: - fibrocystic change, proliferative type, including atypical ductal hyperplasia, associated with microcalcifications - intraductal papilloma - biopsy site change comment: there are scattered foci in parts 1 and 2 showing a micropapillary and cribriform ductal proliferation, which fall short of the criteria for ductal carcinoma in situ.',
  'entities': [{'label': 'I-position', 'span': (17, 21), 'text': '9:00'},
   {'label': 'I-position', 'span': (287, 291), 'text': '1:00'}]},
 {'sentence': 'multiple levels have been examined.', 'entities': []},
 {'sentence': 'xxxx has reviewed the case and concurs with the diagnosis of atypical ductal hyper

### Example outputs

In [8]:
import pickle
with open('example_outputs/example_diagnosis.negation_1.entities.pickle', 'rb') as f:
    example_output = pickle.load(f)
    
example_output

[{'sentence': '1. breast, right 9:00, segmental excision with needle localization: - fibrocystic change, proliferative type, including atypical ductal hyperplasia, associated with microcalcifications - intraductal papilloma - hyalinized and calcified fibroadenoma - biopsy site change 2. breast, right 1:00, segmental excision with needle localization: - fibrocystic change, proliferative type, including atypical ductal hyperplasia, associated with microcalcifications - intraductal papilloma - biopsy site change comment: there are scattered foci in parts 1 and 2 showing a micropapillary and cribriform ductal proliferation, which fall short of the criteria for ductal carcinoma in situ.',
  'entities': [{'label': 'position', 'span': (17, 21), 'text': '9:00'},
   {'label': 'position', 'span': (287, 291), 'text': '1:00'}]},
 {'sentence': 'multiple levels have been examined.', 'entities': []},
 {'sentence': 'xxxx has reviewed the case and concurs with the diagnosis of atypical ductal hyperplas