## 0. Setup
*Import model, analysis module, logging. All original code using open-source tools.*

In [1]:
from models_config import stanford_model as model
from log_analysis import first_pass, second_pass

%reload_ext autoreload
%autoreload 2
%matplotlib inline

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

Device set to use cpu


## 1. OCR
*Can handle PDF, images, dicom. Uses Tesseract, pydicom, and pdf2image libraries. OCR is a key component in the pipeline as a text model cannot work well on text that is not included in the input. So the OCR must be highly accurate and versatile, being able to handle noisy images and structured data (like tables).*

In [6]:
!python tesseract_test.py data/sample_pdf.pdf

Saved JSON to ocr_output/sample_pdf/sample_pdf_page1_ocr.json and text to ocr_output/sample_pdf/sample_pdf_page1.txt
Saved JSON to ocr_output/sample_pdf/sample_pdf_page2_ocr.json and text to ocr_output/sample_pdf/sample_pdf_page2.txt
Saved JSON to ocr_output/sample_pdf/sample_pdf_page3_ocr.json and text to ocr_output/sample_pdf/sample_pdf_page3.txt
Saved JSON to ocr_output/sample_pdf/sample_pdf_page4_ocr.json and text to ocr_output/sample_pdf/sample_pdf_page4.txt
Saved JSON to ocr_output/sample_pdf/sample_pdf_page5_ocr.json and text to ocr_output/sample_pdf/sample_pdf_page5.txt
Saved JSON to ocr_output/sample_pdf/sample_pdf_page6_ocr.json and text to ocr_output/sample_pdf/sample_pdf_page6.txt
Outputs saved in ocr_output/sample_pdf


In [2]:
text = ""
for i in range(1, 7):
    with open(f"ocr_output/sample_pdf/sample_pdf_page{i}.txt", "r") as f:
        page_text = f.read()
        text += page_text + "\n"


print(text)

UNIVERSITY MEDICAL CTR
Department: Neurology / Internal Med
Document Type: ED NOTE + CONSULT + DISCHARGE
Generated: 2024.03.22 07:14 AM
Printed by: J.Nguyen (unit clerk) ext 4021
PATIENT INFORMATION
Name: Jenniferr K. Lee
AKA: Jenny Lee/ J. Lee
DOB: 5-6-94
Age: 29 y/o
Sex: F
Preferred Language: English
Race/ethnicity (self-reported): Korean-American
MRN: 00077219
Acct#: A-77821
Encounter ID: 44-92-77 1
Home Address:
455 San Mateo Ave
Apt# 3B
Redwood City, CA 94063
CURRENT ADDRESS (per patient, moved recently):
4127 W. Elmst Apt #3B
Springfeld, IL 62704
Phone: 312-555-7712
Alt/cell: (312) 555 77 13
Email: jlee94 @ yahoo.com
Emergency contact:
Marry Smith (spouse) 217.555.0198
relationship: wife
Secondary contact:
Anne McKinly (sister) 773.555.8891
Primary care provider (outside):
Dr. S. Patel, MD
Northside Pulmonology
Fax: 847 555 4021
CHIEF COMPLAINT
"migraine flare" + dizziness + chest tightness intermittently
HISTORY OF PRESENT ILLNESS
Pt is a 29 yo F presenting on 03/21/2024 with he

## 2. PII Detection
*We chose to use Microsoft Presidio and Stanford Research's Clinical DEID transformer model. We felt these tools best matched the resources John Snow Labs provided with similar accuracy and greater accessibility. When using our pipeline, the user has the ability look over the file to make sure all necessary PHI is removed. We felt that possibility of noisy images leading to false negatives posed too high of a risk to leave the human out of the loop.*

# First Pass
*The first pass through the model attempts to tag every PII entity. If the accuracy is perfect, nice. If not, the user has a chance to make manual changes.*

In [5]:
first_pass_result, tagged_p1, pass_idx = first_pass(model, text, doc_id=1, case="sample")
print(first_pass_result, tagged_p1)

{}
<LOCATION>
Department: Neurology / Internal Med
Document Type: ED NOTE + CONSULT + DISCHARGE
Generated: <DATE_TIME> 07:14 AM
Printed by: <PERSON> (unit clerk) ext <PHONE_NUMBER>
PATIENT INFORMATION
Name: <PERSON>
AKA: <PERSON>
DOB: <DATE_TIME>
<AGE>
Sex: F
Preferred Language: English
Race/ethnicity (self-reported): Korean-American
<MRN>
Acct#: <ID>
Encounter ID: <ID>
Home Address:
<LOCATION>
Apt# 3B
<LOCATION> <ZIPCODE>
CURRENT ADDRESS (per patient, moved recently):
<LOCATION> Apt #3B
<LOCATION> <PHONE_NUMBER>
Phone: <PHONE_NUMBER>
Alt/cell: <PHONE_NUMBER>
Email: jlee94 @ <ORGANIZATION>
Emergency contact:
<PERSON> (spouse) <PHONE_NUMBER>
relationship: wife
Secondary contact:
<PERSON> (sister) <PHONE_NUMBER>
Primary care provider (outside):
<PERSON>
<LOCATION> Pulmonology
Fax: <PHONE_NUMBER>
CHIEF COMPLAINT
"migraine flare" + dizziness + chest tightness intermittently
HISTORY OF PRESENT ILLNESS
Pt is a <AGE> F presenting on <DATE_TIME> with headache for 2 days.
Pain begins behind R e

Department: Neurology / Internal Med
Document Type: ED NOTE + CONSULT + DISCH...'


# Second Pass
*The user can not identify the entities the model missed (false negatives), and the entities that the model should not have tagged (false positives).*

In [8]:
deny_list = ["jlee94", "07:14 AM", "22:10"]
allow_list = ["29 y/o", "29 yo"]

second_pass_results, tagged_p2, pass_idx = second_pass(model, text, case="sample", doc_id=pass_idx, allow_list=allow_list, deny_list=deny_list)

print(second_pass_results)

[{'J. Nguyen PA-C', 'J.Nguyen'}, {'Dr. S. Patel', 'S. Patel, MD', 'Dr. S. Patel, MD'}, {'Jenniferr K. Lee', 'J. Lee', 'Jenny Lee/'}, {'Rosa Ramirez'}, {'Anne McKinly'}, {'Rebecca Wong MD'}, {'Marry Smith'}]
{'217.555.0198': '986.683.3245', 'yahoo.com': '*********', 'jlee94': '******', '847 555 4021': '855.955.7347', 'S. Patel, MD': 'Micah', 'Northside': 'Lauraside', 'Dr. S. Patel': 'Micah', '3-22-24': '05-05-24', 'FMLA': 'Markton', '312.555.7766': '246.381.7870', 'South Loop Immediate': 'Thompsonshire', 'J. Nguyen PA-C': 'Carter', 'Rebecca Wong MD': 'Blake', '03/21/24': '05/04/24', '2024-02-17': '2024-04-01', '2024-03-22': '2024-05-05', '22:10': '*****', '2024.03.21': '01-22-05', 'L-998211': '********', 'Anne McKinly': 'Avery', '773.555.4410': '731.500.9806', 'Rosa Ramirez': 'Rory', 'IL': 'Brewerburgh', 'BlueCross of': '************', 'warehouse': '*********', 'McArthuer Shipping Co.': '**********************', '2010': '03-05-00', '1990s': '01-22-65', '2/28/24': '04/12/24', 'Redwood Ci

Department: Neurology / Internal Med
Document Type: ED NOTE + CONSULT + DISCH...'


## 3. Output Generation
*Now, we need to map the replacement strings back to the entities and burn them on to the original PDF. The burning works in a void, but we are currently working on the mapping part. The main roadblock right now is that the OCR chunks by individual word, while the model identifies entities that could be multiple words long. There are also aesthetic issues like font, font size, spacing, etc.*