## 0. Setup
*Import model, analysis module, logging. All original code using open-source tools.*

In [None]:
from models_config import stanford_model as model
from log_analysis import first_pass, second_pass

%reload_ext autoreload
%autoreload 2
%matplotlib inline

## 1. OCR
*Can handle PDF, images, dicom. Uses Tesseract, pydicom, and pdf2image libraries. OCR is a key component in the pipeline as a text model cannot work well on text that is not included in the input. So the OCR must be highly accurate and versatile, being able to handle noisy images and structured data (like tables).*

In [None]:
# !python tesseract_test.py data/sample_pdf.pdf

In [None]:
text = ""
for i in range(1, 7):
    with open(f"ocr_output/sample_pdf/sample_pdf_page{i}.txt", "r") as f:
        page_text = f.read()
        text += page_text + "\n"


print(text)

## 2. PII Detection
*We chose to use Microsoft Presidio and Stanford Research's Clinical DEID transformer model. We felt these tools best matched the resources John Snow Labs provided with similar accuracy and greater accessibility. When using our pipeline, the user has the ability look over the file to make sure all necessary PHI is removed. We felt that possibility of noisy images leading to false negatives posed too high of a risk to leave the human out of the loop.*

# First Pass
*The first pass through the model attempts to tag every PII entity. If the accuracy is perfect, nice. If not, the user has a chance to make manual changes.*

In [None]:
first_pass_result, tagged_p1, pass_idx = first_pass(model, text, doc_id=1, case="sample")
print(first_pass_result, tagged_p1)

# Second Pass
*The user can not identify the entities the model missed (false negatives), and the entities that the model should not have tagged (false positives).*

In [None]:
deny_list = ["jlee94", "07:14 AM", "22:10"]
allow_list = ["29 y/o", "29 yo"]

second_pass_results, tagged_p2, pass_idx = second_pass(model, text, case="sample", doc_id=pass_idx, allow_list=allow_list, deny_list=deny_list)

print(second_pass_results)

## 3. Output Generation
*Now, we need to map the replacement strings back to the entities and burn them on to the original PDF. The burning works in a void, but we are currently working on the mapping part. The main roadblock right now is that the OCR chunks by individual word, while the model identifies entities that could be multiple words long. There are also aesthetic issues like font, font size, spacing, etc.*