# Cleaning surgical reports notebook
Notebook that explores the best way of extracting the procedure text from the surgical reports

In [None]:
import os
cwd = os.getcwd()
surgical_folder_text_path = os.path.join(cwd, "..", "data", "raw_data", "txt", "surgical")
test_file_path = os.path.join(cwd, surgical_folder_text_path, "168 O.txt")
# Print a test file
with open(test_file_path, "r") as f:
    print(f.read())

Do all the text files have a procedure section that starts with `OPERATIVE PROCEDURE:"` and end with `Dictated`? 

In [None]:
# Do all the text files have a procedure section that starts with "OPERATIVE PROCEDURE:"?
surgical_reports_path = [
    os.path.join(surgical_folder_text_path, filename) 
    for filename in os.listdir(surgical_folder_text_path) if filename.endswith(".txt")
    ]
number_surgical_reports = len(surgical_reports_path)
count = 0
start_marker = "OPERATIVE PROCEDURE"
# end_marker = "Dictated"
for i, surgical_report_path in enumerate(surgical_reports_path):
    with open(surgical_report_path, "r") as f:
        surgical_report_text = f.read()
    if start_marker.lower() in surgical_report_text.lower():
        count += 1
print(count, number_surgical_reports)
        

In [None]:
with open(surgical_reports_path[i], "r") as f:
    print(f.read())

In [None]:
print(number_surgical_reports)
print(count)

## Cropping for easier text extraction?
Can we first crop the surgical reports such that the text extraction is easier? *Note: This assumes that the surgical reports are in `NLP_ultrasound_report/data/raw_data/png/surgical`*

In [None]:
import os
cwd = os.getcwd()
surgical_folder_png_path = os.path.join(cwd, "..", "data", "raw_data", "png", "surgical")
test_file_path = os.path.join(cwd, surgical_folder_png_path, "180 O_0.png")

Try cropping and then text extraction.

In [None]:
from PIL import Image
import time
import pytesseract

# This goes through each image in the surgical folder, crops it and extracts the text from it
for path in os.listdir(surgical_folder_png_path):
    # Only selects pngs that end with 0.png
    if path.endswith("0.png"):
        print(f"Cropping {path}")
        im = Image.open(os.path.join(surgical_folder_png_path, path))
        # Original image dimensions
        width, height = im.size
        # Cropping dimensions, play around with this
        right = width
        left = 0
        bottom = height - height/10
        top = height/2
        # Crop the image
        im_cropped = im.crop((left, top, right, bottom))
        # Show the cropped image
        im_cropped.show()
        # Get the extracted text
        extracted_text = pytesseract.image_to_string(im_cropped)
        print("Extracted text: ", extracted_text)
        # Wait 5 seconds before continuing
        time.sleep(5)

Cropping would make things much easier *if* we know which ids are what type. If we do then we can create a cropping mapping for different ids. So we need to find the ideal cropping size and different cropping rules for different types of reports.

In [None]:
from PIL import Image
import time

# Same process as above but with the second page of the png i.e. pngs
# that end with 1
for path in os.listdir(surgical_folder_png_path):
    if path.endswith("1.png"):
        print(f"Cropping {path}")
        im = Image.open(os.path.join(surgical_folder_png_path, path))
        width, height = im.size
        right = width
        left = 0
        bottom = height - height/10
        top = height/5
        im_cropped = im.crop((left, top, right, bottom))
        im_cropped.show()
        pytesseract.image_to_string(im_cropped)
        time.sleep(5)
        

Appears promising, try some extraction with this to see if text is easier to extract.