# Reading PDF Documents as Text Files
This notebook explores different packages that could be used to read the pdf files.

In [22]:
import os
from pdfminer.high_level import extract_text
import pytesseract
import PyPDF2
from pdf2image import convert_from_path


cwd = os.getcwd()
pathology_file_example_path = os.path.join(cwd, "..", "data", "raw_data", "pdf", "pathology", "51 P.pdf")
surgical_file_example_path = os.path.join(cwd, "..", "data", "raw_data", "pdf", "surgical", "51 O.pdf")
ultrasound_file_example_path = os.path.join(cwd, "..", "data", "raw_data", "pdf", "ultrasound", "51 U.pdf")

## Ultrasound Reports
This seems straightforward enough we can use `pdfminer`.

In [23]:
print(extract_text(ultrasound_file_example_path))

Centre Universitaire de Santé McGill
Imagerie Médicale

McGill University Health Centre
Medical Imaging

m/Name:
Accession #:

Med Ref/Req MD:
Autre Med/Other MD:

Stewart, Jessica,

Dossier/MRN:
Location/Service:

Sexe/Sex:
DDN/DOB:
RAMQ:
Org:
Rapport/Report:

Examen / Exam

 

US US ABDOMEN/PELVIS- APPENDICITIS -AB

Date d'examen / Exam Date
March 12, 2014 16:14

RENSEIGNEMENT CLINIQUE / CLINICAL INFORMATION:

Right lower quadrant pain.  Query appendicitis.

PROTOCOLE RADIOLOGIQUE / RADIOLOGIST'S REPORT:

ULTRASOUND ABDOMEN AND PELVIS

The liver has slightly increased periportal echoes.  Gallbladder and biliary tree within normal
limits.  Visualized part of the pancreas within normal limits.

The spleen has a normal appearance and measures 12.5cm.

The right kidney measures 12.5cm and the left kidney also measures 12.5cm.  No evidence of
hydronephrosis.  The parenchyma of both kidneys is unremarkable.

The bladder is partially filled.

In the right lower quadrant there is a blind end

## Surgical Reports
Here `pdfminer` doesn't work as well :"(

In [24]:
# Using pdfminer
print(extract_text(surgical_file_example_path))

User: 0870530 

 1 

 2021-12-06 13:39:05

User: 0870530 

 2 

 2021-12-06 13:39:05




Try with `pytesseract`. Difficult to install.

In [25]:
# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr(surgical_file_example_path, extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default

TesseractError: (1, 'Tesseract Open Source OCR Engine v4.1.1 with Leptonica Error in pixReadStream: Pdf reading is not supported Error in pixRead: pix not read Error during processing.')

Try with `pypdf2`. Same result.

In [26]:
# create file object variable
# opening method will be rb
pdffileobj=open(surgical_file_example_path,'rb')
#create reader variable that will read the pdffileobj
pdfreader=PyPDF2.PdfFileReader(pdffileobj)
#This will store the number of pages of this pdf file
x=pdfreader.numPages
#create a variable that will select the selected number of pages
pageobj=pdfreader.getPage(0)
#(x+1) because python indentation starts with 0.
#create text variable which will store all text datafrom pdf file
text=pageobj.extractText()
print(text) # doesn't do better than extract_text

User: 0870530  1  2021-12-06 13:39:05



Try using the command line tool `pdftotext`. Same result. I think just a pdf to text method will be unable to get us the text because there appears to be 2 overlapping pdfs in the pdf.

In [27]:
!pdftotext '/home/c_spino/research/NLP_ultrasound_report/data/raw_data/pdf/surgical/51 O.pdf' -

User: 0870530

1

2021-12-06 13:39:05

User: 0870530

2

2021-12-06 13:39:05



Convert the pdf first to png with `pdf2image` and then png to text with `pytesseract`. This seems to work! Just slightly trickier because need to iterate through the converted images (from the pdf file).

In [45]:
images = convert_from_path(surgical_file_example_path)
for image in images:
    image.save('/home/c_spino/research/NLP_ultrasound_report/data/raw_data/png/surgical/51 O.png', 'PNG')
    img = Image.open('/home/c_spino/research/NLP_ultrasound_report/data/raw_data/png/surgical/51 O.png')
    print(img)
    print(pytesseract.image_to_string(image)) # tesseract is google backed

## Pathology
Try the same approach as surgical.

In [43]:
images = convert_from_path(pathology_file_example_path)
for image in images:
    image.save('/home/c_spino/research/NLP_ultrasound_report/data/raw_data/png/pathology/51 P.png', 'PNG')
    img = Image.open('/home/c_spino/research/NLP_ultrasound_report/data/raw_data/png/pathology/51 P.png')
    print(img)
    print(pytesseract.image_to_string(image)) # tesseract is google backed

<PIL.PngImagePlugin.PngImageFile image mode=RGB size=1700x2200 at 0x7F79C68EB1C0>
A

Montreal Children’s Hospital

Centre universitaire de santé McGill

McGill University Health Centre 2300 rue Tupper
> : Montreal Quebec H3H 1P3
Département de Pathologie - Pathology Department HME/MCH (514) 412-4495 HGM/ MGH (514) 934-1934 poste 42819

HRV/ RVH (514) 398-7174

 

Autre nom:
See Dossier CUSM:
Medecin: Dr. Baird, Robert DDN:

RAMQ / Carte santé:
CONFIDENTIEL / CONFIDENTIAL Téléphone:
Copie a:

 

SURGICAL PATHOLOGY REPORT

Collected: 2014-Mar-12 Case Number: =a
Received: 2014-Mar-13 12:28

Reported: 2014-Mar-18 16:40

CLINICAL INFORMATION
Appendicitis, no perforation?
? gangrene at appendix

SPECIMEN
APPENDIX

GROSS DESCRIPTION

Specimen is received in formalin in one container, labelled with the patient's name and designated "APPENDIX". The
specimen consists of a vermiform appendix and mesoappendix, measuring 7.2 cm in length by 0.7 cm at the proximal
end and 1.3 cm in diameter at the d