## A short notebook demoing how the OCR'd pdf can be used to find string matches

### Pipeline:
1. Given folder with many PDFs, convert all pages of all PDFs into PNGs
2. Given a folder with many PNGs, extract text using Tesseract.
3. Use string matching to find term(s) of interest in collection.

### Tools:
* ghostscript -- to convert pdfs to images so that they can be OCR'd.
* tesseract -- to extract text from images
* python -- glue to run ghostscript, tesseract, perform image manipulations, and compile results

#### The main features of this notebook (converting PDFs to PNGs and OCRing PNGs to extract text) are not done by Python. They are being done by command line tools GhostScript and Tesseract. Those need to be installed before this notebook will do anything.

Starter places for easy installs of those tools:
1. Tesseract: https://github.com/UB-Mannheim/tesseract/wiki -- A full installer for Windows. But hosted outside the US. Domestic installers for older versions of Tesseract may be found here: https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-for-windows

2. Ghostscript: https://www.ghostscript.com/download/gsdnld.html

#### Ghostscript and Tesseract where both installed on a Windows 10 laptop when creating this demo. The tools were installed in subfolders of this projects folder. When being called by the Python OS package in the code below, the relative filepath is being given to the executable application files. 

In [1]:
import os #for navigating folder structures and running command line tools ghostscript and tesseract
import matplotlib.pyplot as plt #for visualizing steps in this notebook
from PIL import Image #for visualizing results
import tempfile
import numpy as np #for some image manipulation
import pandas as pd #to make a excel-like dataframe of the results

Because this process can create a lot of intermediate files, creating a temp directory for those to make clean up easier after completion.

In [29]:
tempDir=tempfile.mkdtemp(dir=os.getcwd())

## 1. Convert collection of PDFs into images:

In [30]:
targetFolder='examplePDFs' #change this filepath to target a different collection of PDFs

In [31]:
for pdf in os.listdir(targetFolder):
    print(pdf)

claim-form-cms-1500.pdf
f1040.pdf


In [32]:
#save location:
PDFImages=os.path.join(tempDir, 'PDFImages')
try:
    os.mkdir(PDFImages)
except:
    print('directory already exists. Cleaning out for next data run')
    for file in os.listdir(PDFImages):
        os.remove(os.path.join(PDFImages, file))

In [33]:
for pdf in os.listdir(targetFolder):
    targetPDF=os.path.join(targetFolder, pdf);
    imageSave=os.path.join(PDFImages,
                       '.'.join(os.path.basename(targetPDF).split('.')[:-1])+'-%03d.png') \
                        #add page number for multipage PDFs when creating images.
    p = os.popen("GhostScript\\gs9.27\\bin\\gswin64c.exe -sDEVICE=pngalpha -sOutputFile=%s -dBATCH -dNOPAUSE -r288 %s"%(imageSave, targetPDF))
    print(p.read())

GPL Ghostscript 9.27 (2019-04-04)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 2.
Page 1
Page 2

GPL Ghostscript 9.27 (2019-04-04)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 2.
Page 1
Page 2



In [34]:
os.listdir(PDFImages)

['claim-form-cms-1500-001.png',
 'claim-form-cms-1500-002.png',
 'f1040-001.png',
 'f1040-002.png']

## 2. OCR all files

In [35]:
#save location:
PDFText=os.path.join(tempDir, 'PDFText')
try:
    os.mkdir(PDFText)
except:
    print('directory already exists. Cleaning out for next data run')
    for file in os.listdir(PDFText):
        os.remove(os.path.join(PDFText, file))

In [36]:
for image in os.listdir(PDFImages):
    tesseractTarget=os.path.join(PDFImages, image)
    tesseractOutput=os.path.join(PDFText,'.'.join(image.split('.')[:-1]))
    p = os.popen('Tesseract\\tesseract.exe %s %s -l eng'%(tesseractTarget, tesseractOutput))

## 3. Find files that have term of interest

In [37]:
termOfInterest='medicaid'

In [62]:
for file in os.listdir(PDFText):
    text=open(os.path.join(PDFText,file),encoding='latin-1').read()
    #print(len(text))
    if termOfInterest.lower() in text.lower():
        print(file)
        stepCount=0
        while stepCount<len(text):
            try:
                i=text.lower()[stepCount:].index(termOfInterest.lower())
                stepCount=stepCount+i+1
                print(stepCount)
                print('\n'.join([elm for elm in text[stepCount-100:stepCount+100].split('\n') if len(elm.strip())>0]))
                print('--------------')
            except:
                stepCount=len(text)+1

claim-form-cms-1500-001.txt
178
FORM CLAIM COMMITTEE (NUCC) 0242
ama PICA | |]
_ MEDICARE MEDICAID TRICARE CHAMPVA OTHER] 1a. INSURED'S LD. NUMBER (For Program in item 4)
HEALTH PLAN Ek
isch
--------------
claim-form-cms-1500-002.txt
7887
rotection Act of 1988â, permits the government to verify information by way of computer matches.
MEDICAID PAYMENTS (PROVIDER CERTIFICATION)
| hereby agree to keep such records as are necessary to d
--------------
8312
Human Services may request.
| further agree to accept, as payment in full, the amount paid by the Medicaid program for those claims submitted for payment under that program, with the exception
of aut
--------------


## Cleanup

In [64]:
for folder in ['PDFImages','PDFText']:
    try:
        for file in os.listdir(os.path.join(tempDir, folder)):
            os.remove(os.path.join(tempDir, folder,file))
        os.rmdir(os.path.join(tempDir, folder))
    except:
        print('Something went wrong with ',folder)
os.rmdir(tempDir)

Something went wrong with  PDFText
