# textract package POC

In [1]:
!pip3 install textract



##### the os-spescific installation can be found here https://textract.readthedocs.io/en/stable/installation.html

### Import the dependecies 

In [2]:
import textract, sched, time
from IPython.lib import backgroundjobs as bg

### Use two files to test the library with a "scanned" pdf and a "text" pdf.

In [3]:
SCANNED_PDF_PATH = '../../patents/US8426363.pdf'

In [4]:
TEXT_PDF_PATH = '../../patents/us8859741.pdf'

### Process both files with the default parser `pdftotext`

In [5]:
scanned_output = textract.process(SCANNED_PDF_PATH)
text_output = textract.process(TEXT_PDF_PATH)

In [6]:
scanned_str = scanned_output.decode("utf-8")
text_str = text_output.decode("utf-8")

### Inspect the parsed text when the pdf file is scanned

In [7]:
scanned_str[:10]

'\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c'

### As shown above "at least in this case", the value of the text field is line feeds when the pdf file is scanned. We can use the check below to switch to an OCR-based parser `tesseract`

In [8]:
if scanned_str.startswith('\x0c\x0c\x0c'):
    print('image-based pdf detected, need to use an ocr-based parser (tesseract)')

image-based pdf detected, need to use an ocr-based parser (tesseract)


### Re-porcess the file using the `tesseract` parser

In [9]:
jobs = bg.BackgroundJobManager()
status = 'running'
print('This process is going to take sometime ...')
start = time.perf_counter()

def print_status(interval):
    while status == 'running':
        time.sleep(interval)
        now = time.perf_counter()
        print('time elapsed.. %d seconds' % (now - start))
    
    end = time.perf_counter()
    print('completed processing the file in %d seconds' % (end - start))


jobs.new(print_status, 10)
jobs.status()
scanned_output = textract.process(SCANNED_PDF_PATH, method='tesseract')
status = 'complete'
scanned_str = scanned_output.decode("utf-8")

This process is going to take sometime ...
Running jobs:
0 : <function print_status at 0x10e2bb4c0>

time elapsed.. 10 seconds
time elapsed.. 20 seconds
time elapsed.. 30 seconds
time elapsed.. 40 seconds
time elapsed.. 50 seconds
time elapsed.. 60 seconds
time elapsed.. 70 seconds
time elapsed.. 80 seconds
time elapsed.. 90 seconds
time elapsed.. 100 seconds
time elapsed.. 110 seconds
time elapsed.. 120 seconds
time elapsed.. 130 seconds
time elapsed.. 140 seconds
time elapsed.. 150 seconds
time elapsed.. 160 seconds
time elapsed.. 170 seconds
time elapsed.. 180 seconds
time elapsed.. 190 seconds
time elapsed.. 200 seconds
time elapsed.. 210 seconds
time elapsed.. 220 seconds
time elapsed.. 230 seconds
time elapsed.. 240 seconds


### Inspect the first 1000 characters from each string 'scanned and text'

In [10]:
print(scanned_str[:1000])

a2) United States Patent
Liang et al.

 

 

 

 

US008426363B2
(10) Patent No.: US 8,426,363 B2
(45) Date of Patent: Apr. 23, 2013

 

(54) METHOD FOR REDUCING A LEVEL OF
LDL-CHOLESTEROL BY AN ANTIBODY
THAT SPECIFICALLY BINDS TO PCSK9

(75) Inventors: Hong Liang, San Francisco, CA (US);
Yasmina Noubia Abdiche, Mountain
View, CA (US); Javier Fernando
Chaparro Riggers, San Mateo, CA
(US); Bruce Charles Gomes,
Ashburnham, MA (US); Julie Jia Li
Hawkins, Old Lyme, CT (US); Jaume
Pons, San Bruno, CA (US); Yuli Wang,
San Diego, CA (US)

(73) Assignees: Rinat Neuroscience Corp., South San
Francisco, CA (US); Pfizer Inc., New
York, NY (US)

(*) Notice: Subject to any disclaimer, the term of this

patent is extended or adjusted under 35
USC, 154(b) by 0 days.

(21) Appl. No.: 13/225,265

(22) Filed: Sep. 2, 2011

(65) Prior Publication Data
US 2012/0014951 Al Jan. 19, 2012

Related U.S. Application Data

(62) Division of application No. 12/558,312, filed on Sep.
11, 2009, now Pat. No. 8,080, 2

In [11]:
print(text_str[:1000])

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

random text here

lI2)

(54)

US008859741B2

United StateS Patent

(Io) Patent No.

Jackson et al.

(45)

ANTIGEN BINDING PROTEINS TO
PROPROTEIN CONVERTASE SUBTILISIN
KEXIN TYPE 9 (PCSK9)

(71) Applicant: Amgen Inc. , Thousand Oaks, CA (US)
(72)

Inventors:

Simon Mark Jackson, San Carlos, CA
(US); Nigel Pelham Clinton Walker,
Burlingame, CA (US); Derek Evan
Piper, Santa Clara, CA (US); Wenyan
Shen, Palo Alto, CA (US); Chadwick
Terence King, North Vancouver (CA);
Randal Robert Ketchem, Snohomish,
WA (US); Christopher Mehlin, Seattle,
WA (US); Teresa Arazas Carabeo, New
York, NY (US)

(73) Assignee:

Amgen Inc. , Thousand Oaks, CA (US)

*
( ) Notice:

Subject to any disclaimer, the term of this
patent is extended or adjusted under 35
U. S.C. 154(b) by 0 days.

(21) Appl. No. : 14/261, 0S7
(22)

Apr. 24, 2014

Filed:

Prior Publication Data

(65)

US 2014/0228545 Al

Aug. 14, 2014

:

US 8,859,741 B2

Date of Pa

### We will need to cleanup/preprocess the parsed text in subsequent steps to remove the \n characters