In [1]:
import os
import sys
import site
from sparclur.parsers import MuPDF, Poppler, PDFMiner, XPDF, Ghostscript

### Set the document path...

In [2]:
hello_world = os.path.join(sys.prefix, 'etc', 'sparclur', 'resources', 'hello_world_hand_edit.pdf')
#If the above does not load try the below. Otherwise any path to a PDF can be used here.
# hello_world = os.path.join(site.USER_BASE, 'etc', 'sparclur', 'resources', 'hello_world_hand_edit.pdf')

### ...and load it into MuPDF

In [3]:
mupdf = MuPDF(hello_world)
mupdf

num_pages:	(Property) Returns number of pages in the document
can_reforge:	(Property) Boolean for whether or not reforge capability is present
reforge:	(Property) Returns the raw binary of the reconstructed PDF
reforge_result:	(Property) Message conveying the success or failure of the reforging
save_reforge:	Save the reforge to the specified file location
can_extract_text:	(Property) Boolean for whether or not text extraction is present
get_text:	Return a dictionary of pages and their extracted texts
clear_text:	Clear the cache of text extraction
get_tokens:	Return a dictionary of the parsed text tokens
compare_text:	Return the Jaccard similarity of the shingled tokens between two text extractors
can_render:	Boolean for whether or not rendering capability is present
validate_renderer:	(Property) Determines the PDF validity for rendering process
logs:	(Property) Any logs collected during the rendering process
caching:	(Property) Whether renders are cached or not
clear_renders:	Clears an

<hr>

### Let's try to extract the text using MuPDF

In [4]:
mupdf.get_text()

Deprecation: 'getText' removed from class 'Page' after v1.19 - use 'get_text'.


{0: 'Hello World ...\n'}

In [5]:
mupdf.validate_text

{'valid': True,
 'info': 'expected generation number (0 ? obj)\ntrying to repair broken xref\nrepairing PDF document'}

<hr>

### Now let's try to extract the text using OCR.

In [6]:
mu_ocr = MuPDF(hello_world, ocr=True, dpi=200)

In [7]:
mu_ocr.get_text(0)

'Hello World ...\n'

<hr>

### We can also directly compare the difference between the text extraction call and the OCR for parsers that have both rendering and text extraction capabilities.

In [8]:
mu_ocr.compare_ocr(shingle_size=1)

Deprecation: 'getText' removed from class 'Page' after v1.19 - use 'get_text'.


1.0

<hr>

### Let's see if Poppler can extract the text.

In [9]:
poppler = Poppler(hello_world)

In [10]:
poppler

num_pages:	(Property) Returns number of pages in the document
can_reforge:	(Property) Boolean for whether or not reforge capability is present
reforge:	(Property) Returns the raw binary of the reconstructed PDF
reforge_result:	(Property) Message conveying the success or failure of the reforging
save_reforge:	Save the reforge to the specified file location
can_extract_image_data:	(Property) Boolean for whether or not image data extraction 
                                                capability is present
contains_jpeg:	(Property) Returns True if jpeg data was extracted from the PDF
contains_images:	(Property) Returns True if image data was extracted from the PDF
images:	(Property) Returns the image data that was extracted from the PDF
validate_image_data:	(Property) Determines the PDF validity for image data extraction
can_extract_font:	(Property) Boolean for whether or not font extraction is present
non_embedded_fonts:	(Property) Returns true if the document is missing non-system f

In [11]:
poppler.get_text(0)

'Hello World ...\n\x0c'

In [12]:
poppler.validate_text

{'valid': True, 'status': 'Valid'}

<hr>

### Success. Now let's compare this text with the OCR'ed text from MuPDF.

In [13]:
poppler.compare_text(mu_ocr, shingle_size=1)

1.0