-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Can pdfminer retrieve text & bboxes without layout? #51
Comments
It should be pretty easy since pdfminer gives access to all entities in a pdf file. pdf2txt and other tools are just examples of what can be done, but you can do much more by overriding the PDFDevice class to handle bboxes positions, and possibly PDFPageInterpreter if needed you may have a look at my Drawing.read_pdf method in https://github.com/goulu/Goulib/blob/master/Goulib/drawing.py (from line 1106) where I do something like that to read vector graphics (but not yet texts...) |
Yes you can, you have to write your own from collections.abc import Iterable
from pdfminer.converter import PDFLayoutAnalyzer
from pdfminer.layout import LTChar
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
class CustomConverter(PDFLayoutAnalyzer):
def receive_layout(self, ltpage):
stack = [ltpage]
while len(stack) > 0:
item = stack.pop()
if isinstance(item, LTChar):
print('"%s"' % item.get_text(), item.bbox)
if isinstance(item, Iterable):
stack.extend(list(iter(item)))
rsrcmgr = PDFResourceManager()
device = CustomConverter(rsrcmgr)
interpreter = PDFPageInterpreter(rsrcmgr, device)
with open('/users/pieter/downloads/fontsizes.pdf', 'rb') as fin:
for page in PDFPage.get_pages(fin):
interpreter.process_page(page)
device.close() Note that by using |
I'm closing this because I think this question is answered. Feel free to reopen. |
Is it possible to just retrieve all the text on the page with each fragment returned with its bounding box, i.e., (x1, y1, x2, y2, text) -- with no layout analysis? Use case: this would be ideal for people who want to do their own layout analysis with minimal overheads.
The text was updated successfully, but these errors were encountered: