Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Can pdfminer retrieve text & bboxes without layout? #51

Closed
mark-summerfield opened this issue Mar 24, 2017 · 3 comments
Closed

Comments

@mark-summerfield
Copy link

Is it possible to just retrieve all the text on the page with each fragment returned with its bounding box, i.e., (x1, y1, x2, y2, text) -- with no layout analysis? Use case: this would be ideal for people who want to do their own layout analysis with minimal overheads.

@goulu
Copy link
Member

goulu commented Apr 18, 2017

It should be pretty easy since pdfminer gives access to all entities in a pdf file. pdf2txt and other tools are just examples of what can be done, but you can do much more by overriding the PDFDevice class to handle bboxes positions, and possibly PDFPageInterpreter if needed

you may have a look at my Drawing.read_pdf method in https://github.com/goulu/Goulib/blob/master/Goulib/drawing.py (from line 1106) where I do something like that to read vector graphics (but not yet texts...)

@pietermarsman
Copy link
Member

pietermarsman commented Oct 16, 2019

Yes you can, you have to write your own PDFLayoutAnalyzer. For example, to print all the bounding boxes of characters use the following:

from collections.abc import Iterable

from pdfminer.converter import PDFLayoutAnalyzer
from pdfminer.layout import LTChar
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage


class CustomConverter(PDFLayoutAnalyzer):
    def receive_layout(self, ltpage):
        stack = [ltpage]
        while len(stack) > 0:
            item = stack.pop()

            if isinstance(item, LTChar):
                print('"%s"' % item.get_text(), item.bbox)

            if isinstance(item, Iterable):
                stack.extend(list(iter(item)))


rsrcmgr = PDFResourceManager()
device = CustomConverter(rsrcmgr)

interpreter = PDFPageInterpreter(rsrcmgr, device)
with open('/users/pieter/downloads/fontsizes.pdf', 'rb') as fin:
    for page in PDFPage.get_pages(fin):
        interpreter.process_page(page)

device.close()

Note that by using laparams=None in the PDFLayoutAnalyzer (the default value) the layout analysis is turned of. You can also overwrite the PDFLayoutAnalyzer.end_page() method to explicity remove the call to .analyze() methods.

@pietermarsman
Copy link
Member

I'm closing this because I think this question is answered. Feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants