Question: Can pdfminer retrieve text & bboxes without layout? #51

mark-summerfield · 2017-03-24T21:43:59Z

Is it possible to just retrieve all the text on the page with each fragment returned with its bounding box, i.e., (x1, y1, x2, y2, text) -- with no layout analysis? Use case: this would be ideal for people who want to do their own layout analysis with minimal overheads.

goulu · 2017-04-18T17:04:53Z

It should be pretty easy since pdfminer gives access to all entities in a pdf file. pdf2txt and other tools are just examples of what can be done, but you can do much more by overriding the PDFDevice class to handle bboxes positions, and possibly PDFPageInterpreter if needed

you may have a look at my Drawing.read_pdf method in https://github.com/goulu/Goulib/blob/master/Goulib/drawing.py (from line 1106) where I do something like that to read vector graphics (but not yet texts...)

pietermarsman · 2019-10-16T14:47:33Z

Yes you can, you have to write your own PDFLayoutAnalyzer. For example, to print all the bounding boxes of characters use the following:

from collections.abc import Iterable

from pdfminer.converter import PDFLayoutAnalyzer
from pdfminer.layout import LTChar
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage


class CustomConverter(PDFLayoutAnalyzer):
    def receive_layout(self, ltpage):
        stack = [ltpage]
        while len(stack) > 0:
            item = stack.pop()

            if isinstance(item, LTChar):
                print('"%s"' % item.get_text(), item.bbox)

            if isinstance(item, Iterable):
                stack.extend(list(iter(item)))


rsrcmgr = PDFResourceManager()
device = CustomConverter(rsrcmgr)

interpreter = PDFPageInterpreter(rsrcmgr, device)
with open('/users/pieter/downloads/fontsizes.pdf', 'rb') as fin:
    for page in PDFPage.get_pages(fin):
        interpreter.process_page(page)

device.close()

Note that by using laparams=None in the PDFLayoutAnalyzer (the default value) the layout analysis is turned of. You can also overwrite the PDFLayoutAnalyzer.end_page() method to explicity remove the call to .analyze() methods.

pietermarsman · 2019-10-16T14:47:54Z

I'm closing this because I think this question is answered. Feel free to reopen.

goulu added help wanted type: question labels Apr 18, 2017

pietermarsman closed this as completed Oct 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Can pdfminer retrieve text & bboxes without layout? #51

Question: Can pdfminer retrieve text & bboxes without layout? #51

mark-summerfield commented Mar 24, 2017

goulu commented Apr 18, 2017

pietermarsman commented Oct 16, 2019 •

edited

pietermarsman commented Oct 16, 2019

Question: Can pdfminer retrieve text & bboxes without layout? #51

Question: Can pdfminer retrieve text & bboxes without layout? #51

Comments

mark-summerfield commented Mar 24, 2017

goulu commented Apr 18, 2017

pietermarsman commented Oct 16, 2019 • edited

pietermarsman commented Oct 16, 2019

pietermarsman commented Oct 16, 2019 •

edited