Pass more arguments to pdftotext #66

zufj · 2020-06-25T13:23:50Z

First of all, thanks for the handy module!

I'd be interested in having access to more of the features offered by pdftotext/xpdf to tune the quality of the extracted text.

As far as I know it is not possible to pass arguments freely to pdftotext but there are a few hardcoded parameters (password, raw).

Would that be something you would be open to add?

I'm not fluent in C++ but it seems that I could get inspiration from the existing code to try to have my arguments in.

The parameters/options in most interested in are nodiag, lineprinter, linespacing and fixed. The full list can be found here: http://www.xpdfreader.com/pdftotext-man.html

jalan · 2020-07-09T13:45:46Z

This library uses the poppler cpp interface, so you would first have to check if it exposes the functionality you desire.

Ekran · 2020-10-03T09:31:09Z

The parameters/options in most interested in are nodiag, lineprinter, linespacing and fixed.
The full list can be found here: http://www.xpdfreader.com/pdftotext-man.html

But the underlying pdftotext is from http://poppler.freedesktop.org! It is a different pdftotext engine!

options from there:

-layout              : maintain original physical layout
-fixed <fp>          : assume fixed-pitch (or tabular) text
-raw                 : keep strings in content stream order

But: I tested it with some pdf from my bank, to read old transactions. The option -layout did work for me. I assume it would output the same as -lineprinter would output.

You can of course call the XpdfReader from python. but then you would not need https://pypi.org/project/pdftotext/.

As jalan wrote, we have to look at the poppler interface. There we find: https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/cpp/poppler-page.cpp#L282 This could be a starting point.

jeanmonet · 2021-05-25T17:21:27Z

@jalan Hi, I'm using this opened issue to suggest a few additions that would considerably widen the usage scope of this library.

pdftotext (poppler) does seems to expose the following parameters, at least via command-line:

Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -r <fp>              : resolution, in DPI (default is 72)
  -x <int>             : x-coordinate of the crop area top left corner
  -y <int>             : y-coordinate of the crop area top left corner
  -W <int>             : width of crop area in pixels (default is 0)
  -H <int>             : height of crop area in pixels (default is 0)
  -layout              : maintain original physical layout
  -fixed <fp>          : assume fixed-pitch (or tabular) text
  -raw                 : keep strings in content stream order
  -nodiag              : discard diagonal text
  -htmlmeta            : generate a simple HTML file, including the meta information
  -enc <string>        : output text encoding name
  -listenc             : list available encodings
  -eol <string>        : output end-of-line convention (unix, dos, or mac)
  -nopgbrk             : don't insert page breaks between pages
  -bbox                : output bounding box for each word and page size to html.  Sets -htmlmeta
  -bbox-layout         : like -bbox but with extra layout bounding box data.  Sets -htmlmeta
  -cropbox             : use the crop box rather than media box
  -opw <string>        : owner password (for encrypted files)
  -upw <string>        : user password (for encrypted files)

The interesting ones that are missing but would be very helpful are:

non_raw_non_physical_layout (see here)

I believe this would be the equivalent via command line of NOT setting either -raw or -layout. The current python wrapper seems to use the layout parameter by default, and only deactivate it when raw=True. But there should be the possibility to deactivate layout even if raw=False. It would be cool to have a layout parameter: layout=False.

bbox and bbox-layout

Here is the bbox-layout output:

...
<page width="595.000000" height="841.000000">
...
    <flow>
      <block xMin="17.281000" yMin="276.220000" xMax="127.441000" yMax="285.156000">
        <line xMin="17.281000" yMin="276.220000" xMax="127.441000" yMax="285.156000">
          <word xMin="17.281000" yMin="276.220000" xMax="65.249000" yMax="285.156000">Blabla</word>
          <word xMin="67.465000" yMin="276.220000" xMax="76.793000" yMax="285.156000">blaa</word>
          <word xMin="79.009000" yMin="276.220000" xMax="127.441000" yMax="285.156000">balbla/word>
        </line>
      </block>
    </flow>

This could certainly be imported in Python via a list of tuples, something like this:

class WordBox(NamedTuple):
    x0: int
    y0: int
    x1: int
    y1: int
    word: str
    flow: int   # dunno what flow really is however
    block: int  # would index blocks in the order they appear. Each word belongs to a block
    line: int   # same for lines

Ideally, a page object in this case could contain some meta-info about the page (such as dimensions and page number) and the possibility to extract the list of words and their bounding box.

I can certainly extract all this info myself by calling pdftotext via the command line and parsing the output file, but it would be neat to have this machinery inside this Python wrapper. (I'm not proficient with C/C++ so can't help there)

jalan · 2021-05-25T20:13:47Z

I have been meaning to fix the layout regarding non_raw_non_physical_layout. When I created this library, poppler-cpp only exposed two different layouts, so that's what I used. But now it has three. I will make the default layout non_raw_non_physical_layout, to match the CLI tool.

jeanmonet · 2021-05-26T18:07:12Z

Awesome! Thanks for your work on this library.

Just to give an example of how I parse the output of -bbox-layout option which works well for my use case and unless mistaken captures all of the information:

def pdftotext_bbox_parse(content_box: str) -> list[PageWordBox]:
    """
    Given output of `pdftotext -bbox-layout`, parse & retrieve positional information.
    Parses the following kind of output from pdftotext:
        <head>
        </head>
        <body>
        <doc>
        <page width="595.000000" height="841.000000">
            <flow>
            <block xMin="277.060000" yMin="1.890400" xMax="520.033120" yMax="17.439040">
                <line xMin="277.060000" yMin="1.890400" xMax="520.033120" yMax="17.439040">
                <word xMin="277.060000" yMin="1.890400" xMax="361.276000" yMax="17.439040">Blablaa</word>
                <word xMin="365.380000" yMin="1.890400" xMax="392.454400" yMax="17.439040">blaaa</word>
                <word xMin="396.340000" yMin="1.890400" xMax="427.228480" yMax="17.439040">blaa</word>
                <word xMin="431.140000" yMin="1.890400" xMax="520.033120" yMax="17.439040">blaaaahhh</word>
                </line>
            </block>
            <block xMin="187.540000" yMin="17.969400" xMax="580.148800" yMax="54.176040">
                <line xMin="187.540000" yMin="17.969400" xMax="580.148800" yMax="26.816040">
            ...
    """
    soup = BeautifulSoup(content_box, features="lxml")
    pages: list[PageWordBox] = []
    idx_page, idx_flow, idx_block, idx_line, idx_word = -1, -1, -1, -1, -1
    for cur_page in soup.find_all("page"):
        idx_page += 1
        page = PageWordBox(n=idx_page, dim=cur_page.attrs)
        pages.append(page)
        for cur_flow in cur_page.find_all("flow"):
            idx_flow += 1
            flow = FlowWordBox(n=idx_flow, page=idx_page)
            page.flows.append(flow)
            for cur_block in cur_flow.find_all("block"):
                idx_block += 1
                block = BlockWordBox(n=idx_block,
                                     flow=idx_flow,
                                     box=Rectangle(*(float(n) for n in cur_block.attrs.values())))
                flow.blocks.append(block)
                page.blocks.append(block)
                for cur_line in cur_block.find_all("line"):
                    idx_line += 1
                    line = LineWordBox(n=idx_line,
                                       block=idx_block,
                                       box=Rectangle(*(float(n) for n in cur_line.attrs.values())))
                    block.lines.append(line)
                    flow.lines.append(line)
                    page.lines.append(line)
                    for cur_word in cur_line.find_all("word"):
                        idx_word += 1
                        word = WordBox(
                            *(float(n) for n in cur_word.attrs.values()),
                            s=cur_word.text,
                            flow=idx_flow,
                            block=idx_block,
                            line=idx_line,
                            n=idx_word)
                        line.words.append(word)
                        block.words.append(word)
                        flow.words.append(word)
                        page.words.append(word)
    return pages

PageWordBox, FlowWordBox, BlockWordBox, LineWordBox, WordBox are just some dataclasses I use to conveniently store the data.

jalan · 2021-05-27T19:00:58Z

I have created #83 to track fixing the layout options. This issue can remain to discuss adding any other options. I am not likely to add more options, as I just want a fast and easy way to get all the text from a PDF, such as for text mining or searching. There are plenty of more featureful PDF libs out there for doing fancier things, like pdfminer, pypdf2, pymupdf, pikepdf, and probably more since last I checked.

jeanmonet · 2021-05-28T20:03:09Z

Understood, thanks. I'm actually already using PyMuPDF and it's great, but seems to lack the layout-related options in pdftotext, so for me they complement each other. In case anyone needs, here is how I read pdftotext output when handled via command line:

def pdftotext_cli(path: str | Path, page_num: int | None = None, args: list[str] | None = None) -> str:
    """
    Example usage -> read the second page of PDF and return `-bbox-layout` information
        >>> pdftotext_cli(Path("/path/to/file.pdf"), page_num=2, args=["-bbox-layout"]))
    """
    if isinstance(path, str):
        path = Path(path)
    if not path.is_file:
        raise RuntimeError(f"Given path not a (pdf) file: {path!r}")
    page_arg = ("-f", str(page_num), "-l", str(page_num),) if page_num else []
    args = args or []
    with tempfile.NamedTemporaryFile() as temp_file:
        _ = subprocess.run(["pdftotext", str(path.absolute()), temp_file.name,
                            *page_arg,
                            *args],
                           check=True,)
        content = temp_file.read().decode()
    return content

ReMiOS · 2021-10-11T18:02:15Z

I would like the "nodiag" (and "layout" wich is already implemented) option in the pdftotext library

Usage: pdftotext [options] []
-nodiag : discard diagonal text

It seems Poppler already provides this feature:
TextOutputDev.h
bool discardDiag; // Diagonal text, i.e., text that is not close to one of the
// 0, 90, 180, or 270 degree axes, is discarded. This is useful
// to skip watermarks drawn on top of body text, etc.

TextOutputDev.cc
// throw away diagonal chars
if (discardDiag && diagonal) {
charPos += nBytes;
return;
}

MohammedRakib · 2022-10-18T12:33:47Z

I might be late, but you guys can install pyxpdf (pip install pyxpdf) for all possible arguments provided by xpdf.

stefan6419846 · 2023-02-16T07:56:54Z

I am not sure whether it actually makes sense to have such a large request laying around here asking about lots of options, while it is not really clear which actually are available already. Wouldn't it make more sense to track the relevant and missing parts in dedicated, smaller issues?

YasminaFr · 2023-03-02T10:16:12Z

Hi @jalan is it possible to retrieve only some pages of the pdf. I don't want to retrieve everything and then filter only the pages that I want. I would like to optimize that. Can you tell me if there is a way to this please ?
Something like those parameters (poppler):
-f : first page to convert
-l : last page to convert

stefan6419846 · 2023-03-02T12:28:56Z

When already using pymupdf, there should be no need to run pdftotext afterwards in theory (and not even using a temporary file), as pymupdf has native support for this itself: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction

YasminaFr · 2023-03-02T12:49:54Z

Hi, I compared pymupdf and pdftotext years ago and I realized that the text extracted from pdftotext was better than pymupdf. That’s why since I only use pdftotext for pdf text extraction. But I will try again. Thank you for your help. Yasmina.

…

________________________________ De : Stefan ***@***.***> Envoyé : Thursday, March 2, 2023 1:29:07 PM À : jalan/pdftotext ***@***.***> Cc : YasminaFr ***@***.***>; Mention ***@***.***> Objet : Re: [jalan/pdftotext] Pass more arguments to pdftotext (#66) When already using pymupdf, there should be no need to run pdftotext afterwards in theory (and not even using a temporary file), as pymupdf has native support for this itself: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction — Reply to this email directly, view it on GitHub<#66 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEW5HXSO5V6H2MOB774XLA3W2CHBHANCNFSM4OILXHLA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

SahibYar · 2023-04-27T05:28:27Z

I am not likely to add more options, as I just want a fast and easy way to get all the text from a PDF, such as for text mining or searching. There are plenty of more featureful PDF libs out there for doing fancier things, like pdfminer, pypdf2, pymupdf, pikepdf, and probably more since last I checked

You can add this comment in ReadMe as well, for future references.

ahmed-bhs · 2023-10-24T15:32:48Z

@Ekran did you find a solution, to pass layout argument to true, I tried this
with open(pdf, "rb") as f:
pdf = pdftotext.PDF(f, layout=True)

Unfortunately, I got TypeError: 'layout' is an invalid keyword argument for this func

benjamin-awd · 2023-10-25T14:50:33Z

@ahmed-bhs
I think you may want:

pdf = pdftotext.PDF(f, physical=True)

https://poppler.freedesktop.org/api/cpp/classpoppler_1_1page.html

ahmed-bhs · 2023-10-25T15:00:31Z

Yeah exactly, thank you so mush @benjamin-awd

jalan mentioned this issue Sep 10, 2021

Bounding boxes #32

Closed

jalan mentioned this issue Jan 20, 2022

Add option to hide clipped text and ignore diagonal text #97

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass more arguments to pdftotext #66

Pass more arguments to pdftotext #66

zufj commented Jun 25, 2020 •

edited

jalan commented Jul 9, 2020

Ekran commented Oct 3, 2020 •

edited

jeanmonet commented May 25, 2021

jalan commented May 25, 2021

jeanmonet commented May 26, 2021

jalan commented May 27, 2021

jeanmonet commented May 28, 2021

ReMiOS commented Oct 11, 2021 •

edited

MohammedRakib commented Oct 18, 2022

stefan6419846 commented Feb 16, 2023

YasminaFr commented Mar 2, 2023 •

edited

stefan6419846 commented Mar 2, 2023

YasminaFr commented Mar 2, 2023 via email

SahibYar commented Apr 27, 2023

ahmed-bhs commented Oct 24, 2023

benjamin-awd commented Oct 25, 2023

ahmed-bhs commented Oct 25, 2023

Pass more arguments to pdftotext #66

Pass more arguments to pdftotext #66

Comments

zufj commented Jun 25, 2020 • edited

jalan commented Jul 9, 2020

Ekran commented Oct 3, 2020 • edited

jeanmonet commented May 25, 2021

jalan commented May 25, 2021

jeanmonet commented May 26, 2021

jalan commented May 27, 2021

jeanmonet commented May 28, 2021

ReMiOS commented Oct 11, 2021 • edited

MohammedRakib commented Oct 18, 2022

stefan6419846 commented Feb 16, 2023

YasminaFr commented Mar 2, 2023 • edited

stefan6419846 commented Mar 2, 2023

YasminaFr commented Mar 2, 2023 via email

SahibYar commented Apr 27, 2023

ahmed-bhs commented Oct 24, 2023

benjamin-awd commented Oct 25, 2023

ahmed-bhs commented Oct 25, 2023

zufj commented Jun 25, 2020 •

edited

Ekran commented Oct 3, 2020 •

edited

ReMiOS commented Oct 11, 2021 •

edited

YasminaFr commented Mar 2, 2023 •

edited