Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass more arguments to pdftotext #66

Open
zufj opened this issue Jun 25, 2020 · 17 comments
Open

Pass more arguments to pdftotext #66

zufj opened this issue Jun 25, 2020 · 17 comments

Comments

@zufj
Copy link

zufj commented Jun 25, 2020

First of all, thanks for the handy module!

I'd be interested in having access to more of the features offered by pdftotext/xpdf to tune the quality of the extracted text.

As far as I know it is not possible to pass arguments freely to pdftotext but there are a few hardcoded parameters (password, raw).

Would that be something you would be open to add?

I'm not fluent in C++ but it seems that I could get inspiration from the existing code to try to have my arguments in.

The parameters/options in most interested in are nodiag, lineprinter, linespacing and fixed. The full list can be found here: http://www.xpdfreader.com/pdftotext-man.html

@jalan
Copy link
Owner

jalan commented Jul 9, 2020

This library uses the poppler cpp interface, so you would first have to check if it exposes the functionality you desire.

@Ekran
Copy link

Ekran commented Oct 3, 2020

The parameters/options in most interested in are nodiag, lineprinter, linespacing and fixed.
The full list can be found here: http://www.xpdfreader.com/pdftotext-man.html

But the underlying pdftotext is from http://poppler.freedesktop.org! It is a different pdftotext engine!

options from there:

-layout              : maintain original physical layout
-fixed <fp>          : assume fixed-pitch (or tabular) text
-raw                 : keep strings in content stream order

But: I tested it with some pdf from my bank, to read old transactions. The option -layout did work for me. I assume it would output the same as -lineprinter would output.

You can of course call the XpdfReader from python. but then you would not need https://pypi.org/project/pdftotext/.

As jalan wrote, we have to look at the poppler interface. There we find: https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/cpp/poppler-page.cpp#L282 This could be a starting point.

@jeanmonet
Copy link

@jalan Hi, I'm using this opened issue to suggest a few additions that would considerably widen the usage scope of this library.

pdftotext (poppler) does seems to expose the following parameters, at least via command-line:

Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -r <fp>              : resolution, in DPI (default is 72)
  -x <int>             : x-coordinate of the crop area top left corner
  -y <int>             : y-coordinate of the crop area top left corner
  -W <int>             : width of crop area in pixels (default is 0)
  -H <int>             : height of crop area in pixels (default is 0)
  -layout              : maintain original physical layout
  -fixed <fp>          : assume fixed-pitch (or tabular) text
  -raw                 : keep strings in content stream order
  -nodiag              : discard diagonal text
  -htmlmeta            : generate a simple HTML file, including the meta information
  -enc <string>        : output text encoding name
  -listenc             : list available encodings
  -eol <string>        : output end-of-line convention (unix, dos, or mac)
  -nopgbrk             : don't insert page breaks between pages
  -bbox                : output bounding box for each word and page size to html.  Sets -htmlmeta
  -bbox-layout         : like -bbox but with extra layout bounding box data.  Sets -htmlmeta
  -cropbox             : use the crop box rather than media box
  -opw <string>        : owner password (for encrypted files)
  -upw <string>        : user password (for encrypted files)

The interesting ones that are missing but would be very helpful are:

  1. non_raw_non_physical_layout (see here)

I believe this would be the equivalent via command line of NOT setting either -raw or -layout. The current python wrapper seems to use the layout parameter by default, and only deactivate it when raw=True. But there should be the possibility to deactivate layout even if raw=False. It would be cool to have a layout parameter: layout=False.

  1. bbox and bbox-layout

Here is the bbox-layout output:

...
<page width="595.000000" height="841.000000">
...
    <flow>
      <block xMin="17.281000" yMin="276.220000" xMax="127.441000" yMax="285.156000">
        <line xMin="17.281000" yMin="276.220000" xMax="127.441000" yMax="285.156000">
          <word xMin="17.281000" yMin="276.220000" xMax="65.249000" yMax="285.156000">Blabla</word>
          <word xMin="67.465000" yMin="276.220000" xMax="76.793000" yMax="285.156000">blaa</word>
          <word xMin="79.009000" yMin="276.220000" xMax="127.441000" yMax="285.156000">balbla/word>
        </line>
      </block>
    </flow>

This could certainly be imported in Python via a list of tuples, something like this:

class WordBox(NamedTuple):
    x0: int
    y0: int
    x1: int
    y1: int
    word: str
    flow: int   # dunno what flow really is however
    block: int  # would index blocks in the order they appear. Each word belongs to a block
    line: int   # same for lines

Ideally, a page object in this case could contain some meta-info about the page (such as dimensions and page number) and the possibility to extract the list of words and their bounding box.

I can certainly extract all this info myself by calling pdftotext via the command line and parsing the output file, but it would be neat to have this machinery inside this Python wrapper. (I'm not proficient with C/C++ so can't help there)

@jalan
Copy link
Owner

jalan commented May 25, 2021

I have been meaning to fix the layout regarding non_raw_non_physical_layout. When I created this library, poppler-cpp only exposed two different layouts, so that's what I used. But now it has three. I will make the default layout non_raw_non_physical_layout, to match the CLI tool.

@jeanmonet
Copy link

Awesome! Thanks for your work on this library.

Just to give an example of how I parse the output of -bbox-layout option which works well for my use case and unless mistaken captures all of the information:

def pdftotext_bbox_parse(content_box: str) -> list[PageWordBox]:
    """
    Given output of `pdftotext -bbox-layout`, parse & retrieve positional information.
    Parses the following kind of output from pdftotext:
        <head>
        </head>
        <body>
        <doc>
        <page width="595.000000" height="841.000000">
            <flow>
            <block xMin="277.060000" yMin="1.890400" xMax="520.033120" yMax="17.439040">
                <line xMin="277.060000" yMin="1.890400" xMax="520.033120" yMax="17.439040">
                <word xMin="277.060000" yMin="1.890400" xMax="361.276000" yMax="17.439040">Blablaa</word>
                <word xMin="365.380000" yMin="1.890400" xMax="392.454400" yMax="17.439040">blaaa</word>
                <word xMin="396.340000" yMin="1.890400" xMax="427.228480" yMax="17.439040">blaa</word>
                <word xMin="431.140000" yMin="1.890400" xMax="520.033120" yMax="17.439040">blaaaahhh</word>
                </line>
            </block>
            <block xMin="187.540000" yMin="17.969400" xMax="580.148800" yMax="54.176040">
                <line xMin="187.540000" yMin="17.969400" xMax="580.148800" yMax="26.816040">
            ...
    """
    soup = BeautifulSoup(content_box, features="lxml")
    pages: list[PageWordBox] = []
    idx_page, idx_flow, idx_block, idx_line, idx_word = -1, -1, -1, -1, -1
    for cur_page in soup.find_all("page"):
        idx_page += 1
        page = PageWordBox(n=idx_page, dim=cur_page.attrs)
        pages.append(page)
        for cur_flow in cur_page.find_all("flow"):
            idx_flow += 1
            flow = FlowWordBox(n=idx_flow, page=idx_page)
            page.flows.append(flow)
            for cur_block in cur_flow.find_all("block"):
                idx_block += 1
                block = BlockWordBox(n=idx_block,
                                     flow=idx_flow,
                                     box=Rectangle(*(float(n) for n in cur_block.attrs.values())))
                flow.blocks.append(block)
                page.blocks.append(block)
                for cur_line in cur_block.find_all("line"):
                    idx_line += 1
                    line = LineWordBox(n=idx_line,
                                       block=idx_block,
                                       box=Rectangle(*(float(n) for n in cur_line.attrs.values())))
                    block.lines.append(line)
                    flow.lines.append(line)
                    page.lines.append(line)
                    for cur_word in cur_line.find_all("word"):
                        idx_word += 1
                        word = WordBox(
                            *(float(n) for n in cur_word.attrs.values()),
                            s=cur_word.text,
                            flow=idx_flow,
                            block=idx_block,
                            line=idx_line,
                            n=idx_word)
                        line.words.append(word)
                        block.words.append(word)
                        flow.words.append(word)
                        page.words.append(word)
    return pages

PageWordBox, FlowWordBox, BlockWordBox, LineWordBox, WordBox are just some dataclasses I use to conveniently store the data.

@jalan
Copy link
Owner

jalan commented May 27, 2021

I have created #83 to track fixing the layout options. This issue can remain to discuss adding any other options. I am not likely to add more options, as I just want a fast and easy way to get all the text from a PDF, such as for text mining or searching. There are plenty of more featureful PDF libs out there for doing fancier things, like pdfminer, pypdf2, pymupdf, pikepdf, and probably more since last I checked.

@jeanmonet
Copy link

Understood, thanks. I'm actually already using PyMuPDF and it's great, but seems to lack the layout-related options in pdftotext, so for me they complement each other. In case anyone needs, here is how I read pdftotext output when handled via command line:

def pdftotext_cli(path: str | Path, page_num: int | None = None, args: list[str] | None = None) -> str:
    """
    Example usage -> read the second page of PDF and return `-bbox-layout` information
        >>> pdftotext_cli(Path("/path/to/file.pdf"), page_num=2, args=["-bbox-layout"]))
    """
    if isinstance(path, str):
        path = Path(path)
    if not path.is_file:
        raise RuntimeError(f"Given path not a (pdf) file: {path!r}")
    page_arg = ("-f", str(page_num), "-l", str(page_num),) if page_num else []
    args = args or []
    with tempfile.NamedTemporaryFile() as temp_file:
        _ = subprocess.run(["pdftotext", str(path.absolute()), temp_file.name,
                            *page_arg,
                            *args],
                           check=True,)
        content = temp_file.read().decode()
    return content

@jalan jalan mentioned this issue Sep 10, 2021
@ReMiOS
Copy link

ReMiOS commented Oct 11, 2021

I would like the "nodiag" (and "layout" wich is already implemented) option in the pdftotext library

Usage: pdftotext [options] []
-nodiag : discard diagonal text

It seems Poppler already provides this feature:
TextOutputDev.h
bool discardDiag; // Diagonal text, i.e., text that is not close to one of the
// 0, 90, 180, or 270 degree axes, is discarded. This is useful
// to skip watermarks drawn on top of body text, etc.

TextOutputDev.cc
// throw away diagonal chars
if (discardDiag && diagonal) {
charPos += nBytes;
return;
}

@MohammedRakib
Copy link

I might be late, but you guys can install pyxpdf (pip install pyxpdf) for all possible arguments provided by xpdf.

@stefan6419846
Copy link

I am not sure whether it actually makes sense to have such a large request laying around here asking about lots of options, while it is not really clear which actually are available already. Wouldn't it make more sense to track the relevant and missing parts in dedicated, smaller issues?

@YasminaFr
Copy link

YasminaFr commented Mar 2, 2023

Hi @jalan is it possible to retrieve only some pages of the pdf. I don't want to retrieve everything and then filter only the pages that I want. I would like to optimize that. Can you tell me if there is a way to this please ?
Something like those parameters (poppler):
-f : first page to convert
-l : last page to convert

@stefan6419846
Copy link

When already using pymupdf, there should be no need to run pdftotext afterwards in theory (and not even using a temporary file), as pymupdf has native support for this itself: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction

@YasminaFr
Copy link

YasminaFr commented Mar 2, 2023 via email

@SahibYar
Copy link

I am not likely to add more options, as I just want a fast and easy way to get all the text from a PDF, such as for text mining or searching. There are plenty of more featureful PDF libs out there for doing fancier things, like pdfminer, pypdf2, pymupdf, pikepdf, and probably more since last I checked

You can add this comment in ReadMe as well, for future references.

@ahmed-bhs
Copy link

@Ekran did you find a solution, to pass layout argument to true, I tried this
with open(pdf, "rb") as f:
pdf = pdftotext.PDF(f, layout=True)

Unfortunately, I got TypeError: 'layout' is an invalid keyword argument for this func

@benjamin-awd
Copy link

@ahmed-bhs
I think you may want:

pdf = pdftotext.PDF(f, physical=True)

https://poppler.freedesktop.org/api/cpp/classpoppler_1_1page.html

@ahmed-bhs
Copy link

Yeah exactly, thank you so mush @benjamin-awd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests