Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extracting words from table] #914

Open
BaillySylvain opened this issue Oct 25, 2023 · 1 comment
Open

[extracting words from table] #914

BaillySylvain opened this issue Oct 25, 2023 · 1 comment

Comments

@BaillySylvain
Copy link

Issue: Words Extracted Too Closely Together in Tables with PDFMiner

Problem Description
When extracting text from a PDF document containing tables with PDFMiner, it seems that the words inside the table are being extracted incorrectly. The words are all closely spaced together, making the extracted text unusable. This negatively affects our ability to properly process the information within these tables.

Steps to Reproduce

  1. Call extract_text_to_fp on given pdf :
    document.pdf
  2. Observe that the extracted words are closely spaced together in xml file :
    image
  3. These pasted words correspond to a table on the page for which the references are indeed very close
    image

Expected Behavior
We expect PDFMiner to correctly extract text inside tables, preserving the spacing between words, so that the extracted text is usable.

This issue significantly impacts our ability to correctly extract and use data contained within PDF document tables. Any assistance in resolving this problem would be greatly appreciated.

I thank you for the time spent studying this case and hope that a solution can be put in place :)

@NickFabry
Copy link

A few suggestions -

1 - Don't use an output type of XML; the comments in high_level.py specifically say only 'text' works properly.

2 - If you are extracting text using extract_text_to_fp as a library call, the exact call signature would be helpful to diagnose what is occurring. In particular, what LAParams are you passing to the call? Typically, you would tweak the LAParams for the nature of your document, specifically the narrow spacing in your case. The kind of parameters you can specify in an LAParams object are:

class LAParams:
    """Parameters for layout analysis

    :param line_overlap: If two characters have more overlap than this they
        are considered to be on the same line. The overlap is specified
        relative to the minimum height of both characters.
    :param char_margin: If two characters are closer together than this
        margin they are considered part of the same line. The margin is
        specified relative to the width of the character.
    :param word_margin: If two characters on the same line are further apart
        than this margin then they are considered to be two separate words, and
        an intermediate space will be added for readability. The margin is
        specified relative to the width of the character.
    :param line_margin: If two lines are are close together they are
        considered to be part of the same paragraph. The margin is
        specified relative to the height of a line.
    :param boxes_flow: Specifies how much a horizontal and vertical position
        of a text matters when determining the order of text boxes. The value
        should be within the range of -1.0 (only horizontal position
        matters) to +1.0 (only vertical position matters). You can also pass
        `None` to disable advanced layout analysis, and instead return text
        based on the position of the bottom left corner of the text box.
    :param detect_vertical: If vertical text should be considered during
        layout analysis
    :param all_texts: If layout analysis should be performed on text in
        figures.
    """

3 - If you are uncomfortable with the above, pdfminer comes with a CLI tool, pdf2txt.py, which you can call on the file whose text you wish to extract, with such parameters as --word-margin, which allows you to define when pdfminer will interpret a physical space between characters as whitespace chars.

In general, when dealing with idiosyncratic documents, one has to simply try different input parameters until the right combination is found for your particular document that allows you to extract the text in a sensible way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants