Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extracted word is broken #964

Closed
jnhyperion opened this issue Aug 10, 2023 · 7 comments · May be fixed by #965
Closed

extracted word is broken #964

jnhyperion opened this issue Aug 10, 2023 · 7 comments · May be fixed by #965
Labels

Comments

@jnhyperion
Copy link

Code to reproduce the problem

page.extract_text()

PDF file

example.pdf

Expected behavior

extracted line:

VLHDU8SHRR Homeowner Discount .....

Actual behavior

VLHDU8SHRR H o m e o w ner Discount .....

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

  • pdfplumber version: [0.10.2]
  • Python version: [e.g., 3.8.1]
  • OS: [e.g., Mac]

Additional context

using param use_text_flow=True will avoid this bug, but this param will cause other extract format bugs like:

expected: foo: bar
actual: bar foo:

@jsvine
Copy link
Owner

jsvine commented Aug 10, 2023

Hi @jnhyperion, and thanks for your interest in this library. Have you tried adjusting the x_tolerance parameter? (See this section of the README.md for more detail.)

@jnhyperion
Copy link
Author

@jsvine I've tried already (with the param range from 0.001~3000), and it's not working for my case.

page.extract_text(x_tolerance=0.001)
page.extract_text(x_tolerance=3000)

@jsvine
Copy link
Owner

jsvine commented Aug 11, 2023

Thanks for clarifying. It appears that the issue stems from the PDF including extraneous whitespace characters, in particular a long string of them that overlap with the "Homeowner" text:

import pdfplumber

pdf = pdfplumber.open("./example.pdf")
page = pdf.pages[0]
im = page.to_image()

whitespace_chars = [ c for c in page.chars if c["text"] == " " ]
im.reset().draw_rects(whitespace_chars)

image

To resolve this, you'll want to filter out those whitespace characters (something that pdfplumber does not do automatically, because many PDFs need them for correct extraction):

filtered = page.filter(lambda obj: obj.get("text") != " ")
print(filtered.extract_text(x_tolerance=1))

Returns what I think you want (although perhaps you also want layout=True?):

Policy Number: VLHDU8SHRR
primary driver
Page 3 of 3
Premium discounts
Policy
VLHDU8SHRR Homeowner Discount, PaperLess Discount, E-Signature Discount, Online Quote
Discount, Continuous Insurance Discount, Automatic Card Payments Discount,
Advance Quote Factor
Failure to pay renewal premium
If you do not pay the minimum amount due on or before the due date, your coverage will end on 02/09/2024.
However, if your payment is received or postmarked by 02/10/2024, we will renew your policy with a lapse in
coverage. Your coverage will be renewed the day after your payment is received or postmarked.
Form WI005 (02/22)

@jnhyperion
Copy link
Author

jnhyperion commented Aug 14, 2023

after using this solution for another pdf files, another issue occurs:

test code:

import os
import pdfplumber

page_texts1 = []
page_texts2 = []
example_pdf = "example.pdf"
with pdfplumber.open(example_pdf) as f:
    for page in f.pages:
        filtered = page.filter(lambda obj: obj.get("text") != " ")
        page_texts1.append(
            filtered.extract_text(x_tolerance=1)
        )
with pdfplumber.open(example_pdf) as f:
    for page in f.pages:
        page_texts2.append(
            page.extract_text()
        )

c1 = "\n".join(page_texts1)
c2 = "\n".join(page_texts2)

import difflib

c = difflib.HtmlDiff().make_file(c1.splitlines(), c2.splitlines())
with open("report.html", "w") as f:
    f.write(c)
    os.system("open report.html")

the output:

image
image

So probably, adjusting x_tolerance will not completely solve this issue, because in different pdf file the x_tolerance could be different, and even in the same file, but different lines, the x_tolerance could be different too.

@jsvine
Copy link
Owner

jsvine commented Aug 14, 2023

So probably, adjusting x_tolerance will not completely solve this issue, because in different pdf file the x_tolerance could be different, and even in the same file, but different lines, the x_tolerance could be different too.

Yes, this is certainly the case, as PDFs themselves are quite varied and designed in a enormous range of styles/layouts/etc.

The core functions of pdfplumber are not meant to provide a universal solution for all PDFs, but rather to give the user control over extraction of individual PDFs, and to keep the logic as simple as possible. That said, pdfplumber should hopefully enable the construction of more complex extraction logic. For instance, depending on your corpus of PDFs, you may want first to group/analyze text by font size, if particularly-small text is a concern.

It's also possible that, if you're just looking for a universal text-extractor, another tool may solve this problem more directly.

@jnhyperion
Copy link
Author

I see, thanks for your explanation. At least my PR #965 resolves this issue and does not import other issues as well (only tested on our few pdf files).

@jsvine
Copy link
Owner

jsvine commented Aug 16, 2023

Thanks, I'll close this issue, and continue the discussion in the PR.

@jsvine jsvine closed this as completed Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants