extracted word is broken #964

jnhyperion · 2023-08-10T08:04:48Z

Code to reproduce the problem

page.extract_text()

PDF file

example.pdf

Expected behavior

extracted line:

VLHDU8SHRR Homeowner Discount .....

Actual behavior

VLHDU8SHRR H o m e o w ner Discount .....

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

pdfplumber version: [0.10.2]
Python version: [e.g., 3.8.1]
OS: [e.g., Mac]

Additional context

using param use_text_flow=True will avoid this bug, but this param will cause other extract format bugs like:

expected: foo: bar
actual: bar foo:

The text was updated successfully, but these errors were encountered:

jsvine · 2023-08-10T15:48:45Z

Hi @jnhyperion, and thanks for your interest in this library. Have you tried adjusting the x_tolerance parameter? (See this section of the README.md for more detail.)

jnhyperion · 2023-08-11T06:30:37Z

@jsvine I've tried already (with the param range from 0.001~3000), and it's not working for my case.

page.extract_text(x_tolerance=0.001)
page.extract_text(x_tolerance=3000)

jsvine · 2023-08-11T16:20:39Z

Thanks for clarifying. It appears that the issue stems from the PDF including extraneous whitespace characters, in particular a long string of them that overlap with the "Homeowner" text:

import pdfplumber

pdf = pdfplumber.open("./example.pdf")
page = pdf.pages[0]
im = page.to_image()

whitespace_chars = [ c for c in page.chars if c["text"] == " " ]
im.reset().draw_rects(whitespace_chars)

To resolve this, you'll want to filter out those whitespace characters (something that pdfplumber does not do automatically, because many PDFs need them for correct extraction):

filtered = page.filter(lambda obj: obj.get("text") != " ")
print(filtered.extract_text(x_tolerance=1))

Returns what I think you want (although perhaps you also want layout=True?):

Policy Number: VLHDU8SHRR
primary driver
Page 3 of 3
Premium discounts
Policy
VLHDU8SHRR Homeowner Discount, PaperLess Discount, E-Signature Discount, Online Quote
Discount, Continuous Insurance Discount, Automatic Card Payments Discount,
Advance Quote Factor
Failure to pay renewal premium
If you do not pay the minimum amount due on or before the due date, your coverage will end on 02/09/2024.
However, if your payment is received or postmarked by 02/10/2024, we will renew your policy with a lapse in
coverage. Your coverage will be renewed the day after your payment is received or postmarked.
Form WI005 (02/22)

jnhyperion · 2023-08-14T02:14:00Z

after using this solution for another pdf files, another issue occurs:

test code:

import os
import pdfplumber

page_texts1 = []
page_texts2 = []
example_pdf = "example.pdf"
with pdfplumber.open(example_pdf) as f:
    for page in f.pages:
        filtered = page.filter(lambda obj: obj.get("text") != " ")
        page_texts1.append(
            filtered.extract_text(x_tolerance=1)
        )
with pdfplumber.open(example_pdf) as f:
    for page in f.pages:
        page_texts2.append(
            page.extract_text()
        )

c1 = "\n".join(page_texts1)
c2 = "\n".join(page_texts2)

import difflib

c = difflib.HtmlDiff().make_file(c1.splitlines(), c2.splitlines())
with open("report.html", "w") as f:
    f.write(c)
    os.system("open report.html")

the output:

So probably, adjusting x_tolerance will not completely solve this issue, because in different pdf file the x_tolerance could be different, and even in the same file, but different lines, the x_tolerance could be different too.

jsvine · 2023-08-14T12:59:40Z

So probably, adjusting x_tolerance will not completely solve this issue, because in different pdf file the x_tolerance could be different, and even in the same file, but different lines, the x_tolerance could be different too.

Yes, this is certainly the case, as PDFs themselves are quite varied and designed in a enormous range of styles/layouts/etc.

The core functions of pdfplumber are not meant to provide a universal solution for all PDFs, but rather to give the user control over extraction of individual PDFs, and to keep the logic as simple as possible. That said, pdfplumber should hopefully enable the construction of more complex extraction logic. For instance, depending on your corpus of PDFs, you may want first to group/analyze text by font size, if particularly-small text is a concern.

It's also possible that, if you're just looking for a universal text-extractor, another tool may solve this problem more directly.

jnhyperion · 2023-08-15T01:42:17Z

I see, thanks for your explanation. At least my PR #965 resolves this issue and does not import other issues as well (only tested on our few pdf files).

jsvine · 2023-08-16T12:30:11Z

Thanks, I'll close this issue, and continue the discussion in the PR.

jnhyperion added the bug label Aug 10, 2023

jnhyperion mentioned this issue Aug 10, 2023

fix issue 964 #965

Open

jsvine closed this as completed Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extracted word is broken #964

extracted word is broken #964

jnhyperion commented Aug 10, 2023

jsvine commented Aug 10, 2023

jnhyperion commented Aug 11, 2023

jsvine commented Aug 11, 2023

jnhyperion commented Aug 14, 2023 •

edited

jsvine commented Aug 14, 2023

jnhyperion commented Aug 15, 2023

jsvine commented Aug 16, 2023

extracted word is broken #964

extracted word is broken #964

Comments

jnhyperion commented Aug 10, 2023

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots

Environment

Additional context

jsvine commented Aug 10, 2023

jnhyperion commented Aug 11, 2023

jsvine commented Aug 11, 2023

jnhyperion commented Aug 14, 2023 • edited

jsvine commented Aug 14, 2023

jnhyperion commented Aug 15, 2023

jsvine commented Aug 16, 2023

jnhyperion commented Aug 14, 2023 •

edited