Plumbing pdf results in mixed characters of neighbouring words #764

XuShanJiang · 2022-11-22T11:08:42Z

Describe the bug

The pdf is not plumbed correctly in text. The words are incomplete and characters of neighbouring words are mixed together.

Code to reproduce the problem

with pdfplumber.open("woo-besluit-contacten-rabo-pveu.pdf") as pdf: for i in range(len(pdf.pages)): print(pdf.pages[i].extract_text())

PDF file

woo-besluit-contacten-rabo-pveu.pdf

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

I expected that the text will be plumbed correctly. I imported several of these documents, in which the words are normal, like in the pdf-file.

Actual behavior

Some similar pdf files (including this one) is plumbed very weirdly. Characters of words are mixed.
For example Pagina 7 is read as Pag7iv naa7n.

Screenshots

Environment

pdfplumber version: 0.7.5
Python version: 3.8
OS: Linux

Additional context

I tried to copy the whole pdf and paste it in a text editor manually, which works totally fine...

The text was updated successfully, but these errors were encountered:

jsvine · 2022-11-22T17:56:17Z

Hi @XuShanJiang, have you tried adjusting the y_tolerance setting described in the .extract_text(...) documentation?

jsvine · 2022-12-15T14:57:01Z

Hi @XuShanJiang, just checking back on this.

XuShanJiang · 2023-01-05T12:48:08Z

Hi @jsvine ,
I tried to adjust the y_tolerance and I experimented with different values. The text will change indeed, but not in the correct way.

jsvine · 2023-01-10T09:04:07Z

Thank you for letting me know. A few observations:

This is rasterized PDF whose text has been OCR'ed. That is: The text is not the original digital text, but rather another piece of software's attempt to recreate it. These types of PDFs are generally harder to work with in pdfplumber because they lack a lot of the important original information.
Moreover, an examination of the character positioning via pdfplumber's visual debugging indicates that the OCR software has positioned the text in an unusual way — and in a way that creates overlaps that explain the results you're getting. E.g.:

That said, if you use these settings, I believe you'll get what you're looking for: page.extract_text(layout=True, use_text_flow=True) — (use_text_flow tells the layout engine to use the characters in the sequence they are provided in the file, rather than their x/y position). Does that work for you?

Rustemhak · 2023-01-27T06:37:17Z

Hello @jsvine, thank you for solution. Okay, adding these options works for extract_text. How can I use the same options to extract_tables?

jsvine · 2023-02-01T00:08:18Z

Right now, that's not possible with pdfplumber, but adding that feature sounds like a good idea.

For the specific PDF discussed above, however, I don't think it'd work, due to the character-positioning issues. (I.e., many characters that should be inside a particular table cell are not.)

jsvine · 2023-02-16T13:42:35Z

@Rustemhak, in the latest version of pdfplumber (v0.8.0), you can now pass all .extract_text(...) arguments to .extract_tables(...), prefixing them with text_. So { "text_use_text_flow": True }.

XuShanJiang added the bug label Nov 22, 2022

jsvine self-assigned this Jan 10, 2023

jsvine closed this as completed Feb 1, 2023

jsvine mentioned this issue Feb 16, 2023

pdfplumber extracting wrong text from pdf #815

Closed

cmdlineluser mentioned this issue Jun 29, 2023

Incorrect extraction in tables with overlapping columns #912

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plumbing pdf results in mixed characters of neighbouring words #764

Plumbing pdf results in mixed characters of neighbouring words #764

XuShanJiang commented Nov 22, 2022

jsvine commented Nov 22, 2022

jsvine commented Dec 15, 2022

XuShanJiang commented Jan 5, 2023

jsvine commented Jan 10, 2023

Rustemhak commented Jan 27, 2023 •

edited

Loading

jsvine commented Feb 1, 2023

jsvine commented Feb 16, 2023

Plumbing pdf results in mixed characters of neighbouring words #764

Plumbing pdf results in mixed characters of neighbouring words #764

Comments

XuShanJiang commented Nov 22, 2022

Describe the bug

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots

Environment

Additional context

jsvine commented Nov 22, 2022

jsvine commented Dec 15, 2022

XuShanJiang commented Jan 5, 2023

jsvine commented Jan 10, 2023

Rustemhak commented Jan 27, 2023 • edited Loading

jsvine commented Feb 1, 2023

jsvine commented Feb 16, 2023

Rustemhak commented Jan 27, 2023 •

edited

Loading