Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plumbing pdf results in mixed characters of neighbouring words #764

Closed
XuShanJiang opened this issue Nov 22, 2022 · 7 comments
Closed

Plumbing pdf results in mixed characters of neighbouring words #764

XuShanJiang opened this issue Nov 22, 2022 · 7 comments
Assignees
Labels

Comments

@XuShanJiang
Copy link

Describe the bug

The pdf is not plumbed correctly in text. The words are incomplete and characters of neighbouring words are mixed together.

Code to reproduce the problem

with pdfplumber.open("woo-besluit-contacten-rabo-pveu.pdf") as pdf: for i in range(len(pdf.pages)): print(pdf.pages[i].extract_text())

PDF file

woo-besluit-contacten-rabo-pveu.pdf

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

I expected that the text will be plumbed correctly. I imported several of these documents, in which the words are normal, like in the pdf-file.

Actual behavior

Some similar pdf files (including this one) is plumbed very weirdly. Characters of words are mixed.
For example Pagina 7 is read as Pag7iv naa7n.

Screenshots

pdfpu

Environment

  • pdfplumber version: 0.7.5
  • Python version: 3.8
  • OS: Linux

Additional context

I tried to copy the whole pdf and paste it in a text editor manually, which works totally fine...

@jsvine
Copy link
Owner

jsvine commented Nov 22, 2022

Hi @XuShanJiang, have you tried adjusting the y_tolerance setting described in the .extract_text(...) documentation?

@jsvine
Copy link
Owner

jsvine commented Dec 15, 2022

Hi @XuShanJiang, just checking back on this.

@XuShanJiang
Copy link
Author

Hi @jsvine ,
I tried to adjust the y_tolerance and I experimented with different values. The text will change indeed, but not in the correct way.

@jsvine jsvine self-assigned this Jan 10, 2023
@jsvine
Copy link
Owner

jsvine commented Jan 10, 2023

Thank you for letting me know. A few observations:

  • This is rasterized PDF whose text has been OCR'ed. That is: The text is not the original digital text, but rather another piece of software's attempt to recreate it. These types of PDFs are generally harder to work with in pdfplumber because they lack a lot of the important original information.

  • Moreover, an examination of the character positioning via pdfplumber's visual debugging indicates that the OCR software has positioned the text in an unusual way — and in a way that creates overlaps that explain the results you're getting. E.g.:

Screen Shot

  • That said, if you use these settings, I believe you'll get what you're looking for: page.extract_text(layout=True, use_text_flow=True) — (use_text_flow tells the layout engine to use the characters in the sequence they are provided in the file, rather than their x/y position). Does that work for you?

@Rustemhak
Copy link

Rustemhak commented Jan 27, 2023

Hello @jsvine, thank you for solution. Okay, adding these options works for extract_text. How can I use the same options to extract_tables?

@jsvine
Copy link
Owner

jsvine commented Feb 1, 2023

Right now, that's not possible with pdfplumber, but adding that feature sounds like a good idea.

For the specific PDF discussed above, however, I don't think it'd work, due to the character-positioning issues. (I.e., many characters that should be inside a particular table cell are not.)

@jsvine jsvine closed this as completed Feb 1, 2023
@jsvine
Copy link
Owner

jsvine commented Feb 16, 2023

@Rustemhak, in the latest version of pdfplumber (v0.8.0), you can now pass all .extract_text(...) arguments to .extract_tables(...), prefixing them with text_. So { "text_use_text_flow": True }.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants