Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Space regression by PR 1172 #1362

Open
MartinThoma opened this issue Sep 24, 2022 · 1 comment
Open

Space regression by PR 1172 #1362

MartinThoma opened this issue Sep 24, 2022 · 1 comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@MartinThoma
Copy link
Member

MartinThoma commented Sep 24, 2022

I've just noticed that PR #1172 introduced a space regression issue for text extraction. A lot of spaces got removed. Those spaces should have stayed.

Code + PDF

Just standard text extraction:

from PyPDF2 import PdfReader

reader = PdfReader("doc.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

PDFs:

See https://arxiv.org/pdf/2201.00029.pdf :

image

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Sep 24, 2022
@MartinThoma
Copy link
Member Author

@pubpub-zz Would you mind to have a look? It's not critical, but you are definitely the expert on that topic :-)

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 24, 2022
@MartinThoma MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

1 participant