Space regression by PR 1172 #1362

MartinThoma · 2022-09-24T04:23:52Z

I've just noticed that PR #1172 introduced a space regression issue for text extraction. A lot of spaces got removed. Those spaces should have stayed.

Code + PDF

Just standard text extraction:

from PyPDF2 import PdfReader

reader = PdfReader("doc.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

PDFs:

https://arxiv.org/pdf/2201.00029.pdf - here it's very obvious
https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf (German doc) - here it happens mostly with mathematical formula missing space to the surrounding text. That is a pattern I've seen in many of the other documents as well.

See https://arxiv.org/pdf/2201.00029.pdf :

MartinThoma · 2022-09-24T04:24:50Z

@pubpub-zz Would you mind to have a look? It's not critical, but you are definitely the expert on that topic :-)

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Sep 24, 2022

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 24, 2022

MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Space regression by PR 1172 #1362

Space regression by PR 1172 #1362

MartinThoma commented Sep 24, 2022 •

edited

MartinThoma commented Sep 24, 2022

Space regression by PR 1172 #1362

Space regression by PR 1172 #1362

Comments

MartinThoma commented Sep 24, 2022 • edited

Code + PDF

MartinThoma commented Sep 24, 2022

MartinThoma commented Sep 24, 2022 •

edited