Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hyphen ignored at end of line #70

Closed
ripspin5 opened this issue Aug 18, 2020 · 10 comments
Closed

hyphen ignored at end of line #70

ripspin5 opened this issue Aug 18, 2020 · 10 comments

Comments

@ripspin5
Copy link

ripspin5 commented Aug 18, 2020

I have a pdf file and used the below code to print it out on a terminal, the hyphens at the end of the lines were not included. I created a 1 page pdf test file (using qpdf).

My test file is: https://github.com/ripspin5/scripts/blob/master/misc/test1.pdf

Code: (python3.7)

import pdftotext

# Load your PDF
with open("test1.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

print(pdf[0])    
@ripspin5 ripspin5 changed the title hyphen ignored at end of line hyphen ignored at end of line ("Load your PDF" came after I submitted the issue) Aug 18, 2020
@jalan
Copy link
Owner

jalan commented Aug 19, 2020

Your test file link gives a 404 for me. Maybe it's in a private repo?

@ripspin5
Copy link
Author

ripspin5 commented Aug 20, 2020 via email

@jalan jalan changed the title hyphen ignored at end of line ("Load your PDF" came after I submitted the issue) hyphen ignored at end of line Aug 21, 2020
@jalan
Copy link
Owner

jalan commented Aug 21, 2020

I don't see any attachment.

@ripspin5
Copy link
Author

ripspin5 commented Aug 21, 2020 via email

@jalan
Copy link
Owner

jalan commented Aug 23, 2020

There is still no PDF attached. I think you have to attach directly in github, not via email.

@ripspin5
Copy link
Author

ripspin5 commented Aug 24, 2020 via email

@jalan
Copy link
Owner

jalan commented Aug 24, 2020

test1.pdf

Cool, I have now attached it to this github issue. Will investigate later.

@jalan
Copy link
Owner

jalan commented Sep 12, 2020

If you use poppler directly to extract the text, it gives the same output as this library does, so I don't think there's anything to fix on my side.

The byte sequence we're outputting for the hyphen and the end of the line is C2 AD 0A, which is UTF8 for "soft hyphen" followed by "line feed", which seems right, but maybe poppler could be smarter about dealing with soft hyphens at line ends. Probably whatever you are viewing the text with doesn't show the soft hyphens. You could add a postprocessing step to convert soft hyphens into regular hyphens, which would then be visible in your output.

@ripspin5
Copy link
Author

ripspin5 commented Sep 15, 2020 via email

@jalan
Copy link
Owner

jalan commented Nov 29, 2020

Nothing for me to fix here, so closing

@jalan jalan closed this as completed Nov 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants