New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hyphen ignored at end of line #70
Comments
Your test file link gives a 404 for me. Maybe it's in a private repo? |
Yes you are right, sorry.
Here it is attached.
Regards ripspin
…On Wed, Aug 19, 2020 at 12:36 PM Jason Alan Palmer ***@***.***> wrote:
Your test file link gives a 404 for me. Maybe it's in a private repo?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQR3XDG5R5CWOVR5RDECUUDSBM3BNANCNFSM4QDUESRQ>
.
|
I don't see any attachment. |
Not sure what happened, I may have sent the reply from my gmail account.
Anyway, here is ttest1.pdf attached and sent from my yahoo account.
regards ripspin
On Friday, 21 August 2020, 12:33:03 pm AEST, Jason Alan Palmer <notifications@github.com> wrote:
I don't see any attachment.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
There is still no PDF attached. I think you have to attach directly in github, not via email. |
I think something must be removing the attachment.
You should be able to download it from this page:
http://users.tpg.com.au/gregoryarthur/
Regards ripspin
…On Mon, Aug 24, 2020 at 9:52 AM Jason Alan Palmer ***@***.***> wrote:
There is still no PDF attached. I think you have to attach directly in
github, not via email.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQR3XDBIRTPVVHXHRJKIGKDSCGTR7ANCNFSM4QDUESRQ>
.
|
Cool, I have now attached it to this github issue. Will investigate later. |
If you use poppler directly to extract the text, it gives the same output as this library does, so I don't think there's anything to fix on my side. The byte sequence we're outputting for the hyphen and the end of the line is |
Thanks for that, I will try out some post processing, sounds like it should fix my particular problem.
Regards ripspin
On Sunday, 13 September 2020, 05:37:32 am AEST, Jason Alan Palmer <notifications@github.com> wrote:
If you use poppler directly to extract the text, it gives the same output as this library does, so I don't think there's anything to fix on my side.
The byte sequence we're outputting for the hyphen and the end of the line is C2 AD 0A, which is UTF8 for "soft hyphen" followed by "line feed", which seems right, but maybe poppler could be smarter about dealing with soft hyphens at line ends. Probably whatever you are viewing the text with doesn't show the soft hyphens. You could add a postprocessing step to convert soft hyphens into regular hyphens, which would then be visible in your output.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Nothing for me to fix here, so closing |
I have a pdf file and used the below code to print it out on a terminal, the hyphens at the end of the lines were not included. I created a 1 page pdf test file (using qpdf).
My test file is: https://github.com/ripspin5/scripts/blob/master/misc/test1.pdf
Code: (python3.7)
The text was updated successfully, but these errors were encountered: