-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run example/node/getinfo.js could not get the right character #7712
Comments
further using xpdf tools
edit getinfo.js
it still not working |
will output
but
output
|
Looks like PDF removed character encoding -- I cannot extract any text using Mac Preview or Adobe Reader. There is nothing can be done -- the PDF must contain proper text encoding information in its fonts. Closing as won't fix. |
how do you find out this file removed character encoding? can you suggest any workaround i may take to re-generate pdf |
mostly by testing the pdf file in other viewers -- if even Adobe Reader cannot do it, there is really low chance any other readers can. Using OCR to recognize glyphs is out of scope of this project. |
you mean with mouse right click to copy the text to somewhere then find out got messy up? then i can say this pdf should leave to OCR engine to deal with that ? |
@wanghaisheng make sure your PDF documents published in PDF/A standard to not have such issues. |
@yurydelendik world is cruel, all these pdf files are from client which out of my control.. |
i just clone the whole latest source and run gulp dist
when run node getinfo.js against the following pdf ,it output like this
should i change the encoding after or just pdf.js could not deal with this pdf
11.pdf
when debug with abode reader you see
first i assume it was caused by embedding fonts,
when i test another pdf which content is the same but embedding fonts are not
3.pdf
and
广西壮族自治区人民医院检验报告单1.xps.pdf
The text was updated successfully, but these errors were encountered: