-
-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chinese PDF font encoding changed on pass-through (--skip-text), breaking OCR layer #99
Comments
The issue is that the encoding of the OCR text layer is changed so copy-paste is broken. I suspect this is due to one of PyPDF2's many open issues with unicode. You might be able to get this to work with the argument The included file 11.PDF is an output file, not an input, so I cannot repeat the test. Please provide the input file if possible. |
as doc in README
i think test.pdf is the input .so is 11.pdf
|
Never mind, I was mistaken. In that case it appears to me that the input file 11.pdf is not properly encoded, at least not in a way any software I have installed can understand. Since Chinese text is hard to get right in PDFs it's quite possible that the service that produced this file, Online2pdf.com, does not encode Chinese correctly. You can try to get ocrmypdf to ignore the existing OCR layer and re-do OCR with this command
When done that way, I can copy and paste Chinese characters (although there are some OCR errors). |
well done
without manually to read and edit source file ,how could i distinguish those not properly encoded or not. any idea about that?
11_redo.html
|
The file is a clear image so you're probably running into the limits of tesseract's OCR accuracy. Results can be improved by training it to recognize the exact font being used, but training tesseract with new fonts is hard to do and tedious. This file looks like it was "born digital" rather than a scanned image. In that case, if you can find a version of the file before it was passed through online2pdf.com, perhaps that will work better.
Two pieces of information told me it is not encoded properly. First, pdffonts reports that the font encoding in "WinAnsi" which cannot encode Chinese characters. Second, the copy and paste text extracted by Acrobat maps Chinese characters to random ASCII characters. So the PDF contains encoding information, and text positioning information, but the encoding is incorrect. It is probably not possible to fix this without using OCR or finding the original digital file. By the way you might want to use "poppler-utils" which is a fork of "xpdf" that provides many of the same tools. xpdf is not being maintained any more while poppler-utils is still being developed. |
thx very much .i got a poppler-util 0.48.0and xpdf 3.0.4docker image |
@jbarlow83 sir recently i tried the above test against tess4 docker image
|
Hello. I was able to get correct OCR from the ocrmypdf-tess4 image, as far as I can tell anyway. I attached a sample based on your original test file. I did the following:
Maybe you're using an older version of that image? The code snippet shows a local installation of ocrmypdf not the docker image. |
just inherit your image |
What file are you using to test? |
11.pdf.zip tried remove all the existing "ocrmypdf" container and image and re-pull ,it works now |
11.PDF.zip
using xpdf it seems the original encoding is lost
The text was updated successfully, but these errors were encountered: