Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese PDF font encoding changed on pass-through (--skip-text), breaking OCR layer #99

Closed
wanghaisheng opened this issue Oct 12, 2016 · 11 comments

Comments

@wanghaisheng
Copy link

➜  test-data git:(master) ✗ docker tag jbarlow83/ocrmypdf-polyglot ocrmypdf
➜  test-data git:(master) ✗ docker run -v "$(pwd):/home/docker"   ocrmypdf  --skip-text 11.pdf 11-output.pdf      
   INFO -    1: page has no images - skipping all processing on this page
GPL Ghostscript 9.19: PDFA doesn't allow images with Interpolate true.
GPL Ghostscript 9.19: PDFA doesn't allow images with Interpolate true.
   INFO - Output file is a PDF/A-2B (as expected)
➜  test-data git:(master) ✗ docker run -v "$(pwd):/home/docker"   ocrmypdf   11.pdf 11-output.pdf 
   INFO -    1: page has no images - skipping all processing on this page
GPL Ghostscript 9.19: PDFA doesn't allow images with Interpolate true.
GPL Ghostscript 9.19: PDFA doesn't allow images with Interpolate true.
   INFO - Output file is a PDF/A-2B (as expected)

11.PDF.zip

using xpdf it seems the original encoding is lost

root@6334724bdee5:/tmp# pdffonts 11-output.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
GPSULR+SimSun                        CID TrueType      Custom           yes yes yes      9  0
root@6334724bdee5:/tmp# pdffonts 11.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
SRPUEP+SimSun                        TrueType          WinAnsi          yes yes yes     13  0
@jbarlow83 jbarlow83 changed the title does work though Chinese PDF font encoding changed on pass-through (--skip-text), breaking OCR layer Oct 12, 2016
@jbarlow83
Copy link
Collaborator

jbarlow83 commented Oct 12, 2016

The issue is that the encoding of the OCR text layer is changed so copy-paste is broken.

I suspect this is due to one of PyPDF2's many open issues with unicode.

You might be able to get this to work with the argument --pdf-renderer tesseract.

The included file 11.PDF is an output file, not an input, so I cannot repeat the test. Please provide the input file if possible.

@wanghaisheng
Copy link
Author

wanghaisheng commented Oct 12, 2016

as doc in README

docker run -v "$(pwd):/home/docker"   ocrmypdf --skip-text test.pdf output.pdf

i think test.pdf is the input .so is 11.pdf

➜  test-data git:(master) ✗ docker run -v "$(pwd):/home/docker"   ocrmypdf  --pdf-renderer tesseract 11.pdf 11-ocr-output.pdf 

   INFO -    1: page has no images - skipping all processing on this page
GPL Ghostscript 9.19: PDFA doesn't allow images with Interpolate true.
GPL Ghostscript 9.19: PDFA doesn't allow images with Interpolate true.
   INFO - Output file is a PDF/A-2B (as expected)
➜  test-data git:(master) ✗ 

@jbarlow83
Copy link
Collaborator

Never mind, I was mistaken. In that case it appears to me that the input file 11.pdf is not properly encoded, at least not in a way any software I have installed can understand. Since Chinese text is hard to get right in PDFs it's quite possible that the service that produced this file, Online2pdf.com, does not encode Chinese correctly.

You can try to get ocrmypdf to ignore the existing OCR layer and re-do OCR with this command

docker run -v "$(pwd):/home/docker"  ocrmypdf --output-type pdf --pdf-renderer tesseract --force-ocr -l chi_sim 11.pdf 11_redo.pdf

When done that way, I can copy and paste Chinese characters (although there are some OCR errors).

@wanghaisheng
Copy link
Author

wanghaisheng commented Oct 12, 2016

well done
when i using pdfminer to extract the 11_redo.pdf, the result is quite promising, it seems what i need to do is to make tesseract work better
one last question

"in that case it appears to me that the input file 11.pdf is not properly encoded, at least not in a way any software I have installed can understand."

without manually to read and edit source file ,how could i distinguish those not properly encoded or not. any idea about that?

root@30c7a47055f5:/tmp# pdf2txt.py -o test-data/11_redo.html -Y exact test-data/11_redo.pdf 

11_redo.html

HTML output

<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:594px; height:419px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<span style="position:absolute; color:black; left:41px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:45px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:49px; top:64px; font-size:11px;">|</span>
<span style="position:absolute; color:black; left:53px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:56px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:60px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:64px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:67px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:71px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:75px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:78px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:82px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:86px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:89px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:93px; top:64px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:43px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:61px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:67px; top:80px; font-size:7px;">;</span>
<span style="position:absolute; color:black; left:82px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:94px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:147px; top:63px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:163px; top:63px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:180px; top:63px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:196px; top:63px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:212px; top:63px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:229px; top:63px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:245px; top:63px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:262px; top:63px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:278px; top:63px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:294px; top:63px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:311px; top:63px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:327px; top:62px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:343px; top:62px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:360px; top:62px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:376px; top:62px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:393px; top:62px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:409px; top:62px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:142px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:155px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:169px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:174px; top:80px; font-size:7px;">=</span>
<span style="position:absolute; color:black; left:179px; top:80px; font-size:7px;">3</span>
<span style="position:absolute; color:black; left:184px; top:80px; font-size:7px;">0</span>
<span style="position:absolute; color:black; left:189px; top:80px; font-size:7px;">3</span>
<span style="position:absolute; color:black; left:194px; top:80px; font-size:7px;">7</span>
<span style="position:absolute; color:black; left:199px; top:80px; font-size:7px;">0</span>
<span style="position:absolute; color:black; left:204px; top:80px; font-size:7px;">8</span>
<span style="position:absolute; color:black; left:209px; top:80px; font-size:7px;">9</span>
<span style="position:absolute; color:black; left:214px; top:80px; font-size:7px;">4</span>
<span style="position:absolute; color:black; left:219px; top:80px; font-size:7px;">6</span>
<span style="position:absolute; color:black; left:224px; top:80px; font-size:7px;">0</span>
<span style="position:absolute; color:black; left:256px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:264px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:271px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:279px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:286px; top:80px; font-size:7px;">=</span>
<span style="position:absolute; color:black; left:302px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:308px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:314px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:320px; top:80px; font-size:7px;">/</span>
<span style="position:absolute; color:black; left:326px; top:80px; font-size:7px;">2</span>
<span style="position:absolute; color:black; left:332px; top:80px; font-size:7px;">9</span>
<span style="position:absolute; color:black; left:339px; top:80px; font-size:7px;">8</span>
<span style="position:absolute; color:black; left:345px; top:80px; font-size:7px;">5</span>
<span style="position:absolute; color:black; left:481px; top:62px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:488px; top:62px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:495px; top:62px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:502px; top:62px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:508px; top:62px; font-size:11px;">.</span>
<span style="position:absolute; color:black; left:515px; top:62px; font-size:11px;">1</span>
<span style="position:absolute; color:black; left:522px; top:62px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:529px; top:61px; font-size:11px;"></span>
<span style="position:absolute; color:black; left:394px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:401px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:409px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:416px; top:80px; font-size:7px;"></span>
<span style="position:absolute; color:black; left:424px; top:80px; font-size:7px;">=</span>
<span style="position:absolute; color:black; left:439px; top:80px; font-size:7px;">2</span>
<span style="position:absolute; color:black; left:444px; top:80px; font-size:7px;">0</span>
<span style="position:absolute; color:black; left:448px; top:80px; font-size:7px;">1</span>
<span style="position:absolute; color:black; left:453px; top:80px; font-size:7px;">5</span>
<span style="position:absolute; color:black; left:457px; top:80px; font-size:7px;">.</span>
<span style="position:absolute; color:black; left:462px; top:80px; font-size:7px;">1</span>
<span style="position:absolute; color:black; left:466px; top:80px; font-size:7px;">0</span>
<span style="position:absolute; color:black; left:471px; top:80px; font-size:7px;">-</span>
<span style="position:absolute; color:black; left:475px; top:80px; font-size:7px;">1</span>
<span style="position:absolute; color:black; left:479px; top:80px; font-size:7px;">3</span>
<span style="position:absolute; color:black; left:489px; top:80px; font-size:7px;">0</span>
<span style="position:absolute; color:black; left:494px; top:80px; font-size:7px;">9</span>
<span style="position:absolute; color:black; left:498px; top:80px; font-size:7px;">:</span>
<span style="position:absolute; color:black; left:503px; top:80px; font-size:7px;">5</span>
<span style="position:absolute; color:black; left:507px; top:80px; font-size:7px;">9</span>
<span style="position:absolute; color:black; left:43px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:61px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:67px; top:97px; font-size:4px;">;</span>
<span style="position:absolute; color:black; left:82px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:142px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:149px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:157px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:164px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:172px; top:97px; font-size:4px;">=</span>
<span style="position:absolute; color:black; left:185px; top:97px; font-size:4px;">1</span>
<span style="position:absolute; color:black; left:189px; top:97px; font-size:4px;">4</span>
<span style="position:absolute; color:black; left:194px; top:97px; font-size:4px;">3</span>
<span style="position:absolute; color:black; left:198px; top:97px; font-size:4px;">6</span>
<span style="position:absolute; color:black; left:202px; top:97px; font-size:4px;">6</span>
<span style="position:absolute; color:black; left:207px; top:97px; font-size:4px;">2</span>
<span style="position:absolute; color:black; left:211px; top:97px; font-size:4px;">0</span>
<span style="position:absolute; color:black; left:216px; top:97px; font-size:4px;">8</span>
<span style="position:absolute; color:black; left:220px; top:97px; font-size:4px;">0</span>
<span style="position:absolute; color:black; left:225px; top:97px; font-size:4px;">0</span>
<span style="position:absolute; color:black; left:255px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:269px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:282px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:288px; top:97px; font-size:4px;">;</span>
<span style="position:absolute; color:black; left:302px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:308px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:314px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:320px; top:97px; font-size:4px;">/</span>
<span style="position:absolute; color:black; left:326px; top:97px; font-size:4px;">T</span>
<span style="position:absolute; color:black; left:332px; top:97px; font-size:4px;">0</span>
<span style="position:absolute; color:black; left:337px; top:97px; font-size:4px;">0</span>
<span style="position:absolute; color:black; left:343px; top:97px; font-size:4px;">3</span>
<span style="position:absolute; color:black; left:349px; top:97px; font-size:4px;">7</span>
<span style="position:absolute; color:black; left:393px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:400px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:408px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:416px; top:97px; font-size:4px;"></span>
<span style="position:absolute; color:black; left:423px; top:97px; font-size:4px;">=</span>
<span style="position:absolute; color:black; left:439px; top:97px; font-size:4px;">2</span>
<span style="position:absolute; color:black; left:444px; top:97px; font-size:4px;">0</span>
<span style="position:absolute; color:black; left:448px; top:97px; font-size:4px;">1</span>
<span style="position:absolute; color:black; left:453px; top:97px; font-size:4px;">5</span>
<span style="position:absolute; color:black; left:457px; top:97px; font-size:4px;">.</span>
<span style="position:absolute; color:black; left:462px; top:97px; font-size:4px;">1</span>
<span style="position:absolute; color:black; left:466px; top:97px; font-size:4px;">0</span>
<span style="position:absolute; color:black; left:471px; top:97px; font-size:4px;">-</span>
<span style="position:absolute; color:black; left:475px; top:97px; font-size:4px;">1</span>
<span style="position:absolute; color:black; left:479px; top:97px; font-size:4px;">3</span>
<span style="position:absolute; color:black; left:490px; top:97px; font-size:4px;">1</span>
<span style="position:absolute; color:black; left:494px; top:97px; font-size:4px;">0</span>
<span style="position:absolute; color:black; left:498px; top:97px; font-size:4px;">:</span>
<span style="position:absolute; color:black; left:503px; top:97px; font-size:4px;">4</span>
<span style="position:absolute; color:black; left:507px; top:97px; font-size:4px;">5</span>
<span style="position:absolute; color:black; left:43px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:61px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:67px; top:109px; font-size:5px;">:</span>
<span style="position:absolute; color:black; left:82px; top:109px; font-size:5px;">3</span>
<span style="position:absolute; color:black; left:87px; top:109px; font-size:5px;">3</span>
<span style="position:absolute; color:black; left:92px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:141px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:169px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:174px; top:109px; font-size:5px;">=</span>
<span style="position:absolute; color:black; left:255px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:283px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:288px; top:109px; font-size:5px;">=</span>
<span style="position:absolute; color:black; left:302px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:311px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:320px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:329px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:338px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:347px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:393px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:400px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:408px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:416px; top:109px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:423px; top:109px; font-size:5px;">=</span>
<span style="position:absolute; color:black; left:439px; top:109px; font-size:5px;">2</span>
<span style="position:absolute; color:black; left:444px; top:109px; font-size:5px;">0</span>
<span style="position:absolute; color:black; left:449px; top:109px; font-size:5px;">1</span>
<span style="position:absolute; color:black; left:454px; top:109px; font-size:5px;">5</span>
<span style="position:absolute; color:black; left:458px; top:109px; font-size:5px;">-</span>
<span style="position:absolute; color:black; left:463px; top:109px; font-size:5px;">1</span>
<span style="position:absolute; color:black; left:468px; top:109px; font-size:5px;">0</span>
<span style="position:absolute; color:black; left:473px; top:109px; font-size:5px;">-</span>
<span style="position:absolute; color:black; left:477px; top:109px; font-size:5px;">1</span>
<span style="position:absolute; color:black; left:482px; top:109px; font-size:5px;">3</span>
<span style="position:absolute; color:black; left:487px; top:109px; font-size:5px;">1</span>
<span style="position:absolute; color:black; left:492px; top:109px; font-size:5px;">0</span>
<span style="position:absolute; color:black; left:496px; top:109px; font-size:5px;">:</span>
<span style="position:absolute; color:black; left:501px; top:109px; font-size:5px;">5</span>
<span style="position:absolute; color:black; left:506px; top:109px; font-size:5px;">1</span>
<span style="position:absolute; color:black; left:43px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:61px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:67px; top:124px; font-size:5px;">;</span>
<span style="position:absolute; color:black; left:81px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:89px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:141px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:149px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:157px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:164px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:172px; top:124px; font-size:5px;">=</span>
<span style="position:absolute; color:black; left:184px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:193px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:257px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:264px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:272px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:279px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:287px; top:124px; font-size:5px;">=</span>
<span style="position:absolute; color:black; left:302px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:311px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:320px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:329px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:337px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:346px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:355px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:364px; top:124px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:42px; top:136px; font-size:2px;"> </span>
<span style="position:absolute; color:black; left:46px; top:141px; font-size:5px;">N</span>
<span style="position:absolute; color:black; left:50px; top:141px; font-size:5px;">o</span>
<span style="position:absolute; color:black; left:65px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:75px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:223px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:234px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:310px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:319px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:328px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:337px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:384px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:395px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:424px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:435px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:477px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:489px; top:141px; font-size:5px;"></span>
<span style="position:absolute; color:black; left:41px; top:150px; font-size:1px;"> </span>
<span style="position:absolute; color:black; left:49px; top:158px; font-size:6px;">1</span>
<span style="position:absolute; color:black; left:63px; top:158px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:72px; top:158px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:82px; top:158px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:92px; top:158px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:102px; top:158px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:112px; top:158px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:122px; top:158px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:132px; top:158px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:142px; top:158px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:151px; top:158px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:219px; top:157px; font-size:6px;">0</span>
<span style="position:absolute; color:black; left:223px; top:157px; font-size:6px;">.</span>
<span style="position:absolute; color:black; left:228px; top:157px; font-size:6px;">0</span>
<span style="position:absolute; color:black; left:233px; top:157px; font-size:6px;">5</span>
<span style="position:absolute; color:black; left:237px; top:157px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:242px; top:157px; font-size:6px;">_</span>
<span style="position:absolute; color:black; left:247px; top:157px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:306px; top:157px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:311px; top:157px; font-size:6px;">l</span>
<span style="position:absolute; color:black; left:316px; top:157px; font-size:6px;">'</span>
<span style="position:absolute; color:black; left:320px; top:157px; font-size:6px;">0</span>
<span style="position:absolute; color:black; left:325px; top:157px; font-size:6px;">0</span>
<span style="position:absolute; color:black; left:378px; top:157px; font-size:6px;">S</span>
<span style="position:absolute; color:black; left:383px; top:157px; font-size:6px;">/</span>
<span style="position:absolute; color:black; left:387px; top:157px; font-size:6px;">C</span>
<span style="position:absolute; color:black; left:392px; top:157px; font-size:6px;">O</span>
<span style="position:absolute; color:black; left:417px; top:156px; font-size:6px;">I</span>
<span style="position:absolute; color:black; left:422px; top:156px; font-size:6px;">2</span>
<span style="position:absolute; color:black; left:427px; top:156px; font-size:6px;">0</span>
<span style="position:absolute; color:black; left:431px; top:156px; font-size:6px;">0</span>
<span style="position:absolute; color:black; left:436px; top:156px; font-size:6px;">0</span>
<span style="position:absolute; color:black; left:457px; top:156px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:467px; top:156px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:477px; top:156px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:486px; top:156px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:496px; top:156px; font-size:6px;"></span>
<span style="position:absolute; color:black; left:45px; top:405px; font-size:2px;"></span>
<span style="position:absolute; color:black; left:72px; top:405px; font-size:2px;"></span>
<span style="position:absolute; color:black; left:76px; top:405px; font-size:2px;"></span>
<span style="position:absolute; color:black; left:80px; top:405px; font-size:2px;">=</span>
<span style="position:absolute; color:black; left:40px; top:413px; font-size:1px;"> </span>
<span style="position:absolute; color:black; left:41px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:50px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:59px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:67px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:76px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:85px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:93px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:102px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:111px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:119px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:128px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:136px; top:413px; font-size:8px;">,</span>
<span style="position:absolute; color:black; left:152px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:160px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:169px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:178px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:187px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:195px; top:413px; font-size:8px;">3</span>
<span style="position:absolute; color:black; left:204px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:213px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:221px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:230px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:255px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:262px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:269px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:276px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:282px; top:413px; font-size:8px;">=</span>
<span style="position:absolute; color:black; left:289px; top:413px; font-size:8px;">2</span>
<span style="position:absolute; color:black; left:296px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:303px; top:413px; font-size:8px;">5</span>
<span style="position:absolute; color:black; left:310px; top:413px; font-size:8px;">-</span>
<span style="position:absolute; color:black; left:320px; top:413px; font-size:8px;">1</span>
<span style="position:absolute; color:black; left:323px; top:413px; font-size:8px;">0</span>
<span style="position:absolute; color:black; left:327px; top:413px; font-size:8px;">.</span>
<span style="position:absolute; color:black; left:334px; top:413px; font-size:8px;">1</span>
<span style="position:absolute; color:black; left:337px; top:413px; font-size:8px;">3</span>
<span style="position:absolute; color:black; left:347px; top:413px; font-size:8px;">1</span>
<span style="position:absolute; color:black; left:352px; top:413px; font-size:8px;">4</span>
<span style="position:absolute; color:black; left:356px; top:413px; font-size:8px;">:</span>
<span style="position:absolute; color:black; left:360px; top:413px; font-size:8px;">5</span>
<span style="position:absolute; color:black; left:364px; top:413px; font-size:8px;">6</span>
<span style="position:absolute; color:black; left:380px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:388px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:396px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:405px; top:413px; font-size:8px;">=</span>
<span style="position:absolute; color:black; left:413px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:421px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:41px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:50px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:58px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:67px; top:423px; font-size:8px;">*</span>
<span style="position:absolute; color:black; left:75px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:83px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:92px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:100px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:108px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:117px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:125px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:133px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:142px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:150px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:158px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:167px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:175px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:460px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:467px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:474px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:482px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:489px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:496px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:503px; top:413px; font-size:8px;">`</span>
<span style="position:absolute; color:black; left:510px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:517px; top:413px; font-size:8px;">`</span>
<span style="position:absolute; color:black; left:524px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:531px; top:413px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:534px; top:423px; font-size:8px;">`</span>
<span style="position:absolute; color:black; left:505px; top:423px; font-size:8px;"></span>
<span style="position:absolute; color:black; left:524px; top:423px; font-size:8px;">`</span>
<div style="position:absolute; border: figure 1px solid; writing-mode:False; left:0px; top:50px; width:594px; height:419px;"></div><div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>

@jbarlow83
Copy link
Collaborator

make tesseract work better

The file is a clear image so you're probably running into the limits of tesseract's OCR accuracy. Results can be improved by training it to recognize the exact font being used, but training tesseract with new fonts is hard to do and tedious.

This file looks like it was "born digital" rather than a scanned image. In that case, if you can find a version of the file before it was passed through online2pdf.com, perhaps that will work better.

how could i distinguish those not properly encoded or not. any idea about that?

Two pieces of information told me it is not encoded properly. First, pdffonts reports that the font encoding in "WinAnsi" which cannot encode Chinese characters. Second, the copy and paste text extracted by Acrobat maps Chinese characters to random ASCII characters. So the PDF contains encoding information, and text positioning information, but the encoding is incorrect. It is probably not possible to fix this without using OCR or finding the original digital file.

By the way you might want to use "poppler-utils" which is a fork of "xpdf" that provides many of the same tools. xpdf is not being maintained any more while poppler-utils is still being developed.

@wanghaisheng
Copy link
Author

thx very much .i got a poppler-util 0.48.0and xpdf 3.0.4docker image
one way is to use pdffonts to detect whether pdf file have only one font type and encoding is winansi
another way is to use pdf.js all i got is null or none of valid Chinese charaters

@wanghaisheng
Copy link
Author

@jbarlow83 sir recently i tried the above test against tess4 docker image
the following command,which when i use old tesseract docker image it can get the proper pdf instead of tess4
i could not get the correct chinese character from tess4 result ,does the --force-ocr is deprecated?

def ocr(body, response, language: "The language(s) to use for OCR"="chi_sim"):
    if not len(body) == 1:
        raise Exception("Need exactly one file!")

    fn, content = list(body.items()).pop()

    f_out = NamedTemporaryFile(suffix='.pdf')

    with NamedTemporaryFile(suffix='.pdf', mode="wb") as f_in:
        f_in.write(content)
        f_in.flush()  
        proc = subprocess.Popen(['ocrmypdf','--pdf-renderer','tesseract','--force-ocr', '--output-type','pdf','-l', language, f_in.name, f_out.name])

        code = proc.wait()

        response.set_header('X-OCR-Exit-Code', str(code))

        print(f_out.name)

        return f_out

@wanghaisheng wanghaisheng reopened this Jul 26, 2017
@jbarlow83
Copy link
Collaborator

Hello.

I was able to get correct OCR from the ocrmypdf-tess4 image, as far as I can tell anyway. I attached a sample based on your original test file. I did the following:

docker run --rm -v (pwd):/home/docker ocrmypdft4 -l chi_sim --output-type pdf --force-ocr 11.pdf 11_.pdf

--force-ocr is not deprecated.

Maybe you're using an older version of that image? The code snippet shows a local installation of ocrmypdf not the docker image.

11_.pdf

@wanghaisheng
Copy link
Author

wanghaisheng commented Jul 27, 2017

#FROM jbarlow83/ocrmypdf
FROM jbarlow83/ocrmypdf-tess4
USER root

RUN mkdir /app
WORKDIR /app

ADD requirements.txt /app


#RUN add-apt-repository ppa:alex-p/tesseract-ocr

RUN apt-get update \
	&& apt-get autoremove -y \
	&& apt-get install -y --no-install-recommends \
		tesseract-ocr-chi-sim 



RUN . /appenv/bin/activate && pip install -r requirements.txt

ADD server.py index.htm entrypoint.sh /app/
ADD static /app/static/

USER docker

ENTRYPOINT ["/app/entrypoint.sh"]

just inherit your image

@wanghaisheng wanghaisheng reopened this Jul 27, 2017
@jbarlow83
Copy link
Collaborator

What file are you using to test?

@wanghaisheng
Copy link
Author

wanghaisheng commented Jul 27, 2017

11.pdf.zip
simply remove suffix .zip

tried remove all the existing "ocrmypdf" container and image and re-pull ,it works now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants