Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text highlighting quirk on PDF files produced by Tesseract #6509

Open
jbreiden opened this issue Oct 6, 2015 · 14 comments
Open

text highlighting quirk on PDF files produced by Tesseract #6509

jbreiden opened this issue Oct 6, 2015 · 14 comments

Comments

@jbreiden
Copy link

jbreiden commented Oct 6, 2015

Programs like Tesseract are used to OCR documents. Basically, we take
a photographic image, recognize any symbolic text, and then compose a PDF
consisting of the photograph and an invisible symbolic text layer for copy-paste
and search.

I am the author of the relevant pdf generation code, and similar code in other
programs. We get very good results in many PDF renderers including pdfium
and poppler, but get misaligned highlighting from pdf.js in Firefox.

GitHub is refusing to let me post a simple example PDF here, so I am
providing a URL instead of attachment. This is a very simple example from
our test suite. I have 100% control over the PDF generation code and
understand everything about it, so if there is any complaint about it let me
know and we'll work it out.

http://leptonica.org/jbreiden/simple-1.pdf

screenshot

Build identifier: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0

@brendandahl
Copy link
Contributor

There seems to be an issue with how the top value for the text over lay is calculated, but nothing seem obviously wrong to me. The relevant code is at https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L164-L180

@jbreiden
Copy link
Author

jbreiden commented Oct 7, 2015

The relevant PDF objects and the embedded glyphless font say that the Hebrew and English word should highlight identically. Suggest tracing what is causing the difference.

@jbreiden
Copy link
Author

jbreiden commented Oct 7, 2015

Here you can see that we are placing the Hebrew and English words on the exact same y position.

5 0 obj
<< /Length 197 >>
stream
q 132.686 0 0 47.314 0 0 cm /Im1 Do Q
BT
3 Tr 1 0 0 1 16.457 19.229 Tm /f-0-0 26 Tf 97.582 Tz [ <0061><006C><006F> ] TJ -1 0 0 1 122.4 19.229 Tm 90.212 Tz [ <05D1><05D0><05D7><05E8><200E> ] TJ 
ET
endstream
endobj

Here you can see that the PDF claims there are no descenders.

11 0 obj
<< /Ascent 500 /CapHeight 500 /Descent -1 /Flags 5 /FontBBox [ 0 0 500 500 ] /FontFile2 12 0 R /FontName /GlyphLessFont /ItalicAngle 0 /StemV 80 /Type /FontDescriptor >>
endobj

And most importantly, we are mapping every single character to the same invisible, empty glyph. Here are links to the code/documentation and to the custom designed font.

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp
https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf

My best guess (without reading pdf.js code) is that the dimension information in the font itself and the relevant PDF objects are being ignored. Instead, a heuristic looks at the Unicode mapping of the characters and says "This might be English, English has descenders, move stuff around!" If that is the case, I'd really like to know what I can do to avoid triggering such heuristics.

@brendandahl
Copy link
Contributor

I missed that they should be on the same line before. It appears the issue is with how we calculate the angle for the text. For the hebrew word there is a negative x scale component which seems to be causing issues on our side. Looking into how this should be working....

1 0 0 1 16.457 19.229 Tm
-1 0 0 1 122.4 19.229 Tm

@jbreiden
Copy link
Author

jbreiden commented Oct 8, 2015

The -1 is just means that I am placing characters from right-to-left. (Because Hebrew is an right-to-left language). This is not terribly common practice, but makes sense especially when working with an invisible glyphless font.

Please note that I'm claiming the problem is with the English word. The highlight region extends way below the baseline, and it should not be doing that. This problem is 100% reproducible, and occurs for every document produced by Tesseract including pure-English.

@jbreiden
Copy link
Author

jbreiden commented Oct 8, 2015

I see this amazing image after a successful copy-paste operation. There is a ghostlike, white-on-gray symbolic text overlayed on the image. I have no idea what it means, or where the font is coming from. It certainly is not the font embedded in the PDF, because that one is glyphless. The English word is too low, and the Hebrew word has each character rotated 180 degrees. Maybe this provides a clue.

ghost

@brendandahl
Copy link
Contributor

To enable text selection we create an invisible dom overlay, so that is what you're seeing. The overlay doesn't use the the embedded font. PDF.js tries to line it up the text layer with the underlying canvas, but as we see above this doesn't always work correctly.

@jbreiden
Copy link
Author

jbreiden commented Oct 8, 2015

Just in case it helps, this is a dump of the font embedded in the PDF, using ttx.

<?xml version="1.0" encoding="UTF-8"?>
<ttFont sfntVersion="\x00\x01\x00\x00" ttLibVersion="2.5">

  <GlyphOrder>
    <!-- The 'id' attribute is only for humans; it is ignored when parsed. -->
    <GlyphID id="0" name=".notdef"/>
    <GlyphID id="1" name=".null"/>
  </GlyphOrder>

  <head>
    <!-- Most of this table will be recalculated by the compiler -->
    <tableVersion value="1.0"/>
    <fontRevision value="1.0"/>
    <checkSumAdjustment value="0xa737b34c"/>
    <magicNumber value="0x5f0f3cf5"/>
    <flags value="00000100 00000111"/>
    <unitsPerEm value="256"/>
    <created value="Thu May 15 23:21:18 2014"/>
    <modified value="Thu May 15 23:21:18 2014"/>
    <xMin value="0"/>
    <yMin value="-32768"/>
    <xMax value="0"/>
    <yMax value="1"/>
    <macStyle value="00000000 00000000"/>
    <lowestRecPPEM value="16"/>
    <fontDirectionHint value="2"/>
    <indexToLocFormat value="0"/>
    <glyphDataFormat value="0"/>
  </head>

  <hhea>
    <tableVersion value="1.0"/>
    <ascent value="1"/>
    <descent value="-1"/>
    <lineGap value="0"/>
    <advanceWidthMax value="0"/>
    <minLeftSideBearing value="0"/>
    <minRightSideBearing value="0"/>
    <xMaxExtent value="0"/>
    <caretSlopeRise value="1"/>
    <caretSlopeRun value="0"/>
    <caretOffset value="0"/>
    <reserved0 value="0"/>
    <reserved1 value="0"/>
    <reserved2 value="0"/>
    <reserved3 value="0"/>
    <metricDataFormat value="0"/>
    <numberOfHMetrics value="2"/>
  </hhea>

  <maxp>
    <!-- Most of this table will be recalculated by the compiler -->
    <tableVersion value="0x10000"/>
    <numGlyphs value="2"/>
    <maxPoints value="0"/>
    <maxContours value="0"/>
    <maxCompositePoints value="0"/>
    <maxCompositeContours value="0"/>
    <maxZones value="1"/>
    <maxTwilightPoints value="0"/>
    <maxStorage value="0"/>
    <maxFunctionDefs value="0"/>
    <maxInstructionDefs value="0"/>
    <maxStackElements value="0"/>
    <maxSizeOfInstructions value="0"/>
    <maxComponentElements value="0"/>
    <maxComponentDepth value="0"/>
  </maxp>

  <OS_2>
    <!-- The fields 'usFirstCharIndex' and 'usLastCharIndex'
         will be recalculated by the compiler -->
    <version value="3"/>
    <xAvgCharWidth value="0"/>
    <usWeightClass value="400"/>
    <usWidthClass value="5"/>
    <fsType value="00000000 00000000"/>
    <ySubscriptXSize value="0"/>
    <ySubscriptYSize value="0"/>
    <ySubscriptXOffset value="0"/>
    <ySubscriptYOffset value="0"/>
    <ySuperscriptXSize value="0"/>
    <ySuperscriptYSize value="0"/>
    <ySuperscriptXOffset value="0"/>
    <ySuperscriptYOffset value="0"/>
    <yStrikeoutSize value="0"/>
    <yStrikeoutPosition value="0"/>
    <sFamilyClass value="0"/>
    <panose>
      <bFamilyType value="5"/>
      <bSerifStyle value="0"/>
      <bWeight value="1"/>
      <bProportion value="0"/>
      <bContrast value="1"/>
      <bStrokeVariation value="0"/>
      <bArmStyle value="0"/>
      <bLetterForm value="0"/>
      <bMidline value="0"/>
      <bXHeight value="0"/>
    </panose>
    <ulUnicodeRange1 value="00000000 00000000 00000000 00000000"/>
    <ulUnicodeRange2 value="00000000 00000000 00000000 00000000"/>
    <ulUnicodeRange3 value="00000000 00000000 00000000 00000000"/>
    <ulUnicodeRange4 value="00000000 00000000 00000000 00000000"/>
    <achVendID value="GOOG"/>
    <fsSelection value="00000000 01000000"/>
    <usFirstCharIndex value="65535"/>
    <usLastCharIndex value="0"/>
    <sTypoAscender value="1"/>
    <sTypoDescender value="-1"/>
    <sTypoLineGap value="0"/>
    <usWinAscent value="1"/>
    <usWinDescent value="1"/>
    <ulCodePageRange1 value="10000000 00000000 00000000 00000000"/>
    <ulCodePageRange2 value="00000000 00000000 00000000 00000000"/>
    <sxHeight value="0"/>
    <sCapHeight value="0"/>
    <usDefaultChar value="0"/>
    <usBreakChar value="1"/>
    <usMaxContext value="0"/>
  </OS_2>

  <hmtx>
    <mtx name=".notdef" width="0" lsb="0"/>
    <mtx name=".null" width="0" lsb="0"/>
  </hmtx>

  <cmap>
    <tableVersion version="0"/>
    <cmap_format_6 platformID="1" platEncID="0" language="0">
      <map code="0x0" name=".notdef"/>
    </cmap_format_6>
    <cmap_format_6 platformID="3" platEncID="0" language="0">
      <map code="0x0" name=".notdef"/><!-- ???? -->
    </cmap_format_6>
  </cmap>

  <loca>
    <!-- The 'loca' table will be calculated by the compiler -->
  </loca>

  <glyf>

    <!-- The xMin, yMin, xMax and yMax values
         will be recalculated by the compiler. -->

    <TTGlyph name=".notdef"/><!-- contains no outline data -->

    <TTGlyph name=".null"/><!-- contains no outline data -->

  </glyf>

  <name>
    <namerecord nameID="5" platformID="0" platEncID="3" langID="0x0">
      Version 1.0
    </namerecord>
    <namerecord nameID="5" platformID="1" platEncID="0" langID="0x0" unicode="True">
      Version 1.0
    </namerecord>
    <namerecord nameID="5" platformID="3" platEncID="1" langID="0x409">
      Version 1.0
    </namerecord>
  </name>

  <post>
    <formatType value="1.0"/>
    <italicAngle value="0.0"/>
    <underlinePosition value="0"/>
    <underlineThickness value="0"/>
    <isFixedPitch value="1"/>
    <minMemType42 value="0"/>
    <maxMemType42 value="0"/>
    <minMemType1 value="0"/>
    <maxMemType1 value="0"/>
  </post>

</ttFont>

@yurydelendik
Copy link
Contributor

See https://github.com/mozilla/pdf.js/wiki/Debugging-PDF.js how to enable debugging tools. PDF.js will use browser's font to render text layer and the text layer on Mac OSX looks differently, probably due metrics of the browser's fonts.

The font you posted above is somewhat unrelated one, however metrics in it does not match metrics in PDFs (check http://brendandahl.github.io/pdf.js.utils/browser/).

Checking the angle value at https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L174, looks like it is reporting unexpected -π value for [-1,0,0,1] transform -- I think you would expect 0 there, that causes ascender value be used during top coordinate calculation.

@jbreiden
Copy link
Author

jbreiden commented Oct 9, 2015

Problem occurs without any Hebrew involved.

1 0 0 1 16 18 Tm /f-0-0 25 Tf 98.666 Tz [ <0061><006C><006F> ] TJ 

ff

gc

@jbreiden
Copy link
Author

FYI, millions of digitized books are affected.

@rlucha
Copy link

rlucha commented Jun 6, 2016

We have the same problem with our OCR'ed pdfs with tesseract. Is there any plan so fix this in the future?

@jbreiden
Copy link
Author

Duplicate of #6863

@timvandermeij
Copy link
Contributor

This changed after PR #12896 in the sense that the alo bit of the original PDF file is now correct, but the Hebrew part is unfortunately not yet correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants