OCR less effective in build 583 since build 545. #309

Closed
Markismus opened this Issue Oct 14, 2013 · 11 comments

Projects

None yet

4 participants

@Markismus
KOReader Community member

This is an ongoing issue since build 545.

In build 545 I can lookup word in Fraktur in the german dictionary, using deu-frac:
imag0079_resize

In build 549 and 561 this doesn't work anymore. I can still select words, but most of the times they are preceded by the white space and mostly I get no dictionary entry:
imag0084_resize
imag0083_resize
imag0082_resize

Only at rather random words I get a nonsense entry: the number 44 in wordnet dictionary. So something is broken!
In roman script the preceding white-space happens too. But only 1/3 of the time. Without causing much problem with the lookup. So although it is a something new, it doesn't seem to touch the core of the problem.

I tested on a pdf-file (https://copy.com/rNHwENxHUa1uVx0E),
on a Kobo Aura HD,
with the same directory for dict and tessdata. (They are outside of the koreader directory and the koreader_kobo.sh script points to them.)
In defaults.lua I changed the lines for OCR to: DKOPTREADER_CONFIG_DOC_LANGS_TEXT = {"English", "Greek", "D-Fraktur", "Deutsch"}
DKOPTREADER_CONFIG_DOC_LANGS_CODE = {"eng", "grc", "deu-frak", "deu"}

Any idea what could be causing it?

The following crash.log were generated.
Build 545:

Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=deu)
Can not convert anfangfi to utf8.
Invalid byte sequence in conversion input
Can not convert auflöien to utf8.
Invalid byte sequence in conversion input
Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=deu-frak)
Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=deu-frak)
Can not convert «----v-«-,.---«----«-m-.« to utf8.
Invalid byte sequence in conversion input
Can not convert «·-·-i.·--·.--.· to utf8.
Invalid byte sequence in conversion input
Can not convert I-----.-.---u-----o-«--o« to utf8.
Invalid byte sequence in conversion input
Can not convert s-ssssisk-xsss-uiis-ss«s-i«-s-.q-«.-ss to utf8.
Invalid byte sequence in conversion input
Can not convert i--.--«-«--«-i----·--.·--.k-«i« to utf8.
Invalid byte sequence in conversion input
Can not convert «-·---·«-»--s«6«-s to utf8.
Invalid byte sequence in conversion input
Can not convert «----v-«-,.---«----«-m-.« to utf8.
Invalid byte sequence in conversion input
Can not convert «·-·-i.·--·.--.· to utf8.
Invalid byte sequence in conversion input
Can not convert I-----.-.---u-----o-«--o« to utf8.
Invalid byte sequence in conversion input
Can not convert s-ssssisk-xsss-uiis-ss«s-i«-s-.q-«.-ss to utf8.
Invalid byte sequence in conversion input
Segmentation fault

Build 549:

Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=deu-frak)
Can not convert H« to utf8.
Invalid byte sequence in conversion input
Can not convert »so« to utf8.
Invalid byte sequence in conversion input
Can not convert »so-·· to utf8.
Invalid byte sequence in conversion input
Can not convert pp« to utf8.
Invalid byte sequence in conversion input
Can not convert txt· to utf8.
Invalid byte sequence in conversion input
Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=deu-frak)
Can not convert BEIDE-L«-LILM,-IIFG-G·ELDHLIL to utf8.
Invalid byte sequence in conversion input
Can not convert EIN:III:IILLHYELÆLL to utf8.
Invalid byte sequence in conversion input
Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=deu-frak)
Can not convert « to utf8.
Invalid byte sequence in conversion input
Can not convert »» to utf8.
Invalid byte sequence in conversion input
Can not convert L» to utf8.
Invalid byte sequence in conversion input
Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=deu-frak)
Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=grc)

Build 561:

Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=eng)
Can not convert £19115'2‘;1J‘LWv_‘:|_"L‘_"L‘_"‘:L‘:3 to utf8.
Invalid byte sequence in conversion input
Can not convert £19115'2‘;1J‘LWv_‘:|_"L‘_"L‘_"‘:L‘:3 to utf8.
Invalid byte sequence in conversion input
Can not convert £19115'2‘;1J‘LWv_‘:|_"L‘_"L‘_"‘:L‘:3 to utf8.
Invalid byte sequence in conversion input
Can not convert 1“££'!L5_"",£|‘,'1vJ£',!!“'F‘7'Lv_"i=-E2 to utf8.
Invalid byte sequence in conversion input
Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=eng)
Can not convert W‘ to utf8.
Invalid byte sequence in conversion input
Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=eng)
@Markismus
KOReader Community member

Another observation:
After having the document open for a rather long time, >30min, the dictionary is more responsive. I get consistently the same results when touching words:
Nichts-->m,
allgemeinen-->He,
Folge-->Los,
Anfang-->J,
and a lot of nothing for a lot of words for which there should be dictionary entries.

I've pasted the crash.logs in the first comment. But, basically these events do not produce entries in the crash.log.

@chrox
KOReader Community member

Starting koreader in a terminal shell with this command ./reader.lua -d /path/to/your/documents/dir will give you more debug information.

@Markismus
KOReader Community member

How would I pipe the standard output to a file, too?
I have ./reader.lua -d /mnt/koreader/library 2>crash.`date +%s`.log,
could I add 1>standardoutput.log?

@Markismus
KOReader Community member

In build 266 (, same pdf-document, same device) with reflow off, zoom to content width:
an entire line gets selected instead of word, standard output:

# hold_release detected in slot 0
# OCRed word: fcbeint,fonnff11¢icbocbmonbizgangeiufaï¬z,menuand;fonï¬nitgenb-'
# lookup word: fcbeint,fonnff11¢icbocbmonbizgangeiufaï¬z,menuand;fonï¬nitgenb-'
# stripped word: fcbeint,fonnff11¢icbocbmonbizgangeiufaï¬z,menuand;fonï¬nitgenb
# in tap state...
# set up hold timer
# in tap state...

or even 2 lines get selected at one touch:

# hold_release detected in slot 0
# OCRed word: ....«.é?..ǤT27...'I..âI:.'..sT{S§l.TsT2-C55.7:22â;
# lookup word: ....«.é?..ǤT27...'I..âI:.'..sT{S§l.TsT2-C55.7:22â;
# stripped word: «.é?..ǤT27...'I..âI:.'..sT{S§l.TsT2-C55.7:22â

With reflow on, english or Fraktur (deu-frak) language selected, I can't select anything, although it recognizes the taps. (This is the same behaviour I saw yesterday evening with build 261):

# hinting page 230 in background
reading page:0,0,1922,3189 scale:6.37
# free koptcontext userdata: 0x2c1ef358
# hinting page 231 in background
reading page:0,0,1922,3189 scale:6.37
# free blitbuffer userdata: 0x2c15b3f0
# free koptcontext userdata: 0x2c9416d0
....
# in tap state...
# set up tap timer
# in tap timer true
# single tap detected in slot 0
# goto relative page: 1
# pan by 0 1390
# on pan: page_area {
        ["h"] = 5499,
        ["w"] = 1080
}
# on pan: visible_area {
        ["y"] = 1390,
        ["x"] = 0,
        ["h"] = 1421,
        ["w"] = 1080
}
# set page position 0
# painting {
        ["y"] = 1390,
        ["x"] = 0,
        ["h"] = 1421,
        ["w"] = 1080
} to 0 0
reading page:0,0,1922,3189 scale:6.37
# reflowed page 229 fullwidth: 1080 fullheight: 5499
# free koptcontext userdata: 0x2c2d2208
# hinting page 230 in background
reading page:0,0,1922,3189 scale:6.37
# free koptcontext userdata: 0x2c25f430
# hinting page 231 in background
reading page:0,0,1922,3189 scale:6.37

The crash.log reports:

Tesseract Open Source OCR Engine v3.02.02 with Leptonica (lang=eng)
Can not convert betbod)ans‘fitntarcb311Icrncnbcllnletfdpeibuligbefuieuamiitben to utf8.
Invalid byte sequence in conversion input
Can not convert «.é?..«§T27...'I..‘I:.'..sT{S§l.TsT2-C55.7:22’ to utf8.
Invalid byte sequence in conversion input
Can not convert fcbeint,fonnff11¢icbocbmonbizgangeiufafiz,menuand;fonflnitgenb to utf8.
Invalid byte sequence in conversion input
Can not convert «.é?..«§T27...'I..‘I:.'..sT{S§l.TsT2-C55.7:22’ to utf8.
Invalid byte sequence in conversion input

This seems to be the error message associated with the selection of 2 lines at once.

Just for completeness sake, I add here the standard output of build 545, when selecting the word allein in Fraktur:

 holdpan position in page {
        ["page"] = 228,
        ["x"] = 573,
        ["y"] = 671,
        ["rotation"] = 0,
        ["zoom"] = 1
}
# painting {
        ["y"] = 0,
        ["x"] = 0,
        ["h"] = 1421,
        ["w"] = 1080
} to 0 0
# in hold state...
....
# hold_release detected in slot 0
start tesseract OCR engine in data for deu-frak language
# OCRed word: allein
# lookup word: allein
# stripped word: allein
# showing quick lookup dictionary window
# painting {
        ["y"] = 0,
        ["x"] = 0,
        ["h"] = 1421,
        ["w"] = 1080
} to 0 0
# update region {
        ["y"] = 346,
        ["x"] = 75,
        ["h"] = 746.36,
        ["w"] = 930
}
@giorgio130
KOReader Community member

I'll try a bisect to find out what is the offending commit. Thanks for the example document!

@giorgio130
KOReader Community member

Oh, I see it could already be fixed by @chrox :
9f42289
Let's hope so ;)

@hchaojie

@Markismus

How would I pipe the standard output to a file, too?

use this:

./reader.lua -d /mnt/koreader/library >crash.log 2>&1
@Markismus
KOReader Community member

No improvement in build 570. When trying to select a word, the debug report shows:

# hinting page 228 in background
reading page:0,0,1922,3189 scale:6.37
# free koptcontext userdata: 0x2c681638
# hinting page 229 in background
reading page:0,0,1922,3189 scale:6.37
# free blitbuffer cdata<struct BlitBuffer>: 0x2c681058
# free koptcontext userdata: 0x2c67f1f0
# in tap state...
# set up hold timer
# in tap state...
...
# hold position in page {
        ["page"] = 227,
        ["x"] = 580,
        ["y"] = 3381,
        ["rotation"] = 0,
        ["zoom"] = 1
}
# selected word:
# in hold state...
# holdpan position in page {
        ["page"] = 227,
        ["x"] = 579,
        ["y"] = 3381,
        ["rotation"] = 0,
        ["zoom"] = 1
}
# selected text:
# in hold state...
# holdpan position in page {
        ["page"] = 227,
        ["x"] = 580,
        ["y"] = 3381,
        ["rotation"] = 0,
        ["zoom"] = 1
}
# selected text:
# in hold state...
# in hold state...
# in hold state...
# in hold state...
# hold_release detected in slot 0
@Markismus
KOReader Community member

There is some change compared to build 570. Still only lines are selected in non-reflow and no selection is possible in reflow. However, the debug output of build 583 shows over an over an empty string returned from OCR after selecting a word:

# selected text: {
    ["pos0"] = {
        ["page"] = 227,
        ["x"] = 171.09641097818,
        ["y"] = 276.7663617171,
        ["rotation"] = 0,
        ["zoom"] = 2.8363273453094
    },
    ["pboxes"] = {
        [1] = {
            ["y"] = 272,
            ["x"] = 16,
            ["h"] = 10,
            ["w"] = 274
        }
    },
    ["word"] = "",
    ["pos1"] = {
        ["page"] = 227,
        ["x"] = 170.74384236453,
        ["y"] = 276.7663617171,
        ["rotation"] = 0,
        ["zoom"] = 2.8363273453094
    },
    ["sboxes"] = {
        [1] = {
            ["y"] = 272,
            ["x"] = 16,
            ["h"] = 10,
            ["w"] = 274
        }
    }
}
@Markismus
KOReader Community member

Dictionary lookup is not functioning in high Render quality!!

I found similar debug log entries as reported above for build 566. So I wondered whether the problem is similar and whether I could pinpoint it.

Log entry in build 545 similar to build 566:

# hold_release detected in slot 0
# OCRed word: ---i-«---·»·-·H-sq-s---s«s--«---oi
# lookup word: ---i-«---·»·-·H-sq-s---s«s--«---oi
# stripped word: i-«---·»·-·H-sq-s---s«s--«---oi
# in tap state...

I found that the Renderquality is important. No word selection in reflow for Fraktur for Renderquality high and word selection and some lookup for default and normal.

high quality:

# OCRed word: -»«-«-·-«--s«-----·««.-.».---.«
# lookup word: -»«-«-·-«--s«-----·««.-.».---.«
# stripped word: »«-«-·-«--s«-----·««.-.».---.«
Can not convert »«-«-·-«--s«-----·««.-.».---.« to utf8.
Invalid byte sequence in conversion input

default and low quality:

    Line 1699: # lookup word: Llllfcms-
    Line 2180: # lookup word: Bewegvstgs
    Line 2346: # lookup word: nicht
    Line 2490: # lookup word: nicht
    Line 2690: # lookup word: nicht
    Line 2872: # lookup word: weil
    Line 3002: # lookup word: weil
    Line 3214: # lookup word: weil
    Line 3381: # lookup word: weil
    Line 3622: # lookup word: -
    Line 3752: # lookup word: Auf-MS
    Line 4105: # lookup word: Llllfcms-
    Line 4254: # lookup word: Anfang
    Line 4393: # lookup word: Anfang
    Line 4478: # lookup word: Anfang
    Line 4603: # lookup word: Anfccnss

Bewegung became Bewegvstgs, Anfang could not be read if a comma followed it amd resulted in Auf-MS, Llllfcms- and Anfccnss.

@Markismus
KOReader Community member

After these results I checked with build 583 and it works with Rendersetting on default and high and line 22 in readerdictionary.lua with added --utf8-input --utf8-output options!

@Markismus Markismus closed this Oct 17, 2013
@chrox chrox added a commit to chrox/koreader that referenced this issue Oct 24, 2013
@chrox chrox highlight word from scratch instead of reusing rectmaps in reflowing …
…mode

Totally revert the OCR in reflowed page to build 545.
And this should fix #309.
184a6f5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment