New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UX] Add InfoMessage about OCR #3718

Merged
merged 3 commits into from Mar 5, 2018

Conversation

Projects
None yet
4 participants
@Frenzie
Member

Frenzie commented Mar 4, 2018

screenshot_2018-03-04_21-07-06

@Frenzie Frenzie added the UX label Mar 4, 2018

@Frenzie

This comment has been minimized.

Member

Frenzie commented Mar 4, 2018

@poire-z

This comment has been minimized.

Contributor

poire-z commented Mar 4, 2018

Looks fine.
I discovered the need for these tessdata files recently.
As we are up to tesseract 3.04, and the language data are a single file for 3.04 (vs a tar.gz with 3.02 with other files), I took the 3.04 file, and it seems to work.
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
Could make your message shorter (by shortening the last part).

@Frenzie

This comment has been minimized.

Member

Frenzie commented Mar 4, 2018

Yes, I updated tesseract myself, didn't I.I didn't. I am "working" on updating it to 3.05, see koreader/koreader-base#555 There's this mysterious thing where something works on my computer and not in Travis. :-P

I copied it from the wiki with minor stylistic improvements only. I'll fix up the version number.

https://github.com/koreader/koreader/wiki/Dictionary-support#dictionary-lookups-in-scanned-pages

@Frenzie

This comment has been minimized.

Member

Frenzie commented Mar 5, 2018

The alternative to the conveniently packaged 3.02 files is to manually grab the required files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00

Better? Worse?

@poire-z

This comment has been minimized.

Contributor

poire-z commented Mar 5, 2018

Are they all really needed ?
When I went to https://github.com/tesseract-ocr/tesseract/wiki/Data-Files, in the Data Files for Version 3.04/3.05 section, there were just a single<lang>.traineddata file to download, no mention of the others (cube, bigrams... - except below for Hindi and Arabic).
So, I just put eng.traineddata and fra.traineddata in my tessdata dir. I actually don't know if they are used or which, but I did get mostly accurate word when holding on scanned PDF. Dunno if the other files are just helpers to make things even more accurate - or if they are even used by tesseract in our use case.

@Frenzie

This comment has been minimized.

Member

Frenzie commented Mar 5, 2018

Hm, good point. Dutch doesn't even have any of that other stuff. The training data for 3.04 is an order of magnitude bigger than that for 3.02.

KOReader has a build-in OCR engine for recognizing words in scanned PDF and DjVu documents. In order to use OCR in scanned pages, you need to install tesseract trained data for your document language.
You can download language data files from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-304305

This comment has been minimized.

@Frenzie

Frenzie Mar 5, 2018

Member

@poire-z I don't really like the anchor (makes it seem like you might have to type in the whole thing) but at the same time it might be a bit unclear without. :-/

This comment has been minimized.

@poire-z

poire-z Mar 5, 2018

Contributor

You could just say:
You can download language data files for version 3.04 from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

(We'll have to think about updating this text on next tesseract updates)

@Frenzie Frenzie merged commit 076bf40 into koreader:master Mar 5, 2018

1 check passed

ci/circleci Your tests passed on CircleCI!
Details

@Frenzie Frenzie deleted the Frenzie:ocr-info branch Mar 5, 2018

@eheader

This comment has been minimized.

eheader commented Mar 18, 2018

when I ocr chinese GB version,it dose work every time. But after i turn several pages forward,the infomessage appears.why?how to fix it?

@Frenzie

This comment has been minimized.

Member

Frenzie commented Mar 18, 2018

Note that for now it says "No OCR results or no language data."

Is the book something that can be shared? What are the steps to reproduce the issue?

@eheader

This comment has been minimized.

eheader commented Mar 18, 2018

My firmware version is 5.4.3. paperwhite.
When i start koreader with "start the file manager" option, the dictionary funtion often doesnot work. When i start koreader with the "no framework" version,it works. I used koreader to read the scanned Chinese book. When i first use the dictionary function,it works ,but after i turned two more pages,the "No OCR results or no language data." information pops up.

@Frenzie

This comment has been minimized.

Member

Frenzie commented Mar 18, 2018

Is this book freely available by any chance?

@eheader

This comment has been minimized.

eheader commented Mar 19, 2018

It has nothing to do with which book it is. Just a scanned GB Chinese book will be the same.

@eheader

This comment has been minimized.

eheader commented Mar 25, 2018

Is there any way to solve this problem?

@Frenzie

This comment has been minimized.

Member

Frenzie commented Mar 25, 2018

I can't reproduce it. Given #3688 it doesn't sound like @poire-z ran into it either. Could you please open a new issue (including crash.log and all that)?

@poire-z

This comment has been minimized.

Contributor

poire-z commented Mar 25, 2018

I haven't really turned many PDF pages in my life :)

But it looks more like a Kindle related issue:

When i start koreader with "start the file manager" option, the dictionary funtion often doesnot work. When i start koreader with the "no framework" version,it works

which I don't know much what it means.

@NiLuJe

This comment has been minimized.

Member

NiLuJe commented Jun 30, 2018

FWIW, the files from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files from the table under the 4.00 section appeared to work (as in, the tessuite ran fine) on my end.

The slightly more up to date ones from the split repos (-fast) did not, on the other hand (properly failing to load with something resembling a correct warning about an unsupported format, IIRC).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment