Tesseract OCR and German Fraktur #306

Closed
Markismus opened this Issue Oct 13, 2013 · 11 comments

Projects

None yet

2 participants

@Markismus
Member

I want to enable OCR for german fractur (gothic script). It is supported by tesseract in the file "deu-frak.traineddata". Since this filename is clearly not starting with "eng" or "deu" I guess that I should make a new language specifically for german fractur.

I wondered whether I could achieve this by adding german to the following lines in default.lua:
-- document languages for OCR DKOPTREADER_CONFIG_DOC_LANGS_TEXT = {"English", "Chinese_S", "Chinese_T","D_Fraktur"} DKOPTREADER_CONFIG_DOC_LANGS_CODE = {"eng", "chi_sim", "chi_tra", "deu-frak"} -- ISO 639-3 language string, DKOPTREADER_CONFIG_DOC_DEFAULT_LANG_CODE = "eng" -- and make sure you have corresponding training data }

"deu-frak" is of course not a correct language string, but I do have to corresponding training data.

Or should I use "deu" and would Tesseract know deu-frak is a subset

@Markismus
Member

It works with separate settings:
-- document languages for OCR DKOPTREADER_CONFIG_DOC_LANGS_TEXT = {"English", "Greek", "D-Fraktur", "Deutsch"} DKOPTREADER_CONFIG_DOC_LANGS_CODE = {"eng", "grc", "deu-frak", "deu"} -- ISO 639-3 language string, DKOPTREADER_CONFIG_DOC_DEFAULT_LANG_CODE = "eng" -- and make sure you have corresponding training data

Selecting "Deutsch" doesn't enable "deu-frak". Selecting "D-fraktur" works with the normal german dictionary (Duden)!!
There is a pause after pressing a language button--probably for loading the new OCR-data after pressing a new language.

I really like this!!!

Problem though is the lack of space for the buttons. If you have more languages, we could use two bars for document language. How could I do this? I will need buttons for Ancient Greek and German-fraktur. Given that 4 languages is already that cramped, I wonder how to fit english, chinese x2, greek, G-fraktur on the bar.

Could the width of the bar be increased to accommodate all languages?

@chrox
Member
chrox commented Oct 13, 2013

I have remove the traditional Chinese language since OCR performance for Chinese is so poor that it's practically useless. And the ToggleSwitch width can be specified in KoptOption now as added in 3e94520.

@Markismus
Member

Great!

@chrox
Member
chrox commented Oct 14, 2013

Actually when you select a new OCR language, the OCR engine will not reinit immediately. It will do that when a real OCR is requested and the requested OCR lang is different from the OCR lang already initiated. See fix in chrox/libk2pdfopt@478b08c

@Markismus
Member

So what would then explain the delay you get when changing languages for the first time?

@chrox
Member
chrox commented Oct 15, 2013

You will notice a freeze when you select a new doc language in reflowing mode because doc language is also a parameter like font size in koptcontext. Change any of these parameters will trigger re-reflowing of the current page.

@Markismus
Member

So that's why the page alters slightly when I change languages!

@Markismus Markismus closed this Oct 17, 2013
@Markismus Markismus reopened this Oct 25, 2013
@Markismus
Member

Build 613-second of the name: No dictionary support at all! Major step back! When selecting a word the log shows this:

# Not implemented yet
# OCRed text:
# painting {
    ["y"] = 2780,
    ["x"] = 0,
    ["h"] = 1421,
    ["w"] = 1080
} to 0 0
@chrox
Member
chrox commented Oct 25, 2013

I hope 2590cc7 will fix this. Will you try changing self:getReflowedTextBoxes into self:getReflowedTextBoxesFromScratch in frontend/document/koptinterface.lua:753?

@Markismus
Member

Dicitonary lookup works..but any selection only selects one character. So I am having entries of ß as ss, m, d, u etc..:

tart tesseract OCR engine in data for deu-frak language
# OCRed word: ß
# lookup word: ß
# stripped word: ß
# showing quick lookup dictionary window
@Markismus
Member

Checked against build 613-first of the name. Selected the word Wesens, got the right dictionary entry from Duden, debug log showes:

# OCRed word: Wesenö
# lookup word: Wesenö
# stripped word: Wesenö
# io.popen command: ./sdcv --utf8-input --utf8-output -nj "Wesenö"
# std_out file (0x2c7c40f0)
# showing quick lookup dictionary window
@Markismus Markismus closed this Nov 23, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment