GitHub - paalberti/tesseract-dan-fraktur: Tesseract ocr training data for Danish written in fraktur script and a few other languages

paalberti / tesseract-dan-fraktur Public

Notifications You must be signed in to change notification settings
Fork 9
Star 17

Tesseract ocr training data for Danish written in fraktur script and a few other languages

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
dan		dan
dan_frak		dan_frak
deu_frak		deu_frak
swe_frak		swe_frak
COPYING		COPYING
README		README

Repository files navigation

Various training data files for Tesseract OCR (version 3.02)

* dan_frak/: Danish written in fraktur script (orthography prior to ca. 1867).
* deu_frak/: German written in fraktur script.
* swe_frak/: Swedish written in fraktur script. The wordlists for Swedish are from 
Projekt Runeberg, http://runeberg.org/words/
* dan/: slightly manipulated version of the Danish .traineddata shipped with upstream tesseract
to not output annoying fi- and fl-ligatures all the time. Since tesseract version 3.02, this 
has become outdated. If you have this problem, it is a better solution to upgrade tesseract.

The *_frak/ directories have a primitive script to compile the data files that only works on
unix-like machines. If you aren't interested in working on training tesseract yourself, just
find the *.traineddata that is relevant for your language, save it to your tesseract
installation's data directory and you should be ready for ocr.