# Diacritics restoration demos
This sheet should serve as a set of universal usage examples for the basic diacritics restoration-related tasks. It will cover the main functions of module diacritics_restorer.py and some of accents.py

## 0. Modules import

In [3]:
import diacritics_restorer as dr
import accents
print("ready.")

ready.


## 1. Using the accents module

In [42]:
stripped_string = accents.strip("Celé spřežení vězelo ve sněhu jako v hrobě.") #strip accents
print(stripped_string)
composed_string = accents.compose([("c",accents.CARON),("o",accents.ACUTE),("m",accents.NO_ACCENT), # compose accents with letters
                       ("p",accents.NO_ACCENT),("o",accents.CIRCUMFLEX),("s",accents.CEDILLA),("e",accents.UMLAUT),])
print(composed_string)
print([accents.decompose(character) for character in composed_string]) # decompose dia-characters

Cele sprezeni vezelo ve snehu jako v hrobe.
čómpôşë
[('c', 'ˇ'), ('o', '´'), ('m', ' '), ('p', ' '), ('o', 'ˆ'), ('s', '¸'), ('e', '¨')]


## 2. Using the diacritics_restorer module
The following scripts are here just to show how to instantiate the diacritics restorer training, testing and how to save the pre-trained models. The training set (Project Gutenberg collection of František Omelka) is way too small to give observations to restore much else than itself. For more impressive instances, scroll down.
### 2.1 Buffering a corpus file as a (n-gram,tag) couples

In [46]:
buf = dr.CorpusNgramBuffer("corpora/simple.txt",3,1)
print(buf.__next__())

BUFFERED line #0 of corpora/simple.txt - "Valašsko Tichý a zadumaný kraj"
[('V', ' '), ('Va', ' '), ('Val', ' '), ('ala', ' '), ('las', 'ˇ'), ('ass', ' '), ('ssk', ' '), ('sko', ' '), ('ko ', ' '), ('o T', ' '), (' Ti', ' '), ('Tic', ' '), ('ich', ' '), ('chy', '´'), ('hy ', ' '), ('y a', ' '), (' a ', ' '), ('a z', ' '), (' za', ' '), ('zad', ' '), ('adu', ' '), ('dum', ' '), ('uma', ' '), ('man', ' '), ('any', '´'), ('ny ', ' '), ('y k', ' '), (' kr', ' '), ('kra', ' '), ('raj', ' ')]


### 2.2 Training

In [33]:
simple_hmm = dr.HmmNgramRestorer(4).train("corpora/simple.txt")

BUFFERED line #0 of corpora/simple.txt - "Valašsko Tichý a zadumaný kraj"
training done: 7.349462032318115


### 2.3 Accent restoration
Let us use the stripped sentence from example #1

In [47]:
simple_hmm.restore_accents(stripped_string) # uses stripped_string from #1 above

'Celé spřežení vezelo ve sněhu jako v hrobe.'

### 2.4 Accuracy testing

In [48]:
simple_hmm.test("corpora/simple_test.txt").as_dict()

BUFFERED line #0 of corpora/simple_test.txt - "Seppala se rozhlédl."


{'accuracy': 0.9200923279841724,
 'correct': 33484,
 'incorrect': 2908,
 'word_accuracy': 0.6385522436947265,
 'words_correct': 3899,
 'words_incorrect': 2207,
 'diaword_accuracy': 0.2735648476257973,
 'diawords_correct': 772,
 'diawords_incorrect': 2050,
 'alphaword_accuracy': 0.6343605036447979,
 'alphawords_correct': 3829,
 'alphawords_incorrect': 2207}

### 2.5 Serialization - Saving to and loading from disk

In [50]:
simple_hmm.save("pretrained/simple.pickle")
simple_hmm2 = dr.HmmNgramRestorer.load("pretrained/simple.pickle")
simple_hmm2.restore_accents(stripped_string)

'Celé spřežení vezelo ve sněhu jako v hrobe.'

## 3 Use a pretrained HMM on a custom sentence
Note that for the quality of restoration, we are often using n-grams of size $n>4$, that were omitted from the submissions. They can be downloaded at http://herbert.saarland/pretrained.zip
### 3.1 Croatian

In [51]:
hmm_hr = dr.HmmNgramRestorer.load("pretrained/hr/5-gram.pickle")
print("ready.")

ready.


In [52]:
hmm_hr.restore_accents(accents.strip("boškovićev uspjeh slave tri države: hrvatska, italija i srbija."))

'boskovićev uspjeh slave tri države: hrvatska, italija i srbija.'

In [53]:
hmm_hr.restore_accents(accents.strip("američko državljanstvo vraćeno mu je postumno odlukom američkog senata 1975. godine."))

'američko državljanstvo vraćeno mu je postumno odlukom američkog senata 1975. godine.'

### 3.2 Irish Gaellic

In [70]:
hmm_ga = dr.HmmNgramRestorer.load("pretrained/ga/8-gram.pickle")
print("ready.")

ready.


In [71]:
hmm_ga.restore_accents("ba earnail thabhachtach den gheilleagar e an mhianadoireacht.")

'ba éarnail thabhachtach den gheilleagar e an mhianadoireacht.'

In [73]:
hmm_ga.restore_accents(accents.strip("bhí sé go tréan ag cur catha ar a chomharsana san oirthear."))

'bhí sé go tréan ag cur catha ar a chomharsana san oirthear.'

### 3.3 Czech

In [3]:
hmm_cs = dr.HmmNgramRestorer.load("pretrained/cs/6-gram.pickle")
print("ready.")

ready.


In [26]:
hmm_cs.restore_accents(accents.strip('velmi tmavý povrch komet jim dovoluje absorbovat teplo potřebné na jejich odplynování.'))

'velmi tmavý povrch komet jim dovoluje absorbovat teplo potřebné na jejích odplynování.'

In [20]:
hmm_cs.restore_accents("dalsi vyznamne pozorovani kometarniho rozpadu byl dopad komety shoemaker-levy 9, pozorovany roku 1993.")

'další významně pozorovaní kometarního rozpadu byl dopad komety shoemaker-levy 9, pozorovany roku 1993.'

### 3.4 Slovakian

In [74]:
hmm_sk = dr.HmmNgramRestorer.load("pretrained/sk/6-gram.pickle")
print("ready.")

ready.


In [75]:
hmm_sk.restore_accents(accents.strip("j. lenoir skonštruoval v roku 1860 dvojtaktný dvojčinný posúvačový motor na svietiplyn."))

'j. lenoir skonštruoval v roku 1860 dvojtaktný dvojčinný posuvačový motor na svietiplyn.'

### 3.5 Hungarian

In [5]:
hmm_hu = dr.HmmNgramRestorer.load("pretrained/hu/6-gram.pickle")
print("ready.")

ready.


In [8]:
hmm_hu.restore_accents(accents.strip("alexiosz ígéretet tett rá, hogy fedezi az egyiptomba induló keresztesek költségeit, ha elűzik bitorló nagybátyját."))

'alexiosz igéretet tett rá, hogy fedezi az egyiptomba induló kéresztésék költségeit, ha eluzik bitorló nagybátyjat.'

### 3.6 French

In [4]:
hmm_fr = dr.HmmNgramRestorer.load("pretrained/fr/4-gram.pickle")
print("ready.")

ready.


In [61]:
hmm_fr.restore_accents(accents.strip("les conditions météorologiques sont très mauvaises entre avril 1315 et avril 1316."))

'les conditions metéorologiques sont très mauvaises entre avril 1315 et avril 1316.'