<a href="https://colab.research.google.com/github/iued-uni-heidelberg/corpustools/blob/main/S105ocrCorrectionV06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# correction of ocr-generated text
- using correction dictionaries
- using rules

### Notes on the development
- Algorithm:
1. We extract rewrite rules from the annotations
2. If a word is not found in our corpus it can be an error:
* We Apply the longest-first strategy for rewriting
* We Apply all the rules, converting the word into several candidates, where possible
* We check which candidates exist in Wikipedia dictionary
3. We print all candidates for annotation in a spreadsheet
4. Record which rules are most productive ...


## Service functions

In [1]:
import re, os, sys


In [47]:
# critical path function...
# function for Armenian tokenization
def tokenizeTextHY(SFIn):
    LLParagraphs = []
    with open(SFIn, 'r') as FIn:
        countpara = 0
        for SLine in FIn:
            countpara += 1
            if countpara % 100000 == 0: print(countpara)
            SLine = SLine.strip()
            if SLine == '': continue
            LLine = re.split('([ ,\.\:։;\'\"\(\)\-\–\!\?\{\}\t\«\»]+)', SLine)
            # if LLine == '': continue
            if LLine: LLParagraphs.append(LLine)
    return LLParagraphs


# This section applies corrections to an Armenian texts and records which words have been corrected

## Downloading Armenian corpus, correcting lines, tokenizing

In [None]:
# core starts here - critical

!wget https://heibox.uni-heidelberg.de/f/c977e87cf2b244e6801b/?dl=1
!mv index.html?dl=1 KorpusARM.tgz


In [None]:
!tar xvzf KorpusARM.tgz
!mkdir KorpusARM1
!mkdir KorpusARM1/stage01
# concatenating files
!cat korpusARM/hyFiktion/* >KorpusARM1/stage01/hyFiktion.txt
!cat korpusARM/hyNatur/* >KorpusARM1/stage01/hyNatur.txt
!cat korpusARM/hyRecht/* >KorpusARM1/stage01/hyRecht.txt
!mkdir KorpusARM1/stage02

# function for Armenian line breaks:

def correctLineBreaksHY(FName, FNameOut):
    FIn = open(FName, 'r')
    FOut = open(FNameOut, 'w')
    countHyphens = 0
    for SLine in FIn:
        SLine = SLine.strip()
        if SLine == '':
            FOut.write('\n\n')
            continue
        if SLine[-1] == '-':
            SLine2write = SLine[:-1]
            FOut.write(SLine2write)
            countHyphens +=1
            continue
        FOut.write(SLine + ' ')
    FOut.flush()
    print(str(countHyphens) + ' hyphens corrected')
    return

correctLineBreaksHY('KorpusARM1/stage01/hyFiktion.txt', 'KorpusARM1/stage02/hyFiktion.txt')
correctLineBreaksHY('KorpusARM1/stage01/hyNatur.txt', 'KorpusARM1/stage02/hyNatur.txt')
correctLineBreaksHY('KorpusARM1/stage01/hyRecht.txt', 'KorpusARM1/stage02/hyRecht.txt')


In [5]:

!wc KorpusARM1/stage02/hyFiktion.txt
!wc KorpusARM1/stage02/hyNatur.txt
!wc KorpusARM1/stage02/hyRecht.txt

   6208   92131 1081755 KorpusARM1/stage02/hyFiktion.txt
  3642  67142 870081 KorpusARM1/stage02/hyNatur.txt
   8940   86621 1288655 KorpusARM1/stage02/hyRecht.txt


### now we tokenize the Armenian corpora

In [48]:
try:
    del LLParaHyF
except:
    sys.stderr.write(f'LLParaHyF doens\'t need to be deleted, not yet created...\n')

try:
    del LLParaHyN
except:
    sys.stderr.write(f'LLParaHyN doens\'t need to be deleted, not yet created...\n')

try:
    del LLParaHyR
except:
    sys.stderr.write(f'LLParaHyR doens\'t need to be deleted, not yet created...\n')

LLParaHyF doens't need to be deleted, not yet created...
LLParaHyN doens't need to be deleted, not yet created...
LLParaHyR doens't need to be deleted, not yet created...


In [49]:
LLParaHyF = tokenizeTextHY('/content/KorpusARM1/stage02/hyFiktion.txt')
LLParaHyN = tokenizeTextHY('/content/KorpusARM1/stage02/hyNatur.txt')
LLParaHyR = tokenizeTextHY('/content/KorpusARM1/stage02/hyRecht.txt')

In [50]:
# we check how our specialized corpora was tokenized...
print(LLParaHyF[1])
print(len(LLParaHyF[1]))
print(len(LLParaHyF))

print(LLParaHyN[1])
print(len(LLParaHyN[1]))
print(len(LLParaHyN))

print(LLParaHyR[1])
print(len(LLParaHyR[1]))
print(len(LLParaHyR))

['', '-- ', 'Բայց', ' ', 'դու', ' ', 'ինճ', ' ', 'անմիջապես', ' ', 'ամբողջովին', ' ', 'չպիտի', ' ', 'կուլ', ' ', 'տաս', ',--- ', 'ասաց', ' ', 'նա', ' ', 'մեղմորեն', ':', '']
25
3104
['Կան', ' ', 'մահվան', ' ', 'քարոզիչներ', ', ', 'ն', ' ', 'երկիրը', ' ', 'լիքն', ' ', 'է', ' ', 'նրանցով', ', ', 'ում', ' ', 'ոլետք', ' ', 'է', ' ', 'կյանքից', ' ', 'հեռացում', ' ', 'քարոզվի', ':', '']
29
1821
['ԳԼՈՒԽ', ' ', '1', ' ', 'ՀԻՄՆԱԿԱՆ', ' ', 'ԴՐՈՒՅԹՆԵՐ']
7
4471


## Wikipedia


In [None]:
# downloading wikipedia
### downloading Armenian Wikipedia
!wget https://heibox.uni-heidelberg.de/f/d1f866a61bd545318213/?dl=1
!mv index.html?dl=1 hywiki-20221101-pages-articles.txt.gz
!gunzip hywiki-20221101-pages-articles.txt.gz
# the length of wikipedia



In [None]:
!wc hywiki-20221101-pages-articles.txt

In [16]:
try:
    del LLParaWiki
except:
    sys.stderr.write(f'LLParaWiki doens\'t need to be deleted, not yet created...\n')

LLParaWiki doens't need to be deleted, not yet created...


In [None]:
LLParaWiki = tokenizeTextHY('/content/hywiki-20221101-pages-articles.txt')

In [18]:
print(LLParaWiki[1])

['Հայաստան', ' , ', 'պաշտոնական', ' ', 'անվանումը՝', ' ', 'Հայաստանի', ' ', 'Հանրապետություն', ', ', 'պետություն', ' ', 'Առաջավոր', ' ', 'Ասիայում՝', ' ', 'Հայկական', ' ', 'լեռնաշխարհի', ' ', 'հյուսիսարևելյան', ' ', 'մասում', '։ ', 'Քաղաքական', ' ', 'և', ' ', 'մշակութային', ' ', 'իմաստով', ', ', 'սակայն', ', ', 'գտնվում', ' ', 'է', ' ', 'հարավարևելյան', ' ', 'Եվրոպայի', ' ', 'Կովկասյան', ' ', 'տարածաշրջանում', '։ ', 'Հյուսիսում', ' ', 'սահմանակցում', ' ', 'է', ' ', 'Վրաստանին', ', ', 'արևելքում՝', ' ', 'Ադրբեջանին', ', ', 'հարավում՝', ' ', 'Իրանին', ', ', 'իսկ', ' ', 'արևմուտքում՝', ' ', 'Թուրքիային', '։ ', 'Հարավարևելյան', ' ', 'կողմում', ' ', 'Բերձորի', ' ', 'միջանցքով', ' ', 'կապվում', ' ', 'է', ' ', 'Արցախի', ' ', 'Հանրապետությանը', ', ', 'իսկ', ' ', 'հարավ', '-', 'արևմուտքում', ' ', 'Ադրբեջանի', ' ', 'էքսկլավ', ' ', 'Նախիջևանի', ' ', 'Ինքնավար', ' ', 'Հանրապետությունն', ' ', 'է', '։ ', 'Այժմյան', ' ', 'ՀՀ', '-', 'ն', ' ', 'զբաղեցնում', ' ', 'է', ' ', 'պատմական', ' ', 'Հայաստանի', 

In [19]:
len(LLParaWiki[1])

137

In [20]:
len(LLParaWiki)

2153019

In [21]:
def llPara2dict(LLParagraphs):
    DFreq = {}
    p=0
    for LPara in LLParagraphs:
        p+=1
        if p%200000 == 0:
            print(p)
        # if LPara == [] or LPara == ['']: continue
        if LPara == []: continue
        # FOut.write(str(LPara) + '\n')
        i = 0 # counting words
        for el in LPara:
            i+=1 # index of next word
            try:
                DFreq[el] += 1
            except:
                DFreq[el] = 1
    return DFreq


In [22]:
DWiki = llPara2dict(LLParaWiki)
print(len(DWiki))

200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
2071984


In [23]:
len(DWiki)

2071984

In [24]:
FOut = open('hywiki-frqDict.txt', 'w')
for key, val in sorted(DWiki.items(), key=lambda item: item[1], reverse=True):
    FOut.write(f'{key}\t{val}\n')

In [25]:
!wc hywiki-frqDict.txt

 2071974  4211029 42482907 hywiki-frqDict.txt


In [None]:
!head --lines=40 hywiki-frqDict.txt

## File with corrections

In [None]:
# Dowloading the file with corrections
# !wget https://heibox.uni-heidelberg.de/f/14706c04a4024b2f937d/?dl=1
# without ճեր
!wget https://heibox.uni-heidelberg.de/f/4a24540473564788853d/?dl=1

!mv index.html?dl=1 Pilot-Corrections-all.tsv


In [8]:
!wc Pilot-Corrections-all.tsv

  324  2854 26730 Pilot-Corrections-all.tsv


In [9]:
# critical path function
def readCorrectionsFrq(colNumberOri, colNumberCorrect, colNumberFrq, SFIn, SFOut = None):
    LTWrongCorrect = []
    '''
    if type(LTWrongCorrect) == list:
        pass
    '''

    DWrongCorrect = {}
    FOut = open(SFOut, 'w')
    with open(SFIn, 'r') as FIn:
        count = 0
        for SLine in FIn:
            count += 1
            if count == 1: continue
            SLine = SLine.rstrip('\n')
            LLine = SLine.split('\t')
            SWrong = LLine[colNumberOri]
            SCorrect = LLine[colNumberCorrect]
            SFrq = LLine[colNumberFrq]
            if SWrong != '' and SCorrect != '' and SWrong != SCorrect:
                TWrongCorrect = (f'[{SWrong}]', f'[{SCorrect}]', f'{SFrq}')
                LTWrongCorrect.append(TWrongCorrect)
                if SWrong in DWrongCorrect.keys():
                    SCorrect1 = DWrongCorrect[SWrong]
                    if SCorrect1 != SCorrect:
                        print(SWrong + '\t' + SCorrect1 + '\t' + SCorrect)
                DWrongCorrect[SWrong] = SCorrect
    if SFOut:
        for SWrong, SCorrect, SFrq in LTWrongCorrect:
            FOut.write(f'{SWrong}\t{SCorrect}\t{SFrq}\n')
        FOut.flush()
    print(len(DWrongCorrect))

    return LTWrongCorrect, DWrongCorrect


In [None]:
# reading corrections for word forms, with the purpose of generalizing them
# goal: to display candidates for correction -- based on existing corrections (?)
LTWrongCorrectWordF, DWrongCorrectWordF = readCorrectionsFrq(1, 4, 9, '/content/Pilot-Corrections-all.tsv', SFOut = 'Pilot-Corrections-all-WordForm.tsv')
# LTWrongCorrectLemmaF, DWrongCorrectLemmaF = readCorrectionsFrq(3, 6, 9, '/content/Pilot-Corrections-all.tsv', SFOut = 'Pilot-Corrections-all-Lemma.tsv')
print(LTWrongCorrectWordF)
# print(LTWrongCorrectLemmaF)
# ինձ|ինչ
# առջև|առջևից
# ինչ|ինձ
# ինչ|ես
# գիտենալ|իմանալ


In [11]:
print(len(DWrongCorrectWordF))
# print(len(DWrongCorrectLemmaF))


171


In [None]:
for key, value in sorted(DWrongCorrectWordF.items()):
    print(f'{key}\t{value}')

## Discover candidate rewriting rules systematically

In [27]:
print(len(LTWrongCorrectWordF))

192


In [None]:
for SWrong, SCorrect, Frq in LTWrongCorrectWordF:
    print(SWrong, SCorrect)

In [29]:
def getPrefInfSuf(wrd1, wrd2):
    try:
        cpref = os.path.commonprefix([wrd1, wrd2])
        drw1 = wrd1[::-1]
        drw2 = wrd2[::-1]
        ffusc = os.path.commonprefix([drw1, drw2])
        csuff = ffusc[::-1]
    except:
        sys.stderr.write('error finding pref- and suffix')
        cpref = None
        csuff = None

    try:
        wrd1minpref = wrd1.removeprefix(cpref)
        wrd2minpref = wrd2.removeprefix(cpref)
        wrd1centre = wrd1minpref.removesuffix(csuff)
        wrd2centre = wrd2minpref.removesuffix(csuff)
    except:
        sys.stderr.write('error finding centre 1 and 2')
        wrd1centre = None
        wrd2centre = None

    return cpref, wrd1centre, wrd2centre, csuff


P12, I1, I2, S12 = getPrefInfSuf('[перепливи]', '[перелови]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)

P12, I1, I2, S12 = getPrefInfSuf('[розгубився]', '[розгубивсь]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)

P12, I1, I2, S12 = getPrefInfSuf('[вловив]', '[зловив]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)

P12, I1, I2, S12 = getPrefInfSuf('[переходити]', '[перешкодити]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)

P12, I1, I2, S12 = getPrefInfSuf('[переходити]', '[перешкоджати]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)

P12, I1, I2, S12 = getPrefInfSuf('[ходити]', '[перешкоджати]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)




P12, I1, I2, S12 = getPrefInfSuf('[մերճակա]', '[մերձակա]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)

P12, I1, I2, S12 = getPrefInfSuf('[առջն]', '[առջև]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)

P12, I1, I2, S12 = getPrefInfSuf('[թեթնություն]', '[թեթևություն]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)

P12, I1, I2, S12 = getPrefInfSuf('[Եթենա]', '[եթե նա]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)

P12, I1, I2, S12 = getPrefInfSuf('[ննա]', '[նա]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)

P12, I1, I2, S12 = getPrefInfSuf('[ճեռքերն]', '[ձեռքից]')
print(P12, ' | ', I1, ' > ', I2, ' | ', S12)




[пере  |  пли  >  ло  |  ви]
[розгубивс  |  я  >  ь  |  ]
[  |  в  >  з  |  ловив]
[пере  |  х  >  шк  |  одити]
[пере  |  ходи  >  шкоджа  |  ти]
[  |  ходи  >  перешкоджа  |  ти]
[մեր  |  ճ  >  ձ  |  ակա]
[առջ  |  ն  >  և  |  ]
[թեթ  |  ն  >  և  |  ություն]
[  |  Եթե  >  եթե   |  նա]
[ն  |    >  ա]  |  նա]
[  |  ճեռքերն  >  ձեռքից  |  ]


In [30]:
def createSufList(a):
    ''' odyty] --> odyty], odyty, odyt, ody, od, o, "" '''
    LSuf = []
    for j in range(1,len(a)+1):
        prefix = a[:j]
        LSuf.append(prefix)
        # print(j, len(prefix), LSuf)
    LSuf.insert(0, '')
    return LSuf

createSufList('odyty]')

['', 'o', 'od', 'ody', 'odyt', 'odyty', 'odyty]']

In [31]:
createSufList('ություն]')

['', 'ո', 'ու', 'ութ', 'ությ', 'ությո', 'ությու', 'ություն', 'ություն]']

In [32]:
def createPrefList(s):
    LPref = [s[-i:] for i in range(1, len(s) + 1)]
    LPref.insert(0, '')
    # print(LPref)
    return LPref

createPrefList('[pere')

['', 'e', 're', 'ere', 'pere', '[pere']

In [33]:
createPrefList('[թեթ')

['', 'թ', 'եթ', 'թեթ', '[թեթ']

### we create the list of potential rewrite rules


In [39]:
def twoWords2listOfRules(W1, W2):
    DTRules = {}
    DTRulesLen = {}
    LTRules = []
    P12, I1, I2, S12 = getPrefInfSuf(W1, W2)
    LPref12 = createPrefList(P12)
    LSuf12 = createSufList(S12)

    for pref in LPref12:
        for suf in LSuf12:
            SlhsRule = pref + I1 + suf
            SrhsRule = pref + I2 + suf
            TRules = (SlhsRule, SrhsRule)
            LTRules.append(TRules)
            try:
                DTRules[TRules] += 1
            except:
                DTRules[TRules] = 1

            try:
                DTRulesLen[TRules] = len(SlhsRule)
            except:
                continue
    return LTRules, DTRules, DTRulesLen, P12, I1, I2, S12

def printFrqTDict(DTFrq, FOut = None):
    for key, val in sorted(DTFrq.items(), key=lambda x:x[1], reverse=True):
        LHS, RHS = key
        if FOut:
            FOut.write(f'{LHS}, {RHS}, {str(val)}\n')
        else:
            print(LHS, RHS, str(val))
    if FOut:
        FOut.flush()

def printTList(LTuples, FOut = None):
    for el in LTuples:
        LHS, RHS = el
        if FOut:
            FOut.write(f'{LHS}, {RHS}\n')
        else:
            print(LHS, RHS)
    if FOut:
        FOut.flush()



In [44]:
LTRules, DTRules, DTRulesLen, P12, I1, I2, S12 = twoWords2listOfRules('[перепливи]', '[перелови]')
# FOut = open('printFrqTDict-test.txt', 'w')
# printFrqTDict(DTRules, FOut)
FOut0 = open('printTRulesLen-test-uk.txt', 'w')
FOut0.write(f'{P12}|{I1}>{I2}|{S12}\n')
printFrqTDict(DTRulesLen, FOut0)
# FOut1 = open('printTList-test.txt', 'w')
# printTList(LTRules, FOut1)

In [45]:
LTRules, DTRules, DTRulesLen, P12, I1, I2, S12 = twoWords2listOfRules('[թեթնություն]', '[թեթևություն]')
FOut0 = open('printTRulesLen-test-hy.txt', 'w')
FOut0.write(f'{P12}|{I1}>{I2}|{S12}\n')
printFrqTDict(DTRulesLen, FOut0)



In [None]:
# ... todo: check if we need this function (possibly -- substring)
def findLongestMatch(SInput, DTRulesLength):
    SOutput = None
    for key, val in sorted(DTRulesLength.items(), key=lambda x:x[1], reverse=True):
        LHS, RHS = key
        if LHS in SInput:
            SOutput = SInput.replace(LHS, RHS, 1)
            break
    return SOutput

SOutput1 = findLongestMatch('[перепливи]', DTRulesLen)
if SOutput1: print(SOutput1)

SOutput1 = findLongestMatch('[перешкоджав]', DTRulesLen)
if SOutput1: print(SOutput1)

SOutput1 = findLongestMatch('[проходжав]', DTRulesLen)
if SOutput1: print(SOutput1)


[перелови]


### we create a common dictionary for all the rewrite strings, we record their frequencies, then we resort them by their length...