<a href="https://colab.research.google.com/github/iued-uni-heidelberg/corpustools/blob/main/S101lemKA_own_correct_v02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workflow for making own corrections for your corpus

In [32]:
# Downloading files for each of the specialized corpora
import sys, os, re


## Files to download:
link: https://heibox.uni-heidelberg.de/d/d588bccab64348a399d4/

## After downloading
- You can change <unknown> to correct lemma, or leave it as it is
- You can change part-of-speech code, if incorrect
- Save changes on your local disk
- Upload the file onto this notebook workspace (using Files>Upload to session storage button

## Running tagging+correction workflow

In [None]:
%%bash
# installing TreeTagger (en, de, ka)
mkdir treetagger
cd treetagger
# Download the tagger package for your system (PC-Linux, Mac OS-X, ARM64, ARMHF, ARM-Android, PPC64le-Linux).
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2.4.tar.gz
tar -xzvf tree-tagger-linux-3.2.4.tar.gz
# Download the tagging scripts into the same directory.
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/tagger-scripts.tar.gz
gunzip tagger-scripts.tar.gz
# Download the installation script install-tagger.sh.
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/install-tagger.sh
# Download the parameter files for the languages you want to process.
# list of all files (parameter files) https://cis.lmu.de/~schmid/tools/TreeTagger/#parfiles
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/english.par.gz
sh install-tagger.sh
cd ..
sudo pip install treetaggerwrapper
# changing options: no-unknown, sgml, lemma
mv /content/treetagger/cmd/tree-tagger-english /content/tree-tagger-english0
awk '{ if (NR == 9) print "OPTIONS=\"-token -lemma -sgml -no-unknown\""; else print $0}' /content/tree-tagger-english0 > /content/treetagger/cmd/tree-tagger-english
chmod a+x ./treetagger/cmd/tree-tagger-english
# downloading German and Georgian 
wget https://heibox.uni-heidelberg.de/f/ec8226edebb64a359407/?dl=1
mv index.html?dl=1 /content/treetagger/lib/german-utf8.par
wget https://heibox.uni-heidelberg.de/f/9183090d2bdb41e09055/?dl=1
mv index.html?dl=1 /content/treetagger/lib/georgian.par
wget https://heibox.uni-heidelberg.de/f/9cafab0509d64ed1ac4b/?dl=1
mv index.html?dl=1 /content/treetagger/cmd/tree-tagger-georgian2
cp /content/treetagger/cmd/tree-tagger-georgian2 /content/treetagger/cmd/tree-tagger-georgian
# German2 = -no-unknown 
# note: tree-tagger-german will not work, as parameter files have not been downloaded, only use tree-tagger-german2 with utf8 encoding
wget https://heibox.uni-heidelberg.de/f/acb9b8a2fa4f40e08f8a/?dl=1
mv index.html?dl=1 /content/treetagger/cmd/tree-tagger-german2
chmod a+x /content/treetagger/cmd/tree-tagger-georgian2
chmod a+x /content/treetagger/cmd/tree-tagger-german2

wget https://heibox.uni-heidelberg.de/f/a6f7f36f175942ccad0a/?dl=1
mv index.html?dl=1 /content/treetagger/cmd/tree-tagger-georgian
chmod a+x /content/treetagger/cmd/tree-tagger-georgian


In [None]:
!wget https://heibox.uni-heidelberg.de/f/dc5bcb4413aa42668130/?dl=1
!mv index.html?dl=1 specialisedCorpus.zip

In [None]:
!unzip specialisedCorpus.zip
!wc specialised-corpora/*

In [5]:
!mkdir specialised-corpora-s01-lines/

In [6]:
def removeNewLines(FName, FNameOut):
    FIn = open(FName, 'r')
    FOut = open(FNameOut, 'w')

    for SLine in FIn:
        SLine = SLine.strip()
        if SLine == '': 
            FOut.write('\n\n')
            continue
        if SLine[-1] == '-':
            SLine2write = SLine[:-1]
            FOut.write(SLine2write)
            continue

        FOut.write(SLine + ' ')
    FOut.flush()
    return


In [7]:
removeNewLines('specialised-corpora/cFiktion.txt', 'specialised-corpora-s01-lines/cFiktion.txt')
removeNewLines('specialised-corpora/cNaturwissenschaft.txt', 'specialised-corpora-s01-lines/cNaturwissenschaft.txt')
removeNewLines('specialised-corpora/cRechtswissenschaft.txt', 'specialised-corpora-s01-lines/cRechtswissenschaft.txt')

In [8]:
!mkdir specialised-corpora-s02-vert/

In [9]:
!./treetagger/cmd/tree-tagger-georgian specialised-corpora-s01-lines/cFiktion.txt >specialised-corpora-s02-vert/cFiktion.vert
!./treetagger/cmd/tree-tagger-georgian specialised-corpora-s01-lines/cNaturwissenschaft.txt >specialised-corpora-s02-vert/cNaturwissenschaft.vert
!./treetagger/cmd/tree-tagger-georgian specialised-corpora-s01-lines/cRechtswissenschaft.txt >specialised-corpora-s02-vert/cRechtswissenschaft.vert


	reading parameters ...
	tagging ...
747000	 finished.
	reading parameters ...
	tagging ...
638000	 finished.
	reading parameters ...
	tagging ...
1181000	 finished.


In [10]:
# how many words in each corpus
!wc specialised-corpora-s02-vert/*

  747302  2241906 24414488 specialised-corpora-s02-vert/cFiktion.vert
  638588  1915786 23003076 specialised-corpora-s02-vert/cNaturwissenschaft.vert
 1181412  3544236 43986910 specialised-corpora-s02-vert/cRechtswissenschaft.vert
 2567302  7701928 91404474 total


## Downloading corrections

In [None]:
%%bash
# Downloading a table with corrected forms from the Georgian Random corpus
# cp /content/treetagger/cmd/tree-tagger-georgian2 /content/treetagger/cmd/tree-tagger-georgian
wget https://heibox.uni-heidelberg.de/f/e9010b0f3e7649ef9552/?dl=1
mv index.html?dl=1 georgianrandom--unknown-frq-all.tsv

In [None]:
# Downloading a table with corrected froms from Specialized corpora
!wget https://heibox.uni-heidelberg.de/f/b7c076d7876d4860a554/?dl=1
!mv index.html?dl=1 specialised-corpora-all-union123Unkonw.tsv

In [14]:
!wc georgianrandom--unknown-frq-all.tsv
!wc specialised-corpora-all-union123Unkonw.tsv

  2999  17830 233687 georgianrandom--unknown-frq-all.tsv
  2895  26391 227853 specialised-corpora-all-union123Unkonw.tsv


In [21]:
# removing the line with column names before concatenation
!cat specialised-corpora-all-union123Unkonw.tsv | tail -n+2 >specialised-corpora-all-union123Unkonw2.tsv

In [22]:
!cat georgianrandom--unknown-frq-all.tsv specialised-corpora-all-union123Unkonw2.tsv >ready-corrections-all.txt
!wc ready-corrections-all.txt

  5893  44210 461479 ready-corrections-all.txt


In [24]:
# reading a dictionary with corrections (function)

def readCorrectionFile(SFInput):
    '''
    input string: file name with the corrections;
    output: tuple of two dictionaries: corrections lemmas, corrections part-of-speech
    '''
    DCorrectionsPoS = {}
    DCorrectionsLem = {}
    with open(SFInput) as f:
        counter = 0
        for sline in f:
            counter += 1
            if counter == 1: continue # we skip the first line

            sline = sline.strip()
            LLine = sline.split('\t')
            try: SWord = LLine[1]
            except: SWord = ''

            try: SPoS = LLine[2]
            except: SPoS = ''

            try: SLemma = LLine[3]
            except: SLemma = ''

            try: SPoSCorrected = LLine[4]
            except: SPoSCorrected = ''
            if SPoSCorrected != '' and SWord != '':
                SPoS = SPoSCorrected
                DCorrectionsPoS[SWord] = SPoS

            if SWord != '' and SLemma != '': DCorrectionsLem[SWord] = SLemma
            # if SWord != '' and SPoS != '': DCorrectionsPoS[SWord] = SPoS

    print(len(DCorrectionsPoS.items()))
    print(len(DCorrectionsLem.items()))

    return DCorrectionsLem, DCorrectionsPoS


def readCorrectionFileTestDict(DCorrectionsLem, DCorrectionsPoS, SFOutDictLemmasCorrected, SFOutDdictPoSCorrected):
    FDictCorrected = open(SFOutDictLemmasCorrected, 'w')
    for key, value in sorted(DCorrectionsLem.items()):
        FDictCorrected.write(f'{key}\t{value}\n')
    FDictCorrected.flush()

    FDictCorrectedPoS = open(SFOutDdictPoSCorrected, 'w')
    for key, value in sorted(DCorrectionsPoS.items()):
        FDictCorrectedPoS.write(f'{key}\t{value}\n')
    FDictCorrectedPoS.flush()

    return



In [None]:
DCorrectionsLem2, DCorrectionsPoS2 = readCorrectionFile('ready-corrections-all.txt')
readCorrectionFileTestDict(DCorrectionsLem2, DCorrectionsPoS2, 'ready-corrections-all-cor-lem.txt', 'ready-corrections-all-cor-pos.txt')

In [27]:
!mkdir specialised-corpora-s05-cVertSC

In [33]:
def applyCorrections(SFInVertUnknown, SFOutVertCorrected, DCorrectionsLem, DCorrectionsPoS):
    FOutVertCorrected = open(SFOutVertCorrected, 'w')
    DReplacements = {}
    with open(SFInVertUnknown) as FInVertUnknown:
        counter = 0
        unknownFound = 0
        unknownCorrected = 0
        unknownCorrectedPoS = 0
        for SLine in FInVertUnknown:
            counter +=1
            if counter % 1000000 == 0: sys.stdout.write(f'{counter}, unknownFound={unknownFound}, unknownCorrected={unknownCorrected}({unknownCorrected/unknownFound*100}%), unknownTypesCorrected={len(DReplacements)}, unknownCorrectedPoS={unknownCorrectedPoS}\n')
            SLine = SLine.rstrip()
            LLine = SLine.split('\t')
            if len(LLine) != 3:
                FOutVertCorrected.write(f'{SLine}\n')
            else:
                SWord = LLine[0]
                SPoS = LLine[1]
                SLem = LLine[2]

                if SLem == '<unknown>':
                    unknownFound +=1
                    if SWord in DCorrectionsLem: 
                        SLem = DCorrectionsLem[SWord]
                        unknownCorrected +=1
                        try:
                            DReplacements[f'{SWord}\t{SLem}'] += 1
                        except:
                            DReplacements[f'{SWord}\t{SLem}'] = 1

                    # except: pass

                    if SWord in DCorrectionsPoS: 
                        SPoS = DCorrectionsPoS[SWord]
                        unknownCorrectedPoS += 1
                    # except: pass
                FOutVertCorrected.write(f'{SWord}\t{SPoS}\t{SLem}\n')
    FOutVertCorrected.flush()

    return (counter, unknownFound, unknownCorrected, len(DReplacements), unknownCorrectedPoS), DReplacements


def reportStatistics(TupleIn1):
    counter, unknownFound, unknownCorrected, ITypesCorrected, unknownCorrectedPoS = TupleIn1
    # ITypesCorrected = len(DReplacements)
    UnknownBeforeUpdate = unknownFound / counter
    UnknowAfterUpdate = (unknownFound - unknownCorrected) / counter
    UnknownChange = UnknownBeforeUpdate - UnknowAfterUpdate

    sys.stdout.write(f'\nAll words:{counter}, Unknown:{unknownFound}, UnknownCorrected:{unknownCorrected}({unknownCorrected/unknownFound*100}%), UnknownTypesCorrected:{ITypesCorrected}, UnknownPoSCorrected:{unknownCorrectedPoS}\n')
    sys.stdout.write(f'\nUnknown before update:{unknownFound}({UnknownBeforeUpdate * 100})%; Unknown after update:{unknownFound - unknownCorrected}({UnknowAfterUpdate * 100})%; Change:{unknownCorrected}({UnknownChange * 100})%\n', )


def printFrqDict(DFrq, SFOut):
    FOut = open(SFOut, 'w')
    count = 0
    for key, val in sorted(DFrq.items(), key=lambda item: item[1], reverse=True):
        count+=1
        FOut.write(f'{count}\t{key}\t{val}\n')
    FOut.flush()
    


In [34]:
TupleStatF2, DReplacementsF2 = applyCorrections('/content/specialised-corpora-s02-vert/cFiktion.vert', '/content/specialised-corpora-s05-cVertSC/cFiktion.vert', DCorrectionsLem2, DCorrectionsPoS2)
reportStatistics(TupleStatF2)
printFrqDict(DReplacementsF2, '/content/specialised-corpora-s05-cVertSC/cFiktion-replacements.txt')




All words:747302, Unknown:134416, UnknownCorrected:21150(15.734733960242828%), UnknownTypesCorrected:2934, UnknownPoSCorrected:6559

Unknown before update:134416(17.98683798517868)%; Unknown after update:113266(15.156656880350916)%; Change:21150(2.830181104827767)%


In [35]:
TupleStatN2, DReplacementsN2 = applyCorrections('/content/specialised-corpora-s02-vert/cNaturwissenschaft.vert', '/content/specialised-corpora-s05-cVertSC/cNaturwissenschaft.vert', DCorrectionsLem2, DCorrectionsPoS2)
reportStatistics(TupleStatN2)
printFrqDict(DReplacementsN2, '/content/specialised-corpora-s05-cVertSC/cNaturwissenschaft-replacements.txt')



All words:638588, Unknown:128802, UnknownCorrected:30036(23.31951367214795%), UnknownTypesCorrected:3858, UnknownPoSCorrected:6678

Unknown before update:128802(20.169812148051637)%; Unknown after update:98766(15.46631004654018)%; Change:30036(4.703502101511459)%


In [36]:
TupleStatR2, DReplacementsR2 = applyCorrections('/content/specialised-corpora-s02-vert/cRechtswissenschaft.vert', '/content/specialised-corpora-s05-cVertSC/cRechtswissenschaft.vert', DCorrectionsLem2, DCorrectionsPoS2)
reportStatistics(TupleStatR2)
printFrqDict(DReplacementsR2, '/content/specialised-corpora-s05-cVertSC/cRechtswissenschaft-replacements.txt')


1000000, unknownFound=210522, unknownCorrected=47345(22.489336031388643%), unknownTypesCorrected=2785, unknownCorrectedPoS=13066

All words:1181412, Unknown:251676, UnknownCorrected:55859(22.194806020438975%), UnknownTypesCorrected:3092, UnknownPoSCorrected:14902

Unknown before update:251676(21.302983209921685)%; Unknown after update:195817(16.574827409912885)%; Change:55859(4.7281558000088015)%
