<a href="https://colab.research.google.com/github/iued-uni-heidelberg/corpustools/blob/main/S01LemmatizationEnHyV01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Morphological analysis for English and Armenian

We will create a workflow for analysing English and Armenian texts

For English we will use the TreeTagger 

For Armenian we will use the git repository with Armenian morphological analyser: 
https://github.com/timarkh/uniparser-grammar-eastern-armenian

In [2]:
# importing python libraries
import os, re, sys

## English

In [None]:
# installing TreeTagger

In [None]:
%%bash
mkdir treetagger
cd treetagger
# Download the tagger package for your system (PC-Linux, Mac OS-X, ARM64, ARMHF, ARM-Android, PPC64le-Linux).
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2.4.tar.gz
tar -xzvf tree-tagger-linux-3.2.4.tar.gz
# Download the tagging scripts into the same directory.
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/tagger-scripts.tar.gz
gunzip tagger-scripts.tar.gz
# Download the installation script install-tagger.sh.
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/install-tagger.sh
# Download the parameter files for the languages you want to process.
# list of all files (parameter files) https://cis.lmu.de/~schmid/tools/TreeTagger/#parfiles
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/english.par.gz
sh install-tagger.sh
cd ..
sudo pip install treetaggerwrapper


In [None]:
%%bash
wget https://heibox.uni-heidelberg.de/f/95a3875771c040db959a/?dl=1
mv index.html?dl=1 humanrights02.txt

wget https://heibox.uni-heidelberg.de/f/cdf240db84ca4718b718/?dl=1
mv index.html?dl=1 en1984.txt

In [None]:
!head --lines=20 humanrights02.txt
!wc humanrights02.txt

In [None]:
!./treetagger/cmd/tree-tagger-english en1984.txt >en1984_vert.txt

	reading parameters ...
	tagging ...
121000	 finished.


In [None]:
!head --lines=20 en1984_vert.txt

In [None]:
!./treetagger/cmd/tree-tagger-english humanrights02.txt >humanrights02_vert.txt

In [None]:
!head --lines=20 humanrights02_vert.txt

## Armenian

In [1]:
# installing Armenian morphological analyser
!git clone https://github.com/timarkh/uniparser-grammar-eastern-armenian

Cloning into 'uniparser-grammar-eastern-armenian'...
remote: Enumerating objects: 181, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 181 (delta 12), reused 40 (delta 12), pack-reused 141[K
Receiving objects: 100% (181/181), 52.66 MiB | 13.50 MiB/s, done.
Resolving deltas: 100% (78/78), done.


In [3]:
# Python classes
!pip3 install uniparser-eastern-armenian

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting uniparser-eastern-armenian
  Downloading uniparser_eastern_armenian-2.1.2-py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 4.3 MB/s 
[?25hCollecting uniparser-morph>=2.2.0
  Downloading uniparser_morph-2.6.4-py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 6.1 MB/s 
Installing collected packages: uniparser-morph, uniparser-eastern-armenian
Successfully installed uniparser-eastern-armenian-2.1.2 uniparser-morph-2.6.4


In [4]:
# disambiguation
!sudo apt-get install cg3

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  libcg3-1
The following NEW packages will be installed:
  cg3 libcg3-1
0 upgraded, 2 newly installed, 0 to remove and 20 not upgraded.
Need to get 339 kB of archives.
After this operation, 1,100 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libcg3-1 amd64 1.0.0~r12254-1ubuntu3 [229 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 cg3 amd64 1.0.0~r12254-1ubuntu3 [110 kB]
Fetched 339 kB in 2s (210 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 2.)
debconf: falling b

In [5]:
from uniparser_eastern_armenian import EasternArmenianAnalyzer
a = EasternArmenianAnalyzer()
analyses = a.analyze_words('Ձևաբանություն')

In [6]:
for ana in analyses:
    print(ana.wf, ana.lemma, ana.gramm, ana.gloss, ana.stem, ana.subwords, ana.wfGlossed, ana.otherData)

Ձևաբանություն ձեւաբանություն N,inanim,sg,nom,nonposs morphology ձևաբանություն. [] ձևաբանություն [('trans_en', 'morphology')]


In [7]:
# nonexisting word
analyses2 = a.analyze_words('Ձևաբայու')

In [None]:
for ana2 in analyses2:
    if ana2.lemma:
      print(ana2.wf, ana2.lemma, ana2.gramm, ana2.gloss, ana2.stem, ana2.subwords, ana2.wfGlossed, ana2.otherData)
    else:
      print(ana2.wf, ana2.wf, "N", "x", ana2.stem, ana2.subwords, ana2.wfGlossed, ana2.otherData)

In [None]:
# which fields we have in analysis:

In [None]:
dir(ana)

In [9]:
analyses = a.analyze_words([['և'], ['Ես', 'սիրում', 'եմ', 'քեզ', ':']],
                           format='xml')

In [10]:
for ana in analyses:
    print(str(ana))

['<w><ana lex="եւ" gr="CONJ" parts="և" gloss="and" trans_en="and, too, either"></ana>և</w>']
['<w><ana lex="ես" gr="PRON,S,hum,sg,nom" parts="ես" gloss="me" trans_en="I"></ana><ana lex="է" gr="V,intr,prs,sg,2" parts="ե-ս" gloss="be-PRS.2SG" trans_en="be"></ana>Ես</w>', '<w><ana lex="սիրել" gr="V,tr,cvb,ipfv" parts="սիր-ում" gloss="love-CVB.IPFV" trans_en="love, have a passion/an affection for, like"></ana>սիրում</w>', '<w><ana lex="է" gr="V,intr,prs,sg,1" parts="ե-մ" gloss="be-PRS.1SG" trans_en="be"></ana>եմ</w>', '<w><ana lex="դու" gr="PRON,S,hum,sg,dat" parts="քեզ" gloss="thou" trans_en="you, thou"></ana>քեզ</w>', '<w><ana lex="" gr="" parts="" gloss=""></ana>:</w>']


In [None]:
analyses = a.analyze_words(['Ձևաբանություն', [['և'], ['Ես', 'սիրում', 'եմ', 'քեզ', ':']]],
                           format='json')

In [None]:
for ana in analyses:
    print(str(ana))

[{'wf': 'Ձևաբանություն', 'lemma': 'ձեւաբանություն', 'gramm': ['N', 'inanim', 'sg', 'nom', 'nonposs'], 'wfGlossed': 'ձևաբանություն', 'gloss': 'morphology', 'trans_en': 'morphology'}]
[[[{'wf': 'և', 'lemma': 'եւ', 'gramm': ['CONJ'], 'wfGlossed': 'և', 'gloss': 'and', 'trans_en': 'and, too, either'}]], [[{'wf': 'Ես', 'lemma': 'ես', 'gramm': ['PRON', 'S', 'hum', 'sg', 'nom'], 'wfGlossed': 'ես', 'gloss': 'me', 'trans_en': 'I'}, {'wf': 'Ես', 'lemma': 'է', 'gramm': ['V', 'intr', 'prs', 'sg', '2'], 'wfGlossed': 'ե-ս', 'gloss': 'be-PRS.2SG', 'trans_en': 'be'}], [{'wf': 'սիրում', 'lemma': 'սիրել', 'gramm': ['V', 'tr', 'cvb', 'ipfv'], 'wfGlossed': 'սիր-ում', 'gloss': 'love-CVB.IPFV', 'trans_en': 'love, have a passion/an affection for, like'}], [{'wf': 'եմ', 'lemma': 'է', 'gramm': ['V', 'intr', 'prs', 'sg', '1'], 'wfGlossed': 'ե-մ', 'gloss': 'be-PRS.1SG', 'trans_en': 'be'}], [{'wf': 'քեզ', 'lemma': 'դու', 'gramm': ['PRON', 'S', 'hum', 'sg', 'dat'], 'wfGlossed': 'քեզ', 'gloss': 'thou', 'trans_en': '

In [11]:
# analysis with disambiguation
analyses = a.analyze_words(['Ես', 'սիրում', 'եմ', 'քեզ'], disambiguate=True)

In [13]:
for ana in analyses:
    if len(ana) > 1: tab = "  "
    else: tab = ""
    for wfo in ana:
        print(tab, wfo.wf, wfo.lemma, wfo.gramm, wfo.gloss)

   Ես ես PRON,S,hum,sg,nom me
   Ես է V,intr,prs,sg,2 be-PRS.2SG
 սիրում սիրել V,tr,cvb,ipfv love-CVB.IPFV
 եմ է V,intr,prs,sg,1 be-PRS.1SG
 քեզ դու PRON,S,hum,sg,dat thou


In [None]:
print(type(wfo))

In [None]:
dir(wfo)

In [None]:
# downloading and analysing texts

In [None]:
!wget https://heibox.uni-heidelberg.de/f/e0bfae444a5a4c76957b/?dl=1
!mv index.html?dl=1 hy1984.txt

In [26]:
FInText = open('hy1984.txt','r')
FOutText = open('hy1984_vert.txt','w')

In [27]:
for SLine in FInText:
    SLine = SLine.strip()
    ListOfWords = re.split('[ ,\.:;\!\(\)\"\[\]՞՝«»\-\—՝։\։]+', SLine) # tokenize: split on white spaces and punctuation
    # if len(ListOfWords) > 0: FOutText.write(str(ListOfWords) + '\n')
    analyses = a.analyze_words(ListOfWords, disambiguate=False)
    FOutText.write('<p>\n')
    for ana in analyses:
        # for wfo in ana:
        # how to type all variants + disambiguate ?
        for wfo in ana:
          # wfo = ana[0]
          FOutText.write(wfo.wf + '\t' + wfo.gramm + '\t' + wfo.lemma + '\t' + wfo.gloss + '\n')
          #    FOutText.write(wfo.wf + '\t' + wfo.gramm + '\t' + wfo.lemma + '\t' + wfo.gloss + '\n')
    FOutText.write('</p>\n')
FOutText.flush()