# About

Preparing data for its use in a Python Belarusian lemmatizer package:
- Reading files, exploration
- Understanding the source's markup rules
- Cleaning data and converting it into Python-compatible JSON format

## Sources

- Data source: [GrammarDB (RELEASE-202309) on GitHub [↗︎]](https://github.com/Belarus/GrammarDB/releases)
- Accompanying publication: [Граматычная база беларускай мовы / уклад. Уладзімір Кошчанка, Алесь Булойчык [↗︎]](https://www.academia.edu/60297156/%D0%93%D1%80%D0%B0%D0%BC%D0%B0%D1%82%D1%8B%D1%87%D0%BD%D0%B0%D1%8F_%D0%B1%D0%B0%D0%B7%D0%B0_%D0%B1%D0%B5%D0%BB%D0%B0%D1%80%D1%83%D1%81%D0%BA%D0%B0%D0%B9_%D0%BC%D0%BE%D0%B2%D1%8B_Belarusian_Language_Grammar_Database)

Original data sample:
```
<Wordlist>
    <Paradigm pdgId="1195935" lemma="нарка+маўскі" tag="ARP">
        <Variant id="a" lemma="нарка+маўскі" slouniki="piskunou2012:107411" pravapis="A2008">
            <Form tag="MNS" slouniki="sbm2012,krapivabr2012">нарка+маўскі</Form>
            ...
            <Form tag="MAS" options="inanim">нарка+маўскі</Form>
            <Form tag="MAS" options="anim">нарка+маўскага</Form>
            ...
        </Variant>
        <Variant id="b" lemma="нарко+маўскі" pravapis="A1957">
            <Form tag="MNS" slouniki="tsbm1984">нарко+маўскі</Form>
            ...
        </Variant>
    </Paradigm>
</Wordlist>
```

-------

# Import libraries

In [2]:
import os
import bs4
from bs4 import BeautifulSoup
import bs4.builder._lxml

# Reading files


In [3]:
path = "./RELEASE-20230920/"
fl = [] #file list

for root, dirs, files in os.walk(path):
    if root == path:
        fl = [f for f in files if f.endswith('.xml')]

fl.sort()
fl[0]

'A1.xml'

Gathering morphosyntactic tags and nesting under PoS using the following structure `{posTag: {form: {wordForm: {id:'', lemma:''}}}}`

In [4]:
tags = {}

for f in fl:
    with open(path + f, 'r') as my_file:
        file = my_file.read()
        
    soup = BeautifulSoup(file, features='xml')
    words = soup.find_all("Paradigm")

    for w in words:
        try:
            if w['tag'] not in tags:
                tags[w['tag']] = {}
                
            forms = w.find_all('Form')
                
            for form in forms:
                if form['tag'] not in tags[w['tag']]:
                    tags[w['tag']][form['tag']] = {form.string: {'id': w.attrs['pdgId'], 'lemma': w.attrs['lemma']}}
                else:
                    tags[w['tag']][form['tag']][form.string] = {'id': w.attrs['pdgId'], 'lemma': w.attrs['lemma']}

        except:
            # print(w)
            pass

In [5]:
tags.keys()

dict_keys(['ARP', 'AQP', 'APP', 'AQC', 'A0', 'AXP', 'AQS', 'CKX', 'CSX', 'E', 'I', 'MNKS', 'MAKS', 'MACS', 'MAOC', 'MNCS', 'M0CS', 'MAOS', 'MNCC', 'MXCS', 'NCIINN0', 'NCIINF2', 'NCIINM1', 'NCIINN1', 'NCIINP7', 'NCAPNM1', 'NCIINF3', 'NCAPNF2', 'NCAPNM2', 'NCAINM1', 'NCIIBF3', 'NCAPNS5', 'NCIIBM0', 'NCIIBM1', 'NCIIBN1', 'NCIINS5', 'NCAIBM1', 'NCAPBM1', 'NCIINF0', 'NCAINF2', 'NCIIBF2', 'NCIIBN0', 'NCAPBF2', 'NCIINU5', 'NCAIBF2', 'NCAINU5', 'NCAINS5', 'NCAINF0', 'NCAINP7', 'NCAINN4', 'NCIINM0', 'NCAINM0', 'NCAINN1', 'NCAPNM0', 'NCIIBF0', 'NCAPNN0', 'NCIINM6', 'NCAPNP7', 'NCIINM2', 'NCAPNN1', 'NCAPNF3', 'NCAPNU5', 'N', 'NCAPNF0', 'NCAPNN4', 'NCAINM2', 'NCAINF3', 'NCAPNM6', 'NCIIBP7', 'NCIINN4', 'NCIINP0', 'NCAPBM0', 'NCAINN0', 'NCAPNP0', 'NCAPNF4', 'NCAPNM4', 'NPIIBF0', 'NPAPNM1', 'NPIINP7', 'NPIINM1', 'NPIINN1', 'NPIINF3', 'NPIINF2', 'NPIINM0', 'NPIINF0', 'NPAPNS5', 'NPIINS5', 'NPAPNF2', 'NPAPNM0', 'NPIINM2', 'NPIIBM1', 'NPIINU5', 'NPAPNM2', 'NPAPNF0', 'NPIINN0', 'NPAPNM6', 'NPIINM6', 'NPA

In [6]:
tags['ARP']['MNS']['аага+мны']

{'id': '1162172', 'lemma': 'аага+мны'}

# Exploratory

Various grouping to determine data structure, some of them used in next sections

In [10]:
tags['ARP'].keys()

dict_keys(['MNS', 'MGS', 'MDS', 'MAS', 'MIS', 'MLS', 'NNS', 'NGS', 'NDS', 'NAS', 'NIS', 'NLS', 'FNS', 'FGS', 'FDS', 'FAS', 'FIS', 'FLS', 'PNP', 'PGP', 'PDP', 'PAP', 'PIP', 'PLP', 'R'])

In [57]:
m = 'ARP' # examples of all forms of a PoS

for cat in list(tags[m].keys()): 
    ex = [i for i in tags[m][cat].keys()][0]
    print('[\'' + m + '\'][\'' + cat + '\'][\'' + ex + '\']')

['ARP']['MNS']['аага+мны']
['ARP']['MGS']['аага+мнага']
['ARP']['MDS']['аага+мнаму']
['ARP']['MAS']['аага+мны']
['ARP']['MIS']['аага+мным']
['ARP']['MLS']['аага+мным']
['ARP']['NNS']['аага+мнае']
['ARP']['NGS']['аага+мнага']
['ARP']['NDS']['аага+мнаму']
['ARP']['NAS']['аага+мнае']
['ARP']['NIS']['аага+мным']
['ARP']['NLS']['аага+мным']
['ARP']['FNS']['аага+мная']
['ARP']['FGS']['аага+мнай']
['ARP']['FDS']['аага+мнай']
['ARP']['FAS']['аага+мную']
['ARP']['FIS']['аага+мнай']
['ARP']['FLS']['аага+мнай']
['ARP']['PNP']['аага+мныя']
['ARP']['PGP']['аага+мных']
['ARP']['PDP']['аага+мным']
['ARP']['PAP']['аага+мныя']
['ARP']['PIP']['аага+мнымі']
['ARP']['PLP']['аага+мных']
['ARP']['R']['бязму+жне']


In [62]:
pos = {} # getting forms for all paradigm tags

for cat in list(tags.keys()):
    if cat[0] not in pos:
        pos[cat[0]] = {}

    pos[cat[0]][cat] = [key for key in tags[cat]]    

for a in pos['A']: # adjective forms
    print(a, pos['A'][a])

ARP ['MNS', 'MGS', 'MDS', 'MAS', 'MIS', 'MLS', 'NNS', 'NGS', 'NDS', 'NAS', 'NIS', 'NLS', 'FNS', 'FGS', 'FDS', 'FAS', 'FIS', 'FLS', 'PNP', 'PGP', 'PDP', 'PAP', 'PIP', 'PLP', 'R']
AQP ['MNS', 'MGS', 'MDS', 'MAS', 'MIS', 'MLS', 'NNS', 'NGS', 'NDS', 'NAS', 'NIS', 'NLS', 'FNS', 'FGS', 'FDS', 'FAS', 'FIS', 'FLS', 'PNP', 'PGP', 'PDP', 'PAP', 'PIP', 'PLP']
APP ['MNS', 'MGS', 'MDS', 'MAS', 'MIS', 'MLS', 'NNS', 'NGS', 'NDS', 'NAS', 'NIS', 'NLS', 'FNS', 'FGS', 'FDS', 'FAS', 'FIS', 'FLS', 'PNP', 'PGP', 'PDP', 'PAP', 'PIP', 'PLP']
AQC ['MNS', 'MGS', 'MDS', 'MAS', 'MIS', 'MLS', 'NNS', 'NGS', 'NDS', 'NAS', 'NIS', 'NLS', 'FNS', 'FGS', 'FDS', 'FAS', 'FIS', 'FLS', 'PNP', 'PGP', 'PDP', 'PAP', 'PIP', 'PLP']
A0 ['']
AXP ['MNS', 'MGS', 'MDS', 'MAS', 'MIS', 'MLS', 'NNS', 'NGS', 'NDS', 'NAS', 'NIS', 'NLS', 'FNS', 'FGS', 'FDS', 'FAS', 'FIS', 'FLS', 'PNP', 'PGP', 'PDP', 'PAP', 'PIP', 'PLP']
AQS ['MNS', 'MGS', 'MDS', 'MAS', 'MIS', 'MLS', 'NNS', 'NGS', 'NDS', 'NAS', 'NIS', 'NLS', 'FNS', 'FGS', 'FDS', 'FAS', 'FIS'

## PoS paradigms

In [64]:
mtags = {} # grouping paradigm tags that have the same form list, no PoS criteria
cn = 0

for part in pos.keys():
    for morph in pos[part].keys():
        check = {}
        
        for m in mtags.keys():       
           check[m] = mtags[m]['par']
        
        if pos[part][morph] not in check.values():
            cn += 1
            mtags['type'+str(cn)] = {'par': pos[part][morph], 'types':[morph]}

        else:
            for c in check.keys():
                if check[c] == pos[part][morph]:
                    mtags[c]['types'].append(morph)

print(mtags['type10'])

{'par': ['NS', 'GS', 'DS', 'AS', 'IS', 'LS', 'NP', 'GP', 'DP', 'AP', 'IP', 'LP'], 'types': ['NCIINN0', 'NCIINF2', 'NCIINM1', 'NCIINN1', 'NCIINF3', 'NCAPNF2', 'NCAPNM2', 'NCIIBF3', 'NCIIBM0', 'NCIIBM1', 'NCIIBN1', 'NCAIBM1', 'NCAPBM1', 'NCIINF0', 'NCAINF2', 'NCIIBF2', 'NCIIBN0', 'NCAPBF2', 'NCAIBF2', 'NCAINF0', 'NCAINN4', 'NCIINM0', 'NCAINM0', 'NCAINN1', 'NCAPNM0', 'NCIIBF0', 'NCAPNN0', 'NCIINM6', 'NCIINM2', 'NCAPNN1', 'NCAPNF3', 'NCAPNF0', 'NCAPNN4', 'NCAINM2', 'NCAINF3', 'NCAPNM6', 'NCIINN4', 'NCAPBM0', 'NCAINN0', 'NCAPNF4', 'NCAPNM4', 'NPAPNM1', 'NPIINM1', 'NPIINN1', 'NPIINF2', 'NPIINM0', 'NPAPNF2', 'NPAPNM0', 'NPIIBM1', 'NPAPNM2', 'NPAPNF0', 'NPIINN0', 'NPAPNM6', 'NPIIBN0']}


## `Tag` attribute disambiguation

All paradigm and form tags within a PoS

In [70]:
alltags = {}

for p in pos.keys():
    alltags[p] = {'partags': [], 'fortags': []}
    
    for k in pos[p].keys():
        if k not in alltags[p]['partags']:
            alltags[p]['partags'].append(k)

        for f in pos[p][k]:
            if f not in alltags[p]['fortags']:
                alltags[p]['fortags'].append(f)

In [71]:
alltags['A']

{'partags': ['ARP', 'AQP', 'APP', 'AQC', 'A0', 'AXP', 'AQS'],
 'fortags': ['MNS',
  'MGS',
  'MDS',
  'MAS',
  'MIS',
  'MLS',
  'NNS',
  'NGS',
  'NDS',
  'NAS',
  'NIS',
  'NLS',
  'FNS',
  'FGS',
  'FDS',
  'FAS',
  'FIS',
  'FLS',
  'PNP',
  'PGP',
  'PDP',
  'PAP',
  'PIP',
  'PLP',
  'R',
  '']}

# Mapping data structures

Collecting all `Paradigm`, `Variant`, `Form` attributes:

In [103]:
attributes = {'Paradigm':[], 'Variant':[], 'Form':[]}

for f in fl:
    with open(path + f, 'r') as my_file:
        file = my_file.read()
        
    soup = BeautifulSoup(file, features='xml')
    words = soup.find_all("Paradigm")

    for w in words:
        for at in w.attrs.keys():
            if at not in attributes['Paradigm']:
                attributes['Paradigm'].append(at)

        variants = w.find_all('Variant')
        for v in variants:
            for at in v.attrs.keys():
                if at not in attributes['Variant']:
                    attributes['Variant'].append(at)

        forms = w.find_all('Form')
        for f in forms:
            for at in f.attrs.keys():
                if at not in attributes['Form']:
                    attributes['Form'].append(at)
        
print(attributes)

{'Paradigm': ['pdgId', 'lemma', 'tag', 'regulation', 'govern', 'todo', 'meaning', 'type', 'marked', 'comment', 'options', 'theme'], 'Variant': ['id', 'lemma', 'slouniki', 'pravapis', 'rules', 'type', 'regulation', 'tag'], 'Form': ['tag', 'slouniki', 'options', 'pravapis', 'govern', 'type', 'comment', 'todo']}


The following attributes are to be retained:
- `Paradigm`:
    - `pdgId` - original paradigm ID
    - `lemma` 
    - `tag` - morphosyntactic data of the paradigm as a whole, such as part of speech
- `Variant`:
    - `id` - original variant ID (within a paradigm)
    - `lemma`
    - `slouniki` - dictionaries that have this word variant
    - `pravapis` - orthography conventions of the variant
- `Form`:
    - `tag` - morphosyntactic data of the form
    - `slouniki` - dictionaries that have this word variant
    - `options` - animacy value (animate, inanimate) for forms which change depending on this attribute
    - `pravapis`- orthography conventions of the form
 
__(!) The primary linguistic data is to be retrieved from `tag` values.__

Other attributes are provided inconsistently, mostly for personal remarks of the developers or data structures for future versions.      

Retrieve all `slouniki` and `pravapis` values:

In [115]:
sources = []
orth = []

for f in fl:
    with open(path + f, 'r') as my_file:
        file = my_file.read()
        
    soup = BeautifulSoup(file, features='xml')
    words = soup.find_all("Paradigm")

    for w in words:
        variants = w.find_all('Variant')
        
        for v in variants:
            for at in v.attrs.keys():
                
                #sources
                if at == 'slouniki':
                    if ':' not in v.attrs[at]:
                        for s in v.attrs[at].split(','):
                            if s.strip() not in sources:
                                sources.append(s.strip())
                        
                    else:
                        for s in v.attrs[at].split(','):
                            if ':' in s:
                                if s.split(':')[0].strip() not in sources:
                                    sources.append(s.split(':')[0].strip())
                        
                #orthography
                if at == 'pravapis':
                    for o in v.attrs[at].split(','):
                        if o.strip() not in orth:
                            orth.append(o.strip())


        forms = w.find_all('Form')
        for f in forms:
            for at in f.attrs.keys():
                
                #sources
                if at == 'slouniki':
                    
                    if ':' not in f.attrs[at]:
                        for s in f.attrs[at].split(','):
                            if s.strip() not in sources:
                                sources.append(s.strip())
                        
                    else:
                        for s in f.attrs[at].split(','):
                            if ':' in s:
                                if s.split(':')[0].strip() not in sources:
                                    sources.append(s.split(':')[0].strip())
                #orthography
                if at == 'pravapis':
                    for o in f.attrs[at].split(','):
                        if o.strip() not in orth:
                            orth.append(o.strip())

print(sources)
print(orth)

['piskunou2012', 'prym2009', 'sbm2012', 'tsblm1996', 'tsbm1984', 'krapivabr2012', 'dzsl2007', 'prym2013', 'biryla1987', 'nazounik2008', 'nazounik2013', '', 'dzsl2013']
['A1957', 'A2008', '']


## `slouniki` values
- `tsbm1984` - «Тлумачальны слоўнік беларускай мовы. У 5 т.» (1984)
- `biryla1987` - «Слоўнік беларускай мовы (пад. рэд. М.В. Бірылы)» (1987)
- `nazounik2008` - «Граматычны слоўнік назоўніка» (2008)
- `dzsl2007` - «Граматычны слоўнік дзеяслова» (2008)
- `prym2009` - «Граматычны слоўнік прыметніка, займенніка, лічэбніка, прыслоўя» (2008)
- `krapivabr2012` (?) - «Руска-беларускі слоўнік. У 3 т.» (2011)
- `krapivabr2012` - «Беларуска-рускі слоўнік. У 3 т.» (2012)
- `piskunou2012` - «Вялікі слоўнік беларускай мовы: арфаграфія, акцэнтуацыя, парадыгматыка (каля 223 000 слоў)» (2012)
- `sbm2012` - «Слоўнік беларускай мовы» (2012)
- `nazounik2013` - «Граматычны слоўнік назоўніка» (2013)
- `dzsl2013` - «Граматычны слоўнік дзеяслова» (2013)
- `prym2013` - «Граматычны слоўнік прыметніка, займенніка, лічэбніка, прыслоўя» (2013)
- `tsblm1996` (?) - «Тлумачальны слоўнік беларускай літаратурнай мовы» (2016)

In [119]:
sourcedict = {
    'tsbm1984': '«Тлумачальны слоўнік беларускай мовы. У 5 т.» (1984)',
    'biryla1987': '«Слоўнік беларускай мовы (пад. рэд. М.В. Бірылы)» (1987)',
    'nazounik2008': '«Граматычны слоўнік назоўніка» (2008)',
    'dzsl2007': '«Граматычны слоўнік дзеяслова» (2008)',
    'prym2009': '«Граматычны слоўнік прыметніка, займенніка, лічэбніка, прыслоўя» (2008)',
    'krapivabr2012': '«Руска-беларускі слоўнік. У 3 т.» (2011)',
    'krapivabr2012': '«Беларуска-рускі слоўнік. У 3 т.» (2012)',
    'piskunou2012': '«Вялікі слоўнік беларускай мовы: арфаграфія, акцэнтуацыя, парадыгматыка (каля 223 000 слоў)» (2012)',
    'sbm2012': '«Слоўнік беларускай мовы» (2012)',
    'nazounik2013': '«Граматычны слоўнік назоўніка» (2013)',
    'dzsl2013': '«Граматычны слоўнік дзеяслова» (2013)',
    'prym2013': '«Граматычны слоўнік прыметніка, займенніка, лічэбніка, прыслоўя» (2013)',
    'tsblm1996': '«Тлумачальны слоўнік беларускай літаратурнай мовы» (2016)'
}

print(sourcedict['tsbm1984'])

«Тлумачальны слоўнік беларускай мовы. У 5 т.» (1984)


## `pravapis` values

- `A1957` - «Правілы беларускай арфаграфіі і пунктуацыі» (1959)
- `A2008` - «Правілы беларускай арфаграфіі і пунктуацыі» (2008)

In [122]:
orthdict = {
    'A1957': '«Правілы беларускай арфаграфіі і пунктуацыі» (1959)',
    'A2008': '«Правілы беларускай арфаграфіі і пунктуацыі» (2008)'
}

orthdict['A2008']

'«Правілы беларускай арфаграфіі і пунктуацыі» (2008)'

## `POS` value

- Part of speech, sometimes its subcategory, is indicated by the first character of the `tag` value in `<Paradigm>` container. For example, initial __"A"__ in `ARP` of `<Paradigm pdgId="1195935" lemma="нарка+маўскі" tag="ARP">` means "Adjective"
- Original data is separated into files dedicated to parts of speech using the same nomenclature. Adjectives are stored in `A1.xml` and `A2.xml` etc.

In [129]:
pos['A']['ARP'][:10]

['MNS', 'MGS', 'MDS', 'MAS', 'MIS', 'MLS', 'NNS', 'NGS', 'NDS', 'NAS']

In [184]:
posdict = {
    'A': {'pos': 'Adjective', 'postag':'ADJ', 'tags':[]}, 
    'C': {'pos': 'Conjunction', 'postag':'CCONJ', 'tags':[]}, 
    'E': {'pos': 'Particle', 'postag':'PART', 'tags':[]}, 
    'I': {'pos': 'Preposition', 'postag':'ADP', 'tags':[]}, 
    'M': {'pos': 'Numeral', 'postag':'NUM', 'tags':[]}, 
    'N': {'pos': 'Noun', 'postag':'NOUN', 'tags':[]},  #PNOUN for proper nouns
    'P': {'pos': 'Participle', 'postag':'VERB', 'VerbForm':'Part', 'tags':[]}, 
    'R': {'pos': 'Adverb', 'postag':'ADV', 'tags':[]}, 
    'S': {'pos': 'Pronoun', 'postag':'PRON', 'tags':[]}, 
    'V': {'pos': 'Verb', 'postag':'VERB', 'tags':[]}, 
    'W': {'pos': 'прэдыкатыў', 'postag':None, 'tags':[]}, 
    'Y': {'pos': 'Interjection', 'postag':'INTJ', 'tags':[]}, 
    'Z': {'pos': 'пабочнае слова', 'postag':None, 'tags':[]}
          }

for k in pos.keys():
    for t in pos[k].keys():
        posdict[k]['tags'].append(t)
        
for k in posdict.keys():
    print(k, posdict[k])

A {'pos': 'Adjective', 'postag': 'ADJ', 'tags': ['ARP', 'AQP', 'APP', 'AQC', 'A0', 'AXP', 'AQS']}
C {'pos': 'Conjunction', 'postag': 'CCONJ', 'tags': ['CKX', 'CSX']}
E {'pos': 'Particle', 'postag': 'PART', 'tags': ['E']}
I {'pos': 'Preposition', 'postag': 'ADP', 'tags': ['I']}
M {'pos': 'Numeral', 'postag': 'NUM', 'tags': ['MNKS', 'MAKS', 'MACS', 'MAOC', 'MNCS', 'M0CS', 'MAOS', 'MNCC', 'MXCS']}
N {'pos': 'Noun', 'postag': 'NOUN', 'tags': ['NCIINN0', 'NCIINF2', 'NCIINM1', 'NCIINN1', 'NCIINP7', 'NCAPNM1', 'NCIINF3', 'NCAPNF2', 'NCAPNM2', 'NCAINM1', 'NCIIBF3', 'NCAPNS5', 'NCIIBM0', 'NCIIBM1', 'NCIIBN1', 'NCIINS5', 'NCAIBM1', 'NCAPBM1', 'NCIINF0', 'NCAINF2', 'NCIIBF2', 'NCIIBN0', 'NCAPBF2', 'NCIINU5', 'NCAIBF2', 'NCAINU5', 'NCAINS5', 'NCAINF0', 'NCAINP7', 'NCAINN4', 'NCIINM0', 'NCAINM0', 'NCAINN1', 'NCAPNM0', 'NCIIBF0', 'NCAPNN0', 'NCIINM6', 'NCAPNP7', 'NCIINM2', 'NCAPNN1', 'NCAPNF3', 'NCAPNU5', 'N', 'NCAPNF0', 'NCAPNN4', 'NCAINM2', 'NCAINF3', 'NCAPNM6', 'NCIIBP7', 'NCIINN4', 'NCIINP0', 'N

# Accentuation

Original data includes accentuation as `+` mark after the stressed syllable as in `<Form>вало+шка</Form>`. To facilitate search function a form will be stored without accent, but the accent position will be indicated in an attribute. When requested, the stress mark will be shown as `u\u0301` diacritic over the stressed vowel.

In [163]:
accsample = 'вало+шка'

form = accsample.replace('+','')
accpos = accsample.find('+')
accentuated = form[:accpos] + u'\u0301' + form[accpos:]

print(form, accpos, accentuated)

валошка 4 вало́шка


# `tag` value conversion

Mapping NCorpus notation onto UD markup. Both `<Paradigm>` and `<Form>` containers have `tag` values, see respective sections under each part of speech. These values are provided abbreviations of PoS data, so `AQS` means "Adjective Qualitative Superlative".
 
**Reference**: 
- [GrammarDB documentation [↗︎]](https://www.academia.edu/60297156/%D0%93%D1%80%D0%B0%D0%BC%D0%B0%D1%82%D1%8B%D1%87%D0%BD%D0%B0%D1%8F_%D0%B1%D0%B0%D0%B7%D0%B0_%D0%B1%D0%B5%D0%BB%D0%B0%D1%80%D1%83%D1%81%D0%BA%D0%B0%D0%B9_%D0%BC%D0%BE%D0%B2%D1%8B_Belarusian_Language_Grammar_Database)
- [UD Belarusian-HSE treebank [↗︎]](https://universaldependencies.org/treebanks/be_hse/index.html)

Value lists mostly use ascending alphabetical order.

In [169]:
posdict.keys()

dict_keys(['A', 'C', 'E', 'I', 'M', 'N', 'P', 'R', 'S', 'V', 'W', 'Y', 'Z'])

## Adjective

- POS value: `ADJ` - adjective {Прыметнік}
- Source files: `A1.xml`, `A2.xml`

### `Paradigm` tag

In [132]:
print(alltags['A']['partags'])

['ARP', 'AQP', 'APP', 'AQC', 'A0', 'AXP', 'AQS']


Features:
- `AdjType` {Тып} - adjective type (not found in HSE treebank)
    - `Qlt` - qualitative {якасны}
    - `Rel` - relative {адносны} 
    - `Pssv` - possesive {прыналежны}
- `InflClass` {Словазмяненне}
    - `Ind` - indeclinable {нескланяльны}
- `Degree` {Ступень параўнання}
    - `Pos` - positive {станоўчая}
    - `Cmp` - comparative {вышэйшая}
    - `Sup` - superlative {найвышэйшая}

NCorp Tag | Notes | PoS | AdjType | Degree | InflClass
----- | ----- | ----- | ----- | ----- | -----
AQP | = Adjective Qualitative Positive | ADJ | Qlt | Pos | 
AQC | = Adjective Qualitative Comparative | ADJ | Qlt | Cmp | 
AQS | = Adjective Qualitative Superlative | ADJ | Qlt | Sup | 
ARP | = Adjective Relative Positive | ADJ | Rel | Pos | 
APP | = Adjective Posessive Positive | ADJ | Pssv | Pos | 
AXP | = Adjective (no type tag) Positive | ADJ |  | Pos | 
A0 | = Adjective Indeclinable, one form only | ADJ |  | Pos | Ind

In [21]:
# examples of each type

for k in tags.keys():
    if k.startswith('A'):
        f1 = [f for f in tags[k].keys()][0] # the first form in the list
        f1count = len([f for f in tags[k][f1].keys()]) # number of occurences of the first form only
        ff1 = [f for f in tags[k][f1].keys()][0] # first example of the first form
        print(k, f1, f1count, tags[k][f1][ff1])

ARP MNS 51531 {'id': '1162172', 'lemma': 'аага+мны'}
AQP MNS 16588 {'id': '1045057', 'lemma': "аб'е+дзены"}
APP MNS 410 {'id': '1303869', 'lemma': 'аве+чы'}
AQC MNS 243 {'id': '1145622', 'lemma': 'актуа+льнейшы'}
A0  23 {'id': '1193572', 'lemma': 'апа+ш'}
AXP MNS 5 {'id': '1193804', 'lemma': 'блінто+ваны'}
AQS MNS 196 {'id': '1073095', 'lemma': 'найадважны'}


In [98]:
# <Paradigm tag="A0">

for k in tags['A0'].keys():
    print(k)

tags['A0 




{'': {'апа+ш': {'id': '1193572', 'lemma': 'апа+ш'},
  'бардо+': {'id': '1140055', 'lemma': 'бардо+'},
  'бекар': {'id': '1140068', 'lemma': 'бекар'},
  'бемо+ль': {'id': '1140069', 'lemma': 'бемо+ль'},
  'бру+та': {'id': '1193837', 'lemma': 'бру+та'},
  'бу+ф': {'id': '1140070', 'lemma': 'бу+ф'},
  'бэ+ж': {'id': '1192912', 'lemma': 'бэ+ж'},
  'гафрэ': {'id': '1140071', 'lemma': 'гафрэ'},
  'дум-ду+м': {'id': '1194542', 'lemma': 'дум-ду+м'},
  'лю+кс': {'id': '1192913', 'lemma': 'лю+кс'},
  'мадэ+рн': {'id': '1192914', 'lemma': 'мадэ+рн'},
  'мідзі': {'id': '1140056', 'lemma': 'мідзі'},
  'міні': {'id': '1140057', 'lemma': 'міні'},
  'не+та': {'id': '1140059', 'lemma': 'не+та'},
  'но+н-стоп': {'id': '1196120', 'lemma': 'но+н-стоп'},
  'перва+нш': {'id': '1196482', 'lemma': 'перва+нш'},
  'піке+': {'id': '1140060', 'lemma': 'піке+'},
  'суахі+лі': {'id': '1140063', 'lemma': 'суахі+лі'},
  'фрэ+з': {'id': '1140064', 'lemma': 'фрэ+з'},
  'ха+кі': {'id': '1140066', 'lemma': 'ха+кі'},
  'э

### `Form` tag

In [80]:
print(alltags['A']['fortags'])

['MNS', 'MGS', 'MDS', 'MAS', 'MIS', 'MLS', 'NNS', 'NGS', 'NDS', 'NAS', 'NIS', 'NLS', 'FNS', 'FGS', 'FDS', 'FAS', 'FIS', 'FLS', 'PNP', 'PGP', 'PDP', 'PAP', 'PIP', 'PLP', 'R', '']


Features:
- `Gender` {Род}
    - `Fem`- feminine {жаночы}
    - `Masc` - masculine {мужчынскі}
    - `Neut` - neutral {ніякі}
- `Case` {Склон}
    - `Acc` - Accusative {Вінавальны}
    - `Dat` - Dative {Давальны}
    - `Gen` - Genitive {Родны}
    - `Ins` - Instrumental {Творны}
    - `Loc` - Locative {Месны}
    - `Nom` - Nominative {Назоўны}
    - (i) Vocative not provided in NCorpus 
- `Number` {Лік}
    - `Sing` - singular {адзіночны}
    - `Plur` - plural {множны}

Two versions of a word can be provided with identical form tags if the word can differ based on the noun's animacy. In such cases `options` parameter specifies animacy.

NCorp Tag pattern | Notes | Feature | Value 
----- | ----- | ----- | ----- 
`F..` | = Feminine | Gender | Fem
`M..` | = Masculine | Gender | Masc
`N..` | = Neutral | Gender | Neut
`..S` | = Singular | Number | Sing
`P.P` | = Plural & genderless | Number | Plur
`.A.` | = Accusative | Case | Acc
`.D.` | = Dative | Case | Dat
`.G.` | = Genitive | Case | Gen
`.I.` | = Instrumental | Case | Ins
`.L.` | = Locative | Case | Loc
`.N.` | = Nominative | Case | Nom
`R` | Adjective as adverb, 3 instances | | 

In [31]:
# <Form tag="R">

for k in tags.keys():
    if k.startswith('A'):
        for t in tags[k].keys():
            if t == 'R':               
                print(tags[k][t])

{'бязму+жне': {'id': '1270595', 'lemma': 'бязму+жняя'}, 'наваце+льна': {'id': '1195883', 'lemma': 'наваце+льная'}, 'няце+льна': {'id': '1283274', 'lemma': 'няце+льная'}, 'паўторнаро+дзяча': {'id': '1196402', 'lemma': 'паўторнаро+дзячая'}}


In [51]:
# <Form tag="">

for k in tags.keys():
    if k.startswith('A'):
        for t in tags[k].keys():
            if not t:               
                print(k) # form tags are empty for 'A0' paradigm tag
                print(tags[k][t])

A0
{'апа+ш': {'id': '1193572', 'lemma': 'апа+ш'}, 'бардо+': {'id': '1140055', 'lemma': 'бардо+'}, 'бекар': {'id': '1140068', 'lemma': 'бекар'}, 'бемо+ль': {'id': '1140069', 'lemma': 'бемо+ль'}, 'бру+та': {'id': '1193837', 'lemma': 'бру+та'}, 'бу+ф': {'id': '1140070', 'lemma': 'бу+ф'}, 'бэ+ж': {'id': '1192912', 'lemma': 'бэ+ж'}, 'гафрэ': {'id': '1140071', 'lemma': 'гафрэ'}, 'дум-ду+м': {'id': '1194542', 'lemma': 'дум-ду+м'}, 'лю+кс': {'id': '1192913', 'lemma': 'лю+кс'}, 'мадэ+рн': {'id': '1192914', 'lemma': 'мадэ+рн'}, 'мідзі': {'id': '1140056', 'lemma': 'мідзі'}, 'міні': {'id': '1140057', 'lemma': 'міні'}, 'не+та': {'id': '1140059', 'lemma': 'не+та'}, 'но+н-стоп': {'id': '1196120', 'lemma': 'но+н-стоп'}, 'перва+нш': {'id': '1196482', 'lemma': 'перва+нш'}, 'піке+': {'id': '1140060', 'lemma': 'піке+'}, 'суахі+лі': {'id': '1140063', 'lemma': 'суахі+лі'}, 'фрэ+з': {'id': '1140064', 'lemma': 'фрэ+з'}, 'ха+кі': {'id': '1140066', 'lemma': 'ха+кі'}, 'э+кстра': {'id': '1197979', 'lemma': 'э+кст

## Conjunction

POS value: `CCONJ` - conjunction {злучнік}

In [100]:
print(alltags['C']['partags'])

['CKX', 'CSX']


`Paradigm` tag features:
- `ConjType` {Тып} - conjunction type (not found in HSE treebank)
    - `Crd`- coordinating {злучальны}
    - `Sub` - subordinating {падпарадкавальны}

Subtypes are indicated in the documentation but not provided in XML.

NCorp Tag | Notes | POS | ConjType 
-- | -- | -- | --
CKX | = Conjunction Coordinating (no subtype tag)  | CCONJ | Crd
CSX | = Conjunction Subordinating (no subtype tag) | CCONJ | Sub

`Form` tags are absent since the PoS is indeclinable.

In [166]:
print(alltags['C']['fortags'])

['']


## Particle

- POS value: `PART` - particle {часціца}
- Files: `E.xml`

No features

In [165]:
print(alltags['E'])

{'partags': ['E'], 'fortags': ['']}


## Preposition

- POS value: `ADP` - preposition (adposition) {прыназоўнік}
- Files: `I.xml`

No features

In [168]:
print(alltags['I'])

{'partags': ['I'], 'fortags': ['']}


## Numeral

- POS value: `NUM` - numeral {лічэбнік}
- Files: `M.xml`

### `Paradigm` tag

In [181]:
print(alltags['M']['partags'])

['M0CS', 'MACS', 'MAKS', 'MAOC', 'MAOS', 'MNCC', 'MNCS', 'MNKS', 'MXCS']


Features:
- `InflClass` {Словазмяненне} 
    - `Ntype` - noun-like {як у назоўніка} 
    - `Atype` - adjective-like {як у прыметніка}
    - `Ind` - indeclinable {нескланяльны}
- `NumType` {Значэнне} - numeral type by function, added new types to HSE treebank
    - `Card` - cardinal {колькасны}
    - `Ord` - ordinal {парадкавы}
    - `Col` - collective {зборны}
    - `Frac` - fraction {дробавы}
- `NumForm` {Форма} - numeral type by complexity, not in HSE treebank`
    - `Sim` - simple {просты}
    - `Com` - complex {складаны}

NCorp Tag | Notes | PoS | InflClass | NumType | NumForm
----- | ----- | ----- | ----- | ----- | ----- 
M0CS | = Numeral indeclinable cardinal simple | NUM | Ind | Card | Sim
MACS | = Numeral adjective-like cardinal simple | NUM | Atype | Card | Sim
MAKS | = Numeral adjective-like collective simple | NUM | Atype | Col | Sim
MAOC | = Numeral adjective-like ordinal complex | NUM | Atype | Ord | Com
MAOS | = Numeral adjective-like ordinal simple | NUM | Atype | Ord | Sim
MNCC | = Numeral noun-like cardinal complex | NUM | Ntype | Card | Com
MNCS | = Numeral noun-like cardinal simple | NUM | Ntype| Card | Sim
MNKS | = Numeral adjective-like collective simple | NUM | Atype | Col | Sim
MXCS | = Numeral (no NumType tag) cardinal simple, 1 instance | NUM |  | Card | Sim

### `Form` tag

In [182]:
print(alltags['M']['fortags'])

['FNP', 'FGP', 'FDP', 'FAP', 'FIP', 'FLP', 'MNP', 'MGP', 'MDP', 'MAP', 'MIP', 'MLP', 'NNP', 'NGP', 'NDP', 'NAP', 'NIP', 'NLP', 'PNP', 'PGP', 'PDP', 'PAP', 'PIP', 'PLP', 'MNS', 'MGS', 'MDS', 'MAS', 'MIS', 'MLS', 'FNS', 'FGS', 'FDS', 'FAS', 'FIS', 'FLS', 'NNS', 'NGS', 'NDS', 'NAS', 'NIS', 'NLS', '0', 'XXX']


Features:
- `Gender` {Род}
    - `Fem`- feminine {жаночы}
    - `Masc` - masculine {мужчынскі}
    - `Neut` - neutral {ніякі}
- `InflClass` {Словазмяненне} 
    - `Ind` - indeclinable {нескланяльны}
- `Case` {Склон}
    - `Acc` - Accusative {Вінавальны}
    - `Dat` - Dative {Давальны}
    - `Gen` - Genitive {Родны}
    - `Ins` - Instrumental {Творны}
    - `Loc` - Locative {Месны}
    - `Nom` - Nominative {Назоўны}
- `Number` {Лік}
    - `Sing` - singular {адзіночны}
    - `Plur` - plural {множны}

NCorp Tag pattern | Notes | Feature | Value 
----- | ----- | ----- | ----- 
`F..`| = Feminine | Gender | Fem
`M..` | = Masculine | Gender | Masc
`N..` | = Neutral | Gender | Neut
`..S` | = Singular | Number | Sing
`P.P` | = Plural & genderless | Number | Plur
`.A.` | = Accusative | Case | Acc
`.D.` | = Dative | Case | Dat
`.G.` | = Genitive | Case | Gen
`.I.` | = Instrumental | Case | Ins
`.L.` | = Locative | Case | Loc
`.N.` | = Nominative | Case | Nom
`0` | = Indeclinable (all `M0CS`) | InflClass | Ind
`XXX` | Unknown, = `MXCS`| |

## Nouns

POS values: 
- `NOUN` - noun {агульны назоўнік}. Files: `N1.xml`, `N2.xml`, `N3.xml`
- `PROPN` - proper noun {уласны назоўнік}. Files: `NP.xml`

### `Paradigm` tag

In [185]:
print(alltags['N']['partags'])

['NCIINN0', 'NCIINF2', 'NCIINM1', 'NCIINN1', 'NCIINP7', 'NCAPNM1', 'NCIINF3', 'NCAPNF2', 'NCAPNM2', 'NCAINM1', 'NCIIBF3', 'NCAPNS5', 'NCIIBM0', 'NCIIBM1', 'NCIIBN1', 'NCIINS5', 'NCAIBM1', 'NCAPBM1', 'NCIINF0', 'NCAINF2', 'NCIIBF2', 'NCIIBN0', 'NCAPBF2', 'NCIINU5', 'NCAIBF2', 'NCAINU5', 'NCAINS5', 'NCAINF0', 'NCAINP7', 'NCAINN4', 'NCIINM0', 'NCAINM0', 'NCAINN1', 'NCAPNM0', 'NCIIBF0', 'NCAPNN0', 'NCIINM6', 'NCAPNP7', 'NCIINM2', 'NCAPNN1', 'NCAPNF3', 'NCAPNU5', 'N', 'NCAPNF0', 'NCAPNN4', 'NCAINM2', 'NCAINF3', 'NCAPNM6', 'NCIIBP7', 'NCIINN4', 'NCIINP0', 'NCAPBM0', 'NCAINN0', 'NCAPNP0', 'NCAPNF4', 'NCAPNM4', 'NPIIBF0', 'NPAPNM1', 'NPIINP7', 'NPIINM1', 'NPIINN1', 'NPIINF3', 'NPIINF2', 'NPIINM0', 'NPIINF0', 'NPAPNS5', 'NPIINS5', 'NPAPNF2', 'NPAPNM0', 'NPIINM2', 'NPIIBM1', 'NPIINU5', 'NPAPNM2', 'NPAPNF0', 'NPIINN0', 'NPAPNM6', 'NPIINM6', 'NPAPNF3', 'NPIIBN0', 'NPIINF1', 'NPAINF2', 'NPAINM6', 'NPIINP0']


`POS` value is determined by the 2nd character in the abbrebiation:
- `NC_____` is tagged with `NOUN`
- `NP_____` is tagged with `PROPN`
- In rare cases of `N` value it is assumed it's a common `NOUN`

Features:
- `Animacy` {Адушаўлёнасць}
    - `Anim` - animate {адушаўлёны}
    - `Inan` - inanimate {неадушаўлёны} 
- `Personal` {Асабовасць} - not in HSE features
    - `Per` - personal {асабовы}
    - `Imp`- impersonal {неасабовы}
- `Abbr` {Скарачэнне}
    - `Yes` - abbreviated {скарачэнне}
- `Gender` {Род}
    - `Fem`- feminine {жаночы}
    - `Masc` - masculine {мужчынскі}
    - `Neut` - neutral {ніякі}
    - `Com` - common {агульны}
- `InflClass` {Скланенне} 
    - `1d` - 1 substantive declension {1 скланенне}
    - `2d` - 2 substantive declension {2 скланенне}
    - `3d` - 3 substantive declension {3 скланенне}
    - `Atype` - adjective-type {ад’ектыўны тып скланення}
    - `Ind` - indeclinable {нескланяльны}
    - `Com` - combined {рознаскланяльны}
    - `Mix` - mixed {змешаны тып скланення}

NCorp Pattern | Notes | Feature | Value 
----- | ----- | ----- | ----- 
`N.A....` | = Animate | Animacy | Anim
`N.I....` | = Inanimate | Animacy | Inan
`N..P...` | = Personal | Personal | Per
`N..I...` | = Impersonal | Personal | Imp
`N...B..` | = Abbreviation | Abbr | Yes
`N....C.` | = Common gender | Gender | Com
`N....F.` | = Female | Gender | Fem
`N....M.` | = Masculine | Gender | Masc
`N....N.` | = Neutral gender | Gender | Neut
`N....S5` | = Adjective-type inflection (substantivated), gender indicated in `Form` tag| InflClass | Atype
`N.....1` | 1 declension | InflClass | 1d
`N.....2` | 2 declension | InflClass | 2d
`N.....3` | 3 declension | InflClass | 3d
`N.....0` | indeclinable | InflClass | Ind
`N.....4` | combined declension | InflClass | Com
`N.....6` | mixed declension | InflClass | Mix
`N....P.`, `N....UP.`, `N......7`| = Plural only, has no gender||  

### `Form` tag

In [186]:
print(alltags['N']['fortags'])

['NS', 'GS', 'DS', 'AS', 'IS', 'LS', 'NP', 'GP', 'DP', 'AP', 'IP', 'LP', 'VS', 'MNS', 'MGS', 'MDS', 'MAS', 'MIS', 'MLS', 'FNS', 'FGS', 'FDS', 'FAS', 'FIS', 'FLS', 'PNP', 'PGP', 'PDP', 'PAP', 'PIP', 'PLP', 'NNS', 'NGS', 'NDS', 'NAS', 'NIS', 'NLS']


Substantivated nouns indicate gender in form tag, some have one gender variation and others have multiple in the case of professions etc.  

Features:
- `Gender` {Род}
    - `Fem`- feminine {жаночы}
    - `Masc` - masculine {мужчынскі}
    - `Neut` - neutral {ніякі}
- `Case` {Склон}
    - `Acc` - Accusative {Вінавальны}
    - `Dat` - Dative {Давальны}
    - `Gen` - Genitive {Родны}
    - `Ins` - Instrumental {Творны}
    - `Loc` - Locative {Месны}
    - `Nom` - Nominative {Назоўны}
- `Number` {Лік}
    - `Sing` - singular {адзіночны}
    - `Plur` - plural {множны}

NCorp Tag pattern | Notes | Feature | Value 
----- | ----- | ----- | ----- 
`.S`, `..S` | = Singular | Number | Sing
`.P`, `P.P` | = Plural | Number | Plur
`A.`, `.A.`  | = Accusative | Case | Acc
`D.`, `.D.` | = Dative | Case | Dat
`G.`, `.G.` | = Genitive | Case | Gen
`I.`, `.I.` | = Instrumental | Case | Ins
`L.`, `.L.` | = Locative | Case | Loc
`N.`, `.N.` | = Nominative | Case | Nom
`F..` | = Feminine | Gender | Fem
`M..` | = Masculine | Gender | Masc
`N..` | = Neutral | Gender | Neut

## Participle

- POS value: `VERB` - verb {займеннік} with `VerbForm = Part`
- Files: `P.xml`

In [194]:
print(alltags['P']['partags'])

['PAPP', 'PARM', 'PPPM', 'PPPP', 'PPRM']


### `Paradigm` tag

Features:
- `Voice` {Стан}
    - `Act` - active {незалежны}
    - `Pass` - passive {залежны}
- `Tense` {Час}
    - `Past` - past {прошлы}
    - `Pres` - present {цяперашні}
- `Aspect` {Трыванне}
    - `Imp` - imperfect {незакончанае}
    - `Perf` - perfect {закончанае}

NCorpus tag | Voice | Tense | Aspect
-|-|-|-
PAPP | Act | Past | Perf
PARM | Act | Pres | Imp
PPPM | Pass | Past | Imp
PPPP | Pass | Past | Perf
PPRM | Pass | Pres | Imp

### `Form` tag

In [196]:
alltags['P']['fortags'].sort()
print(alltags['P']['fortags'])

['FAS', 'FDS', 'FGS', 'FHX', 'FIS', 'FLS', 'FNS', 'MAS', 'MDS', 'MGS', 'MHX', 'MIS', 'MLS', 'MNS', 'NAS', 'NDS', 'NGS', 'NHX', 'NIS', 'NLS', 'NNS', 'PAP', 'PDP', 'PGP', 'PIP', 'PLP', 'PNP', 'R', 'XAP', 'XDP', 'XGP', 'XIP', 'XLP', 'XNP']


Participle form tags are similar to adjective forms, except `X.P` pattern that is used instead of `P.P` for plural.

Features:
- `Gender` {Род}
    - `Fem`- feminine {жаночы}
    - `Masc` - masculine {мужчынскі}
    - `Neut` - neutral {ніякі}
- `Case` {Склон}
    - `Acc` - Accusative {Вінавальны}
    - `Dat` - Dative {Давальны}
    - `Gen` - Genitive {Родны}
    - `Ins` - Instrumental {Творны}
    - `Loc` - Locative {Месны}
    - `Nom` - Nominative {Назоўны}
    - (i) Vocative not provided in NCorpus 
- `Number` {Лік}
    - `Sing` - singular {адзіночны}
    - `Plur` - plural {множны}

Two versions of a word can be provided with identical form tags if the word can differ based on the noun's animacy. In such cases `options` parameter specifies animacy.

NCorp Tag pattern | Notes | Feature | Value 
----- | ----- | ----- | ----- 
`F..` | = Feminine | Gender | Fem
`M..` | = Masculine | Gender | Masc
`N..` | = Neutral | Gender | Neut
`..S` | = Singular | Number | Sing
`X.P` | = Plural & genderless | Number | Plur
`.A.` | = Accusative | Case | Acc
`.D.` | = Dative | Case | Dat
`.G.` | = Genitive | Case | Gen
`.I.` | = Instrumental | Case | Ins
`.L.` | = Locative | Case | Loc
`.N.` | = Nominative | Case | Nom
`R` | Short form | | 

## Adverbs

- POS value: `ADV` - adverb {прыслоўе} 
- Files: `R.xml`

### `Paradigm` tag

In [198]:
alltags['R']['partags'].sort()
print(alltags['R']['partags'])

['RA', 'RG', 'RM', 'RN', 'RS', 'RV', 'RX']


Features:
- `DerFrom` - derivation pattern {Спосаб утварэння} - not found in UD 
    - `Adj` - from adjective {ад прыметнікаў}
    - `Conv`- from converb {ад дзеепрыслоўяў}
    - `Num` - from numeral {ад лічэбнікаў}
    - `Noun` - from noun {ад назоўнікаў}
    - `Pron` - from pronoun {ад займеннікаў}
    - `Verb` - from verb {ад дзеясловаў}

More derivation types are documented (from particles, prepositions), but there are no occurences in the data.

### `From` tag

In [199]:
alltags['R']['fortags'].sort()
print(alltags['R']['fortags'])

['C', 'P', 'S']


Features:
- `Degree` {Ступень параўнання}
    - `Pos` - positive {станоўчая}, `tag = "P"`
    - `Cmp` - comparative {вышэйшая}, `tag = "C"`
    - `Sup` - superlative {найвышэйшая}, `tag = "S"`

## Pronouns

- POS value: `PRON` - pronoun {займеннік}
- Files: `S.xml`

### `Paradigm` tag

In [204]:
alltags['S']['partags'].sort()
print(alltags['S']['partags'])

['S0E0', 'S0F0', 'S0S0', 'SAD0', 'SAE0', 'SAF0', 'SAF3', 'SAL0', 'SAN0', 'SANX', 'SAS0', 'SNE0', 'SNF0', 'SNL0', 'SNN0', 'SNP1', 'SNP2', 'SNP3', 'SNR0']


Features:
- `InflClass` {Словазмяненне} 
    - `Ntype` - noun-like {як у назоўніка} 
    - `Atype` - adjective-like {як у прыметніка}
    - `Ind` - indeclinable {нескланяльны}, always has `Person = 0`
- `PronType` {Pronoun type} [UD documentation](https://universaldependencies.org/u/feat/PronType.html)
    - `Prs` - personal, reflexive, possessive {асабовы, зваротны, прыналежны}, see `Poss` & `Reflex` features
    - `Rcp` - reciprocal/reflexive {зваротны}
    - `Dem` - demonstrative {указальны}
    - `Emp` - emphatic/intensive {азначальны}
    - `Int` - interrogative-relative {пытальна–адносны}
    - `Neg` - negative {адмоўны}
    - `Ind` - indefinite {няпэўны}
- `Poss` possessive {прыналежны}
    - `Yes`
- `Reflex` reflexive {зваротны}
    - `Yes`
- `Person` {Асоба}
    - `1` - first {першая}
    - `2` - second {другая}
    - `3` - third {трэцяя}
    - `0` - impersonal {безасабовы}
 
NCorp Tag |	Notes |	POS | InflClass | PronType | Person | Poss | Reflex
-|-|-|-|-|-|-|-
`S0E0` | = Pronoun indeclinable emphatic impersonal | PRON | Ind | Emp | 0 | |
`S0F0` | = Pronoun indeclinable indefinite impersonal | PRON | Ind | Ind | 0 | |
`S0S0` | = Pronoun indeclinable possessive impersonal | PRON | Ind | Prs | 0 | Yes |  
`SAD0` | = Pronoun adjective-like demonstrative impersonal | PRON | Atype | Dem | 0 | |  
`SAE0` | = Pronoun adjective-like emphatic impersonal | PRON | Atype | Emp | 0 | |
`SAF0` | = Pronoun adjective-like indefinite impersonal  | PRON | Atype | Ind | 0 | |
`SAF3` | = Pronoun adjective-like indefinite 3rd person  | PRON | Atype | Ind | 3 | |
`SAL0` | = Pronoun adjective-like interrogative-relative impersonal | PRON | Atype | Int | 0 | |
`SAN0` | = Pronoun adjective-like negative impersonal | PRON | Atype | Neg | 0 | |
`SANX` | = Pronoun adjective-like negative (no person tag) | PRON | Atype | Neg | | | 
`SAS0` | = Pronoun adjective-like posessive impersonal | PRON | Atype | Prs | 0 | Yes | 
`SNE0` | = Pronoun noun-like emphatic impersonal | PRON | Ntype | Emp | 0 | |
`SNF0` | = Pronoun noun-like indefinite impersonal | PRON | Ntype | Ind | 0 | |
`SNL0` | = Pronoun noun-like interrogative-relative impersonal | PRON | Ntype | Int | 0 | |
`SNN0` | = Pronoun noun-like negative impersonal | PRON | Ntype | Neg | 0 | | 
`SNP1` | = Pronoun noun-like personal 1st person | PRON | Ntype | Prs | 1 | |
`SNP2` | = Pronoun noun-like personal 2nd person | PRON | Ntype | Prs | 2 | |
`SNP3` | = Pronoun noun-like personal 3rd person | PRON | Ntype | Prs | 3 | |
`SNR0` | = Pronoun noun-like reflexive impersonal | PRON | Ntype | Prs | 0 | | Yes

### `Form` tag

In [205]:
alltags['S']['fortags'].sort()
print(alltags['S']['fortags'])

['0AP', '0AS', '0DP', '0DS', '0GP', '0GS', '0IP', '0IS', '0LP', '0LS', '0NP', '0NS', '1', 'FAS', 'FDS', 'FGS', 'FIS', 'FLS', 'FNS', 'MAS', 'MDS', 'MGS', 'MIS', 'MLS', 'MNS', 'NAS', 'NDS', 'NGS', 'NIS', 'NLS', 'NNS', 'XAP', 'XDP', 'XGP', 'XIP', 'XLP', 'XNP', 'XXX']


Features:
- `Gender` {Род}
    - `Fem`- feminine {жаночы}
    - `Masc` - masculine {мужчынскі}
    - `Neut` - neutral {ніякі}
- `Case` {Склон}
    - `Acc` - Accusative {Вінавальны}
    - `Dat` - Dative {Давальны}
    - `Gen` - Genitive {Родны}
    - `Ins` - Instrumental {Творны}
    - `Loc` - Locative {Месны}
    - `Nom` - Nominative {Назоўны}
    - (i) Vocative not provided in NCorpus 
- `Number` {Лік}
    - `Sing` - singular {адзіночны}
    - `Plur` - plural {множны}
 
NCorp Tag pattern | Notes | Feature | Value 
----- | ----- | ----- | ----- 
`0..` | Genderless | |
`1..` | Indeclinable | |
`F..` | = Feminine | Gender | Fem
`M..` | = Masculine | Gender | Masc
`N..` | = Neutral | Gender | Neut
`..S` | = Singular | Number | Sing
`X.P` | = Plural & genderless | Number | Plur
`.A.` | = Accusative | Case | Acc
`.D.` | = Dative | Case | Dat
`.G.` | = Genitive | Case | Gen
`.I.` | = Instrumental | Case | Ins
`.L.` | = Locative | Case | Loc
`.N.` | = Nominative | Case | Nom
`XXX` | Unknown | | 

## Verbs

- POS value: `VERB` - verb {дзеяслоў}
- Files: `V.xml`

### `Paradigm` tag

In [206]:
alltags['V']['partags'].sort()
print(alltags['V']['partags'])

['VDMN1', 'VDMN2', 'VDMN3', 'VDPN1', 'VDPN2', 'VDPR1', 'VIMN1', 'VIMN2', 'VIMN3', 'VIMR1', 'VIMR2', 'VIPN', 'VIPN1', 'VIPN2', 'VIPR1', 'VIPR2', 'VTMN1', 'VTMN2', 'VTMR1', 'VTPN1', 'VTPN2', 'VTPN3', 'VTPR1', 'VTPR2', 'VXMN1', 'VXMN2', 'VXMR1', 'VXMR2', 'VXPN1', 'VXPN2', 'VXPN3', 'VXPR1', 'VXPR2']


Features: 
- `SubCat` - transitivity {Пераходнасць}
    - `Intr` - intransitive {непераходны}
    - `Tran` - transitive {пераходны}
- `Aspect` {Трыванне}
    - `Imp` - imperfect {незакончанае}
    - `Perf` - perfect {закончанае}
- `Reflex` - reflexivity {Зваротнасць}
    - `Yes`
- `InflClass` {Спражэнне}
    - `1c` - 1 conjugation type {першае}
    - `2c` - 2 conjugation type {другое}
    - `Com` - combined type {рознаспрагальны}

NCorpus tag pattern | Notes | Feature | Value
-|-|-|-
`VI..` | = Intransitive | SubCat | Intr
`VT..` | = Transitive | SubCat | Tran
`V.M.` | = Imperfect aspect | Aspect | Imp
`V.P.` | = Perfect aspect | Aspect | Perf
`V.R.` | = Reflexive | Reflex | Yes
`V..1` | = 1 conjugation type | InflClass | 1c
`V..2` | = 2 conjugation type | InflClass | 2c
`V..3` | = combined conjugation type | InflClass | Com

### `Form` tag

In [207]:
alltags['V']['fortags'].sort()
print(alltags['V']['fortags'])

['0', 'F1P', 'F1S', 'F2P', 'F2S', 'F3P', 'F3S', 'I2P', 'I2S', 'PFS', 'PG', 'PMS', 'PNS', 'PXP', 'R1P', 'R1S', 'R2P', 'R2S', 'R3P', 'R3S', 'RG']


Features:
- `Tense` {Час}
    - `Past` - past {прошлы}
    - `Pres` - present {цяперашні}
    - `Fut` - future {будучы}
- `Mood` {Лад}
    - `Imp` - imperative {загадны}
- `VerbForm` {Форма}
    - `Inf` - infinitive {Інфінітыў}
    - `Conv` - converb/transgressive {дзеепрыслоўе}
- `Gender` {Род}
    - `Fem`- feminine {жаночы}
    - `Masc` - masculine {мужчынскі}
    - `Neut` - neutral {ніякі}
- `Person` {Асоба}
    - `1` - first {першая}
    - `2` - second {другая}
    - `3` - third {трэцяя}
- `Number` {Лік}
    - `Sing` - singular {адзіночны}
    - `Plur` - plural {множны}
 
NCorpus tag pattern | Notes | Feature | Value
-|-|-|-
`0` | = Infinitive | VerbForm | Inf
`I..` | = Imperative | Mood | Imp
`F..` | = Future tense | Tense | Fut
`P..` | = Past tense | Tense | Past
`R..` | = Present tense | Tense | Pres
`.1.` | = 1st person | Person | 1
`.2.` | = 2nd person | Person | 2
`.3.` | = 3rd person | Person | 3
`..P` | = Plural | Number | Plur
`..S` | = Singular | Number | Sing
`.G` | = Converb/Transgressive | VerbForm | Conv

## Interjections

- POS value: `INTJ` - interjection {выклічнік}
- Files: `Y.xml`

No features

In [208]:
print(alltags['Y'])

{'partags': ['Y'], 'fortags': ['']}
