# lab 5 - Morphosyntactic tagging

Morphosyntactic tagging is one of the core algorithms in NLP. It assigns morphological
and (in some languages) syntactic tags to the words in a text. E.g. this allows to distinguish
between the major grammatical categories, such as nouns and verbs.

In [None]:
import requests as req
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
from collections import defaultdict
from typing import Tuple, Dict

## Tasks

1. Download [docker image](https://hub.docker.com/r/djstrong/krnnt2) o KRNNT2. It includes the following tools:
   1. Morfeusz2 - morphological dictionary
   1. Corpus2 - corpus access library
   1. Toki - tokenizer for Polish
   1. Maca - morphosyntactic analyzer
   1. rknnt - Polish tagger
   

1. Use the tool to tag and lemmatize the corpus with the bills.

In [None]:
def krnnt(text, url="http://localhost:9200"):
    result = req.post(url, text.encode("utf-8")).content.decode("utf-8").split("\n")
    grouped = [result[2*i:2*i + 2] for i in range(int(len(result) /2))][:-1]
    result = [g[1].split("\t")[1:3] for g in grouped]
    result = [r for r in result if len(r) == 2]
    result = [(r[0], r[1].split(":")[0]) for r in result]
    return result

In [None]:
krnnt("Bedę lematyzował")

1. Using the tagged corpus compute bigram statistic for the tokens containing:
   1. lemmatized, downcased word
   1. morphosyntactic category of the word (noun, verb, etc.)
   

In [None]:
corpora = []
for file in tqdm(list(Path("../data/").glob("*.txt"))):
    with file.open() as f:
        text = f.read()
    lemmatized = krnnt(text)
    corpora.append(lemmatized)

2. Exclude bigram containing non-words (such as numbers, interpunction, etc.)

In [None]:
word_corpora = [
    [
        c for c in corp if c[0].isalpha()
    ]
    for corp in corpora
]
word_corpora[0][:5]

In [None]:
words = defaultdict(int)
for corp in word_corpora:
    for w in corp:
        words[w] = words[w] + 1

In [None]:
Ngram = Tuple[Tuple[str, str], Tuple[str, str]]

In [None]:
bigrams = defaultdict(int)

for corp in word_corpora:
    corp_bigrams = zip(corp[:-1], corp[1:])
    for b in corp_bigrams:
        bigrams[b] = bigrams[b] + 1

list(bigrams.keys())[0]

3. Compute LLR statistic for this dataset.

In [None]:
def H(k: np.ndarray):
    N = k.sum()
    return (
        (k / N) * np.log(k / N + 1e-7)
    ).sum()

def LLR(k: np.ndarray):
    return (2 * k.sum()) * (
        H(k) - 
        H(k.sum(axis=0)) - 
        H(k.sum(axis=1))
    )

In [None]:
all_ngrams = np.sum(list(bigrams.values()))

def incidence(bigram: Ngram, ngrams: Dict[Ngram, int], words: Dict[Tuple[str, str], int]) -> np.ndarray:
    w1, w2 = bigram
    w1_w2 = ngrams.get((w1, w2), 0)
    w1_not_w2 = words[w1] - w1_w2
    w2_not_w1 = words[w2] - w1_w2
    not_w1_not_w2 = all_ngrams - w1_not_w2 - w1_not_w2 - w1_w2
    return np.array([
        [w1_w2, w1_not_w2],
        [w2_not_w1, not_w1_not_w2]
    ])

In [None]:
bgrams_llrs = {
    ngram: LLR(incidence(ngram, bigrams, words))
    for ngram in bigrams
}
bgrams_llrs;

4. Select top 50 results including noun at the first position and noun or adjective at the second position.

In [None]:
{
    k: bgrams_llrs[k]
    for k in sorted(
        [
            ((w1, a1), (w2, a2)) 
            for ((w1, a1), (w2, a2)) in bgrams_llrs
            if a1 == "subst" and a2 == "adj"
        ],
        key = lambda k: -bgrams_llrs[k]
    )
}


## Hints

1. A morphosyntactic analyzer provides the possible values of morphosyntactic tags for the words.
   E.g. for Polish "ma" word it can produce the following interpretations:
   ``` 
    ma	space
            mieć	fin:sg:ter:imperf
            mój  	adj:sg:nom:f:pos
            mój  	adj:sg:voc:f:pos
   ```
   1. The first interpretation shows that the word can be a verb in singular, in 3rd person.
   1. The second interpretation shows that the word can be an adjective in singular, in nominative, in feminine.
   1. The third interpretation shows that the word can be an adjective in singular, in vocative, in feminine.
1. The full list of tags is available at [NKJP](http://nkjp.pl/poliqarp/help/ense2.html).
1. A morphosyntactic tagger selects one of the interpretation of a word, taking into account its context.
   It can take the interpretation from a dictionary (like KRNNT), but it can also compute it dynamically (e.g. 
   [COMBO](https://github.com/360er0/COMBO) is a tagger that does not need a morphosyntactic ananlyzer).
1. The information provided by a tagger can be useful for many applications. You can selects words from particular
   grammatical category or you can submit the data to a downstream task such as text classification.