## Task 3

In this task you have to create a network which looks at characters of the word and tries to guess whether the word is a noun, a verb, an adjective, and so on. To be more precise: the input is a word (without context), the output is a POS-tag (Part-of-Speech). Since some words are unambiguous, and we have no context, our network is supposed to return the set of possible tags.

The data is taken from Universal Dependencies English corpus, and of course it contains errors, especially because not all possible tags occured in the data.

Train a network (4p) or two networks (+2p) solving this task. Both networks should look at character n-grams occuring in the word. There are two options:

* **Fixed size:** for instance take 2,3, and 4-character suffixes of the word, use them as  features (whith 1-hot encoding). You can also combine prefix and suffix features. Simple, useful trick: when looking at suffixes, add some '_' characters at the beginning of the word to guarantee that shorter words have suffixes of a desired length.

* **Variable size:** take for instance 4-grams (or 4 grams and 3-grams), use Deep Averaging Network. Simple trick: add extra character at the beginning and at the end of the word, to add the information, that ngram occurs at special position ('ed' at the end has slightly different meaning that 'ed' in the middle)


In [1]:
from pathlib import Path
import matplotlib.pyplot as plt
from torchtext import vocab
from scipy import sparse
import pandas as pd

In [4]:
DATA_PATH = Path("data")
TRAIN_FILEPATH = DATA_PATH / "english_tags_dev.txt"
TEST_FILEPATH = DATA_PATH / "english_tags_test.txt"

In [65]:
df = pd.read_csv(TRAIN_FILEPATH, sep=" ", names=["word", "labels"])
df.head()

Unnamed: 0,word,labels
0,Confidence,NN
1,the,DT_IN
2,pound,NN
3,is,NNS_VBZ
4,expected,VBD_VBN


In [75]:
label_counts = df["labels"].str.split("_").explode().value_counts()
ID_TO_LABEL = dict(enumerate(label_counts.index))
LABEL_TO_ID = {k: v for v, k in ID_TO_LABEL.items()}

In [140]:
def process(word):
    result = []
    for i in [2, 3, 4]:
        result.append(("____" + word)[-i:] + "$")
        result.append("$" + (word + "____")[:i])
    return result
process("dictionary")

['ry$', '$di', 'ary$', '$dic', 'nary$', '$dict']

In [139]:
input_vocab = vocab.build_vocab_from_iterator(map(process, df["word"]))
label_vocab = vocab.build_vocab_from_iterator(df["labels"].str.split("_"))

43

In [142]:
X, y = sparse.lil_array((len(df), len(input_vocab))), sparse.lil_array((len(df), len(label_vocab)))
for word_idx, (subwords, labels) in enumerate(zip(map(process, df["word"]), df["labels"].str.split("_"))):
    X[word_idx, [input_vocab[subword] for subword in subwords]] = 1
    y[word_idx, [label_vocab[label] for label in labels]] = 1
X, y = X.tocsr(), y.tocsr()