# ANLP Lab 10
## Under-resourced language processing

## Task 1

Consider the following dictionary entry from a Welsh-English dictionary


**cymdogaeth (-au)** *nf* neighbourhood


1. What are the 4 facts that are represented by this entry?
    * cymdogaeth — headword/lemma
    * (-au) — plural
    * nf — noun, feminine
    * neighbourhood — English translation
2. How might you extract these facts automatically?
3. Explain how you would use crowd-sourcing or gamification in the process of extracting data his dictionary entry?

## Task 2
### Build a new resource

In this task, we will build a new Welsh-Urdu dictionary using two existing bilingual dictionaries: Welsh-English and Urdu-English.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd drive/MyDrive/Colab Notebooks

/content/drive/MyDrive/Colab Notebooks


In [None]:
# Using cy_en and ur_fr create a new Welsh-Urdu bilingual dictionary

with open("cy-en.txt", "r", encoding='utf-8') as f:
   cy_en = {entry.split("\t")[0].lower(): entry.split("\t")[1].lower() for entry in f.read().split("\n")}

with open("ur-en.txt", "r", encoding='utf-8') as f:
    ur_en = dict()
    for entry in f.read().split("\n"):
        try:
            ur_en[entry.split("\t")[0].lower()] = entry.split("\t")[1].lower()
        except IndexError:
            pass
    
# first, we need to inverse one of the dictionaries
en_ur = {v: k for k, v in ur_en.items()}

cy_ur = dict()
for i in cy_en:
    if cy_en[i] in en_ur:
        cy_ur[i] = en_ur[cy_en[i]]


In [None]:
print(len(cy_en))
print(len(ur_en))
print(len(en_ur))
print(len(cy_ur))

9936
1609
1258
816


In [None]:
print(cy_ur['gwyn'])
print(ur_en['safed'])

safed
white


What are the flaws of the dictionary we built? What can we do to improve it?


## Task 3
### Frequency-based part-of-speech tagger

Let's build a simple POS-tagger that annotates each word in isolation with its most frequent tag.  The  only  calculations  that  are  required are POS-tag counts per word in the training data ([the Irish treebank](https://github.com/UniversalDependencies/UD_Irish-IDT/tree/master) from Univeral Dependencies). As soon as the occurrences are counted, the frequency tagger is ready to annotate sentences.

In [None]:
!wget https://raw.githubusercontent.com/UniversalDependencies/UD_Irish-IDT/master/ga_idt-ud-train.conllu
!wget https://raw.githubusercontent.com/UniversalDependencies/UD_Irish-IDT/master/ga_idt-ud-test.conllu
!wget https://raw.githubusercontent.com/UniversalDependencies/UD_Irish-IDT/master/ga_idt-ud-dev.conllu 

--2021-04-23 03:14:50--  https://raw.githubusercontent.com/UniversalDependencies/UD_Irish-IDT/master/ga_idt-ud-train.conllu
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6137980 (5.9M) [text/plain]
Saving to: ‘ga_idt-ud-train.conllu.3’


2021-04-23 03:14:50 (45.8 MB/s) - ‘ga_idt-ud-train.conllu.3’ saved [6137980/6137980]

--2021-04-23 03:14:50--  https://raw.githubusercontent.com/UniversalDependencies/UD_Irish-IDT/master/ga_idt-ud-test.conllu
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 656899 (642K) [text/plain]
Saving to: ‘

In [None]:
!pip install conllu



In [None]:
import conllu
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [CoNLL-U format](https://universaldependencies.org/format.html), 

1. Word lines containing the annotation of a word/token in 10 fields separated by single tab characters; see below.
2. Blank lines marking sentence boundaries.
3. Comment lines starting with hash (#).
4. Sentences consist of one or more word lines, and word lines contain the following fields:

    * ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
    * FORM: Word form or punctuation symbol.
    * LEMMA: Lemma or stem of word form.
    * UPOS: Universal part-of-speech tag. [The description of tags](https://universaldependencies.org/u/pos/index.html).
    * XPOS: Language-specific part-of-speech tag; underscore if not available.
    * FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
    * HEAD: Head of the current word, which is either a value of ID or zero (0).
    * DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
    * DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
    * MISC: Any other annotation.

5. Fields must not be empty.
6. Fields other than FORM, LEMMA, and MISC must not contain space characters.


Here is an example. 

![](https://raw.githubusercontent.com/ancatmara/data-science-nlp/master/img/dep-annot.png)



In [None]:
with open("ga_idt-ud-train.conllu", "r", encoding="utf-8") as f, open("ga_idt-ud-test.conllu", "r", encoding="utf-8") as f1, open("ga_idt-ud-dev.conllu", "r", encoding="utf-8") as f2:
    irish_sents = "\n".join([f.read(), f1.read(), f2.read()])

In [None]:
print(len(irish_sents))
print(irish_sents[:505])

7308418
# sent_id = 906
# text = As lár na tubaiste is ea stadfaidh an ghrian sula rachaidh sí a luí san áigéan thiar ó lonrú anuas; solas ní bheidh ar fáil ach oiread na hoíche nó mar éiclips lán.
1	As	as	ADP	Simp	_	0	root	_	_
2	lár	lár	NOUN	Noun	Case=NomAcc|Gender=Masc|Number=Sing	1	nmod	_	_
3	na	na	DET	Art	Case=Gen|Definite=Def|Gender=Fem|Number=Sing|PronType=Art	4	det	_	_
4	tubaiste	tubaiste	NOUN	Noun	Case=Gen|Definite=Def|Gender=Fem|Number=Sing	2	nmod	_	_
5	is	is	AUX	Cop	Tense=Pres|VerbForm=Cop	1	cop	_	


There is a python library `conllu` for parsing CoNLL-U format. The ourput is a list of sentences, and each token in a sentence is represented by an `OrderedDict`:

```
OrderedDict([('id', 1),
             ('form', 'Перспективы'),
             ('lemma', 'перспектива'),
             ('upostag', 'NOUN'),
             ('xpostag', None),
             ('feats',
              OrderedDict([('Animacy', 'Inan'),
                           ('Case', 'Nom'),
                           ('Gender', 'Fem'),
                           ('Number', 'Plur')])),
             ('head', 0),
             ('deprel', 'ROOT'),
             ('deps', [('root', 0)]),
             ('misc', None)])
```

In [None]:
sentences = conllu.parse(irish_sents)

# first sentence
s = sentences[0]
# first token
print(s[0])
# first token's pos tag
print(s[0]['upostag'])


As
ADP


Now, let's build a dictionary of POS-tag counts per word.

In [None]:
word_tag = dict()

for sent in sentences:
    for token in sent:
        word = token['form']
        pos = token['upostag']
        # Count frequency of each word tag
        if word in word_tag and pos in word_tag[word]:
            word_tag[word] = {pos: word_tag[word][pos]+1}
        else:
            word_tag[word] = {pos: 1}

We are ready to tag new sentences. Please keep in mind, that we might encounter new words, which aren't in our dictionary and which we won't be able to tag because of that. At this stage, let's handle this problem by returning 'False' or some special token for unknown words, like 'UNK'.

In [None]:
sentence = "An tOireachtas is ainm don pharlaimint náisiúnta, agus sin é a bheirtear uirthi de ghnáth sa bhunreacht seo."
sentence_tokenized = nltk.word_tokenize(sentence)

# Tag the sentence
sentence_tags = list()
for token in sentence_tokenized:
    if word_tag.get(token, False):
        sentence_tags.append(max(word_tag[token].keys(), key=(lambda k: word_tag[token][k])))
    else:
        sentence_tags.append(False)

[(sentence_tokenized[i], sentence_tags[i]) for i in range(len(sentence.split()))]

[('An', 'DET'),
 ('tOireachtas', 'NOUN'),
 ('is', 'CCONJ'),
 ('ainm', 'NOUN'),
 ('don', 'ADP'),
 ('pharlaimint', False),
 ('náisiúnta', 'ADJ'),
 (',', 'PUNCT'),
 ('agus', 'CCONJ'),
 ('sin', 'DET'),
 ('é', 'PRON'),
 ('a', 'PART'),
 ('bheirtear', 'VERB'),
 ('uirthi', 'ADP'),
 ('de', 'ADP'),
 ('ghnáth', 'NOUN'),
 ('sa', 'ADP'),
 ('bhunreacht', False)]

## Task 4
### Improve the tagger
In the previous section, we created a frequency-based pos-tagger for Irish. As you notice, some of the words could not be tagged, such as `bhunreacht` and `pharlaimint`. What would be possible solutions to this problem?

Here are some ideas:

1. Normalise your data before tagging. This may include:
    * orthography standartisation (relevant for many minority languages that don't have a single spelling standard)
    * removing initial mutations (for Celtic languages)
    * lemmatisation (if your goal is only POS-tagging, not full morphological analysis)

2. Use aligned parallel data and the induction technique (lecture slides 23-25). The goal in this case will be to use an existing POS tagger for English to annotate the English side of a parallel corpus, then project the POS-tags to the second language (Irish). Where can we get parallel corpora?
    * https://data.europa.eu/data/datasets/
    * https://www.clarin.eu/resource-families/parallel-corpora
    * Datasets available on Kaggle

3. Use a more sophisticated architecture to train a model that will be able to assign POS-tags to unknown words. POS-tagging is essentially a classification problem, so you can experiment with: 
    * feature engineering 
    * different classifiers using `sklearn` and `keras`





### Building a Multilayer Perceptron POS-tagger

In [None]:
def reformat_data(raw_conllu_data):
    """
    Returns list of sentences, where
    each sentence is a (word, tag) tuple
    """
    parsed = conllu.parse(raw_conllu_data)
    ref_data = []
    for sent in parsed:
        ref_sent = []
        for token in sent:
            ref_sent.append((token['form'], token['upostag']))
        ref_data.append(ref_sent)
    return ref_data


def add_basic_features(sentence_terms, index):
    """ 
    Compute some very basic word features.        
    :param sentence_terms: [w1, w2, ...] 
    :param index: the index of the word 
    :return: dict containing features
    """
    term = sentence_terms[index]
    feature_dict = {
        'nb_terms': len(sentence_terms),
        'term': term,
        'is_first': index == 0,
        'is_last': index == len(sentence_terms) - 1,
        'is_capitalized': term[0].upper() == term[0],
        'is_all_caps': term.upper() == term,
        'is_all_lower': term.lower() == term,
        'prefix-1': term[0],
        'prefix-2': term[:2],
        'prefix-3': term[:3],
        'suffix-1': term[-1],
        'suffix-2': term[-2:],
        'suffix-3': term[-3:],
        'prev_word': '' if index == 0 else sentence_terms[index - 1],
        'next_word': '' if index == len(sentence_terms) - 1 else sentence_terms[index + 1]
        }
    return feature_dict


def untag(tagged_sentence):
    """ 
    Remove the tag for each tagged term.
    :param tagged_sentence: a POS tagged sentence
    :return: a list of tags
    """
    return [w for w, _ in tagged_sentence]
    
def transform_to_dataset(tagged_sentences):
    """
    Split tagged sentences to X and y datasets and append some basic features.
    :param tagged_sentences: a list of POS tagged sentences
    :param tagged_sentences: list of list of tuples (term_i, tag_i)
    """
    X, y = [], []
    for pos_tags in tagged_sentences:
        for index, (term, class_) in enumerate(pos_tags):
            # Add basic NLP features for each sentence term
            X.append(add_basic_features(untag(pos_tags), index))
            y.append(class_)
    return X, y

In [None]:
# Load data
with open("ga_idt-ud-train.conllu", "r", encoding="utf-8") as f, open("ga_idt-ud-test.conllu", "r", encoding="utf-8") as f1, open("ga_idt-ud-dev.conllu", "r", encoding="utf-8") as f2:
    raw_train = f.read()
    raw_test = f1.read()
    raw_val = f2.read()

# Extract words and their POS-tags from CoNLL-U format
training_sentences = reformat_data(raw_train)
testing_sentences = reformat_data(raw_test)
validation_sentences = reformat_data(raw_val)

In [None]:
# Check the number of classes (tags)
all_sentences = training_sentences + testing_sentences + validation_sentences

tags = set([tag for sentence in all_sentences for _, tag in sentence])
print(len(tags))
tags

17


{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'INTJ',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'SYM',
 'VERB',
 'X'}

In [None]:
# Transform lists of sentenses to datasets
X_train, y_train = transform_to_dataset(training_sentences)
X_test, y_test = transform_to_dataset(testing_sentences)
X_val, y_val = transform_to_dataset(validation_sentences)

#### Feature encoding

In [None]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

# Fit our DictVectorizer with our set of features
dict_vectorizer = DictVectorizer()
dict_vectorizer.fit(X_train + X_test + X_val)

# Convert dict features to vectors
X_train = dict_vectorizer.transform(X_train)
X_test = dict_vectorizer.transform(X_test)
X_val = dict_vectorizer.transform(X_val)

# Fit LabelEncoder with our list of classes
label_encoder = LabelEncoder()
label_encoder.fit(y_train + y_test + y_val)

# Encode class values as integers
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
y_val = label_encoder.transform(y_val)

# Convert integers to dummy variables (one hot encoded)
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
y_val = np_utils.to_categorical(y_val)

#### Building the model

* This kind of linear stack of layers can easily be made with the `Sequential` model. It will contain an input layer, a hidden layer, and an output layer.

* To overcome overfitting, we'll use dropout regularization. We'll set the dropout rate to 20%, meaning that 20% of the randomly selected neurons will be ignored during training at each update cycle.

* We'll use Rectified Linear Units (ReLU) activation for the hidden layers as they are the simplest non-linear activation functions available.

* For multi-class classification, we may want to convert the outputs to probabilities, which can be done with the softmax function.

* Finally, we'll use categorical cross-entropy loss function and Adam optimizer as they have proven well for classification tasks.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

def build_model(input_dim, hidden_neurons, output_dim):
    """
    Construct, compile and return a Keras model which will be used to fit/predict
    """
    model = Sequential([
        Dense(hidden_neurons, input_dim=input_dim),
        Activation('relu'),
        Dropout(0.2),
        Dense(hidden_neurons),
        Activation('relu'),
        Dropout(0.2),
        Dense(output_dim, activation='softmax')
        ])
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

#### Creating a wrapper between Keras API and Scikit-Learn

Keras provides a wrapper called `KerasClassifier` which implements the Scikit-Learn classifier interface.

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier

model_params = {
    'build_fn': build_model,
    'input_dim': X_train.shape[1],
    'hidden_neurons': 512,
    'output_dim': y_train.shape[1],
    'epochs': 3,
    'batch_size': 256,
    'verbose': 1,
    'validation_data': (X_val, y_val),
    'shuffle': True}

clf = KerasClassifier(**model_params)

#### Training & evaluation

In [None]:
hist = clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)

score

Epoch 1/3
Epoch 2/3
Epoch 3/3


0.9154219031333923

In [None]:
def transform_for_tagging(sentences):
    """
    Split sentences to tokens and append some basic features.
    :param sentences: a list of sentences
    :param sentences: list of tokenised sentences 
    """
    X = []
    for sent in sentences:
        for index, word in enumerate(sent):
            # Add basic NLP features for each sentence term
            X.append(add_basic_features(sent, index))
    return X

In [None]:
sentence = "An tOireachtas is ainm don pharlaimint náisiúnta, agus sin é a bheirtear uirthi de ghnáth sa bhunreacht seo."
sentence_tokenized = nltk.word_tokenize(sentence)
test = transform_for_tagging([sentence_tokenized])
test = dict_vectorizer.transform(test)

preds = label_encoder.inverse_transform(clf.predict(test))

[(w, t) for w, t in zip(sentence_tokenized, preds)]





[('An', 'DET'),
 ('tOireachtas', 'NOUN'),
 ('is', 'AUX'),
 ('ainm', 'NOUN'),
 ('don', 'ADP'),
 ('pharlaimint', 'NOUN'),
 ('náisiúnta', 'ADJ'),
 (',', 'PUNCT'),
 ('agus', 'CCONJ'),
 ('sin', 'PRON'),
 ('é', 'PRON'),
 ('a', 'PART'),
 ('bheirtear', 'VERB'),
 ('uirthi', 'ADP'),
 ('de', 'ADP'),
 ('ghnáth', 'NOUN'),
 ('sa', 'ADP'),
 ('bhunreacht', 'NOUN'),
 ('seo', 'DET'),
 ('.', 'PUNCT')]

In [None]:
clf.model.save('mlp_tagger.h5')