# Assignment 2: Dependency parsing

In this assignment you will implement and evaluate a **dependency parser** and summarise your experimental findings in a short report. The main goal of the assignment is to give you insights into how to implement a non-trivial neural architecture for structured prediction.

**Submission:** Please email your submission to Marco ([marco.kuhlmann@liu.se](mailto:marco.kuhlmann@liu.se))

**Due date:** 2 June (extended from 25 May)

## The data set

The data set for the lab is the English Web Treebank from the [Universal Dependencies Project](http://universaldependencies.org), a corpus containing more than 16,000 sentences (254,000&nbsp;tokens) annotated with dependency trees (among other things). The Universal Dependencies Project distributes its data in the [CoNLL-U format](https://universaldependencies.org/format.html). The code in the next cell can be used to read data in this format from a file-like object:

In [None]:
ROOT = ('<root>', '<root>', 0)  # Pseudoword; see comment below

def read_data(source):
    buffer = [ROOT]
    for line in source:
        if not line.startswith('#'):  # Skip lines with comments
            line = line.rstrip()
            if not line:
                yield buffer
                buffer = [ROOT]
            else:
                columns = line.split('\t')
                if columns[0].isdigit():  # Skip range tokens
                    buffer.append((columns[1], columns[3], int(columns[6])))

For this assignment, we will use the training section and the development section of the [English Web Treebank](https://github.com/UniversalDependencies/UD_English-EWT):

In [None]:
with open('en_ewt-ud-train-projectivized.conllu') as source:
    train_data = list(read_data(source))

with open('en_ewt-ud-dev.conllu') as source:
    dev_data = list(read_data(source))

Both data sets consist of syntactically analyzed sentences. In this assignment, an analyzed sentence is represented as a list of triples, where the first component of each triple represents a word form, the second component represents the word’s tag, and the third component is an integer specifying the position of the word’s syntactic head, i.e., its parent in the dependency tree. Run the following code cell to see an example:

In [None]:
dev_data[100]

In this example the head of the word *I* is the word at position&nbsp;2, *ran*; the dependents of *ran* are *I* (position&nbsp;1), *item* (position&nbsp;5), *Internet* (position&nbsp;8), as well as the final punctuation mark. Note that each sentence is preceded by the pseudoword `<root>` (position&nbsp;0); this turns out to be useful for technical reasons.

## Your tasks

Your task is to implement a dependency parser, train this parser on the training data, and evaluate its performance on the development data. We provide skeleton code for the implementation of a simple transition-based dependency parser (&lsquo;baseline parser&rsquo;) in the vein of [Chen and Manning (2014)](https://www.aclweb.org/anthology/D14-1082/); however, you are free to implement any architecture that you would like to try. For example, you could try to implement the arc-hybrid system and dynamic oracle described in [Goldberg and Nivre (2013)](https://www.aclweb.org/anthology/Q13-1033/), or one of the architectures described in [Kiperwasser and Goldberg (2016)](https://www.aclweb.org/anthology/Q16-1023/).

### Deliverables

Submit a Jupyter notebook with your code as well as a brief report that describes your implementation and your experimental results. It is required that the notebook can be run easily, maybe just with some trivial modification to point to the location of the data.

### Grading

To pass the assignment, you must submit a complete and working implementation of a dependency parser and a well-written report. Your parser must reach an unlabelled attachment score (UAS) on the development data of at least 70%; see the end of this notebook for details. If you choose to implement other methods, you should be able to obtain UAS scores that are significantly higher than 70%.

## Baseline parser

In this section we will walk you through the implementation of a simple transition-based parser in the style of [Chen and Manning (2014)](https://www.aclweb.org/anthology/D14-1082/). This &lsquo;baseline&rsquo; parser is based on the arc-standard algorithm that was presented in class.

We present the skeleton code for the baseline parser in six steps.

### Step 1: Encoding the data

It will be convenient to represent the strings in the data by indexes as integers. Your first task then is to implement a function that constructs these vocabularies, and a second function that encodes the data into integer form.

In [None]:
def make_vocabs(gold_sentences):
    raise NotImplementedError

def encode(vocab_words, vocab_tags, gold_sentences):
    raise NotImplementedError

Your implementation should confirm to the following specification:

**make_vocabs** (*gold_sentences*)

> Returns a pair of two vocabularies, represented as dictionaries: a *word vocabulary* and a *tag vocabulary*. The word vocabulary maps the unique words in the *gold_sentences* to a contiguous range of integers between $0$ and $W+1$, where $W$ is the total number of unique words. Similarly, the tag vocabulary maps the unique part-of-speech tags in the *gold_sentences* to a range between $0$ and $T$, where $T$ is the total number of unique tags. The special words `<pad>` (used later for padding undefined values) and `<unk>` (used in place of unknown words at prediction time) are mapped to the indexes&nbsp;0 and&nbsp;1, respectively. The special tag `<pad>` is mapped to the index&nbsp;0.

**encode** (*vocab_words*, *vocab_tags*, *gold_sentences*)

> Returns an encoded version of the *gold_sentences* where each word form is replaced by its index in the word vocabulary (*vocab_words*) and where each tag is replaced by its index in the tag vocabulary (*vocab_tags*).

In the following, whenever we say &lsquo;words&rsquo; and &lsquo;tags&rsquo; we really mean the integer versions of words and tags.

### Step 2: Parser, static part

The baseline parser consists of two parts: a static part that implements the logic of the arc-standard transition system, and a non-static part that contains the learning component. In this section we cover the static part; the non-static part is covered in the next section.

In the arc-standard algorithm, the next move (transition) of the parser is predicted based on features extracted from the current parser configuration, with references to the words and part-of-speech tags of the input sentence. On the Python side of things, we represent parser configurations as triples

$$
(i, \mathit{stack}, \mathit{heads})
$$

where $i$ is an integer specifying the position of the next word in the buffer, $\mathit{stack}$ is a list of integers specifying the positions of the words currently on the stack (with the topmost element last in the list), and $\mathit{heads}$ is a list of integers specifying the positions of the currently assigned head words. To illustrate this representation, the initial configuration for the sample sentence above is

and a possible final configuration is

The next cell contains skeleton code for the static part of the baseline parser:

In [None]:
class Parser(object):

    # Parser moves are specified as integers.

    MOVES = tuple(range(3))

    SH, LA, RA = MOVES

    @staticmethod
    def initial_config(num_words):
        raise NotImplementedError

    @staticmethod
    def valid_moves(config):
        raise NotImplementedError

    @staticmethod
    def next_config(config, move):
        raise NotImplementedError

    @staticmethod
    def is_final_config(config):
        raise NotImplementedError

Your task is to implement this interface according the following specification:

**initial_config** (*num_words*)

> Returns the initial configuration for a sentence with the specified number of words.

**valid_moves** (*config*)

> Returns the list of valid moves for the specified configuration. Note that moves are represented as integers.

**next_config** (*config*, *move*)

> Applies the specified move (an integer) to the specified configuration and returns the new configuration.

**is_final_config** (*config*)

> Tests whether the specified configuration is a final configuration.

To test your implementation, you can run the code below. This code creates the initial configuration for the example sentence, simulates a sequence of moves, and then checks that the resulting configuration is the expected final configuration.

In [None]:
moves = [0, 0, 0, 1, 0, 0, 0, 1, 1, 2, 0, 0, 0, 1, 1, 2, 0, 2, 2]

parser = Parser()
config = parser.initial_config(len(dev_data[100]))
for move in moves:
    assert move in parser.valid_moves(config)
    config = parser.next_config(config, move)
assert parser.is_final_config(config)

assert config == (10, [0], [0, 2, 0, 5, 5, 2, 8, 8, 2, 2])

### Step 3: Parser, non-static part

The heart of the non-static part of the baseline parser is the *next move classifier*, which is implemented by a feedforward network. The input to this network is a vector of integers representing words and tags, as described in the article by [Chen and Manning (2014)](https://www.aclweb.org/anthology/D14-1082/). For example, a simple feature model would look at the next word in the buffer and the topmost two words on the stack. The network processes this input as follows:

1. embed the words and tags and concatenate the resulting embeddings
2. send the concatenated embeddings through a linear layer followed by a ReLU
3. pass the output of the non-linearity into a final softmax layer

The next cell contains skeleton code for the non-static part of the parser:

In [None]:
class BaselineParser(Parser):

    def __init__(self, vocab_words, vocab_tags):
        raise NotImplementedError

    def featurize(self, words, tags, config):
        raise NotImplementedError

    def predict(self, words, tags):
        raise NotImplementedError

Your implementation should comply with the following specification:

**__init__** (*self*, *vocab_words*, *vocab_tags*)

> Creates a new parser, including the neural network that implements the next move classifier. The arguments *vocab_words* and *vocab_tags* are the dictionaries that you created in Step&nbsp;1.

**featurize** (*self*, *words*, *tags*, *config*)

> Returns the input vector to the next move classifier for the specified parser configuration. If you implement the simple feature model described above, this will be a vector of length&nbsp;6. The *words* and *tags* are the words and part-of-speech tags for the current input sentence, and *config* is a parser configuration as in Step&nbsp;2.

**predict** (*self*, *words*, *tags*)

> Predicts the list of all heads for the specified input sentence. The input sentence is specified in terms of the list of its *words* and the list of its *tags*. This method runs the arc-standard parsing algorithm, at each step asking the next move classifier for the next transition to take.

#### Hyperparameters

The following choices are reasonable defaults for the hyperparameters of the network architecture used by the parser:

* width of the word embedding: 50
* width of the tag embedding: 10
* size of the hidden layer: 180

### Step 4: Creating the training data for the next move classifier

To train the next move classifier, we need training samples of the form $(\mathbf{x}, m)$, where $\mathbf{x}$ is a feature vector extracted from a given parser configuration&nbsp;$c$, and $m$ is the corresponding gold-standard move. To construct this dataset, we need an **oracle**. As you have learned in class, there are different ways to implement the oracle; here we ask you to implement the static oracle, which generates training data using teacher forcing.

The following cell contains skeleton code for a function that implements the oracle, and for a class `NextMoveDataset` that holds the training samples of the next move classifier.

In [None]:
from torch.utils.data import Dataset

def oracle_moves(gold_heads):
    raise NotImplementedError

class NextMoveDataset(Dataset):
    
    def __init__(self, gold_sentences, parser):
        self.xs = []
        self.ys = []
        # TODO: Insert code here

    def __len__(self):
        return len(self.xs)

    def __getitem__(self, idx):
        return self.xs[idx], self.ys[idx]

Your implementation should conform to the following specification:

**oracle_moves** (*gold_heads*)

> Translates a gold-standard head assignment for a sentence-specific head-assignment (`gold_heads`) into the corresponding stream of oracle moves. More specifically, this yields pairs $(c, m)$ where $m$ is a move and $c$ is the configuration in which $m$ was taken.

**NextMoveDataset** (*gold_sentences*, *parser*)

> Constructs a PyTorch dataset for the next move classifier. The dataset is constructed by calling *oracle_moves* on each gold-standard sentence in *gold_sentences*, and applying the `featurize` function to the resulting configurations to obtain feature vectors.

You can test your oracle by executing the cell below. This extracts the oracle move sequence from the example sentence and compares it to the gold-standard move sequence `gold_moves`.

In [None]:
gold_heads = [h for w, t, h in dev_data[100]]
gold_moves = [0, 0, 0, 1, 0, 0, 0, 1, 1, 2, 0, 0, 0, 1, 1, 2, 0, 2, 2]

assert list(m for _, m in oracle_moves(gold_heads)) == gold_moves

### Step 5: Training loop

The last piece of the implementation of the baseline parser is the training loop. This should hold no surprises.

In [None]:
def train(train_data, n_epochs=1, batch_size=100):
    raise NotImplementedError

## Evaluation

For evaluation, we use **unlabelled attachment score (UAS)**, which is defined as the percentage of all tokens to which the parser assigns the correct head (as per the gold standard). Note that the calculation excludes the pseudoword at position&nbsp;0 in each sentence.

In [None]:
def uas(parser, gold_sentences):
    total = 0
    correct = 0
    for sentence in gold_sentences:
        words, tags, gold_heads = zip(*sentence)
        pred_heads = parser.predict(words, tags)
        for gold, pred in zip(gold_heads[1:], pred_heads[1:]):
            total += 1
            correct += int(gold == pred)
    return correct / total

Run the following cell to test your parser. Note that, during development, you may want to restrict the number of epochs.

In [None]:
parser = train(train_data, n_epochs=3)
print('{:.4f}'.format(uas(parser, encode(parser.vocab_words, parser.vocab_tags, dev_data))))

When training for 3&nbsp;epochs with the default parameters, you should achieve an UAS of around 71%.

In case you want to play around with the baseline parser, here are some extensions that you can experiment with to increase performance:

* Increase the dimensions of the embeddings.
* Experiment with different random distributions when initialising embeddings.
* Use pre-trained word embeddings instead of random initialisation.
* Add dropout or other forms of regularisation.
* Invest time into preprocessing, such as normalisation of numbers and URLs.

That&rsquo;s all, folks!