<div class="alert alert-danger">
**Due date:** 2018-02-09
</div>

# L3: Part-of-speech tagging

## Introduction

Part-of-speech (POS) tagging is the task of labelling words (tokens) with parts of speech such as noun, adjective, and verb. In this lab you will implement a POS tagger based on the averaged perceptron and evaluate it on the English treebank from the [Universal Dependencies Project](http://universaldependencies.org), a corpus containing more than 16,000 sentences (254,000&nbsp;tokens) annotated with, among others, parts of speech. The data is divided into two files:

<table align="left">
<tr><td><code>train.txt</code></td><td style="text-align: right">12,543 sentences</td><td style="text-align: right">204,585 tokens</td></tr>
<tr><td><code>dev.txt</code></td><td style="text-align: right">2,002 sentences</td><td style="text-align: right">25,148 tokens</td></tr>
</table>

Start by importing the Python module that is required for this lab:

In [1]:
import nlp3
import numpy as np

The next cell loads the data:

In [2]:
training_data = nlp3.read_data("/home/TDDE09/labs/l3/data/train.txt")
dev_data = nlp3.read_data("/home/TDDE09/labs/l3/data/dev.txt")

Both data sets consist of tagged sentences. In Python, a tagged sentence is represented as a list of string pairs, where the first component of each pair represents a word and the second component represents a part-of-speech tag. Run the following code cell to see an example:

In [3]:
training_data[42]

[('There', 'PRON'),
 ('has', 'AUX'),
 ('been', 'VERB'),
 ('talk', 'NOUN'),
 ('that', 'SCONJ'),
 ('the', 'DET'),
 ('night', 'NOUN'),
 ('curfew', 'NOUN'),
 ('might', 'AUX'),
 ('be', 'AUX'),
 ('implemented', 'VERB'),
 ('again', 'ADV'),
 ('.', 'PUNCT')]

The tags are explained and exemplified in the [Annotation Guidelines](http://universaldependencies.org/u/pos/all.html) of the Universal Dependencies Project.

Run the next code cell to train the default tagger, tag the sample sentence from above, and evaluate the tagger on the development data.

In [4]:
tagger = nlp3.Tagger()
tagger.train(training_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(tagger, dev_data)))

Progress: 99.99%
Accuracy: 87.63%


## Implement the tagger

Your main task in this lab is to re-implement the default tagger.

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Implement a part-of-speech tagger based on the averaged perceptron, train it on the training data, and evaluate performance on the development data. Your tagger should get the same results as the default tagger.
</div>
</div>

Starter code for this problem is given in the following code cell. The provided class simply inherits from `nlp3.Tagger` and calls the methods in the superclass. Your task is to replace these calls with your own code. The intended interface of the methods is documented in the docstrings.

<div class="alert alert-danger">
You will not need to touch the method `features()`, unless you want to do the advanced part of this lab (see below).
</div>

You are allowed to use the provided `nlp3.Perceptron` class for the implementation of the multi-class perceptron. This class has the same interface as the class that you implemented in lab&nbsp;L1, except for one additional method `finalize()`. This method implements the last step of the training method, the averaging of the classifier&rsquo;s weight vector. If you feel adventurous, then you may want to try using your own implementation instead of the provided one.

In [5]:
for m in dir(nlp3.Perceptron):
    if not m.startswith('_'):
        print (m)

finalize
predict
update


In [6]:
class Tagger():
    """A part-of-speech tagger based on a multi-class perceptron
    classifier.

    This tagger implements a simple, left-to-right tagging algorithm
    where the prediction of the tag for the next word in the sentence
    can be based on the surrounding words and the previously
    predicted tags. The exact features that this prediction is based
    on can be controlled with the `features()` method, which should
    return a feature vector that can be used as an input to the
    multi-class perceptron.

    Attributes:
        classifier: A multi-class perceptron classifier.
    """

    def __init__(self):
        """Initialises a new tagger."""
        self.classifier = nlp3.Perceptron()

    def features(self, words, i, pred_tags):
        """Extracts features for the specified tagger configuration.
        
        Args:
            words: The input sentence, a list of words.
            i: The index of the word that is currently being tagged.
            pred_tags: The list of previously predicted tags.
        
        Returns:
            A feature vector for the specified configuration.
        """
        features = ['word:%s' % words[i].lower()]
        
        # Pair of two previous tags
        prev_tag = 'prevtags:%s' % pred_tags[-2:] if len(pred_tags) >= 2 else 'prevtags:None'
        features.append(prev_tag)
        
        # Previous word (lowercased)
        prev_word = 'prevword:%s' % words[i-1].lower() if i > 0 else 'prevword:None'
        features.append(prev_word)
        
        # Next word (lowercased)
        next_word = 'nextword:%s' % words[i+1].lower() if i < len(words) - 1 else 'nextword:None'
        features.append(next_word)
        
        return features


    def tag(self, words):
        """Tags a sentence with part-of-speech tags.

        Args:
            words: The input sentence, a list of words.

        Returns:
            The list of predicted tags for the input sentence.
        """
        
        pred_tags = []
        for i in range(len(words)):
            features = self.features(words, i, pred_tags)
            pred_tags.append(self.classifier.predict(features))
        
        return pred_tags

    def update(self, words, gold_tags):
        """Updates the tagger with a single training instance.

        Args:
            words: The list of words in the input sentence.
            gold_tags: The list of gold-standard tags for the input
                sentence.

        Returns:
            The list of predicted tags for the input sentence.
        """
        pred_tags = []           
        for i in range(len(words)):
            features = self.features(words, i, pred_tags)
            pred_tags.append(self.classifier.update(features, gold_tags[i]))
                     
        return pred_tags

    def train(self, data):
        """Train a new tagger on training data.

        Args:
            data: Training data, a list of tagged sentences.
        """

        # Extract word and tag samples
        X, y = [], []
        for sample in data:
            X.append([word for word, _ in sample])
            y.append([pos_tag for _, pos_tag in sample])
        
        # To try with different amount of epochs
        for _ in range(1):
            # For each sample make update
            for i, (words, gold_tags) in enumerate(zip(X, y)):
                self.update(words, gold_tags)
            
        # Average weights
        self.finalize()

    def finalize(self):
        """Finalizes the classifier by averaging its weight vectors."""
        self.classifier.finalize()

Run the following cell to test your tagger. At the end of the lab you should get the same results as in the evaluation of the default tagger (assuming that you do not change the feature extraction, see below).

In [7]:
our_tagger = Tagger()
our_tagger.train(training_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(our_tagger, dev_data)))

Accuracy: 88.39%


In what follows, we try to give you an idea of what the two methods `train()` and `tag()` do. We start with the latter.

### Tagging

The default tagger implements the sequence model presented in the lecture. In this model, sentences are tagged from left to right. A **configuration** of the tagger consists of the list of words, the index of the current word, and the list of already predicted tags. For each word in the sentence, the tagger calls the method `features()` to obtain a feature vector for the current configuration. To illustrate how this works, we define a variant of the default tagger that only extracts a single feature, the current word.

In [8]:
class DemoTagger(nlp3.Tagger):
    
    def __init__(self):
        super().__init__()
        self.debug = False
    
    def features(self, words, i, pred_tags):
        features = [words[i]]
        if self.debug:
            print("words: {}".format(" ".join(words)))
            print("i: {} (current word: {})".format(i, words[i]))
            print("pred_tags: {}".format(" ".join(pred_tags)))
            print("features: {}".format(" ".join(features)))
            print()
        return features

We train this tagger and evaluate it:

In [9]:
demo_tagger = DemoTagger()
demo_tagger.train(training_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(demo_tagger, dev_data)))

Progress: 0.00%Progress: 0.01%Progress: 0.02%Progress: 0.02%Progress: 0.03%Progress: 0.04%Progress: 0.05%Progress: 0.06%Progress: 0.06%Progress: 0.07%Progress: 0.08%Progress: 0.09%Progress: 0.10%Progress: 0.10%Progress: 0.11%Progress: 0.12%Progress: 0.13%Progress: 0.14%Progress: 0.14%Progress: 0.15%Progress: 0.16%Progress: 0.17%Progress: 0.18%Progress: 0.18%Progress: 0.19%Progress: 0.20%Progress: 0.21%Progress: 0.22%Progress: 0.22%Progress: 0.23%Progress: 0.24%Progress: 0.25%Progress: 0.26%Progress: 0.26%Progress: 0.27%Progress: 0.28%Progress: 0.29%Progress: 0.29%Progress: 0.30%Progress: 0.31%Progress: 0.32%Progress: 0.33%Progress: 0.33%Progress: 0.34%Progress: 0.35%Progress: 0.36%Progress: 0.37%Progress: 0.37%Progress: 0.38%Progress: 0.39%Progress: 0.40%Progress: 0.41%Progress: 0.41%Progress: 0.42%Progress: 0.43%Progress: 0.44%Progress: 0.45%Progress: 0.45%Progress: 0.46%Progress: 0.47%Progress: 0.48%Progress: 0.49%Progres

Progress: 99.99%
Accuracy: 83.09%


Here are the features that are extracted when the system tags the sentence *Kim reads books*:

In [10]:
demo_tagger.debug = True
demo_tagger.tag("Kim reads books".split())

words: Kim reads books
i: 0 (current word: Kim)
pred_tags: 
features: Kim

words: Kim reads books
i: 1 (current word: reads)
pred_tags: PROPN
features: reads

words: Kim reads books
i: 2 (current word: books)
pred_tags: PROPN VERB
features: books



['PROPN', 'VERB', 'NOUN']

Note that a feature vector is represented as a list of Python values, as in lab&nbsp;L1. With this vector, the tagger then calls the perceptron to predict the next tag, and updates the configuration before moving on to the next word. Finally, `tag()` returns the list of predicted tags.

### Training

Training is based on the learning algorithm for the averaged perceptron. Note that the weight vectors need to be updated for each word, not for each sentence. The tagger maintains a list of already predicted tags as part of its configuration. The tagger trains for a single epoch.

## L3X: Feature engineering for part-of-speech tagging

In the advanced part of this lab, you will practice your skills in **feature engineering**, the task of identifying useful features for a machine learning system.

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
<p>Think about which features could be useful for tagging and re-implement the method `features()` in the class `Tagger` accordingly. Provide a short description of how you came up with your features.</p>
<p>The goal is to create a system whose accuracy on the development data is as high as possible. For a pass grade, you will have to achieve an accuracy of at least 87% on the development data.</p>
</div>
</div>

We suggest that you experiment not only with atomic features but also with different feature combinations (pairs or tuples of features).

<div class="alert alert-danger">
You are not allowed to try re-engineering the reference system!
</div>

Note that the reference implementation uses integers to represent features; this is to make re-engineering slightly harder. (Internally, the reference implementation really uses tuples of key values.)

For the words we have made them **lowercase**, because we don't want to differentiate between the same word depending on if the word is in the beginning of the sentence or not.

* **Current word** (lowercased): Self-evidently important
* **Previous two tags**: Because this seems to give the model important context for a prediction of the current tag. We tried to add the previous two tags as separate features, but to add these as a single string gave a higher test accuracy. We also tried to only consider the previous tag, but adding the two previous resulted in a higher test set accuracy.
* **Previous word** (lowercased): Similarly to n-grams this gives the model important context. We tried to add an additional previous word, but this resulted in a lower test set accuracy.  
* **Next word** (lowercased): Same reasoning as why we added the previous word as a feature this gave the model additional context when considering the tag of the current word.

**Accuracy: 88.39%** (we get higher accuracies when we train the model for more epochs.)