<div class="alert alert-danger">
**Due date:** 2017-02-10
</div>

# Lab 3: Part-of-Speech Tagging

**Students:** Ludvig Noring (ludno249), Michael Sörsäter (micso554), Victor Tranell (victr593)

## Introduction

Part-of-speech (POS) tagging is the task of labelling words (tokens) with parts of speech such as noun, adjective, and verb. In this lab you will implement a POS tagger based on the averaged perceptron and evaluate it on the [Stockholm Umeå Corpus (SUC)](http://spraakbanken.gu.se/eng/resources/suc), a Swedish corpus containing more than 74,000 sentences (1.1&nbsp;million tokens), which were manually annotated with, among others, parts of speech. The corpus is divided into two files:

<table align="left">
<tr><td><code>suc-train.txt</code></td><td style="text-align: right">72,594 sentences</td><td style="text-align: right">1,142,802 tokens</td></tr>
<tr><td><code>suc-test.txt</code></td><td style="text-align: right">1,569 sentences</td><td style="text-align: right">23,319 tokens</td></tr>
</table>

Start by importing the Python module that is required for this lab:

In [4]:
import nlp3

The next cell loads the data:

In [14]:
training_data = nlp3.read_data("/home/TDDE09/labs/nlp3/suc-train.txt")
test_data = nlp3.read_data("/home/TDDE09/labs/nlp3/suc-test.txt")

Both data sets consist of tagged sentences. In Python, a tagged sentence is represented as a list of string pairs, where the first component of each pair represents a word and the second component represents a part-of-speech tag. Run the following code cell to see an example:

In [6]:
training_data[42]

[('Och', 'KN'),
 ('det', 'PN'),
 ('är', 'VB'),
 ('som', 'KN'),
 ('segerherre', 'NN'),
 ('han', 'PN'),
 ('vill', 'VB'),
 ('göra', 'VB'),
 ('politik', 'NN'),
 ('.', 'MAD')]

The next cell extracts all unique tags from the training data. The tags are explained and exemplified in Table&nbsp;12 (page&nbsp;20) of the [SUC 2.0 Manual](https://spraakbanken.gu.se/parole/Docs/SUC2.0-manual.pdf).

In [10]:
suc_tags = set()
for tagged_sentence in training_data:
    for word, tag in tagged_sentence:
        suc_tags.add(tag)
suc_tags = sorted(suc_tags)
print(" ".join(suc_tags))

AB DT HA HD HP HS IE IN JJ KN MAD MID NN PAD PC PL PM PN PP PS RG RO SN UO VB


Run the next code cell to train the default tagger, tag the sample sentence from above, and evaluate the tagger on the test data. Note that for reasons of speed, this only uses the first 1,000 sentences of the training data; for higher accuracies you should train on the complete training data.

In [15]:
tagger = nlp3.PerceptronTagger(suc_tags)
tagger.train(training_data[:1000])
print(tagger.tag([word for word, tag in training_data[42]]))
matrix = nlp3.confusion_matrix(tagger, test_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(matrix)))

Progress: 99.90%
[('Och', 'KN'), ('det', 'PN'), ('är', 'VB'), ('som', 'HP'), ('segerherre', 'JJ'), ('han', 'PN'), ('vill', 'VB'), ('göra', 'VB'), ('politik', 'NN'), ('.', 'MAD')]
Accuracy: 84.22%


## Implement the tagger

Your main task in this lab is to re-implement the two central methods of the default tagger:

* `train()`, which takes a list of tagged sentences and trains the tagger using the averaged perceptron learning algorithm

* `tag()`, which takes a list of words (strings) and returns a tagged sentence

You are of course free to add other methods to your class if you deem it appropriate to do so.

In implementing the tagger you will be able to reuse code from your implementation of the averaged perceptron classifier in lab&nbsp;1. However, for this lab it is crucial that you can handle multiple classes, as the tagger needs one class per POS tag.

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Implement a part-of-speech tagger based on the averaged perceptron, train it on the training data, and evaluate performance on the test data. Your tagger should get the same results as the default tagger.
</div>
</div>

Starter code for this problem is given in the following code cell. The provided class simply inherits from `nlp3.PerceptronTagger` and calls the methods in the superclass. Your task is to replace these calls with your own code. You will note that there is a third method `get_features()`; you do not need to touch this method unless you want to do the advanced part of this lab (see below).

In [106]:
class OurTagger(nlp3.PerceptronTagger):

    def __init__(self, tags):
        """Creates a new tagger that uses the specified tag set."""
        super().__init__(tags)
        self.tags = tags
        self.class_weights = {}
        self.words = []

    def tag(self, words):
        """Tags the specified words, returning a tagged sentence."""
        # TODO: Replace the following line with your own code
        
        #return super().tag(words)
        predicted_tags = []
        for word in words:
            tmp_score = 0
            best_score = 0
            predicted_tag = 'nope'
            if word in self.words:
                for tag in self.class_weights:

                    tmp_score = self.class_weights[tag][self.words.index(word)]

                    if tmp_score >= best_score:
                        best_score = tmp_score
                        predicted_tag = tag
                predicted_tags.append(predicted_tag)
            else:
                predicted_tags.append('VB')
        return list(zip(words, predicted_tags))
            
                
            
        
        #

    def train(self, tagged_sentences, report_progress=True):
        """Trains this tagger on the specified gold-standard data."""
        #super().train(tagged_sentences, report_progress)
        #return
        words = set()
        for sentence in tagged_sentences:
            for word, tag in sentence:
                words.add(word)
            
        words = list(words)
        words.sort()
        self.words = words
        
        # weight vectors for the different classes
        #class_weights = {}
        for tag in self.tags:
            self.class_weights[tag] = [0]*len(words)

        # loop over training data
        for sentence in tagged_sentences:
            for word, tag in sentence:
            
                predicted_class = 'nope'
                best_score = 0

                for weight in self.class_weights:
                    tmp_score = self.class_weights[weight][self.words.index(word)]

                    if(tmp_score >= best_score):
                        predicted_class = weight
                        best_score = tmp_score
                if tag != predicted_class:
                    self.class_weights[predicted_class][self.words.index(word)] -= 1
                    self.class_weights[tag][self.words.index(word)] += 1

                #predicted_class2 = sorted(scores, key=scores.get, reverse=True)[0]
                #if(predicted_class != predicted_class2):
                 #   print(predicted_class, predicted_class2)
                  #  print("OH NO")
                   # 1/0
                #else:
                 #   print("trr", end="")
        #print(self.class_weights)
                
            
            
            
            
            
        #

    def get_features(self, tokens, i, pred_tags):
        """Extracts the feature list for the specified configuration."""
        # TODO: For the advanced part, replace the following line with your own code
        return super().get_features(tokens, i, pred_tags)

Run the following cell to test your tagger. At the end of the lab you should get the same results as in the evaluation of the default tagger (assuming that you do not change the feature extraction, see below).

In [108]:
our_tagger = OurTagger(suc_tags)
our_tagger.train(training_data[:120])
#print(dir(our_tagger))
#d = training_data[42]
#print(our_tagger.tag([word for word, tag in d]))
#print(d)
our_matrix = nlp3.confusion_matrix(our_tagger, test_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(our_matrix)))

Accuracy: 57.58%


In what follows, we try to give you an idea of what the two methods `train()` and `tag()` do. We start with the latter.

### Tagging

The default tagger implements the sequence model presented in the lecture. In this model, sentences are tagged from left to right. A **configuration** consists of the list of words, the index of the current word, and the list of already predicted tags. For each word in the sentence, the tagger calls the method `get_features()` to obtain a feature vector for the current configuration. To illustrate how this works, we define a variant of the default tagger that only extracts a single feature, the current word.

In [None]:
class DemoTagger(nlp3.PerceptronTagger):
    
    def get_features(self, words, i, pred_tags):
        if self.debug:
            print("words: {}".format(" ".join(words)))
            print("i: {} (current word: {})".format(i, words[i]))
            print("pred_tags: {}".format(" ".join(pred_tags)))
            print()
        return [words[i]]

We train this tagger and evaluate it:

In [None]:
demo_tagger = DemoTagger(suc_tags)
demo_tagger.debug = False
demo_tagger.train(training_data[:1000])
demo_matrix = nlp3.confusion_matrix(demo_tagger, test_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(demo_matrix)))
demo_tagger.debug = True

Here are the features that are extracted when the system tags the sentence *Anna älskar Kurt*:

In [None]:
demo_tagger.tag("Anna älskar Kurt".split())

Note that a feature vector is represented as a list of features. With this vector, the tagger then predicts the next tag using the classification rule for the perceptron, and updates the configuration before moving on to the next word. Finally, `tag()` returns the tagged sentence.

### Training

Training is based on the learning algorithm for the averaged perceptron. Note that the weight vectors need to be updated for each word, not for each sentence. The tagger maintains a list of already predicted tags as part of its configuration. The tagger trains for a single epoch.

## Advanced: Feature engineering

In the advanced part of this lab, you will practice your skills in **feature engineering**, the task of identifying useful features for a machine learning system.

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Think about which features could be useful for tagging and re-implement the method `get_features()` in the class `OurTagger` accordingly. Experiment not only with atomic features but also with different feature combinations (pairs or tuples of features). The goal is to create a system whose accuracy on the test data is as high as possible. Provide a short description of how you came up with your features.
</div>
</div>

*TODO: Insert your description of how you came up with your features here*