<div class="alert alert-danger">
**Due date:** 2017-02-10
</div>

# Lab 3: Part-of-Speech Tagging

**Students:** Ludvig Noring (ludno249), Michael Sörsäter (micso554), Victor Tranell (victr593)

## Introduction

Part-of-speech (POS) tagging is the task of labelling words (tokens) with parts of speech such as noun, adjective, and verb. In this lab you will implement a POS tagger based on the averaged perceptron and evaluate it on the [Stockholm Umeå Corpus (SUC)](http://spraakbanken.gu.se/eng/resources/suc), a Swedish corpus containing more than 74,000 sentences (1.1&nbsp;million tokens), which were manually annotated with, among others, parts of speech. The corpus is divided into two files:

<table align="left">
<tr><td><code>suc-train.txt</code></td><td style="text-align: right">72,594 sentences</td><td style="text-align: right">1,142,802 tokens</td></tr>
<tr><td><code>suc-test.txt</code></td><td style="text-align: right">1,569 sentences</td><td style="text-align: right">23,319 tokens</td></tr>
</table>

Start by importing the Python module that is required for this lab:

In [1]:
import nlp3

The next cell loads the data:

In [2]:
training_data = nlp3.read_data("/home/TDDE09/labs/nlp3/suc-train.txt")
test_data = nlp3.read_data("/home/TDDE09/labs/nlp3/suc-test.txt")

Both data sets consist of tagged sentences. In Python, a tagged sentence is represented as a list of string pairs, where the first component of each pair represents a word and the second component represents a part-of-speech tag. Run the following code cell to see an example:

In [3]:
training_data[42]

[('Och', 'KN'),
 ('det', 'PN'),
 ('är', 'VB'),
 ('som', 'KN'),
 ('segerherre', 'NN'),
 ('han', 'PN'),
 ('vill', 'VB'),
 ('göra', 'VB'),
 ('politik', 'NN'),
 ('.', 'MAD')]

The next cell extracts all unique tags from the training data. The tags are explained and exemplified in Table&nbsp;12 (page&nbsp;20) of the [SUC 2.0 Manual](https://spraakbanken.gu.se/parole/Docs/SUC2.0-manual.pdf).

In [6]:
suc_tags = set()
for tagged_sentence in training_data:
    for word, tag in tagged_sentence:
        suc_tags.add(tag)
suc_tags = sorted(suc_tags)
print(" ".join(suc_tags))

AB DT HA HD HP HS IE IN JJ KN MAD MID NN PAD PC PL PM PN PP PS RG RO SN UO VB


Run the next code cell to train the default tagger, tag the sample sentence from above, and evaluate the tagger on the test data. Note that for reasons of speed, this only uses the first 1,000 sentences of the training data; for higher accuracies you should train on the complete training data.

In [7]:
tagger = nlp3.PerceptronTagger(suc_tags)
tagger.train(training_data)
print(tagger.tag([word for word, tag in training_data[42]]))
matrix = nlp3.confusion_matrix(tagger, test_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(matrix)))

Progress: 15.98%

KeyboardInterrupt: 

## Implement the tagger

Your main task in this lab is to re-implement the two central methods of the default tagger:

* `train()`, which takes a list of tagged sentences and trains the tagger using the averaged perceptron learning algorithm

* `tag()`, which takes a list of words (strings) and returns a tagged sentence

You are of course free to add other methods to your class if you deem it appropriate to do so.

In implementing the tagger you will be able to reuse code from your implementation of the averaged perceptron classifier in lab&nbsp;1. However, for this lab it is crucial that you can handle multiple classes, as the tagger needs one class per POS tag.

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Implement a part-of-speech tagger based on the averaged perceptron, train it on the training data, and evaluate performance on the test data. Your tagger should get the same results as the default tagger.
</div>
</div>

Starter code for this problem is given in the following code cell. The provided class simply inherits from `nlp3.PerceptronTagger` and calls the methods in the superclass. Your task is to replace these calls with your own code. You will note that there is a third method `get_features()`; you do not need to touch this method unless you want to do the advanced part of this lab (see below).

In [31]:
class OurTagger(nlp3.PerceptronTagger):
    def __init__(self, tags):
        """Creates a new tagger that uses the specified tag set."""
        super().__init__(tags)
        self.tags = tags
        self.tag_weights = {}

            
    def tag(self, words):
        """Tags the specified words, returning a tagged sentence."""
        predicted_tags = []
        prev_tag = ''

        for i, word in enumerate(words):
            features = self.get_features(words, i, predicted_tags)
            best_score = -999
            predicted_tag = 'none'

            for current_tag in self.tag_weights:
                tmp_score = 0
                for feature in features:
                    if feature not in self.tag_weights['NN']:
                        continue
                    
                    tmp_score += self.tag_weights[current_tag][feature]
                if tmp_score > best_score:
                    best_score = tmp_score
                    predicted_tag = current_tag
        
            if predicted_tag == 'none':
                print(words)
                print(word)
                predicted_tag = 'NN'
            predicted_tags.append(predicted_tag)

        return list(zip(words, predicted_tags))
    
    def train(self, tagged_sentences, report_progress=True):
        import time
        """Trains this tagger on the specified gold-standard data."""
        #super().train(tagged_sentences, report_progress)
        #return
        t0 = time.time()
        total_count = len(tagged_sentences)

        # weight vectors for the different classes
        acc = {}
        for tag in self.tags:
            self.tag_weights[tag] = {}
            acc[tag] = {}

        acc_cnt = 1
        # loop over training data
        progress_cnt = 0
        for sentence in tagged_sentences:
            progress_cnt += 1
            print("Progress: {0:.2f} %, {1:.1f} seconds".format(100 * progress_cnt / total_count, time.time()-t0), end="\r")

            tokens = [token for token, tag in sentence]
            pred_tags = []
            for i, pair in enumerate(sentence):
                correct_tag = pair[1]
                predicted_tag = 'NN'
                best_score = -999                
                features = self.get_features(tokens, i, pred_tags)
                
                #pred_tags.append(correct_tag)
                
                # Add features too dicts
                for feature in features:
                    if feature not in self.tag_weights['NN']:
                        for tag in self.tag_weights:
                            self.tag_weights[tag][feature] = 0
                            acc[tag][feature] = 0 # add the tag to the acc

                for current_tag in self.tag_weights:
                    tmp_score = 0
                    for feature in features:
                        tmp_score += self.tag_weights[current_tag][feature] 
                    if(tmp_score > best_score):
                        predicted_tag = current_tag
                        best_score = tmp_score

                if correct_tag != predicted_tag:
                    for feature in features:
                        self.tag_weights[predicted_tag][feature] -= 1
                        self.tag_weights[correct_tag][feature] += 1

                        acc[predicted_tag][feature] -= acc_cnt
                        acc[correct_tag][feature] += acc_cnt
                pred_tags.append(predicted_tag)
                acc_cnt += 1

        # Averaging
        #print(self.tag_weights['NN'])
        for current_tag in self.tag_weights:
            for feature in self.tag_weights[current_tag]:
                self.tag_weights[current_tag][feature] -= acc[current_tag][feature] / acc_cnt

        print()
        
        #print(self.tag_weights['NN'])

    def get_features(self, tokens, i, pred_tags):
        """Extracts the feature list for the specified configuration."""
        
        # Tuning parameters
        t_word = 4 #4
        t_prevtag = 2 #2
        t_nextword = 2 #2
        t_prevword = 2 #2
        t_wordlen = 1 #1
        t_prevlen = 1 #1
        t_case = 1 #1
        t_currend = 0 # 0
        
        featurelist = []        
        featurelist += [tokens[i]] * t_word
        

        if i == 0:            
            featurelist += ["prevtag:<BOS>"] * t_prevtag
        else:
            featurelist += ["prevtag:"+str(pred_tags[-1])] * t_prevtag
        
        if i == len(tokens)-1:
            featurelist += ["nextword:<EOS>"] * t_nextword
        else:
            featurelist += ["nextword:"+tokens[i+1]] * t_nextword
            
        if i == 0:
            featurelist += ['prevword:<BOS>'] * t_prevword
            
        else:
            featurelist += ['prevword:'+str(tokens[i-1])] * t_prevword
                
        featurelist += ['wordlen:'+str(len(tokens[i]))] * t_wordlen
        
        if i == 0:
            featurelist += ['prevlen:0'] * t_prevlen
        else:
            featurelist += ['prevlen:'+str(len(tokens[i-1]))] * t_prevlen
            
        if tokens[i][0].isupper():# and i > 0:
            featurelist += ['case:upper'] * t_case
        else:
            featurelist += ['case:lower'] * t_case
        
        if not i == 0:
            featurelist += ["currend:" + tokens[i][-1:-3]] * t_currend
                
        return featurelist
        #return [tokens[i]]
        #return super().get_features(tokens, i, pred_tags)
        
our_tagger = OurTagger(suc_tags)
our_tagger.train(training_data[:])
our_matrix = nlp3.confusion_matrix(our_tagger, test_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(our_matrix)))

#vanlig
#10 000 - 90.16

#i > 0 på upper





Progress: 100.00 %, 88.5 seconds
Accuracy: 94.41%


Run the following cell to test your tagger. At the end of the lab you should get the same results as in the evaluation of the default tagger (assuming that you do not change the feature extraction, see below).

In [10]:
our_tagger = OurTagger(suc_tags)
our_tagger.train(training_data[:10000])
our_matrix = nlp3.confusion_matrix(our_tagger, test_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(our_matrix)))

# 75.96 - 1000
# 92.35 - alla
# vår
# 76.52 - 1000
# 92.6 - alla

# FEATURES
# std: 76.52 87.26
# nuvarande ord: 64.22


# bos, längd av ord: 74.48
# föregående ord, föregående klass, ordlängd, föregående ordlängd, stor bokstav, ändelse: 74.51 
# föregående ord, föregående klass, ordlängd, föregående ordlängd, stor bokstav, ord igen: 77.26 86.69



# 10 000
# std: 87.26
# alla: 89.12
# 90.11

# 94.52

Progress: 100.00 %, 12.3 seconds
Accuracy: 90.16%


In what follows, we try to give you an idea of what the two methods `train()` and `tag()` do. We start with the latter.

### Tagging

The default tagger implements the sequence model presented in the lecture. In this model, sentences are tagged from left to right. A **configuration** consists of the list of words, the index of the current word, and the list of already predicted tags. For each word in the sentence, the tagger calls the method `get_features()` to obtain a feature vector for the current configuration. To illustrate how this works, we define a variant of the default tagger that only extracts a single feature, the current word.

In [None]:
class DemoTagger(nlp3.PerceptronTagger):
    
    def get_features(self, words, i, pred_tags):
        features = [words[i]]
        if self.debug:
            print("words: {}".format(" ".join(words)))
            print("i: {} (current word: {})".format(i, words[i]))
            print("pred_tags: {}".format(" ".join(pred_tags)))
            print("features: {}".format(" ".join(features)))
            print()
        return features

We train this tagger and evaluate it:

In [None]:
demo_tagger = DemoTagger(suc_tags)
demo_tagger.debug = False
demo_tagger.train(training_data[:1000])
demo_matrix = nlp3.confusion_matrix(demo_tagger, test_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(demo_matrix)))
demo_tagger.debug = True

Here are the features that are extracted when the system tags the sentence *Anna älskar Kurt*:

In [None]:
demo_tagger.tag("Anna älskar Kurt".split())

Note that a feature vector is represented as a list of features. With this vector, the tagger then predicts the next tag using the classification rule for the perceptron, and updates the configuration before moving on to the next word. Finally, `tag()` returns the tagged sentence.

### Training

Training is based on the learning algorithm for the averaged perceptron. Note that the weight vectors need to be updated for each word, not for each sentence. The tagger maintains a list of already predicted tags as part of its configuration. The tagger trains for a single epoch.

## Advanced: Feature engineering

In the advanced part of this lab, you will practice your skills in **feature engineering**, the task of identifying useful features for a machine learning system.

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Think about which features could be useful for tagging and re-implement the method `get_features()` in the class `OurTagger` accordingly. Experiment not only with atomic features but also with different feature combinations (pairs or tuples of features). The goal is to create a system whose accuracy on the test data is as high as possible. For full credit you will have to achieve an accuracy of at least 93% on the test data. Provide a short description of how you came up with your features.
</div>
</div>

First of all we tried to copy the feature set shown during the lectures. We got that to work equally as good as the provided tagger with the original super.get_feature(). After that we tested every features we could think of one by one to figure out if they made any imporvement. When we had about 6 features we started tuning them. By adding the same feature multiple times we could put a heavier weight on those features. When we could not think of any more features or could not find any better combination of weights we considered ourselves finished.