# Feature engineering for part-of-speech tagging

In this challenge, you will practice your skills in feature engineering, the task of identifying useful features for a machine learning system.

## The data set

The data for this lab comes from the [Universal Dependencies Project](http://universaldependencies.org). The code in the next cell defines a simple reader for this dataset. For more information about the format, see [this website](https://universaldependencies.org/format.html).

In [None]:
class Dataset:

    def __init__(self, filename, max_tokens=100000):
        self.filename = filename
        self.max_tokens = max_tokens

    def __iter__(self):
        todo = self.max_tokens
        tmp = []
        with open(self.filename, 'rt', encoding='utf-8') as lines:
            for line in lines:
                line = line.rstrip()
                if line:
                    if not line.startswith('#'):
                        columns = line.split('\t')
                        if '-' not in columns[0]:
                            tmp.append((columns[1], columns[3]))
                            todo -= 1
                else:
                    yield tmp
                    if todo <= 0:
                        break
                    tmp = []

We load the training data and the development data for English and Icelandic:

In [None]:
# English
train_data = Dataset('en_ewt-ud-train.conllu')
dev_data = Dataset('en_ewt-ud-dev.conllu')

# Icelandic
# train_data = Dataset('is_icepahc-ud-train.conllu')
# dev_data = Dataset('is_icepahc-ud-dev.conllu')

Both data sets consist of **tagged sentences**. On the Python side of things, a tagged sentence is represented as a list of string pairs, where the first component of each pair represents a word token and the second component represents the word’s tag. The possible tags are listed and exemplified in the [Annotation Guidelines](http://universaldependencies.org/u/pos/all.html) of the Universal Dependencies Project.

## Baseline tagger

The baseline tagger that you will use is a pure Python implementation of a simple tagger based on a linear classifier.

### Linear model

In [None]:
from collections import defaultdict

class Linear(object):

    def __init__(self, classes):
        self.classes = sorted(classes)
        self.weight = {c: defaultdict(float) for c in self.classes}
        self.bias = {c: 0.0 for c in self.classes}

    def forward(self, features):
        scores = {}
        for c in self.classes:
            scores[c] = self.bias[c]
            for f, v in features.items():
                scores[c] += v * self.weight[c][f]
        return scores

### Perceptron learning algorithm

In [None]:
class PerceptronTrainer(object):

    def __init__(self, model):
        self.model = model
        self._acc = Linear(model.classes)
        self._counter = 1

    def update(self, features, gold):
        scores = self.model.forward(features)
        pred = max(self.model.classes, key=lambda c: scores[c])
        if pred != gold:
            self.model.bias[gold] += 1
            self.model.bias[pred] -= 1
            self._acc.bias[gold] += self._counter
            self._acc.bias[pred] -= self._counter
            for f, v in features.items():
                self.model.weight[gold][f] += v
                self.model.weight[pred][f] -= v
                self._acc.weight[gold][f] += v * self._counter
                self._acc.weight[pred][f] -= v * self._counter
        self._counter += 1

    def finalize(self):
        for c in self.model.classes:
            delta_b = self._acc.bias[c] / self._counter
            self.model.bias[c] -= delta_b
            for feat in self.model.weight[c]:
                delta_w = self._acc.weight[c][feat] / self._counter
                self.model.weight[c][feat] -= delta_w

### Perceptron tagger

This is the part of the code that you will have to modify.

In [None]:
class PerceptronTagger(object):

    def __init__(self, tags):
        self.model = Linear(tags)

    # This is the only method you are supposed to change!
    def featurize(self, words, i, pred_tags):
        feats = []
        feats.append(words[i])
        feats.append(words[i-1] if i > 0 else '<bos>')
        feats.append(words[i+1] if i + 1 < len(words) else '<eos>')
        feats.append(pred_tags[i-1] if i > 0 else '<bos>')
        return {(i, f): 1 for i, f in enumerate(feats)}

    def predict(self, words):
        pred_tags = []
        for i, _ in enumerate(words):
            features = self.featurize(words, i, pred_tags)
            scores = self.model.forward(features)
            pred_tag = max(self.model.classes, key=lambda c: scores[c])
            pred_tags.append(pred_tag)
        return pred_tags

### Training loop

In [None]:
from tqdm import tqdm

def train_perceptron(train_data, n_epochs=1):
    # Collect the tags in the training data
    tags = set()
    for tagged_sentence in train_data:
        words, gold_tags = zip(*tagged_sentence)
        tags.update(gold_tags)

    # Initialise and train the perceptron tagger
    tagger = PerceptronTagger(tags)
    trainer = PerceptronTrainer(tagger.model)
    for epoch in range(n_epochs):
        with tqdm(total=sum(1 for s in train_data)) as pbar:
            for tagged_sentence in train_data:
                words, gold_tags = zip(*tagged_sentence)
                pred_tags = []
                for i, gold_tag in enumerate(gold_tags):
                    features = tagger.featurize(words, i, pred_tags)
                    trainer.update(features, gold_tag)
                    pred_tags.append(gold_tag)
                pbar.update()
    trainer.finalize()

    return tagger

## Evaluation

The following function that computes the accuracy of the tagger on gold-standard data.

In [None]:
def accuracy(tagger, gold_data):
    correct = 0
    total = 0
    for tagged_sentence in gold_data:
        words, gold_tags = zip(*tagged_sentence)
        pred_tags = tagger.predict(words)
        for gold_tag, pred_tag in zip(gold_tags, pred_tags):
            correct += int(gold_tag == pred_tag)
            total += 1
    return correct / total

## Feature engineering

Your task now is to try to improve the performance of the perceptron tagger by adding new features. The only part of the code you should change is the `featurize` method.

In [None]:
tagger = train_perceptron(train_data, n_epochs=1)
print('{:.4f}'.format(accuracy(tagger, dev_data)))