# Part-of-Speech Tagging with Recurrent Neural Networks

Your task in this assignment is to implement a simple part-of-speech tagger based on recurrent neural networks.

## Part-of-speech tagging

Part-of-speech (POS) tagging is the task of labelling words (tokens) with [parts of speech](https://en.wikipedia.org/wiki/Part_of_speech). To give an example, consider the  sentence *Parker hates parsnips*. In this sentence, the word *Parker* should be labelled as a proper noun (a noun that is the name of a person), *hates* should be labelled as a verb, and *parsnips* should be labelled as a (common) noun. Part-of-speech tagging is an essential ingredient of many state-of-the-art natural language understanding systems.

Part-of-speech tagging can be cast as a supervised machine learning problem where the gold-standard data consists of sentences whose words have been manually annotated with parts of speech. For the present assignment you will be using a corpus built over the source material of the [English Web Treebank](https://catalog.ldc.upenn.edu/ldc2012t13), consisting of approximately 16,000&nbsp;sentences with 254,000&nbsp;tokens. The corpus has been released by the [Universal Dependencies Project](http://universaldependencies.org).

To make it easier to compare systems, the gold-standard data has been split into three parts: training, development (validation), and test. The following code uses three functions from the helper module `utils` (provided with this assignment) to load the data:

In [1]:
import utils
import keras
import numpy as np
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Activation, InputLayer, Bidirectional,TimeDistributed
from keras.layers import Embedding
from keras.layers import LSTM
from keras.optimizers import Adam

training_data = list(utils.read_training_data())
print('Number of sentences in the training data: {}'.format(len(training_data)))

development_data = list(utils.read_development_data())
print('Number of sentences in the development data: {}'.format(len(development_data)))

test_data = list(utils.read_test_data())
print('Number of sentences in the test data: {}'.format(len(test_data)))

Using TensorFlow backend.


Number of sentences in the training data: 12543
Number of sentences in the development data: 2002
Number of sentences in the test data: 2077


From a Python perspective, each of the data sets is a list of what we shall refer to as *tagged sentences*. A tagged sentence, in turn, is a list of pairs $(w,t)$, where $w$ is a word token and $t$ is the word&rsquo;s POS tag. Here is an example from the training data to show you how this looks like:

In [2]:
training_data[42]

[(b'There', b'PRON'),
 (b'has', b'AUX'),
 (b'been', b'VERB'),
 (b'talk', b'NOUN'),
 (b'that', b'SCONJ'),
 (b'the', b'DET'),
 (b'night', b'NOUN'),
 (b'curfew', b'NOUN'),
 (b'might', b'AUX'),
 (b'be', b'AUX'),
 (b'implemented', b'VERB'),
 (b'again', b'ADV'),
 (b'.', b'PUNCT')]

You will see part-of-speech tags such as `VERB` for verb, `NOUN` for noun, and `ADV` for adverb. If you are interested in learning more about the tag set used in the gold-standard data, you can have a look at the documentation of the [Universal POS tags](http://universaldependencies.org/u/pos/all.html). However, you do not need to understand the meaning of the POS tags to solve this assignment; you can simply treat them as labels drawn from a finite set of alternatives.

## Problem specification

Your task in this assignment is to build a part-of-speech tagger based on a recurrent neural network architecture, to train this tagger on the provided training data, and to evaluate its performance on the test data. To tune the hyperparameters of the network, you can use the provided development (validation) data.

### Network architecture

The proposed network architecture for your tagger is a sequential model with three layers, illustrated below: an embedding, a bidirectional LSTM, and a softmax layer. The embedding turns word indexes (integers representing words) into fixed-size dense vectors which are then fed into the bidirectional LSTM. The output of the LSTM at each position of the sentence is passed to a softmax layer which predicts the POS tag for the word at that position.

![System architecture](architecture.png)

To implement the network architecture, you will use [Keras](https://keras.io), a high-level neural network library for Python. Keras comes with an extensive online documentation, and reading the relevant parts of this documentation will be essential when working on this assignment. We suggest to start with the tutorial [Getting started with the Keras Sequential model](https://keras.io/getting-started/sequential-model-guide/). We also suggest to have a look at concrete examples, such as  [imdb_lstm.py](https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py).

### Pre-processing the data

Before you can start to implement the network architecture as such, you will have to bring the tagged sentences from the gold-standard data into a form that can be used with the network. At its core, this involves encoding each word and each tag as an index into a finite set (a non-negative integer), which can be done for example via a Python dictionary. Here is some code to illustrate the basic idea:

In [4]:
# Construct a simple index for words

w2i = dict()
t2i = dict()
w2i[b'UNW123'] = 1
w2i[b'PAD123'] = 0
t2i[b'PAD123'] = 0

for tagged_sentence in training_data:
    for word, tag in tagged_sentence:
        if word not in w2i:
            w2i[word] = len(w2i)    # assign next available index
        if tag not in t2i:
            t2i[tag] = len(t2i)    # assign next available index
print('Number of unique words in the training data: {}'.format(len(w2i)))
print('Number of unique tags in the training data: {}'.format(len(t2i)))
#print(t2i)

Number of unique words in the training data: 19674
Number of unique tags in the training data: 18
{b'PAD123': 0, b'PROPN': 1, b'PUNCT': 2, b'ADJ': 3, b'NOUN': 4, b'VERB': 5, b'DET': 6, b'ADP': 7, b'AUX': 8, b'PRON': 9, b'PART': 10, b'SCONJ': 11, b'NUM': 12, b'ADV': 13, b'CCONJ': 14, b'X': 15, b'INTJ': 16, b'SYM': 17}


Once you have indexes for the words and the tags, you can construct the input and the gold-standard output tensor required to train the network.

**Constructing the input tensor.** The input tensor should be of shape $(N, n)$ where $N$ is the total number of sentences in the training data and $n$ is the length of the longest sentence. Note that Keras requires all sequences in an input tensor to have the same length, which means that you will have to pad all sequences to that length. You can use the helper function `pad_sequences` for this, which by default will front-pad sequences with the value&nbsp;0. It is essential then that you do not use this special padding value as the index of actual words.

**Constructing the gold-standard output tensor.** The gold-standard output tensor should be of shape $(N, n, T)$ where $T$ is the number of unique tags in the training data, plus one to cater for the special padding value. The additional dimension corresponds to the fact that the softmax layer of the network will output one $T$-dimensional vector for each position of an input sentence. To construct the gold-standard version of this vector, you can use the helper function `to_categorical`, which will produce a &lsquo;one-hot vector&rsquo; for a given tag index.

### Constructing the network

To implement the network architecture, you need to find and instantiate the relevant building blocks from the Keras library. Note that Keras layers support a large number of optional parameters; use the default values unless you have a good reason not to. Two mandatory parameters that you will have to specify are the dimensionality of the embedding and the dimensionality of the output of the LSTM layer. The following values are reasonable starting points:

* dimensionality of the embedding: 100
* dimensionality of the output of the bidirectional LSTM layer: 100

You will also have to choose an appropriate loss function. For training we recommend the Adam optimiser.

### Evaluation

The last problem that you will have to solve is to write code to evaluate the trained tagger. The most widely-used evaluation measure for part-of-speech tagging is per-word accuracy, which is the percentage of words to which the tagger assigns the correct tag (according to the gold standard). Implementing this metric should be straightforward. However, make sure that you remove (or ignore) the special padding value when you compute the tagging accuracy.

**The performance goal for this assignment is to build a tagger that achieves a development set accuracy of at least 90%.**

**Unknown words.** One problem that you will encounter during evaluation is that the development data contains words that you did not see (and did not add to your index) during training. The simplest solution to this problem is to reserve a special index for &lsquo;the unknown word&rsquo; which the network can use whenever it encounters an unknown word. When you go for this strategy, the size of your index will be equal to the number of unique words in the training data plus&nbsp;2 &ndash; one extra for the unknown word, and one for the padding symbol.

## Skeleton code

The following skeleton code provides you with a starting point for your implementation:

In [13]:
class Tagger(object):

    def __init__(self):
        self.model = Sequential() 
        self.n = 1

    def train(self, training_data):
        # Pre-process the training data
        # Construct the network, add layers, compile, and fit
        self.n= len(max(training_data, key=len))
        N = len(training_data)
        input_size = np.zeros((N,self.n),dtype='int32')
        output_size = np.zeros((N,self.n),dtype='int32')

        iter = 0
        for tagged_sentence in training_data:
            train_sentences_num, train_tags_num = [], []
            for word, tag in tagged_sentence:
                try:
                    train_sentences_num.append(w2i[word])
                except KeyError:
                    train_sentences_num.append(w2i[b'UNW123'])
        
                train_tags_num.append(t2i[tag])
            train_sentences_num_paded = sequence.pad_sequences([train_sentences_num], maxlen = self.n) 
            train_tags_num_padded = sequence.pad_sequences([train_tags_num], maxlen = self.n) 
            input_size[iter,:] = train_sentences_num_paded
            output_size[iter,:] = train_tags_num_padded
            iter = iter + 1
        output_size = keras.utils.to_categorical(output_size,num_classes=len(t2i)) 

        self.model.add(InputLayer(input_shape=(self.n, )))
        self.model.add(Embedding(len(w2i), 100,mask_zero= True))
        self.model.add(Bidirectional(LSTM(50, return_sequences=True)))
        self.model.add(Dense(len(t2i),activation='softmax'))
        self.model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])
        self.model.summary()
        self. model.fit(input_size, output_size, batch_size=32, epochs=4)
        pass

    def evaluate(self, gold_data):
        # Compute the accuracy of the tagger relative to the gold data
        input_size_eval = [] 
        output_size_eval = []

        for tagged_sentence_eval in gold_data:
            train_sentences_num_eval, train_tags_num_eval = [], []
            for word, tag in tagged_sentence_eval:
                try:
                    train_sentences_num_eval.append(w2i[word])
                except KeyError:
                    train_sentences_num_eval.append(w2i[b'UNW123'])
        
                train_tags_num_eval.append(t2i[tag])
            input_size_eval.append(train_sentences_num_eval)
            output_size_eval.append(train_tags_num_eval)
        train_sentences_num_paded_eval = sequence.pad_sequences(input_size_eval, maxlen = self.n) 
        train_tags_num_padded_eval = sequence.pad_sequences(output_size_eval, maxlen = self.n) 
        output_size_eval = keras.utils.to_categorical(train_tags_num_padded_eval,num_classes=len(t2i)) 
        acc = self.model.evaluate(train_sentences_num_paded_eval, output_size_eval)
        return  acc

And here is how the tagger is supposed to be used:

In [14]:
tagger = Tagger()
tagger.train(training_data)

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 159, 100)          1967400   
_________________________________________________________________
bidirectional_4 (Bidirection (None, 159, 100)          60400     
_________________________________________________________________
dense_4 (Dense)              (None, 159, 18)           1818      
Total params: 2,029,618
Trainable params: 2,029,618
Non-trainable params: 0
_________________________________________________________________


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [17]:
acc = tagger.evaluate(development_data)
print('Accuracy on development data: {:.1%}'.format(acc[1]))
acc_test = tagger.evaluate(test_data)
print('Accuracy on test data: {:.2%}'.format(acc_test[1]))

Accuracy on development data: 91.5%
Accuracy on test data: 91.58%


## Submission

Submit this assignment by emailing this notebook to Marco. Your notebook should include all your code, and should be runnable by Marco without further modification. It should also include a short text with your reflections on this assignment. What did you find particularly surprising or hard? What have you learned from this assignment? You can paste your text into the box below. Good luck!

### Reflections

Dear Marco,

This lab was much easier than lab 1 for me especially after your comment about the possibility of using Mask_zero option in the embedding layer which made everything straightforward. Using Keras was much easier than implementing everything from scratch. 

with epoch equal to 3, I tried different batch size and with 100 I have got:


Accuracy on development data: 70.83%

Accuracy on test data: 70.67%

When I decreased the size to 50 the results for accuracy got improved to:


Accuracy on development data: 90.19%


Accuracy on test data: 90.59%

Finally in order to get even better accuracy, I tried 32:


Accuracy on development data: 90.53%


Accuracy on test data: 90.60%



and increasing the epoch to 4 I have the results as follows:


Accuracy on development data: 91.5%


Accuracy on test data: 91.58%


Best Regards,
Amin