# Introduction
In `1. Part of Speech Tagging - First Atempt` I used a simple DNN to do Part of Speech (POS) tagging based on word embeddings created with Google's word2vec pretrained on their news data set, and some extra features.

I this notebook I will atempt a new approach, I will use a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) cells.

Inspiration for this notebook was drawn from [Wang et al., 2015](https://arxiv.org/pdf/1510.06168.pdf).
Some noteable differences:
* I am not using Bidirectional layers, meaning that words can only influence the words that come after it in the sentence, not those before. 
* I am not using word embeddings trained by a LSTM network, which could potentially improve performance.

# The data
I will be using the same training data for my tagger as in `1. Part of Speech Tagging - First Atempt`:
[Universal Dependencies - English Web Treebank](http://universaldependencies.org/treebanks/en_ewt/index.html), a CoNLL-U formart corpus with 254 830 words and 16 622 sentences in english *taken from various web media including weblogs, newsgroups, emails, reviews, and Yahoo! answers*.

## Load the Data
First lets load the training data and convert it to a python dictionary and a pandas data frame.
I use the [conllu](https://github.com/EmilStenstrom/conllu) python package to parse the CoNLL-U files to dictionaries.

In [32]:
import numpy as np
import conllu

Read the data.

In [2]:
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-train.conllu'.format(directory), 'r', encoding='utf-8') as f:
    train_text = f.read()
    
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-dev.conllu'.format(directory), 'r', encoding='utf-8') as f:
    dev_text = f.read()
    
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-test.conllu'.format(directory), 'r', encoding='utf-8') as f:
    test_text = f.read()

Convert it to a dictionary.

In [3]:
train_dict = conllu.parse(train_text)
dev_dict = conllu.parse(dev_text)
test_dict = conllu.parse(test_text)

Count sentences and tokens.

In [4]:
from functools import reduce

n_train_sentences = len(train_dict)
n_train_tokens = reduce(lambda x, y: x + len(y), train_dict, 0)

print("The training set contains {} sentences and {} tokens".format(n_train_sentences, n_train_tokens))

The training set contains 12543 sentences and 204607 tokens


In [7]:
train_sentences = [[token['form'] for token in sentence] for sentence in train_dict]
train_labels = [[token['upostag'] for token in sentence] for sentence in train_dict]

dev_sentences = [[token['form'] for token in sentence] for sentence in dev_dict]
dev_labels = [[token['upostag'] for token in sentence] for sentence in dev_dict]

test_sentences = [[token['form'] for token in sentence] for sentence in test_dict]
test_labels = [[token['upostag'] for token in sentence] for sentence in test_dict]

In [26]:
pos_tags = list(set(reduce(lambda x, y: x + y, train_labels)))

In [29]:
len(pos_tags)

17

In [30]:
pos_idx = dict(zip(pos_tags, np.arange(len(pos_tags))))

In [35]:
pos_encoding = {}
for pos, i in pos_idx.items():
    pos_encoding[pos] = np.zeros(len(pos_tags))
    pos_encoding[pos][i]=1

# Feature Engineering
Wang et al. showed that a bidirectional LSTM network could achieve state of the art performance without using any morphological features, they only used these features:
* Word embedding of the word (cast to lower case)
* Suffix of length two, one-hot encoded
* Wether the word is all caps, lower case, or has an initial capital letter. One-hot encoded.

I will opt to only use word embeddings as features initially. Also, as I will be using Googles pre-trained word2vec model that did not cast words to lower case I see no point in doing this conversion.

## Word2Vec
Lets start by converting our tokens into word embeddings.

In [5]:
from gensim.models import Word2Vec

from gensim.models import KeyedVectors



In [8]:
news_w2v = KeyedVectors.load_word2vec_format('word2vec/GoogleNews-vectors-negative300.bin', binary=True)

In [12]:
np.random.uniform(-.1, .1, 10)

array([-0.04110769,  0.03943264, -0.0495911 ,  0.08454352, -0.09290703,
        0.08739691, -0.0877254 , -0.07737273, -0.0381479 , -0.04204748])

In [17]:
def encode_word(word):
    try:
        return news_w2v[word]
    except KeyError:
        # As per Wang et al. I initialise unkown words with a uniform distribution ranging from -.1 to .1
        np.random.seed(0)
        return np.random.uniform(-.1, .1, 300)

In [19]:
features = encode_word(train_sentences[0][0])

In [20]:
features.shape

(300,)

In [127]:
def get_feature_vec(X):
    return np.array([np.array([encode_word(word) for word in sentence]) for sentence in X])
def get_label_vec(y):
    return np.array([np.array([pos_encoding[tok] for tok in sentence]) for sentence in y])

In [128]:
X = get_feature_vec(train_sentences)
y = get_label_vec(train_labels)

X_val = get_feature_vec(dev_sentences)
y_val = get_label_vec(dev_labels)

X_test = get_feature_vec(test_sentences)
y_test = get_label_vec(test_labels)

# RNN LSTM

## Model One

In [243]:
from keras.layers import LSTM, Dense, Dropout, BatchNormalization
from keras.models import Sequential

In [110]:
n_labels, n_features = len(pos_tags), 300

In [281]:
def get_model():
    model = Sequential()
    model.add(BatchNormalization(input_shape=(None, n_features)))
    model.add(LSTM(units=100, return_sequences=True))
    model.add(Dropout(rate=.1))
    model.add(BatchNormalization())
    model.add(LSTM(units=100, return_sequences=True))
    model.add(Dropout(rate=.1))
    model.add(BatchNormalization())
    model.add(Dense(units=n_labels, activation='softmax'))
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [266]:
from keras.utils import Sequence
class GenerateData(Sequence):
    def __init__(self, x_set, y_set):
        self.x, self.y = x_set, y_set
    def __len__(self):
        return len(self.x)
    def __getitem__(self, idx):
        batch_x = self.x[idx]
        batch_y = self.y[idx]
        return batch_x.reshape(1, batch_x.shape[0], batch_x.shape[1]), batch_y.reshape(1, batch_y.shape[0], batch_y.shape[1])

In [274]:
class GeneratePaddedData(Sequence):
    def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))
    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        
        max_len = max(map(lambda x: x.shape[0], batch_x))
        batch_x = np.array(list(map(lambda x: self.pad_input(x, max_len - len(x)), batch_x)))
        batch_y = np.array(list(map(lambda x: self.pad_labels(x, max_len - len(x)), batch_y)))
        
        return batch_x, batch_y
        #return batch_x.reshape(1, batch_x.shape[0], batch_x.shape[1]), batch_y.reshape(1, batch_y.shape[0], batch_x.shape[1])
    
    def pad_input(self, seq, pads):
        # Pad with UNK tokens
        return np.append(seq, np.zeros(pads * 300).reshape(pads, 300), axis=0)
    def pad_labels(self, seq, pads):
        return np.append(seq, np.zeros(17*pads).reshape(pads, 17), axis=0)

In [282]:
model = get_model()

In [278]:
%time model.fit_generator(generator = GeneratePaddedData(X, y, batch_size=64), epochs=2, validation_data=GenerateData(X_val, y_val))

Epoch 1/2
Epoch 2/2
Wall time: 5min 42s


<keras.callbacks.History at 0x2d83e3420b8>

In [285]:
model = get_model()

In [286]:
%time model.fit_generator(generator = GenerateData(X, y), epochs=1, validation_data=GenerateData(X_val, y_val))

Epoch 1/1
Wall time: 15min 41s


<keras.callbacks.History at 0x2d84ac82e48>

### Model One Summary
So far I have constructed a RNN with LSTM cells and trained it on batches of padded data.
Padding the data significantly sped up training compared to traing on batches with single sentences, but probably has negative effects on accuracy. 

I did not train for many epochs due to long training times, so it's hard to tell if I would get better results with some patience. But it really feels like I need to find a better solution for padding.

I will try to mitigate the effects of padding by restircting my model to train on short sentences, no longer than 20 tokens.
I realise this is a significant restirction to my model, but it avoids the problem of short sentences being padded to be 100 tokens long. It feels like a good idea to restrict my problem in order to get more experience.

## Model Two

### Create short padded sentences

In [301]:
def get_short_instances(sentences, labels, max_len=20):
    
    # Find the indexes of all short sentences
    short_sentences_mask = np.array([len(sentence) <= max_len for sentence in sentences])
    short_sentences_idx = np.where(short_sentences_mask)
    
    short_sentences = np.array(sentences)[short_sentences_idx]
    short_labels = np.array(labels)[short_sentences_idx]
    
    return get_feature_vec(short_sentences), get_label_vec(short_labels)

In [302]:
X_short, y_short = get_short_instances(train_sentences, train_labels)

In [303]:
X_val_short, y_val_short = get_short_instances(dev_sentences, dev_labels)

In [304]:
X_short.shape

(8791,)

In [305]:
X_val_short.shape

(1630,)

I'm down to 8791 training sentences and 1630 validation sentences. This time lets store the padded data to avoid having to re-pad the sentences every epoch. 
Also, lets pad them all to be exactly 20 tokens.

In [316]:
def pad_data(X, y, n_features, output_dim, max_seq_len=20):
    X_padded = np.zeros((len(X), max_seq_len, n_features), dtype='float32')
    y_padded = np.zeros((len(y), max_seq_len, output_dim), dtype='float32')
    
    for i, (instance, labels) in enumerate(zip(X, y)):
        X_padded[i,:len(instance)] = instance
        y_padded[i, :len(labels)] = labels
    return X_padded, y_padded

In [317]:
X_short_padded, y_short_padded = pad_data(X_short, y_short, n_features, n_labels)

In [319]:
X_short_padded.shape

(8791, 20, 300)

### Train and evaluate the model
I will try to not pad the validation data.

In [323]:
model2 = get_model()

In [327]:
%time model2.fit(X_short_padded, y_short_padded, batch_size=64, epochs=2)

Epoch 1/2
Epoch 2/2
Wall time: 1min 30s


<keras.callbacks.History at 0x2d84ff5b208>

I significantly reduced training time!

In [328]:
model2.evaluate_generator(GenerateData(X_val_short, y_val_short))

[0.62488046909476758, 0.80963499824419338]

Sweet, accuracy on the valiation set is 80% after 2 epochs, which is a similar result to my previous attempt with padding.
Lets train the model a little longer and see how it performs.

In [329]:
%time model2.fit(X_short_padded, y_short_padded, batch_size=64, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Wall time: 17min 23s


<keras.callbacks.History at 0x2d852af2128>

In [335]:
model2.evaluate_generator(GenerateData(X_val_short, y_val_short))

[0.94173224774462994, 0.81284289407126742]

In [331]:
model2.evaluate_generator(GenerateData(X_val, y_val))

[0.97171295310731165, 0.81377732010556392]

In [363]:
model2.evaluate_generator(GenerateData(X_short, y_short))

[0.12118941652556567, 0.95545017976576985]

Okay, so no improvement in accuracy from training. Thats a bummer.
On the other hand, it's nice to see that I get the same accuracy on the full validation set as I do on the short sentences.

Anyway, lets have a look at the predictions of the model just to see what they look like.

In [348]:
predictions = []
for sentence in X_val:
    predictions.append(model2.predict_classes(sentence.reshape(1, sentence.shape[0], sentence.shape[1]))[0])

In [351]:
predictions = [[pos_tags[i] for i in sentence] for sentence in predictions]

In [359]:
import pandas as pd

In [362]:
pd.DataFrame(data = list(zip(dev_sentences[10], predictions[10], dev_labels[10])), columns = ['Token', 'Prediction', 'Target'])

Unnamed: 0,Token,Prediction,Target
0,In,ADP,ADP
1,the,DET,DET
2,eastern,ADJ,ADJ
3,city,NOUN,NOUN
4,of,CCONJ,ADP
5,Baqubah,PROPN,PROPN
6,",",PUNCT,PUNCT
7,guerrillas,NOUN,NOUN
8,detonated,VERB,VERB
9,a,PUNCT,DET


Looks like what you would expect, a pretty decent tagger making mistakes in about 20% of cases.

Lets try just using one LSTM layer, but with more nodes.

## Model Three

In [366]:
model3 = Sequential()
model3.add(BatchNormalization(input_shape=(None, n_features)))
model3.add(LSTM(units=512, return_sequences=True))
model3.add(Dropout(rate=.1))
model3.add(Dense(units=n_labels, activation='softmax'))

model3.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


In [367]:
%time model3.fit(X_short_padded, y_short_padded, batch_size=64, epochs=2)

Epoch 1/2
Epoch 2/2
Wall time: 6min 36s


<keras.callbacks.History at 0x2d85369aa90>

In [368]:
model3.evaluate_generator(GenerateData(X_val_short, y_val_short))

[0.56462470099781326, 0.82756802132981688]

In [369]:
model3.evaluate_generator(GenerateData(X_val, y_val))

[0.55866566814410568, 0.82720946210560264]

In [370]:
model3.evaluate_generator(GenerateData(X_short, y_short))

[0.35940897561025592, 0.88475357780696828]

Similar validation accuracy as previous models, longer trainingtime than the previous one. Let evaulate again after two more epochs.

In [371]:
%time model3.fit(X_short_padded, y_short_padded, batch_size=64, epochs=2)

Epoch 1/2
Epoch 2/2
Wall time: 6min 22s


<keras.callbacks.History at 0x2d84d9a1940>

In [372]:
model3.evaluate_generator(GenerateData(X_val, y_val))

[0.58038230885717357, 0.82080563112021565]

In [373]:
model3.evaluate_generator(GenerateData(X_short, y_short))

[0.2106831872414589, 0.93159210102717604]

Validation accuracy did not increase with two more epochs. 
Right now I am at a loss on how to calibrate my network to generalize better.
I think I would like to learn more about RNNs by trying out some more projects, and then get back to doing POS tagging with LSTM.

# Summary
In this notebook I have trained RNNs on variable length inputs by using padding and single instance batches.

I did not achieve any good results on the validation set, 82% accuracy being my best result. 

Training RNNs takes a lot more time compared to regular DNNs, this resulted in me not allowing my networks that many epochs to train. However, none of the networks showed any signs of improving validation accuracy with more training.

Right now I feel like the main hurdle for me in improving my model is that I lack experience using RNNs.
I would like to gain some more experience, perhaps by mimicking other people models.