# Introduction
In `1. Part of Speech Tagging - First Atempt` I used a simple DNN to do Part of Speech (POS) tagging based on word embeddings created with Google's word2vec pretrained on their news data set, and some extra features.

I this notebook I will atempt a new approach, I will use a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) cells.

Inspiration for this notebook was drawn from [Wang et al., 2015](https://arxiv.org/pdf/1510.06168.pdf).
Some noteable differences:
* I am not using Bidirectional layers, meaning that words can only influence the words that come after it in the sentence, not those before. 
* I am not using word embeddings trained by a LSTM network, which could potentially improve performance.

# The data
I will be using the same training data for my tagger as in `1. Part of Speech Tagging - First Atempt`:
[Universal Dependencies - English Web Treebank](http://universaldependencies.org/treebanks/en_ewt/index.html), a CoNLL-U formart corpus with 254 830 words and 16 622 sentences in english *taken from various web media including weblogs, newsgroups, emails, reviews, and Yahoo! answers*.

## Load the Data
First lets load the training data and convert it to a python dictionary and a pandas data frame.
I use the [conllu](https://github.com/EmilStenstrom/conllu) python package to parse the CoNLL-U files to dictionaries.

In [32]:
import numpy as np
import conllu

Read the data.

In [2]:
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-train.conllu'.format(directory), 'r', encoding='utf-8') as f:
    train_text = f.read()
    
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-dev.conllu'.format(directory), 'r', encoding='utf-8') as f:
    dev_text = f.read()
    
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-test.conllu'.format(directory), 'r', encoding='utf-8') as f:
    test_text = f.read()

Convert it to a dictionary.

In [3]:
train_dict = conllu.parse(train_text)
dev_dict = conllu.parse(dev_text)
test_dict = conllu.parse(test_text)

Count sentences and tokens.

In [4]:
from functools import reduce

n_train_sentences = len(train_dict)
n_train_tokens = reduce(lambda x, y: x + len(y), train_dict, 0)

print("The training set contains {} sentences and {} tokens".format(n_train_sentences, n_train_tokens))

The training set contains 12543 sentences and 204607 tokens


In [7]:
train_sentences = [[token['form'] for token in sentence] for sentence in train_dict]
train_labels = [[token['upostag'] for token in sentence] for sentence in train_dict]

dev_sentences = [[token['form'] for token in sentence] for sentence in dev_dict]
dev_labels = [[token['upostag'] for token in sentence] for sentence in dev_dict]

test_sentences = [[token['form'] for token in sentence] for sentence in test_dict]
test_labels = [[token['upostag'] for token in sentence] for sentence in test_dict]

In [26]:
pos_tags = list(set(reduce(lambda x, y: x + y, train_labels)))

In [29]:
len(pos_tags)

17

In [30]:
pos_idx = dict(zip(pos_tags, np.arange(len(pos_tags))))

In [35]:
pos_encoding = {}
for pos, i in pos_idx.items():
    pos_encoding[pos] = np.zeros(len(pos_tags))
    pos_encoding[pos][i]=1

# Feature Engineering
Wang et al. showed that a bidirectional LSTM network could achieve state of the art performance without using any morphological features, they only used these features:
* Word embedding of the word (cast to lower case)
* Suffix of length two, one-hot encoded
* Wether the word is all caps, lower case, or has an initial capital letter. One-hot encoded.

I will opt to only use word embeddings as features initially. Also, as I will be using Googles pre-trained word2vec model that did not cast words to lower case I see no point in doing this conversion.

## Word2Vec
Lets start by converting our tokens into word embeddings.

In [5]:
from gensim.models import Word2Vec

from gensim.models import KeyedVectors



In [8]:
news_w2v = KeyedVectors.load_word2vec_format('word2vec/GoogleNews-vectors-negative300.bin', binary=True)

In [12]:
np.random.uniform(-.1, .1, 10)

array([-0.04110769,  0.03943264, -0.0495911 ,  0.08454352, -0.09290703,
        0.08739691, -0.0877254 , -0.07737273, -0.0381479 , -0.04204748])

In [17]:
def encode_word(word):
    try:
        return news_w2v[word]
    except KeyError:
        # As per Wang et al. I initialise unkown words with a uniform distribution ranging from -.1 to .1
        np.random.seed(0)
        return np.random.uniform(-.1, .1, 300)

In [19]:
features = encode_word(train_sentences[0][0])

In [20]:
features.shape

(300,)

In [127]:
def get_feature_vec(X):
    return np.array([np.array([encode_word(word) for word in sentence]) for sentence in X])
def get_label_vec(y):
    return np.array([np.array([pos_encoding[tok] for tok in sentence]) for sentence in y])

In [128]:
X = get_feature_vec(train_sentences)
y = get_label_vec(train_labels)

X_val = get_feature_vec(dev_sentences)
y_val = get_label_vec(dev_labels)

X_test = get_feature_vec(test_sentences)
y_test = get_label_vec(test_labels)

# RNN LSTM

In [243]:
from keras.layers import LSTM, Dense, Dropout, BatchNormalization
from keras.models import Sequential

In [110]:
n_labels, n_features = len(pos_tags), 300

In [248]:
model = Sequential()
model.add(BatchNormalization(input_shape=(None, n_features)))
model.add(LSTM(units=100, return_sequences=True))
model.add(Dropout(rate=.1))
model.add(BatchNormalization())
model.add(LSTM(units=100, return_sequences=True))
model.add(Dropout(rate=.1))
model.add(BatchNormalization())
model.add(Dense(units=n_labels, activation='softmax'))

In [249]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [266]:
from keras.utils import Sequence
class GenerateData(Sequence):
    def __init__(self, x_set, y_set):
        self.x, self.y = x_set, y_set
    def __len__(self):
        return len(self.x)
    def __getitem__(self, idx):
        batch_x = self.x[idx]
        batch_y = self.y[idx]
        return batch_x.reshape(1, batch_x.shape[0], batch_x.shape[1]), batch_y.reshape(1, batch_y.shape[0], batch_y.shape[1])

In [None]:
from keras.preprocessing.sequence import pad_sequences
class GeneratePaddedData(Sequence):
    def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))
    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        
        max_len = max(map(lambda x: x.shape[0], batch_x))
        batch_x = np.array(list(map(lambda x: self.pad_input(x, max_len - len(x)), batch_x)))
        batch_y = np.array(list(map(lambda x: self.pad_labels(x, max_len - len(x)), batch_y)))
        
        return batch_x, batch_y
        #return batch_x.reshape(1, batch_x.shape[0], batch_x.shape[1]), batch_y.reshape(1, batch_y.shape[0], batch_x.shape[1])
    
    def pad_input(self, seq, pads):
        
        return np.append(seq, np.random.uniform(-.1, .1, pads * 300).reshape(pads, 300), axis=0)
    def pad_labels(self, seq, pads):
        return np.append(seq, np.zeros(17*pads).reshape(pads, 17), axis=0)

In [250]:
model.fit_generator(generator = GeneratePaddedData(X, y, batch_size=64), epochs=2, validation_data=GeneratePaddedData(X_val, y_val, batch_size=64))

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x2d83c4d1e10>

Evaluate with padded data.

In [263]:
model.evaluate_generator(generator = GeneratePaddedData(X_val, y_val, batch_size=64))

[0.1525469797474521, 0.29196878282221167]

Evaluate on batches of size 1.

In [267]:
model.evaluate_generator(generator = GenerateData(X_val, y_val))

[0.61217464409400146, 0.81220571657309637]

29% accuracy on padded validation set amd 81% when doing batches of size 1. I suspect padding negatively affects they way I evaluate my model, and as a result the way it can adapt during back propagation.

# Summary
So far I have constructed a RNN with LSTM cells and trained it on batches of padded data.
Padding the data significantly sped up training compared to traing on batches with single sentences, but probably has negative effects on accuracy. 

I did not train for many epochs due to long training times, so it's hard to tell if I would get better results with some patience. But it really feels like I need to find a better solution for padding.

I will need to read up on best practices fro handeling variable length input in RNNs.