### 3- Gated Recurrent Unit (GRU) with Keras: POS(Part-of-Speech) tagging 
Keras provides a GRU implementation, that we will use here to build a network that does POS
tagging. A **POS** is a grammatical category of words that are used in the same way across multiple
sentences. **Examples of POS** are nouns, verbs, adjectives, and so on. For example, **nouns** are typically
used to identify things, **verbs** are typically used to identify what they do, and **adjectives** to describe
some attribute of these things. POS tagging used to be done manually, but nowadays this is done
**automatically using statistical models**. In recent years, deep learning has been applied to this problem
as well (Natural Language Processing from Scratch, by R. Collobert, Journal of Machine Learning Research, Pp. 2493-2537, 2011).

**Dataset: The Penn Treebank**

For our training data, we will need sentences tagged with part of speech tags. The Penn Treebank (http
s://catalog.ldc.upenn.edu/ldc99t42) is one such dataset, it is a human annotated corpus of about 4.5 million
words of American English. However, it is a non-free resource. A 10% sample of the Penn Treebank
is freely available as part of the NLTK (http://www.nltk.org/), which we will use to train our network.
Our model will take in a sequence of words in a sentence and output the corresponding POS tags for
each word. Thus for an input sequence consisting of the words [The, cat, sat, on, the, mat, .], the
output sequence emitted would be the POS symbols [DT, NN, VB, IN, DT, NN].

In [58]:
import numpy as np
np.random.seed(42) # setting seed before importing from keras
from keras.layers.core import Activation, Dense, Dropout, RepeatVector, SpatialDropout1D
from keras.layers.embeddings import Embedding, Bidirectional
from keras.layers.recurrent import GRU, LSTM
from keras.layers.wrappers import TimeDistributed
from keras.models import Sequential
from keras.preprocessing import sequence
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
import collections
import nltk

import os
import codecs

**Read Data**

In [46]:
DATA_DIR = "./data"

with open(os.path.join(DATA_DIR, "treebank_sents.txt"), "w") as fedata, \
        open(os.path.join(DATA_DIR, "treebank_poss.txt"), "w") as ffdata:
    sents = nltk.corpus.treebank.tagged_sents()
    for sent in sents:
        words, poss = [], []
        for word, pos in sent:
            if pos == "-NONE-":
                continue
            words.append(word)
            poss.append(pos)
        fedata.write("{:s}\n".format(" ".join(words)))
        ffdata.write("{:s}\n".format(" ".join(poss)))

**Explore the data to find out what vocabulary size to set**

We need to find: 
- the number of unique words in each vocabulary (in file of words and in file of tags);
- the maximum number of words in a sentence in our training corpus; 
- the number of records. 

Because of the one-to-one nature of POS tagging, the last two values are identical for both vocabularies.

In [47]:
def parse_sentences(filename):
    word_freqs = collections.Counter()
    num_recs, maxlen = 0, 0
    with open(filename, "r") as fin:
        for line in fin:
            words = line.strip().lower().split()
            for word in words:
                word_freqs[word] += 1
            maxlen = max(maxlen, len(words))
            num_recs += 1
    return word_freqs, maxlen, num_recs

In [48]:
s_wordfreqs, s_maxlen, s_numrecs = \
    parse_sentences(os.path.join(DATA_DIR, "treebank_sents.txt"))
t_wordfreqs, t_maxlen, t_numrecs = \
    parse_sentences(os.path.join(DATA_DIR, "treebank_poss.txt"))
print("# records: {:d}".format(s_numrecs))
print("# unique words: {:d}".format(len(s_wordfreqs)))
print("# unique POS tags: {:d}".format(len(t_wordfreqs)))
print("# words/sentence: max: {:d}".format(s_maxlen))

# records: 3914
# unique words: 10947
# unique POS tags: 45
# words/sentence: max: 249


We can observe that:
- 10947 unique words; 
- 45 unique POS tags;
- The maximum sentence size is 249; 
- The number of sentences is 249. 

Using this information, we decide to consider only the top 5000 words for our source vocabulary. Our target vocabulary has 45 unique POS tags, we want to be able to predict all of them, so we will consider all of them in our
vocabulary. Finally, we set 250 to be our maximum sequence length:

In [14]:
MAX_SEQLEN = 250
S_MAX_FEATURES = 5000
T_MAX_FEATURES = 150

**Building Lookup Tables**

Just like our sentiment analysis example:
- each row of the input will be represented as a sequence of word indices.
- the corresponding output will be a sequence of POS tag indices. 

So we need to build lookup tables to translate between the words/POS tags and their corresponding indices. 
On the source side, we build a vocabulary index with two extra slots to hold the PAD
and UNK pseudo-words. On the target side, we don't drop any words so there is no need for the UNK
pseudo-word

In [51]:
s_vocabsize = min(len(s_wordfreqs), S_MAX_FEATURES) + 2
s_word2index = {x[0]: i+2 for i, x in
                enumerate(s_wordfreqs.most_common(S_MAX_FEATURES))}
s_word2index["PAD"] = 0
s_word2index["UNK"] = 1
s_index2word = {v: k for k, v in s_word2index.items()}

t_vocabsize = len(t_wordfreqs) + 1
t_word2index = {x[0]: i for i, x in
                enumerate(t_wordfreqs.most_common(T_MAX_FEATURES))}
t_word2index["PAD"] = 0
t_index2word = {v: k for k, v in t_word2index.items()}

**Building Dataset to our Network**

We will use these lookup tables to convert our input sentences into a word ID sequence of length MAX_SEQLEN (250). 

The labels need to be structured as a sequence of one-hot vectors of size T_MAX_FEATURES + 1 (151), also of length MAX_SEQLEN (250).

The build_tensor function reads the data from the two files and converts them to the input and output
tensors. Additional default parameters are passed in to build the output tensor. This triggers the call to
np_utils.to_categorical() to convert the output sequence of POS tag IDs to one-hot vector representation:

In [52]:
def build_tensor(filename, numrecs, word2index, maxlen):
    data = np.empty((numrecs, ), dtype=list)
    with open(filename, "r") as fin:
        for i, line in enumerate(fin):
            wids = []
            for word in line.strip().lower().split():
                if word in word2index:
                    wids.append(word2index[word])
                else:
                    wids.append(word2index['UNK'])
            data[i] = wids
    pdata = sequence.pad_sequences(data, maxlen=maxlen)
    return pdata

X = build_tensor(os.path.join(DATA_DIR, "treebank_sents.txt"),s_numrecs, s_word2index, MAX_SEQLEN)
Y = build_tensor(os.path.join(DATA_DIR, "treebank_poss.txt"),t_numrecs, t_word2index, MAX_SEQLEN)
Y = np.array([np_utils.to_categorical(d, t_vocabsize) for d in Y])

**Spliting the data:**

Training:80 / Test:20

In [53]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=42)

**Schematic of the Network** 
<img src="gru_pos.JPG">

As previously, assuming that 
- 1: the batch size is as yet undetermined
- 2: the input to the network is a tensor of word IDs of shape (None, MAX_SEQLEN, 1). 
- 3: the input is sent through an embedding layer, which converts each word into a dense vector of shape (EMBED_SIZE)
- 4: so, the output tensor from this layer has the shape(None, MAX_SEQLEN, EMBED_SIZE). 
- 5: the output tensor is fed to the encoder GRU with an output size of HIDDEN_SIZE. The GRU is set to return a single context vector (return_sequences=False) after seeing a sequence of size MAX_SEQLEN, so the output tensor from the GRU layer has shape (None, HIDDEN_SIZE).
- 6: This context vector is then replicated using the RepeatVector layer into a tensor of shape (None, MAX_SEQLEN, HIDDEN_SIZE) and fed into the decoder GRU layer. 
- 7: This is then fed into a dense layer which produces an output tensor of shape (None, MAX_SEQLEN, t_vocab_size). The activation function on the dense layer is a softmax. The argmax of each column of this tensor is the index of the predicted POS tag for the word at that position.

The model definition is shown as follows: EMBED_SIZE, HIDDEN_SIZE, BATCH_SIZE, and NUM_EPOCHS are
hyperparameters which have been assigned these values after experimenting with multiple different
values.

The model is compiled with the categorical_crossentropy loss function since we have multiple
categories of labels, and the optimizer used is the popular adam optimizer:

In [55]:
# Hyperparameters of the model
EMBED_SIZE = 128
HIDDEN_SIZE = 64
BATCH_SIZE = 32
NUM_EPOCHS = 1

model = Sequential()

model.add(Embedding(s_vocabsize,
                    EMBED_SIZE,
                    input_length=MAX_SEQLEN))
model.add(Dropout(0.2))

model.add(GRU(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(RepeatVector(MAX_SEQLEN))

model.add(GRU(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))

model.add(Activation("softmax"))

model.compile(loss="categorical_crossentropy", 
              optimizer="adam",
              metrics=["accuracy"])

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


We train this model for a single epoch. The model is very rich, with many parameters, and begins to
overfit after the first epoch of training. When fed the same data multiple times in the next epochs, the
model begins to overfit to the training data and does worse on the validation data

In [56]:
model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,
          epochs=NUM_EPOCHS, validation_data=[Xtest, Ytest])
score, acc = model.evaluate(Xtest, Ytest, batch_size=BATCH_SIZE)
print("Test score: {:.3f}, accuracy: {:.3f}".format(score, acc))

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 3131 samples, validate on 783 samples
Epoch 1/1
Test score: 0.545, accuracy: 0.916


**Using LSTM and campare the results with GRU**

In [59]:
model = Sequential()

model.add(Embedding(s_vocabsize, EMBED_SIZE, input_length=MAX_SEQLEN))
model.add(Dropout(0.2))

model.add(LSTM(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(RepeatVector(MAX_SEQLEN))

model.add(LSTM(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))

model.add(Activation("softmax"))

model.compile(loss="categorical_crossentropy",
              optimizer="adam", metrics=["accuracy"])
model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,
          epochs=NUM_EPOCHS, validation_data=[Xtest, Ytest])

score, acc = model.evaluate(Xtest, Ytest, batch_size=BATCH_SIZE)
print("Test score: {:.3f}, accuracy: {:.3f}".format(score, acc))

Train on 3131 samples, validate on 783 samples
Epoch 1/1
Test score: 0.548, accuracy: 0.916


As you can see from the output, the results of the GRU-based network are quite comparable to our
previous LSTM-based network.

**Sequence-to-sequence models**
- most used in **machine translation**
- **entity recognition**: J. Hammerton, 2003 _Named Entity Recognition with Long Short Term Memory_, Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL, Association for Computational Linguistics
- **sentence parsing**: O. Vinyals, 2015, Grammar as a Foreign Language, Advances in Neural Information Processing Systems.
- **image captioning**: A. Karpathy, and F. Li, 2015, _Deep Visual-Semantic Alignments for Generating Image Descriptions_, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

### Bidirectional RNNs

At a given time step t, the output of the RNN is dependent on the outputs at all previous time steps.
However, it is entirely possible that the output is also dependent on the future outputs as well. This is
especially true for applications such as NLP, where the attributes **of the word or phrase we are trying
to predict may be dependent on the context given by the entire enclosing sentence, not just the words
that came before it.** Bidirectional RNNs also help a network architecture **place equal emphasis on the
beginning and end of the sequence, and increase the data available for training.**

Bidirectional RNNs **are two RNNs stacked on top of each other, reading the input in opposite
directions.** So in our example, one RNN will read the words left to right and the other RNN will read
the words right to left. The output at each time step will be based on the hidden state of both RNNs.

Keras provides support for bidirectional RNNs through a bidirectional wrapper layer. For example,
for our POS tagging example, we could **make our LSTMs bidirectional simply by wrapping them with
this Bidirectional wrapper**, as shown in the model definition code as follows:

In [61]:
model = Sequential()

model.add(Embedding(s_vocabsize, EMBED_SIZE, input_length=MAX_SEQLEN))
model.add(Dropout(0.2))

model.add(Bidirectional(LSTM(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2)))
model.add(RepeatVector(MAX_SEQLEN))

model.add(Bidirectional(LSTM(HIDDEN_SIZE, return_sequences=True)))
model.add(TimeDistributed(Dense(t_vocabsize)))

model.add(Activation("softmax"))

model.compile(loss="categorical_crossentropy",
              optimizer="adam", metrics=["accuracy"])
model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,
          epochs=NUM_EPOCHS, validation_data=[Xtest, Ytest])

score, acc = model.evaluate(Xtest, Ytest, batch_size=BATCH_SIZE)
print("Test score: {:.3f}, accuracy: {:.3f}".format(score, acc))

Train on 3131 samples, validate on 783 samples
Epoch 1/1
Test score: 0.443, accuracy: 0.916
