# Sequence Labeling with Deep Neural Networks
- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

*Recommended Reading*:
- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)
- François Chollet (2017) Deep Learning with Python

*Notebook Covers Material of*:
- [SLP](https://web.stanford.edu/~jurafsky/slp3/9.pdf) Chapter 9: Deep Learning Architectures for Sequence Processing

__Requirements__

- [keras](https://keras.io/)
- [tensorflow](https://www.tensorflow.org/)

## Recurrent Neural Networks with Keras

To train an RNN with Keras the following steps are required:

- transformation of the input into tensors
    - preparing the data
- building a model using layers
- compiling the model, specifying
    - loss function
    - optimizer for gradient descent
    - evaluation metric

## Preparing Corpus for Keras

### Loading the Corpus in CoNLL Format

In [1]:
# to import conll
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))

from conll import evaluate, read_corpus_conll

In [2]:
trn = read_corpus_conll('../data/NL2SPARQL4NLU/train.txt')
tst = read_corpus_conll('../data/NL2SPARQL4NLU/test.txt')

In [3]:
print(trn[0])

[('who', 'O'), ('plays', 'O'), ('luke', 'B-character.name'), ('on', 'O'), ('star', 'B-movie.name'), ('wars', 'I-movie.name'), ('new', 'I-movie.name'), ('hope', 'I-movie.name')]


### Feature Extraction
Keras, like most algorithms, work with number vectors (i.e. require vectorization of data).
We need to:
- get vocabulary of words
- get vocabulary of tags
- create mappings for:
    - word to index
    - index to word
    - tag to index
    - index to tag

Let's define a helper function to get vocabulary from our corpus.

In [4]:
# function to get vocabulary
# idx specifies the index of the colunm to get (0: words, 1: tags)
def get_vocab(data, idx=None):
    idx = 0 if idx is None else idx
    vocab = set()
    for sent in data:
        for tok in sent:
            vocab.add(tok[idx])
    return sorted(list(vocab))        

In [5]:
words = get_vocab(trn)
labels = get_vocab(trn, idx=-1)

print(len(labels))
print(len(words))

41
1728


#### Creating Index Mappings

`keras` vocabulary by default has 2 entries with indices `0` and `1` reserved for PADDING token and UNKNOWN token (i.e. OOV). We need those for words. For tags, our padding is `O` tag.

In [6]:
# initial will take a dict mapping for keras default entries
def create_idx(vocab, initial=None):
    idx = {} if initial is None else initial
    inc = len(idx)
    tmp = {w: i + inc for i, w in enumerate(vocab)}
    idx.update(tmp)
    return idx

In [7]:
# word index
word2idx = create_idx(words, initial={"<PAD>":0, "<UNK>":1})
print(word2idx["<UNK>"])
print(word2idx["<PAD>"])

1
0


In [8]:
# label index
label2idx = create_idx(labels)
print(label2idx)

{'B-actor.name': 0, 'B-actor.nationality': 1, 'B-actor.type': 2, 'B-award.category': 3, 'B-award.ceremony': 4, 'B-character.name': 5, 'B-country.name': 6, 'B-director.name': 7, 'B-director.nationality': 8, 'B-movie.description': 9, 'B-movie.genre': 10, 'B-movie.gross_revenue': 11, 'B-movie.language': 12, 'B-movie.location': 13, 'B-movie.name': 14, 'B-movie.release_date': 15, 'B-movie.release_region': 16, 'B-movie.star_rating': 17, 'B-movie.subject': 18, 'B-person.name': 19, 'B-person.nationality': 20, 'B-producer.name': 21, 'B-rating.name': 22, 'I-actor.name': 23, 'I-actor.nationality': 24, 'I-award.ceremony': 25, 'I-character.name': 26, 'I-country.name': 27, 'I-director.name': 28, 'I-movie.genre': 29, 'I-movie.gross_revenue': 30, 'I-movie.language': 31, 'I-movie.location': 32, 'I-movie.name': 33, 'I-movie.release_date': 34, 'I-movie.release_region': 35, 'I-movie.subject': 36, 'I-person.name': 37, 'I-producer.name': 38, 'I-rating.name': 39, 'O': 40}


In [9]:
# index to label
idx2label = {v: k for k, v in label2idx.items()}
print(idx2label[40])

O


In [10]:
# index to word
idx2word = {v: k for k, v in word2idx.items()}
print(idx2word[0])
print(idx2word[1])

<PAD>
<UNK>


### Vectorization of Data
- `keras` accepts integer vectorization, where words are replaced by their indices.
- for batch processing the input sequences must be of the same length; thus we need to identify max sequence length we want to handle (the longest sequence in training)
- all the data needs to be padded to this max length (or truncated)

In [11]:
# vectorization of data
x_trn_int = [[word2idx[w] for w, t in s] for s in trn]
print("Textual: {}".format(list(map(lambda x: x[0], trn[0]))))
print("Encoded: {}".format(x_trn_int[0]))

Textual: ['who', 'plays', 'luke', 'on', 'star', 'wars', 'new', 'hope']
Encoded: [1678, 1149, 911, 1066, 1457, 1650, 1026, 734]


In [12]:
# let's get max length, or 25, if it is less
# check test set max size for simplicity, not to deal with evaluation of truncated data
max_len = max(max(map(len, x_trn_int)), 25)
print(max_len)

25


#### Padding & Truncating
(from documentation)
- `pad_sequences` function transforms a list (of length `num_samples`) of sequences (lists of integers) into a 2D Numpy array of shape (`num_samples`, `num_timesteps`). `num_timesteps` is either the `maxlen` argument if provided, or the length of the longest sequence in the list.

- Sequences that are shorter than `num_timesteps` are padded with `value` until they are `num_timesteps` long.

- Sequences longer than `num_timesteps` are truncated so that they fit the desired length.

- The position where padding or truncation happens is determined by the arguments `padding` and `truncating`, respectively. Pre-padding or removing values from the beginning of the sequence is the default.

In [13]:
# let's pad the sentences to max length
from keras.preprocessing.sequence import pad_sequences

x_trn_pad = pad_sequences(maxlen=max_len, sequences=x_trn_int, padding="post", value=word2idx['<PAD>'])

# value is 0, since it is the <PAD>'s index
print(x_trn_pad[0])

[1678 1149  911 1066 1457 1650 1026  734    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0]


#### Vectorization of Labels

- let's vectorize labels the same way as words & pad them
- additionally, we need to do one-hot encoding

In [14]:
y_trn_int = [[label2idx[t] for w, t in s] for s in trn]
y_trn_pad = pad_sequences(maxlen=max_len, sequences=y_trn_int, padding="post", value=label2idx['O'])

# 40 is the id of 'O'
print("Textual: {}".format(list(map(lambda x: x[1], trn[0]))))
print("Encoded & Padded: {}".format(y_trn_pad[0]))

Textual: ['O', 'O', 'B-character.name', 'O', 'B-movie.name', 'I-movie.name', 'I-movie.name', 'I-movie.name']
Encoded & Padded: [40 40  5 40 14 33 33 33 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
 40]


In [15]:
from keras.utils import to_categorical

y_trn_ohv = [to_categorical(i, num_classes=len(labels)) for i in y_trn_pad]

print(y_trn_ohv[0].shape)

(25, 41)


In [16]:
# converting data to numpy array
import numpy as np

x_trn = np.array(x_trn_pad)
y_trn = np.array(y_trn_ohv)
print(x_trn.shape)
print(y_trn.shape)

(3338, 25)
(3338, 25, 41)


### Preparing Test Set
- test set is created the same way; except with handling of:
    - UNKNOWN words
    - MISSING labels (just for safety)

In [17]:
# replace words not in training with <UNK>
x_tst_int = [[word2idx.get(w, word2idx.get('<UNK>')) for w, t in s] for s in tst]
x_tst_pad = pad_sequences(maxlen=max_len, sequences=x_tst_int, padding="post", value=word2idx['<PAD>'])

In [18]:
# replace tags not in training with 'O'
y_tst_int = [[label2idx.get(t, label2idx.get('O')) for w, t in s] for s in tst]
y_tst_pad = pad_sequences(maxlen=max_len, sequences=y_tst_int, padding="post", value=label2idx['O'])
y_tst_ohv = [to_categorical(i, num_classes=len(labels)) for i in y_tst_pad]

In [19]:
# converging to numpy arrays
x_tst = np.array(x_tst_pad)
y_tst = np.array(y_tst_ohv)

In [20]:
print(x_tst.shape)
print(y_tst.shape)

(1084, 25)
(1084, 25, 41)


## Creating a Model

Model consists of a set of layers that transform input and predict output.

`keras` provides many built-in [layers](https://keras.io/api/layers/).

Below are the lists of the important __Core__ and __Recurrent__ layers for us.
The layers are briefly described with a few important arguments. 

Please consult the documentation for full descriptions.

### [Core Layers](https://keras.io/api/layers/core_layers/)

- [`Input()`](https://keras.io/api/layers/core_layers/input/) is used to instantiate a Keras tensor. 
    - `shape=(N,)` argument indicates that the expected input will be batches of N-dimensional vectors. 
    
- [`Embedding()`](https://keras.io/api/layers/core_layers/embedding/) layer can only be used as the first layer in a model. Turns positive integers (indexes) into dense vectors of fixed size. (Embedding layer for word embeddings.)
    - `input_dim`: Size of the vocabulary
    - `output_dim`: Dimension of the dense embedding.
    - `input_length`: Length of input sequences, when it is constant. Required for `Dense` layers upstream.

- [`Dense()`](https://keras.io/api/layers/core_layers/dense/) is regular densely-connected NN layer that computes `output = activation(dot(input, kernel) + bias)`
    - `units`: dimensionality of the output space.
    - `activation`: Activation function to use. (default: "linear" activation: a(x) = x).
    
- [`Activation()`](https://keras.io/api/layers/core_layers/activation/) layer that applies an activation function to an output.
    - `activation`: Activation function (e.g. tf.nn.relu), or string name of __built-in__ activation function (e.g. "relu").

### [Recurrent Layers](https://keras.io/api/layers/recurrent_layers/)

[Guide on working with RNN](https://keras.io/guides/working_with_rnns/).

- [`SimpleRNN()`](https://keras.io/api/layers/recurrent_layers/simple_rnn/) is a fully-connected RNN where the output is to be fed back to input.
    - `units`: dimensionality of the output space.
    - `activation`: activation function to use. Default: hyperbolic tangent (`tanh`). 
    - `dropout`: Fraction of the units to drop for the linear transformation of the inputs. 
    - `recurrent_dropout`: Fraction of the units to drop for the linear transformation of the recurrent state. 
    - `return_sequences`: Boolean. __Whether to return the last output in the output sequence, or the full sequence__. Default: False.

- [`LSTM()`](https://keras.io/api/layers/recurrent_layers/lstm/) Long Short-Term Memory
    - same and `SimpleRNN` + 
    - `recurrent_activation`: Activation function to use for the recurrent step. Default: sigmoid (`sigmoid`)
    
- [`GRU()`](https://keras.io/api/layers/recurrent_layers/gru/) Gated Recurrent Unit 
    - same and `LSTM`

#### Wrapper Layers for RNN

- [`Bidirectional()`](https://keras.io/api/layers/recurrent_layers/bidirectional/) makes RNN bidirectional
    - `layer`: keras.layers.RNN instance, such as keras.layers.LSTM or keras.layers.GRU. 
    - `merge_mode`: Mode by which outputs of the forward and backward RNNs will be combined. One of {'sum', 'mul', 'concat', 'ave', None}. If None, the outputs will not be combined, they will be returned as a list. Default value is 'concat'.
    - `backward_layer`: Optional. Same as `layer` automatically. 

- [`TimeDistributed()`](https://keras.io/api/layers/recurrent_layers/time_distributed/) allows to apply a layer to every temporal slice of an input. (e.g. `softmax` to output)

### [Loss Functions](https://keras.io/api/losses/)
- Probabilisting Cross Entropy Losses compute the crossentropy loss between the labels and predictions
    - `binary_crossentropy`: Use this cross-entropy loss when there are only two label classes (assumed to be 0 and 1)
    - `categorical_crossentropy`: Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided in a one_hot representation.
    - `sparse_categorical_crossentropy`: Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided as integers. 

### [Optimizers](https://keras.io/api/optimizers/)
One of the two required arguments for model compilation. Available optimizers are:
- `SGD`: Gradient descent (with momentum) optimizer 
- `Adam` Optimizer that implements the Adam algorithm. Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.
- Others (read documentation):
    - RMSprop
    - Adadelta
    - Adagrad
    - Adamax
    - Nadam
    - Ftrl

### [Metrics](https://keras.io/api/metrics/)
A metric is a function that is used to judge the performance of your model.

Metric functions are similar to loss functions, except that the results from evaluating a metric are not used when training the model. Note that you may use any loss function as a metric.

Available accuracy metrics are:
- accuracy
- binary accuracy
- categorical accuracy

There are many other metrics that are available.

## Bilding Simple RNN Model

Let's build simple RNN for our data.

There are several ways to build a model in Keras.

- The __Sequential model__, which is very straightforward (a simple list of layers), but is limited to single-input, single-output stacks of layers (as the name gives away).

- The __Functional API__, which is an easy-to-use, fully-featured API that supports arbitrary model architectures. For most people and most use cases, this is what you should be using. This is the Keras "industry strength" model.

- __Model subclassing__, where you implement everything from scratch on your own. Use this if you have complex, out-of-the-box research use cases.

We are going to use __Functional API__ approach.

The main idea is that a deep learning model is usually a directed acyclic graph (DAG) of layers. So the functional API is a way to build graphs of layers.

### Function API Model Building
To build a model using the functional API

- start by creating an input node: `inputs = Input(shape=(input_vector_size,))`
- create a new node in the graph of layers by calling a layer on this inputs object: `model = layers.Dense(...)(inputs)`
- create outputs from a layer: `outputs = layers.Dense(10)(model)`
- create a model specifying its inputs and outputs in the graph of layers: `model = keras.Model(inputs=inputs, outputs=outputs, name="mnist_model")`

- model can be inspected using `model.summary()`

In [21]:
from keras.models import Model
from keras.layers import Input, Embedding, TimeDistributed, Dense, SimpleRNN

# input layer
inputs = Input(shape=(max_len,))

# embedding layer; don't forget we added 'UNK' and 'PAD'
# mask_zero=True tells the model that sequence is padded and it should ignore it
model = Embedding(input_dim=len(words)+2, output_dim=50, input_length=max_len, mask_zero=True)(inputs)

# RNN layer
model = SimpleRNN(units=100, return_sequences=True)(model)

# softmax output layer
outputs = TimeDistributed(Dense(len(labels), activation="softmax"))(model)

# defining model
model = Model(inputs, outputs, name="simple_rnn")

In [22]:
# let's inspect the model
model.summary()

Model: "simple_rnn"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 25)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 25, 50)            86500     
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 25, 100)           15100     
_________________________________________________________________
time_distributed (TimeDistri (None, 25, 41)            4141      
Total params: 105,741
Trainable params: 105,741
Non-trainable params: 0
_________________________________________________________________


### Compiling the Model
[Model API](https://keras.io/api/models/model_training_apis/)

model is compiled providing optimizer and loss, and metrics list (optional).

In [23]:
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

### Training the Model

In [24]:
rnn_history = model.fit(x_trn, y_trn, 
                        batch_size=64, 
                        epochs=10, 
                        validation_split=0.2,
                        verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Evaluating the Model

__BEST PRACTICE IS TO AVERAGE PERFORMANCES OF SEVERAL RUNS!__

In [25]:
# let's evaluate our model
scores = model.evaluate(x_tst, y_tst)



In [26]:
print(model.metrics_names)  # to get metric names
print(scores)

['loss', 'accuracy']
[0.1205439567565918, 0.8936349749565125]


Results look great!

However, remember that we padded our data; thus, most of it is due to `O`.

### Getting Predictions
- using `model.predict(x_tst)` for performance in large scale inputs
- using `model(x_tst)` for small data

In [45]:
all_preds = model.predict(x_tst)
max_preds = all_preds.argmax(axis=-1)

print(all_preds[0])
print(max_preds[0])

[[3.1615471e-07 1.9970980e-08 7.2076460e-08 ... 5.8605654e-09
  2.5333813e-08 9.9939859e-01]
 [2.8339803e-10 3.9239532e-12 2.5805294e-11 ... 4.2919848e-11
  5.9898031e-11 9.9999833e-01]
 [7.5490725e-06 8.7174648e-08 2.2182816e-07 ... 3.8386869e-08
  3.2921129e-08 3.5385638e-06]
 ...
 [2.4051236e-02 2.3626415e-02 2.4015347e-02 ... 2.3585007e-02
  2.4465526e-02 2.9772669e-02]
 [2.4051236e-02 2.3626415e-02 2.4015347e-02 ... 2.3585007e-02
  2.4465526e-02 2.9772669e-02]
 [2.4051236e-02 2.3626415e-02 2.4015347e-02 ... 2.3585007e-02
  2.4465526e-02 2.9772669e-02]]
[40 40 14 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
 40]


In [44]:
preds = model(x_tst)
preds = np.argmax(preds, -1)
print(preds[0])

[40 40 14 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
 40]


### Preparing for CoNLL Evaluation

#### Unpadding

In [29]:
def unpad(preds, refs):
    # todo: doesnt' take care of truncated input
    if len(preds) != len(refs):
        raise ValueError
    hyps = []
    for i, sent in enumerate(refs):
        if len(sent) < len(preds[i]):
            hyps.append(preds[i][:len(sent)])
        else:
            if len(sent) > len(preds[i]):
                raise ValueError("truncated input!")
            hyps.append(preds[i])
    return hyps
            

#### Minor format changes

In [30]:
# let's unpad
preds = unpad(preds, tst)
# to make tuples, so that we can use the conll eval
preds_txt = [[('_', idx2label.get(i)) for i in s] for s in preds]

print(preds_txt[0])

[('_', 'B-movie.name'), ('_', 'O'), ('_', 'B-movie.name')]


#### Evaluation

In [31]:
import pandas as pd

results = evaluate(tst, preds_txt)
pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
movie.language,0.667,0.754,0.707,69
director.nationality,1.0,0.0,0.0,1
person.name,0.364,0.353,0.358,34
award.ceremony,1.0,0.0,0.0,7
actor.nationality,1.0,0.0,0.0,1
award.category,1.0,0.0,0.0,2
movie.genre,1.0,0.417,0.588,36
producer.name,0.536,0.411,0.465,73
movie.gross_revenue,0.0,0.0,0.0,5
movie.location,1.0,0.0,0.0,7


## Model Improvements
Let's:
- Add dropout
- Make model bidirectional
- Change cell type to LSTM
- Explicitly define optimization & set learning rate

In [32]:
from keras.models import Model
from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, TimeDistributed, Dense
from keras.optimizers import Adam

inputs = Input(shape=(max_len,))
# don't forget we added 'UNK' and 'PAD'
model = Embedding(input_dim=len(words)+2, output_dim=50, input_length=max_len, mask_zero=True)(inputs)
# adding dropout
model = Dropout(0.5)(model)
# making bidirectional LSTM & adding recurrent dropout
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
outputs = TimeDistributed(Dense(len(labels), activation="softmax"))(model)  # softmax output layer

model = Model(inputs, outputs, name="BiLSTM")

# setting learning rate & decay parameters (read documentation!) 
opt = Adam(lr=0.01, decay=1e-6)
model.compile(optimizer=opt, loss="categorical_crossentropy", metrics=["accuracy"]) 

In [33]:
model.summary()

Model: "BiLSTM"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 25)]              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 25, 50)            86500     
_________________________________________________________________
dropout (Dropout)            (None, 25, 50)            0         
_________________________________________________________________
bidirectional (Bidirectional (None, 25, 200)           120800    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 25, 41)            8241      
Total params: 215,541
Trainable params: 215,541
Non-trainable params: 0
_________________________________________________________________


In [34]:
history = model.fit(x_trn, y_trn, batch_size=64, epochs=10, validation_split=0.2, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [35]:
preds = model(x_tst)
preds = np.argmax(preds, -1)
preds = unpad(preds, tst)
# to make tuples, so that we can use the conll eval
preds = [[('_', idx2label.get(i)) for i in s] for s in preds]
results = evaluate(tst, preds)
pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
movie.language,0.726,0.652,0.687,69
director.nationality,1.0,0.0,0.0,1
person.name,0.667,0.647,0.657,34
award.ceremony,1.0,0.0,0.0,7
actor.nationality,1.0,1.0,1.0,1
award.category,1.0,0.0,0.0,2
movie.genre,0.689,0.861,0.765,36
producer.name,0.77,0.644,0.701,73
movie.gross_revenue,0.333,0.2,0.25,5
movie.location,0.667,0.286,0.4,7


## Using Pre-Trained Embeddings

Read [documentation](https://keras.io/examples/nlp/pretrained_word_embeddings/) on how to use embeddings and all the parameters.

Here we are going to do a simple illustration with embeddings of spacy.

`python -m spacy download en_core_web_lg` to get proper vectors

### Creating Embedding Matrix

In [36]:
import spacy
nlp = spacy.load('en_core_web_lg')

# let's get embedding vector & inspect its properties
vec = nlp.vocab['movie'].vector

emb_dimension = len(vec)
print(emb_dimension)

300


In [37]:
vocab_size = len(words) + 2
print(vocab_size)

1730


In [38]:
# let's initialize embedding matrix with zeros
embedding_matrix = np.zeros((vocab_size, emb_dimension))
hits = 0
misses = 0
for word, i in word2idx.items():
    embedding_vector = nlp.vocab[word].vector
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 1730 words (0 misses)


### Model with Pre-Trained Embeddings
copy from above (some parameter changes)

In [39]:
from keras.models import Model
from keras.layers import Input, Embedding, Dropout, Bidirectional, LSTM, TimeDistributed, Dense
from keras.optimizers import Adam
# new one
from keras.initializers import Constant

inputs = Input(shape=(max_len,))
# note new parameters
model = Embedding(input_dim=vocab_size, 
                  output_dim=emb_dimension, 
                  embeddings_initializer=Constant(embedding_matrix), 
                  trainable=False,  # to keep embeddings frozen
                  mask_zero=True,
                  input_length=max_len)(inputs)
# adding dropout
model = Dropout(0.5)(model)
# making bidirectional LSTM & adding recurrent dropout
model = Bidirectional(LSTM(units=emb_dimension, return_sequences=True, recurrent_dropout=0.1))(model)
outputs = TimeDistributed(Dense(len(labels), activation="softmax"))(model)  # softmax output layer

model = Model(inputs, outputs, name="BiLSTM.emb")

# setting learning rate & decay parameters (read documentation!) 
opt = Adam(lr=0.01, decay=1e-6)
model.compile(optimizer=opt, loss="categorical_crossentropy", metrics=["accuracy"]) 

In [40]:
model.summary()

Model: "BiLSTM.emb"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 25)]              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 25, 300)           519000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 25, 300)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 25, 600)           1442400   
_________________________________________________________________
time_distributed_2 (TimeDist (None, 25, 41)            24641     
Total params: 1,986,041
Trainable params: 1,467,041
Non-trainable params: 519,000
_________________________________________________________________


In [41]:
history = model.fit(x_trn, y_trn, batch_size=64, epochs=10, validation_split=0.2, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [42]:
model.evaluate(x_tst, y_tst)



[0.08656857907772064, 0.9409863948822021]

In [43]:
preds = model(x_tst)
preds = np.argmax(preds, -1)
preds = unpad(preds, tst)
# to make tuples, so that we can use the conll eval
preds = [[('_', idx2label.get(i)) for i in s] for s in preds]
results = evaluate(tst, preds)
pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
movie.language,0.8,0.58,0.672,69
director.nationality,1.0,0.0,0.0,1
person.name,0.742,0.676,0.708,34
award.ceremony,1.0,0.0,0.0,7
actor.nationality,0.5,1.0,0.667,1
award.category,1.0,0.0,0.0,2
movie.genre,0.833,0.694,0.758,36
producer.name,0.905,0.781,0.838,73
movie.gross_revenue,0.125,0.2,0.154,5
movie.location,0.667,0.286,0.4,7


## Exercises
- Experiment with different model parameters
    - vary hidden layer size
    - learning rate
    - dropout rate
    - batch size
    - cell type: LSTM, GRU, SimpleRNN
    - optimizer: SGD, Adam, others
    
- We have used post-padding, the default is pre-padding.
    - change the padding & unpadding
    - train & evaluate one of the models