# Applying Attention mechanisms to an LSTM network 

We will be working with the CONLL 2003 dataset, annotated for the task of Named Entity Recognition.

The objetive of this notebook is to build a prototype LSTM for sequence labeling, and apply a very simple attention mechanisms before the recurrent layer. The base model is inpired in [this work](https://www.kaggle.com/gagandeep16/ner-using-bidirectional-lstm), by GaganBhatia. Most of the explanations of the code is in the accompaning slides.

You can find two sample datasets directly hosted at UNC, [one](https://cs.famaf.unc.edu.ar/~mteruel/datasets/tensorflowMeetup/ner.csv) used by the original Kaggle notebook (150M) and a [smaller one](https://cs.famaf.unc.edu.ar/~mteruel/datasets/tensorflowMeetup/ner.sample.csv) just to play with (14M).


Once the model is trained, we show the attention score for each word.

In [1]:
import keras
import numpy
import pandas

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

Using TensorFlow backend.


## Reading dataset and extracting sequences

In [2]:
dataset = pandas.read_csv("../ner.csv", encoding = "ISO-8859-1", error_bad_lines=False,
                          usecols=['sentence_idx', 'word', 'pos', 'tag'])

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
dataset.shape

(1050796, 4)

In [4]:
dataset[:20]

Unnamed: 0,pos,sentence_idx,word,tag
0,NNS,1,Thousands,O
1,IN,1,of,O
2,NNS,1,demonstrators,O
3,VBP,1,have,O
4,VBN,1,marched,O
5,IN,1,through,O
6,NNP,1,London,B-geo
7,TO,1,to,O
8,VB,1,protest,O
9,DT,1,the,O


In [5]:
class SentenceFactory(object):
    
    def __init__(self, dataset, tag_preprocess=lambda x: x):
        self.dataset = dataset
        agg_func = lambda s: [
            (w, p, tag_preprocess(t)) 
            for w, p, t in zip(s["word"].values.tolist(), s['pos'].values.tolist(),
                             s["tag"].values.tolist())
        ]
        grouped = self.dataset.groupby("sentence_idx").apply(agg_func)
        self.sentences = [s for s in grouped]

We obtain a list of sentences from the dataset and we replace the BIO tag format for a regular label type

In [6]:
remove_bio = lambda x: x.replace('I-', '').replace('B-', '')

instances = SentenceFactory(dataset, tag_preprocess=remove_bio).sentences

instances[0:1]

[[('Thousands', 'NNS', 'O'),
  ('of', 'IN', 'O'),
  ('demonstrators', 'NNS', 'O'),
  ('have', 'VBP', 'O'),
  ('marched', 'VBN', 'O'),
  ('through', 'IN', 'O'),
  ('London', 'NNP', 'geo'),
  ('to', 'TO', 'O'),
  ('protest', 'VB', 'O'),
  ('the', 'DT', 'O'),
  ('war', 'NN', 'O'),
  ('in', 'IN', 'O'),
  ('Iraq', 'NNP', 'geo'),
  ('and', 'CC', 'O'),
  ('demand', 'VB', 'O'),
  ('the', 'DT', 'O'),
  ('withdrawal', 'NN', 'O'),
  ('of', 'IN', 'O'),
  ('British', 'JJ', 'gpe'),
  ('troops', 'NNS', 'O'),
  ('from', 'IN', 'O'),
  ('that', 'DT', 'O'),
  ('country', 'NN', 'O'),
  ('.', '.', 'O')]]

In [7]:
max_sentence_length = dataset.groupby('sentence_idx').word.count().max()
max_sentence_length

140

In [8]:
unique_words = dataset.word.unique()
unique_words = numpy.append(unique_words, "ENDPAD")
print('Vocabulary size {}'.format(unique_words.shape[0]))

Vocabulary size 30175


In [9]:
labels = dataset.tag.fillna('O').apply(remove_bio).unique()
print(labels)
print('Unique labels {}'.format(labels.shape[0]))

['O' 'geo' 'gpe' 'per' 'org' 'tim' 'art' 'nat' 'eve' 'prev-prev-lemma']
Unique labels 10


## Processing the input sequences

To train more effectively the network, we pad all sequences to have the same lenght. In this case, we choose to use the lenght of the longest sequence.

In [10]:
word2idx = {w: i for i, w in enumerate(unique_words)}
labels2idx = {t: i for i, t in enumerate(labels)}

In [11]:
x_matrix = [[word2idx[w[0]] for w in s] for s in instances]
x_matrix = pad_sequences(maxlen=max_sentence_length, sequences=x_matrix,
                         padding="post", value=unique_words.shape[0] - 1)

In [12]:
x_matrix

array([[    0,     1,     2, ..., 30174, 30174, 30174],
       [   22,     1,    23, ..., 30174, 30174, 30174],
       [   42,     4,    18, ..., 30174, 30174, 30174],
       ..., 
       [   61,   921,   151, ..., 30174, 30174, 30174],
       [  531,   330,     3, ..., 30174, 30174, 30174],
       [18519, 30174, 30174, ..., 30174, 30174, 30174]], dtype=int32)

In [13]:
y = [[labels2idx[w[2]] for w in s] for s in instances]
y = pad_sequences(maxlen=140, sequences=y, padding="post", value=labels2idx["O"])
y = [to_categorical(i, num_classes=labels.shape[0]) for i in y]

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_matrix, y, test_size=0.2)

--- 

# Building the model

We build a model with an object oriented interface so we can add and remove layers in sub-classes.

In [15]:
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional

from sklearn import metrics

In [16]:
class BiLSTM(object):
    def __init__(self, vocabulary_size, max_sentence_length, labels,
                 embedding_size=50):
        self.model = None
        self.vocabulary_size = vocabulary_size
        self.max_sentence_length = max_sentence_length
        self.labels = labels
        self.n_labels = labels.shape[0]
        
    def add_embedding_layer(self, layers):
        layers = Embedding(
            input_dim=self.vocabulary_size,
            output_dim=self.max_sentence_length,
            input_length=self.max_sentence_length)(layers)
        return Dropout(0.1)(layers)
    
    def add_recurrent_layer(self, layers):
        return Bidirectional(
            LSTM(units=100, return_sequences=True,
                 recurrent_dropout=0.1))(layers)
    
    def add_output_layer(self, layers):
        return TimeDistributed(
            Dense(self.n_labels, activation="softmax"))(layers)
    
    def build(self):
        input = Input(shape=(self.max_sentence_length,))
        layers = self.add_embedding_layer(input)
        layers = self.add_recurrent_layer(layers)
        layers = self.add_output_layer(layers)        
        
        self.model = Model(input, layers)
        self.model.compile(
            optimizer="adam", loss="categorical_crossentropy",
            metrics=["accuracy"])
    
    def fit(self, X_train, y_train, epochs, batch_size=32, validation_split=0.2):
        if self.model is None:
            self.build()
        return self.model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs,
                              validation_split=validation_split, verbose=1)
    
    def predict(self, X_test):
        return numpy.argmax(self.model.predict(X_test), axis=-1)
    
    def evaluate(self, X_test, y_test):
        predictions = numpy.argmax(self.model.predict(X_test), axis=-1).flatten()
        true_labels = numpy.argmax(y_test, axis=-1).flatten()
        print(metrics.classification_report(true_labels, predictions,
                                            target_names=self.labels))

In [19]:
model = BiLSTM(vocabulary_size=unique_words.shape[0],
               max_sentence_length=max_sentence_length,
               labels=labels)

If we want, we train a new model

In [19]:
size = 100
model.fit(X_train, numpy.array(y_train), epochs=10)

Train on 23478 samples, validate on 5870 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff66a95d4a8>

In [20]:
model.model.save('model_10ep.keras')

Otherwise, we can load a previously trained model

In [20]:
model.build()

In [23]:
model.model.load_weights('../model_10ep.keras')

Finally, we evaluate the performance

In [26]:
size = 100
model.evaluate(X_test[:size], y_test[:size])

                 precision    recall  f1-score   support

              O       1.00      1.00      1.00     13571
            geo       0.95      0.90      0.93        91
            gpe       0.97      0.98      0.98        60
            per       0.93      0.95      0.94        91
            org       0.95      0.90      0.92       101
            tim       0.95      0.91      0.93        82
            art       0.00      0.00      0.00         1
            nat       1.00      0.33      0.50         3

    avg / total       1.00      1.00      1.00     14000



  'precision', 'predicted', average, warn_for)


We can see some of the predictions

In [27]:
i = 0
p = model.predict(numpy.array([X_test[i]]))
print("{:15} : ({:4}) : {}".format("Word", "True", "Pred"))
for w, true, pred in zip(X_test[i], y_test[i], p[i]):
    if w == len(unique_words) - 1:
        break
    print("{:15} : {:6} : {:6}".format(unique_words[w], labels[numpy.argmax(true)], labels[pred]))

Word            : (True) : Pred
The             : O      : O     
United          : geo    : geo   
States          : geo    : geo   
and             : O      : O     
other           : O      : O     
Western         : O      : O     
nations         : O      : O     
are             : O      : O     
trying          : O      : O     
to              : O      : O     
persuade        : O      : O     
the             : O      : O     
U.N.            : org    : org   
Security        : org    : org   
Council         : org    : org   
to              : O      : O     
impose          : O      : O     
sanctions       : O      : O     
on              : O      : O     
Iran            : geo    : geo   
because         : O      : O     
of              : O      : O     
its             : O      : O     
nuclear         : O      : O     
program         : O      : O     
.               : O      : O     


---

# Add an attention mechanism

We implement the first solution given by the slides, calculating a single score per word

In [28]:
from keras import backend as K
from keras.layers import Lambda, Permute, RepeatVector, merge

In [29]:
class AttBiLSTM(BiLSTM):
    
    def add_attention_block(self, layers):
        """Apply an attention block to a partial model layers."""
        feature_vector_size = K.int_shape(layers)[-1]
        att_layer = Dense(feature_vector_size, activation='softmax',
            name='attention_matrix_score')(layers)
        # Calculate a single score for each timestep
        att_layer = Lambda(lambda x: K.mean(x, axis=2),
                           name='attention_vector_score')(att_layer)
        # Reshape to obtain the same shape as input
        att_layer = Permute((2, 1))(
            RepeatVector(feature_vector_size)(att_layer))
        layers = merge([att_layer, layers],  mode='mul')
        return layers 
    
    def add_embedding_layer(self, layers):
        layers = super(AttBiLSTM, self).add_embedding_layer(layers)        
        return self.add_attention_block(layers)
    
    def attention_predict(self, input_sequences):
        """Classifies the input sequences and returns the attention score.

        Args:
            model: a Keras model
            input_: a list of array representation of sentences.

        Returns:
            A tuple where the first element is the attention scores for each
            sentence, and the second is the model predictions.
        """
        layer = self.model.get_layer('attention_vector_score')
        attention_model = Model(
            inputs=self.model.input, outputs=[layer.output, self.model.output])
        # The attention output is (batch_size, timesteps, features)
        return attention_model.predict(input_sequences)

In [30]:
model = AttBiLSTM(vocabulary_size=unique_words.shape[0],
               max_sentence_length=max_sentence_length,
               labels=labels)

Again, we can train the model...

In [None]:
size = 100
model.fit(X_train, numpy.array(y_train), epochs=2)

Train on 23478 samples, validate on 5870 samples
Epoch 1/2
Epoch 2/2
 4896/23478 [=====>........................] - ETA: 4:38 - loss: 0.0272 - acc: 0.9916

In [None]:
model.model.save('model_10ep_att2.keras')

... or we can load it from disk

In [32]:
model.build()
model.model.load_weights('model_10ep_att.keras')

  name=name)


In [33]:
attention, predictions = model.attention_predict(numpy.array(X_test[0:2]))

In [34]:
attention.shape

(2, 140)

## Second attention model

We implement the Philippe Remy model, where we calculate the attention scores weighting all the tie

In [None]:
class AttBiLSTM2(BiLSTM):
    
    def add_attention_block(self, layers):
        """Apply an attention block to a partial model layers."""
        timesteps = K.int_shape(layers)[-2]
        att_layer = Permute((2, 1))(att_layer)
        att_layer = TimeDistributed(
            Dense(timesteps, activation=None),
            name='attention_matrix_score')(att_layer)
        # Calculate a single score for each timestep
        att_layer = Lambda(lambda x: K.mean(x, axis=2),
                           name='attention_vector_score')(att_layer)
        # Reshape to obtain the same shape as input
        att_layer = Permute((2, 1))(
            RepeatVector(feature_vector_size)(att_layer))
        layers = merge([att_layer, layers],  mode='mul')
        return layers 
    
    def add_embedding_layer(self, layers):
        layers = super(AttBiLSTM, self).add_embedding_layer(layers)        
        return self.add_attention_block(layers)
    
    def attention_predict(self, input_sequences):
        """Classifies the input sequences and returns the attention score.

        Args:
            model: a Keras model
            input_: a list of array representation of sentences.

        Returns:
            A tuple where the first element is the attention scores for each
            sentence, and the second is the model predictions.
        """
        layer = self.model.get_layer('attention_vector_score')
        attention_model = Model(
            inputs=self.model.input, outputs=[layer.output, self.model.output])
        # The attention output is (batch_size, timesteps, features)
        return attention_model.predict(input_sequences)

Again, we can train the model...

In [None]:
size = 100
model.fit(X_train, numpy.array(y_train), epochs=10)

In [None]:
model.model.save('model_10ep_att3.keras')

... or we can load it from disk

In [55]:
model.build()
model.model.load_weights('model_10ep_att3.keras')

  name=name)


---

# Visualize the attention

First, we align the attention and labels output from the network, and remove all the padding tokens.

In [56]:
result = []
# This could be done in a much more compact code, but I hope this is more
# understandable
for sentence_idx, (word_idxs, sentence_a_scores, sentence_labels) in enumerate(
        zip(X_test[0:2], attention, numpy.argmax(predictions, axis=-1))):
    for word_idx, a_score, label_idx in zip(word_idxs, sentence_a_scores, sentence_labels):
        word = unique_words[word_idx]
        if word == 'ENDPAD':
            break
        label = labels[label_idx]
        result.append((word, a_score, sentence_idx, label))

In [57]:
result[:5]

[('The', 1.5802292, 0, 'O'),
 ('United', -4.1597695, 0, 'org'),
 ('States', -4.1535788, 0, 'geo'),
 ('and', -1.7216194, 0, 'O'),
 ('other', 2.8763144, 0, 'O')]

As a stand alone service, we first must store the results in a json file

In [58]:
pandas.DataFrame(result, columns=['token', 'attention', 'sentence', 'label']).to_csv('data.csv', index=False)

After saving the file, you need to run a local http server to see the result. Run in the console from the repository directory:

$ python -m http.server

Then, open your browser in localhost:8000, and you should see the visualization

## Visualizing attention in notebook

Another option is to import d3 directly into the notebook, but it is less robust.

In [59]:
from IPython.core.display import display, HTML
from string import Template
import json

In [60]:
HTML('<script src="js/d3.min.js"></script>')

In [61]:
HTML('<script src="js/textChart.js"></script>')

In [62]:
HTML("""<script>
if (d3 === undefined) {
    alert('No d3 library');
}
if (TextChart === undefined) {
    alert('No Chart library');
}
</script>""")

In [63]:
json_data = pandas.DataFrame(
    result, columns=['token', 'attention', 'sentence', 'label']).to_json(orient='records')

In [64]:
js_text_template = Template('''
var nouns = $json_data;  // We are heavily using the similarties
                         // between js and json syntax.
opts = {
  lineHeight: 16,
  width: 900,
  height: 600,
  linePadding: 10
}
chart = new TextChart(nouns, opts);
chart.draw("text-container");
''')

html_template = Template('''
    <div id='text-container'></div>
    <script>$js_text</script>
''')

js_text = js_text_template.substitute({
    'json_data': json_data
})

HTML(html_template.substitute({'js_text': js_text}))