# Applying Attention mechanisms to an LSTM network 

We will be working with the CONLL 2003 dataset, annotated for the task of Named Entity Recognition.

The objetive of this notebook is to build a prototype LSTM for sequence labeling, and apply a very simple attention mechanisms before the recurrent layer. The base model is inpired in [this work](https://www.kaggle.com/gagandeep16/ner-using-bidirectional-lstm), by GaganBhatia. Most of the explanations of the code is in the accompaning slides.

Once the model is trained, we show the attention score for each word.

You can find two sample datasets directly hosted at UNC, [one](https://cs.famaf.unc.edu.ar/~mteruel/datasets/tensorflowMeetup/ner.csv) used by the original Kaggle notebook (150M) and a [smaller one](https://cs.famaf.unc.edu.ar/~mteruel/datasets/tensorflowMeetup/ner.sample.csv) just to play with (14M). If you are running in colab, just run the next cell with the corresponding URL to donwload the csv file.

In [2]:
! wget -O data.sample.csv -nc https://cs.famaf.unc.edu.ar/~mteruel/datasets/tensorflowMeetup/ner.sample.csv

wget: /home/milagro/miniconda2/envs/am_env/lib/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /home/milagro/miniconda2/envs/am_env/lib/libssl.so.1.0.0: no version information available (required by wget)
wget: /home/milagro/miniconda2/envs/am_env/lib/libssl.so.1.0.0: no version information available (required by wget)
--2018-06-25 16:28:15--  https://cs.famaf.unc.edu.ar/~mteruel/datasets/tensorflowMeetup/ner.sample.csv
Resolving cs.famaf.unc.edu.ar (cs.famaf.unc.edu.ar)... 200.16.17.55
Connecting to cs.famaf.unc.edu.ar (cs.famaf.unc.edu.ar)|200.16.17.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14760771 (14M) [text/csv]
Saving to: ‘data.sample.csv’


2018-06-25 16:28:28 (1,12 MB/s) - ‘data.sample.csv’ saved [14760771/14760771]



In [316]:
import keras
import numpy
import pandas

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

## Reading dataset and extracting sequences

In [84]:
dataset = pandas.read_csv("../ner.csv", encoding = "ISO-8859-1", error_bad_lines=False,
                          usecols=['sentence_idx', 'word', 'pos', 'tag'])

  interactivity=interactivity, compiler=compiler, result=result)


In [85]:
dataset.shape

(1050796, 4)

In [86]:
dataset[:20]

Unnamed: 0,pos,sentence_idx,word,tag
0,NNS,1,Thousands,O
1,IN,1,of,O
2,NNS,1,demonstrators,O
3,VBP,1,have,O
4,VBN,1,marched,O
5,IN,1,through,O
6,NNP,1,London,B-geo
7,TO,1,to,O
8,VB,1,protest,O
9,DT,1,the,O


In [87]:
class SentenceFactory(object):
    
    def __init__(self, dataset, tag_preprocess=lambda x: x):
        self.dataset = dataset
        agg_func = lambda s: [
            (w, p, tag_preprocess(t)) 
            for w, p, t in zip(s["word"].values.tolist(), s['pos'].values.tolist(),
                             s["tag"].values.tolist())
        ]
        grouped = self.dataset.groupby("sentence_idx").apply(agg_func)
        self.sentences = [s for s in grouped]

We obtain a list of sentences from the dataset and we replace the BIO tag format for a regular label type

In [88]:
remove_bio = lambda x: x.replace('I-', '').replace('B-', '')

instances = SentenceFactory(dataset, tag_preprocess=remove_bio).sentences

instances[0:1]

[[('Thousands', 'NNS', 'O'),
  ('of', 'IN', 'O'),
  ('demonstrators', 'NNS', 'O'),
  ('have', 'VBP', 'O'),
  ('marched', 'VBN', 'O'),
  ('through', 'IN', 'O'),
  ('London', 'NNP', 'geo'),
  ('to', 'TO', 'O'),
  ('protest', 'VB', 'O'),
  ('the', 'DT', 'O'),
  ('war', 'NN', 'O'),
  ('in', 'IN', 'O'),
  ('Iraq', 'NNP', 'geo'),
  ('and', 'CC', 'O'),
  ('demand', 'VB', 'O'),
  ('the', 'DT', 'O'),
  ('withdrawal', 'NN', 'O'),
  ('of', 'IN', 'O'),
  ('British', 'JJ', 'gpe'),
  ('troops', 'NNS', 'O'),
  ('from', 'IN', 'O'),
  ('that', 'DT', 'O'),
  ('country', 'NN', 'O'),
  ('.', '.', 'O')]]

In [89]:
max_sentence_length = dataset.groupby('sentence_idx').word.count().max()
max_sentence_length

140

In [90]:
unique_words = dataset.word.unique()
unique_words = numpy.append(unique_words, "ENDPAD")
print('Vocabulary size {}'.format(unique_words.shape[0]))

Vocabulary size 30175


In [91]:
labels = dataset.tag.fillna('O').apply(remove_bio).unique()
print(labels)
print('Unique labels {}'.format(labels.shape[0]))

['O' 'geo' 'gpe' 'per' 'org' 'tim' 'art' 'nat' 'eve' 'prev-prev-lemma']
Unique labels 10


## Processing the input sequences

To train more effectively the network, we pad all sequences to have the same lenght. In this case, we choose to use the lenght of the longest sequence.

In [92]:
word2idx = {w: i for i, w in enumerate(unique_words)}
labels2idx = {t: i for i, t in enumerate(labels)}

In [93]:
x_matrix = [[word2idx[w[0]] for w in s] for s in instances]
x_matrix = pad_sequences(maxlen=max_sentence_length, sequences=x_matrix,
                         padding="post", value=unique_words.shape[0] - 1)

In [94]:
x_matrix

array([[    0,     1,     2, ..., 30174, 30174, 30174],
       [   22,     1,    23, ..., 30174, 30174, 30174],
       [   42,     4,    18, ..., 30174, 30174, 30174],
       ...,
       [   61,   921,   151, ..., 30174, 30174, 30174],
       [  531,   330,     3, ..., 30174, 30174, 30174],
       [18519, 30174, 30174, ..., 30174, 30174, 30174]], dtype=int32)

In [95]:
y = [[labels2idx[w[2]] for w in s] for s in instances]
y = pad_sequences(maxlen=140, sequences=y, padding="post", value=labels2idx["O"])
y = [to_categorical(i, num_classes=labels.shape[0]) for i in y]

In [96]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_matrix, y, test_size=0.2)

--- 

# Building the model

We build a model with an object oriented interface so we can add and remove layers in sub-classes.

In [97]:
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional

from sklearn import metrics

In [98]:
class BiLSTM(object):
    def __init__(self, vocabulary_size, max_sentence_length, labels,
                 embedding_size=50):
        self.model = None
        self.vocabulary_size = vocabulary_size
        self.max_sentence_length = max_sentence_length
        self.labels = labels
        self.n_labels = labels.shape[0]
        
    def add_embedding_layer(self, layers):
        layers = Embedding(
            input_dim=self.vocabulary_size,
            output_dim=self.max_sentence_length,
            input_length=self.max_sentence_length)(layers)
        return Dropout(0.1)(layers)
    
    def add_recurrent_layer(self, layers):
        return Bidirectional(
            LSTM(units=100, return_sequences=True,
                 recurrent_dropout=0.1))(layers)
    
    def add_output_layer(self, layers):
        return TimeDistributed(
            Dense(self.n_labels, activation="softmax"))(layers)
    
    def build(self):
        input = Input(shape=(self.max_sentence_length,))
        layers = self.add_embedding_layer(input)
        layers = self.add_recurrent_layer(layers)
        layers = self.add_output_layer(layers)        
        
        self.model = Model(input, layers)
        self.model.compile(
            optimizer="adam", loss="categorical_crossentropy",
            metrics=["accuracy"])
    
    def fit(self, X_train, y_train, epochs, batch_size=32, validation_split=0.2):
        if self.model is None:
            self.build()
        return self.model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs,
                              validation_split=validation_split, verbose=1)
    
    def predict(self, X_test):
        return numpy.argmax(self.model.predict(X_test), axis=-1)
    
    def evaluate(self, X_test, y_test):
        predictions = numpy.argmax(self.model.predict(X_test), axis=-1).flatten()
        true_labels = numpy.argmax(y_test, axis=-1).flatten()
        print(metrics.classification_report(true_labels, predictions,
                                            target_names=self.labels))

In [99]:
model = BiLSTM(vocabulary_size=unique_words.shape[0],
               max_sentence_length=max_sentence_length,
               labels=labels)

If we want, we train a new model

In [19]:
size = 100
model.fit(X_train, numpy.array(y_train), epochs=10)

Train on 23478 samples, validate on 5870 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff66a95d4a8>

In [20]:
model.model.save('model_10ep.keras')

Otherwise, we can load a previously trained model

In [100]:
model.build()

In [101]:
model.model.load_weights('model_10ep.keras')

Finally, we evaluate the performance

In [103]:
model.evaluate(X_test, y_test)

                 precision    recall  f1-score   support

              O       1.00      1.00      1.00    994779
            geo       0.92      0.94      0.93      9071
            gpe       0.98      0.95      0.96      3350
            per       0.95      0.94      0.94      7014
            org       0.93      0.85      0.89      7410
            tim       0.93      0.96      0.94      5236
            art       0.81      0.69      0.75       134
            nat       0.78      0.73      0.75        55
            eve       0.83      0.91      0.86       130
prev-prev-lemma       0.00      0.00      0.00         1

    avg / total       1.00      1.00      1.00   1027180



  'precision', 'predicted', average, warn_for)


We can see some of the predictions

In [27]:
i = 0
p = model.predict(numpy.array([X_test[i]]))
print("{:15} : ({:4}) : {}".format("Word", "True", "Pred"))
for w, true, pred in zip(X_test[i], y_test[i], p[i]):
    if w == len(unique_words) - 1:
        break
    print("{:15} : {:6} : {:6}".format(unique_words[w], labels[numpy.argmax(true)], labels[pred]))

Word            : (True) : Pred
The             : O      : O     
United          : geo    : geo   
States          : geo    : geo   
and             : O      : O     
other           : O      : O     
Western         : O      : O     
nations         : O      : O     
are             : O      : O     
trying          : O      : O     
to              : O      : O     
persuade        : O      : O     
the             : O      : O     
U.N.            : org    : org   
Security        : org    : org   
Council         : org    : org   
to              : O      : O     
impose          : O      : O     
sanctions       : O      : O     
on              : O      : O     
Iran            : geo    : geo   
because         : O      : O     
of              : O      : O     
its             : O      : O     
nuclear         : O      : O     
program         : O      : O     
.               : O      : O     


---

# Add an attention mechanism

We implement the first solution given by the slides, calculating a single score per word

In [104]:
from keras import backend as K
from keras.layers import Lambda, Permute, RepeatVector, merge

In [306]:
class AttBiLSTM(BiLSTM):
    
    def add_attention_block(self, layers):
        """Apply an attention block to a partial model layers."""
        feature_vector_size = K.int_shape(layers)[-1]
        att_layer = Dense(feature_vector_size, activation='softmax', # activation=None,
                          name='attention_matrix_score')(layers)
        # Calculate a single score for each timestep
        att_layer = Lambda(lambda x: K.mean(x, axis=1),
                           name='attention_vector_score')(att_layer)
        # Reshape to obtain the same shape as input
        att_layer = Permute((2, 1))(
            RepeatVector(feature_vector_size)(att_layer))
        layers = merge([att_layer, layers],  mode='mul')
        return layers 
    
    def add_embedding_layer(self, layers):
        layers = super(AttBiLSTM, self).add_embedding_layer(layers)        
        return self.add_attention_block(layers)
    
    def attention_predict(self, input_sequences):
        """Classifies the input sequences and returns the attention score.

        Args:
            input_sequences: a list of array representation of sentences.

        Returns:
            A tuple where the first element is the attention scores for each
            sentence, and the second is the model predictions.
        """
        layer = self.model.get_layer('attention_vector_score')
        attention_model = Model(
            inputs=self.model.input, outputs=[layer.output, self.model.output])
        # The attention output is (batch_size, timesteps, features)
        return attention_model.predict(input_sequences)

In [307]:
model = AttBiLSTM(vocabulary_size=unique_words.shape[0],
                  max_sentence_length=max_sentence_length,
                  labels=labels)

Again, we can train the model...

In [None]:
size = 100
model.fit(X_train, numpy.array(y_train), epochs=10)

Train on 23478 samples, validate on 5870 samples
Epoch 1/10
Epoch 2/10

In [None]:
model.model.save('model_10ep_att_softmax.keras')

... or we can load it from disk

In [323]:
model.build()
model.model.load_weights('model_10ep_att_softmax.keras')

  name=name)


In [324]:
model.evaluate(X_test, y_test)

                 precision    recall  f1-score   support

              O       1.00      1.00      1.00    994779
            geo       0.83      0.83      0.83      9071
            gpe       0.94      0.85      0.89      3350
            per       0.89      0.77      0.83      7014
            org       0.80      0.65      0.72      7410
            tim       0.87      0.82      0.85      5236
            art       0.00      0.00      0.00       134
            nat       0.00      0.00      0.00        55
            eve       0.10      0.02      0.03       130
prev-prev-lemma       0.00      0.00      0.00         1

    avg / total       0.99      0.99      0.99   1027180



  'precision', 'predicted', average, warn_for)


In [325]:
attention, predictions = model.attention_predict(numpy.array(X_test[0:2]))

In [326]:
attention[:2]

array([[0.01000638, 0.00979509, 0.03112146, 0.03014026, 0.01036052,
        0.02828653, 0.04183299, 0.0112968 , 0.00954351, 0.00961488,
        0.00934762, 0.00933148, 0.00927358, 0.00972058, 0.00904836,
        0.00897365, 0.00904961, 0.0096976 , 0.00888072, 0.01068403,
        0.00879953, 0.01101034, 0.00893785, 0.00906931, 0.00969644,
        0.00946349, 0.0089477 , 0.0091248 , 0.00905322, 0.00946682,
        0.00919382, 0.00931458, 0.00925128, 0.00932566, 0.0090926 ,
        0.01046528, 0.00913227, 0.00911073, 0.00931868, 0.00912134,
        0.00904317, 0.00907065, 0.0091409 , 0.00893204, 0.00884265,
        0.00887148, 0.00859397, 0.00918771, 0.00873354, 0.01000163,
        0.00946891, 0.00960682, 0.00965905, 0.01130545, 0.00991268,
        0.00933771, 0.00947753, 0.00958044, 0.00955382, 0.00882774,
        0.00910393, 0.00891715, 0.00858055, 0.00849193, 0.00861425,
        0.00908855, 0.00831471, 0.00830295, 0.00807236, 0.00802214,
        0.00849908, 0.00799347, 0.00770374, 0.00

## Second attention model

We implement the Philippe Remy model, where we calculate the attention scores weighting all the timesteps at the same time.

In [327]:
class AttBiLSTM2(BiLSTM):
    
    def add_attention_block(self, layers):
        """Apply an attention block to a partial model layers."""
        timesteps = K.int_shape(layers)[-2]
        att_layer = Permute((2, 1))(layers)
        att_layer = Dense(timesteps, activation='softmax', # activation=None,
                          name='attention_matrix_score')(att_layer)
        # Calculate a single score for each timestep
        att_layer = Lambda(lambda x: K.mean(x, axis=1),
                           name='attention_vector_score')(att_layer)
        # Reshape to obtain the same shape as input
        att_layer = Permute((2, 1))(RepeatVector(timesteps)(att_layer))
        layers = merge([att_layer, layers],  mode='mul')
        return layers 
    
    def add_embedding_layer(self, layers):
        layers = super(AttBiLSTM2, self).add_embedding_layer(layers)        
        return self.add_attention_block(layers)
    
    def attention_predict(self, input_sequences):
        """Classifies the input sequences and returns the attention score.

        Args:
            input_sequences: a list of array representation of sentences.

        Returns:
            A tuple where the first element is the attention scores for each
            sentence, and the second is the model predictions.
        """
        layer = self.model.get_layer('attention_vector_score')
        attention_model = Model(
            inputs=self.model.input, outputs=[layer.output, self.model.output])
        # The attention output is (batch_size, timesteps, features)
        return attention_model.predict(input_sequences)

In [328]:
model = AttBiLSTM2(vocabulary_size=unique_words.shape[0],
                   max_sentence_length=max_sentence_length,
                   labels=labels)

Again, we can train the model...

In [None]:
size = 100
model.fit(X_train, numpy.array(y_train), epochs=10)

In [None]:
model.model.save('model_10ep_att2_softmax.keras')

... or we can load it from disk

In [329]:
model.build()
model.model.load_weights('model_10ep_att2_softmax.keras')

  name=name)


We finally evaluate the model and get some sample attention scores

In [330]:
model.evaluate(X_test, y_test)

                 precision    recall  f1-score   support

              O       1.00      1.00      1.00    994779
            geo       0.82      0.87      0.85      9071
            gpe       0.96      0.89      0.92      3350
            per       0.91      0.79      0.85      7014
            org       0.84      0.68      0.75      7410
            tim       0.91      0.85      0.88      5236
            art       0.00      0.00      0.00       134
            nat       0.00      0.00      0.00        55
            eve       0.50      0.01      0.02       130
prev-prev-lemma       0.00      0.00      0.00         1

    avg / total       0.99      0.99      0.99   1027180



  'precision', 'predicted', average, warn_for)


In [331]:
attention, predictions = model.attention_predict(numpy.array(X_test[0:2]))

In [332]:
attention[:2]

array([[0.01988118, 0.01532837, 0.01745828, 0.02029429, 0.01645563,
        0.01321826, 0.01471941, 0.01522901, 0.017341  , 0.01651317,
        0.01266951, 0.01955203, 0.0268642 , 0.01956076, 0.01492754,
        0.01824682, 0.01587212, 0.02207735, 0.01965686, 0.02398645,
        0.02009648, 0.01991243, 0.01306573, 0.01767619, 0.01840059,
        0.01783916, 0.02105205, 0.01719425, 0.01869447, 0.01663652,
        0.01668748, 0.01739269, 0.01590375, 0.01081016, 0.01316883,
        0.00844534, 0.0154878 , 0.01242747, 0.01088647, 0.01504009,
        0.01790265, 0.02157492, 0.02769303, 0.00967434, 0.011865  ,
        0.00807473, 0.01186555, 0.00610112, 0.01052454, 0.00582738,
        0.00867038, 0.00547996, 0.00650602, 0.00207297, 0.0052943 ,
        0.00219738, 0.0062105 , 0.00153253, 0.00273718, 0.00149692,
        0.00245146, 0.00100101, 0.00121036, 0.00101179, 0.00101948,
        0.00054415, 0.00114498, 0.00053637, 0.00109128, 0.00049385,
        0.00059793, 0.00049596, 0.00073868, 0.00

---

# Visualize the attention

First, we align the attention and labels output from the network, and remove all the padding tokens.

In [333]:
result = []
# This could be done in a much more compact code, but I hope this is more
# understandable
for sentence_idx, (word_idxs, sentence_a_scores, sentence_labels) in enumerate(
        zip(X_test[0:2], attention, numpy.argmax(predictions, axis=-1))):
    for word_idx, a_score, label_idx in zip(word_idxs, sentence_a_scores, sentence_labels):
        word = unique_words[word_idx]
        if word == 'ENDPAD':
            break
        label = labels[label_idx]
        result.append((word, abs(a_score), sentence_idx, label))

In [334]:
result[:5]

[('Among', 0.01988118, 0, 'O'),
 ('those', 0.015328373, 0, 'O'),
 ('freed', 0.017458279, 0, 'O'),
 ('earlier', 0.020294288, 0, 'O'),
 ('this', 0.01645563, 0, 'O')]

As a stand alone service, we first must store the results in a json file

In [313]:
pandas.DataFrame(result, columns=['token', 'attention', 'sentence', 'label']).to_csv('data.csv', index=False)

After saving the file, you need to run a local http server to see the result. Run in the console from the repository directory:

```$ python -m http.server```

Then, open your browser in localhost:8000, and you should see the visualization

## Visualizing attention in notebook

Another option is to import d3 directly into the notebook, but it is less robust.

In [300]:
from IPython.core.display import display, HTML
from string import Template
import json

In [301]:
HTML('<script src="js/d3.min.js"></script>')

In [302]:
HTML('<script src="js/textChart.js"></script>')

In [303]:
HTML("""<script>
if (d3 === undefined) {
    alert('No d3 library');
}
if (TextChart === undefined) {
    alert('No Chart library');
}
</script>""")

In [314]:
json_data = pandas.DataFrame(
    result, columns=['token', 'attention', 'sentence', 'label']).to_json(orient='records')

In [315]:
js_text_template = Template('''
var nouns = $json_data;  // We are heavily using the similarties
                         // between js and json syntax.
opts = {
  lineHeight: 16,
  width: 900,
  height: 600,
  linePadding: 10
}
chart = new TextChart(nouns, opts);
chart.draw("text-container");
''')

html_template = Template('''
    <div id='text-container'></div>
    <script>$js_text</script>
''')

js_text = js_text_template.substitute({
    'json_data': json_data
})

HTML(html_template.substitute({'js_text': js_text}))