# Multi-head Attention Sentiment Analysis

This notebook introduces the Transformer network architecture based solely on attention mechanisms. 

The idea comes from the https://arxiv.org/abs/1706.03762 “Attention is All You Need”, an influential paper with a catchy title that brings innovative change in the field of machine translation. This paper demonstrated how high performance can be achieved __without__ convolutional or recurrent neural networks, which were previously regarded as the go-to architecture for machine translation.

In this notebook, we show that the Multi-head Attention network architecture can be implemented with Zoo Keras API and resolve the sentimental analysis task with IMDB data.

In [17]:
from __future__ import print_function

from keras.datasets import imdb
from keras.preprocessing import sequence
from zoo.pipeline.api.keras.models import Model
from zoo.pipeline.api.keras.layers import *
from zoo.pipeline.api.autograd import *
from zoo.common.nncontext import *

sc = init_nncontext()

## data preparation

Load 25000 records as training data, and another 25000 records as validation dataset. Each record in `x_train` and `x_test` represents one text comment from IMDB movie comments, where the sentences have been encoded with the dictionary.

In [18]:
max_features = 20000
max_len = 200

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 200)
x_test shape: (25000, 200)


## Multi-head Attention Layer

The attention mechanism in the Transformer is interpreted as a way of computing the relevance of a set of values(information) based on some keys and queries. Basically, the attention mechanism is used as a way for the model to focus on relevant information based on what it is currently processing. This is a natural extension of the original attention mechanism:  traditionally, the attention weights were the relevance of the encoder hidden states (values) in processing the decoder state (query) and were calculated based on the encoder hidden states (keys) and the decoder hidden state (query).

To improve the expressive ability of the model, the Transformer uses the Multi-Head Attention block. Instead of computing a single attention pass over the values, the Multi-Head Attention computes multiple attention weighted sums – hence the name “Multi-Head” Attention.

To learn diverse representations, the Multi-Head Attention applies different linear transformations to the values, keys, and queries for each “head” of attention. This is illustrated in the following code:


In [19]:

def build_multi_head_attention_model(q, k, v, n_head, d_k, d_v, d_out):
    '''
    Build a multi-Head Attention layer (https://arxiv.org/abs/1706.03762) with existing Keras Layers.
    The implementation can leverage more matrix operation after Batch_Dot in Zoo supports dimension > 3.
    :param q: Input Node for query
    :param k: Input Node for key
    :param v: Input Node for value
    :param n_head: numner of heads
    :param d_k: size of output for the linear transformation of key
    :param d_v: size of output for the linear transformation of value
    :param d_out: size of output for the linear transformation of the multihead.
    :return:
    '''
    qs_layers = []
    ks_layers = []
    vs_layers = []
    for _ in range(n_head):
        qs_layers.append(TimeDistributed(Dense(d_k, bias=False)))
        ks_layers.append(TimeDistributed(Dense(d_k, bias=False)))
        vs_layers.append(TimeDistributed(Dense(d_v, bias=False)))

    heads = []
    for i in range(n_head):
        qs = qs_layers[i](q)
        ks = ks_layers[i](k)
        vs = vs_layers[i](v)
        attn = Lambda(lambda x, y: batch_dot(x, y, axes=[2, 2]) / np.sqrt(d_k))([qs, ks])
        attn = Activation('softmax')(attn)
        head = Lambda(lambda x, y: batch_dot(x, y, axes=[2, 1]))([attn, vs])
        heads.append(head)

    head = merge(heads, mode="concat")
    output = TimeDistributed(Dense(d_out, activation = "relu"))(head)
    return output

## Model definition for Sentiment Analysis.

Now we use the Multi-head Attention component to build a text classification model for sentimental analysis.

In [20]:
S_inputs = Input(shape=(max_len,))
embeddings = Embedding(max_features, 128)(S_inputs)
O_seq = build_multi_head_attention_model(embeddings, embeddings, embeddings, 8, 16, 16, 16)
O_seq = GlobalAveragePooling1D()(O_seq)
O_seq = Dropout(0.2)(O_seq)
outputs = Dense(2, activation='softmax')(O_seq)

model = Model(S_inputs, outputs)
model.summary()

creating: createZooKerasInput
creating: createZooKerasEmbedding
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDe

## Model Training and Evaluation

In [21]:
# Users may try use different optimizers and different optimizer configs
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

batch_size = 40
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          nb_epoch=1)
print("Train finished.")

creating: createAdam
creating: createZooKerasSparseCategoricalCrossEntropy
creating: createZooKerasAccuracy
Train...
Train finished.


In [22]:
print('Evaluating...')
score = model.evaluate(x_test, y_test)[0]
print(score)

Evaluating...
Evaluated result: 0.871999979019165, total_num: 25000, method: Top1Accuracy


## Summary
With the Multi-head Attention architect, we can get 87% accuracy on the validation dataset. While Transformer network architect is more suitable for seq2seq tasks like translation, we use this simple example to demo how to build a multi-head attention model with the existing Zoo Keras API and apply it to a real world problem.