# Attention

## Attentionとは
連続したデータを扱う際に過去の重要なポイントに着目する(=Attention)ための手法

最終時刻の結果しか用いないより全部使った方がいい

実際LSTMも最終時刻だけより、全時刻の平均を用いる方が少し良いらしい

## Attentionを行う流れ

- これまでの隠れ層に対し、何らかの形でスコアリングする
- スコアをsoftmax関数にかけて正規化する
- 得られた重みについてそれぞれの隠れ層を加重平均する

# 0. Machine Translation + Attention



### Global Attention Model
<img src='img/encdecattention.png' width=600>

### $$ a_t = \frac{score(h_t,\bar{h_s})}{\sum_{s'}{score(h_t,\bar{h_s'}})}$$

<img src ='img/score.png' width=300>

### $$ c_t = \sum_s \alpha_t(s)\bar{h_s} $$

### $$ \tilde{h_t} = tanh{W_c[c_t;h_t]} $$


### Local Attention
論文：[Effective Approaches to Attention-based Neural Machine Translation](http://aclweb.org/anthology/D15-1166)
<img src ='img/localattention.png' width =600>

それぞれの単語についてscoreを計算しなくてもよくない？

Globalは長い文になると計算コスト増える

Local Attentionは更に見る場所を限定する  [$p_t-D,p_t+D$]


### 1. $p_t$の計算 
<img src ='img/p_t.png' width =300>
$p_t$は[0,S]の値を取り、どこの単語を中心に見たらいいかを判断する



### 2.scoreの計算

Global Attentionの時と同様だが、$p_t$から前後D個のみ計算

つまり$2D+1$個のAttentionを計算

<img src ='img/localattention3.png' width=400>


### 3.　正規分布を仮定
<img src ='img/seikibunpu2.png' width =400>

defaultでは$\sigma=\frac{D}{2}$

### 4. scoreの再計算

<img src ='img/scorerecalc.png' width =400>

### 5.以下同様...


### $$ c_t = \sum_s \alpha_t(s)\bar{h_s} $$

### $$ \tilde{h_t} = tanh({W_c[c_t;h_t]}) $$


# 文書分類にはどうやってAttentionを適用させたらいいの？

# 1.LSTM or GRU or RNN + Attention

[Hierarchical Attention Networks for Document Classification(Yang et.al,NAACL2016])](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf)　<br>
を参考に実装

通常のLSTMから文書分類を行う場合は最終時刻の結果が用いられる。

Attentionを用いる場合は今までと同様、どの時刻の隠れ層に着目すべきかの加重和でcontext vectorを生成する。

しかし今回いままでと対応するDecoder側での$h_t$と対応するものって何...?


**$h_t$と対応するものはランダムに初期化した学習していく重み**


<img src='img/attention.png'>

### 1.スコアの計算
$i$は$0$〜$n$<br>
$W_s,b_s$は学習させる重み
### $$ u_i = tanh(W_sh_i + b_s) $$

$u_s$が前までのDecoder側での$h_t$の代わりで、学習する重み
### $$ score(u_i,u_s) = u^T_i u_s $$

### 2.softmaxで正規化
### $$α_i = \frac{exp(score(u_i,u_s) )}{\sum_{i'} exp(score(u_{i'},u_s) )}$$

### 3.加重和
### $$v =\sum_i α_ih_i$$

In [1]:
from keras.layers import LSTM, Bidirectional, Dense,Merge,RepeatVector,Multiply,Lambda
from keras.layers.merge import Concatenate
from keras.layers import Input, Embedding, SimpleRNN, LSTM, GRU
from keras.layers.wrappers import TimeDistributed
from keras.layers.core import Activation, Dense
from keras.models import Model
import keras.backend as K

Using TensorFlow backend.


In [29]:
# kerasを用いた実装

input = Input(shape=(100,))
embedding = Embedding(input_dim=1000, output_dim=300,name='Emb')(input)
hs = LSTM(128, return_sequences=True,name='LSTM')(embedding)

u=TimeDistributed(Dense(32,  activation='tanh'),name='T1')(hs)
score = TimeDistributed(Dense(1),name='T2')(u)
alpha=Activation('softmax')(score)
alphahs=Multiply(name='attention_mul')([alpha,hs])

v = Lambda(lambda x: K.sum(x, axis=1))(alphahs)

model = Model(inputs=input, outputs=v)
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_16 (InputLayer)            (None, 100)           0                                            
____________________________________________________________________________________________________
Emb (Embedding)                  (None, 100, 300)      300000      input_16[0][0]                   
____________________________________________________________________________________________________
LSTM (LSTM)                      (None, 100, 128)      219648      Emb[0][0]                        
____________________________________________________________________________________________________
T1 (TimeDistributed)             (None, 100, 32)       4128        LSTM[0][0]                       
___________________________________________________________________________________________

#### kerasを用いたAttenionLayerの実装
Download from here

https://github.com/richliao/textClassifier

In [None]:
class AttLayer(Layer):
    def __init__(self, **kwargs):
        self.init = initializations.get('normal')
        #self.input_spec = [InputSpec(ndim=3)]
        super(AttLayer, self).__init__(** kwargs)

    def build(self, input_shape):
        assert len(input_shape)==3
        #self.W = self.init((input_shape[-1],1))
        self.W = self.init((input_shape[-1],))
        #self.input_spec = [InputSpec(shape=input_shape)]
        self.trainable_weights = [self.W]
        super(AttLayer, self).build(input_shape)  # be sure you call this somewhere!

    def call(self, x, mask=None):
        eij = K.tanh(K.dot(x, self.W))

        ai = K.exp(eij)
        weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')

        weighted_input = x*weights.dimshuffle(0,1,'x')
        return weighted_input.sum(axis=1)

    def get_output_shape_for(self, input_shape):
        return (input_shape[0], input_shape[-1])

# Hierarchical Attention Networks for Document Classification(Yang et.al,NAACL2016])
https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf


文書分類（sentence classificationじゃない。複数文入った文書）についてAttentionを取り入れた

二つのレベルのAttention
- 単語レベルのAttention(これだけ使ってSentence Classificationしてもいい)
- 文書レベルのAttention

## Hierarchical Attention Networks(HAN)
<img src ='img/hierarchicl.png' width =600>

### 1.双方向のGRU+Attentionで１文ごとにEncode
### 2.Encodeされた文の順番でもう一度双方向GRU+Attention


### Attentionの可視化
<img src='img/attentionvisual.png'>

## この仕組みを使えばtextCNNにもQRNNにもAttention機構が加えられる

In [28]:
from keras.layers import Input, Dense, Embedding, Conv2D, MaxPool2D
from keras.layers import Reshape, Flatten, Dropout, Concatenate
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
from keras.models import Model


#sequence_length = x.shape[1] # 56
#vocabulary_size = len(vocabulary_inv) # 18765
sequence_length = 56
vocabulary_size =  18765

embedding_dim = 256
filter_sizes = [3,4,5]
num_filters = 512
drop = 0.5

epochs = 100
batch_size = 30


inputs = Input(shape=(sequence_length,), dtype='int32')
embedding = Embedding(input_dim=vocabulary_size, output_dim=embedding_dim, input_length=sequence_length)(inputs)
reshape = Reshape((sequence_length,embedding_dim,1))(embedding)

conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)

maxpool_0 = MaxPool2D(pool_size=(sequence_length - filter_sizes[0] + 1, 1), strides=(1,1), padding='valid')(conv_0)
maxpool_1 = MaxPool2D(pool_size=(sequence_length - filter_sizes[1] + 1, 1), strides=(1,1), padding='valid')(conv_1)
maxpool_2 = MaxPool2D(pool_size=(sequence_length - filter_sizes[2] + 1, 1), strides=(1,1), padding='valid')(conv_2)

concatenated_tensor = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2])
flatten = Flatten()(concatenated_tensor)
dropout = Dropout(drop)(flatten)
output = Dense(units=2, activation='softmax')(dropout)

# this creates a model that includes
model = Model(inputs=inputs, outputs=output)
model.summary()

Creating Model...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_15 (InputLayer)            (None, 56)            0                                            
____________________________________________________________________________________________________
embedding_3 (Embedding)          (None, 56, 256)       4803840     input_15[0][0]                   
____________________________________________________________________________________________________
reshape_2 (Reshape)              (None, 56, 256, 1)    0           embedding_3[0][0]                
____________________________________________________________________________________________________
conv2d_4 (Conv2D)                (None, 54, 1, 512)    393728      reshape_2[0][0]                  
_________________________________________________________________________

In [None]:
# maxpoolをAttention層に書き換えなさい

# WRITE ME
