# 基于自注意力的Transformer实践

Transformer 是一个基于自注意力机制的全新神经网络架构，训练速度提升了一个数量级。

此文转自:[keras-attention](https://github.com/datalogue/keras-attention)，论文基础来自：[How to Visualize Your Recurrent Neural Network with Attention in Keras](https://medium.com/datalogue/attention-in-keras-1892773a4f22)

这个例子的应用：将各种格式的日期文本，转为标准日期格式文本，如将26.12.18，转为2018-12-26， 将Sat 8 Jun 2017转为2017-06-08

注意力机制的示意图：
<img src='./image/attention.png' />

实现本文的例子，需要准备以下Python包：
```
numpy
babel>=1.3
tensorflow>=1.1.0
keras>=2.0.4
h5py
Faker
matplotlib>=2.0.2
```

如果你的电脑具有GPU运算能力，需要将tensorflow替换为：```tensorflow-gpu>=1.1.0```

## 创建数据

我们通过faker以及babel，随机创建训练集以及验证集，其中faker可以随机生成日期，而babel是Python的一个国际化工具包，可以实现日期格式本地化。

以美国区域为例，常见的日期字符串的例子：30 october 1989， 23.06.11， 3 05 18等。

导入相关的模块

In [1]:
import random
import json
import os
from faker import Faker
from babel.dates import format_date

指定自动生成数据文件所在文件夹

In [25]:
DATA_FOLDER = './keras_attention/data'
if not os.path.exists(DATA_FOLDER):
    os.makedirs(DATA_FOLDER)

In [2]:
fake = Faker()
fake.seed(230517)
random.seed(230517)
# 定义15种日期格式
FORMATS = ['short',
           'medium',
           'long',
           'full',
           'd MMM YYY',
           'd MMMM YYY',
           'dd MMM YYY',
           'd MMM, YYY',
           'd MMMM, YYY',
           'dd, MMM YYY',
           'd MM YY',
           'd MMMM YYY',
           'MMMM d YYY',
           'MMMM d, YYY',
           'dd.MM.YY',
           ]
# change this if you want it to work with only a single language
LOCALES = ['en_US']

定义随机生成日期的函数

In [5]:
def create_date():
    """
        Creates some fake dates 
        :returns: tuple containing 
                  1. human formatted string
                  2. machine formatted string
                  3. date object.
    """
    dt = fake.date_object()

    # wrapping this in a try catch because
    # the locale 'vo' and format 'full' will fail
    try:
        human = format_date(dt,
                            format=random.choice(FORMATS),
                            locale=random.choice(LOCALES))

        case_change = random.randint(0,3) # 1/2 chance of case change
        if case_change == 1:
            human = human.upper()
        elif case_change == 2:
            human = human.lower()

        machine = dt.isoformat()
    except AttributeError as e:
        # print(e)
        return None, None, None

    return human, machine, dt

下面测试该函数的随机日期输出，分别为手写日期字符串，标准日期字符串以及日期类型数据对象

In [11]:
for i in range(10):
    print(create_date())

('19 July 1989', '1989-07-19', datetime.date(1989, 7, 19))
('may 18, 1971', '1971-05-18', datetime.date(1971, 5, 18))
('June 12, 2005', '2005-06-12', datetime.date(2005, 6, 12))
('jul 31, 1988', '1988-07-31', datetime.date(1988, 7, 31))
('30 march, 2017', '2017-03-30', datetime.date(2017, 3, 30))
('March 3 1985', '1985-03-03', datetime.date(1985, 3, 3))
('05.09.87', '1987-09-05', datetime.date(1987, 9, 5))
('17 Jul 2005', '2005-07-17', datetime.date(2005, 7, 17))
('MONDAY, JANUARY 4, 2010', '2010-01-04', datetime.date(2010, 1, 4))
('OCTOBER 27 1972', '1972-10-27', datetime.date(1972, 10, 27))


创建数据集的方法

- human_vocab，集合类型，即利用集合去重的特性，构建手写日期字符串的字符表，对应Encoder
>如12, Sep 2010，通过tuple处理，得到：('1', '2', ',', ' ', 'S', 'e', 'p', ' ', '2', '0', '1', '0')
<br>通过human_vocab.update(tuple('12, Sep 2010'))，完成去重得到：{'2', 'e', 'p', ',', ' ', '1', 'S', '0'}
- machine_vocab，集合类型，即利用集合去重的特性，构建即标准化日期字符串的字符表，对应Decoder
>如2010-09-12，通过tuple处理，得到：('2', '0', '1', '0', '-', '0', '9', '-', '1', '2')
<br>通过machine_vocab.update(tuple('2019-09-12'))，完成去重得到：{'2', '9', '1', '0', '-'}
- ```int2human = dict(enumerate(human_vocab))```
>其中enumerate是对集合human_vocab做迭代器，然后通过dict进行字典显式转换，得到字典结构：{0: 'C', 1: 'N', 2: '8', 3: 'R', 4: '2', 5: 'e', 6: 'p', 7: ',', 8: 'M', 9: 'B', 10: 'E', 11: 'J', 12: 'D', 13: 'U', 14: '9', 15: ' ', 16: '1', 17: 'S', 18: '0', 19: '4'}
- ```int2human.update({len(int2human): '<unk>',len(int2human)+1: '<eot>'})```,这个目的是设置空与结尾的字典
>得到```{0: 'C', 1: 'N', 2: '8', 3: 'R', 4: '2', 5: 'e', 6: 'p', 7: ',', 8: 'M', 9: 'B', 10: 'E', 11: 'J', 12: 'D', 13: 'U', 14: '9', 15: ' ', 16: '1', 17: 'S', 18: '0', 19: '4', 20: '<unk>', 21: '<eot>'}```
- 因为我们需要字符与索引的对应关系，所以字典需要颠倒一下：
>```human2int = {v: k for k, v in int2human.items()}```
<br>从而得到：```{'C': 0, 'N': 1, '8': 2, 'R': 3, '2': 4, 'e': 5, 'p': 6, ',': 7, 'M': 8, 'B': 9, 'E': 10, 'J': 11, 'D': 12, 'U': 13, '9': 14, ' ': 15, '1': 16, 'S': 17, '0': 18, '4': 19, '<unk>': 20, '<eot>': 21}```

In [19]:
def create_dataset(dataset_name, n_examples, vocabulary=False):
    """
        Creates a csv dataset with n_examples and optional vocabulary
        :param dataset_name: name of the file to save as
        :n_examples: the number of examples to generate
        :vocabulary: if true, will also save the vocabulary
    """
    human_vocab = set()
    machine_vocab = set()

    with open(dataset_name, 'w', encoding='utf-8') as f:
        for i in range(n_examples):
            h, m, _ = create_date()
            if h is not None:
                f.write('"'+h + '","' + m + '"\n')
                human_vocab.update(tuple(h))
                machine_vocab.update(tuple(m))
            if (i + 1) % 1000 == 0:
                print('Generated {0} records'.format(i + 1))

    if vocabulary:
        int2human = dict(enumerate(human_vocab))
        int2human.update({len(int2human): '<unk>',
                          len(int2human)+1: '<eot>'})
        int2machine = dict(enumerate(machine_vocab))
        int2machine.update({len(int2machine):'<unk>',
                            len(int2machine)+1:'<eot>'})

        human2int = {v: k for k, v in int2human.items()}
        machine2int = {v: k for k, v in int2machine.items()}

        with open(os.path.join(DATA_FOLDER, 'human_vocab.json'), 'w') as f:
            json.dump(human2int, f)
        with open(os.path.join(DATA_FOLDER, 'machine_vocab.json'), 'w') as f:
            json.dump(machine2int, f)

In [20]:
print(DATA_FOLDER)
print(os.path.exists(DATA_FOLDER))

./keras_attention/data
True


生成训练集以及验证集

In [21]:
print('creating training dataset begin')
create_dataset(os.path.join(DATA_FOLDER, 'training.csv'), 500000,
               vocabulary=True)
print('creating training dataset end')
print()
print('creating validation dataset begin')
create_dataset(os.path.join(DATA_FOLDER, 'validation.csv'), 1000)
print('creating validation dataset end')

creating training dataset begin
Generated 1000 records
Generated 2000 records
Generated 3000 records
Generated 4000 records
Generated 5000 records
Generated 6000 records
Generated 7000 records
Generated 8000 records
Generated 9000 records
Generated 10000 records
Generated 11000 records
Generated 12000 records
Generated 13000 records
Generated 14000 records
Generated 15000 records
Generated 16000 records
Generated 17000 records
Generated 18000 records
Generated 19000 records
Generated 20000 records
Generated 21000 records
Generated 22000 records
Generated 23000 records
Generated 24000 records
Generated 25000 records
Generated 26000 records
Generated 27000 records
Generated 28000 records
Generated 29000 records
Generated 30000 records
Generated 31000 records
Generated 32000 records
Generated 33000 records
Generated 34000 records
Generated 35000 records
Generated 36000 records
Generated 37000 records
Generated 38000 records
Generated 39000 records
Generated 40000 records
Generated 41000 r

Generated 332000 records
Generated 333000 records
Generated 334000 records
Generated 335000 records
Generated 336000 records
Generated 337000 records
Generated 338000 records
Generated 339000 records
Generated 340000 records
Generated 341000 records
Generated 342000 records
Generated 343000 records
Generated 344000 records
Generated 345000 records
Generated 346000 records
Generated 347000 records
Generated 348000 records
Generated 349000 records
Generated 350000 records
Generated 351000 records
Generated 352000 records
Generated 353000 records
Generated 354000 records
Generated 355000 records
Generated 356000 records
Generated 357000 records
Generated 358000 records
Generated 359000 records
Generated 360000 records
Generated 361000 records
Generated 362000 records
Generated 363000 records
Generated 364000 records
Generated 365000 records
Generated 366000 records
Generated 367000 records
Generated 368000 records
Generated 369000 records
Generated 370000 records
Generated 371000 records


生成的数据文件：

<img src='./image/attentiondata.png' />

生成数据的例子：
```
"9/30/97","1997-09-30"
"November 30, 1986","1986-11-30"
"25.05.78","1978-05-25"
"23.05.72","1972-05-23"
"TUESDAY, MAY 20, 1975","1975-05-20"
"27 Jul 1995","1995-07-27"
"Sunday, April 18, 1999","1999-04-18"
"july 15 1986","1986-07-15"
"19 Nov 1979","1979-11-19"
"JULY 12, 2001","2001-07-12"
"February 25 2018","2018-02-25"
"10.02.70","1970-02-10"
"June 5 1975","1975-06-05"
"01 Jun 1977","1977-06-01"
"dec 13, 2006","2006-12-13"
"9 DEC, 1995","1995-12-09"
"august 5, 1972","1972-08-05"
```

指定储存模型文件的文件夹

In [24]:
WEIGHT_FOLDER = './keras_attention/weights'
if not os.path.exists(WEIGHT_FOLDER):
    os.makedirs(WEIGHT_FOLDER)

## 准备构建基于Attention机制模型的各种方法与类

为了便于理解，此篇文章将涉及的类与方法，都直接放置到本篇文章中，而不是作为单独的python文件存放。

原文的目录结构如下：

<img src='./image/attentionstructure.png' />

导入相关的模块与类

In [26]:
import numpy as np
import os
import tensorflow as tf
from keras import backend as K
from keras import regularizers, constraints, initializers, activations
from keras.engine import InputSpec
from keras.models import Model, load_model
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Embedding, Activation, Permute, Input, Flatten, Dropout
from keras.layers.recurrent import LSTM, Recurrent
from keras.layers.wrappers import TimeDistributed, Bidirectional

Using TensorFlow backend.


对每一个临时切片都做一次: 输入x与权重w的矩阵乘法+偏移量b，有些类似于做线性回归

In [27]:
def _time_distributed_dense(x, w, b=None, dropout=None,
                            input_dim=None, output_dim=None,
                            timesteps=None, training=None):
    """Apply `x . w + b` for every temporal slice.
    # Arguments
        x: input tensor.
        w: weight matrix.
        b: optional bias vector.
        dropout: wether to apply dropout (same dropout mask
            for every temporal slice of the input).
        input_dim: integer; optional dimensionality of the input.
        output_dim: integer; optional dimensionality of the output.
        timesteps: integer; optional number of timesteps.
        training: training phase tensor or boolean.
    # Returns
        Output tensor.
    """
    if not input_dim:
        input_dim = K.shape(x)[2]
    if not timesteps:
        timesteps = K.shape(x)[1]
    if not output_dim:
        output_dim = K.shape(w)[1]

    if dropout is not None and 0. < dropout < 1.:
        # apply the same dropout pattern at every timestep
        ones = K.ones_like(K.reshape(x[:, 0, :], (-1, input_dim)))
        dropout_matrix = K.dropout(ones, dropout)
        expanded_dropout_matrix = K.repeat(dropout_matrix, timesteps)
        x = K.in_train_phase(x * expanded_dropout_matrix, x, training=training)

    # collapse time dimension and batch dimension together
    x = K.reshape(x, (-1, input_dim))
    x = K.dot(x, w)
    if b is not None:
        x = K.bias_add(x, b)
    # reshape to 3D tensor
    if K.backend() == 'tensorflow':
        x = K.reshape(x, K.stack([-1, timesteps, output_dim]))
        x.set_shape([None, None, output_dim])
    else:
        x = K.reshape(x, (-1, timesteps, output_dim))
    return x

Attention机制的Decoder类：AttentionDecoder，其继承了RNN的父类：Recurrent, 实现这个类之后，即可作为神经网络中的Attention层。

<b><font color='red'>重要构造函数介绍</color></b>
- activation,激活函数: 这里选用了tanh
- kernel_initializer，权值初始化方法， 用于初始化权值的初始化器
>这里选用了glorot_uniform, Glorot均匀分布初始化方法，又称Xavier均匀初始化，目标就是使得每一层输出的方差应该尽量相等。参数从```[-limit, limit]```的均匀分布产生，其中limit为```sqrt(6 / (fan_in + fan_out))```。
<br>有篇很好的文章:[深度学习中Xavier初始化](https://www.cnblogs.com/hejunlin1992/p/8723816.html),可以通过如下方式计算：
<br>```scale = np.sqrt(6. / (shape[0] + shape[1])) 
np.random.uniform(low=-scale, high=scale, size=shape)```
更多TensorFlow的参数初始化方法，参见:[tensorflow 1.0 学习：参数初始化（initializer)](http://www.mamicode.com/info-detail-1835147.html)
- recurrent_initializer，循环层状态节点权重初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的函数。
>这里选择了orthogonal，正交方法初始化(orthogonal initialization)网络参数，可以参考这篇博客[Explaining and illustrating orthogonal initialization for recurrent neural networks](https://smerity.com/articles/2016/orthogonal_init.html)
><br><b>正交初始化：</b>
<br>理想的情况是，特征值绝对值为1。则无论步数增加多少，梯度都在数值计算的精度内。
<br>这样的参数矩阵W是单位正交阵。
<br><b>把转移矩阵初始化为单位正交阵，可以避免在训练一开始就发生梯度爆炸/消失现象，称为orthogonal initialization。</b>
- bias_initializer, 偏移值初始化，这里设置的为zeros，即0

In [28]:
tfPrint = lambda d, T: tf.Print(input_=T, data=[T, tf.shape(T)], message=d)
class AttentionDecoder(Recurrent):
    def __init__(self, units, output_dim,
                 activation='tanh',
                 return_probabilities=False,
                 name='AttentionDecoder',
                 kernel_initializer='glorot_uniform',
                 recurrent_initializer='orthogonal',
                 bias_initializer='zeros',
                 kernel_regularizer=None,
                 bias_regularizer=None,
                 activity_regularizer=None,
                 kernel_constraint=None,
                 bias_constraint=None,
                 **kwargs):
        """
        Implements an AttentionDecoder that takes in a sequence encoded by an
        encoder and outputs the decoded states 
        :param units: dimension of the hidden state and the attention matrices
        :param output_dim: the number of labels in the output space

        references:
            Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 
            "Neural machine translation by jointly learning to align and translate." 
            arXiv preprint arXiv:1409.0473 (2014).
        """
        self.units = units
        self.output_dim = output_dim
        self.return_probabilities = return_probabilities
        self.activation = activations.get(activation)
        self.kernel_initializer = initializers.get(kernel_initializer)
        self.recurrent_initializer = initializers.get(recurrent_initializer)
        self.bias_initializer = initializers.get(bias_initializer)

        self.kernel_regularizer = regularizers.get(kernel_regularizer)
        self.recurrent_regularizer = regularizers.get(kernel_regularizer)
        self.bias_regularizer = regularizers.get(bias_regularizer)
        self.activity_regularizer = regularizers.get(activity_regularizer)

        self.kernel_constraint = constraints.get(kernel_constraint)
        self.recurrent_constraint = constraints.get(kernel_constraint)
        self.bias_constraint = constraints.get(bias_constraint)

        super(AttentionDecoder, self).__init__(**kwargs)
        self.name = name
        self.return_sequences = True  # must return sequences

    def build(self, input_shape):
        """
          See Appendix 2 of Bahdanau 2014, arXiv:1409.0473
          for model details that correspond to the matrices here.
        """

        self.batch_size, self.timesteps, self.input_dim = input_shape

        if self.stateful:
            super(AttentionDecoder, self).reset_states()

        self.states = [None, None]  # y, s

        """
            Matrices for creating the context vector
        """

        self.V_a = self.add_weight(shape=(self.units,),
                                   name='V_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.W_a = self.add_weight(shape=(self.units, self.units),
                                   name='W_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.U_a = self.add_weight(shape=(self.input_dim, self.units),
                                   name='U_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.b_a = self.add_weight(shape=(self.units,),
                                   name='b_a',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the r (reset) gate
        """
        self.C_r = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_r = self.add_weight(shape=(self.units, self.units),
                                   name='U_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_r = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_r = self.add_weight(shape=(self.units, ),
                                   name='b_r',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)

        """
            Matrices for the z (update) gate
        """
        self.C_z = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_z = self.add_weight(shape=(self.units, self.units),
                                   name='U_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_z = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_z = self.add_weight(shape=(self.units, ),
                                   name='b_z',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the proposal
        """
        self.C_p = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_p = self.add_weight(shape=(self.units, self.units),
                                   name='U_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_p = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_p = self.add_weight(shape=(self.units, ),
                                   name='b_p',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for making the final prediction vector
        """
        self.C_o = self.add_weight(shape=(self.input_dim, self.output_dim),
                                   name='C_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_o = self.add_weight(shape=(self.units, self.output_dim),
                                   name='U_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim),
                                   name='W_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_o = self.add_weight(shape=(self.output_dim, ),
                                   name='b_o',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)

        # For creating the initial state:
        self.W_s = self.add_weight(shape=(self.input_dim, self.units),
                                   name='W_s',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)

        self.input_spec = [
            InputSpec(shape=(self.batch_size, self.timesteps, self.input_dim))]
        self.built = True

    def call(self, x):
        # store the whole sequence so we can "attend" to it at each timestep
        self.x_seq = x

        # apply the a dense layer over the time dimension of the sequence
        # do it here because it doesn't depend on any previous steps
        # thefore we can save computation time:
        self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a,
                                             input_dim=self.input_dim,
                                             timesteps=self.timesteps,
                                             output_dim=self.units)

        return super(AttentionDecoder, self).call(x)

    def get_initial_state(self, inputs):
        print('inputs shape:', inputs.get_shape())

        # apply the matrix on the first time step to get the initial s0.
        s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s))

        # from keras.layers.recurrent to initialize a vector of (batchsize,
        # output_dim)
        y0 = K.zeros_like(inputs)  # (samples, timesteps, input_dims)
        y0 = K.sum(y0, axis=(1, 2))  # (samples, )
        y0 = K.expand_dims(y0)  # (samples, 1)
        y0 = K.tile(y0, [1, self.output_dim])

        return [y0, s0]

    def step(self, x, states):

        ytm, stm = states

        # repeat the hidden state to the length of the sequence
        _stm = K.repeat(stm, self.timesteps)

        # now multiplty the weight matrix with the repeated hidden state
        _Wxstm = K.dot(_stm, self.W_a)

        # calculate the attention probabilities
        # this relates how much other timesteps contributed to this one.
        et = K.dot(activations.tanh(_Wxstm + self._uxpb),
                   K.expand_dims(self.V_a))
        at = K.exp(et)
        at_sum = K.sum(at, axis=1)
        at_sum_repeated = K.repeat(at_sum, self.timesteps)
        at /= at_sum_repeated  # vector of size (batchsize, timesteps, 1)

        # calculate the context vector
        context = K.squeeze(K.batch_dot(at, self.x_seq, axes=1), axis=1)
        # ~~~> calculate new hidden state
        # first calculate the "r" gate:

        rt = activations.sigmoid(
            K.dot(ytm, self.W_r)
            + K.dot(stm, self.U_r)
            + K.dot(context, self.C_r)
            + self.b_r)

        # now calculate the "z" gate
        zt = activations.sigmoid(
            K.dot(ytm, self.W_z)
            + K.dot(stm, self.U_z)
            + K.dot(context, self.C_z)
            + self.b_z)

        # calculate the proposal hidden state:
        s_tp = activations.tanh(
            K.dot(ytm, self.W_p)
            + K.dot((rt * stm), self.U_p)
            + K.dot(context, self.C_p)
            + self.b_p)

        # new hidden state:
        st = (1-zt)*stm + zt * s_tp

        yt = activations.softmax(
            K.dot(ytm, self.W_o)
            + K.dot(stm, self.U_o)
            + K.dot(context, self.C_o)
            + self.b_o)

        if self.return_probabilities:
            return at, [yt, st]
        else:
            return yt, [yt, st]

    def compute_output_shape(self, input_shape):
        """
            For Keras internal compatability checking
        """
        if self.return_probabilities:
            return (None, self.timesteps, self.timesteps)
        else:
            return (None, self.timesteps, self.output_dim)

    def get_config(self):
        """
            For rebuilding models on load time.
        """
        config = {
            'output_dim': self.output_dim,
            'units': self.units,
            'return_probabilities': self.return_probabilities
        }
        base_config = super(AttentionDecoder, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

下面尝试用AttentionDecoder类，构建一个神经网络层

In [29]:
i = Input(shape=(100,104), dtype='float32')
enc = Bidirectional(LSTM(64, return_sequences=True), merge_mode='concat')(i)
dec = AttentionDecoder(32, 4)(enc)
model = Model(inputs=i, outputs=dec)
print(model.summary())

inputs shape: (?, ?, 128)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 100, 104)          0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 128)          86528     
_________________________________________________________________
AttentionDecoder (AttentionD (None, 100, 4)            25780     
Total params: 112,308
Trainable params: 112,308
Non-trainable params: 0
_________________________________________________________________
None


通过Attention机制创建神经网络机器翻译（Neural Machine Translator，简称NMT）的方法，相比以往构建BI-LSTM的方式，区别就是叠加了AttentionDecoder层

In [30]:
def simpleNMT(pad_length=100,
              n_chars=105,
              n_labels=6,
              embedding_learnable=False,
              encoder_units=256,
              decoder_units=256,
              trainable=True,
              return_probabilities=False):
    """
    Builds a Neural Machine Translator that has alignment attention
    :param pad_length: the size of the input sequence
    :param n_chars: the number of characters in the vocabulary
    :param n_labels: the number of possible labelings for each character
    :param embedding_learnable: decides if the one hot embedding should be refinable.
    :return: keras.models.Model that can be compiled and fit'ed

    *** REFERENCES ***
    Lee, Jason, Kyunghyun Cho, and Thomas Hofmann. 
    "Neural Machine Translation By Jointly Learning To Align and Translate" 
    """
    input_ = Input(shape=(pad_length,), dtype='float32')
    input_embed = Embedding(n_chars, n_chars,
                            input_length=pad_length,
                            trainable=embedding_learnable,
                            weights=[np.eye(n_chars)],
                            name='OneHot')(input_)

    rnn_encoded = Bidirectional(LSTM(encoder_units, return_sequences=True),
                                name='bidirectional_1',
                                merge_mode='concat',
                                trainable=trainable)(input_embed)

    y_hat = AttentionDecoder(decoder_units,
                             name='attention_decoder_1',
                             output_dim=n_labels,
                             return_probabilities=return_probabilities,
                             trainable=trainable)(rnn_encoded)

    model = Model(inputs=input_, outputs=y_hat)

    return model

我们来测试一下，并观察其输出，超参数都使用默认值：

In [31]:
model = simpleNMT()
print(model.summary())

inputs shape: (?, ?, 512)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 100)               0         
_________________________________________________________________
OneHot (Embedding)           (None, 100, 105)          11025     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 512)          741376    
_________________________________________________________________
attention_decoder_1 (Attenti (None, 100, 6)            928042    
Total params: 1,680,443
Trainable params: 1,669,418
Non-trainable params: 11,025
_________________________________________________________________
None


计算Accuracy的方法，在模型编译时，用于测量准确率的metrics

In [34]:
import keras.backend as K

def all_acc(y_true, y_pred):
    """
        All Accuracy
        https://github.com/rasmusbergpalm/normalization/blob/master/train.py#L10
    """
    return K.mean(
        K.all(
            K.equal(
                K.max(y_true, axis=-1),
                K.cast(K.argmax(y_pred, axis=-1), K.floatx())
            ),
            axis=1)
    )

## 读取词汇表的类与方法（用于模型训练以及结果输出）

In [35]:
import json
import csv
import random

import numpy as np
from keras.utils.np_utils import to_categorical

In [37]:
random.seed(1984)

INPUT_PADDING = 50
OUTPUT_PADDING = 100

Vocabulary类
- vocabulary：字典，字符为key，索引为value
- reverse_vocabulary：字典，索引为key，字符为value
从文件创建词汇表，提供字符转整数索引，也提供索引转字符的方法

无论是输入还是输出词汇表，都可以通过此类进行转换处理

In [38]:
class Vocabulary(object):

    def __init__(self, vocabulary_file, padding=None):
        """
            Creates a vocabulary from a file
            :param vocabulary_file: the path to the vocabulary
        """
        self.vocabulary_file = vocabulary_file
        with open(vocabulary_file, 'r', encoding='utf-8') as f:
            self.vocabulary = json.load(f)

        self.padding = padding
        self.reverse_vocabulary = {v: k for k, v in self.vocabulary.items()}

    def size(self):
        """
            Gets the size of the vocabulary
        """
        return len(self.vocabulary.keys())

    def string_to_int(self, text):
        """
            Converts a string into it's character integer 
            representation
            :param text: text to convert
        """
        characters = list(text)

        integers = []

        if self.padding and len(characters) >= self.padding:
            # truncate if too long
            characters = characters[:self.padding - 1]

        characters.append('<eot>')

        for c in characters:
            if c in self.vocabulary:
                integers.append(self.vocabulary[c])
            else:
                integers.append(self.vocabulary['<unk>'])


        # pad:
        if self.padding and len(integers) < self.padding:
            integers.extend([self.vocabulary['<unk>']]
                            * (self.padding - len(integers)))

        if len(integers) != self.padding:
            print(text)
            raise AttributeError('Length of text was not padding.')
        return integers

    def int_to_string(self, integers):
        """
            Decodes a list of integers
            into it's string representation
        """
        characters = []
        for i in integers:
            characters.append(self.reverse_vocabulary[i])

        return characters

Data类，创建从文件获取数据的对象。

In [39]:
class Data(object):

    def __init__(self, file_name, input_vocabulary, output_vocabulary):
        """
            Creates an object that gets data from a file
            :param file_name: name of the file to read from
        """

        self.input_vocabulary = input_vocabulary
        self.output_vocabulary = output_vocabulary
        self.file_name = file_name

    def load(self):
        """
            Loads data from a file
        """
        self.inputs = []
        self.targets = []

        with open(self.file_name, 'r', encoding='utf-8') as f:
            reader = csv.reader(f)
            for row in reader:
                self.inputs.append(row[0])
                self.targets.append(row[1])

    def transform(self):
        """
            Transforms the data as necessary
        """
        # @TODO: use `pool.map_async` here?
        self.inputs = np.array(list(
            map(self.input_vocabulary.string_to_int, self.inputs)))
        self.targets = map(self.output_vocabulary.string_to_int, self.targets)
        self.targets = np.array(
            list(map(
                lambda x: to_categorical(
                    x,
                    num_classes=self.output_vocabulary.size()),
                self.targets)))

        assert len(self.inputs.shape) == 2, 'Inputs could not properly be encoded'
        assert len(self.targets.shape) == 3, 'Targets could not properly be encoded'

    def generator(self, batch_size):
        """
            Creates a generator that can be used in `model.fit_generator()`
            Batches are generated randomly.
            :param batch_size: the number of instances to include per batch
        """
        instance_id = range(len(self.inputs))
        while True:
            try:
                batch_ids = random.sample(instance_id, batch_size)
                yield (np.array(self.inputs[batch_ids], dtype=int),
                       np.array(self.targets[batch_ids]))
            except Exception as e:
                print('EXCEPTION OMG')
                print(e)
                yield None, None

测试数据转换类

其中经过Data类load方法处理之后，手写日期格式会作为inputs的列表成员，标准日期格式会作为targets的列表成员
<img src='./image/attentiondata1.png' />
<img src='./image/attentiondata2.png' />


经过transform之后，inputs变成了字符索引映射矩阵，从而可以成为训练集的X，即特征数据：
<br>50列，是因为padding长度为50,1000行，是数据records数量。
<img src='./image/attentioninputt.gif' />
经过transform之后，targets变成了one-hot编码的矩阵，从而可以成为训练集的Y,即分类依据数据：
<br>列0~12，表明字符索引；行0~11对应padding的长度：12
<img src='./image/attentiondatatarget.png' />

In [75]:
DATA_FOLDER = './keras_attention/data'
if not os.path.exists(DATA_FOLDER):
    os.makedirs(DATA_FOLDER)
human_vocab_file = os.path.join(DATA_FOLDER, 'human_vocab.json')
machine_vocab_file = os.path.join(DATA_FOLDER, 'machine_vocab.json')
# 因为手写日期格式比较长，如：28 September 2018, 所以设置为50
input_vocab = Vocabulary(human_vocab_file, padding=50)
# 因为标准日期格式相对较短，如：2018-05-12，所以padding设置为12就可以了
output_vocab = Vocabulary(machine_vocab_file, padding=12)
datacsv = os.path.join(DATA_FOLDER, 'validation.csv')
ds = Data(datacsv, input_vocab, output_vocab)
ds.load()
ds.transform()
print('input shape: {0}'.format(ds.inputs.shape))
print('target shape: {0}'.format(ds.targets.shape))
# 设置mini-batch为32，通过迭代器的方式返回数据
g = ds.generator(32)
# print(ds.inputs[0:3])
# print(ds.targets[0:3])
print(ds.inputs[[5,10, 115]].shape)
print(ds.targets[[5,10,12]].shape)
# for i in range(50):
#     print(next(g)[0].shape)
#     print(next(g)[1].shape)

input shape: (1000, 50)
target shape: (1000, 12, 13)
(3, 50)
(3, 12, 13)


## 构建模型训练的代码

In [91]:
import os
import argparse

测试模型用的代码：

In [92]:
import numpy as np

EXAMPLES = ['26th January 2016', '3 April 1989', '5 Dec 09', 'Sat 8 Jun 2017']

def run_example(model, input_vocabulary, output_vocabulary, text):
    encoded = input_vocabulary.string_to_int(text)
    prediction = model.predict(np.array([encoded]))
    prediction = np.argmax(prediction[0], axis=-1)
    return output_vocabulary.int_to_string(prediction)

def run_examples(model, input_vocabulary, output_vocabulary, examples=EXAMPLES):
    predicted = []
    for example in examples:
        print('~~~~~')
        predicted.append(''.join(run_example(model, input_vocabulary, output_vocabulary, example)))
        print('input:',example)
        print('output:',predicted[-1])
    return predicted

In [93]:
WEIGHT_FOLDER = './keras_attention/weights'
if not os.path.exists(WEIGHT_FOLDER):
    os.makedirs(WEIGHT_FOLDER)

In [99]:
checkpointfile = os.path.join(WEIGHT_FOLDER, 'NMT.{epoch:02d}-{val_loss:.2f}.hdf5')
cp = ModelCheckpoint(checkpointfile,
                     monitor='val_loss',
                     verbose=0,
                     save_best_only=True,
                     save_weights_only=True,
                     mode='auto')

因为用到了自定义的AttentionDecoder层，准确度评估也是用自定义的all_acc方法，所以在load_model要加入：
```
custom_objects={"AttentionDecoder":AttentionDecoder,"all_acc": all_acc}
```

In [103]:
traincsv = os.path.join(DATA_FOLDER, 'training.csv')
validationcsv = os.path.join(DATA_FOLDER, 'validation.csv')
def trainapply(gpu="0", 
               padding=50, 
               training_data=traincsv, 
               validation_data=validationcsv,
               batch_size=32,
               epoch=50):
    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"  # see issue #152
    os.environ["CUDA_VISIBLE_DEVICES"] = gpu
    # Dataset functions
    input_vocab = Vocabulary(os.path.join(DATA_FOLDER, 'human_vocab.json'), padding=padding)
    output_vocab = Vocabulary(os.path.join(DATA_FOLDER, 'machine_vocab.json'),
                              padding=padding)
    
    modelpath = os.path.join(WEIGHT_FOLDER, 'sample_NMT.h5')
    if os.path.exists(modelpath):
        print('load model: {0} begin'.format(modelpath))
        model = load_model(modelpath,
                           custom_objects={
                               "AttentionDecoder":AttentionDecoder,
                               "all_acc": all_acc})
        print('load model: {0} end'.format(modelpath))
    else:
        print('Loading datasets.')
        training = Data(training_data, input_vocab, output_vocab)
        validation = Data(validation_data, input_vocab, output_vocab)
        training.load()
        validation.load()
        training.transform()
        validation.transform()

        print('Datasets Loaded.')
        print('Compiling Model.')
        model = simpleNMT(pad_length=padding,
                          n_chars=input_vocab.size(),
                          n_labels=output_vocab.size(),
                          embedding_learnable=False,
                          encoder_units=256,
                          decoder_units=256,
                          trainable=True,
                          return_probabilities=False)

        model.summary()
        model.compile(optimizer='adam',
                      loss='categorical_crossentropy',
                      metrics=['accuracy', all_acc])
        print('Model Compiled.')
        print('Training. Ctrl+C to end early.')

        try:
            model.fit_generator(generator=training.generator(batch_size),
                                steps_per_epoch=100,
                                validation_data=validation.generator(batch_size),
                                validation_steps=100,
                                callbacks=[cp],
                                workers=1,
                                verbose=1,
                                epochs=epoch)
            model.save(modelpath)
        except KeyboardInterrupt as e:
            print('Model training stopped early.')
        print('Model training complete.')

    run_examples(model, input_vocab, output_vocab,
                 examples=['24th December 2018',
                           '3 April 1989',
                           '5 Dec 09',
                           'Sat 8 Jun 2017',
                           '21 03 16',
                           '21 05 39',
                           '23.8.99',
                           '26.12.18',
                           'TUESDAY, August 23, 1993'])

In [104]:
def main():
    trainapply()

In [107]:
main()

Loading datasets.
Datasets Loaded.
Compiling Model.
inputs shape: (?, ?, 512)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         (None, 50)                0         
_________________________________________________________________
OneHot (Embedding)           (None, 50, 60)            3600      
_________________________________________________________________
bidirectional_1 (Bidirection (None, 50, 512)           649216    
_________________________________________________________________
attention_decoder_1 (Attenti (None, 50, 13)            938934    
Total params: 1,591,750
Trainable params: 1,588,150
Non-trainable params: 3,600
_________________________________________________________________
Model Compiled.
Training. Ctrl+C to end early.
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50


Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Model training complete.
~~~~~
input: 24th December 2018
output: 2018-12-24<eot><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
~~~~~
input: 3 April 1989
output: 1989-04-03<eot><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
~~~~~
input: 5 Dec 09
output: 2009-12-05<eot><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
~~~~~
input: Sat 8 Jun 2017
output: 2017-06-08<eot><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>