- author: Lee Meng
- date: 2019-06-03 09:00
- title: 淺談神經機器翻譯：用 Transformer 及 TensorFlow 打造巴比倫塔
- slug: transformer
- tags: 
- description: 
- summary: 
- image: Tour_de_babel.jpg
- image_credit_url: 
- status: draft

<img src="https://www.tensorflow.org/images/tutorials/transformer/transformer.png" width="600" alt="transformer">

Components and implementation order
- Scaled Dot-Product Attention
- Multi-Head Attention 
- Feed Forward
- Residual Connection & Layer Normalization
- Encoder block
- Decoder block
- Encoder
- Decoder
- Positional Encoding
- Transformer

In [1]:
!pip install tf-nightly-gpu-2.0-preview

Collecting tf-nightly-gpu-2.0-preview
[?25l  Downloading https://files.pythonhosted.org/packages/41/df/ef509c275be9cf4a8f973da0c8ab355e74a3e17b0633b359a862c6e82fd5/tf_nightly_gpu_2.0_preview-2.0.0.dev20190522-cp36-cp36m-manylinux1_x86_64.whl (349.0MB)
[K     |████████████████████████████████| 349.0MB 62kB/s 
Collecting wrapt>=1.11.1 (from tf-nightly-gpu-2.0-preview)
  Downloading https://files.pythonhosted.org/packages/67/b2/0f71ca90b0ade7fad27e3d20327c996c6252a2ffe88f50a95bba7434eda9/wrapt-1.11.1.tar.gz
Collecting tensorflow-estimator-2.0-preview (from tf-nightly-gpu-2.0-preview)
[?25l  Downloading https://files.pythonhosted.org/packages/b7/58/7f14cd2c2f3baf06b91497118260db1db29b40ad61d106bab3efabc47372/tensorflow_estimator_2.0_preview-1.14.0.dev2019052300-py2.py3-none-any.whl (428kB)
[K     |████████████████████████████████| 430kB 41.7MB/s 
Collecting google-pasta>=0.1.6 (from tf-nightly-gpu-2.0-preview)
[?25l  Downloading https://files.pythonhosted.org/packages/f9/68/a14620bfb0

In [2]:
import numpy as np
import tensorflow as tf
tf.__version__

'2.0.0-dev20190522'

In [0]:
np.random.seed(9527)
tf.random.set_seed(9527)

## Scaled Dot-Product Attention

在實作 Multi-head 之前，先讓我們實作基本的 Attention 機制。

注意力機制基本上可以想成資料庫比對。給定一個查詢 Q，我們去看該 Q 跟所有 K 的匹配程度，接著以此匹配程度對實際的 V 做加權平均，得到最後的 Repr.


 
$$Attention(Q, K, V) = softmax({QK^T \over \sqrt{d_{k}}})V $$

In [0]:
def scaled_dot_product_attention(q, k, v, mask):
  """Calculate the attention weights.
  q, k, v must have matching leading dimensions.
  k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
  The mask has different shapes depending on its type(padding or look ahead) 
  but it must be broadcastable for addition.
  
  Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable 
          to (..., seq_len_q, seq_len_k). Defaults to None.
    
  Returns:
    output, attention_weights
  """

  matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)
  
  # scale matmul_qk
  dk = tf.cast(tf.shape(k)[-1], tf.float32)
  scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

  # add the mask to the scaled tensor.
  if mask is not None:
    scaled_attention_logits += (mask * -1e9)

  # softmax is normalized on the last axis (seq_len_k) so that the scores
  # add up to 1.
  attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

  output = tf.matmul(attention_weights, v)  # (..., seq_len_v, depth_v)

  return output, attention_weights

很大的負值丟入 Softmax 函式以後會接近 0 ，則如果我們想把後三個位置遮住丟入 softmax 的話，則 mask 應該要是 `[..., 0, 1, 1, 1]` （要被遮住的位置的 mask 值為 1），再乘上 `-1e9` 以後加入 scaled_attention_logits 即可讓後三個位置經過 softmax 出來的值為 0。



In [50]:
tf.nn.softmax(tf.constant([1, -1e9, 3]))

<tf.Tensor: id=129, shape=(3,), dtype=float32, numpy=array([0.11920291, 0.        , 0.880797  ], dtype=float32)>

In [53]:
q = tf.constant([[0, 10, 0, 0], 
                 [0, 0, 10, 10]], dtype=tf.float32)
q

<tf.Tensor: id=135, shape=(2, 4), dtype=float32, numpy=
array([[ 0., 10.,  0.,  0.],
       [ 0.,  0., 10., 10.]], dtype=float32)>

In [54]:
k = tf.constant([[0, 10, 0, 0], 
                 [0, 0, 10, 0], 
                 [0, 0, 10, 10]], dtype=tf.float32)
k

<tf.Tensor: id=137, shape=(3, 4), dtype=float32, numpy=
array([[ 0., 10.,  0.,  0.],
       [ 0.,  0., 10.,  0.],
       [ 0.,  0., 10., 10.]], dtype=float32)>

In [60]:
v = tf.random.uniform((3, 10))
v

<tf.Tensor: id=157, shape=(3, 10), dtype=float32, numpy=
array([[0.5907972 , 0.01128781, 0.92228806, 0.07953656, 0.31918705,
        0.5416858 , 0.57252204, 0.9974569 , 0.17398036, 0.5514989 ],
       [0.32853377, 0.23834121, 0.62532985, 0.0153873 , 0.0709399 ,
        0.13619518, 0.8167461 , 0.5599638 , 0.9179418 , 0.7110497 ],
       [0.35725784, 0.5407543 , 0.46235597, 0.75289536, 0.6780722 ,
        0.6773449 , 0.9228561 , 0.94404805, 0.41801345, 0.00916016]],
      dtype=float32)>

In [0]:
# test
matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)
scaled_logits = matmul_qk / tf.math.sqrt(tf.cast(k.shape[-1], tf.float32))
test_attn_weights = tf.nn.softmax(scaled_logits, axis=-1)
test_attn = tf.matmul(test_attn_weights, v)

# real
attn, attn_weights = scaled_dot_product_attn(q, k, v)

assert tf.reduce_sum(test_attn - attn) < 1e-9
assert tf.reduce_sum(test_attn_weights - attn_weights) < 1e-9
assert attn.shape == (q.shape[-2], v.shape[-1])
assert attn_weights.shape == (q.shape[-2], k.shape[-2])