# Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation [1]

## Part 1. 논문 정리와 모델 구현

# Summary

다양한 분야에서 Deep neural networks (DNNs) 를 도입하려는 움직임이 나타나면서, 구 단위 SMT 시스템 (phrase-based SMT system; Statistical Machine Translation) 에도 순방향 신경망 (feedfoward neural networks) 을 도입하려는 연구가 활발하다. 해당 논문에서는 전통적인 구 단위 SMT 시스템의 일부분으로 도입할 수 있는 신경망 구조를 제안하였다.

해당 논문에서는 이 구조를 RNN 인코더-디코더 (RNN Encoder-Decoder) 라고 지칭하며, 각각 인코더와 디코더로 동작하는 두 개의 RNN으로 이루어져 있다고 설명한다. 인코더 역할의 RNN은 가변 길이의 입력 시퀀스를 고정 길이의 벡터로 매핑하는 연산을 수행하고, 디코더 역할의 RNN은 이 벡터를 다시 가변 길이의 출력 시퀀스로 매핑한다. 두 RNN은 입력 시퀀스가 주어졌을 때 출력 시퀀스의 조건부 확률을 최대화하도록 "함께" 훈련된다.

또한 해당 논문에서는 추후에 GRU (Gated Recurrent Unit) 라고 불리게 되눈 새로운 RNN 구조를 제시한다. 이 구조는 LSTM에서 영감을 받아 간소화한 것으로, 총 두 개의 게이트로 이루어져 있다. 그 중 하나인 리셋 게이트 (reset gate) 는 현재 입력으로 새로운 은닉 스테이트 (hidden state) 를 계산할 때 이전 은닉 스테이트를 얼마나 반영할 것인지 조절한다. 다른 하나인 업데이트 게이트 (update gate) 는 다음 은닉 스테이트를 계산하는 데에 있어서 이전 은닉 스테이트와 새로운 은닉 스테이트 간의 비중을 조절한다.

제안한 방법을 WMT'14 English to French Machine Translation 작업에 적용하였다. SMT 시스템으로 기본 세팅을 이용한 Moses를 사용하였으며, 이와 함께 논문에서 제안한 구조를 사용했을 때 테스트 세트에서 33.87에 달하는 점수를 얻었다. (기존 시스템만 사용했을 때보다 0.57만큼 높은 점수이다.)

(단, 해당 논문에서는 영어 문장을 프랑스어 문장으로 번역하는 확률을 학습시킨 것이 아니라, 영어 구를 프랑스어 구로 번역하는 확률을 학습시켰다.)

<figure align="center">
  <img src="https://drive.google.com/uc?export=view&id=1lz1k42N9XQSfaAb449bhXpCfPfpZm7nv" width=800 />
  <figcaption>Encoder and Decoder Architecture</figcaption>
</figure>

<figure align="center">
  <img src="https://drive.google.com/uc?export=view&id=1fwFTSzQm0yqZD_rWsISIePdg4UDsUcyo" width=900 />
  <figcaption>Example of Network Flow</figcaption>
</figure>

# Models

In [1]:
import torch
import torch.nn as nn

## GRU Module

<figure align="center">
  <img src="https://drive.google.com/uc?export=view&id=1PL067WUxNtDMZiCBa2cGLWWdFMX-iabM" width=350 />
  <figcaption>GRU Architecture [1]</figcaption>
</figure>

$$
r = \sigma\big(W_r\cdot[x, h_{\langle t-1\rangle}]+b_r\big), \\
z = \sigma\big(W_z\cdot[x, h_{\langle t-1\rangle}]+b_z\big), \\
\tilde{h}^{\langle t\rangle} = \tanh\big(W\cdot[x, r\odot h_{\langle t-1\rangle}]+b_{\tilde{h}}\big), \\
h^{\langle t\rangle} = z \odot h^{\langle t-1\rangle} + (1-z) \odot \tilde{h}^{\langle t\rangle}.
$$
<div align="center">Basic Equations of GRU [2]</div>

그러나 디코더에 쓰이는 GRU는 하나의 인자 (요약 벡터 $c$)를 더 입력받는다.

$$
r = \sigma\big(W_r\cdot[x, h_{\langle t-1\rangle}, c]+b_r\big), \\
z = \sigma\big(W_z\cdot[x, h_{\langle t-1\rangle}, c]+b_z\big), \\
\tilde{h}^{\langle t\rangle} = \tanh\big(W\cdot[x, r\odot h_{\langle t-1\rangle}, r\odot c]+b_{\tilde{h}}\big), \\
h^{\langle t\rangle} = z \odot h^{\langle t-1\rangle} + (1-z) \odot \tilde{h}^{\langle t\rangle}.
$$
<div align="center">Equations of GRU for Decoder [2]</div>

In [50]:
class GRULayer(nn.Module):

  def __init__(self, input_size, hidden_size, is_decoder, dtype=torch.float, device='cpu'):
    super(GRULayer, self).__init__()
    self.input_size = input_size
    self.hidden_size = hidden_size
    self.is_decoder = is_decoder
    self.factory_kwargs = {'dtype': dtype, 'device': device}

    # summary_size == hidden_size
    combined_size = input_size + 2 * hidden_size if is_decoder \
      else input_size + hidden_size
    self.linear_reset = nn.Linear(combined_size, hidden_size,
                                  **self.factory_kwargs)
    self.linear_update = nn.Linear(combined_size, hidden_size,
                                   **self.factory_kwargs)
    self.linear_new = nn.Linear(combined_size, hidden_size,
                                **self.factory_kwargs)

  def forward(self, input, hidden=None, summary=None):
    """Args:
        input: torch.Tensor, [seq_len, input_size] or
          [seq_len, batch_size, input_size]
        hidden: torch.Tensor, [hidden_size] or [batch_size, hidden_size]
        summary: torch.Tensor, [hidden_size] or [batch_size, hidden_size]

    Return:
        output: torch.Tensor, [seq_len, hidden_size] or
            [seq_len, batch_size, hidden_size]
        hidden: torch.Tensor, [hidden_size] or [batch_size, hidden_size]
    """
    assert (2 <= len(input.shape) <= 3) and input.size(-1) == self.input_size, \
      "The shape of the `input` should be [seq_len, input_size] or " \
      "[seq_len, batch_size, input_size]"
    assert (not self.is_decoder and summary is None) or \
      (self.is_decoder and hidden is not None and summary is not None), \
      "The GRU for an encoder should not receive a summary vector and for " \
      "a decoder should receive a hidden state and a summary vector."
    assert (hidden is None) or \
      (len(hidden.shape) == len(input.shape) - 1 and \
       hidden.size(-1) == self.hidden_size), \
      "The shape of the `hidden` should be [hidden_size] or " \
      "[batch_size, hidden_size]"
    assert (summary is None) or \
      (len(summary.shape) == len(input.shape) - 1 and \
       summary.size(-1) == self.hidden_size), \
      "The shape of the `summary` should be [hidden_size] or " \
      "[batch_size, hidden_size]"
    
    is_batched = len(input.shape) == 3
    if is_batched:
      seq_len, batch_size, _ = input.shape
      outputs = torch.zeros(seq_len, batch_size, self.hidden_size,
                            **self.factory_kwargs)
      if hidden is None:
        hidden = torch.zeros(batch_size, self.hidden_size,
                             **self.factory_kwargs)
    else:
      seq_len, _ = input.shape
      outputs = torch.zeros(seq_len, self.hidden_size,
                            **self.factory_kwargs)
      if hidden is None:
        hidden = torch.zeros(self.hidden_size,
                             **self.factory_kwargs)
    
    for i in range(seq_len):
      if self.is_decoder:
        combined = torch.cat((input[i], hidden, summary),
                             dim=len(input[i].shape)-1)
      else:
        combined = torch.cat((input[i], hidden), dim=len(input[i].shape)-1)
      reset = torch.sigmoid(self.linear_reset(combined))
      update = torch.sigmoid(self.linear_update(combined))

      if self.is_decoder:
        combined = torch.cat((input[i], reset * hidden, reset * summary),
                             dim=len(input[i].shape)-1)
      else:
        combined = torch.cat((input[i], reset * hidden),
                             dim=len(input[i].shape)-1)
      new = torch.tanh(self.linear_new(combined))
      hidden = update * hidden + (1 - update) * new

      outputs[i] = hidden
    
    return outputs, hidden

In [25]:
class GRU(nn.Module):

  def __init__(self, input_size, hidden_size, num_layers, is_decoder, dtype=torch.float, device='cpu'):
    super(GRU, self).__init__()
    self.input_size = input_size
    self.hidden_size = hidden_size
    self.num_layers = num_layers
    self.is_decoder = is_decoder
    self.factory_kwargs = {'dtype': dtype, 'device': device}

    layers = \
      [GRULayer(input_size, hidden_size, is_decoder, **self.factory_kwargs)] + \
      [GRULayer(hidden_size, hidden_size, is_decoder, **self.factory_kwargs)
      for _ in range(num_layers - 1)]
    self.layers = nn.ModuleList(layers)

  def forward(self, input, hiddens=None, summarys=None):
    """Args:
        input: torch.Tensor, [seq_len, input_size] or
          [seq_len, batch_size, input_size]
        hiddens: torch.Tensor, [num_layers, hidden_size] or
          [num_layers, batch_size, hidden_size]
        summarys: torch.Tensor, [num_layers, hidden_size] or
          [num_layers, batch_size, hidden_size]

    Return:
        output: torch.Tensor, [seq_len, hidden_size] or
            [seq_len, batch_size, hidden_size]
        hidden: torch.Tensor, [num_layers, hidden_size] or
          [num_layers, batch_size, hidden_size]
    """
    assert (2 <= len(input.shape) <= 3) and input.size(-1) == self.input_size, \
      "The shape of the `input` should be [seq_len, input_size] or " \
      "[seq_len, batch_size, input_size]"
    assert (not self.is_decoder and summarys is None) or \
      (self.is_decoder and hiddens is not None and summarys is not None), \
      "The GRU for an encoder should not receive a summary vector and for " \
      "a decoder should receive a hidden state and a summary vector."
    assert (hiddens is None) or \
      (len(hiddens.shape) == len(input.shape) and \
       hiddens.size(0) == self.num_layers and \
       hiddens.size(-1) == self.hidden_size), \
      "The shape of the `hidden` should be [num_layers, hidden_size] or " \
      "[num_layers, batch_size, hidden_size]"
    assert (summarys is None) or \
      (len(summarys.shape) == len(input.shape) and \
       summarys.size(0) == self.num_layers and \
       summarys.size(-1) == self.hidden_size), \
      "The shape of the `summary` should be [num_layers, hidden_size] or " \
      "[num_layers, batch_size, hidden_size]"

    is_batched = len(input.shape) == 3
    if is_batched:
      seq_len, batch_size, _ = input.shape
      if hiddens is None:
        hiddens = torch.zeros(self.num_layers, batch_size, self.hidden_size,
                              **self.factory_kwargs)
    else:
      seq_len, _ = input.shape
      if hiddens is None:
        hiddens = torch.zeros(self.num_layers, self.hidden_size,
                              **self.factory_kwargs)

    output = input
    next_hiddens = torch.zeros_like(hiddens)
    for i in range(self.num_layers):
      if self.is_decoder:
        output, hidden = self.layers[i](output, hiddens[i], summarys[i])
      else:
        output, hidden = self.layers[i](output, hiddens[i])
      next_hiddens[i] = hidden

    return output, next_hiddens

## Encoder Module

In [4]:
class Encoder(nn.Module):

  def __init__(self, input_size, embed_size, hidden_size, num_rnn_layers,
               padding_index, dtype=torch.float, device='cpu'):
    super(Encoder, self).__init__()
    self.input_size = input_size
    self.embed_size = embed_size
    self.hidden_size = hidden_size
    self.num_rnn_layers = num_rnn_layers
    self.factory_kwargs = {'dtype': dtype, 'device': device}

    self.embedding = nn.Embedding(input_size, embed_size, padding_index,
                                  **self.factory_kwargs)
    self.rnn = GRU(embed_size, hidden_size, num_rnn_layers, is_decoder=False,
                    **self.factory_kwargs)
    self.linear_summary = nn.Linear(hidden_size, hidden_size,
                                    **self.factory_kwargs)

  def forward(self, input, hidden=None):
    """Args:
        input: torch.Tensor, [seq_len] or [seq_len, batch_size]
        hidden (optional): torch.Tensor, [num_rnn_layers, hidden_size] or
          [num_rnn_layers, batch_size, hidden_size]

    Return:
        output: torch.Tensor, [seq_len, hidden_size] or
            [seq_len, batch_size, hidden_size]
        hidden: torch.Tensor, [num_rnn_layers, hidden_size] or
          [num_rnn_layers, batch_size, hidden_size]
        summary: torch.Tensor, [num_rnn_layers, hidden_size] or
          [num_rnn_layers, batch_size, hidden_size]
    """
    embedded = self.embedding(input)
    output, hidden = self.rnn(embedded, hidden)
    summary = torch.tanh(self.linear_summary(hidden))
    return output, hidden, summary

## Decoder Module

In [46]:
class Decoder(nn.Module):

  def __init__(self, embed_size, hidden_size, output_size, num_rnn_layers,
               padding_index, dtype=torch.float, device='cpu'):
    super(Decoder, self).__init__()
    self.embed_size = embed_size
    self.hidden_size = hidden_size
    self.output_size = output_size
    self.num_rnn_layers = num_rnn_layers
    self.num_maxouts = 500
    self.pool_size = 2
    self.stride = 2
    self.factory_kwargs = {'dtype': dtype, 'device': device}

    input_size = output_size
    self.embedding = nn.Embedding(input_size, embed_size, padding_index,
                                  **self.factory_kwargs)
    self.linear_hidden = nn.Linear(hidden_size, hidden_size,
                                   **self.factory_kwargs)
    self.rnn = GRU(embed_size, hidden_size, num_rnn_layers, is_decoder=True,
                    **self.factory_kwargs)
    # 아래에서 input_size 대신 embed_size를 했는데 괜찮을까?
    self.linear_maxout = nn.Linear(embed_size + 2 * hidden_size,
                                    self.num_maxouts * self.pool_size,
                                    **self.factory_kwargs)
    self.linear_output = nn.Linear(self.num_maxouts, output_size,
                                   **self.factory_kwargs)

  def forward(self, input, hidden=None, summary=None, max_len=50,
              teacher_forcing_ratio=0.):
    """Args:
        input: torch.Tensor, [seq_len] or [seq_len, batch_size]
        hidden: torch.Tensor, [num_layers, hidden_size] or
          [num_layers, batch_size, hidden_size]
        summary: torch.Tensor, [num_layers, hidden_size] or
          [num_layers, batch_size, hidden_size]
        max_len (optional): a non-negative integer
        teacher_forcing_ratio (optional): a float number between 0 and 1

    Return:
        output: torch.Tensor, [max_len, output_size] or
            [max_len, batch_size, output_size]
        hidden: torch.Tensor, [num_layers, hidden_size] or
          [num_layers, batch_size, hidden_size]
        summary: torch.Tensor, [num_layers, hidden_size] or
          [num_layers, batch_size, hidden_size]
    """
    #TODO: sample until all rows have more than one EOS
    # input.size(0) == target length
    if self.training: max_len = input.size(0)
      
    is_batched = len(input.shape) == 2
    if is_batched:
      _, batch_size = input.shape
      outputs = torch.zeros(max_len, batch_size, self.output_size,
                            **self.factory_kwargs)
    else:
      outputs = torch.zeros(max_len, self.output_size, **self.factory_kwargs)

    assert summary is not None, "You should give summary vector into the " \
      "decoder"
    if hidden is None:
      hidden = torch.tanh(self.linear_hidden(summary))

    inputs = input
    input_shape = (1, batch_size) if is_batched else (1,)
    input = inputs[0].view(input_shape) # [1] or [1, batch_size]
    for i in range(1, max_len):
      embedded = self.embedding(input)
      output, hidden = self.rnn(embedded, hidden, summary)
      combined = torch.cat((hidden[-1], embedded[0], summary[-1]),
                            dim=len(hidden.shape)-2)
      # [batch_size, embed_size + 2 * hidden_size]
      # -> [batch_size, self.num_maxouts]
      maxout = nn.functional.max_pool1d(self.linear_maxout(combined),
                                        kernel_size=self.pool_size,
                                        stride=self.stride)
      output = self.linear_output(maxout) # [batch_size, output_size]
      outputs[i] = output.view(outputs.shape[1:])
      if self.training and torch.randn(1) < teacher_forcing_ratio:
        # use teacher forcing
        input = inputs[i].view(input_shape)
      else:
        # do not use teacher forcing
        input = output.argmax(len(input.shape)-1).view(input_shape)
          
    return outputs, hidden, summary

## A Whole Seq2Seq Module

In [6]:
class Seq2SeqNetwork(nn.Module):

  def __init__(self, input_size, embed_size, hidden_size, output_size,
               num_rnn_layers, padding_index, dtype=torch.float, device='cpu'):
    super(Seq2SeqNetwork, self).__init__()
    self.input_size = input_size
    self.embed_size = embed_size
    self.hidden_size = hidden_size
    self.output_size = output_size
    self.num_rnn_layers = num_rnn_layers
    self.factory_kwargs = {'dtype': dtype, 'device': device}

    self.encoder = Encoder(input_size, embed_size, hidden_size, num_rnn_layers,
                           padding_index, **self.factory_kwargs)
    self.decoder = Decoder(embed_size, hidden_size, output_size, num_rnn_layers,
                           padding_index, **self.factory_kwargs)

  def forward(self, src, trg, max_len=50, teacher_forcing_ratio=0.):
    """Args:
        src: torch.Tensor, [src_len] or [src_len, batch_size]
        trg: torch.Tensor, [trg_len] or [trg_len, batch_size]
        max_len (optional): a non-negative integer
        teacher_forcing_ratio (optional): a float number between 0 and 1

    Return:
        output: torch.Tensor, [trg_len, output_size] or
            [trg_len, batch_size, output_size]
    """
    _, _, summary = self.encoder(src)
    output, _, _ = self.decoder(trg, summary=summary, max_len=max_len,
                             teacher_forcing_ratio=teacher_forcing_ratio)
    return output

  def encode(self, input, hidden=None):
    """Args:
        input: torch.Tensor, [seq_len] or [seq_len, batch_size]
        hidden: torch.Tensor, [num_layers, hidden_size] or
          [num_layers, batch_size, hidden_size]

    Return:
        output: torch.Tensor, [seq_len, hidden_size] or
            [trg_len, batch_size, hidden_size]
        hidden: torch.Tensor, [num_layers, hidden_size] or
          [num_layers, batch_size, hidden_size]
        summary: torch.Tensor, [num_layers, hidden_size] or
          [num_layers, batch_size, hidden_size]
    """
    return self.encoder(input, hidden)

  def decode(self, input, hidden=None, summary=None, beam_size=1, max_len=50,
            teacher_forcing_ratio=0.):
    """Args:
        input: torch.Tensor, [seq_len] or [seq_len, batch_size]
        beam_size (optional): a non-negative integer
        max_len (optional): a non-negative integer
        teacher_forcing_ratio (optional): a float number between 0 and 1

    Return:
        output: torch.Tensor, [max_len, output_size] or
            [max_len, batch_size, output_size]
    """
    output, _ = self.decoder(input, hidden, summary, max_len,
                             teacher_forcing_ratio)
    return output

# Temp

In [7]:
import torch.optim as optim
import math, time

In [8]:
def format_time(start_time, current_time, progress):
  elapsed = int(current_time - start_time)
  elapsed_time = f'{elapsed // 60:2d}m {elapsed % 60:2d}s'
  total = int(elapsed / progress)
  total_time = f'{total // 60:2d}m {total % 60:2d}s'
  return elapsed_time, total_time

def train(dataloader, model, optimizer, loss_fn, verbose=True, print_every=50):
  model.train()
  avg_loss = 0.
  loss_history = []
  model.train()
  for batch, (src, trg) in enumerate(dataloader):
    src, trg = src.to(device), trg.to(device)
    pred = model(src, trg)
    pred = pred[1:].view(-1, pred.size(2))
    trg = trg[1:].view(-1)
    # pred: [trg_len * batch_size, output_size], trg: [trg_len * batch_size]
    loss = loss_fn(pred, trg)

    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)
    optimizer.step()

    avg_loss += loss.item()
    loss_history.append(loss.item())
    if verbose and batch % print_every == 0 and batch > 0:
      avg_loss /= print_every
      print(f'> [{(batch + 1) * src.size(1):5d}/{len(dataloader.dataset):5d}]',
            f'loss={avg_loss:1.4f}, ppl={math.exp(avg_loss):7.3f}')
      avg_loss = 0.
    
  return loss_history

def evaluate(dataloader, model, loss_fn, verbose=True):
  avg_loss = 0.
  loss_history = []
  model.eval()
  with torch.no_grad():
    for batch, (src, trg) in enumerate(dataloader):
      src, trg = src.to(device), trg.to(device)
      pred = model(src, trg)
      pred = pred[1:trg.size(0)].view(-1, pred.size(2))
      trg = trg[1:].view(-1)
      # pred: [trg_len * batch_size, output_size], trg: [trg_len * batch_size]
      loss = loss_fn(pred, trg)

      avg_loss += loss.item()
      loss_history.append(loss.item())
  if verbose:
    avg_loss /= len(dataloader)
    print(f'> [evaluation]  loss={avg_loss:1.4f}, ',
          f'ppl={math.exp(avg_loss):7.3f}')
  return avg_loss, loss_history

In [9]:
input_size = 8000
output_size = 6000
padding_index = 1
device = 'cuda' if torch.cuda.is_available else 'cpu'

BATCH_SIZE = 64

EMBED_SIZE = 512
HIDDEN_SIZE = 512
NUM_LAYERS = 4

NUM_EPOCHS = 1
# LEARNING_RATE = 0.001
MAX_GRAD_NORM = 5

In [51]:
model = Seq2SeqNetwork(input_size, EMBED_SIZE, HIDDEN_SIZE, output_size,
                       NUM_LAYERS, padding_index, device=device)
for param in model.parameters():
  param.data.normal_(std=0.01)
print(f'The number of model parameter: {sum([param.numel() for param in model.parameters()]):,}')

The number of model parameter: 27,977,240


In [52]:
import torch.optim as optim
# Prepare an optimizer and a loss function
optimizer = optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss(ignore_index=padding_index)
# reduction='mean' -> the gradients are summed and divided by batch_size

In [54]:
import time

src_train = torch.randint(0, input_size, (30, BATCH_SIZE))
trg_train = torch.randint(0, output_size, (35, BATCH_SIZE))
src_val = torch.randint(0, input_size, (30, BATCH_SIZE))
trg_val = torch.randint(0, output_size, (35, BATCH_SIZE))

best_val_loss = float('inf')
train_loss_history = []
val_loss_history = []
start_time = time.time()
for epoch in range(1, NUM_EPOCHS + 1):
  print('-' * 41)
  print(f'Epoch {epoch}')
  print('-' * 41)

  with torch.autograd.set_detect_anomaly(True):
    train_loss_history += train([(src_train, trg_train)], model, optimizer, loss_fn)
  val_loss, loss_history = evaluate([(src_val, trg_val)], model, loss_fn)
  val_loss_history += loss_history
  
  if epoch > NUM_EPOCHS // 2 and val_loss < best_val_loss:
    best_val_loss = val_loss
    torch.save(model.state_dict(), f'seq2seq-{val_loss * pow(10, 5):07.0f}.pth')
  
  elapsed_time, total_time = format_time(start_time, time.time(), epoch / NUM_EPOCHS)
  print('-' * 41)
  print(f'End of epoch {epoch} ({elapsed_time}/{total_time})')
  print('-' * 41)
  print()
print('Done!')

-----------------------------------------
Epoch 1
-----------------------------------------
> [evaluation]  loss=8.7000,  ppl=6003.026
-----------------------------------------
End of epoch 1 ( 0m  5s/ 0m  5s)
-----------------------------------------

Done!


# References

[1] Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation [[link]](https://doi.org/10.48550/arXiv.1406.1078)