# Basic Seq2Seq Practice
### Source code come from  :  '從零開始的 Sequence to Sequence ' 
Article： http://zake7749.github.io/2017/09/28/Sequence-to-Sequence-tutorial/ <br>
Github: https://github.com/zake7749/Sequence-to-Sequence-101

## seq2seq Moder!
![](./image/seq2seq.png)

##### Seq2Seq Model: 由兩個Sequential model組成，輸入和輸出都可以是序列資料，也被稱作 Encoder-Decoder framework。
1. Sequential model擅長處理有序列特徵的資料(文字,語音,時間序)，模型常見的基本組成就是RNN、LSTM、GRU。
2. Encoder: 把輸入的文字轉換成機器理解的context vector。
3. Decoder: 把context vector轉換成我們能理解的文字。

In [1]:
import torch
import random
import torch.nn as nn
from torch.autograd import Variable
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

'''Self define package by author (check author's github)'''
from dataset.DataHelper import DataTransformer
from config import config

# Building Seq2Seq Model (Combine encoder & decoder)
1. Input encoder & decoder model and declared <br>
2. In forward: input date / runnung encoder forward / runnung decoder forward
3. Evaluation: Using test data => Running Encoder forward => Running Decoder evaluation 

In [5]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        # Input Encoder、Decoder model and declare.
        self.encoder = encoder
        self.decoder = decoder
        

    def forward(self, inputs, targets):
        # Input training data
        input_vars, input_lengths = inputs
        
        #Running encoder
        encoder_outputs, encoder_hidden = self.encoder.forward(input_vars, input_lengths)
        
        # Running decoder
        decoder_outputs, decoder_hidden = self.decoder.forward(context_vector=encoder_hidden, targets=targets)
        return decoder_outputs, decoder_hidden

    
    
    def evaluation(self, inputs):
        # Input test data
        input_vars, input_lengths = inputs
        
        # 
        encoder_outputs, encoder_hidden = self.encoder(input_vars, input_lengths)
        decoded_sentence = self.decoder.evaluation(context_vector=encoder_hidden)
        return decoded_sentence

# Encoder Model
將一組序列(input)用 Embedding 轉成向量，並在 RNN 最後一個時間點的輸出 hidden 做為 context vector。 <br>

* 補充 PackedSequence 物件：<br>
1. 在 Recurrent neural network 裡，由於每筆資料的 input 和 output 在長度會有所不同，無法用 batch 的方式來 train ，在 pytorch 有一個特別的 class 叫 PackedSequence，用來幫忙解決這個問題。<br>
http://www.cnblogs.com/lindaxin/p/8052043.html <br>
![](./image/padd.png)
2. 用 torch.nn.utils.rnn.pack_padded_sequence將 Variable 轉換成 PackedSequence  ;  如果要轉換回 Variable ，要用torch.nn.utils.rnn.pad_packed_sequence這個函式。

In [6]:
class VanillaEncoder(nn.Module):

    def __init__(self, vocab_size, embedding_size, output_size):
        """Define layers for a vanilla rnn encoder"""
        super(VanillaEncoder, self).__init__()

        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.gru = nn.GRU(embedding_size, output_size) # GRU: one kind of rnn model

    def forward(self, input_seqs, input_lengths, hidden=None):
        # input to vector(variable)
        embedded = self.embedding(input_seqs)
        # vector(variable) to packed sequence (become same length)
        packed = pack_padded_sequence(embedded, input_lengths)
        # Runnung RNN
        packed_outputs, hidden = self.gru(packed, hidden)
        # packed sequence to vector(variable) 
        outputs, output_lengths = pad_packed_sequence(packed_outputs)
        return outputs, hidden

    def forward_a_sentence(self, inputs, hidden=None):
        """Deprecated, forward 'one' sentence at a time which is bad for gpu utilization"""
        embedded = self.embedding(inputs)
        outputs, hidden = self.gru(embedded, hidden)
        return outputs, hidden

# Decoder Model

##### 和 Decoder 類似 Encoder 只是他的 input 除了來自 Encoder 之外，每一個時間的的 output 也會變成下一個時間點的input，以下重點：

* Flow1. first input: SOS (Start of sentence) <br> 
* Flow2. first hidden : Pass the context vector <br>
* Flow3. Decoder每個時間點的output當作下個時間點input，利用 [forward_step] 來執行RNN，和 Encoder類似都是 GRU，只是多出每個時間點的 output。<br>
* 補充訓練小技巧：teacher_forcing_ratio 是個常數機率（本例子設0.5），用於隨機將 Decoder下個時間的的 input換成是真正 Label，以幫助訓練的穩定性。

In [7]:
class VanillaDecoder(nn.Module):
    def __init__(self, hidden_size, output_size, max_length, teacher_forcing_ratio, sos_id, use_cuda):
        """Define layers for a vanilla rnn decoder"""
        super(VanillaDecoder, self).__init__()

        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.log_softmax = nn.LogSoftmax()  # work with NLLLoss = CrossEntropyLoss

        self.max_length = max_length
        self.teacher_forcing_ratio = teacher_forcing_ratio
        self.sos_id = sos_id
        self.use_cuda = use_cuda
        
        
    def forward_step(self, inputs, hidden):
        '''Run GRU in each time step:
           和 Encoder類似都是 GRU，只是多出每個時間點的 output'''
        # inputs: (time_steps=1, batch_size)
        batch_size = inputs.size(1)
        embedded = self.embedding(inputs)
        embedded.view(1, batch_size, self.hidden_size)  # S = T(1) x B x N
        rnn_output, hidden = self.gru(embedded, hidden)  # S = T(1) x B x H
        rnn_output = rnn_output.squeeze(0)  # squeeze the time dimension
        output = self.log_softmax(self.out(rnn_output))  # S = B x O
        # self.out： nn.Linear(data) / Ax+b 
        # self.log_softmax = nn.LogSoftmax()  # work with NLLLoss = CrossEntropyLoss
        return output, hidden
    
    
    ### 重點流程：
    def forward(self, context_vector, targets):
        # Prepare variable for decoder on time_step_0
        target_vars, target_lengths = targets
        batch_size = context_vector.size(1)
        
        ''' Flow1. 
            first input: SOS (Start of sentence) ''' 
        decoder_input = Variable(torch.LongTensor([[self.sos_id] * batch_size]))
        
        
        ''' Flow2.
            first hidden : context vector/ come frome Encoder '''
        decoder_hidden = context_vector

        max_target_length = max(target_lengths)
        decoder_outputs = Variable(torch.zeros(
            max_target_length,
            batch_size,
            self.output_size
        ))  # (time_steps, batch_size, vocab_size)

        if self.use_cuda:
            decoder_input = decoder_input.cuda()
            decoder_outputs = decoder_outputs.cuda()

            
        '''補充訓練小技巧：
          teacher_forcing_ratio 是個常數機率（本例子設0.5），用於隨機將 Decoder下個時間的的 input
          換成是真正 Label，以幫助訓練的穩定性。
        ''' 
        use_teacher_forcing = True if random.random() > self.teacher_forcing_ratio else False
        
        
        ''' Flow3.
        Decoder每個時間點的 output 當作下個時間點 input，利用 [forward_step] 來執行RNN，和 Encoder 類似
        都是 GRU，只是多出每個時間點的 output。'''
        for t in range(max_target_length):
            decoder_outputs_on_t, decoder_hidden = self.forward_step(decoder_input, decoder_hidden)
            decoder_outputs[t] = decoder_outputs_on_t
            
            # 同上訓練小技巧
            if use_teacher_forcing:
                decoder_input = target_vars[t].unsqueeze(0)
                # 一定機率給真實Label回去訓練
            else:
                decoder_input = self._decode_to_index(decoder_outputs_on_t)
                # 一定機率給 decoder_outputs_on_t

        return decoder_outputs, decoder_hidden
    
    
    
    

    def evaluation(self, context_vector):
        batch_size = context_vector.size(1) # get the batch size
        decoder_input = Variable(torch.LongTensor([[self.sos_id] * batch_size]))
        decoder_hidden = context_vector

        decoder_outputs = Variable(torch.zeros(
            self.max_length,
            batch_size,
            self.output_size
        ))  # (time_steps, batch_size, vocab_size)

        if self.use_cuda:
            decoder_input = decoder_input.cuda()
            decoder_outputs = decoder_outputs.cuda()

        # Unfold the decoder RNN on the time dimension
        for t in range(self.max_length):
            decoder_outputs_on_t, decoder_hidden = self.forward_step(decoder_input, decoder_hidden)
            decoder_outputs[t] = decoder_outputs_on_t
            decoder_input = self._decode_to_index(decoder_outputs_on_t)  # select the former output as input

        return self._decode_to_indices(decoder_outputs)
     
    

    def _decode_to_index(self, decoder_output):
        """
        evaluate on the logits, get the index of top1:
        param decoder_output: S = B x V or T x V
        """
        value, index = torch.topk(decoder_output, 1)
        # Returns the k largest elements of the given input tensor along a given dimension.
        index = index.transpose(0, 1)  # S = 1 x B, 1 is the index of top1 class
        if self.use_cuda:
            index = index.cuda()
        return index
    
    

    def _decode_to_indices(self, decoder_outputs):
        """
        Evaluate on the decoder outputs(logits), find the top 1 indices.
        Please confirm that the model is on evaluation mode if dropout/batch_norm layers have been added
        :param decoder_outputs: the output sequence from decoder, shape = T x B x V 
        """
        decoded_indices = []
        batch_size = decoder_outputs.size(1)
        decoder_outputs = decoder_outputs.transpose(0, 1)  # S = B x T x V

        for b in range(batch_size):
            top_ids = self._decode_to_index(decoder_outputs[b])
            decoded_indices.append(top_ids.data[0])
        return decoded_indices

# Building Training Object
1. init: initializing seq2seq model, dataset information, optimizer setting
2. train method: Training seq2seq model [num_epochs] times, with [mini_batches]

In [8]:
class Trainer(object):

    def __init__(self, model, data_transformer, learning_rate, use_cuda,
                 checkpoint_name=config.checkpoint_name,
                 teacher_forcing_ratio=config.teacher_forcing_ratio):

        self.model = model #seq2seq model

        # record some information about dataset
        self.data_transformer = data_transformer
        self.vocab_size = self.data_transformer.vocab_size
        self.PAD_ID = self.data_transformer.PAD_ID
        self.use_cuda = use_cuda

        # optimizer setting
        self.learning_rate = learning_rate
        self.optimizer= torch.optim.Adam(self.model.parameters(), lr=learning_rate)
        self.criterion = torch.nn.NLLLoss(ignore_index=self.PAD_ID, size_average=True)

        self.checkpoint_name = checkpoint_name
        
        

    def train(self, num_epochs, batch_size, pretrained=False):

        if pretrained:
            self.load_model()

        step = 0

        for epoch in range(0, num_epochs):
            mini_batches = self.data_transformer.mini_batches(batch_size=batch_size)
            for input_batch, target_batch in mini_batches:
                self.optimizer.zero_grad()
                
                # Call seq2seq model to training
                decoder_outputs, decoder_hidden = self.model(input_batch, target_batch)

                # calculate the loss and back prop.
                cur_loss = self.get_loss(decoder_outputs, target_batch[0])

                # logging
                step += 1
                if step % 50 == 0:
                    print("Step:", step, "loss of char: ", cur_loss.data[0])
                    self.save_model()
                cur_loss.backward()

                # optimizing parameter
                # torch.optim.Adam(self.model.parameters(), lr=learning_rate)
                self.optimizer.step()
        self.save_model()

        
    def masked_nllloss(self):
        # Deprecated in PyTorch 2.0, can be replaced by ignore_index
        # define the masked NLLoss
        weight = torch.ones(self.vocab_size)
        weight[self.PAD_ID] = 0
        if self.use_cuda:
            weight = weight.cuda()
        return torch.nn.NLLLoss(weight=weight).cuda()
    

    def get_loss(self, decoder_outputs, targets):
        b = decoder_outputs.size(1)
        t = decoder_outputs.size(0)
        targets = targets.contiguous().view(-1)  # S = (B*T)
        decoder_outputs = decoder_outputs.view(b * t, -1)  # S = (B*T) x V
        return self.criterion(decoder_outputs, targets)
    

    def save_model(self):
        torch.save(self.model.state_dict(), self.checkpoint_name)
        print("Model has been saved as %s.\n" % self.checkpoint_name)

    def load_model(self):
        self.model.load_state_dict(torch.load(self.checkpoint_name))
        print("Pretrained model has been loaded.\n")

    def tensorboard_log(self):
        pass
    
    

    def evaluate(self, words):
        # make sure that words is list
        if type(words) is not list:
            words = [words]

        # transform word to index-sequence
        eval_var = self.data_transformer.evaluation_batch(words=words)
        decoded_indices = self.model.evaluation(eval_var)
        results = []
        for indices in decoded_indices:
            results.append(self.data_transformer.vocab.indices_to_sequence(indices))
        return results

# Function to Training Model 
1. Declare encoder model
2. Declare decoder model
3. Declare seq2seq midel (by encoder & decoder model)
4. Declare training object
5. Running training object

In [28]:
def main():
    '''
    Self defined package by author (check author's github)
    from dataset.DataHelper import DataTransformer
    from config import config
    '''
    use_cuda=False
    data_transformer = DataTransformer(config.dataset_path, use_cuda=use_cuda) # use_cuda=config.use_cuda

    # 1. Declare encoder model
    vanilla_encoder = VanillaEncoder(vocab_size=data_transformer.vocab_size,
                                     embedding_size=config.encoder_embedding_size,
                                     output_size=config.encoder_output_size)
    # 2. Declare decoder model
    vanilla_decoder = VanillaDecoder(hidden_size=config.decoder_hidden_size,
                                     output_size=data_transformer.vocab_size,
                                     max_length=data_transformer.max_length,
                                     teacher_forcing_ratio=config.teacher_forcing_ratio,
                                     sos_id=data_transformer.SOS_ID,
                                     use_cuda=config.use_cuda)
    if config.use_cuda:
        vanilla_encoder = vanilla_encoder.cuda()
        vanilla_decoder = vanilla_decoder.cuda()
        

    # 3. Declare seq2seq midel (by encoder & decoder model)
    seq2seq = Seq2Seq(encoder=vanilla_encoder,
                      decoder=vanilla_decoder)

    # 4. Declare training object
    trainer = Trainer(seq2seq, data_transformer, config.learning_rate, config.use_cuda)
    
    # 5. Running training object
    trainer.train(num_epochs=config.num_epochs, batch_size=config.batch_size, pretrained=False)


In [None]:
# Starting to run    
if __name__ == "__main__":
    main()

# Reference:
* 科技大擂台 Pytorch Seq2Seq 篇: 
https://fgc.stpi.narl.org.tw/activity/videoDetail/4b1141305df38a7c015e194f22f8015b

* PyTorch Document: 
http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.htm

###### ==> Further reading: Conbine Attention

# Data

In [33]:
data_transformer = DataTransformer(config.dataset_path, use_cuda=False)
text_data = data_transformer.mini_batches(batch_size=5)

for input_batch, target_batch in text_data:
    print('input_batch with shape:', input_batch[0].shape)
    print(input_batch)
    break

input_batch with shape: torch.Size([14, 5])
(Variable containing:
    6    19     9    20     8
   26     9    24     9    20
    4    24     6    14     1
   13    12    11     1     2
    9    22     1     2     2
    7     1     2     2     2
   13     2     2     2     2
   11     2     2     2     2
   12     2     2     2     2
   10     2     2     2     2
    9     2     2     2     2
   13     2     2     2     2
   16     2     2     2     2
    1     2     2     2     2
[torch.LongTensor of size 14x5]
, [14, 6, 5, 4, 3])


In [47]:
embedding_size= 8
vocab_size=data_transformer.vocab_size
EM = nn.Embedding(vocab_size, embedding_size)
print('embedding_size: ', embedding_size)
print('vocab_size    : ', vocab_size)

embedding_size:  8
vocab_size    :  30


In [54]:
input_vars, input_lengths = input_batch
embedded = EM(input_vars)

embedded     

Variable containing:
(0 ,.,.) = 
 -2.8272 -0.8495 -0.7722 -1.6469 -1.4244 -1.0290  0.4223  2.0485
 -0.5904  0.5569  0.2864 -0.7210  1.0009 -0.9796 -0.5699  0.1710
  1.2649 -1.5756 -0.1523  1.3332 -0.3693 -0.6410  2.5663  1.0958
 -0.7273 -0.1256 -1.8895  0.4864 -0.3066 -0.0811 -0.0353  0.0683
 -0.8637 -0.4147 -0.9887 -1.0099  0.7571 -0.8994  1.3255  0.9284

(1 ,.,.) = 
  0.1070  0.8086  0.6861 -0.0120  0.4722  1.7426  1.9810  1.5739
  1.2649 -1.5756 -0.1523  1.3332 -0.3693 -0.6410  2.5663  1.0958
 -0.7698 -1.1971 -0.4080 -1.6318  0.3038  1.4950  0.4344 -0.9362
  1.2649 -1.5756 -0.1523  1.3332 -0.3693 -0.6410  2.5663  1.0958
 -0.7273 -0.1256 -1.8895  0.4864 -0.3066 -0.0811 -0.0353  0.0683

(2 ,.,.) = 
 -0.7275 -0.0046 -0.3437 -0.7898  1.7581 -0.9436 -0.2465  0.7758
 -0.7698 -1.1971 -0.4080 -1.6318  0.3038  1.4950  0.4344 -0.9362
 -2.8272 -0.8495 -0.7722 -1.6469 -1.4244 -1.0290  0.4223  2.0485
  1.4335  1.1976 -0.1213  0.2545  0.6740  0.6193  0.5124  0.2051
 -1.4960  1.6445 -0.5717 -0.274

# PackedSequence

In [3]:
x = torch.FloatTensor([[1,2,3],[4,5,6],[7,8,9], [10, 11, 12]]).view(2, 3, 2)
# (epoch_length, batch_size, data_dim)
x


(0 ,.,.) = 
   1   2
   3   4
   5   6

(1 ,.,.) = 
   7   8
   9  10
  11  12
[torch.FloatTensor of size 2x3x2]

In [6]:
pack_padded_sequence(x, [2, 2, 2])

PackedSequence(data=
  1   2
  3   4
  5   6
  7   8
  9  10
 11  12
[torch.FloatTensor of size 6x2]
, batch_sizes=[3, 3])

In [78]:
def pack_padded_sequence(input, lengths, batch_first=False):
    """Packs a Variable containing padded sequences of variable length.

    Input can be of size ``TxBx*`` where T is the length of the longest sequence
    (equal to ``lengths[0]``), B is the batch size, and * is any number of
    dimensions (including 0). If ``batch_first`` is True ``BxTx*`` inputs are expected.

    Arguments:
        input (Variable): padded batch of variable length sequences.
        lengths (list[int]): list of sequences lengths of each batch element.
        batch_first (bool, optional): if True, the input is expected in BxTx*
            format.

    Returns:
        a :class:`PackedSequence` object
    """
    if lengths[-1] <= 0:
        raise ValueError("length of all samples has to be greater than 0, "
                         "but found an element in 'lengths' that is <=0")
    if batch_first:
        input = input.transpose(0, 1)

    steps = []
    batch_sizes = []
    lengths_iter = reversed(lengths)
    current_length = next(lengths_iter)
    batch_size = input.size(1)
    if len(lengths) != batch_size:
        raise ValueError("lengths array has incorrect size")

    for step, step_value in enumerate(input, 1):
        steps.append(step_value[:batch_size])
        batch_sizes.append(batch_size)

        while step == current_length:
            try:
                new_length = next(lengths_iter)
            except StopIteration:
                current_length = None
                break

            if current_length > new_length:  # remember that new_length is the preceding length in the array
                raise ValueError("lengths array has to be sorted in decreasing order")
            batch_size -= 1
            current_length = new_length
        if current_length is None:
            break
    return PackedSequence(torch.cat(steps), batch_sizes)