## The basic sequence-to-sequence model
Here you'll be implementing the most basic sequence-to-sequence model. Consists of two GRUs and a classification layer (as well as one embedding layer for each source and target vocabularies).

There will be a lot hyperparameters, so it will be convenient to pass an option variable (that stores all hyperparamters) to the constructor when calling them. Similar to what you've done with the params.json file.

### Encoder 
The encoder transforms the source input sequence into features that is passed to the decoder for generating a sequence.

### Homework
Write a module for the encoder
1. Takes in srcBatch == (German word indices, GERMAN_original_lengths ) as input
2. Consists of 
    - Embedding layer
    - Bi-directional, 1 layer GRU, use packed sequence
3. The encoder outputs the GRU output, and its last hidden state. 
4. The encoder output has dimension ~ [B,S,2*D_enc], where D_enc is the output dimension of the GRU, the factor of two comes from the two directions (you don't need to do anything to get 2*D_enc). The last hidden state has dimensions [2,B,D_enc].

Even though we will only use the GRU's last hidden state, and not its output, we keep it in the encoder output for now. We will be using the output when we start using the attention mechanism.

In [2]:
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

In [3]:
class Att(nn.Module):
    
    def __init__(self):
        super().__init__()
        
        self.p = torch.nn.Parameter(torch.rand(1,4,5))
        


In [4]:
def f(*args):
    return args[0]

In [5]:
x = torch.rand(4, 5, 5)
y = torch.rand(1, 4, 5)
z = torch.rand(10, 1, 1)

In [9]:
a = torch.tensor([[[1,2],[3,4]]])
b = torch.tensor([[10,0],[0,10]])

In [10]:
a*b

tensor([[[10,  0],
         [ 0, 40]]])

In [None]:
)

In [4]:
L = []
y_1 = y
for i in range(x.shape[1]):
    out, y = g(x[:,i:i+1,:], y)
    L.append(out)
b = torch.cat(L, dim=1)

a, h = g(x, y_1)

In [7]:
a.shape, h.shape

(torch.Size([4, 5, 5]), torch.Size([1, 4, 5]))

In [6]:
b - a

tensor([[[0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.]]], grad_fn=<SubBackward0>)

In [33]:
a

tensor([[[ 0.1493, -0.3334,  0.2257, -0.0451, -0.0450],
         [ 0.1625, -0.3466,  0.2309, -0.0137, -0.0020],
         [ 0.1940, -0.3031,  0.2657,  0.0224,  0.0411],
         [ 0.1307, -0.3988,  0.2248, -0.0576, -0.0237],
         [ 0.1050, -0.4080,  0.2241, -0.1063, -0.0689]],

        [[ 0.1397, -0.4140,  0.2234,  0.0087,  0.1228],
         [ 0.1276, -0.4129,  0.2084,  0.0831,  0.0670],
         [ 0.1678, -0.3154,  0.2277,  0.0360,  0.0652],
         [ 0.0765, -0.4434,  0.2526, -0.0589,  0.0045],
         [ 0.1137, -0.3908,  0.2221, -0.0983,  0.0361]],

        [[ 0.0646, -0.3752,  0.3508, -0.1250,  0.0646],
         [ 0.0869, -0.3480,  0.3156, -0.0587,  0.0016],
         [ 0.0392, -0.4347,  0.3001, -0.1704,  0.0179],
         [ 0.1090, -0.4263,  0.2620, -0.0398,  0.1505],
         [ 0.0754, -0.4265,  0.3009, -0.0909,  0.1143]],

        [[ 0.1675, -0.3646,  0.3311,  0.0877,  0.2010],
         [ 0.1769, -0.3078,  0.3357,  0.0359,  0.1372],
         [ 0.1533, -0.3725,  0.3445,  0.06

In [62]:
L

[tensor([[[0.3234, 0.2675, 0.3116, 0.3432, 0.5607]],
 
         [[0.1133, 0.1946, 0.6486, 0.4176, 0.2496]]],
        grad_fn=<TransposeBackward0>),
 tensor([[[ 0.2260,  0.0588,  0.2708,  0.3016,  0.2143]],
 
         [[ 0.0861,  0.1145,  0.4533,  0.0516, -0.0155]]],
        grad_fn=<TransposeBackward0>),
 tensor([[[ 0.1294, -0.0170,  0.3525,  0.0330, -0.0804]],
 
         [[-0.0727, -0.0591,  0.3707, -0.1014, -0.2206]]],
        grad_fn=<TransposeBackward0>)]

In [30]:
y_1 = y.reshape(1, -1, 10)
y_2 = y.reshape(1, 2, 10)

In [31]:
y_1 == y_2

tensor([[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]], dtype=torch.uint8)

In [24]:

g(x, y)

(tensor([[[ 0.3013,  0.5212,  0.5297,  0.4256,  0.1757, -0.1173,  0.2165,
           -0.0381,  0.3361,  0.0966]],
 
         [[ 0.6927,  0.3957,  0.2995,  0.3448,  0.4576, -0.1258,  0.1210,
           -0.0139,  0.3930,  0.2677]]], grad_fn=<TransposeBackward0>),
 tensor([[[ 0.3013,  0.5212,  0.5297,  0.4256,  0.1757],
          [ 0.6927,  0.3957,  0.2995,  0.3448,  0.4576]],
 
         [[-0.1173,  0.2165, -0.0381,  0.3361,  0.0966],
          [-0.1258,  0.1210, -0.0139,  0.3930,  0.2677]]],
        grad_fn=<StackBackward>))

In [18]:
g(x, y)

(tensor([[[-0.0172,  0.3889,  0.2774,  0.3145,  0.2928, -0.1933, -0.0536,
           -0.0997,  0.4987,  0.3862]],
 
         [[ 0.0598,  0.5259,  0.2992,  0.1735,  0.2973, -0.0458, -0.1524,
            0.2355,  0.4287,  0.1915]]], grad_fn=<TransposeBackward0>),
 tensor([[[-0.0172,  0.3889,  0.2774,  0.3145,  0.2928],
          [ 0.0598,  0.5259,  0.2992,  0.1735,  0.2973]],
 
         [[-0.1933, -0.0536, -0.0997,  0.4987,  0.3862],
          [-0.0458, -0.1524,  0.2355,  0.4287,  0.1915]]],
        grad_fn=<StackBackward>))

### Decoder
Rather than inputting the whole sequence as in the encoder, we pick word indices from tgtBatch at time $t$ and loop through $t$. The output of the decoder is the conditional log-likelihood of each word. 

$\log p(y_t | y_{0:t-1}, s_{t-1})$, 

where $y_t$ is the t-th word and $s_{t-1}$ is the last hidden state from the last time step. At $t=0$, $s_0 = h_S$, where $h_S$ is the last hidden state from the **encoder**. In other words the task is to predict the next word given a partial sentence. 

The steps to implement this are:

1. Write a decoder module just as you would for the encoder. (this homework)
2. Once you have a decoder module, pick out the $t$-th word indices from tgtBatch and use this as the decoder input. Keep track of the last decoder hidden state $s_t$ from the GRU. (next homework)
3. Loop through $t$, (next homework).

We don't need to use packed sequences here, as we are passing batches of tokens (length 1) step by step. The padding are taken care of by masking the loss function.

### Homework
Write a module for the decoder, without the classification layer
1. Takes in tgtBatch == (English word indices, ENGLISH_original_lengths) as input
2. Consists of 
    - Embedding layer
    - Uni-directional, 1 layer GRU, **do not** use packed sequence
3. The decoder outputs the GRU output, and its last hidden state. 
4. We will add the classification layer in the loop over $t$, outside of this module.

The decoder output has dimension ~ [B,T,D_dec], where D_dec = 2*D_enc. The last hidden state has dimensions [1,B,D_dec]. The output dimension of the decoder D_dec is fixed by the encoder output dimension, because we wlil be using the encoder's last hidden state as the decoder's last hidden state for $t=0$.

In [None]:
### Insert code here 

## Putting everything together
Here, you will put the encoder and decoder module together, put the decoder in a loop over $t$, as well as adding a classification layer on top, to form the sequence-to-sequence model.

### The loss function
The loss function is the log-joint-likelihood of the sequence

$L = \sum_{t=0}^{T-1}  \log p(y_t | y_{0:t-1}, s_{t-1})$,

i.e. the sum of cross entropy loss over time. For the implementation, you have to pass the argument ignore_index=0 to the CrossEntropyLoss constructor. It tells Pytorch to ignroe the padded indices (presumably 0) by masking. You may also need to set reduction='None' to stop Pytorch from taking the mean over batches (by default) before summing over $t$. 

See the options for the CrossEntropyLoss module:
https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss

### Training method - Teacher Forcing
We input the known target word as input during training, rather than using the previous predicted word as input. This is to stabilize training in the early stages. However it causes *exposure bias*, as the training operation is not the same as the sequence generation operation, where the previous word is used as input. This method is called **teacher forcing**.

### Homework
1. Write a Seq2Seq module, that takes in a tuple (srcBatch, tgtBatch), and outputs the log-joint-likelihood, summed over all $t$ on the target batch. If you are confused about how to do the time loop, look at the code at the end of the notebook for an idea. Half of which is relevant. 


2. Attributes of the Seq2Seq module class contain (but not limited to)
    - self.encoder = ...
    - self.decoder = ...
    - self.lossLayer = ...    


3. The forward function in the module class has the following order:
    1. Get the output and last hidden state, h, from the encoder, the dimensions should be output ~ [B,S,D_enc], h ~ [ 2,B,D_enc].
       
    2. Set initial decoder last_state to be $h$, concatenating the two directions, you'll need to reshape h to [1,B,2*D_enc].
    
    3. Loop through t:
        - Get decoder_input **from tgtBatch** at time t, decoder input ~ [B,1], a sequence of length 1.
        - Get decoder_output, next_state from self.decoder(decoder_input, last_state)
        - Get loss_t at time t from the classification layer followed by softmax.
        - Accumulate loss_t over t
        - set last_state = next_state

    4. The loss function is the sum over all loss_t over t, average this over the batch.
       

In [None]:
### Insert code here 

### Training
At this stage, you should have a module that outputs the loss tensor given a batch of source and target sequences. Then training proceeds as usual, calling backward() on the loss...etc. 

### Homework
Train the model, print out the training and validation loss. Note that we're still using 

In [None]:
### Insert code here 

### Bonus: Greedy search, example
The simplest way to generate a sequence is the greedy search, which only picks the most likely word in each time step.

We cannot use the previous code for training because now the decoder input is the previous generated word, and not the ground truth which we have no access to.

A pseudo code of the greedy search method is below. When we do beam search the code follows a similar structure. The translator class object is defined right after the model declaration, e.g.

model = Net(hyperparameters)

translator = Translator(hyperparameters, model)

The model being passed here is actually a reference to the model (and its parameters). So we can use the translator object to generate sequence during training.

In the code below, I did a few things other than just picking out the best words
- Forcing the sequence to generate the EOS token whenever the previous word is EOS.
- Termination condition, when all sequences in the batch ends with EOS.

The function translate() returns the generated indices in the batch. The (inverse) dictionary word2idx is used to translate it back to real words.

In [41]:
#import beam_search

#Pseudo code for translation, most likely have bugs.
class Translator(object):
    def __init__(self,hp,model):
        self.model = model
        self.model.eval()
        self.hp = hp #hyperparameters
        
    def translate(self, srcBatch, tgtBatch,EOS=2): #we only target batch for evaluation scores (e.g. BLEU)
        B = srcBatch[0].shape[0]
        
        #  (1) run the encoder on the src
        x  , h = self.model.encoder(srcBatch) #h ~ [2,B,D_enc]

        #initializing with <sos> tokens, [B]
        translation = torch.tensor(tgtBatch[0][:,t])
        #Initializaing EOS_indices to be an empty set
        EOS_indices = []        
        #Initializing decoder input.
        dec_in = tgtBatch[0][:,t].unsqueeze(1) #all SOS tokens
        #Initializing last decoder state, reshaping h to [1,B,2D_enc]
        last_state = h.transpose(1,0).reshape(1,B,2*h.shape[-1]).transpose(1,0).contiguous()

        #  (2) loop through decoder t        
        for t in range(self.hp.T_dec):
            
            #Model operations
            dec_out, last_state = self.model.decoder(dec_in, last_state)
            logit = self.model.classifier(dec_out) #logit ~ [B,vocab_size]
            
            #Greedy Search
            next_words = logit.argmax(dim=1) #picking the best word, [B]
            
            #Forcing EOS if previous word is EOS as well
            next_words[EOS_indices] = EOS
            
            # Termination condition
            if all(next_words==EOS): break
            
            #setting up for next time step.
            EOS_indices = next_words.eq(EOS).nonzero() #this will be used to force EOS in the next time step
            dec_in = next_words #using generated words for next decoder input
            
            #Stacking chosen words to [B,t]
            translation = torch.stack([translation,next_words])
        return translation