## Building a deep learning calculator

In this HW we will use seq2seq for building a calculator. The input will be an equation and the solution will be generated by the network.

### Data Generation

In this task we will generate our own data! We will use three operators (addition, multiplication and subtraction) and work with positive integer numbers in some range. Here are examples of correct inputs and outputs:

    Input: '1+2'
    Output: '3'
    
    Input: '0-99'
    Output: '-99'

*Note, that there are no spaces between operators and operands.*




In [2]:
import random

In [3]:
def generate_equations(allowed_operators, dataset_size, min_value, max_value):
    """Generates pairs of equations and solutions to them.
    
       Each equation has a form of two integers with an operator in between.
       Each solution is an integer with the result of the operaion.
    
        allowed_operators: list of strings, allowed operators.
        dataset_size: an integer, number of equations to be generated.
        min_value: an integer, min value of each operand.
        max_value: an integer, max value of each operand.

        result: a list of tuples of strings (equation, solution).
    """
    sample = []
    for _ in range(dataset_size):
        left_operand = str(random.randint(min_value, max_value))
        right_operand = str(random.randint(min_value, max_value))
        operator = random.choice(allowed_operators)
        operation = left_operand+operator+right_operand
        sample.append((operation, str(eval(operation))))
    return sample

In [4]:
from sklearn.model_selection import train_test_split
allowed_operators = ['+', '-','*']
dataset_size = 100000
data = generate_equations(allowed_operators, dataset_size, min_value=0, max_value=10000)

train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

In [5]:
train_set[0:10]

[('7687-1741', '5946'),
 ('1576+4576', '6152'),
 ('4145*5128', '21255560'),
 ('4420+8012', '12432'),
 ('7131-2368', '4763'),
 ('5702-1497', '4205'),
 ('5516+6848', '12364'),
 ('5149-4854', '295'),
 ('9690*5204', '50426760'),
 ('4994*4677', '23356938')]

### Building vocabularies and tokenization function

We now need to build vocabularies that map strings to token ids and vice versa. We're gonna need these  when we feed training data into  the model or convert output matrices into words. 

Pay a close attention to the special characters you need to add for the vocabulary:


*    End of equation / solution token
*    Begining of equation / solution token
*    Padding token

Please note that in the exercise we do not need the <UNK> token



In [None]:
#build a vocabulary  string --> tokenId
#build a reverse vocabulary  tokenId --> string
#build a tokenizer (i.e. a function which takes a string and returns tokenids)

### Encoder-decoder model

The code below contains a template for a simple encoder-decoder model: single GRU encoder/decoder.
**Please note that some places require change and your implementation.**

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
class BasicModel(nn.Module):
    def __init__(self, vocab, emb_size=64, hid_size=128):
        """
        A simple encoder-decoder seq2seq model
        """
        super().__init__() # initialize base class to track sub-layers, parameters, etc.

        self.inp_voc = vocab
        self.hid_size = hid_size
        
        self.emb_inp = nn.Embedding(len(inp_voc), emb_size)
        self.emb_out = nn.Embedding(len(inp_voc), emb_size)
        self.enc0 = nn.GRU(emb_size, hid_size, batch_first=True)

        self.dec_start = nn.Linear(hid_size, hid_size)
        self.dec0 = nn.GRUCell(emb_size, hid_size)
        self.logits = nn.Linear(hid_size, len(inp_voc))
        
    def forward(self, inp, out):
        """ Apply model in training mode """
        initial_state = self.encode(inp)
        return self.decode(initial_state, out)


    def encode(self, inp, **flags):
        """
        Takes symbolic input sequence, computes initial state
        :param inp: matrix of input tokens [batch, time]
        :returns: initial decoder state tensors, one or many
        """
        inp_emb = self.emb_inp(inp)
        batch_size = inp.shape[0]
        
        enc_seq, [last_state_but_not_really] = self.enc0(inp_emb)
        # enc_seq: [batch, time, hid_size], last_state: [batch, hid_size]
        
        # note: last_state is not _actually_ last because of padding, let's find the real last_state
        lengths = (inp != self.inp_voc.eos_ix).to(torch.int64).sum(dim=1).clamp_max(inp.shape[1] - 1)
        last_state = enc_seq[torch.arange(len(enc_seq)), lengths]
        # ^-- shape: [batch_size, hid_size]
        
        dec_start = self.dec_start(last_state)
        return [dec_start]

    def decode_step(self, prev_state, prev_tokens, **flags):
        """
        Takes previous decoder state and tokens, returns new state and logits for next tokens
        :param prev_state: a list of previous decoder state tensors, same as returned by encode(...)
        :param prev_tokens: previous output tokens, an int vector of [batch_size]
        :return: a list of next decoder state tensors, a tensor of logits [batch, len(inp_voc)]
        """
        prev_gru0_state = prev_state[0]
        
        <YOUR CODE HERE>
        
        return new_dec_state, output_logits

    def decode(self, initial_state, out_tokens, **flags):
        """ Iterate over reference tokens (out_tokens) with decode_step """
        batch_size = out_tokens.shape[0]
        state = initial_state
        
        # initial logits: always predict BOS
        onehot_bos = F.one_hot(torch.full([batch_size], self.inp_voc.bos_ix, dtype=torch.int64),
                               num_classes=len(self.inp_voc)).to(device=out_tokens.device)
        first_logits = torch.log(onehot_bos.to(torch.float32) + 1e-9)
        
        logits_sequence = [first_logits]
        for i in range(out_tokens.shape[1] - 1):
            state, logits = self.decode_step(state, out_tokens[:, i])
            logits_sequence.append(logits)
        return torch.stack(logits_sequence, dim=1)

    def decode_inference(self, initial_state, max_len=100, **flags):
        """ Generate solutions from model (greedy version) """
        batch_size, device = len(initial_state[0]), initial_state[0].device
        state = initial_state
        outputs = [torch.full([batch_size], self.inp_voc.bos_ix, dtype=torch.int64, 
                              device=device)]
        all_states = [initial_state]

        for i in range(max_len):
            state, logits = self.decode_step(state, outputs[-1])
            outputs.append(logits.argmax(dim=-1))
            all_states.append(state)
        
        return torch.stack(outputs, dim=1), all_states

    def caculate_lines(self, inp_lines, **kwargs):
        inp = self.inp_voc.to_matrix(inp_lines).to(device)
        initial_state = self.encode(inp)
        out_ids, states = self.decode_inference(initial_state, **kwargs)
        return self.inp_voc.to_lines(out_ids.cpu().numpy()), states


### Training loop

Training encoder-decoder models isn't that different from any other models: sample batches, compute loss, backprop and update.

For training loss we will use ***torch.nn.NLLLoss*** please note that the loss should not be calculated with the padding token. (For ignoring specific labels please look at ***torch.nn.NLLLoss*** documentation 


In [None]:
#<Implement training loop>

## Adding Attention Layer
Here you will have to implement a layer that computes a simple additive attention:

Given encoder sequence $ h^e_0, h^e_1, h^e_2, ..., h^e_T$ and a single decoder state $h^d$,

* Compute logits with a 2-layer neural network
$$a_t = linear_{out}(tanh(linear_{e}(h^e_t) + linear_{d}(h_d)))$$
* Get probabilities from logits, 
$$ p_t = {{e ^ {a_t}} \over { \sum_\tau e^{a_\tau} }} $$

* Add up encoder states with probabilities to get __attention response__
$$ attn = \sum_t p_t \cdot h^e_t $$



In [None]:
#<implement attention layer>
#<add attention layer for your seq2seq model>
#<Train the two models (with/without attention and compare the results)

## Good Luck!