<a href="https://colab.research.google.com/github/mphirke/1PAW/blob/master/Week1/The_Annotated_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1 PAW Week 1 (22 Sept - 29 Sept, 2019)
Reimplementation of paper ["Attention is all you need" ](https://arxiv.org/abs/1706.03762) using (mostly) [The Annotated Transformer ](https://github.com/harvardnlp/annotated-transformer/blob/master/The%20Annotated%20Transformer.ipynb) , [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)  and the PyTorch docs.

# The Architecture 

![this](https://www.dropbox.com/s/m1sqqlie8y05hz9/arch.png?dl=1)

In [0]:
import numpy as np
import pandas as pd
import torch.nn as nn
import torch
import numpy
import torch.nn.functional as F
import torch.autograd as Variable

The Generator essential gives us the final probabilities after the inputs is passed through all the rest of our architecture. 

![Linear + Softmax](https://www.dropbox.com/s/ntpqyov5r3thw2h/generator.png?dl=1)



# Generator

In [0]:
class Generator(nn.Module):
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)
    
    def forward(self, x):
        return F.log_softmax(self.proj(x), dim=-1)

## Understanding the Generator  - 

### Understanding nn.Linear
nn.Linear(input_features, outputs_features)

$y = xW^T + b$

**y** = output

**W** = weights

**x** = input


**b** = bias

The weights and the bias are randomly initliazed.

Essentially, we are creating a linear equation of the form $y= xW^T + b$ , where our input is the d_model and the output is the vocab.

In [23]:
m = nn.Linear(3,1)
inp = torch.tensor([[1.0, -1.0, 10]])
out = m(inp)
print("The output is", out)
print(m.weight)
print(m.bias)   

The output is tensor([[2.7858]], grad_fn=<AddmmBackward>)
Parameter containing:
tensor([[-0.0718,  0.2321,  0.3375]], requires_grad=True)
Parameter containing:
tensor([-0.2856], requires_grad=True)


## Understanding log_softmax

from [PyTorch documentation](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.log_softmax)

> Applies a softmax followed by a logarithm. While mathematically equivalent to log(softmax(x)), doing these two operations separately is slower, and numerically unstable. This function uses an alternative formulation to compute the output and gradient correctly.

Softmax is given by - 

# $ Softmax(x) =  \frac {e ^x } {\sum e ^ (x_j) } $

Let's try it out.

In [69]:
print(F.softmax(torch.tensor([1, 2.0, 3 ,4 ]), dim=-1))
print(F.log_softmax(torch.tensor([1,2.0,3,4])))

tensor([0.0321, 0.0871, 0.2369, 0.6439])
tensor([-3.4402, -2.4402, -1.4402, -0.4402])


  


Reimplementing log_softmax in numpy to compare results

In [0]:
def softmax(inps):
    sum_denominator = np.sum(np.exp(inps))
    softmaxes = np.array(np.exp(inps))/sum_denominator
    return softmaxes

def log_sm(inps):
    softmaxes = softmax(inps)
    return np.log(softmaxes)

In [68]:
print(softmax([1, 2.0, 3, 4]))
log_sm([1, 2.0, 3, 4])

[0.0320586  0.08714432 0.23688282 0.64391426]


array([-3.4401897, -2.4401897, -1.4401897, -0.4401897])

We get the same thing.
So, confirmed that log_softmax returns 
# $ log (\frac {e ^x } {\sum e ^ (x_j) } )$

# Encoder

Consists of a stack of six encoders.
Why 6 encoders? No special reason. It's open to experimentation.

In [0]:
def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
    # _ means blank. Essentially, it deepcopies the module for N times, hence building a stack of N same layers.


In [0]:
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    #Essentially follows an Encode > Decode architecture
    def forward(self, src, tgt, src_mask, tgt_mask):
        return self.decode(self.encoder(src,src_mask), src_mask, tgt, tgt_mask)
    
    def encode(self, src, src_mask):
        return self.encoder(self.src_embed, src_mask)
    
    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)