In [None]:
import torch
import tensorflow as tf
import math

## Aggregation NN for Programmers

Aggregation NN refers to use of several types of NN
e.g. CNN, RNN etc. for *aggregation* purpose, on a high level that is,
aggregating many inputs into a smaller set of summary outputs (sometimes just one output).

## Convolutional Layers

Question: What is the difference/relationship between
conv layers and pooling layers?

[Conv2d in PyTorch](https://pytorch.org/docs/master/generated/torch.nn.Conv2d.html#torch.nn.Conv2d):

In [None]:
batch_size = 20
height = 50
width = 100
in_channels = 16
out_channels = 33

# assuming square
kernel_size = 3
stride = 1 # default
padding = 0 # default
dilation = 1 # default

height_out = math.floor((height + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)
width_out = math.floor((width + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)

cond2d = torch.nn.Conv2d(in_channels=in_channels,
                         out_channels=out_channels,
                         kernel_size=kernel_size)
I = torch.randn(batch_size, in_channels, height, width)
O = cond2d(I)

In [None]:
assert O.shape == torch.Size((batch_size, out_channels, height_out, width_out))

## Pooling Layers

[MaxPool2d in PyTorch](https://pytorch.org/docs/stable/nn.html#maxpool2d)

In [None]:
batch_size = 20
height = 50
width = 100
channels = 16

kernel_size = 3
stride = kernel_size # default
padding = 0 # default
dilation = 1 # default

# The same formula as the Conv2d
height_out = math.floor((height + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)
width_out = math.floor((width + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)

maxpool2d = torch.nn.MaxPool2d(kernel_size)
I = torch.randn(batch_size, channels, height, width)
O = maxpool2d(I)

In [None]:
assert O.shape == torch.Size((batch_size, channels, height_out, width_out))

## Recurrent Layers

TODO: use_attention

TODO: compare with LSTM (and the post)

In [None]:
input_size = 10
hidden_size = 20
seq_len, batch_size = 5, 3

rnn = torch.nn.RNN(input_size, hidden_size)
I = torch.randn(seq_len, batch_size, input_size)
O, H = rnn(I)

In [None]:
assert O.shape == torch.Size((seq_len, batch_size, hidden_size))

In [None]:
assert H.shape == torch.Size((1, batch_size, hidden_size))

## Attention

Attention can be complex. But understanding it from a
simpler perspective, it is similar to min/max/mean --
we use it to aggregate a sequence of embeddings into
one.

Attention itself is not a neural network primitive. It can be implemented from simpler ones. Here, we present
a simplifier version of attention layer based on the
[attention decoder in PyTorch tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#attention-decoder).

Note how embedding is used (see *insert link* for tutorial
on using embedding layer). Note that the most important thing
is that embedding embeds each index as an embedding vector,
and this API is universal for all implementations, including
BERT's fine-tuning interface.

In [None]:
embedding_dim = 256
hidden_size = embedding_dim

max_length = 10
num_embeddings = 20

# TODO: explain why they are both one in this case
batch_size = 1
seq_len    = 1

min_index = 0
max_index = min_index + num_embeddings

attn = torch.nn.Linear(hidden_size * 2, max_length)
attn_combine = torch.nn.Linear(hidden_size * 2, hidden_size)

I = torch.randint(min_index, max_index, (batch_size, seq_len))

# Prepare inputs 
embedding = torch.nn.Embedding(num_embeddings, hidden_size)
embedded = embedding(I)
assert embedded.shape == torch.Size((batch_size, seq_len, embedding_dim)), embedded.shape
embedded = embedded.view(1, 1, -1)
assert embedded.shape == torch.Size((1, 1, batch_size * seq_len * embedding_dim))
assert embedded[0].shape == torch.Size((1, batch_size * seq_len * embedding_dim))

hidden = torch.zeros(1, 1, hidden_size)
assert hidden[0].shape == torch.Size((1, hidden_size))

embedded_hidden = torch.cat((embedded[0], hidden[0]), dim=1)
assert embedded_hidden.shape == torch.Size((1, hidden_size + batch_size * seq_len * embedding_dim))
assert embedded_hidden.shape == torch.Size((1, 2 * hidden_size))

encoder_outputs = torch.zeros(max_length, hidden_size)

Now, we first compute the attention weights from `embedded_hidden`

In [None]:
O = attn(embedded_hidden)
assert O.shape == torch.Size((1, max_length))
attn_weights = torch.nn.functional.softmax(O, dim=1)
assert attn_weights.shape == torch.Size((1, max_length))

Then we can apply the weights on incoming encoder outputs:

In [62]:
attn_applied = torch.bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0))
assert attn_applied.shape == torch.Size((1, 1, embedding_dim))

Essentially, attention layer summarize `max_length` many encoder outputs as a single one. You can read the entire PyTorch tutorial to understand the context
of this many-to-one mapping.

![](https://pytorch.org/tutorials/_images/attention-decoder-network.png)