1. What are the pros and cons of using a stateful RNN versus a stateless RNN?

A stateful RNN can learn from much longer sequences by passing its state to the next training step, but does not face the issue of having to backpropogate gradients over long sequences.

The main drawback is that structuring training can be trickier, since it is harder, but possible, to perform training across batched sequences. Since it is stateful, the RNN must train on consecutive subsequences, meaning that shuffling is constrained. Batches can contain subsequences of the ith sequence at the ith offset from within a batch, and then shuffling can take place across these batched sequences (inter sequence but not intra sequence). Since shuffling is constrained, the dataset will somewhat violate the IID assumption of gradient descent (although that constraint is already somewhat violated by taking multiple passes through a dataset).

If sequences are already short, there may not be a need for a stateful RNN.

2. Why do people use Encoder–Decoder RNNs rather than plain sequence-to-sequence RNNs for automatic translation?

One clear advantage of an Encoder-Decoder architecture is that it decouples the lengths of the input and output sequence, so they can be of a different size.

It also seems to add a bit of control. Variational Autoencoders for example can treat the latent state like a random variable and do other transformations. Latent state can also be used for interpretability by studying the structure of the latent state space.

3. How can you deal with variable-length input sequences? What about variable-length output sequences?

If instances within a batch must all be of the same lenght but they vary, padding can be added. Then, masking can be used to ignore any padded item in a sequence. During forward propagation, a layer will simply pass its input to its output if the value from the previous time step. During calculation of the loss function, padded sequence elements will be ignored, and so padding will not affect backpropagation or the application of gradients.

4. What is beam search and why would you use it? What tool can you use to implement it?

Since Encoder-Decoder architectures can track the conditional probability of the output sequence given the input sequence, the model is able to score the likelihood of a partial output sequence given it's partial input sequence.

Beam search keeps track of the K most probably sequences while generating output sequences. At each step, the model extends each of the K sequences with all possible extensions, and then evalutates the probability of all extended sequences, keeping the top K, regardless of which sequence they came from (for example one sequence could produce all of the the top K sequences at a particular time step).

Beam search is useful because often the most probable sequence at the end was not the most probably length 1 sequence (more generally length i), and so by tracking K sequences, the chances of discovering the more probable sequences are increased.

Beam search can be implemented using `tfa.seq2seq.beam_search_decoder.BeamSearchDecoder` from TensorFlow Addons.

5. What is an attention mechanism? How does it help?

An attention mechanism allows a model to focus on any part of the input sequence (often the input sequence is really the output sequence of an encoder) when generating an element in the output sequence. This sidesteps the issue that RNNs face with propagating memories across long sequences.

Implementation of attention mechanisms can vary in both architecture and in how they are calculated. The author gives three examples, one an Encoder-Decoder RNN with an alignment model using concatenative attention, the same model but with simplifications and multiplicative attention (dot-products), and the other being the transformer.

concatenative attention: concatenate encoders output at time t with decoder hidden state from time t-1 and pass values for all t through a time distributed dense layer and then softmax, using the softmax outputs as weights to scale encoder outputs passed to a decoder memory cell at time t

multiplicative attention: compute dot product instead of concatenation in the above scheme, and pass those values through the softmax layer. Some variants are possible like using the current hidden state at time t instead of t-1, and then use the attention mechanism to directly compute decoder output.

transformer: Will discuss in the next question.

6. What is the most important layer in the Transformer architecture? What is its purpose?

The multi-headed attention layer. First, to define scaled dot-product attention:

$$
Attention(Q, K, V) = softmax(\frac{QK^\intercal}{\sqrt{d_{keys}}})V
$$

- $Q$ is a matrix of shape $[n_{queries}, d_{keys}]$ and where n_queries is the number of queries and $d_{keys}$ is the dimensionality of each query and key
- $K$ has shape $[n_{keys}, d_{keys}]$ and contains the number of keys and values (length of sequence input into scaled dot product attention layer)
- $V$ has shape $[n_{keys}, d_{values}]$ where $d_{values}$ is the dimensionality of each value
- Similar to temperature, dividing by the square root of the dimensionality of the keys prevents the softmax from saturating. This keeps gradients from dissappearing
- In the encoder $Q$, $K$, and $V$ are all equal to the list of input words (compare similarity of input embeddings)
- In the masked multi-headed attention layer in the decoder $Q$, $K$, and $V$ are all equal to the list of target words, but with causal masking
- In the masked multi-headed attention layer in the decoder $K$, and $V$ the word encodings produced by the encoder, where $Q$ is the list of word encodings produced by the decoder.

A multi-headed attention layer:
- Passes $Q$ $K$ and $V$ into a stack of 
- $h$ time-distributed Dense layers without bias
- $h$ scaled dot product attention layers
- concatenates the $h$ outputs
- computes another time-distributed Dense layer without bias

Each of the $h$ layers computes a different feature of the input sequence, analagous to feature maps in conv nets.

I'm citing the book because I borrowed so heavily here:

Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (p. 559). O'Reilly Media. Kindle Edition. 

7. When would you need to use sampled softmax?

If the output vocabulary is very large, training can be prohibitively expensive due to the need to compute softmax over target word in the vocabulary. To avoid this it is possible to compute softmax on the known true target, along with a sampled subset of incorrect words. This speeds up training.

At inference time the actual softmax must be computed over the full vocabulary.