# Chapter 16: Natural Language Processing with RNNs and Attention

### 1. What are the pros and cons of using a stateful RNN versus a stateless RNN?

Stateful RNNs allow to use the hidden state of a LSTM or GRU layer for the next batch, instead of starting from scratch. This helps the RNN to maintain the information collected from the previous batches, and helps to improve the long-memory problem.

This basically helps the RNN to have longer context when using long sequences. 

In order to use stateful RNNs, it is important that the data is sequential to be possible to be applied. This means that each batch should be sequential with the following. This makes batching sequences for stateful RNNs really tricky.

Another inconvenience may be obtaining the hidden states for the current time step from the decoder. This requires an additional function to be implemented. One solution is to use the decoder output.

### 2. Why do people use encoder-decoder RNNs rather than plain sequence-to-sequence RNNs for automatic translation?

In a sequence-to-sequence RNN, the model translate one word at a time, sometimes losing the context that the whole sentence provides. 

When using an `encoder-decoder` RNN, the model first creates the embedding based on the whole sentence, and then it's translated, adding the context of the sentence to the translation.

### 3. How can you deal with variable-length input sequences? What about variable-length output sequences?

There are different ways of dealing with `variable-length` input. The first would be using masking when creating the embedding. In this way, the embedding will create a masking that sets zero-padding at the right to fill the sequences (sentences) that are shorter than the longest sequence in the dataset. Another way is by using ragged tensors to feed the model.

When dealing with `variable-length` outputs, it is good to create an `end-of-sequence` token to identify. This will automatically stop the model from keep predicting the next word/character.

### 4. What is a beam search, and why would you use it? What tool can you use to implement it?

Beam search is a way to improve prediction performance. It works by allowing the model of the top k most probable sentences. This allows the model to explore the most promising options in parallel to determine the best sentence. 

We can implement it from scratch, or alternatively, we can use the TensorFlow's addon seq2seq for an implementation.

### 5. What is an attention mechanism? How does it help?

An attention mechanism is a technique that allows a encoder-decoder model to deal with really long sequences. It works by analyzing the encoder output with the decoder current time step state, and validate which part of the sequence is more relevant at that time step. This allows the model to deal with longer sequences, but also allows us to validate what is the model paying attention to at each time step.

### 6. What is the most important layer in the transformer architecture? What is its purpose? 

The most important layer in the transformer architecture is the `multi-head attention layer`. Its purpose is to create vector projections of the sequences to identify which parts of the sequences are most alligned, giving to the model a good idea of how different parts of the sequence are connected or are more important. This allows the model to improve the representation of the sequences. 

### 7. When would you need to use sampled softmax?

If the output classes are too many, calculating the probability for each may take a long time. Instead, we can calculate the logits of the target class, and we sample the logits for some incorrect classes. Then, in prediction we can use the regular softmax function. 