1. Can you think of a few applications for a sequence-to-sequence RNN? What about a sequence-to-vector RNN, and a vector-to-sequence RNN?

sequence-to-sequence (assuming this could be direct sequence-to-sequence or encoder/decoder):
- stock prices as mentioned in the book
- weather or climate predictions
- speech to text
- machine translation
- generative video/audio (from a prefix)
- video frame by frame classification
- mover trajectory prediction

sequence-to-vector:
- next-step predictions (many the same as above, just predicting the final frame)
- sentiment score
- any other scoring based on a sequence (reward/utility function in reinforcment learning?)
- genome analysis
- DALL E (description to image)

vector-to-sequence:
- image captioning
- video/audio generation
- robotic command processing (generate a list of actions from an enumerated command list)

2. How many dimensions must the inputs of an RNN layer have? What does each dimension represent? What about its outputs?

- An RNN must have 3D inputs: `[batch_size, steps, feature_dimensions]`
- An RNN has 3D outputs: `[batch_size, steps, n_neurons]`

3. If you want to build a deep sequence-to-sequence RNN, which RNN layers should have return_sequences=True ? What about a sequence-to-vector RNN?

- sequence-to-sequence RNNs should have all layers set to return_sequences=True
- sequence-to-vector RNNs should have all but the last RNN layer set to return_sequences=True

4. Suppose you have a daily univariate time series, and you want to forecast the next seven days. Which RNN architecture should you use?

A deep sequence-to-sequence RNN that uses a `Dense(7, ...)` layer at its top. Since the previous RNN layer will be configured with return_sequences=True, the Dense layer will receive a 3D tensor of shape `[batch_size, steps, n_neurons]`. `keras.layers.TimeDistributed` is not necessary, because if the rank of the input tensor to Dense is higher than 2, dense will perform the equivalent of 1D convolution with kernel size 1 across the time dimenension (index 1) (in other words it transforms the last axis dimension from n_rnn_neurons -> n_dense_neurons).

The RNN layers could be LSTM or GRU. According to the author it is best to try both to see which performs best on a case-by-case basis.

If the training data contains a very large number of time steps, training could take random, shorter windows from the time series.

5. What are the main difficulties when training RNNs? How can you handle them?

- vanishing/exploding gradients. With non-saturating activations, activations can grow or shrink at every timestep, eventually vanishing or exploding. Even with saturating activations like tanh gradients themselves can still vanish or explode.
  - Gradient clipping
  - Layer normalization, which normalizes activations according to the first and second moments across the feature dimension
  - dropout
- Short-term memory simple memory cells forget earlier states after very few timesteps (10 approximately)
  - LSTM: Pass two hidden states forward in time: a long term memory and a short term memory. 
    - Learn gates that filter long term memory, inputs, and outputs based on input and previous hidden state.
    - Also learn a gate for input activation
  - GRU: Similar to LSTM but slightly simpler:
    - Only one hidden state
    - chooses between long-term memory and short-term inputs based on input and previous hidden state.
  - 1D convolution in combination with RNN: downsample sequence to reduce sequence length while still providing features with similar information content
  - WaveNet: Main idea is to increase dilation in successive layers so each top level neuron has a large, hierarchical receptive field even with only a branching factor (kernel size) of 2

In [None]:
Can you sketch the LSTM cellâ€™s architecture?


7. Why would you want to use 1D convolutional layers in an RNN?

I actually answered this in question 5 already:

> 1D convolution in combination with RNN: downsample sequence to reduce sequence length while still providing features with similar information content

8. Which neural network architecture could you use to classify videos?

I'm not positive what this question is asking. I'll assume the simplest interpretation which is it wants to group videos into categories like offensive/not-offensive. Another interpretation of the question would be frame-by-frame object detection/tracking, but I'll assume it isn't asking this.

Assuming we want to predict whether content contains offensive material, the architecture would be sequence-to-vector.

So we would likely have input data with shape `[batch_size, steps, width, height, channels]`. One possible approach would be to use:
- A 2D convolutional network to reduce the spatial dimensionality while increasing feature depth
- GlobalAveragePooling to eliminate the spatial dimension of each feature map
- An RNN that takes the output of the convolutional network `[batch_size, steps, n_feature_maps]` with a `Dense(1, activation="softmax")` top layer

I looked in to this more after sketching out the above architecture, and it looks like the above is the most naive architecture described in [Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset](https://arxiv.org/pdf/1705.07750.pdf)

More sophisticated approaches use 3D convolution and 2 streams that combine information from RGB frames and precomputed optical flow frames.