__Cell Support (tf.nn.rnn_cell)__
- BasicRNNCell: The most basic RNN cell.
- RNNCell: Abstract object representing an RNN cell.
- BasicLSTMCell: Basic LSTM recurrent network cell.
- LSTMCell: LSTM recurrent network cell.
- GRUCell: Gated Recurrent Unit cell 

Example:

To construct Cells (tf.nn.rnn_cell)

    cell = tf.nn.rnn_cell.GRUCell(hidden_size)
    
To stack multiple cells

    cell = tf.nn.rnn_cell.GRUCell(hidden_size)
    rnn_cells = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)

Construct Recurrent Neural Network
- tf.nn.dynamic_rnn: uses a tf.While loop to dynamically construct the graph when it is executed. Graph creation is faster and you can feed batches of variable size.
- tf.nn.bidirectional_dynamic_rnn: dynamic_rnn with bidirectional

Stack multiple cells

    cell = tf.nn.rnn_cell.GRUCell(hidden_size)
    rnn_cells = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)
    output, out_state = tf.nn.dynamic_rnn(cell, seq, length, initial_state)

The problem with this is that you need to specify the *length*. However, most sequences are not of the same length 

### Dealing with variable sequence length

*The padded labels change the total loss, which affects the gradients*

Approach 1:
  1. Maintain a mask (True for real, False for padded tokens)
  2. Run your model on both the real/padded tokens (model will predict labels for the padded tokens as well)
  3. Only take into account the loss caused by the real elements

Example

    full_loss = tf.nn.softmax_cross_entropy_with_logits(preds, labels)
    loss = tf.reduce_mean(tf.boolean_mask(full_loss, mask))

Approach 2: Let your model know the real sequence length so it only predict the labels for the real tokens

Example

    cell = tf.nn.rnn_cell.GRUCell(hidden_size)
    rnn_cells = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)
    tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1)
    output, out_state = tf.nn.dynamic_rnn(cell, seq, length, initial_state)

## How to deal with common problems when training RNNS

### Vanishing Gradients
Use different activation units:
- tf.nn.relu
- tf.nn.relu6
- tf.nn.crelu
- tf.nn.elu

In addition to:
- tf.nn.softplus
- tf.nn.softsign
- tf.nn.bias_add
- tf.sigmoid
- tf.tanh

## Exploding Gradients
Clip gradients with tf.clip_by_global_norm

In [None]:
gradients = tf.gradients(cost, tf.trainable_variables())  # take gradients of cosst w.r.t. ALL trainable variables
clipped_gradients, _ = tf.clip_by_global_norm(gradients, max_grad_norm) # clip the gradients by a pre-defined max norm
optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.apply_gradients(zip(gradients, trainables)) # add the clipped gradients to the optimizer

### Anneal the learning rate
Optimizers accept both scalars and tensors as learning rate

In [None]:
learning_rate = tf.train.exponential_decay(init_lr,
 global_step,
 decay_steps,
 decay_rate,
 staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate)

### Overfitting

Use dropout through tf.nn.dropout or DropoutWrapper for cells
- tf.nn.dropout

      hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

- DropoutWrapper

      cell = tf.nn.rnn_cell.GRUCell(hidden_size)
      cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=keep_prob)

- Early stopping
  - implement Early Stopping yourself by evaluating your model’s performance on a validation set every N steps during training, and saving a “winner” snapshot of the model (using a Saver) when the model outperforms the previous “winner” snapshot. At the end of training, just restore the last “winner” snapshot. Note that you should not stop immediately when performance starts dropping, because it may get better a few steps later. One good strategy is to count the number of steps since the last time a “winner” snapshot was saved, and stop when this counter is large enough that you can be confident that the network is never going to beat it.

  - Another option is to use TensorFlow’s `ValidationMonitor` class and set its `early_stopping` parameters. This is documented here
  - Or http://mckinziebrandon.me/TensorflowNotebooks/2016/11/20/early-stopping.html

## Language Modeling
n-grams: predict the next word based on previous n-grams
- Huge vocabulary
- Can’t generalize to OOV (out of vocabulary)
- Requires a lot of memory

Character-level: Both input and output are characters
- Pros:
  - Very small vocabulary
  - Doesn’t require word embeddings
  - Faster to train
- Cons:
  - Low fluency (many words can be gibberish)

Word-level: 

Subword-level: Input and output are subwords
- Keep W most frequent words
- Keep S most frequent syllables
- Split the rest into characters
- Seem to perform better than both word-level and character-level models*
        
      new company dreamworks interactive
      new company dre+ am+ wo+ rks: in+ te+ ra+ cti+ ve:

Mikolov, Tomáš, et al. "Subword language modeling with neural networks." (2012).

Hybrid: Word-level by default, switching to character-level for unknown tokens
