<h1>Intermediate Sequence Modeling for Natural Language Processing<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#The-Problem-with-Vanilla/Elman-RNNs" data-toc-modified-id="The-Problem-with-Vanilla/Elman-RNNs-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The Problem with Vanilla/Elman RNNs</a></span></li><li><span><a href="#Gating-as-a-Solution-to-a-Vanilla-RNNs-Problems" data-toc-modified-id="Gating-as-a-Solution-to-a-Vanilla-RNNs-Problems-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Gating as a Solution to a Vanilla RNNs Problems</a></span></li><li><span><a href="#Tips-and-Tricks-for-training-sequence-models" data-toc-modified-id="Tips-and-Tricks-for-training-sequence-models-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Tips and Tricks for training sequence models</a></span></li></ul></div>

## Introduction

- _Sequence Prediction_ task requires to label each item of a sequence. Examples include language modeling, part of speech tagging, name entity recognition.
- Sequence prediction is also referred as Sequence Labeling.

![Figure 7.1](../images/figure_7_1.png)

## The Problem with Vanilla/Elman RNNs

Elman RNNs suffers from two problems:

- Inability to retain information for long range predictions
    - At each time step we simply update the hidden state vector regardless of whether it made sense. Due to this, RNN has no control over which values are retained and which are discarded in the hidden state. However what is desired is some way for the RNN to decide of the update is optional or if the update happens by how much and what parts of the state vector and so on.
- Gradient Stability
    - Vanilla RNNs also suffers from vanishing gradients or exploding gradients.

Some solutions that can be address these problems are:
- ReLUs
- Gradient Clipping
- Careful Initialization
- Gating(Most reliable)

## Gating as a Solution to a Vanilla RNNs Problems

To understand gating solution, lets suppose that we are adding two numberss, $a$ and $b$ and we want to control how much of $b$ gets into the sum. So we can write this as:

$$ a + \lambda b $$
    
where $\lambda$ is a value between 0 and 1. So if $\lambda = 0$, these is no contribution from b and if $\lambda = 1$ b contributes fully.

In above example, we can interpret $\lambda$ as a _switch_ or a _gate_ in controlling the amount of $b$ that gets into the sum. This is the intuition behind the gating mechanism.

In case of Elman RNN, the previous hidden state was $h_{t-1}$ and the current input is $x_t$, the recurrent update in Elman RNN would look something like:

$$ h_t = h_{t-1} + F(h_{t-1}, x_t) $$

where $F$ is the recurrent computation of the RNN. This is unconditioned sum and has the vanilla RNN problems mentioned above.

This can be updated with gating function by making $\lambda$ a function of previous hidden state vector $h_{t-1}$ then the RNN update equation would look like:

$$ h_t = h_{t-1} + \lambda(h_{t-1}, x_t) F(h_{t-1}, x_t) $$

Now $\lambda$ function controls how much of the current input gets to update the state $h_{t-1}$ and now function $\lambda$ is context dependent. The function $\lambda$ is usually a sigmoid function in gated networks.

In case of the _long short term memory network_(LSTM), above intuition is extended to incorporate not only conditional updated but also intentional forgetting of the values in the previous hidden state $h_{t-1}$. This forgetting happens by multiplying the previous hidden state and value $h_{t-1}$ with another function $\mu$ that also produces values between 0 and 1.

$$ h_t = \mu(h_{t-1}, x_t)h_{t-1} + \lambda(h_{t-1}, x_t) F(h_{t-1}, x_t) $$

## Tips and Tricks for training sequence models

- When possible use the gated variants
- When possible, prefer GRUs over LSTMs
- Use Adam as your optimizer
- Gradient Clipping
- Early Stopping