<h1>Advanced Sequence Modeling for Natural Language Processing<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Sequence-to-Sequence-Models,-Encoder-Decoder-Models-and-Conditioned-Generation" data-toc-modified-id="Sequence-to-Sequence-Models,-Encoder-Decoder-Models-and-Conditioned-Generation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sequence-to-Sequence Models, Encoder-Decoder Models and Conditioned Generation</a></span></li><li><span><a href="#Capturing-More-from-a-Sequence:-Bidirectional-Recurrent-Models" data-toc-modified-id="Capturing-More-from-a-Sequence:-Bidirectional-Recurrent-Models-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Capturing More from a Sequence: Bidirectional Recurrent Models</a></span></li><li><span><a href="#Capturing-More-from-a-Sequence:-Attention" data-toc-modified-id="Capturing-More-from-a-Sequence:-Attention-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Capturing More from a Sequence: Attention</a></span><ul class="toc-item"><li><span><a href="#Attention-in-Deep-Neural-Networks" data-toc-modified-id="Attention-in-Deep-Neural-Networks-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Attention in Deep Neural Networks</a></span></li></ul></li><li><span><a href="#Evaluating-Sequence-Generation-Models" data-toc-modified-id="Evaluating-Sequence-Generation-Models-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Evaluating Sequence Generation Models</a></span></li></ul></div>

## Introduction

- _Sequence-to-sequence Modeling_ refers to taking sequence as input and producing another sequence as an output of possibly different length
- Examples
    - Predict response for a given email
    - Translate text
    - Summarize the given text

## Sequence-to-Sequence Models, Encoder-Decoder Models and Conditioned Generation

- **Sequence-to-Sequence(S2S)** models are a special case of general family of models alled _encoder-decoder models_
- Encoder-Decoder model is a composition of two models -> an Encoder and a Decoder that are trained jointly.
    - Encoder Model takes an input and produces an encoding or a representation $(\phi)$ of the input which is usually a vector. The goal of the encoder is to capture important properties of the input with respect to the task at hand.
    - The goal of the decoder is to take the encoded input and produce a desired output.
- So S2S models can be defined as encoder-decoder models in which the encoder and decoder are sequence models and the inputs and outputs are both sequences possibly of different lengths.

![Figure 8.1](../images/figure_8_1.png)

**Encoder-Decoder Models as Special Case of Conditioned Generation Models**

- In Conditioned generation, instead of the input respresentation $\phi$, a general conditioning context $c$ influences a decoder to produce an output.
- When the conditioning context $c$ comes from an encoder model, conditioned generation is same as an encoder-decoder model.
- Not all conditioned generation models are encoder-decoder models, because it is possible for the conditioning context to be derived from a structured source.
- For example, in weather report generation, the value of the temperature, humidity and wind speed and direction could condition a decoder to generate the textual weather report.

![Figure 8.2](../images/figure_8_2.png)
![Figure 8.3](../images/figure_8_3.png)
![Figure 8.4](../images/figure_8_4.png)

## Capturing More from a Sequence: Bidirectional Recurrent Models

- The goal of bidirectional recurrent model is to combine the information from past and future to robustly represent the meaning of a word in a sequence.
- Any model in the recurrent family, such as Elmann RNNs, LSTMs or GRUs could be used in such a bidirectional formulation.
- Bidirectional models like unidirectional models can be used in both classification and sequence labeling settings.

![Figure 8.5](../images/figure_8_5.png)
![Figure 8.6](../images/figure_8_6.png)

## Capturing More from a Sequence: Attention

**Problems with S2S, encoder-decoder and conditioned generation models**

- These models crams(encodes) the entire input sentence into a single vector $\phi$ and uses that encoding to generate the output. This might work for very short sentences but will fail to capture the information in the entire input in case of ling sentences. This is a limitation of using just the final hidden state as the encoding.
- Gradients vanishing problem can also happen during back prooagation throigh time which makes the training difficult.

![Figure 8.7](../images/figure_8_7.png)

- **Attention** is the phenomenon in which our minds focus on the relevant parts of the input while producing output.
- **Attention mechanism** is the process in which sequence generation models incorporate attention to different parts of the input and not just the final summary of the input.
- The first models to incorporate the notion of attention for NLP were machine translation models by Bahdanau(2015).

![Figure 8.8](../images/figure_8_8.png)

### Attention in Deep Neural Networks

- In typical S2S Model, each time step produces a hidden state representation, denoted as $\phi_w$, specific to that time step in the encoder.
- To incorporate attention, we consider not only the final hidden state of the encoder but also the hidden states for each of the intermediate steps. These encoder hidden states are uninformatively called _values_.
- Attention also depends on the previous hidden state of the decoder called the _query_. The query vector for time step $t=0$ is a fixed hyperparameter.
- Attention is represented by a vector with the same dimension as the number of values it is attending to. This is called _attention vector_, or _attention weights_ or sometimes _alignment_.
- The attention weights are combined with the encoder states(values) to generate a _context vector_ that sometimes also known as a _glimpse_. This context vector becomes the input for the decoder instead of the full sentence encoding.
- The attention vector for the next time step is updated using a _compatibility function_. The exact nature of the compatibility function depends on the attention mechanism being used.

![Figure 8.9](../images/figure_8_9.png)

**Ways to Implement Attention**
- Simplest and most common is **Content-aware Mechanism**.
- Another popular attention is **Location-aware Attention** which depends only on query vector and the key.
- Attention weights are typically floating-point values between 0 and 1. This is called **Soft Attention**.
- It is also possible to learn a binary 0/1 vector for attention which is called **Hard Attention**.
- When the encoder depends on the states for all the time step in the input, this is known as **Global Attention**.
- In **Local Attention**, attention mechanism only depends on a window of the input around the current time step.
- When multiple attention vector are used to track different regions of input such mechanism is known as **Multiheaded Attention** which is based on Vaswani(2017) work. This popularized the concept of **Self Attention** a mechanizm where the model learns which regions of the input influence one another.
- When the input is multimodal like both image and speech, it is possible to design a **Multimodal Attention Mechanism**

## Evaluating Sequence Generation Models

Sequence Models are evaluated against an expected output called **Reference Output**.

There are two kinds of evaluation for sequence generation models:

- Human Evaluation
- Automation Evaluation
    - n-gram overlap based metrics -> BLEU, ROUGE, METEOR
    - Perplexity