# Survey of Neural Abstractive Text Summarization
Summary of a summary
https://arxiv.org/abs/1812.02303

## Evaluation metrics

### ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
* Most common metric for summarization tasks.
* Recall based.
* A few different versions. Commonly reported are:
    * **ROUGE-N** measures overlap of n-grams (usually N=1, or N=2).
    * **ROUGE-L** measures longest matching subsequence (not necessarily consecutive).
* Critique:
    * Only cares about content, not about grammatical correctness, fluency, coherence, etc.
    * Content evaluation is perhaps a bit biased against abstractive approached. Consider synonyms for example.
    * Sort of designed to be evaluated with multiple reference summaries but common datasets only have one per input text.

### Others
* METEOR
* BLEU
* PPL
* CIDER

## Seq2Seq models
* Sequence to Sequence, i.e. a model that takes a sequence (of tokens) as input and outputs another sequence.
* Not necessarily the same length.
* Used in most (all?) modern abstractive summarization tasks.

### Basics
* Modeled as an **encoder** and a **decoder**.
* Encoder takes the input sequence $x = (x_1, \ldots, x_J)$ and outputs hidden states $h^e = (h_1^e, \ldots, h_J^e)$
* Decoder takes the hidden states and outputs output sequence $y = (y_1, \ldots, y_T)$
* In this case the input tokens are the tokens in the text to be summarized and the output tokens are the tokens of the summary text.

<img src="figs/nats-fig-2.png" width="50%">

* Encoders and decoders can be implemented as MLPs, CNNs, RNNs.
* The figure shows a bidirectional LSTM encoder and an LSTM decoder.
* In the LSTM case, usually the final hidden state of each direction is concatenated and used as the initial hidden state of the decoder, $h_0^d$. Same for initial cell state $c_0^d$.

#### Problems
* It's difficult to train the encoder because there are many computational steps between loss from a predicted token and the responsible part of the encoder meaning limited propagation of the gradient signal.
* Representational difficulties: everything needs to be remembered in the final hidden state(s).
* Repetition in generated output.
* Factual errors in generated output.

The following slides are some attempts to improve on these shortcomings (from a bunch of different papers).

### Attention
* The goal of attention is to reduce the distance of information flow so that the model can get the relevant information more easily.
    * Alleviates the gradient propagation problem.
    * Less dependent on storing everything in the hidden state / context fed into the decoder.
* With attention, the decoder is allowed to look at all encoder hidden states $h_j^e$.

* Attention is just a distribution over the source tokens corresponding hidden states that is then taken into account when making token predictions and computing the next hidden states.

$$
\begin{align*}
    s_{tj}^e &= s(h_j^e, h_t^d) \\
    \\
    \alpha_{tj}^e &= \frac{exp(s_{tj}^e)}{\sum_{k=1}^J exp(s_{tk}^e)} \\
    \\
    z_t^e &= \sum_{j=1}^J \alpha_{tj}h_t^e \\
    \\
    \tilde{h}_t^d &= \mathbf{W}_z(z_t^e \oplus h_t^d) + b_z \\
    \\
    P_{vocab,t} &= softmax(\mathbf{W}_{d2v}\tilde{h}_t^d + b_{d2v})
\end{align*}
$$

* Choice of scoring function $s(h_j^e, h_t^d)$ has some different alternatives which may include extra learnable parameters.

* The figure shows the same bidirectional LSTM encoder and an LSTM decoder with attention at a certain decoding step.
<img src="figs/nats-fig-3.png" width="50%">

#### Transformer
* Network architecture from "Attention is all you need" paper, very common in summarization papers since.
* No RNN or CNN, instead just relying on attention.
* Can be used for both encoder and decoder.
* Encoder is made up of several encoding layers where each such layer can attend to all positions in the current (layerwise) encoded sequence (self-attention).
* Decoder is made up of several decoding layers where each such layer can 
    * attend to all positions in the final encoded sequence.
    * attend to all previously generated positions in the decoded sequence (i.e. future is masked).

### Pointing/Copying mechanism
* The goal of these approaches is to handle out-of-vocabulary words.
    * e.g. a named entity that is not part of the vocabulary and couldn't otherwise be part of the summary.
* The following approaches all seem very similar to each other differing mainly in "hard" or "soft" switches when choosing between generating and copying.

#### Pointer softmax
Three parts in play at each decoding step $t$.

* Softmax prediction over vocabulary.
* Softmax prediction over locations in input sequence based on attention weights $\alpha_{tj}^e$.
* Switching network (MLP+sigmoid) to softly pick token from either predicted vocabulary distribution or input sequence by concatenating the weighted distributions.
$$
\begin{align*}
    p_{gen,t} = \sigma(\mathbf{W}_{s,z}z_t^e + \mathbf{W}_{s,h}h_t^d + b_s)
\end{align*}
$$

#### Switching Generator-Pointer
Seems similar to Pointer softmax but more explicitly choosing argmax from either predicted vocabulary distribution OR input sequence based on switch defined below.
$$
\begin{align*}
    p_{gen,t} = \sigma(\mathbf{W}_{s,z}z_t^e + \mathbf{W}_{s,h}h_t^d + \mathbf{W}_{s,E}E_{y_{t-1}}b_s)
\end{align*}
$$

#### Copynet
* Extended vocabulary $\mathcal{V}_{ext}$ from union of vocabulary and input sequence.
* One distribution for generating from vocabulary and one distribution for copying from input sequence.

$$
\begin{align*}
    P_{\mathcal{V}_{ext}} &= P_g(y_t) + P_c(y_t) \\
    \\
    P_g(y_t) &= \left\{
        \begin{array}{ll}
            \frac{1}{Z}exp(\psi_g(y_t)) & \quad y_t \in \mathcal{V} \\
            0 & \quad otherwise
        \end{array}
    \right. \\
    \\
    P_c(y_t) &= \left\{
        \begin{array}{ll}
            \frac{1}{Z}\sum_{x_j=y_t} exp(\psi_c(x_j)) & \quad y_t \in \mathcal{X} \\
            0 & \quad otherwise
        \end{array}
    \right. \\
    \\
    \psi_g(y_t) &= \mathbf{W}_{d2v}\tilde{h}_t^d + b_{d2v} \\
    \\
    \psi_g(y_t) & \text{ from score function}
\end{align*}
$$


#### Pointer-Generator network

$$
\begin{align*}
    P_g(y_t) &= \left\{
        \begin{array}{ll}
            P_{vocab,t}(y_t) & \quad y_t \in \mathcal{V} \\
            0 & \quad otherwise
        \end{array}
    \right. \\
    \\
    P_c(y_t) &= \left\{
        \begin{array}{ll}
            \sum_{x_j=y_t} \alpha_{tj}^e & \quad y_t \in \mathcal{X} \\
            0 & \quad otherwise
        \end{array}
    \right. \\
    \\
    P_{\mathcal{V}_{ext}}(y_t) &= p_{gen,t}P_g(y_t) + (1 - p_{gen,t})P_c(y_t)
\end{align*}
$$

### Repetition handling
* The goal here is to reduce repetition in the generated sequence which is a common problem. This occurs both at word level and at sentence level.
* The general idea is to in some way improve how the model can remember what has been generated before.

#### Coverage
* The aim is to remember what has been attended before.
* Define a coverage vector $u_t^e$ as the sum of previous attention distributions.
* Add coverage vector as input to attention score function to allow the current decoding step to know about previous attentions.
* Add a coverage loss $\sum_j min(\alpha_{tj}^e, u_{tj}^e)$ so that attention is not put at the same place repeatedly.

#### Temporal attention
Similar to coverage but looks at the attention scores $s_{tj}^e$

$$
\begin{align*}
    s_{tj}^{temp} &= \frac{exp(s_{tj}^e)}{\sum_k^{t-1} exp(s_{kj}^e)} \\
    \\
    \alpha_{tj}^{temp} &= \frac{s_{tj}^{temp}}{\sum_k^J s_{tk}^{temp}} \\
    \\
    z_t^e &= \sum_j^J \alpha_{tj}^{temp} h_j^e
\end{align*}
$$

#### Intra-decoder attention
* Similar to temporal attention (and coverage) but introduces $s_{t\tau}^d$ which is based on attention scores within the decoder only.
* A decoder side context vector $z_t^d$ is computed and is used in addition to the encoder side $z_t^e$ to compute $P_{vocab}$

$$
\begin{align*}
    s_{t\tau}^d &= s(h_\tau^d, s_t^d) \\
    \\
    \alpha_{t\tau}^d &= \frac{exp(s_{t\tau}^d)}{\sum_k^{t-1} exp(s_{tk}^d)} \\
    \\
    z_t^d &= \sum_\tau^{t-1} \alpha_{t\tau}^d h_\tau^d
\end{align*}
$$

#### Distraction
* Similar to coverage in that each decoding step should know about what has been what has focused on in previous decoding step.
* In this case this is based on the previous context vectors.

$$
\begin{align*}
    z_t^{dist,t} = tanh(\mathbf{W}_{dist,z} z_t^e - \mathbf{W}_{hist,z} \sum_j^{t-1} z_j^e)
\end{align*}
$$

### Improving encoded representations

#### Selective encoding
TODO

#### Read-again encoding
* Basically pass the sequence through the first encoder once to get the hidden states $(h_1^{e,1}, \ldots, h_J^{e,1})$.
* Then add another encoder whose hidden state is updated according to 

$$
\begin{align*}
    h_{sen}^{e,1} &= h_J^{e,1} \\
    \\
    h_j^{e,2} &= LSTM(h_{j-1}^{e,2}, E_{x_j} \oplus h_j^{e,1} \oplus h_{sen}^{e,1})
\end{align*}
$$

* Hidden states of the second read is then used by decoder.

### Improving decoder
TODO

### Summarizing long documents
* Want to capture hierarchical representation of longer documents.
* "Chunks" refers to higher level representation.
* Essentially, use one encoder at token level and one at chunk level and use information from both at decoding steps.
* At decoding time $t$ we have both chunk attention weights $\alpha_i^{chk,t}$ and token attention weights $\alpha_{ij}^{wd,t}$ computed from hidden states from chunk and token hidden states respectively.

<img src="figs/nats-fig-7.png" width="50%">

#### Hierarchical attention
Tokens in less important chunks should be less attended so computation of the context vector is changed.
$$
\begin{align*}
    \alpha_{ij}^{scale,t} &= \frac{\alpha_i^{chk,t}\alpha_{ij}^{wd,t}}{\sum_{k,l}\alpha_k^{chk,t}\alpha_{kl}^{wd,t}} \\
    \\
    z_t^e &= \sum_{i,j} \alpha_{ij}^{scale,t}
\end{align*}
$$

#### Discourse-aware attention 
Similar to hierarchical attention, but slightly different computation of scaled attention weights.
$$
\begin{align*}
    \alpha_{ij}^{scale,t} &= \frac{exp(\alpha_i^{chk,t}s_{ij}^{wd,t})}{\sum_{k,l} exp(\alpha_k^{chk,t}s_{kl}^{wd,t})} \\
\end{align*}
$$

#### Course-to-fine attention
Also similar to hierarchical attention but computes context vector by sampling chunk $i$ and then using only token level hidden states for that chunk instead of all chunks.
$$
\begin{align*}
    z_t^e &= \sum_j \alpha_{ij}^{scale,t}
\end{align*}
$$

#### Graph-based attention
* Different from hierarchical attention in that instead implicitly capturing which chunks are important this is done explicitly by constructing a graph where chunks are vertices and similarities are edges and applying pagerank.
    * (Skipping details)
* Intuition: Pagerank score of a vertex increases with more edges pointing to it.

### Extraction + Abstraction
TODO

## Training strategies
TODO

## Summary generation
* Output of the decoder at each decoding step is a distribution over tokens.
* At prediction time we want to pick a strategy for using these distribution to produce a sequence that is close to optimal with regards to likelihood.

### Beam search
* Pick beam size $B$
* We always keep track of the top $B$ states of the decoder and sequence produced so far. This includes internal (LSTM) states etc.
* At each decoding step we consider picking the top $B$ tokens and "locking" them in in $B$ separate cases.
* With it each token locked in and passed to the next hypothetical LSTM state we consider that likelihood of the next potential tokens after it.
* For $B=5$ and vocabulary size 10000, we consider 5 * 10000 cases and then pick the top 5 of these.
* Repeat this process until done.
* $B=1$ gives greedy search, i.e. just picking most likely at each time step. But usually this is inferior.

### Diverse beam search
TODO

## Recent papers
TODO: Include some produced summaries

## Discussion points
* ROUGE is not so good, could hinder progress. Did anyone look at the others?
* Many approaches to solving some of the issues were essentially the same idea applied at different layers. Pros and cons of some of them?
* Is this useful to us? Right now or can be in the future. Are results good enough to make a product?
* Did anyone look at the reinforcement learning approaches?