# Autoregressive models

## RNNs

- Interpret them as graphical models/bayes nets
- Add position encoding when using to generate images for better performance


## [MADE](https://arxiv.org/pdf/1502.03509)

Masked Autoencoder for Distribution Estimation

- [How to use multiple masks at once?](https://youtu.be/iyEOk8KCRUw?list=PLwRJQ4m4UJjPiJP3691u-qWwPGVKzSlNP&t=4177)

<img src="https://www.evernote.com/l/AgsYAUqEV21Nwp_UlobVq0wPKbG5zGDfUV8B/image.png" alt="drawing" width="500"/>

- How to train? 

Using max likelihood approach:

$$\max_{\theta}\sum_{i}\sum_{k}\log\left(P_{\theta}\left(x_{k}^{(i)} | x_{1:k-1}^{(i)}\right)\right)$$

Which is equivalent to maximizing:


$$\max_{\theta}\prod_i\prod_k\left(P_{\theta}\left(x_k^{(i)} | x_{1:k-1}^{(i)}\right)\right) = \max_{\theta}P_{\theta}\left(x_1, x_2, \dots, x_K\right)$$

- How to sample?

At any given moment, you've sampled $x_1, \dots, x_k$, you can feed those parameters into the network, and evaluate $P(x_{k+1} | x_{1:k})$. 

No input is required to sample $x_1$, so we can simply run the network and get it. It follows by induction that we can sample $x_k$ for all $k$.

## Masked Temporal Convolution

<img src="https://www.evernote.com/l/Ags00m1nWw9DLaGjOYVjeJKLU6GMx3-DAgsB/image.png" alt="drawing" width="500"/>

- Parameter sharing, aka filters

## Wavenet

<img src="https://www.evernote.com/l/Agvynh7bkHBD3ZBiESXiF0xu8XQf_laS1jgB/image.png" alt="drawing" width="500"/>

- Interesting activation where both sigmoid and tanh activations are multiplied
 - Used for gating what to come through, similar to LSTM
- Residual connections

- When using for image encoding, use positional encoding

## [PixelCNN](https://arxiv.org/pdf/1606.05328.pdf)

"Wavenet/MADE for 2D"


<img src="https://www.evernote.com/l/AgtA7S8jiNZBpa9Kos7GlsxwAukNXzHdkXUB/image.png" alt="drawing" width="200"/>

Sampling is the same, one pixel at a time - it is very slow.

Use 3x2 filter instead of 3x3 to avoid blindspot, and the run a 1d filter

<img src="https://www.evernote.com/l/AgsFgShAnu1O26EoxIK2XZ8V3w8KBge4WOkB/image.png" alt="drawing" width="500"/>

## PixelCNN++

- Instead of using softmax, we can parameterize the probability of the value of each pixel using a mixture of logistic dists.
- Use skip connections for downsampling 

## Masked Self-Attention

**Traditional Attention**

q: Query
K: keys
V: values

$$ A(q, K, V) = \sum_i \frac{e^{q\cdot k_i}}{\sum_j e^{q\cdot k_j}} v_i $$

**Masking**

Subtract a large number in the exponents of everything after that pixel in the given ordering

$$ A(q, K, V) = \sum_i \frac{e^{q\cdot k_i - masked(k_i, q).10^{10}} }{\sum_j e^{q\cdot k_j - masked(k_j, q).10^{10}}} v_i $$

More flexible, we can create orderings that are not compatible with convolutions:

<img src="https://www.evernote.com/l/AgvXTF9ycvRIDIKKzg2ewWtLbGy2t6GF9yQB/image.png" alt="drawing" width="300"/>

## Class-Conditional PixelCNN

Feed in the label of the class as a one-hot encoded as a bias in each of the filters.

## Improve Perf

* Break the autoregressive constraints by creating a hierarchical pattern