Discrete Generative Models

Generative models aim to reproduce a real-world distribution given many training examples. One way to do this is to train the neural network so that the least amount of information is needed to reconstruct the training examples. For example, in a generative text model using next-token prediction, the only information needed is, "which token is next?". If the model outputs a distribution $f(x)$, while the correct distribution is $p$, the number of bits needed to identify the correct next token is

$$\mathbb{E}_p[-\log f(x)],$$

the cross-entropy loss. While it is possible to train an autoregressive model for images, scanning pixel-by-pixel and line-by-line, next-token prediction is inherently flawed. Sometimes, the previous tokens do rely on the future, which means the model must predict not just the next token, but all the ones after it as well, while only being trained on the next token. Text happens to be written mostly one-dimensionally and has a relatively small size, so reinforcement learning can compensate for these flaws. However, images are much larger and intrinsically two-dimensional, so another approach is needed.

It cannot be directly outputting probabilities for every possible image. Even a simple black-and-white MNIST image has $2^{28\times 28}\approx 10^{236}$ possibilities. Besides, the image on the screen is only an approximation to the image captured in the real world, so ignoring quantum effects, images should be continuous. The most common approach to modelling continuous distributions is to train a reversible model $f$ that maps it to another continuous distribution $P$ that is already known. The original image can be recovered by pointing to its mapped value, as well as the reverse path:

$$\underbrace{-\log P(f(x))}_\text{Source Bits} - \underbrace{\log\left|\det \frac{\partial f}{\partial x}(x)\right|}_\text{Flow Bits}$$

This technique is known as normalizing flows, as usually a normal distribution is chosen for the known distribution. The second term can be a little hard to compute, so diffusion models approximate it by using a stochastic PDE for the mapping. When $f$ is a solution to an ordinary differential equation,

$$\frac{\mathrm{d}x}{\mathrm{d}t} = g(x)$$

then

$$\log\left|\det \frac{\partial f}{\partial x}(x)\right| = \int \mathrm{Tr}\frac{\partial g(x)}{\partial x}\mathrm{d}t = \int\underbrace{\mathbb{E}_{\epsilon\sim\mathcal{N}(0, I)}\left[\epsilon^T \frac{\partial g(x)}{\partial x}\epsilon \right]}_\text{Hutchinson Estimation}\mathrm{d}t$$

Switching to a stochastic PDE

$$\mathrm{d}x' = g(x')\mathrm{d}t + \epsilon(t)\mathrm{d}W$$

and tracking the difference $\delta x = x' - x$, the mean-squared error approximately satisfies

$$\frac{\mathrm{d}(\delta x^T \delta x)}{\mathrm{d}t} = 2\delta x^T\frac{\partial g(x)}{\partial x}\delta x,$$

which is close to Hutchinson's estimator, but weighted a little strange.

Flow models are pretty good, but the continuity assumption creates its own problems. Some features of images or videos, such as the number of fingers or the number of dogs, are discrete. While a flow model can push most of its outputs towards correct, discrete, values, sometimes it will have to interpolate between them, generating 4.5 fingers or 2.5 dogs. This motivates a need for discrete distribution networks.

In Lei Yang's work of the same name, this is achieved with a fractal. A model is trained to ouput slightly different images to the one it is fed in. Each iteration, the output closest to the target image is chosen to be fed back into the model for more, hopefully similar, images. An initially blank input should slowly become the target. To train the model, the chosen image in each iteration is updated towards the target. Since there are a finite number of images at each level, they will specialize into different parts of the target distribution. If they perfectly divvy it up, a sample image can be generated by randomly choosing output images, which, after enough iterations, is unlikely to have ever been specifically trained.

Unfortunately, many outputs, even at the top level, end up "dead" in training. They are never seen, so never updated, and very far from the target distribution. The issue is with always picking the most similar image; sometimes it is alright to pick a similar image, even if it is not the most similar. Although not mentioned in Yang's paper, we can instead select images proportional to

$$p\propto e^{-\beta\cdot \text{error}(x, \text{target})}$$

and increase $\beta$ over time. The information needed to construct a target image at a given iteration is some error correction bits, as well as the path taken. When generating, we will sample uniformly, which requires

$$\mathrm{KL}(p_\text{batch mean}\Vert\text{Uniform}) = \text{constant} - H(p_\text{batch mean})$$

bits to describe. This gives the loss

$$\lVert\text{final image} - \text{target}\rVert^2 - \sum_{\text{iter}=1}^\text{iters} H(p_\text{batch mean}^\text{iter}).$$

If we want an infinite-depth model, we can choose to sometimes halt, but usually sample another image with probability $\delta$ (for 'discount factor'). Also, as the depth increases, the images should become more similar to each other, so $\beta$ should increase exponentially to compensate. Empirically, I found $\beta = \delta^{-\text{iter}}$ as $\delta\to 1$ to give decent results.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/MNIST/raw		data/MNIST/raw
.gitignore		.gitignore
README.md		README.md
fractal.png		fractal.png
generate.py		generate.py
model.pth		model.pth
model.py		model.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Discrete Generative Models

About

Uh oh!

Releases

Packages

Languages

programjames/discrete_generation

Folders and files

Latest commit

History

Repository files navigation

Discrete Generative Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages