In [None]:
import torch
from torch.nn import functional as F
from torch.distributions.categorical import Categorical
import numpy as np

# Discrete Case

## Constants
$t \in \mathbb{R}$ - timestep for denoising process in the range [0.0, 1.0]

$n \in \mathbb{N}^+$ - total number of denoising steps

$i \in \mathbb{N}^+$ such that $i \leq n$ - current denoising step

$\beta(t)$ - accuracy at time $t$

$\beta(1)$ - we adjust this to define maximum possible accuracy

$K$ - number of classes in our discrete distribution

$x$ - ground truth

$y'$ - noisy ground truth

$\delta_x$ - Kronocker delta of $x$ also known as the one-hot encoding

$\theta = \text{softmax}(y')$ - input parameters to the network (scaled to be between -1 and 1 using $\theta * 2 - 1$)

### Notes

Suppose our ground truth $x$ is $[0, 1, 0]$, then $\delta_x$ would be $[0, 1, 0]$

$\text{beta} = \beta(1) * t^2$

Normally $y' \sim 𝒩(\text{beta} * (K * \delta_x - 1), beta * K)$ but we do reparameterization to get:

$y' = \text{beta} * (K * \delta_x - 1) + beta * K * \epsilon$ where $\epsilon$ is noise drawn from normal distribution with variance 1 and mean 0

#### Discrete simple example

Suppose no batch dimension and ground truth is [0, 1, 0], at first denoising step (in other words, $t$ should be equal to 0)

In [None]:
n = 10
i = 1
beta1 = 4
x = torch.tensor([0,1,0])
delta_x = x
K = len(x)

In [None]:
t = (i - 1) / n

In [None]:
beta = beta1 * (t**2)

In [None]:
epsilon = torch.normal(0, 1, size=delta_x.shape)

In [None]:
y_prime_left_term = beta * (K * delta_x - 1)
y_prime_right_term = beta * K * epsilon

In [None]:
y_prime = y_prime_left_term + y_prime_right_term

In [None]:
theta = F.softmax(y_prime, dim=-1)

In [None]:
theta # as expected, it is uniform distribution, as when t = 0, prior is completely uninformative

tensor([0.3333, 0.3333, 0.3333])

In [None]:
theta_scaled = 2 * theta - 1

In [None]:
theta_scaled

tensor([-0.3333, -0.3333, -0.3333])

#### Discrete simple example

Suppose no batch dimension and ground truth is [0, 1, 0], at middle of denoising step (in other words, $t$ should be equal to 0.5 or so)

In [None]:
n = 10
i = 5
beta1 = 4
x = torch.tensor([0,1,0])
delta_x = x
K = len(x)

In [None]:
t = (i - 1) / n

In [None]:
beta = beta1 * (t**2)

In [None]:
epsilon = torch.normal(0, 1, size=delta_x.shape)

In [None]:
y_prime_left_term = beta * (K * delta_x - 1)
y_prime_right_term = beta * K * epsilon

In [None]:
y_prime = y_prime_left_term + y_prime_right_term

In [None]:
theta = F.softmax(y_prime, dim=-1)

In [None]:
theta # as expected, close to middle of denoising, data is somewhat informative

tensor([0.0922, 0.0198, 0.8880])

In [None]:
theta_scaled = 2 * theta - 1

In [None]:
theta_scaled

tensor([-0.8156, -0.9603,  0.7759])

#### Discrete simple example

Suppose no batch dimension and ground truth is [0, 1, 0], at last denoising step (in other words, $t$ should be close to 1)

In [None]:
n = 10
i = 10
beta1 = 4
x = torch.tensor([0,1,0])
delta_x = x
K = len(x)

In [None]:
t = (i - 1) / n

In [None]:
beta = beta1 * (t**2)

In [None]:
epsilon = torch.normal(0, 1, size=delta_x.shape)

In [None]:
y_prime_left_term = beta * (K * delta_x - 1)
y_prime_right_term = beta * K * epsilon

In [None]:
y_prime = y_prime_left_term + y_prime_right_term

In [None]:
theta = F.softmax(y_prime, dim=-1)

In [None]:
theta # as expected, it is almost exactly equal to ground truth

tensor([3.7434e-08, 2.6485e-03, 9.9735e-01])

In [None]:
theta_scaled = 2 * theta - 1

In [None]:
theta_scaled

tensor([-1.0000, -0.9947,  0.9947])

# Continuous Case

## Constants

$\sigma_1$ - the variance of the noise as $t \to 1$

$\gamma(t) = 1 - \sigma_1^{2t}$

## Notes

The noise $\mu$ is drawn from $𝒩(\gamma(t) \cdot x, \gamma(t)(1 - \gamma(t)))$

### Continuous simple example

Suppose no batch dimension, ground truth is $[0.2, 0.8, 0.1, 0.9]$, and $t = 0$

In [None]:
x = torch.tensor([0.2, 0.8, 0.1, 0.9])
t = 0
sigma_1 = 0.001

In [None]:
gamma = 1 - sigma_1 ** (2 * t)

In [None]:
mean = gamma * x
variance = gamma * (1 - gamma)
epsilon = torch.normal(0, 1, size=x.shape)

In [None]:
mean, variance, epsilon

(tensor([0., 0., 0., 0.]), 0.0, tensor([ 0.5200,  0.1092, -0.6988, -0.4410]))

In [None]:
mu = mean + variance * epsilon

In [None]:
mu

tensor([0., 0., 0., 0.])

### Continuous simple example

Suppose no batch dimension, ground truth is $[0.2, 0.8, 0.1, 0.9]$, and $t = 0.5$

In [None]:
x = torch.tensor([0.2, 0.8, 0.1, 0.9])
t = 0.5
sigma_1 = 0.001

In [None]:
gamma = 1 - sigma_1 ** (2 * t)

In [None]:
mean = gamma * x
variance = gamma * (1 - gamma)
epsilon = torch.normal(0, 1, size=x.shape)

In [None]:
mean, variance, epsilon

(tensor([0.1998, 0.7992, 0.0999, 0.8991]),
 0.000999000000000001,
 tensor([ 1.6145,  0.4570, -0.4408, -1.7838]))

In [None]:
mu = mean + variance * epsilon

In [None]:
mu

tensor([0.2014, 0.7997, 0.0995, 0.8973])

### Continuous simple example

Suppose no batch dimension, ground truth is $[0.2, 0.8, 0.1, 0.9]$, and $t = 1$

In [None]:
x = torch.tensor([0.2, 0.8, 0.1, 0.9])
t = 1
sigma_1 = 0.001

In [None]:
gamma = 1 - sigma_1 ** (2 * t)

In [None]:
mean = gamma * x
variance = gamma * (1 - gamma)
epsilon = torch.normal(0, 1, size=x.shape)

In [None]:
mean, variance, epsilon

(tensor([0.2000, 0.8000, 0.1000, 0.9000]),
 9.999990000287556e-07,
 tensor([-0.3440, -0.1561,  0.3768, -0.0716]))

In [None]:
mu = mean + variance * epsilon

In [None]:
mu

tensor([0.2000, 0.8000, 0.1000, 0.9000])

## Discrete Loss

### Discrete Loss example

Suppose ground truth is $x = [0, 1, 0]$ (this means $K = 3$), $\beta(1) = 4$. With $t$ drawn uniformly from range $[0, 1]$, suppose we get $t = 0.5$

Recall that $\beta(t) = t^2 \cdot \beta(1)$

The input $y'$ is drawn from $𝒩(\beta(t) \cdot ( K \cdot x - 1), \beta(t) \cdot K)$

Then we have $\theta = \text{softmax}(y')$ and then $\theta$ is scaled so values are between $-1$ and $1$ before being passed into the model.

Suppose model output is $\omega = [0.2, 0.6, 0.2]$

The loss is:

$K \cdot \beta(1) \cdot t \cdot || x - \omega||^2$

In [None]:
beta_1 = 4
x = torch.tensor([0, 1, 0])
t = 0.5
K = len(x)

In [None]:
beta = beta_1 * (t**2)

In [None]:
mean = beta * K * x - 1

In [None]:
variance = beta * K

In [None]:
epsilon = torch.normal(0, 1, x.shape)

In [None]:
mean, variance, epsilon

(tensor([-1.,  2., -1.]), 3.0, tensor([ 0.6564, -0.1692, -0.8507]))

In [None]:
y_prime = mean + variance * epsilon

In [None]:
y_prime

tensor([ 0.9691,  1.4925, -3.5521])

In [None]:
theta = F.softmax(y_prime, dim=-1)

In [None]:
theta

tensor([0.3706, 0.6254, 0.0040])

In [None]:
theta = theta * 2 - 1

In [None]:
theta # we pass this into model

tensor([-0.2589,  0.2508, -0.9919])

In [None]:
model_output = torch.tensor([0.2, 0.6, 0.2])

In [None]:
loss = K * beta_1 * t * torch.sum((x - model_output)**2)

In [None]:
loss

tensor(1.4400)

# Continuous Loss

## Continous Loss example

Recall that $\sigma_1$ is the variance the noise approaches as $t \to 1$

$\gamma(t) = 1 - \sigma_1^{2t}$

The noise $\mu$ is drawn from $𝒩(\gamma(t) \cdot x, \gamma(t)(1 - \gamma(t)))$

We want the ground truth $x$, however, what the model produces is not the prediction of the ground truth $x'$ directly. Instead, the model predicts the noise $\epsilon'$ that was added to the input, as a result of the reparameterization trick.

$\mu = \gamma(t) \cdot x + (\sqrt{\gamma(t) \cdot (1-\gamma(t))} \cdot \epsilon)$

We can rearrange this equation to produce the model's effective prediction of $x'$ given its output $\epsilon'$ like so:

$x' = \frac{\mu}{\gamma(t)} - (\sqrt{ \frac{1 - \gamma(t)}{\gamma(t)}} \cdot \epsilon')$

The loss function is then $-\ln(\sigma_1) \cdot \mathbb{E}[\frac{||x - x'||}{\sigma_1^{2t}}]$

There is an expectation term because we usually have batches. But if there is no batch dimension the expectation term goes away and we are left with its body.

### Continous Loss simple example

Let $\sigma_1 = 0.001$, $t = 0.5$, and ground truth $x$ be $[0.3, 0.5, 0.2]$

Suppose in response to $\mu$ the model outputs $[0.05, -0.02, 0.03] = \epsilon'$

In [None]:
sigma_1 = 0.001
t = 0.5
gamma_t = 1 - sigma_1 ** (2 * t)
x = torch.tensor([0.3, 0.5, 0.2])

In [None]:
mean = gamma_t * x
variance = gamma_t * (1 - gamma_t)
epsilon = torch.normal(0, 1, x.shape)

In [None]:
mu = gamma_t * x + (variance ** 0.5) * epsilon

In [None]:
mu # we feed this into model

tensor([0.2959, 0.4435, 0.2119])

In [None]:
# suppose model output is [0.05, -0.02, 0.03]
epsilon_prime = torch.tensor([0.05, -0.02, 0.03])

In [None]:
x_prime = (mu / gamma_t) - (((1 - gamma_t) / gamma_t) ** 0.5) * epsilon_prime

In [None]:
x_prime

tensor([0.2946, 0.4446, 0.2112])

In [None]:
loss = -np.log(sigma_1) * torch.sum((x - x_prime) ** 2) / (sigma_1 ** (2*t))

In [None]:
loss

tensor(22.2665)

# Bayesian update step

Bayesian update steps occur only during inference, as during training we opt to mimic its behavior instead of spending time calculating each actual step to improve compute utilization

## Discrete bayesian update step

$\beta(1)$ - the highest possible accuracy as $t \to 1$

$n$ - total number of inference steps

$i$ - current step out of the $n$ steps

$K$ - number of classes

We calculate accuracy $\alpha$ with:

$\alpha = \frac{\beta(1) \cdot (2 \cdot i - 1)}{n^2}$

We feed in the $\text{input}$ to the model and take the model output to create a categorical distribution from the output and sample the distribution. We turn the sample into a one-hot encoding $\nabla$ and then create a noise distribution $y$.

$y \sim 𝒩(\alpha \cdot (K \cdot \nabla - 1), \alpha \cdot K)$

After sampling from $y$ we calculate:

$e^y \cdot \text{input} = \text{res}$

We then normalize the result with $\frac{\text{res}}{\text{sum}(\text{res})}$

This normalized result is the input to the next loop

### Discrete bayesian update step example

Suppose no batch dimension (or rather, batch size of 1)

Suppose $\beta_1 = 4$, $n = 20$, $i = 10$, $\text{input} = [0.3, 0.5, 0.2]$

Suppose the model outputs $[1.0, -0.5, 0.2]$ and when we sample it we get output class 1, which is encoded in a one-hot vector as $[1, 0, 0]$

In [None]:
beta_1 = 4
n = 20
i = 10
alpha = beta_1 * ((2 * i - 1) / (n**2))

In [None]:
model_input = torch.tensor([0.3, 0.5, 0.2])
K = len(model_input)
batch_size = 1

In [None]:
model_output = torch.tensor([1.0, -0.5, 0.2])

In [None]:
output_dist = Categorical(logits=model_output)

In [None]:
output_sampled = output_dist.sample((batch_size,))

In [None]:
sampled_one_hot = torch.tensor([[1, 0, 0]]) # normally we would use `F.one_hot(output_sampled, K)`, but this is a hard-coded example

In [None]:
mean = alpha * (K * sampled_one_hot - 1)
variance = alpha * K
epsilon = torch.normal(0, 1, sampled_one_hot.shape)

In [None]:
y = mean + variance * epsilon

In [None]:
y

tensor([[ 0.6075, -0.5093,  0.0739]])

In [None]:
res = torch.exp(y) * model_input

In [None]:
res

tensor([[0.5507, 0.3005, 0.2153]])

In [None]:
res = res / torch.sum(res)

In [None]:
res # used as the input to the model in next inference step

tensor([[0.5164, 0.2817, 0.2019]])

## Continuous bayesian update step

Recall that $\sigma_1$ is the variance the noise approaches as $t \to 1$

$\gamma(t) = 1 - \sigma_1^{2t}$

The noise $\mu$ is drawn from $𝒩(\gamma(t) \cdot x, \gamma(t)(1 - \gamma(t)))$ during training, but during inference, it comes from previous bayesian update step.

We want the ground truth $x$, however, what the model produces is not the prediction of the ground truth $x'$ directly. Instead, the model predicts the noise $\epsilon'$ that was added to the input, as a result of the reparameterization trick.

$\mu = \gamma(t) \cdot x + (\sqrt{\gamma(t) \cdot (1-\gamma(t))} \cdot \epsilon)$

We can rearrange this equation to produce the model's effective prediction of $x'$ given its output $\epsilon'$ like so:

$x' = \frac{\mu}{\gamma(t)} - (\sqrt{ \frac{1 - \gamma(t)}{\gamma(t)}} \cdot \epsilon')$

Let $n$ be total number of inference steps and $i$ be current inference step. We start with precision $p$

The accuracy $\alpha$ is calculated using:

$\alpha = \sigma_1^{\frac{-2i}{n}} \cdot (1 - \sigma_1^{\frac{2}{n}})$

We construct the $y$ value from a distribution:

$y \sim 𝒩(x', \alpha^{-1})$

The new precision $p'$ is $p' = p + \alpha$

The updated input becomes $\mu' = \frac{\mu \cdot p + y \cdot \alpha}{p'}$ and this is used in the next inference step

In [None]:
mu = torch.tensor([0.25, 0.55, 0.18])
p = 3
t = 0.49
sigma_1 = 0.001
n = 100
i = 50

In [None]:
# suppose we feed in mu and get epsilon', and the effective x' is [0.31, 0.48, 0.22]
x_prime = torch.tensor([0.31, 0.48, 0.22])
alpha = (sigma_1 ** (-2 * i / n)) * (1 - sigma_1**(2/n))

In [None]:
alpha

129.0364100439194

In [None]:
mean = x_prime
variance = alpha ** (-1)
epsilon = torch.normal(0, 1, x_prime.shape)

In [None]:
y = mean + variance * epsilon

In [None]:
y

tensor([0.3117, 0.4742, 0.2356])

In [None]:
p_prime = p + alpha

In [None]:
mu_prime = (mu * p + y * alpha) / p_prime

In [None]:
mu_prime

tensor([0.3103, 0.4759, 0.2344])