## Cross Entropy between Bernoulli Distributions

The cross entropy between two Bernoulli distributions $p = \text{Bernoulli}(\alpha)$ and $q = \text{Bernoulli}(\sigma(\beta))$ ($\sigma(\cdot)$ is the sigmoid function) is

$$H(p, q) = - \big(\alpha \log(\sigma(\beta)) + (1 - \alpha) \log(1 - \sigma(\beta))\big)$$

Because of numerical instability, $\log(1 - \sigma(\beta))$ in the expression above is not obtained by first computing $\sigma(\beta)$ and then taking the logarithm of $1 - \sigma(\beta)$. It is computed based on the following relationship

$$\log(1 - \sigma(\beta)) = \log\left(1 - \frac{1}{1 + \exp(-\beta)}\right) = \log\left(\frac{\exp(-\beta)}{1 + \exp(-\beta)}\right) = -\beta + \log\left(\frac{1}{1 + \exp(-\beta)}\right) = \log(\sigma(\beta)) - \beta$$

Using this relationship, $H(p, q)$ can be transformed into

$$H(p, q) = - \big(\log(\sigma(\beta)) + (\alpha - 1) \beta\big)$$

$\log(\sigma(\beta))$ is computed using the `torch.nn.functional.logsigmoid` function provided by PyTorch.

## Log Conditional Probability and KL Divergence

### 1. Bernoulli Distribution

#### 1.1 Log Conditional Probability

Let $\sigma(\beta)$ be the mean of the Bernoulli conditional distribution and $\alpha$ be the intensity of the pixel. The log conditional probability is given by

$$\log\big(p(\alpha; \sigma(\beta))\big) = \alpha \log(\sigma(\beta)) + (1 - \alpha) \log(1 - \sigma(\beta)) = \log(\sigma(\beta)) + (\alpha - 1) \beta$$

#### 1.2 KL Divergence

The KL divergence between two Bernoulli distributions $p = \text{Bernoulli}(\alpha)$ and $q = \text{Bernoulli}(\sigma(\beta))$ is

$$D_{\text{KL}}(p||q) = \big(\alpha \log(\alpha) + (1 - \alpha) \log(1 - \alpha)\big) - \big(\alpha \log(\sigma(\beta)) + (1 - \alpha) \log(1 - \sigma(\beta))\big) = \text{const} - \big(\log(\sigma(\beta)) + (\alpha - 1) \beta\big)$$

Because $\alpha$ is a constant hyperparameter, all terms containing only $\alpha$ need not be included in the loss function.

### 2. Gaussian Distribution

#### 2.1 Log Conditional Probability

Let $\beta$ and $s$ be the mean and variance of the Gaussian conditional distribution, and $\alpha$ be the intensity of the pixel. The log conditional probability is given by

$$\log\big(p(\alpha; \beta, s)\big) = -\frac{1}{2}\left(\log(2 \pi s) + \frac{(\alpha - \beta)^2}{s}\right) = \text{const} - \frac{(\alpha - \beta)^2}{2 s}$$

Because $s$ is a constant hyperparameter, all terms containing only $s$ need not be included in the computations of posterior probabilities and the loss function.

#### 2.2 KL Divergence

The KL divergence between two Gaussian distributions $p = \text{Normal}(\alpha, s)$ and $q = \text{Normal}(\beta, s)$ is

\begin{equation}
D_{\text{KL}}(p||q) = -\frac{1}{2} \big(\log(2 \pi s) + 1\big) + \frac{1}{2} \left(\log(2 \pi s) + \frac{s + (\alpha - \beta)^2}{s}\right) = \frac{(\alpha - \beta)^2}{2 s}
\end{equation}