<h1><center>VAE - Directed Graphical Models</center></h1>

We work with ***directed probabilistic models***, also called ***directed probabilistic graphical models (PGMs)***, or ***Bayesian networks***. Directed graphical models are a type of probabilistic models where all the variables are topologically organized into a directed acyclic graph. The joint distribution over the variables of such models factorizes as a product of prior and conditional distributions:

<center>$p_{\theta}(x_1, ..., x_M) = \prod_{j=1}^{M} p_{\theta}(x_j | Pa(x_j)) $</center>

where $Pa(x_j)$ is the set of parent variables of node $j$ in the directed graph. For non-root-nodes, we condition on the parents. For root nodes, the set of parents is the empty set, such that the distribution is unconditional.

To parameterize a conditional probability distribution $p_{\theta}(x_j | Pa(x_j))$ we can use neural networks. In this case, neural networks take as input the parents of a variable in a directed graph, and produce the distributional parameters $\eta$ over that variable:

<center>$ \eta = NeuralNet(Pa(x)) $</center>
<center>$ p_{\theta}(x | Pa(x)) = p_{\theta}(x | \eta )$</center>

We often collect a dataset $D$ consisting of $N >= 1$ datapoints:

<center>$ D = \{ x^1, x^2, ..., x^N \} = \{ x^i \}_{i=1}^N = x^{(1:N)} $</center>

The datapoints are assumed to be independent samples from an unchanging underlying distribution. In other words, the dataset is assumed to consist of distinct, independent measurements from the same (unchanging) system. In this case, the observations $D$ are said to be i.i.d., for independently and identically distributed. Under the i.i.d. assumption, the probability of the datapoints given the parameters factorizes as a product of individual datapoint probabilities. The log-probability assigned to the data by the model is therefore given by:

<center>$ log p_{\theta} (D) = \sum_{x \in D} log p_{\theta} (x) $</center>

The most common criterion for probabilistic models is ***maximum loglikelihood (ML)***. As we will explain, maximization of the log-likelihood criterion is equivalent to minimization of a ***Kullback Leibler divergence*** between the data and model distributions. Under the ML criterion, we attempt to find the parameters $\theta$ that maximize the sum, or equivalently the average, of the log-probabilities assigned to the data by the model. With i.i.d. dataset $D$ of size $N_D$, the maximum likelihood objective is to maximize the log-probability given by the above equation.

Using calculus’ chain rule and automatic differentiation tools, we can efficiently compute gradients of this objective, i.e. the first derivatives of the objective w.r.t. its parameters $\theta$. We can use such gradients to iteratively hill-climb to a local optimum of the ML objective.