A structured probabilistic (or graphical) model is graph which describes the direct interraction between random variables.  

Use a graph to describe a probability distribution allows us to evercome many challenges, but finding which graph structure is most suited for a problem is not easy.

# The Challenge of Unstructured Modeling

The goal od DL is to understand high-dimensional data with rich structures, such as photos or speech records.  
Classification algorithms can take a high-dimensional ata and give it a label.  

Probalistic models need to do more complex tasks:
- Density estimation: Estime the data generating distribution $p(x)$.
- Denoising: Gin a damaged $\tilde{x}$, find the original correct $x$.
- Missing value imputation: Return estimates or a probability distribution over the missing elements of $x$.
- Sampling: Generate new samples from $p(x)$.

Modeling a rich distribution over millions of random variables is a computationally and statistically challenging task.  
If we wish to model a distribution of $n$ discrete variables, taking each $k$ values, the naive approach of representing $P(x)$ by a lookup table would requires $k^n$ parameters. This is not feasible because:
- Memory: can't store the whole representation
- Statistical efficienty: Because they an huge nomber of parameters, a huge number of training examples is need.
- Inference cost: If we want to compute the marginal distribution given we lernt a joint distribution for example, we would need to sum over the whole table.
- Sampling cost: We would need to go through the whole table in the worst case.  

The table based model every interracton between every variables, but in real life most variables influence each other only indirectly.  
Graphical models only model direct interactions. The model has less parameters so it need less data and computations are faster.

# Using Graphs to Describe Model Structure

Each node represent a random variable, and each edge a direct interaction.

## Directed Models

The Directed graphical model is also called a belief network or Bayesian network.  
They are directed edges. An edge from a to b means the distribution over b depends on a.  

A directed acyclic graph $G$ defines a set of local conditional probabilisty distributions:
$$p(x_i | P_{a_G}(x_i))$$

with $P_{a_G}(x_i)$ the parents of node $x_i$ in $G$.  
The joint distribution is given by:
$$p(x) = \prod_{x_i} p(x_i | P_{a_G}(x_i))$$

Let $m$ the maximum number of parents any node. The complexity of a full table is $O(k^m)$, wich can reduce a lot with $m \ll n$.  
Other restrictions (such that the graph is the tree) assure faster computations.  

Directed models are applicable when we know the causality, and it only flows in one direction.

## Undirected models

Undirected models are also called Markov random fields (MRF) or Markov networks.  
They are used when the interaction between variables have not a clear direction, or operate in both.  
The value associated with the edges is not a condition probability distribution.  

An undirected graph $G$ is complosed of several cliques $\mathcal{C}$, each with a non-negative factor $\phi(C)$ called clique potential, that mesures the affinity of $x_i \in \mathcal{C}$ for being in each of their possible joint states.  
They define un unormalized probability distribution:
$$\tilde{p}(x) = \prod_{\mathcal{C} \in G} \phi(\mathcal{C})$$

## The Partition Function

We need to normalize to get a valid probability distribution:
$$p(x) = \frac{1}{Z} \tilde{p}(x)$$

with $Z$ the partition function, a sum or integral over all point joint assignments of $x$:
$$Z = \int \tilde{p}(x)dx$$

Usually $Z$ is intractable, and in DL we often use approximations.  

The domain of each variable has a huge impact and the resulting probability distribution.  
For example, $x$ n-dimensional random vector parametrized by a vector of biases $b$.  
We have one clique per element, such as $\phi^{(i)}(x_i) = \exp (b_i x_i)$.  
If $x \in \mathbb{R}^n$, $Z$ diverges and there is no probability distribution.  
If $x \in {0, 1}^{n}$, $p(x)$ factorizes into $n$ independent distributions $p(x_i = 1) = \text{sigmoid}(b_i)$.  
If $x \in \{ [1, 0, \text{...}, 0], [0, 1, 0, \text{...}, 0], \text{...}, [0, 0, \text{...}, 1] \}$, then $p(x) = \text{softmax}(b)$

## Energy-Based Models

An energy-based model is the speacial kind of undirected model:

$$\tilde{p}(x) = \exp(-E(x))$$

with $E(x)$ the energy function.  
We can learn $E(x)$, and we are always sure to have $\tilde{p}(x) \geq 0$.

Many energy-based models are called Boltzmann machines.  

Because $exp(a)=exp(b)=ab$, Different cliques in the graph are ust different terms of the energy function.  

Energy-based models can be seen as a product of experts. Each term in $E$ is an expert that determines if a particular soft contrainst is satisfied. Multiple experts together enforce higher dimensional constraints.

## Separation and D-Separation

We can use the graph to known wich subsets of variables are independent from each other given the value of ther subsets of variables. This is called separation.  
A set of variables $A$ is separated from another set $B$ given a third set of variables $S$ if $A$ is independant from $B$ given $S$.  

Two variables $a$ and $b$ rae independant given an observered set $S$ if there is no path between $a$ and $b$ that do not pass by an observed variable.  
Paths going through unobserved variables are said active, those through observed variables are said passive.  

For directed graphs, there is the same concept, but called D-Separation.  

Seperation tell us only about conditional independances implied by the graph, but there is no garanty gat it represent all independencies.  

Context-specific independences are inpendance present only when a variable has a specific value, it can't be represented by a graph.

## Converting between Undirect and Directed Graphs

No probabilistic model is undirected or directed, just some are more easlisy represented as one than as another.  
They both have pros and cons, we should choose depending on the task.  

We can choose depending on which approaches captures the most independencies (uses the fewest edges).  
We may also switch while using the same model. It's often straightforward to sample from directed models, and undirect models are can approximate inference procedures.  

Any distribution can be represented by a directed or undirected model, or a complete graph.  

In a directed graph, when nodes $a$ and $b$ are parent of $c$, and there is no edge between $a$ and $b$, this structure is called an immorality. Undirected models can't represent it perfectly.  

To convert a direct graph $D$ into an undirected graph $U$, for every pair $(x,y)$, there is an edge if $D$ contains an edge $x \to y$ or $y \to x$, or if $x$ and $y$ are both parent of any $z$.  
We call $U$ a moralized graph.  

Likewise, some structure of undirected models can't be represented by directed ones.  
If $U$ has a loop of length $> 3$ we no chord, we need to add any chord first before converting it.  
A chord is an edge between two non-consecutive variables in the loop.  
The grap with added chords is called a chordal or triangulated graph (all loops are of size 3).  
To convert $U$ to $D$, we must add direction to every edges, in a way that doesn't create any cycle.

## Factor Graphs

In an undirected graph, every $\phi$ function is a subset of some clique in the graph. But it may have one factor over the whole clique, or several factors over different parts of the clique.  

Factor graphs explicitly represent the scope of each $\phi$ function. The undirect model also contains squares, corresponding to factors. There is an edge between a variable and a factor if the variable is an argument in the factor function. There edges between variables are not represented anymore.  

Representation, inference and learning are cheaper in factor graphs.

# Sampling from Graphical Models

Sampling from directed models is easy with ancestral sampling.  
The variables are sorted by topological order, the the variables are sampled in that order.  
It's easy to sample from $p(x)$ as long as it's easy to sample from each conditional distribution $p(x_i | P_{a_G}(x_i))$.

To sample from undirect models, we couple convert it to direct ones, but it would requires solving intractable inference problems.  

Every variable interacts with every other, so there is no clear beginning point for the sampling process.  
Sample from an undirected model is an expensive multipass process.  

One approach is Gibbs sampling.  
We sample for each variable $x_i$ conditioned on all others other variables (just the neighbors of $x_i$ thanks to the graph structure).  
After sampling from all $x_i$s, we have an inaccurate sample from $p(x)$.  
By repeating the process and sampling several times using the updated values, it converges to a sample of $p(x)$.  
But it's difficult to know when the sample reached a sufficiently accurate approximation.

# Advantages of Structured Modeling

Representating, learning, inference and sampling costs are reduced. This is possible because we decide not to model certain interractions.  

They also allows us to separate knowdledge from learning / inference.  
We can develop algorithms from broad classes of graph, and we can independantly design models that represent best the data.

# Learning about Dependencies

A generative models to learn the distribution about the observable variables $v$ that are higly dependent. The DL solution to capture these dependencies is to introduce hidden latent variables $h$. The model can capture dependencies between $v_i$ and $v_j$ indirectly by direct dependecies between $(v_i, h)$ and $(v_j, h)$.  
This results in a much smaller graph than without $h$, directly modelling interactions $(v_i, v_j)$.  

Structure learning tries to find the structure of the graph to only connect tightly coupled variables.  
It's usually a Greedy-Search procedure, with starts with an initial structure, evaluate it, and move to another structure with a few edges added / removed, that should increase the score.  

Using latent variables instead of structure learning avoid doing many rounds of training. Using parameter learing we can learn a model with a fixed structure that finds the right marginal $p(v)$.  

Another advantage is that $h$ provide an alternative representation for $v$. For example $h$ can be used for classification.  
Many approaches of feature learning use latent variables. $\mathbb{E}[h|v]$ is a good feature mapping for $v$.

# Inference and Approximate Inference

We often train the model wiht maximum likelihood:
$$\log p(v) = \mathbb{E}_{h \sim p(h|v)}[\log p(h,v) - \log p(h|v)]$$

We need to compute $p(h|v)$.  

Inference problems trie to predicth the value of the probability distribution over some variables the value of other variables.  
For DL most inference problems are intractable.  
Instead we can use variational inference: approximate $p(h|v)$ by learning an approximate tractable distribution $q(h|v)$ as close to the true one as possible.

# The DL Approach to Structured Probabilistic Models

Most DL models have no or only one layer of latent variables, but use deep  computation graphs to define the conditional distributions inside the model.  
DL models use distributed representation, they often have only one large layer of latient variables.  

For distributed representations, all $v_i$ sis connected to many $h_j$, yelding a graph not sparse enough for tradional algorithms.  
Most DL algorithms are designed to make Gibbs samping or variatonal inference efficient.

## Example: Restricted Botlzmann Machines (RBM)


602