# Probabilistic Graphical Model
Probabilistic graphical model is a machine learning model of combining probability and graph. For example, Bayesian network, MRF(Markov random field), and DBN(deep belief network) etc.  
The simplest way to assign a probability to a graph is the joint probability. $\Pr(\mathbf{x}) = \{x_1, x_2, \cdots, x_n\}$. But if you have $n$ random variables and each have $q$ values, you must specify a total of $q^n-1$ probability values. The core idea of probabilistic graphical model is connecting with edge for only variables that interact directly. Construction of graphs with partial connectivity is called graph decomposition. The reason to estimate the probability distribution of a subset is because of Markov property.  


## Bayesian network
Bayesian networks are widely used in applications where the causality between random variables is apparent. The characteristics of the Bayesian network are as follows.
1. Whereas a MRF or a DBN performs a probability operation indirectly through an energy function, the Bayesian network is a more rigorous probability model because it uses the probability estimated from the data.
2. The causality between random variables is expressed as conditional probability, so that incomplete data can be processed. In other words, when only a few of the random variables are observed, the probability of the interest among the remaining variables can be estimated.
3. You can mix data and domain knowledge.  


<img src="./img/7_causality_graph.png" width="25%" height="25%">  
The above graph shows the causality between random variables. It is called a causality graph or DAG(directed acyclic graph) because the edges have a direction but do not allow cycles. In Bayesian networks, root nodes have prior probabilities, and non-root nodes have conditional probabilities. It is almost impossible to specify the joint probability of all random variables. Decompose the graph to reduce the number of probabilities that must be specified.  


$$
\begin{alignat}{2}
\Pr(\mathbf{x}) & = \Pr(x_1, x_2, \cdots, x_n) \\
& = \Pr(x_1)\Pr(x_2 \vert x_1)\Pr(x_3 \vert x_2, x_1)\Pr(x_4 \vert x_3, x_2, x_1) \cdots \Pr(x_n \vert x_{n-1}, x_{n-2}, \cdots, x_1)
\end{alignat}
$$
This equation is a conditional probability decomposition of the joint probability. And substituting the graph above into this equation gives the following equation.  


$$
\begin{array}{lcl}
\Pr(smoking, bronchitis, lung\_cancer, fatigue, x-ray) \\
= \Pr(smoking)\Pr(bronchitis \vert smoking)\Pr(lung\_cancer \vert bronchitis, smoking)\Pr(fatigue \vert lung\_cancer, bronchitis, smoking)\Pr(x-ray \vert fatigue, lung\_cancer, bronchitis, smoking)
\end{array}
$$  
Bayesian network further decomposes the graph by applying Markov condition. The Markov condition is that if the parent node of node $x$ is $y$, given the value of $y$, $x$ is conditional independent with all nodes except its descendants, namely non-descendants. Applying the Markov condition to the above equation gives:  


$$
\begin{array}{lcl}
\Pr(smoking, bronchitis, lung\_cancer, fatigue, x-ray) \\
= \Pr(smoking)\Pr(bronchitis \vert smoking)\Pr(lung\_cancer \vert smoking)\Pr(fatigue \vert lung\_cancer, bronchitis)\Pr(x-ray \vert lung\_cancer)
\end{array}
$$  


Note that the problem of creating a Bayesian network by assigning probabilities to causal networks, that is, 'probability learning', is localized among parent and children. Prior probabilities are given to root nodes without parents, and conditional probabilities are given only between parents and children. In addition, the coupling probability can be obtained if necessary.  
$$
\Pr(\mathbf{x}) = \prod_{i=1}^n \Pr\left(x_i \vert parent(x_i)\right)
$$


### d-separated
Graph decomposition dramatically reduces the number of random variables involved in probability computation by allowing conditional probability to be assigned only between parent and child. However, in the probability inference process, it is necessary to estimate the probability of another random variable in a situation in which a value of one random variable is given. In this case, finding and removing unrelated random variables can greatly reduce the amount of computation. Conditional independence plays a key role in finding d-separation.  

<img src="./img/7_conditional_independent.png" width="50%" height="50%">  
The Bayesian network consists of three connection patterns: linear, branched, and confluence. According to the shape of the edge connected to the middle node $b$, the linear is called head-tail, the branch is called tail-tail, and the confluence is called head-head.  
These connection patterns constitute a chain, which means a path connected to edges, and the direction of the edge may be reversed. The reason for reversing is also because the ancestor node affects the descendant node when calculating the probability value, and vice versa.  
- In linear structure, if the value of the random variable of node $b$ is known, node $a$ and node $c$ are conditionally independent according to the Markov condition. By probability notation, $\Pr(c \vert b, a) = \Pr(c \vert b)$, and by independent notation, $\mathbf{I}(c, a \vert b)$.
- In branched structure, node $b$ acts as a common causal node for nodes $a$ and $c$. If node $b$ is known, then node $a$ and $c$ are not dependent. But knowing node $b$ changes the situation. By probability notation, $\Pr(c \vert b, a) = \Pr(c \vert b)$, and by independent notation, $\mathbf{I}(a, c \vert b)$. It is called "the chain $a \leftarrow b \rightarrow c$ is closed when $b$ is known."
- Confluence works opposite to branching. If node $b$ is unknown, node $a$ and node $c$ are independent. Otherwise, node $a$ and node $c$ are not conditional independent anymore. In other words, the chain $a \rightarrow b \leftarrow c$ is not closed when $b$ is known. If node $a$ is considered true, then probability of node $c$ is low. This situation is called explaining away. Explaining away also occurs in the opposite case. If node $a$ is not true, then probability of node $c$ increases. It is called "the chain $a \rightarrow b \leftarrow c$ is open when $b$ is known." By conditional independent notation, not $\mathbf{I}(a, c \vert b)$  


**chain closure**  
Given the chain $a \rightsquigarrow c$ connecting $a$ and $c$ and the node set $W$, the chain is said to be closed by $W$ if any of the following conditions are true:
1. (linear) Nodes belonging to $\mathcal{W}$ appear as head-tail in the chain.
2. (branched) Nodes belonging to $\mathcal{W}$ appear as tail-tail in the chain.
3. (confluence) When there is a head-to-head node in the chain, both this node and its descendants do not belong to $\mathcal{W}$.  


The above definition determines whether a chain connecting two nodes is closed. However, since there are multiple chains between the two nodes, you have to check every chain individually to find out if the two nodes are completely closed to achieve conditional independence. d-separated is used for this verification.  
If all the chains between the two nodes $a$ and $c$ are closed by the node set $\mathcal{W}$, then the two nodes are said to be d-separated and denoted $d - sep(a, c \vert \mathcal{W})$.  
This definition can be extended to a set of nodes. If all the chains between the two node sets $\mathcal{A}$ and $\mathcal{C}$ are closed by the node set $\mathcal{W}$, then the two node sets are said to be d-separated and denoted $d - sep(\mathcal{A}, \mathcal{C} \vert \mathcal{W})$.  


### Probabilistic inference
It is very easy for a random variable to find its own probability when the parent value is specified. For example, when the patient is found to have bronchitis but not lung cancer, it is 0.1 -$\Pr(fatigue \vert bronchitis, not\_lung\_cancer)$- to predict the probability that the patient will feel fatigue.  
If the patient is also non-smoking then, the probability that the patient will feel fatigue would be changed? In other words, $\Pr(fatigue \vert bronchitis, not\_lung\_cancer, non\_smoking)$ has different value with 0.1? The answer is same value with previous(0.1). That's because of $d-sep(smoking, fatigue \vert \{bronchitis, lung\_cancer\})$. Probability inference can be used to find out the probability other than the specified probability. In general, in reality, we are more interested in backward probability inference such as $\Pr(lung\_cancer \vert positive)$ or $\Pr(lung\_cancer \vert positive, not\_fatigue)$.  
In the Bayesian network, the nodes below can be observed, and the nodes above are the cause of these observations. Nodes close to observations are called information random variables, and nodes close to causes are called hypothesis random variables. Usually, when we know the value of the information random variable through observation, we are interested in finding the value of the hypothesis random variable.  
The independence between random variables can be seen by finding the d-separation using the previously proved contents. For example, When $\Pr(fatigue, positive \vert not\_lung\_cancer)$ is estimated, it is $\mathbf{I}(fatigue, x-ray \vert bronchitis)$ because it is $d-sep(fatigue, x-ray \vert lung\_cancer)$. Thus, $\Pr(fatigue, positive \vert not\_lung\_cancer) = \Pr(fatigue \vert not\_lung\_cancer) * \Pr(positive \vert not\_lung\_cancer)$.  


The probability inference algorithm operates by distributing information to neighboring random variables starting from random variables with known values. Suppose you have a linear structure with $x$, $y$, $z$, and $w$ nodes in order. When the random variable $x$ is known to have a certain value, this information is transferred to the neighbor random variable $y$, $y$ is transferred to $z$, and $z$ is transferred to $w$. This method is called message passing.
<img src="./img/7_linear_structure.png" width="4%" height="4%">  
Each of the four random variables in this network are binary variables. For example, random variable $x \in \{\mathcal{x}1, \mathcal{x}2\}$. Assume a situation where the random variable $x$ has a value of $\mathcal{x}1$, and compute $\Pr(\mathcal{w}1 \vert \mathcal{x}1)$ as the forward probability inference.  
First, the information of $x$ is transferred to $y$, where $\Pr(\mathcal{y}1 \vert \mathcal{x}1)$ is obtained using the conditional probability of the Bayesian network. Second, the information of $y$ is transferred to $z$.  
$$
\begin{align}
\Pr(\mathcal{z}1 \vert \mathcal{x}1)
& = \Pr(\mathcal{z}1 \vert \mathcal{y}1, \mathcal{x}1)\Pr(\mathcal{y}1 \vert \mathcal{x}1) + \Pr(\mathcal{z}1 \vert \mathcal{y}2, \mathcal{x}1)\Pr(\mathcal{y}2 \vert \mathcal{x}1) \\
& = \Pr(\mathcal{z}1 \vert \mathcal{y}1)\Pr(\mathcal{y}1 \vert \mathcal{x}1) + \Pr(\mathcal{z}1 \vert \mathcal{y}2)\Pr(\mathcal{y}2 \vert \mathcal{x}1)
\end{align}
$$  
Finally, the information of $z$ is transferred to $w$.  
$$
\begin{align}
\Pr(\mathcal{w}1 \vert \mathcal{x}1)
& = \Pr(\mathcal{w}1 \vert \mathcal{z}1, \mathcal{x}1)\Pr(\mathcal{z}1 \vert \mathcal{x}1) + \Pr(\mathcal{w}1 \vert \mathcal{z}2, \mathcal{x}1)\Pr(\mathcal{z}2 \vert \mathcal{x}1) \\
& = \Pr(\mathcal{w}1 \vert \mathcal{z}1)\Pr(\mathcal{z}1 \vert \mathcal{x}1) + \Pr(\mathcal{w}1 \vert \mathcal{z}2)\Pr(\mathcal{z}2 \vert \mathcal{x}1)
\end{align}
$$  

In a Bayesian network with a linear structure, probability inference takes linear time. The above network is a singly-connected graph. However, in a multi-connected graph, the problem is more complicated because the message delivery must be calculated on all chains connecting the two nodes. It has been proved that the problem of inferring the correct probability in a general Bayesian network is NP-hard. So, in reality, we give up the exact solution and get an approximate solution.  
