# Bayesian Network: Basics

---

**Bayesian (belief) network** is a very powerful model to represent the uncertainty in the world, including the **dependencies** between different random variables (events) in the real world, and the corresponding **(conditional) probabilities**.

Bayesian network has a number of **advantages** for representing knowledge about an uncertain "world".

- The model encodes dependencies among all variables, it readily **handles situations where some data entries are missing**.
- A Bayesian network can be used to learn causal relationships, and hence can be used to **gain understanding about a problem** and to **predict the consequences of intervention**. 
- The model has both a causal and probabilistic semantics, it is an ideal representation for **combining prior knowledge** (which often comes in causal form) and data. 
- Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for **avoiding overfitting to data**.

In this tutorial, we will introduce the basics of Bayesian network.

## The "Alarm World"

---

Let us consider the following "*alarm world*": you installed an **alarm** system in your house against **burglary**. However, New Zealand frequently has **earthquakes**, and the alarm system can be occasionally set off by an earthquake as well. In addition, the alarm can be set off by mistake with a very small probability. You have two neighbours, **John** and **Mary**. They might call you if they hear the alarm from your house while you are away. On the other hand, they might still call you for other issues even if they do not hear the alarm. However, they do not know each other, thus they will not communicate with each other about calling you.

<img src="img/alarm.png" width=300></img>

## Random Variables

---

In this "alarm world", there are five binary **random variables**.

- $B$: Whether a **b**urglar breaks into the house or not.
- $E$: Whether there is an **e**arthquake or not.
- $A$: Whether the **a**larm is set off or not.
- $J$: Whether your neighbour **J**ohn calls you or not.
- $M$: Whether your neighbour **M**ary calls you or not.

## Causal Dependencies

---

From domain knowledge, we have the following **causal dependencies** between the random variables.

- The alarm can be set off by a burglar.
- The alarm can be set off by an earthquake.
- Whether a burglar breaks into the house is independent from whether there is an earthquake.
- John might call you if they hear the alarm.
- Mary might call you if they hear the alarm.
- Since John and Mary do not communicate, **given the alarm condition**, whether John calls is independent from whether Mary calls.

## Directed Acyclic Graph

---

Based on the above domain knowledge, we can represent the random variables in the alarm world and their (in)dependencies by the following **Directed Acyclic Graph (DAG)**. 

<img src="img/alarm-dag.png" width=150></img>

We can see that each **node** represents a random variable, and each **directed edge** represents a causal dependency between the variables. For example, the directed edge $(B, A)$ means that the burglary variable is a cause of the alarm variable, and the alarm variable is an effect of the burglary variable.

## (Conditional) Probabilities

---

The causal dependencies are qualitative. For quantitative reasoning, we need the (conditional) probabilities for each random variable in the DAG.

Again, from domain knowledge, we have the following probabilities.

- A burglar breaks into the house with probability of 0.1%.
- The probability of an earthquake is 0.2%.
- If there were both a burglar and an earthquake, the alarm is set off with probability of 95%.
- If there was a burglar but no earthquake, the alarm is set off with probability of 94%.
- If there was no burglar and an earthquake, the alarm is set off with probability of 29%.
- If there was no burglar and no earthquake, the alarm is set off by mistake with probability of 0.1%.
- If the alarm is set off, John will hear it and call you with probability of 90%.
- If the alarm is not set off, John will call you for other issues with probability of 5%.
- If the alarm is set off, Mary will hear it and call you with probability of 70%.
- If the alarm is not set off, Mary will call you for other issues with probability of 1%.

Based on the above probabilities, we can have the following **Bayesian network** for the alarm world.

<img src="img/alarm-bn.png" width=500></img>

From the above example, we can see that to define a Bayesian network, we need to define

1. A **Directed Acyclic Graph (DAG)**, where each **node** represents a **random variable** in the world, and each **directed edge** represents a **causal dependency** between two random variables
2. A **Conditional Probability Table (CPT)** for each node $X$ in the graph. The conditional probabilities are $P(X\ |\ parents(X))$, where $parents(X)$ are the parents (incoming neighbours) of $X$ in the graph. They are direct causes of $X$.

## (In)dependencies <a name="dependency"></a>

---

An important task in a Bayesian network is to identify the **(in)dependencies** between any pair of random variables in the network. In general, there are four types of common (in)dependencies between variables.

<img src="img/cause-effect.png" width=550></img>

- **Direct Cause**: $A$ is a direct cause of $B$, if $A$ is an incoming neighbour of $B$. Obviously we have
    - $A$ **and $B$ are <span style="color: red;">dependent</span>**.
- **Indirect Cause**: $A$ is an indirected cause of $C$, if there is a directed path from $A$ to $C$. In the above 3-node example, the directed path is $A \rightarrow B \rightarrow C$. We have 
    - $A$ **and $C$ are conditionally <span style="color: blue;">independent</span> given $B$**. This is obvious, since if the direct cause is given, then the indirect cause is not needed.
- **Common Cause**: $B$ and $C$ have the common cause $A$, and they are not direct cause of each other. We have
    - $B$ **and $C$ are conditionally <span style="color: blue;">independent</span> given $A$**.
    - $B$ **and $C$ are <span style="color: red;">dependent</span> with each other if $A$ is not given**. This is because if $A$ is not given, then the probability of $A$ is dependent on $B$ (or $C$), which in turn changes the probability of $C$ (or $B$).
- **Common Effect**: $A$ and $B$ have the common effect $C$, and they are not direct cause of each other. We have
    - $A$ **and $B$ are <span style="color: blue;">independent</span>** ($C$ is not given).
    - $A$ **and $B$ are conditionally <span style="color: red;">dependent</span> given $C$**. This is also called "explaining away". Since both $A$ and $B$ are the causes of $C$, if $C$ is given, either $A$ or $B$ can be the cause. Thus, one cause can "explain away" the other. For example, in the alarm network, both burglary and earthquake are the causes of the alarm. Given that the alarm is set off, if we know that there is a burglary, then there is no need to require earthquake to set off the alarm, and the conditional probability of earthquake is reduced. On the other hand, if we know that there is no burglary, then earthquake becomes the main cause of the alarm, and its conditional probability must be very high.

### Independencies in the Alarm Network

<img src="img/alarm-dag.png" width=150></img>

In the above alarm network, we can find the following independencies:

- $B$ and $E$ are independent
- $B$ and $J$ are conditionally independent given $A$ (indirect cause)
- $B$ and $M$ are conditionally independent given $A$ (indirect cause)
- $E$ and $J$ are conditionally independent given $A$ (indirect cause)
- $E$ and $M$ are conditionally independent given $A$ (indirect cause)
- $J$ and $M$ are conditionally independent given $A$ (common cause)

> **NOTE**: $B$ and $E$ will become <span style="color: red;">dependent</span> if $A$ is given, due to common effect $A$ (explaining away).

## Factorisation <a name="factorisation"></a>

---

The independencies in a Bayesian network can be summarised as follows.

> **THEOREM**: In a Bayesian network, a node is conditionally independent from **all the nodes except its direct effects**, if the **direct causes are all given**, and **no direct effect is given**.

**Proof**

1. Given all the direct causes, a node is conditionally independent from all its indirect causes, as well as any other cause/effect node extended from its indirect causes.
2. Given all the direct causes, a node is conditionally independent from all the direct effects of its direct causes (common cause), as well as any other cause/effect node extended from them.
3. If no direct effect is given, then a node is conditionally independent from the direct cause of its direct effect (common effect), as well as any other cause/effect node extended from them.
4. There is no other node than the above categories.

<p style="text-align: right"> $\blacksquare$ </p>

Consider a Bayesian network with the nodes $\{X_1, \dots, X_n\}$. We can have the following theorem.

> **THEOREM**: If the order of $\{X_1, \dots, X_n\}$ is consistent with the nodes in the directed acyclic graph, i.e., for each node $X_i$, no direct effect of $X_i$ is before it, then each node $X_i$ is conditionally independent from all the nodes in $\{X_1, \dots, X_{i-1}\}$ except its direct cause, if all its direct causes are given.

**Proof**

- First, from the order of the nodes in the network, we have that **all the directed causes of $X_i$ are in $\{X_1, \dots, X_{i-1}\}$**. Otherwise, if a direct cause of $X_i$ is outside $\{X_1, \dots, X_{i-1}\}$, there must be a directed path from outside $\{X_1, \dots, X_{i-1}\}$ to $X_i$, which contradicts the order of the nodes.
- Second, the node order already tells that no direct effect of $X_i$ is in $\{X_1, \dots, X_{i-1}\}$. In other words, **no direct effect of $X_i$ is given in $\{X_1, \dots, X_{i-1}\}$**.

Therefore, from the first theorem, we know that each node $X_i$ is conditionally independent from all the nodes in $\{X_1, \dots, X_{i-1}\}$ except its direct cause, if all its direct causes are given.

<p style="text-align: right"> $\blacksquare$ </p>

The above theorem can be written as

$$
P(X_i\ |\ X_1, \dots, X_{i-1}) = P(X_i\ |\ parents(X_i)).
$$

On the other hand, from the chain (product) rule, we have 

$$
\begin{aligned}
& P(X_1, \dots, X_n) \\
& = P(X_1) * P(X_2\ |\ X_1) * \dots * P(X_n\ |\ X_1, \dots, X_{n-1}).
\end{aligned}
$$

Therefore, the Bayesian network can be written as the following **factorisation** of joint probability distribution.

$$
\begin{aligned}
& P(X_1, \dots, X_n) \\
& = P(X_1\ |\ parents(X_1)) \dots * P(X_n\ |\ parents(X_n)).
\end{aligned}
$$



For example, in the alarm network, the factorisation is

$$
\begin{aligned}
& P(B, E, A, J, M) \\
& = P(B) * P(E) * P(A\ |\ B, E) * P(J\ |\ A) * P(M\ |\ A).
\end{aligned}
$$

> **NOTE**: The factorisation of the joint probability distribution is equivelant to the representation of previous graph + probability tables.

---

- More tutorials can be found [here](https://github.com/meiyi1986/tutorials).
- [Yi Mei's homepage](https://meiyi1986.github.io/)