# Bayesian Network

**Bayesian (belief) network** is a very powerful model to represent the uncertainty in the world, including the **dependencies** between different random variables (events) in the real world, and the corresponding **(conditional) probabilities**.

In this tutorial, we will introduce the Bayesian network, how to build a Bayesian network, and do inference in a Bayesian network.

# Table of Contents <a name="toc"></a>

1. [Bayesian Network Definition](#definition)
2. [Cause, Effect, (In)dependencies](#dependency)
3. [Factorisation](#factorisation)
4. [Number of Free Parameters](#freepara)
5. [Building Bayesian Network](#building)
6. [Build Bayesian Network through `pgmpy`](#pgmpy)
7. [Inference in Bayesian Network](#inference)

## 1. Bayesian Network Definition <a name="definition"></a>

### The Alarm World

Let us consider the following "*alarm world*": you installed an **alarm** system in your house against **burglary**. However, New Zealand frequently has **earthquakes**, and the alarm system can be occasionally set off by an earthquake as well. In addition, the alarm can be set off by mistake with a very small probability. You have two neighbours, **John** and **Mary**. They might call you if they hear the alarm from your house while you are away. On the other hand, they might still call you for other issues even if they do not hear the alarm. However, they do not know each other, thus they will not communicate with each other about calling you.

<img src="img/alarm.png" width=300></img>

### Random Variables

In this "alarm world", there are five binary **random variables**.

- $B$: Whether a **b**urglar breaks into the house or not.
- $E$: Whether there is an **e**arthquake or not.
- $A$: Whether the **a**larm is set off or not.
- $J$: Whether your neighbour **J**ohn calls you or not.
- $M$: Whether your neighbour **M**ary calls you or not.

### (In)dependencies

From domain knowledge, we have the following **(in)dependencies** between the random variables.

- The alarm can be set off by a burglar.
- The alarm can be set off by an earthquake.
- Whether a burglar breaks into the house is independent from whether there is an earthquake.
- John might call you if they hear the alarm.
- Mary might call you if they hear the alarm.
- Since John and Mary do not communicate, **given the alarm condition**, whether John calls is independent from whether Mary calls.

### Directed Acyclic Graph

Based on the above domain knowledge, we can represent the random variables in the alarm world and their (in)dependencies by the following **Directed Acyclic Graph (DAG)**. 

<img src="img/alarm-dag.png" width=150></img>

We can see that each **node** represents a random variable, and each **directed edge** represents a causal dependency between the variables. For example, the directed edge $(B, A)$ means that the burglary variable is a cause of the alarm variable, and the alarm variable is an effect of the burglary variable.

### (Conditional) Probabilities

The causal dependencies are qualitative. For quantitative reasoning, we need the (conditional) probabilities for each random variable in the DAG.

Again, from domain knowledge, we have the following probabilities.

- A burglar breaks into the house with probability of 0.1%.
- The probability of an earthquake is 0.2%.
- If there were both a burglar and an earthquake, the alarm is set off with probability of 95%.
- If there was a burglar but no earthquake, the alarm is set off with probability of 94%.
- If there was no burglar and an earthquake, the alarm is set off with probability of 29%.
- If there was no burglar and no earthquake, the alarm is set off by mistake with probability of 0.1%.
- If the alarm is set off, John will hear it and call you with probability of 90%.
- If the alarm is not set off, John will call you for other issues with probability of 5%.
- If the alarm is set off, Mary will hear it and call you with probability of 70%.
- If the alarm is not set off, Mary will call you for other issues with probability of 1%.

Based on the above probabilities, we can have the following **Bayesian network** for the alarm world.

<img src="img/alarm-bn.png" width=500></img>

From the above example, we can see that to define a Bayesian network, we need to define

1. A **Directed Acyclic Graph (DAG)**, where each **node** represents a **random variable** in the world, and each **directed edge** represents a **causal dependency** between two random variables
2. A **Conditional Probability Table (CPT)** for each node $X$ in the graph. The conditional probabilities are $P(X\ |\ parents(X))$, where $parents(X)$ are the parents (incoming neighbours) of $X$ in the graph. They are direct causes of $X$.

---------

*[Back to Table of Content](#toc)*

---------

## 2. Cause, Effect, (In)dependencies <a name="dependency"></a>

An important task in a Bayesian network is to identify the **(in)dependencies** between any pair of random variables in the network. In general, there are four types of common (in)dependencies between variables.

<img src="img/cause-effect.png" width=550></img>

- **Direct Cause**: $A$ is a direct cause of $B$, if $A$ is an incoming neighbour of $B$. Obviously we have
    - $A$ **and $B$ are <span style="color: red;">dependent</span>**.
- **Indirect Cause**: $A$ is an indirected cause of $C$, if there is a directed path from $A$ to $C$. In the above 3-node example, the directed path is $A \rightarrow B \rightarrow C$. We have 
    - $A$ **and $C$ are conditionally <span style="color: blue;">independent</span> given $B$**. This is obvious, since if the direct cause is given, then the indirect cause is not needed.
- **Common Cause**: $B$ and $C$ have the common cause $A$, and they are not direct cause of each other. We have
    - $B$ **and $C$ are conditionally <span style="color: blue;">independent</span> given $A$**.
    - $B$ **and $C$ are <span style="color: red;">dependent</span> with each other if $A$ is not given**. This is because if $A$ is not given, then the probability of $A$ is dependent on $B$ (or $C$), which in turn changes the probability of $C$ (or $B$).
- **Common Effect**: $A$ and $B$ have the common effect $C$, and they are not direct cause of each other. We have
    - $A$ **and $B$ are <span style="color: blue;">independent</span>** ($C$ is not given).
    - $A$ **and $B$ are conditionally <span style="color: red;">dependent</span> given $C$**. This is also called "explaining away". Since both $A$ and $B$ are the causes of $C$, if $C$ is given, either $A$ or $B$ can be the cause. Thus, one cause can "explain away" the other. For example, in the alarm network, both burglary and earthquake are the causes of the alarm. Given that the alarm is set off, if we know that there is a burglary, then there is no need to require earthquake to set off the alarm, and the conditional probability of earthquake is reduced. On the other hand, if we know that there is no burglary, then earthquake becomes the main cause of the alarm, and its conditional probability must be very high.

### Independencies in the Alarm Network

From the above alarm network, we can find the following independencies:

- $B$ and $E$ are independent
- $B$ and $J$ are conditionally independent given $A$ (indirect cause)
- $B$ and $M$ are conditionally independent given $A$ (indirect cause)
- $E$ and $J$ are conditionally independent given $A$ (indirect cause)
- $E$ and $M$ are conditionally independent given $A$ (indirect cause)
- $J$ and $M$ are conditionally independent given $A$ (common cause)

> **NOTE**: $B$ and $E$ will become <span style="color: red;">dependent</span> if $A$ is given, due to common effect (explaining away).

---------

*[Back to Table of Content](#toc)*

---------

## 3. Factorisation <a name="factorisation"></a>

The independencies in a Bayesian network can be summarised as follows.

> **THEOREM**: In a Bayesian network, a node is conditionally independent from **all the nodes except its direct effects**, if the **direct causes are all given**, and **no direct effect is given**.

**Proof**

1. Given all the direct causes, a node is conditionally independent from all its indirect causes, as well as any other cause/effect node extended from its indirect causes.
2. Given all the direct causes, a node is conditionally independent from all the direct effects of its direct causes (common cause), as well as any other cause/effect node extended from them.
3. If no direct effect is given, then a node is conditionally independent from the direct cause of its direct effect (common effect), as well as any other cause/effect node extended from them.
4. There is no other node than the above categories.

<div style="text-align: right"> $\blacksquare$ </div>

Consider a Bayesian network with the nodes $\{X_1, \dots, X_n\}$. We can have the following theorem.

> **THEOREM**: If the order of $\{X_1, \dots, X_n\}$ is consistent with the nodes in the directed acyclic graph, i.e., for each node $X_i$, no direct effect of $X_i$ is before it, then each node $X_i$ is conditionally independent from all the nodes in $\{X_1, \dots, X_{i-1}\}$ except its direct cause, if all its direct causes are given.

**Proof**

First, from the order of the nodes in the network, we have

- **All the directed causes of $X_i$ are in $\{X_1, \dots, X_{i-1}\}$**. Otherwise, if a direct cause of $X_i$ is outside $\{X_1, \dots, X_{i-1}\}$, there must be a directed path from outside $\{X_1, \dots, X_{i-1}\}$ to $X_i$, which contradicts the order of the nodes.

Second, the node order already tells that no direct effect of $X_i$ is in $\{X_1, \dots, X_{i-1}\}$. Therefore, from Theorem 1, we have that each node $X_i$ is conditionally independent from all the nodes in $\{X_1, \dots, X_{i-1}\}$ except its direct cause, if all its direct causes are given.

<div style="text-align: right"> $\blacksquare$ </div>

The above theorem can be written as

$$
P(X_i\ |\ X_1, \dots, X_{i-1}) = P(X_i\ |\ parents(X_i)),
$$

where $parents(X_i)$ are the parents (direct causes) of $X_i$.

For example, we can have some equations for the alarm network as follows:

$$
P(J\ |\ A, B, E) = P(J\ |\ A),
$$

$$
P(M\ |\ A, B, J) = P(M\ |\ A),
$$

$$
P(J\ |\ A, B, E, M) = P(J\ |\ A),
$$

On the other hand, from the chain (product) rule, we have 

$$
\begin{aligned}
& P(X_1, \dots, X_n) \\
& = P(X_1) * P(X_2\ |\ X_1) * \dots * P(X_n\ |\ X_1, \dots, X_{n-1}).
\end{aligned}
$$

Therefore, the Bayesian network can be written as the following **factorisation** of joint probability distribution.

$$
\begin{aligned}
& P(X_1, \dots, X_n) \\
& = P(X_1\ |\ parents(X_1)) \dots * P(X_n\ |\ parents(X_n)).
\end{aligned}
$$



For example, in the above alarm network, the factorisation is

$$
\begin{aligned}
& P(B, E, A, J, M) \\
& = P(B) * P(E) * P(A\ |\ B, E) * P(J\ |\ A) * P(M\ |\ A).
\end{aligned}
$$

> **NOTE**: The factorisation of the joint probability distribution is equivelant to the representation of previous graph + probability tables.

---------

*[Back to Table of Content](#toc)*

---------

## 4. Number of Free Parameters <a name="freepara"></a>

To store a Bayesian network, we need to store the graph and the probability tables of each node. Obviously, the probability tables dominate the graph in the memory requirement, thus we focus on the memory requirement of the probability tables.

For the sake of convenience, we only consider **discrete** variables in the network, and the continuous variables will be discretised. Then, for each variable $X$ in the network, we have the following notations.

- $\Omega(X)$: the domain (set of possible values) of $X$
- $|\Omega(X)|$: the number of possible values of $X$
- $parents(X)$: the parents (direct causes) of $X$ in the network

For each variable $X$, the probability table stores the probabilities for $P(X\ |\ parents(X))$ for different $X$ values and $parent(X)$ values. Let's consider the following situations:

1. $X$ does not have any parent. In this case, the table stores $P(X)$. There are $|\Omega(X)|$ probabilities, each for a possible value of $X$. However, due to the [normalisation rule](https://homepages.ecs.vuw.ac.nz/~yimei/tutorials/reasoning-under-uncertainty-basics.html#probrules), all the probabilities add up to 1. Thus, we need to store only $|\Omega(X)|-1$ probabilities, and the last probability can be calculated by ($1-$the sum of the stored probabilities). Therefore, the probability table contains $|\Omega(X)|-1$ rows/probabilities.
2. $X$ has one parent $Y$. In this case, for each condition $y \in \Omega(Y)$, we need to store the conditional probabilities $P(X\ |\ Y = y)$. Again, we need to store $|\Omega(X)|-1$ conditional probabilities for $P(X\ |\ Y = y)$, and can calculate the last conditional probability by the normalisation rule. Therefore, the probability table contains $(|\Omega(X)|-1)*|\Omega(Y)|$ rows/probabilities.
3. $X$ has multiple parents $Y_1, \dots, Y_m$. In this case, there are $|\Omega(Y_1)|*\dots * |\Omega(Y_m)|$ possible conditions $[Y_1 = y_1, \dots, Y_m = y_m]$. For each condition, we need to store $|\Omega(X)|-1$ conditional probabilities for $P(X\ |\ Y_1 = y_1, \dots, Y_m = y_m)$. Therefore, the probability table contains $(|\Omega(X)|-1)*|\Omega(Y_1)|*\dots * |\Omega(Y_m)|$ rows/probabilities.

As shown in the above alarm network, all the variables are binary, i.e. $|\Omega(X)| = 2$. Therefore, $B$ and $E$ have only 1 row in their probability tables, since they have no parent. $A$ has $1 \times 2 \times 2 = 4$ rows in its probability tables, since it has two binary parents $B$ and $E$, leading to four possible conditions.

> **DEFINITION**: The **number of free parameters** of a Bayesian network is the number of probabilities we need to estimate and store (can NOT be derived/calculated) in the probability tables.

Consider a Bayesian network with the factorisation

$$
\begin{aligned}
& P(X_1, \dots, X_n) \\
& = P(X_1\ |\ parents(X_1)) \dots * P(X_n\ |\ parents(X_n)),
\end{aligned}
$$

the number of free parameters is

$$
\begin{aligned}
P(X_1, \dots, X_n) & = (|\Omega(X_1)|-1)*\prod_{Y \in parents(X_1)}|\Omega(Y)| \\
& + (|\Omega(X_2)|-1)*\prod_{Y \in parents(X_2)}|\Omega(Y)| \\
& + \dots \\
& + (|\Omega(X_n)|-1)*\prod_{Y \in parents(X_n)}|\Omega(Y)|. \\
\end{aligned}
$$

Let's calculate the number of free parameters of the following simple networks, assuming that all the variables are binary.

- **Direct cause**: $P(A)$ has 1 free parameter, $P(B\ |\ A)$ has 2 free parameters. The network has $1+2 = 3$ free parameters.
- **Indirect cause**: $P(A)$ has 1 free parameter, $P(B\ |\ A)$ and $P(C\ |\ B)$ have 2 free parameters. The network has $1+2+2 = 5$ free parameters.
- **Common cause**: $P(A)$ has 1 free parameter, $P(B\ |\ A)$ and $P(C\ |\ A)$ have 2 free parameters. The network has $1+2+2 = 5$ free parameters.
- **Common effect**: $P(A)$ and $P(B)$ have 1 free parameter, $P(C\ |\ A, B)$ has $2\times 2 = 4$ free parameters. The network has $1+1+4 = 6$ free parameters.

<img src="img/cause-effect.png" width=550></img>

> **NOTE**: We can see that the common effect dependency causes the most free parameters required for the network. Therefore, when building a Bayesian network, we should try to reduce the number of such dependencies to reduce the number of free parameters of the network.

---------

*[Back to Table of Content](#toc)*

-----------

## 5. Building Bayesian Network <a name="building"></a>

Building a Bayesian network mainly consists of the following three steps:

1. Identify a set of **random variables** that describe the world of reasoning.
2. Build the **directed acyclic graph**, i.e., the **directed links** between the random variables.
3. Build the **conditional probability table** for each variable, by estimating the necessary probabilities.

Here, we introduce the Pearl's network construction algorithm, which is a way to build the network based on **node ordering**.

```Python
# Step 1: identify variables
Identify the random variables that describe the world of reasoning
# Step 2: build the graph, add the links
Sort the random variables by some order
Set bn = []
for var in sorted_vars:
    Find the minimum subset of variables in bn so that P(var | bn) = P(var | subset)
    
    Add var into bn
    for bn_var in subset:
        Add a direct link [bn_var, var]
    # Step 3: estimate the conditional probability table
    Estimate the conditional probabilities P(var | subset)
```

In this algorithm, the **node ordering** is critical to determine the number of links between the nodes, and thus the size of the conditional probability tables. 

We show how the links are added in to the network under different node orders, using the alarm network as an example.

----------

#### Order 1: $B \rightarrow E \rightarrow A \rightarrow J \rightarrow M$

- **Step 1**: The node $B$ is added into the network. No edge is added, since there is only one node in the network.
- **Step 2**: The node $E$ is added into the network. No edge from $B$ to $E$ is added, since $B$ and $E$ are independent.
- **Step 3**: The node $A$ is added into the network. Two edges $[B, A]$ and $[E, A]$ are added, since $B$ and $E$ are both direct causes of $A$.
- **Step 4**: The node $J$ is added into the network. The minimum subset $A \subseteq \{B, E, A\}$ in the network is found to be the parent of $J$, since $J$ is conditionally independent from $B$ and $E$ given $A$, i.e., $P(J\ |\ B, E, A) = P(J\ |\ A)$. An edge $[A, J]$ is added into the network.
- **Step 5**: The node $M$ is added into the network. The minimum subset $A \subseteq \{B, E, A, J\}$ in the network is found to be the parent of $M$, since $M$ is conditionally independent from $B$, $E$ and $J$ given $A$, i.e., $P(M\ |\ B, E, A, J) = P(M\ |\ A)$. An edge $[A, M]$ is added into the network.

The built network is shows as follows. The number of free parameters in this network is $1 + 1 + 4 + 2 + 2 = 10$.

<img src="img/alarm-dag.png" width=150></img>

----------

#### Order 2: $J \rightarrow M \rightarrow A \rightarrow B \rightarrow E$

- **Step 1**: The node $J$ is added into the network. No edge is added, since there is only one node in the network.
- **Step 2**: The node $M$ is added into the network. $M$ and $J$ are dependent (note that the common cause $A$ is not given at this step), i.e., $P(M\ |\ J) \neq P(M)$. Therefore, an edge $[J, M]$ is added into the network.
- **Step 3**: The node $A$ is added into the network. Two edges $[J, A]$ and $[M, A]$ are added, since $J$ and $M$ are both dependent on $A$.
- **Step 4**: The node $B$ is added into the network. The minimum subset $A \subseteq \{J, M, A\}$ in the network is found to be the parent of $B$, since $B$ is conditionally independent from $J$ and $M$ given $A$, i.e., $P(B\ |\ J, M, A) = P(B\ |\ A)$. An edge $[A, B]$ is added into the network.
- **Step 5**: The node $E$ is added into the network. The minimum subset $\{A, B\} \subseteq \{J, M, A, B\}$ in the network is found to be the parent of $E$, since $E$ is conditionally independent from $J$ and $M$ given $A$ and $E$, i.e., $P(M\ |\ J, M, A, B) = P(M\ |\ A, B)$ (note that $B$ and $E$ have the common effect $A$, thus when $A$ is given, $B$ and $E$ are conditionally dependent). Two edges $[A, E]$ and $[B, E]$ are added into the network.

The built network is shows as follows. The number of free parameters in this network is $1 + 2 + 4 + 2 + 4 = 13$.

<img src="img/alarm-dag2.png" width=150></img>

----------

#### Order 3: $J \rightarrow M \rightarrow B \rightarrow E \rightarrow A$

- **Step 1**: The node $J$ is added into the network. No edge is added, since there is only one node in the network.
- **Step 2**: The node $M$ is added into the network. $M$ and $J$ are dependent (note that the common cause $A$ is not given at this step), i.e., $P(M\ |\ J) \neq P(M)$. Therefore, an edge $[J, M]$ is added into the network.
- **Step 3**: The node $B$ is added into the network. Two edges $[J, B]$ and $[M, B]$ are added, since $J$ and $M$ are both dependent on $B$ (through $A$, which has not been added yet).
- **Step 4**: The node $E$ is added into the network. There is no conditional independence found among $\{J, M, B, E\}$ without giving $A$. Therefore, three edges $[J, E]$, $[M, E]$, $[B, E]$ are added into the network.
- **Step 5**: The node $A$ is added into the network. First, two edges Two edges $[J, A]$ and $[M, A]$ are added, since $J$ and $M$ are both dependent on $A$. Then, another two edges $[B, A]$ and $[E, A]$ are also added, since $B$ and $E$ are both direct causes of $A$.

The built network is shows as follows. The number of free parameters in this network is $1 + 2 + 4 + 8 + 16 = 31$.

<img src="img/alarm-dag3.png" width=200></img>

---------

We can see that different node orders can lead to greatly different graphs and numbers of free parameters. Therefore, we should find the **optimal node order** that leads to the most **compact** network (with the fewest free parameters).

> **QUESTION**: How to find the optimal node order that leads to the most compact Bayesian network?

The node order is mainly determined based on our **domain knowledge** about **cause and effect**. At first, we add the nodes with no cause (i.e., the root causes) into the ordered list. Then, at each step, we find the remaining nodes whose direct causes are all in the current ordered list (i.e., all their direct causes are given) and append them into the end of the ordered list. This way, we only need to add direct links from their direct causes to them.

The pseucode of the node ordering is shown as follows.

```Python
def node_ordering(all_nodes):
    Set ordered_nodes = [], remaining_nodes = all_nodes
    while remaining_nodes is not empty:
        Select the nodes whose direct causes are all in ordered_nodes
        Append the selected nodes into ordered_nodes
        Remove the selected nodes from remaining_nodes
    return ordered_nodes
```

For the alarm network, first we add two nodes $\{B, E\}$ into the ordered list, since they are the root causes, and have no direct cause. Then, we add $A$ into the ordered list, since it has two direct causes $B$ and $E$, both are already in the ordered list. Finally, we add $J$ and $M$ into the list, since their direct cause $A$ is already in the ordered list.

---------

*[Back to Table of Content](#toc)*

---------

## 6. Build Bayesian Network through `pgmpy` <a name="pgmpy"></a>

Here, we show how to build the alarm network through the Python [pgmpy](https://pgmpy.org) library. The alarm network is displayed again below.

<img src="img/alarm-bn.png" width=500></img>

First, we install the library using `pip`.

In [1]:
pip install pgmpy

Note: you may need to restart the kernel to use updated packages.


Then, we import the necessary modules for the Bayesian network as follows.

In [2]:
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD

Now, we build the alarm Bayesian network as follows.

1. We define the network structure by specifying the four links.
2. We define (estimate) the discrete conditional probability tables, represented as the `TabularCPD` class.

In [3]:
# Define the network structure
alarm_model = BayesianNetwork(
    [
        ("Burglary", "Alarm"),
        ("Earthquake", "Alarm"),
        ("Alarm", "JohnCall"),
        ("Alarm", "MaryCall"),
    ]
)

# Define the probability tables by TabularCPD
cpd_burglary = TabularCPD(
    variable="Burglary", variable_card=2, values=[[0.999], [0.001]]
)

cpd_earthquake = TabularCPD(
    variable="Earthquake", variable_card=2, values=[[0.998], [0.002]]
)

cpd_alarm = TabularCPD(
    variable="Alarm",
    variable_card=2,
    values=[[0.999, 0.71, 0.06, 0.05], [0.001, 0.29, 0.94, 0.95]],
    evidence=["Burglary", "Earthquake"],
    evidence_card=[2, 2],
)

cpd_johncall = TabularCPD(
    variable="JohnCall",
    variable_card=2,
    values=[[0.95, 0.1], [0.05, 0.9]],
    evidence=["Alarm"],
    evidence_card=[2],
)

cpd_marycall = TabularCPD(
    variable="MaryCall",
    variable_card=2,
    values=[[0.99, 0.3], [0.01, 0.7]],
    evidence=["Alarm"],
    evidence_card=[2],
)

# Associating the probability tables with the model structure
alarm_model.add_cpds(
    cpd_burglary, cpd_earthquake, cpd_alarm, cpd_johncall, cpd_marycall
)

Let's view the nodes of the alarm network.

In [4]:
# Viewing nodes of the model
alarm_model.nodes()

NodeView(('Burglary', 'Alarm', 'Earthquake', 'JohnCall', 'MaryCall'))

We can also view the edges of the alarm network.

In [5]:
# Viewing edges of the model
alarm_model.edges()

OutEdgeView([('Burglary', 'Alarm'), ('Alarm', 'JohnCall'), ('Alarm', 'MaryCall'), ('Earthquake', 'Alarm')])

We can show the probability tables using the `print()` method. 

> **NOTE**: the `pgmpy` library stores ALL the probabilities (including the last probability). This requires a bit more memory, but can save time for calculating the last probability by normalisation rule.

Let's print the probability tables for **Alarm** and **MaryCalls**. For each variable, the value (0) stands for `False`, while the value (1) is `True`.

In [6]:
# Print the probability table of the Alarm node
print(cpd_alarm)

# Print the probability table of the MaryCalls node
print(cpd_marycall)

+------------+---------------+---------------+---------------+---------------+
| Burglary   | Burglary(0)   | Burglary(0)   | Burglary(1)   | Burglary(1)   |
+------------+---------------+---------------+---------------+---------------+
| Earthquake | Earthquake(0) | Earthquake(1) | Earthquake(0) | Earthquake(1) |
+------------+---------------+---------------+---------------+---------------+
| Alarm(0)   | 0.999         | 0.71          | 0.06          | 0.05          |
+------------+---------------+---------------+---------------+---------------+
| Alarm(1)   | 0.001         | 0.29          | 0.94          | 0.95          |
+------------+---------------+---------------+---------------+---------------+
+-------------+----------+----------+
| Alarm       | Alarm(0) | Alarm(1) |
+-------------+----------+----------+
| MaryCall(0) | 0.99     | 0.3      |
+-------------+----------+----------+
| MaryCall(1) | 0.01     | 0.7      |
+-------------+----------+----------+


We can find all the **(conditional) independencies** between the nodes in the network.

In [7]:
alarm_model.get_independencies()

(JohnCall ⟂ Earthquake, MaryCall, Burglary | Alarm)
(JohnCall ⟂ MaryCall, Burglary | Earthquake, Alarm)
(JohnCall ⟂ Earthquake, Burglary | MaryCall, Alarm)
(JohnCall ⟂ Earthquake, MaryCall | Alarm, Burglary)
(JohnCall ⟂ Burglary | Earthquake, MaryCall, Alarm)
(JohnCall ⟂ MaryCall | Earthquake, Alarm, Burglary)
(JohnCall ⟂ Earthquake | MaryCall, Alarm, Burglary)
(MaryCall ⟂ JohnCall, Earthquake, Burglary | Alarm)
(MaryCall ⟂ Earthquake, Burglary | JohnCall, Alarm)
(MaryCall ⟂ JohnCall, Burglary | Earthquake, Alarm)
(MaryCall ⟂ JohnCall, Earthquake | Alarm, Burglary)
(MaryCall ⟂ Burglary | JohnCall, Earthquake, Alarm)
(MaryCall ⟂ Earthquake | JohnCall, Alarm, Burglary)
(MaryCall ⟂ JohnCall | Earthquake, Alarm, Burglary)
(Burglary ⟂ Earthquake)
(Burglary ⟂ JohnCall, MaryCall | Alarm)
(Burglary ⟂ MaryCall | JohnCall, Alarm)
(Burglary ⟂ JohnCall, MaryCall | Earthquake, Alarm)
(Burglary ⟂ JohnCall | MaryCall, Alarm)
(Burglary ⟂ MaryCall | JohnCall, Earthquake, Alarm)
(Burglary ⟂ JohnCall | E

We can also find the **local (conditional) independencies of a specific node** in the network as follows.

In [8]:
# Checking independcies of a node
alarm_model.local_independencies("JohnCall")

(JohnCall ⟂ Earthquake, MaryCall, Burglary | Alarm)

---------

*[Back to Table of Content](#toc)*

----------

## 7. Inference in Bayesian Network <a name="inference"></a>

In the alarm network, we might have the following questions:

- If there was an earthquake, how likely Mary will call you?
- If both John and Mary called you, how likely there was a burglary?
- If Mary called you, how likely John will call you as well?

Answering such questions is the **inference** in Bayesian network. Formally, the inference in Bayesian network is defined as follows.

> **DEFINITION**: Given a set of **query** nodes $[X_1, \dots, X_n]$, and a set of **evidence** nodes with their **observed values** $[Y_1 = y_1, \dots, Y_m = y_m]$, the **inference** is to calculate the conditional probabilities $P(X_1, \dots, X_n\ |\ Y_1 = y_1, \dots, Y_m = y_m)$.

The inference in Bayesian network is very **flexible**, and any node in the network can be a query node or an evidence node. Depending on different query and evidence nodes, some often encountered inference/reasoning scenarios are:

- **Causal reasoning**: the evidence nodes are the causes of the query nodes. This is forward reasoning.
- **Diagnostic reasoning**: the evidence nodes are the effects of the query nodes. This is backward reasoning.
- **Inter-causal reasoning**: the evidence and query nodes are effects of some *hidden* common causes.

There are two main types of inference algorithms:

1. **Exact Algorithm (Inference by Enumeration)**: This can guarantee the accuracy of the calculated conditional probabilities. However, For large and complex Bayesian networks, the exact algorithms become computationally infeasible, in which case the approximate algorithms must be used. 
2. **Approximate Algorithm**: This can be much faster than the exact algorithms, although the estimated conditoinal probabilities are not 100% accurate.

### 7.A. Exact Algorithm (Inference by Enumeration) <a name="exact"></a>

Let's consider the following inference scenario in the alarm network.

> **QUESTION**: What is the conditional probability of burglary, given that John calls you? That is, what is $P(B\ |\ J = t)$?

This conditional probability cannot be calculated directly from $B$ and $J$ alone. We need to consider other variables ($E$, $A$ and $M$) in the network as well.

However, not all the other variables are necessary. **Which other nodes in the network are necessary for the inference**?

#### Markov Blanket and Boundary

To find the necessary other nodes for inference, we first introduce the following concepts:

> **DEFINITION**: A **Markov blanket** of a random varaible $X$ in a set of variables $\mathcal{H} = \{H_1, \dots, H_k\}$ is any **subset** $\mathcal{B} \subseteq \mathcal{H}$, so that $X$ is conditionally independent from all the other variables in $\mathcal{H} \setminus \mathcal{B}$ if $\mathcal{B}$ is given. That is, $P(X\ |\ \mathcal{H}) = P(X_i\ |\ \mathcal{B})$.

> **DEFINITION**: A **Markov boundary** $\mathcal{B}^*$ of a random varaible $X$ in a set of variables $\mathcal{H} = \{H_1, \dots, H_k\}$ is the **minimal Markov blanket**. In other words, (1) $\mathcal{B}^*$ is a Markov blanket of $X$, and (2) any subset of $\mathcal{B}^*$ is NOT a Markov blanket of $X$.

Obviously, for any node $X$ in the network, its **Markov boundary** is the minimal subset of nodes necessary for the inference, and all the other nodes in the network are useless.

For Bayesian network, we have the following important [theorem](https://en.wikipedia.org/wiki/Markov_blanket).

> **THEOREM**: The **Markov boundary** of a node $X$ in a Bayesian network consists of (1) $X$'s direct causes; (2) $X$'s direct effects; and (3) other direct causes of $X$'s direct effects.

In the inference of $P(B\ |\ J = t)$, the Markov boundary of the query node $B$ in the network contains $A$ (direct effect) and $E$ (the other cause of its direct effect $A$).

#### Finding Hidden Nodes

From the theorem, we know that to infer a node $X$, we need the nodes in its Markov boundary. Thus, we can find all the necessary hidden nodes to infer $P(X_1, \dots, X_n\ |\ Y_1 = y_1, \dots, Y_m = y_m)$ as follows.

```Python
def find_hidden_nodes(query_nodes, evidence_nodes):
    # Initially, there is no hidden node, and the query nodes are random nodes for inference
    Set hidden_nodes = [], rand_nodes = query_nodes
    
    while rand_nodes is not empty:
        # At each step, select a random variable in the inference and explore its Markov boundary
        Select and remove a node from rand_nodes
        for var in markov_boundary(node):
            # If var is already an evidence node, then skip; 
            # Otherwise, var is a hidden node, and its Markov boundary needs to be explored
            if var in evidence_nodes:
                continue
            
            Add var into hidden_nodes 
            Add var into rand_nodes
            
    return hidden_nodes
```

In the inference of $P(B\ |\ J = t)$, 

- In iteration 1, we add $E$ and $A$ into `hidden_nodes`;
- In iteration 2, we add $M$ into `hidden_nodes`, as it is in the Markov boundary of $A$.

In the end, all the other nodes $E$, $A$, and $M$ in the network are hidden nodes for inferring $P(B\ |\ J = t)$.

#### Probability Calculation by Factorisation

After finding all the hidden nodes $\{E, A, M\}$, we can calculate $P(B\ |\ J = t)$ by as follows.

$$
\begin{aligned}
& P(B\ |\ J = t) & \\
& = \frac{P(B, J = t)}{P(J = t)} & \hspace{50pt} \textrm{[product rule]} \\
& = \alpha * \sum_{E \in \{t, f\}}\sum_{A \in \{t, f\}}\sum_{M \in \{t, f\}}P(B, E, A, J = t, M) & \hspace{50pt} \textrm{[sum rule]} \\
& = \alpha * \sum_{E \in \{t, f\}}\sum_{A \in \{t, f\}}\sum_{M \in \{t, f\}}P(B) * P(E) * P(A\ |\ B, E) * P(J = t\ |\ A) * P(M\ |\ A) & \hspace{50pt} \textrm{[Factorisation]}
\end{aligned}
$$

where $\alpha = \frac{1}{P(J = t)}$ is the **normalisation factor**, and is not needed to calculate (we can simply normalise the conditional probabilities for all possible query variable values so that they add up to 1).

We can see that except $\alpha$, all the probabilities $P(B)$, $P(E)$, $P(A\ |\ B, E)$, $P(J = t\ |\ A)$, $P(M\ |\ A)$ can be directly read from the probability tables of the Bayesian network. Therefore, we have successfully found a way to calculate the probability. 

<!-- In general, given the query nodes $\{X_1, \dots, X_n\}$, evidence nodes $\{Y_1 = y_1, \dots, Y_m = y_m\}$ and hidden nodes $\{H_1, \dots, H_k\}$, the inference can be done as follows.

$$
\begin{aligned}
& P(X_1, \dots, X_n\ |\ Y_1 = y_1, \dots, Y_m = y_m) \\ 
& = \frac{P(X_1, \dots, X_n, Y_1 = y_1, \dots, Y_m = y_m)}{P(Y_1 = y_1, \dots, Y_m = y_m)} & \hspace{50pt} \textrm{[product rule]} \\
& = \alpha * \sum_{[h_1, \dots, h_k] \in \\ \Omega(H_1, \dots, H_k)} P(X_1, \dots, X_n, Y_1 = y_1, \dots, Y_m = y_m, H_1 = h_1, \dots, H_k = h_k) & \hspace{50pt} \textrm{[sum rule]} \\
& = \alpha * \sum_{[h_1, \dots, h_k] \in \\ \Omega(H_1, \dots, H_k)} \prod_{i=1}^{n} P(X_i\ |\ parents(X_i)) * \prod_{i=1}^{m} P(y_i\ |\ parents(Y_i)) * \prod_{i=1}^{k} P(h_i\ |\ parents(H_i)) & \hspace{50pt} \textrm{[Factorisation]}
\end{aligned}
$$

where $\alpha = \frac{1}{P(Y_1 = y_1, \dots, Y_m = y_m)}$ is the **normalisation factor**, and is not needed to calculate (we can simply normalise the conditional probabilities for all possible query variable values so that they add up to 1). -->

Finally, we can see that except $\alpha$, all the probabilities can be directly read from the probability tables in the Bayesian network.

#### Computational Complexity

If we directly calculate $P(B\ |\ J = t)$, how many operations are needed? When we look at the last line (ignoring $\alpha$), 

$$
\sum_{E \in \{t, f\}}\sum_{A \in \{t, f\}}\sum_{M \in \{t, f\}}P(B) * P(E) * P(A\ |\ B, E) * P(J = t\ |\ A) * P(M\ |\ A)
$$

For each $B \in \{t, f\}$, we have $2 \times 2 \times 2 = 8$ terms to be added. In total, there are $2 \times 7 = 14$ additions.

For each $B \in \{t, f\}$, $E \in \{t, f\}$, $A \in \{t, f\}$ and $M \in \{t, f\}$, there are 5 probabilities to be multiplied, needing 4 multiplications. In total, there are $2^4 \times 4 = 64$ multiplications.


<!-- In general, to calculate

$$
\sum_{[h_1, \dots, h_k] \in \\ \Omega(H_1, \dots, H_k)} \prod_{i=1}^{n} P(X_i\ |\ parents(X_i)) * \prod_{i=1}^{m} P(y_i\ |\ parents(Y_i)) * \prod_{i=1}^{k} P(h_i\ |\ parents(H_i))
$$

- For each possible query values, there are $|\Omega(H_1)| * \dots * |\Omega(H_k)|$ terms to be added, which is the number of possible value combinations of the hidden variables. There are $|\Omega(H_1)| * \dots * |\Omega(H_k)| - 1$ number of additions.
- There are $|\Omega(X_1)| * \dots * |\Omega(X_n)|$ possible query values. Therefore, **there are $|\Omega(X_1)| * \dots * |\Omega(X_n)| * (|\Omega(H_1)| * \dots * |\Omega(H_k)| - 1)$ additions in total.**
- Each term has $n+m+k$ probabilities to be multiplied, needing $n+m+k-1$ multiplications.
- There are $|\Omega(X_1)| * \dots * |\Omega(X_n)| * |\Omega(H_1)| * \dots * |\Omega(H_k)|$ terms in total. Therefore, **there are $|\Omega(X_1)| * \dots * |\Omega(X_n)| * |\Omega(H_1)| * \dots * |\Omega(H_k)| *(n+m+k-1)$ multiplications in total.** -->

In large and complex Bayesian networks, the complexity can be intractable. For example, if all the variables are binary, i.e. $|\Omega(X)| = 2$, if we have 1 query node, 3 evidence nodes and 10 hidden nodes, then the total number of multiplications will be

$$
2^{(1+10)} \times (1+3+10-1) = 26624.
$$

#### Speed Up by Variable Elimination

The **Variable Elimination** algorithm is a very important exact inference algorithm that speeds up the above calculation process. The key idea is to **eliminate hidden variables as early as possible**. 

> **DEFINITION**: A **factor** of some random variables is a table of some probabilities of all the possible values of the random variables. Note that the probability can be any probability involving the random variables.

To calculate

$$
\sum_{E \in \{t, f\}}\sum_{A \in \{t, f\}}\sum_{M \in \{t, f\}}P(B) * P(E) * P(A\ |\ B, E) * P(J = t\ |\ A) * P(M\ |\ A)
$$

We can define five initial **factors**, each for a probability in this equation.

$f_1(B) = P(B)$:

| B | P(B) |
| - | --------------- |
| t |    0.001        |
| f |    0.999        |

$f_2(E) = P(E)$:

| E | P(E) |
| - | --------------- |
| t |    0.002        |
| f |    0.998        |  

$f3(A, B, E) = P(A\ |\ B, E)$:

| A | B | E | P(A &#124; B, E) |
| - | - | - | --------------- |
| t | t | t |   0.95        |
| f | t | t |   0.05        |
| t | t | f |   0.94        |
| f | t | f |   0.06        |
| t | f | t |   0.29        |
| f | f | t |   0.71        |
| t | f | f |   0.001        |
| f | f | f |   0.999        |    

$f_4(A) = P(J=t\ |\ A)$:

| A | P(J=t &#124; A) |
| - | --------------- |
| t |    0.9        |
| f |    0.05        | 

$f_5(M, A) = P(M\ |\ A)$:

| M | A | P(M &#124; A) |
| - | - | --------------- |
| t | t |   0.7        |
| f | t |   0.3        |  
| t | f |   0.01        |
| f | f |   0.99        |  

> **DEFINITION**: The **join** operation between two factors $f_1$ and $f_2$, denoted as $f_1 \otimes f_2$, is a table of the *union* of the variables in $f_1$ and $f_2$, where each row is the multiplication of the corresponding row of $f_1$ and $f_2$.

In the above example, $f_1(B) \otimes f_2(E) = P(B) * P(E)$ is shown as follows. It converts two 2-row tables into a 4-row table, leading to 4 multiplications.

| B | E | P(B) * P(E) |
| - | - | --------------- |
| t | t |   0.001 * 0.002 = 0.000002        |
| f | t |   0.999 * 0.002 = 0.001998      |  
| t | f |   0.001 * 0.998 = 0.000998        |
| f | f |   0.999 * 0.998 = 0.997002       | 

On the other hand, $f_4(A) \otimes f_5(M, A) = P(J = t\ |\ A) * P(M\ |\ A)$ is shown as follows.

| M | A | P(J = t &#124; A) * P(M &#124; A) |
| - | - | --------------- |
| t | t |   0.9 * 0.7 = 0.63        |
| f | t |   0.9 * 0.3 = 0.27        |  
| t | f |   0.05 * 0.01 = 0.0005       |
| f | f |   0.05 * 0.99 = 0.0495       | 

Due to the overlap between the variables of the two joined factors, the resultant table is still 4 rows, the same as the original $f_5$. In general, **the complexity of the join operator depends on the size of the joint factors and their overlapping variables**.

> **DEFINITION**: The **elimination/sum-out** operation of a factor $f(X, Y)$ on $X$ is a table of $Y$, where each row is the sum of the all the rows in $f(X, Y)$ with the corresponding $Y=y$ value.

For example, if we **eliminate/sum-out** $M$ in $f_4(A) \otimes f_5(M, A)$, then we can obtain the following factor.

| A | P(J = t &#124; A) * P(M = t &#124; A) + P(J = t &#124; A) * P(M = f &#124; A) |
| - | --------------- |
| t |    0.63 + 0.27 = 0.9        |
| f |    0.0005 + 0.0495 = 0.05        | 

Elimination/Sum-out can reduce the size of the factor.

Then, we can write the calculation of the conditional probabilities as the factor operations.

$$
\sum_{E \in \{t, f\}}\sum_{A \in \{t, f\}}\sum_{M \in \{t, f\}}P(B) * P(E) * P(A\ |\ B, E) * P(J = t\ |\ A) * P(M\ |\ A)
$$

$$
\mathtt{Elim}_{E}\mathtt{Elim}_{A}\mathtt{Elim}_{M}f_1(B) \otimes f_2(E) \otimes f_3(A, B, E) \otimes f_4(A) \otimes f_5(M, A)
$$

Note that the order of the join and elimination operations can be swapped freely. To save computational cost, we should eliminate variables as early as possible to reduce the size of the tables for later join operations. The **variable elimination** algorithm is proposed to this end.

```Python
def variable_elimination(query_nodes, evidence_nodes, observations, hidden_nodes):
    Set all_nodes = [query_nodes, evidence_nodes, hidden_nodes]
    # Initialise the factors
    factors = []
    for node in all_nodes:
        Initialise factor = P(node | parents(node)) with the observations
        Add factor into factors
        
    Sort hidden_nodes in some way
    
    # At each iteration, eliminate one hidden node
    for node in sorted_hidden_nodes:
        Join all the factors containing node
        Eliminate node from the joined factor
    
    Join all the factors containing query_nodes
    Normalise the probabilities in the final factor
    return the final factor
```

We show the process of the `variable_elimination` algorithm to calculate $P(B\ |\ J)$ through the following equation as follows. Let the order of the hidden variables be $M \rightarrow A \rightarrow E$.

$$
\begin{align}
& \mathtt{Elim}_{E}\mathtt{Elim}_{A}\mathtt{Elim}_{M}f_1(B) \otimes f_2(E) \otimes f_3(A, B, E) \otimes f_4(A) \otimes f_5(M, A) \\
& = \underbrace{f_1(B) \otimes \underbrace{\mathtt{Elim}_{E} f_2(E) \otimes \underbrace{\mathtt{Elim}_{A} f_3(A, B, E) \otimes f_4(A) \otimes \underbrace{\mathtt{Elim}_{M} f_5(M, A)}_{f_6(A)}}_{f_8(B, E)}}_{f_{10}(B)}}_{f_{11}(B)}
\end{align}
$$

- **Iteration 1**: Eliminate $M$ from $f_5(M, A)$, which is the only factor containing $M$, to get $f_6(A)$. It costs <span style="color: blue;">**2 additions**</span>.

| A | f6(A) |
| - | --------------- |
| t |    0.7 + 0.3 = 1.0        |
| f |    0.01 + 0.99 = 1.0        |

- **Iteration 2**: 
    1. Join all the factors containing $A$, $f_3(A, B, E)$, $f_4(A)$ and $f_6(A)$, to obtain $f_7(A, B, E)$. It costs <span style="color: red;">**16 multiplications**</span>.
    
    | A | B | E | f7(A, B, E) |
    | - | - | - | --------------- |
    | t | t | t |   0.95 * 0.9 * 1.0 = 0.855       |
    | f | t | t |   0.05 * 0.05 * 1.0 = 0.0025       |
    | t | t | f |   0.94 * 0.9 * 1.0 = 0.846      |
    | f | t | f |   0.06 * 0.05 * 1.0 = 0.003      |
    | t | f | t |   0.29 * 0.9 * 1.0 = 0.261      |
    | f | f | t |   0.71 * 0.05 * 1.0 = 0.0355      |
    | t | f | f |   0.001 * 0.9 * 1.0 = 0.0009      |
    | f | f | f |   0.999 * 0.05 * 1.0 = 0.04995      | 
    
    2. Eliminate $A$ from $f_7(A, B, E)$ to obtain $f_8(B, E)$. It costs <span style="color: blue;">**4 additions**</span>.
    
    | B | E | f8(B, E) |
    | - | - | --------------- |
    | t | t |   0.855 + 0.0025 = 0.8575          |
    | f | t |   0.261 + 0.0355 = 0.2965       |  
    | t | f |   0.846 + 0.003 = 0.849       |
    | f | f |   0.0009 + 0.04995 = 0.05085    | 
    
- **Iteration 3**:
    1. Join all the factors containing $E$, $f_2(E)$ and $f_8(B, E)$, to obtain $f_9(B, E)$. It costs <span style="color: red;">**4 multiplications**</span>.
    
    | B | E | f9(B, E) |
    | - | - | --------------- |
    | t | t |   0.002 * 0.8575 = 0.001715        |
    | f | t |   0.002 * 0.2965 = 0.000593      |  
    | t | f |   0.998 * 0.849 = 0.847302        |
    | f | f |   0.998 * 0.05085 = 0.0507483       | 
    
    2. Eliminate $E$ from $f_9(B, E)$ to obtain $f_{10}(B)$. It costs <span style="color: blue;">**2 additions**</span>.
    
    | B | f10(B) |
    | - | --------------- |
    | t |    0.001715 + 0.847302 = 0.849017       |
    | f |    0.000593 + 0.0507483 = 0.0513413        |

- **Iteration 4**: Join all the factors containing $B$, $f_1(B)$ and $f_{10}(B)$, to obtain $f_{11}(B)$. It costs <span style="color: red;">**2 multiplications**</span>.

| B | f11(B) |
| - | --------------- |
| t |    0.001 * 0.849017 = 0.000849017       |
| f |    0.999 * 0.0513413 = 0.0512899587        |

In total, the variable elimination costs <span style="color: blue;">**8 additions**</span> and <span style="color: red;">**22 multiplications**</span>, which is much smaller than the original 14 additions and 64 multiplications.

Finally, we normalise the probabilities in $f_{11}(B)$ to obtain the final factor:

| B | norm f11(B) |
| - | --------------- |
| t |    0.000849017 / (0.000849017 + 0.0512899587) = 0.01628372994      |
| f |    0.0512899587 / (0.000849017 + 0.0512899587) = 0.98371627005        |

Let's verify the results by the `VariableElimination` function in the `pgmpy` library.

In [9]:
from pgmpy.inference import VariableElimination

alarm_infer = VariableElimination(alarm_model)

q = alarm_infer.query(variables=["Burglary"], evidence={"JohnCall": 1})
print(q)

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

+-------------+-----------------+
| Burglary    |   phi(Burglary) |
| Burglary(0) |          0.9837 |
+-------------+-----------------+
| Burglary(1) |          0.0163 |
+-------------+-----------------+


We can see that the `pgmpy` library gives the same results, which verifies the correctness of our calculation.

---------

*[Back to Table of Content](#toc)*

----------

The original Juypter Notebook can be downloaded [here](https://homepages.ecs.vuw.ac.nz/~yimei/tutorials/bayesian-network.ipynb).

More tutorials can be found [here](https://meiyi1986.github.io/tutorials/).

[Yi Mei's homepage](https://meiyi1986.github.io/)