# Building a Bayesian Network

---

In this tutorial, we introduce how to build a **Bayesian (belief) network** based on domain knowledge of the problem.

If we build the Bayesian network in different ways, the built network can have different graphs and sizes, which can greatly affect the memory requirement and inference efficience. To represent the size of the Bayesian network, we first introduce the **number of free parameters**.

## Number of Free Parameters <a name="freepara"></a>

---

The size of a Bayesian network includes the size of the graph and the probability tables of each node. Obviously, the probability tables dominate the graph, thus we focus on the size of the probability tables.

For the sake of convenience, we only consider **discrete** variables in the network, and the continuous variables will be discretised. Then, for each variable $X$ in the network, we have the following notations.

- $\Omega(X)$: the domain (set of possible values) of $X$
- $|\Omega(X)|$: the number of possible values of $X$
- $parents(X)$: the parents (direct causes) of $X$ in the network

For each variable $X$, the probability table contains the probabilities for $P(X\ |\ parents(X))$ for all possible $X$ values and $parent(X)$ values. Let's consider the following situations:

1. $X$ does not have any parent. In this case, the table stores $P(X)$. There are $|\Omega(X)|$ probabilities, each for a possible value of $X$. However, due to the [normalisation rule](https://github.com/meiyi1986/tutorials/blob/master/notebooks/reasoning-under-uncertainty-basics.ipynb), all the probabilities add up to 1. Thus, we need to store only $|\Omega(X)|-1$ probabilities, and the last probability can be calculated by ($1-$the sum of the stored probabilities). Therefore, the probability table contains $|\Omega(X)|-1$ rows/probabilities.
2. $X$ has one parent $Y$. In this case, for each condition $y \in \Omega(Y)$, we need to store the conditional probabilities $P(X\ |\ Y = y)$. Again, we need to store $|\Omega(X)|-1$ conditional probabilities for $P(X\ |\ Y = y)$, and can calculate the last conditional probability by the normalisation rule. Therefore, the probability table contains $(|\Omega(X)|-1)*|\Omega(Y)|$ rows/probabilities.
3. $X$ has multiple parents $Y_1, \dots, Y_m$. In this case, there are $|\Omega(Y_1)|*\dots * |\Omega(Y_m)|$ possible conditions $[Y_1 = y_1, \dots, Y_m = y_m]$. For each condition, we need to store $|\Omega(X)|-1$ conditional probabilities for $P(X\ |\ Y_1 = y_1, \dots, Y_m = y_m)$. Therefore, the probability table contains $(|\Omega(X)|-1)*|\Omega(Y_1)|*\dots * |\Omega(Y_m)|$ rows/probabilities.

As shown in the above alarm network, all the variables are binary, i.e. $|\Omega(X)| = 2$. Therefore, $B$ and $E$ have only 1 row in their probability tables, since they have no parent. $A$ has $1 \times 2 \times 2 = 4$ rows in its probability tables, since it has two binary parents $B$ and $E$, leading to four possible conditions.

> **DEFINITION**: The **number of free parameters** of a Bayesian network is the number of probabilities we need to estimate (can NOT be derived/calculated) in the probability tables.

Consider a Bayesian network with the factorisation

$$
\begin{aligned}
& P(X_1, \dots, X_n) \\
& = P(X_1\ |\ parents(X_1)) \dots * P(X_n\ |\ parents(X_n)),
\end{aligned}
$$

the number of free parameters is

$$
\begin{aligned}
P(X_1, \dots, X_n) & = (|\Omega(X_1)|-1)*\prod_{Y \in parents(X_1)}|\Omega(Y)| \\
& + (|\Omega(X_2)|-1)*\prod_{Y \in parents(X_2)}|\Omega(Y)| \\
& + \dots \\
& + (|\Omega(X_n)|-1)*\prod_{Y \in parents(X_n)}|\Omega(Y)|. \\
\end{aligned}
$$

Let's calculate the number of free parameters of the following simple networks, assuming that all the variables are binary.

<img src="img/cause-effect.png" width=550></img>

- **Direct cause**: $P(A)$ has 1 free parameter, $P(B\ |\ A)$ has 2 free parameters. The network has $1+2 = 3$ free parameters.
- **Indirect cause**: $P(A)$ has 1 free parameter, $P(B\ |\ A)$ and $P(C\ |\ B)$ have 2 free parameters. The network has $1+2+2 = 5$ free parameters.
- **Common cause**: $P(A)$ has 1 free parameter, $P(B\ |\ A)$ and $P(C\ |\ A)$ have 2 free parameters. The network has $1+2+2 = 5$ free parameters.
- **Common effect**: $P(A)$ and $P(B)$ have 1 free parameter, $P(C\ |\ A, B)$ has $2\times 2 = 4$ free parameters. The network has $1+1+4 = 6$ free parameters.

> **NOTE**: We can see that the common effect dependency causes the most free parameters required for the network. Therefore, when building a Bayesian network, we should try to reduce the number of such dependencies to reduce the number of free parameters of the network.

## Building Bayesian Network  from Domain Knowledge<a name="building"></a>

---

Building a Bayesian network mainly consists of the following three steps:

1. Identify a set of **random variables** that describe the problem, using domain knowledge.
2. Build the **directed acyclic graph**, i.e., the **directed links** between the random variables based on domain knowledge about the causal relationships between the variables.
3. Build the **conditional probability table** for each variable, by estimating the necessary probabilities using domain knowledge or historical data.

Here, we introduce the Pearl's network construction algorithm, which is a way to build the network based on **node ordering**.

```Python
# Step 1: identify variables
Identify the random variables that describe the world of reasoning
# Step 2: build the graph, add the links
Sort the random variables by some order
Set bn = []
for var in sorted_vars:
    Find the minimum subset of variables in bn so that P(var | bn) = P(var | subset)
    
    Add var into bn
    for bn_var in subset:
        Add a direct link [bn_var, var]
    # Step 3: estimate the conditional probability table
    Estimate the conditional probabilities P(var | subset)
```

In this algorithm, the **node ordering** is critical to determine the number of links between the nodes, and thus the size of the conditional probability tables. 

We show how the links are added in to the network under different node orders, using the alarm network as an example.

----------

#### Order 1: $B \rightarrow E \rightarrow A \rightarrow J \rightarrow M$

- **Step 1**: The node $B$ is added into the network. No edge is added, since there is only one node in the network.
- **Step 2**: The node $E$ is added into the network. No edge from $B$ to $E$ is added, since $B$ and $E$ are <span style="color: blue;">independent</span>.
- **Step 3**: The node $A$ is added into the network. Two edges $[B, A]$ and $[E, A]$ are added. This is because $B$ and $E$ are both direct causes of $A$, and thus $A$ is <span style="color: red;">dependent</span> on $B$ and $E$. 
- **Step 4**: The node $J$ is added into the network. The minimum subset $A \subseteq \{B, E, A\}$ in the network is found to be the parent of $J$, since $J$ is <span style="color: blue;">conditionally independent</span> from $B$ and $E$ given $A$, i.e., $P(J\ |\ B, E, A) = P(J\ |\ A)$. An edge $[A, J]$ is added into the network.
- **Step 5**: The node $M$ is added into the network. The minimum subset $A \subseteq \{B, E, A, J\}$ in the network is found to be the parent of $M$, since $M$ is <span style="color: blue;">conditionally independent</span> from $B$, $E$ and $J$ given $A$, i.e., $P(M\ |\ B, E, A, J) = P(M\ |\ A)$. An edge $[A, M]$ is added into the network.

The built network is shows as follows. The number of free parameters in this network is $1 + 1 + 4 + 2 + 2 = 10$.

<img src="img/alarm-dag.png" width=150></img>

----------

#### Order 2: $J \rightarrow M \rightarrow A \rightarrow B \rightarrow E$

- **Step 1**: The node $J$ is added into the network. No edge is added, since there is only one node in the network.
- **Step 2**: The node $M$ is added into the network. $M$ and $J$ are <span style="color: red;">dependent</span> (_note that the common cause $A$ has not been given yet at this step_), i.e., $P(M\ |\ J) \neq P(M)$. Therefore, an edge $[J, M]$ is added into the network.
- **Step 3**: The node $A$ is added into the network. Two edges $[J, A]$ and $[M, A]$ are added, since $J$ and $M$ are both <span style="color: red;">dependent</span> on $A$.
- **Step 4**: The node $B$ is added into the network. The minimum subset $A \subseteq \{J, M, A\}$ in the network is found to be the parent of $B$, since $B$ is <span style="color: blue;">conditionally independent</span> from $J$ and $M$ given $A$, i.e., $P(B\ |\ J, M, A) = P(B\ |\ A)$. An edge $[A, B]$ is added into the network.
- **Step 5**: The node $E$ is added into the network. The minimum subset $\{A, B\} \subseteq \{J, M, A, B\}$ in the network is found to be the parent of $E$, since $E$ is <span style="color: blue;">conditionally independent</span> from $J$ and $M$ given $A$ and $E$, i.e., $P(M\ |\ J, M, A, B) = P(M\ |\ A, B)$ (_note that $B$ and $E$ have the common effect $A$, thus when $A$ is given, $B$ and $E$ are <span style="color: red;">conditionally dependent</span>_). Two edges $[A, E]$ and $[B, E]$ are added into the network.

The built network is shows as follows. The number of free parameters in this network is $1 + 2 + 4 + 2 + 4 = 13$.

<img src="img/alarm-dag2.png" width=150></img>

----------

#### Order 3: $J \rightarrow M \rightarrow B \rightarrow E \rightarrow A$

- **Step 1**: The node $J$ is added into the network. No edge is added, since there is only one node in the network.
- **Step 2**: The node $M$ is added into the network. $M$ and $J$ are <span style="color: red;">dependent</span> (note that the common cause $A$ is not given at this step), i.e., $P(M\ |\ J) \neq P(M)$. Therefore, an edge $[J, M]$ is added into the network.
- **Step 3**: The node $B$ is added into the network. Two edges $[J, B]$ and $[M, B]$ are added, since $J$ and $M$ are both <span style="color: red;">dependent</span> on $B$ (through $A$, which has not been added yet).
- **Step 4**: The node $E$ is added into the network. There is NO conditional independence found among $\{J, M, B, E\}$ without giving $A$. Therefore, three edges $[J, E]$, $[M, E]$, $[B, E]$ are added into the network.
- **Step 5**: The node $A$ is added into the network. First, two edges Two edges $[J, A]$ and $[M, A]$ are added, since $J$ and $M$ are both <span style="color: red;">dependent</span> on $A$. Then, another two edges $[B, A]$ and $[E, A]$ are also added, since $B$ and $E$ are both direct causes of $A$.

The built network is shows as follows. The number of free parameters in this network is $1 + 2 + 4 + 8 + 16 = 31$.

<img src="img/alarm-dag3.png" width=200></img>

---------

We can see that different node orders can lead to greatly different graphs and numbers of free parameters. Therefore, we should find the **optimal node order** that leads to the most **compact** network (with the fewest free parameters).

> **QUESTION**: How to find the optimal node order that leads to the most compact Bayesian network?

The node order is mainly determined based on our **domain knowledge** about **cause and effect**. At first, we add the nodes with no cause (i.e., the root causes) into the ordered list. Then, at each step, we find the remaining nodes whose direct causes are all in the current ordered list (i.e., all their direct causes are given) and append them into the end of the ordered list. This way, we only need to add direct links from their direct causes to them.

The pseucode of the node ordering is shown as follows.

```Python
def node_ordering(all_nodes):
    Set ordered_nodes = [], remaining_nodes = all_nodes
    while remaining_nodes is not empty:
        Select the nodes whose direct causes are all in ordered_nodes
        Append the selected nodes into ordered_nodes
        Remove the selected nodes from remaining_nodes
    return ordered_nodes
```

For the alarm network, first we add two nodes $\{B, E\}$ into the ordered list, since they are the root causes, and have no direct cause. Then, we add $A$ into the ordered list, since it has two direct causes $B$ and $E$, both are already in the ordered list. Finally, we add $J$ and $M$ into the list, since their direct cause $A$ is already in the ordered list.

## Building Alarm Network through `pgmpy` <a name="pgmpy"></a>

---

Here, we show how to build the alarm network through the Python [pgmpy](https://pgmpy.org) library. The alarm network is displayed again below.

<img src="img/alarm-bn.png" width=500></img>

First, we install the library using `pip`.

In [1]:
pip install pgmpy

Note: you may need to restart the kernel to use updated packages.


Then, we import the necessary modules for the Bayesian network as follows.

In [2]:
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD

Now, we build the alarm Bayesian network as follows.

1. We define the network structure by specifying the four links.
2. We define (estimate) the discrete conditional probability tables, represented as the `TabularCPD` class.

In [3]:
# Define the network structure
alarm_model = BayesianNetwork(
    [
        ("Burglary", "Alarm"),
        ("Earthquake", "Alarm"),
        ("Alarm", "JohnCall"),
        ("Alarm", "MaryCall"),
    ]
)

# Define the probability tables by TabularCPD
cpd_burglary = TabularCPD(
    variable="Burglary", variable_card=2, values=[[0.999], [0.001]]
)

cpd_earthquake = TabularCPD(
    variable="Earthquake", variable_card=2, values=[[0.998], [0.002]]
)

cpd_alarm = TabularCPD(
    variable="Alarm",
    variable_card=2,
    values=[[0.999, 0.71, 0.06, 0.05], [0.001, 0.29, 0.94, 0.95]],
    evidence=["Burglary", "Earthquake"],
    evidence_card=[2, 2],
)

cpd_johncall = TabularCPD(
    variable="JohnCall",
    variable_card=2,
    values=[[0.95, 0.1], [0.05, 0.9]],
    evidence=["Alarm"],
    evidence_card=[2],
)

cpd_marycall = TabularCPD(
    variable="MaryCall",
    variable_card=2,
    values=[[0.99, 0.3], [0.01, 0.7]],
    evidence=["Alarm"],
    evidence_card=[2],
)

# Associating the probability tables with the model structure
alarm_model.add_cpds(
    cpd_burglary, cpd_earthquake, cpd_alarm, cpd_johncall, cpd_marycall
)

We can view the nodes of the alarm network.

In [4]:
# Viewing nodes of the model
alarm_model.nodes()

NodeView(('Burglary', 'Alarm', 'Earthquake', 'JohnCall', 'MaryCall'))

We can also view the edges of the alarm network.

In [5]:
# Viewing edges of the model
alarm_model.edges()

OutEdgeView([('Burglary', 'Alarm'), ('Alarm', 'JohnCall'), ('Alarm', 'MaryCall'), ('Earthquake', 'Alarm')])

We can show the probability tables using the `print()` method. 

> **NOTE**: the `pgmpy` library stores ALL the probabilities (including the last probability). This requires a bit more memory, but can save time for calculating the last probability by normalisation rule.

Let's print the probability tables for **Alarm** and **MaryCalls**. For each variable, the value (0) stands for `False`, while the value (1) is `True`.

In [6]:
# Print the probability table of the Alarm node
print(cpd_alarm)

# Print the probability table of the MaryCalls node
print(cpd_marycall)

+------------+---------------+---------------+---------------+---------------+
| Burglary   | Burglary(0)   | Burglary(0)   | Burglary(1)   | Burglary(1)   |
+------------+---------------+---------------+---------------+---------------+
| Earthquake | Earthquake(0) | Earthquake(1) | Earthquake(0) | Earthquake(1) |
+------------+---------------+---------------+---------------+---------------+
| Alarm(0)   | 0.999         | 0.71          | 0.06          | 0.05          |
+------------+---------------+---------------+---------------+---------------+
| Alarm(1)   | 0.001         | 0.29          | 0.94          | 0.95          |
+------------+---------------+---------------+---------------+---------------+
+-------------+----------+----------+
| Alarm       | Alarm(0) | Alarm(1) |
+-------------+----------+----------+
| MaryCall(0) | 0.99     | 0.3      |
+-------------+----------+----------+
| MaryCall(1) | 0.01     | 0.7      |
+-------------+----------+----------+


We can find all the **(conditional) independencies** between the nodes in the network.

In [7]:
alarm_model.get_independencies()

(Burglary ⟂ Earthquake)
(Burglary ⟂ JohnCall, MaryCall | Alarm)
(Burglary ⟂ MaryCall | JohnCall, Alarm)
(Burglary ⟂ JohnCall | MaryCall, Alarm)
(Burglary ⟂ JohnCall, MaryCall | Earthquake, Alarm)
(Burglary ⟂ MaryCall | JohnCall, Earthquake, Alarm)
(Burglary ⟂ JohnCall | Earthquake, MaryCall, Alarm)
(JohnCall ⟂ Burglary, MaryCall, Earthquake | Alarm)
(JohnCall ⟂ MaryCall, Earthquake | Burglary, Alarm)
(JohnCall ⟂ Burglary, Earthquake | MaryCall, Alarm)
(JohnCall ⟂ Burglary, MaryCall | Earthquake, Alarm)
(JohnCall ⟂ Earthquake | Burglary, MaryCall, Alarm)
(JohnCall ⟂ MaryCall | Burglary, Earthquake, Alarm)
(JohnCall ⟂ Burglary | Earthquake, MaryCall, Alarm)
(MaryCall ⟂ Burglary, JohnCall, Earthquake | Alarm)
(MaryCall ⟂ JohnCall, Earthquake | Burglary, Alarm)
(MaryCall ⟂ Burglary, Earthquake | JohnCall, Alarm)
(MaryCall ⟂ Burglary, JohnCall | Earthquake, Alarm)
(MaryCall ⟂ Earthquake | Burglary, JohnCall, Alarm)
(MaryCall ⟂ JohnCall | Burglary, Earthquake, Alarm)
(MaryCall ⟂ Burglary | J

We can also find the **local (conditional) independencies of a specific node** in the network as follows.

In [8]:
# Checking independcies of a node
alarm_model.local_independencies("JohnCall")

(JohnCall ⟂ Burglary, MaryCall, Earthquake | Alarm)

---

- More tutorials can be found [here](https://github.com/meiyi1986/tutorials).
- [Yi Mei's homepage](https://meiyi1986.github.io/)