   # Intro to bnlearn

## Jacinto Arias @jacintoarias


*Updated 17/10/18*



In [None]:
library(tidyverse)
library(networkD3)

# The bnlearn package

- The `bnlearn` package is the most complete and popular **open source** package for Bayesian Networks available to the date in R (and almost everywhere)

- We will start our tutorial by reviewing some of its capacities

- [Docs](http://www.bnlearn.com/)

In [None]:
library(bnlearn)

# Data Structures

The usage of bnlearn revolves around the usage of two main **data structures** to represent a Bayesian Network in different stages (NOTE that these are *S3* classes and the names might overlap with other functions):

* `bn` [[Docs]](http://www.bnlearn.com/documentation/man/bn.class.html). Represents the structural information, variables, graph and learning algorithm if provided.

* `bn.fit` [[Docs]](http://www.bnlearn.com/documentation/man/bn.fit.class.html). Adds the parametric information on top of the previous structure. Contains the distribution of each node according to its type and parent configuration.

# Creating the structure of Bayesian networks

There are different ways to manually initialize and modify the graph of a Bayesian Network.

We can create empty graphs from a set of variables:



In [None]:
vars <- LETTERS[1:6]
dag  <- empty.graph(vars)
dag

## Arcs as tuples

You can specify arcs as a two column (`from`, `to`) matrix, setting them via `arcs` to an existing network

In [None]:
e <- matrix(
      c("A", "C", "B", "F", "C", "F"),
      ncol = 2, byrow = TRUE,
      dimnames = list(NULL, c("from", "to"))
    )

arcs(dag) <- e
dag

## Arcs as an adjancency matrix

We can also use an adjancecy matrix, and assign it to a dag with `amat`

In [None]:
adj <- matrix(
        0L, 
        ncol = 6, 
        nrow = 6,
        dimnames = list(vars, vars)
       )

adj["A", "C"] = 1L
adj["B", "F"] = 1L
adj["C", "F"] = 1L
adj["D", "E"] = 1L
adj["A", "E"] = 1L
print(adj)



In [None]:
amat(dag) <- adj
dag

## Model String

The last option is to create a *formula* for a given set of variables. Each node is specified by a pair of brackets `[<var_name>]`. If the node has a parentset we denote it by `|` and specify the list of parents separated by colons `:`. We can compute the formula into a `bn` object with `model2network`.

In [None]:
dag <- model2network("[A][C][B|A][D|C][F|A:B:C][E|F]")
dag

# Plotting graphs

We can ploting graphs using the built in R engine by using `plot` for the `bn` class

Minimal aspects of the plot can be customized as documented in the corresponding help page. Other packages can be used indrectly to plot graphs, `bnlearn` provides connections with some of them but be aware that some of them might be outdated.

In [None]:
plot(dag)

## Using D3's force graphs

- Graphs are a common data structure and we can find lots of utilities to work with them

- The D3 library from the *Javascript* domain is one of the most powerful visualization libs

- The `networkD3` package is a nice port of the *D3 force graph*

- The next snippet is just a custom function to transform a `bn` object to the required format for D3.

In [None]:
plotD3bn <- function(bn) {
  varNames <- nodes(bn)

  # Nodes should be zero indexed!
  links <- data.frame®®(arcs(bn)) %>%
    mutate(from = match(from, varNames)-1, to = match(to, varNames)-1, value = 1)
  
  nodes <- data.frame(name = varNames) %>%
    mutate(group = 1, size = 30)
  
  networkD3::forceNetwork(
    Links = links,  
    Nodes = nodes,
    Source = "from",
    Target = "to",
    Value = "value",
    NodeID = "name",
    Group = "group",
    fontSize = 20,
    zoom = TRUE,
    arrows = TRUE,
    bounded = TRUE,
    opacityNoHover = 1
  )
}

In [None]:
# Use the mouse weel for zoom!
plotD3bn(dag)

# Loading Bayesian networks from files

- There are different file formats to represent a Bayesian network

- They have originated over the years as an effort to create standards or as part of particular **propietary systems**

- `bnlearn` provides several ways to load BNs from different formats [[Docs]](http://www.bnlearn.com/documentation/man/foreign.html)

- If you plan to use just bnlearn you could just save an `rda` file

## The Bayes net repository

The maintainers of `bnlearn` also provide a modern R-focused repository for a series of popular Bayesian networks that have been used extensivelly for benchmarking on the literature.

http://www.bnlearn.com/bnrepository/

In here you can find networks with different properties that can be use to test algorithms or explore this or other BN packages. We will now work with the so popular `asia` network

> *Asia* is for BNs what *"iris"* is for statistics.

In [None]:
# This downloads the RData file from the repository and loads it.
# The bn is loaded into a bn.fit variable called bn
load(url("http://www.bnlearn.com/bnrepository/asia/asia.rda"))
asia <- bn

In [None]:
bn.net(asia)

In [None]:
plotD3bn(asia)

Now is the time to review the parameters, this prints each node and the asociated probability table. In this case all variables are **discrete** so the tables would be conditional probability tables.

In [None]:
asia

### Accessing nodes

We can access individual nodes of the net as in a data.frame:


In [None]:
asia$smoke

### Plotting Parameters

There is also a function to plot the distributions of discrete networks:


In [None]:
bn.fit.barchart(asia$smoke)

In [None]:
bn.fit.barchart(asia$dysp)


# Introducing expert knowledge

- We can manually alter the probability tables of a BN 
- This is useful for overriding parameters learnt from data or **not observed variables** 
- This method allows us to include expert information from the domain of the problem modelled.

To modify a **conditional probability table** you can just directly replace the existing table in the model by extracting it with `coef`. 

**Be careful to maintain the inherent restrictions of the probability distribution.**

In [None]:
cpt <- coef(asia$smoke)
cpt[] <- c(0.2, 0.8)
asia$smoke <- cpt
asia$smoke

# Sampling data from a Bayesian network

- `bnlearn` introduces an *R like* function to sample data from a given fitted model `rbn`

- We will now sample a dataset from *asia* to test learning from data

In [None]:
# Note that the order of the parameters is inverted from the R sample functions
sampleAsia <- rbn(x = asia, n = 10000)

head(sampleAsia)

# Parametric learning from data

- We can induce the parameters of a Bayesian Network from observed data
- `bnlearn` provides different algorithms for that, Maximum Likelihood Estimation (MLE) is the most common one

- We can invoke the learning algorithm by using the function `bn.fit`

- For that we need a **DAG** and a **compatible dataset**


In [None]:
net <- bn.net(asia)
asiaInduced <- bn.fit(x = net, data = sampleAsia)

We can now compare the two networks, *there should be some discrepacies in the induced one* 

Notice that extremelly marginal probabilities will not be simulated and thus will not have a significant present in the sample.


In [None]:
asia

In [None]:
asiaInduced

# Structural learning

In many ocasions the structure of the model is designed by hand if know the variables and the causal patterns from expert knowledge

This is what we call an **open box model** and provides a powerful framework for many problems, specially when compared to other models that do not provide a clear interpretation of their parameters

However, there are many situations in which we would like to automatize the structural learning of a model such as causal patterns discovery or a lack of knowledge of the domain

In other cases we just want to learn about particular dependency relationships between the variables or select the best structure among a particular set


- `bnlearn` specializes in structural learning

- There is a complex taxonomy of such algorithms related to the statistical tests, metrics and heuristics. Exact learning is a NP-hard problem and thus several approaches have been proposed.

- We focus on the `hc` algorithm to learn a full structure and the `BIC` score metric to measure the fit of a particular network with a dataset

- The `hc` algorithm can be run from a data sample

In [None]:
networkInduced <- hc(x = sampleAsia)
networkInduced

## Network Comparison

Lets compare it with the original network **golden model** as the algorithm may have introduced some differences given that we used a small data sample.

In [None]:
modelstring(networkInduced)

In [None]:
modelstring(asia)

We can also compute some metrics to compare the network. The **structural Hamming distance** determines the amount of discrepancy between the two graphs.

In [None]:
shd(bn.net(asia), networkInduced)

## Network Scoring

- The BIC metric measures the fit of the structure for a given sample
- It also penalizes the number of parameters to avoid overfitting 

Although in this case the result is almost the same, **the lower the metric the better**, so it seems that the induced model could be biased towards the sample and marginally outperforms the golden model

In [None]:
print(BIC(object = asia, data = sampleAsia))
print(BIC(object = networkInduced, data = sampleAsia))

# Gaussian networks

- Gaussian networks differ in the kind of probability tables that represent them

- If all nodes are gaussian we will find Gaussian nodes and linear Gaussian nodes

- Gaussian nodes are encoded by the normal distribution parameters (mean and sd), linear gaussian are represented as linear reggresion with a coef for each parent, an intercept term and standard deviation of the residuals

- `bnlearn` has some sample gaussian data to test these BNs

In [None]:
data(gaussian.test)
dag = model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]")
model <- bn.fit(dag, gaussian.test)
model

## Editing Gaussian Nodes

To modify a gaussian network node we proceed as in the discrete case

In [None]:
model$A <- list(coef = c("(Intercept)" = 10), sd = 0)
model$A

# Hybrid networks

- An hybrid network contains both discrete and continuous variables

- **There is a framework restriction in which a discrete variable cannot have any continuous parent**

- A continuous variable with discrete parents is represented by a conditional Gaussian distribution

- With a linear gaussian distribution (according to any continuous parents) for each configuration of the discrete parents.

- In the next example we will use `custom.fit` to manually load the parameters into a graph

In [None]:
net <- model2network("[A][B][C|A:B]")

cptA  <- matrix(c(0.4, 0.6), ncol = 2, dimnames = list(NULL, c("LOW", "HIGH")))
distB <- list(coef = c("(Intercept)" = 1), sd = 1.5)
distC <- list(
  coef = 
    matrix(
      c(1.2, 2.3, 3.4, 4.5), 
      ncol = 2,
      dimnames = list(c("(Intercept)", "B"), NULL)
    ),
    sd = c(0.3, 0.6)
)

model = custom.fit(net, dist = list(A = cptA, B = distB, C = distC))
model

# Inference and probability queries

- Inference in `bnlearn` is limited, but it can be used to test the networks and to perform basic operations with the models

- `cpquery` asks for the probability of an **event** given a set of **evidence**

- Both of them are boolean expressions involving the variables in the model

- We may ask for a particular combination of configurations in the BN and a set of observed statuses for the variables in the evidence.

- For example we could ask

> *what is the posibility of a positive cancer diagnosis for a person who smokes?*, in the asia network.

In [None]:
# (For cpquery I recommend the most powerfull `lw` algorithm)

# First we should observe the prior probability to compare
# TRUE is for empty evidence

cpquery(asia, event = lung == "yes", evidence = TRUE)

In [None]:
# Now for the complete wuery

cpquery(asia, event = lung == "yes", evidence = list(smoke = "yes"), method = "lw", n = 100000)

### Repeat for stability

As the method is not very stable it is useful to sample and repeat the query

In [None]:
query_trials <- replicate(100, cpquery(asia, event = lung == "yes", evidence = TRUE))
query <- mean(query_trials)
print(query)

In [None]:
query_trials <- replicate(100, cpquery(asia, event = lung == "yes", evidence = list(smoke = "yes"), method = "lw", n = 100000))
query <- mean(query_trials)
print(query)

## Sampling data with evidence

The other option is to use `cpdist` to sample cases from the network for a set of nodes in the presence of some evidence, the usage is the same, and we can obtain more stability by increasing the size of the sample.

In [None]:
s <- cpdist(asia, nodes = c("lung"), evidence = TRUE, n=1000)
head(s)

In [None]:
summary(s)

In [None]:
prop.table(table(s))

In [None]:
ggplot(s, aes(x=lung)) + geom_bar()

# What have we learnt?

- `bn` and `bn.fit` data structures
- Manual construction of DAGs
- Manual input of CPTs
- Data sampling from a BN
- Estimating CPTs from data
- Estimating and scoring DAGs from data
- Gaussian and Hybrid networks
- Inference with cpqueries and cpdist