# Bayesian networks

In this notebook, we will have a taste of Bayesian networks (BNs). As you know, BNs have two building blocks: a directed acyclic graph structure and a set of conditional probability distributions.


We want to know what it is necessary for an athlete to go to the Olympic games. Our model assumes that given some predisposition (Genetics) and hard work (Practice), an athlete will be able to show her strength in the trials. Depending on her performance in the trials, she will receive or not an offer to join the national team.


<img src="images/olympicsTrials.png" style="width:200px" />

The joint probability distribution factorizes as follows:

$$p(G,P,T,O)=p(O|T)p(T|P,G)p(P)p(G)$$

In BNs, one can easily infer the corresponding factorization from the graph as a product of distributions of the form $p($variable${}_i$ | parents${}_i)$.

Each of the factors at the RHS of this equation are the CPDs that we need to codify. By now, we assume that these values are given by a domain expert. Let us focus on $p(O|T)$. It will look like this:


T O | $p$
----|-------
0 0 | 0.95
0 1 | 0.05
1 0 | 0.80
1 1 | 0.20
2 0 | 0.50
2 1 | 0.50



This can be implemented easily with a 2D array indexed by the values of the involved variables (conviniently, the values of the different variables are considered to go from 0 to $n_v$):

In [None]:
import numpy as np

cpd_o = np.array([[0.95, 0.8, 0.5],
                  [0.05, 0.2, 0.5]])
np.sum(cpd_o, axis=0)

Each column of this array sums up to 1, that is, it is a probability distribution. Given a value for the parent variable (which indexes throughout the columns of the array), we have a probability distribution over the child variable.

This previous addition is the basis of the marginalization operation (in this case, marginizing out the child variable $O$). Reduction, for example, is implemented as a selection. So, when the parent variable $T$ has value 3, this CPD reduces as:

In [None]:
par_value=2
cpd_o[:,par_value]

Things get complicated when there are several parent variables, as it is the case in the CPD of T: $p(T|P,G)$. 

For convinience, it is better to continue working with 2D arrays. However, we need a way to transform a multidimensional index (as many as parent variables) to a single index that, again, goes through the columns, as follows:

In [None]:
cpd_t = np.array([[.90, .80, .70, .50],
                  [.08, .15, .20, .30],
                  [.02, .05, .10, .20]])
par_g_value=0
par_p_value=1

ind_par = ### YOUR CODE HERE ### How would you obtain a 1D index from a 2D index?

# numpy does it as follows:
ind_par = np.ravel_multi_index([par_g_value,par_p_value],(2,2))
# Order matters!!! And (2,2) means that both variables take up to 2 different values

cpd_t[:,ind_par]

A product of factors would involve to create a larger table for all the variables involved, and each cell would be obtained as the product of two cells from the input tables well indexed by the values values of all the variables. 

We won't code it. Let us use an emerging Python's library to do so: `pgmpy`.

## Setting up our model with pgmpy
We need to codify both elements: the DAG structure and the CPDs.

In [None]:
from pgmpy.factors.discrete import TabularCPD # It implements CPDs
from pgmpy.models import BayesianModel        # and a class for Bayesian networks

### Set up the structure

We codify it as a set of edges. With this simple code, we specify that there are four variables and that the directed edges are: 
- <i>Genetics $\rightarrow$ Trials</i>
- <i>Practice $\rightarrow$ Trials</i>, and 
- <i>Trials $\rightarrow$ Offer</i>

In [None]:
olympic_model = BayesianModel([('Genetics', 'Trials'), #### YOUR CODE HERE #### , ]) 

### Set up the conditional probability distributions (CPDs)

Once the structure has been defined, we codify the respective probability distributions. 

Firstly, the <i>Genetics</i> and <i>Practice</i> variables do not have any parent and the corresponding distributions are marginal probability distributions. <i>Genetics</i> takes two possible values with the following probability distribution:

In [None]:
genetics_cpd = TabularCPD(
                variable = 'Genetics',
                variable_card = 2,
                values = [[.2],[.8]])


<i>Practice</i> also takes two possible values with probability $0.7$ and $0.3$, respectively:

In [None]:
practice_cpd = TabularCPD(
                variable = 'Practice',
                variable_card = 2,
                values = [[.7],[.3]])

That is, having favorable genetics is not very uncommon, at least among people that faces this decision, but having an appropriate practice is not that common.


The other two variables, <i>Offer</i> and <i>Trials</i>, do have parents in the DAG and their factors are conditional probability distributions. <i>Trials</i> takes three possible values and has both <i>Genetics</i> and <i>Practice</i> as parents:


In [None]:
trials_cpd = TabularCPD(
                        variable = 'Trials', 
                        variable_card = 3,
                        values = [[.90, .80, .70, .50],
                                  [.08, .15, .20, .30],
                                  [.02, .05, .10, .20]],
                        evidence = ['Genetics', 'Practice'],
                        evidence_card = [2,2])


<i>Offer</i> takes two possible values and has <i>Trials</i> as its only parent. The corresponding conditional probability distribution table is:

T O | p
----|-----
1 1 | 0.95
1 2 | 0.05
2 1 | 0.80
2 2 | 0.20
3 1 | 0.50
3 2 | 0.50


In [None]:
offer_cpd = TabularCPD(
                    variable = 'Offer',
                    variable_card = 2,
                    values = [[### YOUR CODE ### , , ],  
                              [### GOES HERE ### , , ]],
                    evidence = ['Trials'],
                    evidence_card = [3])


Once the CPDs are defined, we only have to include them into the model:


In [None]:
olympic_model.add_cpds (genetics_cpd, practice_cpd, offer_cpd, trials_cpd)


Let us examine our model:


In [None]:
olympic_model.get_cpds()

## Using our Bayesian network

We have already built our model. It is time to use it!

We can find <b>active trails</b> in the model that show the flows of probabilistic influence. For example, we can see that, when no variable is observed, between <i>Genetics</i> and <i>Practice</i> there is no active trail:

In [None]:
olympic_model.is_active_trail('Genetics', 'Practice')


However, if variable <i>Offer</i> is observed, the trail between both variables become active: 


In [None]:
olympic_model.is_active_trail('Genetics', 'Practice', observed=['Offer'])

We can obtain all the nodes reachable through an active trail from a single variable as follows:

In [None]:
olympic_model.active_trail_nodes('Genetics', observed=['Offer'])


We can want to find the local <b>independencies</b> in the model associated to variable <i>Genetics</i>:


In [None]:
olympic_model.local_independencies('Genetics')


Regarding the variable <i>Trials</i>, the list of independencies is empty:


In [None]:
olympic_model.local_independencies('Trials')


We can simply find all the independencies present in our model as follows:


In [None]:
olympic_model.get_independencies()


Note that some of them are repeated. Probably because it looks for the independencies of all the variables one by one.


## Asking our Bayesian network

Later in this course, we will know different approaches for inference in PGMs. However, let us consider the approach known as <i>Variable Elimination</i> to observe how the different reasoning patterns work.

We can do probability propagation even when no information is observed:


In [None]:
from pgmpy.inference import VariableElimination
olympic_infer = VariableElimination(olympic_model)


We can get probability distributions that are not explicitly spelled out in our graph, as the marginal probability distribution of the variable <i>Offer</i>:


In [None]:
prob_offer = olympic_infer.query(variables = ['Offer'])
print(prob_offer)


or the marginal probability distribution of the variable <i>Trials</i>:


In [None]:
prob_trials = olympic_infer.query(variables = ['Trials'])
print(prob_trials)


But, the most common use is to propagate the observation of some variables. We can calculate the marginal probability of variable <i>Offer</i> given that the observed individual has no favorable genetics:


In [None]:
prob_offer_bad_genes = olympic_infer.query(
                                        variables = ['Offer'], 
                                        evidence = {'Genetics':0})
print(prob_offer_bad_genes)


The probability of obtaining an offer increases when the individual has good genetics and does practice:


In [None]:
prob_offer_good_genes_did_practice = olympic_infer.query(
                                        variables = ['Offer'], 
                                        evidence = {'Genetics':1, 'Practice':1})
print(prob_offer_good_genes_did_practice)


These two queries are examples of <b>causal reasoning</b>.

We can also go upstream logically as in <b>evidential reasoning</b>. For example, evidence about a great performance at the Olympic trials affects the probability distribution of <i>Genetics</i> variable:


In [None]:
prob_good_genes_if_amazing_olympic_trials = olympic_infer.query(
                                        variables = ['Genetics'], 
                                        evidence = {'Trials':2})
print(prob_good_genes_if_amazing_olympic_trials)


Finally, the <b>intercausal reasoning</b> is related with the study of two variables that are parents of a third variables (v-structure of <i>Genetics => Trial <= Practice</i>). 

If we have evidence only about one of the parents, as they are independent, that evidence would have no effect on the probability distribution of the other variable. 

Once the variable <i>Trial</i> is also observed, both parents become dependent and the evidence about <i>Practice</i> does affect the marginal probability distribution of <i>Genetics</i>:

In [None]:
# Practice does not inherently tell us something about Genetics
prob_good_genes_if_no_practice = olympic_infer.query(
                                        variables = ['Genetics'], 
                                        evidence = {'Practice':1})
print(prob_good_genes_if_no_practice)

# Practice does not inherently tell us something about Genetics, but practice+performance does
prob_good_genes_if_no_practice_and_great_perf = olympic_infer.query(
                                        variables = ['Genetics'], 
                                        evidence = {'Practice':1,'Trials':2})
print(prob_good_genes_if_no_practice_and_great_perf)


As one can imagine that, if someone performs great in the Olympic trials without practice, that person must have very favorable genetics!

From the point of view of BNs, we are seeing a v-structure in action. P are G are marginally independent. Once we observe T (at the bottom of the v-structure), P and G become dependent. That's why we only observe an update of the probability distribution of G once we observe both P and T.


<hr />

## Exercices

- Which is the probability of having a regular-performance trial for someone that does practice but has not appropriate genetics?


- Which is the probability of receiving an offer just having good genetics? And having bad genetics?



- Which is the probability of requiring large practice for having a great performance in the Trials without appropriate genetics?