## CHAPTER 13 - Probabilistic Reasoning 

### George Tzanetakis, University of Victoria 


# WORKPLAN 

The section number is based on the 4th edition of the AIMA textbook and is the suggested
reading for this week. Each list entry provides just the additional sections. For example the Expected reading include the sections listed under Basic as well as the sections listed under Expected. Some additional readings are suggested for Advanced. 

1. Basic: Sections **13.1**, **13.2 (not 13.2.1, 13.2.2, 13.2.3, 13.2.4)**, **13.3 (just exact inference)**, **13.4 (just direct sampling)**, and **Summary**
2. Expected: Same as Basic + in 13.3 variable elimination) + in 1.34 (+ rejection sampling) 
3. Advanced: All the chapter including bibligraphical and historical notes 




## History 

Pierre Laplace (1819): "Probability theory is nothing but common sense reduced to calculation" 

James Maxwell (1850): "The true logic for this world is the calculus of probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man's mind." 

Early expert system of 1970s focused on logic and ignored uncertainty. Next generation medical diagnostic systems used probabilistic techniques but run into issues of scalability when using full joint distributions. Probabilistic approaches fell out of fashion from roughly 1975 to 1988. 



## Inference using full joint distribution


Let's consider another example where the full joint distribution $2 x 2 x 2$ is given. 



|---| toothache and catch    | toothache and not catch | not toothache and catch | not toothache and not catch | 
|---|-------   | ----------| ------| ----------|
|cavity | 0.108 | 0.012 | 0.072 | 0.008 | 
| not cavity | 0.016 | 0.064 | 0.144 | 0.576 | 


Direct way to evalute the probability of any proposition: 
* Identify the possible worlds in which a proposition is true and add up their probabilities 
* $P(cavity \lor toothache) = 0.108 + 0.012 + 0.072 + 0.008 + 0.016+ 0.064 = 0.28$ 
* **Marginal probability** of cavity: 
* $P(cavity) = 0.108 + 0.012 + 0.072 + 0.008 = 0.2$ 



## The big picture 

We represent problems as sets variables with discrete and finite domains. The full joint probability distribution specifies probabilities for every possible assignment of all the variables to values from their corresponding domain. In a typical scenario we are given the values of some variables (called evidence) and are interested in the probability distribution of some variables (called query) given the evidence. The remaining variables are called the hidden variables. 


Without going into details, we can solve any inference problem by summation and products using the full joint probability distribution. 

The problem with this approach is that specifying the joint probability distribution becomes 
very difficult as the number of variables increases. For example if we have 10 binary random variables we would need to provide $2^{10}=1024$ probability values. Specifying probabilities 

We can take advantage of independence and conditional independence relationships among the random variables specifying our problem to greatly reduce the number of probabilities that need to be specified in order to define the full joint probability distribution. 




##  Bayesian Networks  

A Bayesian Network is as specific case of a Probabilistic Graphical Model. 



Specifying a Bayesian Network 

1. Each node corresponds to a random variable, which may be discrete or continuous 
2. Directed links connect pairs of nodes. If there is an arrow from node X to node Y, X is called the **parent** of Y. The graph has no directed cycles and hence is a directed acyclic graph, or DAG. 
3. Each node $X_{i}$ has associated probability information $\theta(X_i|Parents(X_i))$ that quantifies the effect oif the parents on the node using a finite number of parameters. 

(Causes should be parents of effects - this is typically something that a domain expert can easily do) 

The joint distribution can be calculated from all the variables defined by the topology and the local probability information. 


## Bayesian Network Example 


Let's look at a particular well-known example of a Bayesian Network, originally proposed by Judea Pearl, a well-known computer scientist. 

<img src="images/judea_pearl_bayes_net.png" width="75%"/>



## Semantics of Bayesian Networks 


An entry in the joint probability distribution is: 
* $P(X_{1} = x_{1} \wedge ... \wedge X_{n} = x_{n})$ or $P(x_1,x_2,..,x_{n})$

The Bayesian Network can be used to calculate each entry "on demand" as follows: 

* $P(x_1,x_2,..,x_{n}) = \prod_{i=1}^{n} \theta(x_{i} | parents(X_{i}))$

Let's illustrate this with an example based on the network above: 

* $P(j,m,a,\lnot b, \lnot e) = P(John = True \land Mary = True \land \dots ) = P(j|a) P(m|a) P(a| \lnot b \lnot e) P(\lnot b) P(\lnot e) $
* $= 0.90 \times 0.70 \times 0.001 \times 0.999 \times 0.998 = 0.000628 $
* $P(\lnot j, m, \lnot a, b, e) = 0.95 \times 0.01 \times .05 \times .001 \times .002$

Note the above uses short-hand notation. If we wanted to be more precise we should have written the expression as follows: 
$P(JohnCalls = True \wedge MaryCalls = True ... )$


As the number of variables increases specifying the full joint probability distribution requires many more numbers than specifying a Bayesian Network. If we have $n$ boolen variables and each variables has $k$ parents 
the complete network can be specified by $2^{k} \cdot n$ numbers. 

For example suppose that you have 30 variables and each one has 5 parents. The Baysian Network requires $32 * 30=960$ but the full distribution requires over a billion. 


### Constructing Bayesian Network 


Ideally parents of a node contain all the nodes that directly influence that node. 
It is possible for the same joint probability distribution to be represented by networks 
created by adding nodes in different orders. These networks can be clunky and hard to understand. 


Intuitively:
* Start from root causes and expand effects –(follow causality) 
* For details: Read textbook

<img src="images/different_orders_bayes_net.png" width="75%"/>


## Exact Inference by Enumeration


The basic task of probabilistic inference system is to compute the posterior probability distribution for a set of **query variables**, given some observed **event** which is an assignment of values to a set of **evidence variables**. For presentation we will look into single query variables. The remaining variables are called **hidden**. 

**Note it is possible to express multiple query variables in terms of single variable queries ** 
* Simple queries $P(NoGas | Gauge = empty, Lights = on, Starts =false)$
* Conjuctive queries $P(Xi , Xj | E = e) = P(Xi| E = e) P(Xj| Xi,, E =e)$ 

For example in the burglary network: 

* $P(Burglary | JohnCalls = true, MaryCalls=true) = 0.716$

### Inference by enumeration 

$P(X|e) = \alpha P(X,e) = \alpha \sum_{y} P(X,e,y)$

So basically any query can be answered using a Bayes net by computing sum of products of conditional probabilities from the network. 


For example: 

* $P(b | j,m) = \alpha \sum_{e} \sum_{a} P(b) P(e) P(a|b,e) P(j|a) P(m|a) = \alpha P(b) \sum_e P(e) \sum_a P(a|b,e) P(j|a) P(m|a)$

* Shortname for $P(B=True | J=True and M = True)$

Note the use of $\alpha$ for normalization: 

* $ P(B|j,m) = \alpha <0.00059224, 0.0014919> \approx <0.284, 0.716> $

The chance of burglary if both mary and john call. 


<img src="images/bayes_inference_enumeration.png" width="75%"/>

## Variable elimination 

Notice in the figure above showing the structure of the expression tree for direct enumeration there are several repeated multiplications. More efficient direct enumeration algorithms can be devised by doing calculations once and saving the results for later use. 
This is a form of **dynamic programming**. The **variable elimination** algorithm is a simple algorithm that works by evaluating expressions in bottom up order and storing intermediate results as factors. 

## Markov Blanket 

A node is conditionally independent of all other nodes in the network, given its parents, children, and children's parents - that is given its **Markov Blanket**. 



## A bigger example of a Bayes net 

<img src="images/car_diagnostics_bayes.png" width="75%"/>


## Additional topics 


* Markov blankets, ancestral graph, moral graph 
* Efficient representation of conditional distributions 
* Bayesian networks with continuous variables 
* Car insurance case study 
* Variable elimination algorithm 
* Complexity of exact inference 
* Clustering algorithms 




## Approximate Inference 

The basic idea is that instead of calculating the exact probabilities of events we are interested in we can simulate stochastically the generation of samples using the network and simply count. Approximate inference can be much more efficient than exact inference for large networks. 


* Draw N samples from a sampling distribution 
*Compute an approximate posterior P
* Show this converges to true prob. 

Examples of approximate inference: 

* Sampling from an empty network 
* Rejection sampling: reject samples disagreeing with evidence 
* Likelihood weighting: use evidence to weight samples 
* Markov Chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution approximates the distribution 

### Sampling from a network with no evidence 

Generate events from a network with no evidence associated with it 

Idea:  
* sample each variable in turn, in topological order, conditioning variables appropriately 
* Count actual samples generate 
* Frequency converges the more samples we generate 
* Consistent estimate = converges to true probability in the largesample limit


<img src="images/approximate_inference1.png" width="75%"/>
<img src="images/approximate_inference2.png" width="75%"/>
<img src="images/approximate_inference3.png" width="75%"/>
<img src="images/approximate_inference4.png" width="75%"/>
<img src="images/approximate_inference5.png" width="75%"/>


### Rejection Sampling 


* P'(X/e) estimated from samples agreeing with e 
* Use direct sampling to generate N samples, then select the ones that agree with evidence 
* For example to estimate $P(Rain | Sprinkler = true)$ generate $100$ samples. 
* Let's say we have 27 samples that have Sprinkler=true and out of those 8 have $Rain=True$ 
and $19$ have $Rain = False$ 
* $P'(Rain | Sprinkler = True) = Norm(<8,19>) = <0.296, 0.704>$
* Rejection sampling returns consistent posterior estimates 
* Problem: –Hopelessly expensive if P(e) is small 
    * Why ? 
* P(e) drops exponentially with number of evidence variables 



### Likelihood Weighting


* Idea: fix evidence variables, sample only nonevidence variables and weight each sample with the likelihood it affords the evidence 

* Produces consistent estimates (details in book)

## Markov Chain Simulation

MCMC algorithms work quite differently from rejection sampling and likelihood weighting. Instead of generating each sample from scratch, MCMC algorithms generate each sample by making a random change to a the preceding sample. It is therefore helpful to think of an MCMC algorithms as being in a particular **current state** specifying a value for each variable and generating a **next state** by making random changes to the current state. 
There are several MCMC algorithms - a simple example is **Gibbs** sampling. 



18.0