# Day 24 notebook

The objectives of this notebook are to practice

* computing the probability of data given the graph (the model evidence)
* counting the number of possible Bayesian network structures

## Modules used for this assignment

In [1]:
# standard library modules
import math             # for log and lgamma

# course modules
import bayesian_network # for the BayesianNetwork class

## PROBLEM 1: Computing the score (model evidence) of Bayesian network structure

In this notebook we will compute $\log P(D | G)$ (the model evidence) for a particular data set, $D$, and a particular Bayesian network graph, $G$.  This value is the score of the Bayesian network in the structure learning task.  In practice, we would want to find the Bayesian network structure that has the highest score, but for now we will focus simply on computing the score for one particular structure, given a data set.

We will be modeling three binary random variables $X_1, X_2,$ and $X_3$ for which we have data.  Here is the data set we will be using to compute the score of a graph:

In [2]:
# read in the data set as a list of tuples 
# (each tuple is one joint observation of the three variables)
data = [tuple(map(int, line.split())) for line in open("data.txt")]

# here are the first six observations
num_first = 6
print("The first %d (out of %d total) observations:" % (num_first, len(data)))
print(*data[:num_first], sep="\n")

The first 6 (out of 1000 total) observations:
(0, 0, 0)
(0, 1, 0)
(0, 1, 0)
(0, 0, 1)
(1, 1, 1)
(0, 0, 0)


Each tuple is an observation of the three random variables $(x_1, x_2, x_3)$.

You are to compute $\log P(D | G)$ for the Bayesian network, $G$, defined below.  We will be using flat, i.e., $Beta(1,1)$, prior distributions for all parameters of the network.  The formula (and its derivation) for this value is given in the Day 24 Structure scoring example.

In the Bayesian network instantiated below, you should ignore the parameter values in the CPDs.  The only important aspect of the network is its structure.

In [3]:
random_variables = ["x1", "x2", "x3"]
g = bayesian_network.BayesianNetwork(random_variables)

g.set_cpd("x1",
          [], [0, 1],
          {(): [0.75, 0.25]})
g.set_cpd("x2",
          [], [0, 1],
          {(): [0.75, 0.25]})
g.set_cpd("x3",
          ["x1", "x2"], [0, 1],
          {(0, 0): [0.9, 0.1],
           (0, 1): [0.3, 0.7],
           (1, 0): [0.2, 0.8],
           (1, 1): [0.1, 0.9]})

g.plot()

To answer this question, assign the value of $\log P(D | G)$ for this graph and the given dataset to the variable `log_prob_data_given_graph` below.  You will likely want to make use of the function `logbinom` provided below, which computes the natural logarithm of a [binomial coefficient](http://mathworld.wolfram.com/BinomialCoefficient.html).

In [4]:
def logbinom(n, k):
    """The natural logarithm of the binomial coefficient (n choose k)"""
    return math.lgamma(n + 1) - math.lgamma(k + 1) - math.lgamma(n - k + 1)

In [5]:
###
### log_prob_data_given_graph=?
import itertools
import pprint

def sufficient_statistics(bn, data):
    ss = []
    for i in range(bn.num_vertices()):
        parent_possible_values = [bn.possible_values[j] for j in bn.parents(i)]
        vertex_ss = {parent_vals: [0] * len(bn.possible_values[i])
                     for parent_vals in itertools.product(*parent_possible_values)}
        ss.append(vertex_ss)

    for values in data:
        encoded_values = bn.encode_values(values)
        for i, value in enumerate(encoded_values):
            parent_values = tuple(encoded_values[j] for j in bn.parents(i))
            ss[i][parent_values][value] += 1
    return ss

def model_evidence(bn, data):
    """assumes binary random variables"""
    ss = sufficient_statistics(bn, data)
    me = 0
    for i, vertex_ss in enumerate(ss):
        for count_vector in vertex_ss.values():
            total_counts = sum(count_vector)
            me -= (math.log(total_counts + 1) + logbinom(total_counts, count_vector[0]))
    return me

log_prob_data_given_graph = model_evidence(g, data)
pprint.pprint(sufficient_statistics(g, data), width=20)
###


[{(): [733, 267]},
 {(): [738, 262]},
 {(0, 0): [479, 63],
  (0, 1): [177, 14],
  (1, 0): [56, 140],
  (1, 1): [22, 49]}]


In [9]:
print(log_prob_data_given_graph)

-1579.0685570479313


In [6]:
# test for prob_data_given_graph
assert isinstance(log_prob_data_given_graph, float)
assert -2000 < log_prob_data_given_graph < 0
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## PROBLEM 2: Counting the number of possible Bayesian network structures (1 POINT)
For a Bayesian network of *three* random variables (like the one in Problem 1), how many possible Bayesian network structures are there?  Assign your answer to the variable `num_3var_networks` below.

In [7]:
###
num_3var_networks=25
###


In [8]:
# test for num_3var_networks
assert isinstance(num_3var_networks, int)
assert num_3var_networks > 0
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## BONUS PROBLEM: Find the Bayesian network structure with maximum score
For the dataset in Problem 1, find the Bayesian network structure that gives the maximum score (model evidence).