# Day 25  notebook

The objectives of this notebook are to practice

* computing the marginal distribution for a subset of variables
* computing the Kullback–Leibler divergence
* selecting candidate parents for variables, as is done in the Sparse Candidate Algorithm

## Modules used for this assignment

In [1]:
# standard library modules
import math             # for log
import random           # for seed
import collections      # for Counter

# course modules
import bayesian_network # for the BayesianNetwork class

## Updates to the `bayesian_network` module

The `BayesianNetwork` class has been updated a bit since last time.  Please note the following new methods:
1. `joint_dist`: Returns the joint probability distribution represented by the network.
2. `estimate_parameters`: Estimates (and sets) the parameters of the model given a data set using maximum likelihood.

In addition, the `bayesian_network` module has a few functions for creating some of the example datasets, which we will use in this activity:

In [2]:
lac_operon_network = bayesian_network.make_lac_operon_network()
flight_weather_network = bayesian_network.make_flight_weather_network()

We will use some simulated datasets from each of these example datasets:

In [3]:
random.seed(5)
lac_operon_dataset = [lac_operon_network.sample() for _ in range(1000)]
random.seed(42)
flight_weather_dataset = [flight_weather_network.sample() for _ in range(100)]

In this activity we will need to work with the joint (and marginal) distributions estimated from these data sets, which we will refer to as the "empirical" distributions.  We will borrow a function from the Day 10 notebook that we used for estimating a joint distribution:

In [4]:
def estimate_joint_dist(observations):
    """Returns a joint distribution for the random variables measured in observations.
    
    Args:
        observations: a list of tuples, where the ith element of each tuple represents the
            observed value of the ith random variable.  
    Returns:
        A joint distribution in the form of a dictionary with random variable configurations
        as keys and probabilities as values.
    """
    counter = collections.Counter(observations)
    return {key: count / len(observations) for key, count in counter.items()}

In [5]:
lac_operon_dataset_joint = estimate_joint_dist(lac_operon_dataset)
flight_weather_dataset_joint = estimate_joint_dist(flight_weather_dataset)

## PROBLEM 1: Computing the marginal distribution for a *subset* of variables (1 POINT)
In the Day 10 notebook you implemented a function for computing the marginal distribution of a single variable given a joint distribution.  In general, one can compute a marginal distribution for any (strict) subset of the variables in a joint distribution by summing out all of the variables that are not in that subset.  This will be important for determining the marginal distribution for a pair of random variables in the Sparse Candidate Algorithm.  Implement the `compute_marginal_dist` function below.

In [6]:
def compute_marginal_dist(joint_distribution, indices):
    """Returns the marginal distribution a subset of random variables in a joint distribution.
    
    Args:
        joint_distribution: a distribution in the form of a dictionary with with random variable
            configurations (tuples) as keys and probabilities as values.
        indices: a tuple or list of indices of the random variables to keep in the marginal distribution.
    Returns:
        The marginal distribution (using the same dictionary based representation as the input joint distribution)
    """
    ###
    ### YOUR CODE HERE
    dist = collections.defaultdict(float)
    for joint_config, prob in joint_distribution.items():
        dist[tuple(joint_config[i] for i in indices)] += prob
    return dist
    ###


In [7]:
# tests for compute_marginal_dist
def round_dist(dist, digits=5):
    """Returns a new distribution with probabilities rounded to the specified number of digits."""
    return {key: round(value, digits) for key, value in dist.items()}

# marginal distribution of weather (index 1) and status (index 2) variables in flight_weather system
weather_status_dist = {
    ('rain', 'delayed'): 0.159,
    ('rain', 'on-time'): 0.141,
    ('snow', 'delayed'): 0.174,
    ('snow', 'on-time'): 0.026,
    ('sun', 'delayed'): 0.085,
    ('sun', 'on-time'): 0.415}
assert round_dist(compute_marginal_dist(flight_weather_network.joint_dist(), (1, 2))) == weather_status_dist

# marginal distribution of airline (index 0) and status (index 2) variables in flight_weather system
airline_status_dist = {
    ('Delta', 'delayed'): 0.117,
    ('Delta', 'on-time'): 0.183,
    ('United', 'delayed'): 0.301,
    ('United', 'on-time'): 0.399}
assert round_dist(compute_marginal_dist(flight_weather_network.joint_dist(), (0, 2))) == airline_status_dist

# marginal distribution of status (index 2) variable in flight_weather system
status_dist = {('delayed',): 0.418, ('on-time',): 0.582}
assert round_dist(compute_marginal_dist(flight_weather_network.joint_dist(), (2,))) == status_dist

# marginal distribution of L (index 0), G (index 2), and Z (index 3) variables in lac_operon_network system
L_G_Z_dist = {
    ('absent', 'absent', 'absent'): 0.3033,
    ('absent', 'absent', 'high'): 0.09149,
    ('absent', 'absent', 'low'): 0.05521,
    ('absent', 'present', 'absent'): 0.3033,
    ('absent', 'present', 'high'): 0.05577,
    ('absent', 'present', 'low'): 0.09093,
    ('present', 'absent', 'absent'): 0.0085,
    ('present', 'absent', 'high'): 0.03083,
    ('present', 'absent', 'low'): 0.01067,
    ('present', 'present', 'absent'): 0.0085,
    ('present', 'present', 'high'): 0.01099,
    ('present', 'present', 'low'): 0.03052}

assert round_dist(compute_marginal_dist(lac_operon_network.joint_dist(), (0, 2, 6))) == L_G_Z_dist

print("SUCCESS: compute_marginal_dist passed all tests!")

SUCCESS: compute_marginal_dist passed all tests!


## PROBLEM 2: Computing the Kullback–Leibler divergence (1 POINT)
The Sparse Candidate algorithm uses Kullback–Leibler divergence to identify candidate parents for each random variable.  Implement the computation of the Kullback–Leibler divergence in the function `kl_divergence` below.

In [8]:
def kl_divergence(p, q):
    """Computes the Kullback–Leibler divergence from Q to P, i.e., D_KL(P || Q)
    
    Args:
        p and q: distributions over the same set of random variables.  Each distribution is 
                 represented as a dictionary with configurations (tuples) as keys and probabilities
                 as values.
    Returns:
        The Kullback–Leibler divergence as a floating point value.
    """
    ###
    ### YOUR CODE HERE
    return sum(p[config] * math.log(p[config] / q[config]) for config in p if p[config])

    ###


In [9]:
# tests for kl_divergence

# flight status 
true_status_dist = {('delayed',): 0.418, ('on-time',): 0.582}
empirical_status_dist = {('on-time',): 0.64, ('delayed',): 0.36}
assert round(kl_divergence(true_status_dist, empirical_status_dist), 5) == 0.00715

# weather, flight status
true_weather_status_dist = {
    ('rain', 'delayed'): 0.159,
    ('rain', 'on-time'): 0.141,
    ('snow', 'delayed'): 0.174,
    ('snow', 'on-time'): 0.026,
    ('sun', 'delayed'): 0.085,
    ('sun', 'on-time'): 0.415}
empirical_weather_status_dist = {
    ('rain', 'delayed'): 0.07,
    ('rain', 'on-time'): 0.2,
    ('snow', 'delayed'): 0.19,
    ('snow', 'on-time'): 0.08,
    ('sun', 'delayed'): 0.1,
    ('sun', 'on-time'): 0.36}
assert round(kl_divergence(true_weather_status_dist, empirical_weather_status_dist), 5) == 0.08182

# L, G, Z
true_L_G_Z_dist = {
    ('absent', 'absent', 'absent'): 0.3033,
    ('absent', 'absent', 'high'): 0.09149,
    ('absent', 'absent', 'low'): 0.05521,
    ('absent', 'present', 'absent'): 0.3033,
    ('absent', 'present', 'high'): 0.05577,
    ('absent', 'present', 'low'): 0.09093,
    ('present', 'absent', 'absent'): 0.0085,
    ('present', 'absent', 'high'): 0.03083,
    ('present', 'absent', 'low'): 0.01067,
    ('present', 'present', 'absent'): 0.0085,
    ('present', 'present', 'high'): 0.01099,
    ('present', 'present', 'low'): 0.03052}
empirical_L_G_Z_dist = {
    ('absent', 'absent', 'absent'): 0.321,
    ('absent', 'absent', 'high'): 0.091,
    ('absent', 'absent', 'low'): 0.058,
    ('absent', 'present', 'absent'): 0.287,
    ('absent', 'present', 'high'): 0.069,
    ('absent', 'present', 'low'): 0.072,
    ('present', 'absent', 'absent'): 0.009,
    ('present', 'absent', 'high'): 0.038,
    ('present', 'absent', 'low'): 0.005,
    ('present', 'present', 'absent'): 0.01,
    ('present', 'present', 'high'): 0.012,
    ('present', 'present', 'low'): 0.028}
assert round(kl_divergence(true_L_G_Z_dist, empirical_L_G_Z_dist), 5) == 0.00811

print("SUCCESS: kl_divergence passed all tests!")

SUCCESS: kl_divergence passed all tests!


## PROBLEM 3: Selecting candidate parents for variables (1 POINT)

Using your `compute_marginal_dist` and `kl_divergence` functions, we can now implement the "restrict" step of the Sparse Candidate algorithm.  In this step, the parameters of the current network are estimated using the dataset, and then the empirical and network pairwise marginal distributions are compared with each other using KL-divergence.  Then a set of $k$ candidate parents are selected for each variable, which includes the current parents and the other variables for which the KL-divergence is highest.  Most of the implementation is provided for you, except for `candidate_parents`, which selects the candidate parents for a single variable.  Fill in the implementation for this function.

In [10]:
def matrix(num_rows, num_cols, value=None):
    """Constructs a matrix (a list of lists)"""
    return [[value] * num_cols for i in range(num_rows)]

def pair_marginal_divergences(empirical_joint_dist, network_joint_dist):
    """For each pair of variables, computes the KL-divergence between the empirical and network
       derived marginal distributions for that pair of variables.

    Args: 
        empirical_joint_dist: the empirical joint distribution (dictionary representation)
        network_joint_dist:  the network joint distribution (dictionary representation)
    Returns:
        A matrix (list of lists) M with M[i][j] giving the KL-divergence between empirical and network
        derived marginal distributions of variables i and j
    """
    # the length of each configuration is the # of variables 
    # (we'll grab one configuration by taking first key when iterating over the empirical_joint_dist)
    num_variables = len(next(iter(empirical_joint_dist))) 
    M = matrix(num_variables, num_variables, 0)
    for i in range(num_variables):
        for j in range(i + 1, num_variables):
            empirical_marginal = compute_marginal_dist(empirical_joint_dist, (i, j))
            network_marginal = compute_marginal_dist(network_joint_dist, (i, j))
            M[i][j] = M[j][i] = kl_divergence(empirical_marginal, network_marginal)
    return M

def restrict(network, dataset, k):
    """Computes the candidate parents for all variables in the network, given a dataset.
    
    Args:
        network: The Bayesian network
        dataset: a list of observations (tuples)
        k: the number of candidate parents to select for each variable
    Returns:
        A list of sets of the candidate parents for each variable.
    """
    # compute maximum likelihood (ML) estimates for the network for the given data
    network.estimate_parameters(dataset)
    
    # compute the joint distribution represented by the network, given the ML parameters
    network_joint_dist = network.joint_dist()
    
    # compute the empirical joint distribution from the data
    empirical_joint_dist = estimate_joint_dist(dataset)
    
    # For each pair of variables, compute the KL-divergence between the empirical and network
    # marginal distributions for that pair of variables
    M = pair_marginal_divergences(empirical_joint_dist, network_joint_dist)
    
    # Return a list of the candidate parent sets for each variable
    return [candidate_parents(network, i, M[i], k) for i in range(network.num_vertices())]

In [11]:
def candidate_parents(network, i, divergences, k):
    """Computes the k candidate parents for variable i in the network, given the computed divergences.
    
    The candidate parent set will always include the current parents of variable i.  Other variables will
    be added to the candidate set based on the divergences.
    Args:
        network: The Bayesian network
        i: the index of the variable for which to select parents
        divergences: a list of the pairwise marginal divergences between variable i and all other variables
        k: the number of candidate parents to select
    Returns:
        A set of the candidate parent variable indices.
    """
    ###
    ### YOUR CODE HERE
    parents = set(network.parents(i))
    sorted_divergences = sorted((divergence, j) for j, divergence in enumerate(divergences) if j not in parents)
    while len(parents) < k:
        parents.add(sorted_divergences.pop()[1])
    return parents
    ###


In [12]:
# test candidate_parents
empty_lac_operon_network = bayesian_network.make_empty_lac_operon_network()
assert (restrict(empty_lac_operon_network, lac_operon_dataset, 1) == 
        [{4}, {4}, {5}, {5}, {6}, {2}, {4}])
assert (restrict(empty_lac_operon_network, lac_operon_dataset, 2) ==
        [{4, 6}, {4, 6}, {5, 6}, {5, 6}, {0, 6}, {2, 3}, {1, 4}])

assert (restrict(lac_operon_network, lac_operon_dataset, 2) ==
        [{5, 6}, {2, 6}, {1, 6}, {0, 6}, {0, 1}, {2, 3}, {4, 5}])
assert (restrict(lac_operon_network, lac_operon_dataset, 3) ==
        [{3, 5, 6}, {2, 5, 6}, {1, 4, 6}, {0, 4, 6}, {0, 1, 5}, {2, 3, 4}, {1, 4, 5}])

print("SUCCESS: candidate_parents passed all tests!")

SUCCESS: candidate_parents passed all tests!
