In [1]:
import numpy as np
import matplotlib.pyplot as plt

from bayesian_network import BayesNet
from utils import sample_forward, get_default_bayes_net

# Parameter Learning

In this problem, we will assume that a fixed dependency graph structure between variables is given and learn the parameters (the complete Conditional Probability Distribution Table (CPDT)) from a set of events. Furthermore, we will use log-likelihood to find a model structure that also generalizes to future data.
    
## ML Estimates for Conditional Distributions

<div class="alert alert-warning">
    Implement the <i>maximum_likelihood_estimate</i> function, which computes the Maximum Likelihood Estimate for the parameters of a discrete (conditional) probability distribution $ P(X \mid \mathit{pa}(X) )$, given a data set. (3 points)
</div>

`maximum_likelihood_estimate` takes  three parameters:
- `data` is a NumPy array of shape `(num_samples, num_variables)`.
- `variable_id` is the column index of the variable to estimate the distribution for.
- `parent_ids` is a tuple, containing the column indices of parent variables.
- `alpha` is a non-negative integer, corresponding to pseudocounts in Laplace smoothing.

`maximum_likelihood_estimate` must return one object:
- A Maximum Likelihood Estimate (MLE) of the parameters in form of a `np.ndarray`. The first dimension (index `0`) of the returned array must correspond to variable `variable_id`, the remaining dimensions must be sorted according to `parent_ids`. Altogether, tuple `(variable_id, ) + parent_ids` gives the mapping of dimensions to variables.

Hint:
- Assume that all variables are boolean.
- To count elements in a Numpy array, you simply loop over the data array.
- The smoothing parameter `alpha` is added to the counts of each possible event represented in the CPDT.

In [3]:
def maximum_likelihood_estimate(data: np.ndarray, variable_id: int, parent_ids: tuple=tuple(), alpha: int=0):
    """
    Estimates the conditional probability distribution of a (discrete) variable from data.
    :param data:    data to estimate distribution from
    :param variable_id:  column index corresponding to the variable we estimate the distribution for
    :param parent_ids: column indices of the variables the distribution is conditioned on
    :param alpha: smoothing parameter, pseudocounts
    :returns: estimated conditional probability distribution table
    """
    
    assert type(variable_id) == int
    assert type(parent_ids) == tuple
    
    # mapping of axis to variable_id,
    # e.g. the variable with id variable_ids[i] is on axis i of the CPDT
    variable_ids = (variable_id,) + parent_ids
    
    # YOUR CODE HERE
    cpdt = np.zeros([2]*(len(variable_ids)), np.float64)
    accum = np.zeros([2]*(len(variable_ids)), np.int64)
    #print("cpdt:",cpdt.shape)
    #print("accum:",accum.shape)
    
    
    for sample in data:
        data_s = [sample[indexer] for indexer in (variable_ids)] 
        accum[tuple(data_s)] += 1
    
    cpdt[0] = (accum[0] + alpha) / (accum.sum(axis=0, keepdims=True)[0] + 2*alpha + 1e-10)
    cpdt[1] = 1 - cpdt[0]
   # print("cpdt:",cpdt)
   # diff = np.abs(expected - actual)
    #print("Differences:", diff)
    #raise NotImplementedError()
    
    return cpdt

In [4]:
# sanity checks
_A_, _B_, _C_, _D_, _E_ = 0, 1, 2, 3, 4
# get the bayes net from the previous problem
bayes_net = get_default_bayes_net()
np.random.seed(0)
# draw 100 samples
data = sample_forward(bayes_net, 100)

# get exact A form bayes net
expected = bayes_net[_A_].pdt[:,0,0,0,0]
# estimate A from the data
actual = maximum_likelihood_estimate(data, _A_)
# estimate should not be far off
assert np.all(np.isclose(expected, actual, atol=0.05))

# get exact B_A form bayes net
expected = bayes_net[_B_].pdt[:,:,0,0,0].T
# estimate B_A from data
actual = maximum_likelihood_estimate(data, _B_, (_A_,))
# estimate should not be far off
assert np.all(np.isclose(expected, actual, atol=0.05))

# test if alpha correctly added
expected = [0.29166667, 0.70833333]
# estimate A from the data with alpha=10
actual = maximum_likelihood_estimate(data, _A_, alpha=10)
# estimate should not be far off
assert np.all(np.isclose(expected, actual, atol=0.0001))


### The Log-Likelihood Function

<div class="alert alert-warning">
    Implement the <i>log_likelihood</i> function, which computes the log-likelihood $\mathcal{L}(\mathcal{M} : \mathcal{D})$ of a model (BayesNet) relative to a data set. (3 points)
</div>

`log_likelihood` takes two parameters:
- `data` is a NumPy array of shape `(num_samples, num_variables)`.
- `bayes_net` a BayesNet object representing the model $\mathcal{M}$ (containing already estimated CPDTs).

`log_likelihood` must return one object:
- The log-likelihood of the model given the data (i.e., a floating point number (<= 0)).

Hint:
- Recall that iterating over the variables in the BayesNet is super easy: `for variable in bayes_net: ...`.
- The probability distribution of variable $X$ given its parents $\mathit{pa}(X)$, $P(X \mid \mathit{pa}(X))$, can be obtained by passing the random event to the variable, i.e., `variable(data[i])`.
- Use the natural logarithm for your computations, i.e. `np.log`.

In [6]:
def log_likelihood(data: np.ndarray, bayes_net: BayesNet):
    """
    Computes the log-likelihood of a given Bayesian network relative to the data.
    :param data: data to compute the log-likelihood relative to.
    :param bayes_net: Bayesian network model.
    :returns: the log-likelihood of the Bayesian network relative to the data.
    """    

    ll = 0
    
    # YOUR CODE HERE
    for sample in data:
        for variable in bayes_net:
            lk = variable(sample)[sample[variable.id]]
            ll = ll + np.log(lk)
  #  raise NotImplementedError()
    
    return ll

In [8]:
# sanity checks
# get the bayes net from the previous problem
bayes_net = get_default_bayes_net()
np.random.seed(0)
# draw 100 samples
data = sample_forward(bayes_net, 100)

# expected log-likelihood
expected = -215.9
# actual log-likelihood
actual = log_likelihood(data, bayes_net)

# must be close
assert np.all(np.isclose(expected, actual, atol=0.1))


# remove unused variables
del data
del bayes_net

## Finding a Model for Strokes   

After watching hours of medical dramas on television, you try to figure out the perfect prediction model for strokes. Some of your computer science colleagues told you about how Bayesian networks can be used for symptom diagnosis, so you decide to model your ideas using this technique. 

Let's assume that you magically know the true underlying Bayes model (structure and parameters); all variables in this example are boolean (false=0 or true=1).  

<img  style='width:100%;  max-width:400px;' src="img/bn_mod2.svg">

The conditional probability tables are given as follows:

<table style="float: left;margin:5px;"><tr><th>P(A)</th><th>$a_0$<br></th><th>$a_1$</th></tr><tr><td>-</td><td>0.01</td><td>0.99</td></tr></table>

<table style="float: left;margin:5px;"><tr><th>P(H | A)</th><th>$a_0$<br></th><th>$a_1$</th></tr><tr><td>$h_0$</td><td>0.9</td><td>0.8</td></tr><tr><td>$h_1$</td><td>0.1</td><td>0.2</td></tr></table>

<table style="float: left;margin:5px;"><tr><th>P(S | H)</th><th>$h_0$<br></th><th>$h_1$</th></tr><tr><td>$s_0$</td><td>0.9</td><td>0.85</td></tr><tr><td>$s_1$</td><td>0.1</td><td>0.15</td></tr></table>


<table style="float: left;margin:5px;"><tr><th rowspan="2">P(C | A, S)</th><th colspan="2">$a_0$<br></th><th colspan="2">$a_1$</th></tr><tr><td>$s_0$</td><td>$s_1$</td><td>$s_0$</td><td>$s_1$</td></tr><tr><td>$c_0$<br></td><td>0.8</td><td>0.7</td><td>0.85</td><td>0.45</td></tr><tr><td>$c_1$</td><td>0.2</td><td>0.3</td><td>0.15</td><td>0.55</td></tr></table>

<table style="float: left;margin:5px;"><tr><th>P(V | S)</th><th>$s_0$</th><th>$s_1$</th></tr><tr><td>$v_0$</td><td>0.1</td><td>0.2</td></tr><tr><td>$v_1$</td><td>0.9</td><td>0.8</td></tr></table>  

In order to find a good model, you would need to collect a lot of training examples.

But since we know the true undelying model, you can instead just sample 5000 events from this network as the training data, and 5000 samples as the test data.

In [11]:
_A_, _H_, _S_, _C_, _V_ = 0, 1, 2, 3, 4
A = np.array([0.01, 0.99])
H_A = np.array([[0.9, 0.8], [0.1, 0.2]])
S_H = np.array([[0.9, 0.85], [0.1, 0.15]])
C_AS = np.array([[[0.8, 0.7], [0.85, 0.45]], [[0.2, 0.3], [0.15, 0.55]]])
V_S = np.array([[0.1, 0.2], [0.9, 0.8]])

# this bayes net represents the true underlying full joint distribution in the medical world 
true_bayes_net = BayesNet(
    (A, (_A_,)),
    (H_A, (_H_,_A_)),
    (S_H, (_S_,_H_)),
    (C_AS, (_C_,_A_,_S_)),
    (V_S, (_V_,_S_))
)
np.random.seed(0)
train = sample_forward(true_bayes_net, 5000)
test = sample_forward(true_bayes_net, 5000)

<div class="alert alert-warning">
    Based on the sampled training data points, estimate the (conditional) probability tables for the true underlying network structure and compute its log-likelihood w.r.t the training data. (1 point)
</div>
  
  
Store the CPDTs into the provided variables. The dimensions of the CPDT must be sorted according to the naming of the variable, e.g., in C_AS, dimension 0 corresponds to C, dimension 1 to A, and dimension 2 to S.

**Hint**:
- Use the two functions you implemented above (`maximum_likelihood_estimate` and `log_likelihood`)!
- The training data is stored in variable `train`. 
- `_A_, _H_, _S_, _C_, _V_` hold the column indices (= IDs) of the variables. 

In [16]:
_A_, _H_, _S_, _C_, _V_ = 0, 1, 2, 3, 4

A, H_A, S_H, C_AS, V_S = None, None, None, None, None

# YOUR CODE HERE
A = maximum_likelihood_estimate(train, _A_)
H_A = maximum_likelihood_estimate(train, _H_, (_A_,))
S_H = maximum_likelihood_estimate(train, _S_, (_A_,))
C_AS = maximum_likelihood_estimate(train, _C_, (_A_, _S_))
V_S = maximum_likelihood_estimate(train, _V_, (_S_,))
#raise NotImplementedError()

# begin sanity check
assert np.all(np.isclose(A.sum(axis=0), 1))
assert np.all(np.isclose(H_A.sum(axis=0), 1))
assert np.all(np.isclose(S_H.sum(axis=0), 1))
assert np.all(np.isclose(C_AS.sum(axis=0), 1))
assert np.all(np.isclose(V_S.sum(axis=0), 1))
# end sanity check

bayes_net_1 = BayesNet(
    (A, (_A_,)),
    (H_A, (_H_,_A_)),
    (S_H, (_S_,_H_)),
    (C_AS, (_C_,_A_,_S_)),
    (V_S, (_V_,_S_))
)

tr_log_likelihood_1 = 0

# YOUR CODE HERe
tr_log_likelihood_1 = log_likelihood(train, bayes_net_1)
#raise NotImplementedError()

In [18]:
# sanity check
assert tr_log_likelihood_1 < -8500
assert tr_log_likelihood_1 > -8800


Back to our strokes model: Having no idea of the true underlying network structure, you decide to try out the following very simple model first:
    
<img style='width:100%;  max-width:400px;' src="img/bn_mod1.svg">

<br>

<div class="alert alert-warning">
    Based on the sampled training data points, estimate the (conditional) probability tables for this model and compute its log-likelihood w.r.t the training data. (1 point)
</div>

Store the CPDTs into the provided variables.

**Hint**:
- Use the two functions you implemented above (`maximum_likelihood_estimate` and `log_likelihood`)!
- The training data is stored in variable `train`. 
- `_A_, _H_, _S_, _C_, _V_` hold the column indices (= IDs) of the variables. 

In [20]:
_A_, _H_, _S_, _C_, _V_ = 0, 1, 2, 3, 4

A, H, S, C, V = None, None, None, None, None

# YOUR CODE HERE
A = maximum_likelihood_estimate(train, _A_)
H = maximum_likelihood_estimate(train, _H_)
S = maximum_likelihood_estimate(train, _S_)
C = maximum_likelihood_estimate(train, _C_)
V = maximum_likelihood_estimate(train, _V_)
#raise NotImplementedError()

# begin sanity check
assert np.all(np.isclose(A.sum(axis=0), 1))
assert np.all(np.isclose(H.sum(axis=0), 1))
assert np.all(np.isclose(S.sum(axis=0), 1))
assert np.all(np.isclose(C.sum(axis=0), 1))
assert np.all(np.isclose(V.sum(axis=0), 1))
# end sanity check

bayes_net_2 = BayesNet(
    (A, (_A_,)),
    (H, (_H_,)),
    (S, (_S_,)),
    (C, (_C_,)),
    (V, (_V_,))
)

tr_log_likelihood_2 = 0

# YOUR CODE HERE
tr_log_likelihood_2 = log_likelihood(train, bayes_net_2)
#raise NotImplementedError()

In [21]:
# sanity check
assert tr_log_likelihood_2 < -8500
assert tr_log_likelihood_2 > -8800



Unhappy with the result, you decide to try out a second, more complex model:

<img  style='width:100%;  max-width:400px;' src="img/bn_mod3.svg">

<div class="alert alert-warning">
    Based on the sampled training data points, estimate the (conditional) probability tables for this model and compute its log-likelihood w.r.t the training data. (1 point)
</div>

Store the CPDTs into the provided variables. The dimensions of the CPDT must be sorted according to the naming of the variable, e.g., in C_AS, dimension 0 corresponds to C, dimension 1 to A, and dimension 2 to S.

**Hint**:
- Use the two functions you implemented above (`maximum_likelihood_estimate` and `log_likelihood`)!
- The training data is stored in variable `train`. 
- `_A_, _H_, _S_, _C_, _V_` hold the column indices (= IDs) of the variables. 

In [23]:
_A_, _H_, _S_, _C_, _V_ = 0, 1, 2, 3, 4

A, H_A, S_AH, C_AS, V_CS = None, None, None, None, None

# YOUR CODE HERE
A = maximum_likelihood_estimate(train, _A_)
H_A = maximum_likelihood_estimate(train, _H_, (_A_,))
S_AH = maximum_likelihood_estimate(train, _S_, (_A_,_H_))
C_AS = maximum_likelihood_estimate(train, _C_, (_A_, _S_))
V_CS = maximum_likelihood_estimate(train, _V_, (_C_,_S_))
#raise NotImplementedError()

# begin sanity check
assert np.all(np.isclose(A.sum(axis=0), 1))
assert np.all(np.isclose(H_A.sum(axis=0), 1))
assert np.all(np.isclose(S_AH.sum(axis=0), 1))
assert np.all(np.isclose(C_AS.sum(axis=0), 1))
assert np.all(np.isclose(V_CS.sum(axis=0), 1))
# end sanity check

bayes_net_3 = BayesNet(
    (A, (_A_,)),
    (H_A, (_H_,_A_)),
    (S_AH, (_S_,_A_,_H_)),
    (C_AS, (_C_,_A_,_S_)),
    (V_CS, (_V_,_C_,_S_))
)

tr_log_likelihood_3 = 0

# YOUR CODE HERE
tr_log_likelihood_3 = log_likelihood(train, bayes_net_3)
#raise NotImplementedError()

In [24]:
# sanity check
assert tr_log_likelihood_3 < -8500
assert tr_log_likelihood_3 > -8800


### Compare Train Log-Likelihoods

Compare the log-likelihoods w.r.t the training data of Model **M1** (having the true underlying structure) to the two new models (**M2** - no edges, **M3** - complex model).

In [26]:
print('logP(train|M1) = {}'.format(tr_log_likelihood_1))
print('logP(train|M2) = {}'.format(tr_log_likelihood_2))
print('logP(train|M3) = {}'.format(tr_log_likelihood_3))

logP(train|M1) = -8613.374837357533
logP(train|M2) = -8740.033479009893
logP(train|M3) = -8504.427454360319


<div class="alert alert-warning">
    Answer the following question in one sentence! (1 point)
</div>

Even though **M1** has the true underlying network structure (it correctly represents all independencies holding in our world), it doesn't have the highest train log-likelihood. How do you explain this?

YOUR ANSWER HERE.

M2 and M3 have higher log-likelihood due to the higher flixibelity and more potentiialy the overfitting to the training dataset.

### Compare Test Log-Likelihoods

Finally, we compute the test log-likelihood of the model **M1** (having the true underlying structure) and the newly created models **M2** and **M3**.

In [30]:
te_log_likelihood_1 = log_likelihood(test, bayes_net_1)
te_log_likelihood_2 = log_likelihood(test, bayes_net_2)
te_log_likelihood_3 = log_likelihood(test, bayes_net_3)

print('logP(test|M1) = {}'.format(te_log_likelihood_1))
print('logP(test|M2) = {}'.format(te_log_likelihood_2))
print('logP(test|M3) = {}'.format(te_log_likelihood_3))

logP(test|M1) = -8492.78914241468
logP(test|M2) = -8598.449646741794
logP(test|M3) = -8379.468463963667


<div class="alert alert-warning">
    Answer the following question! Keep your answers short! (1 point)
</div>

What is the difference compared the the log-likelihoods of the training data? Explain the difference!

YOUR ANSWER HERE

the diffrence mainly is in the complexity of the model also overfitting because some models revise pst data rather than generaliz it so leading to overfitting.

### Laplace Smoothing

<div class="alert alert-warning">
    Answer the following question! Keep your answer short! (1 point)
</div>

Estimate the (conditional) probability tables for your model **M3** again. However, this time you only have a training set consisting of 100 samples. You run into the error shown in the output of the code cell below. Explain the source of the problem and how to avoid it by adapting a parameter when calling the function ```maximum_likelihood_estimate```.

YOUR ANSWER HERE

the main problem is that the data set is too small to cover all the possible combinations leading to a zero value for those remaining combinations so missnormalizatoin, and to solve that we have to ada a smoothing factor (alpha) to make sure that all the missed combination have a value not zero.

In [43]:
np.random.seed(0)
# we generate a new training set consisting of 100 samples
train = sample_forward(true_bayes_net, 100)  

A = maximum_likelihood_estimate(train, _A_)
H_A = maximum_likelihood_estimate(train, _H_, (_A_,))
S_AH = maximum_likelihood_estimate(train, _S_, (_A_, _H_,))
C_AS = maximum_likelihood_estimate(train, _C_, (_A_, _S_))
V_CS = maximum_likelihood_estimate(train, _V_, (_S_,_C_,))

bayes_net_3 = BayesNet(
    (A, (_A_,)),
    (H_A, (_H_,_A_)),
    (S_AH, (_S_,_A_,_H_)),
    (C_AS, (_C_,_A_,_S_)),
    (V_CS, (_V_,_C_,_S_))
)