# Parametric learning

In this notebook you will understand the basic concepts of parametric learning of discrete Bayesian networks.

Remember that a Bayesian network factorizes as the product of conditional probability distributions (CPDs), one for each variable. Our job is, then, to learn the CPT of each CPD.

Let's start loading the necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import itertools as it
import matplotlib.pyplot as plt

from pgmpy.models import BayesianModel
from pgmpy.sampling import BayesianModelSampling
from pgmpy.utils import get_example_model

We will start with a basic BN with only two variables:
$$P(X,Y)=P(X|Y)P(Y)$$
Imagine that we are given the following dataset:

In [None]:
df = pd.DataFrame({'X': ["l", "l", "l", "m", "l", "l", "l", "m", "l", "l", "m", "l", "l", "m", "m"], 
                   'Y': ["y", "n", "y", "y", "d", "y", "y", "d", "y", "d", "y", "y", "d", "n", "n"] })
print(df)

The joint probability distribution $p(X,Y)$ can be estimated directly from this dataset as:

In [None]:
df.groupby(["X", "Y"]).size() / df.shape[0]

But we are interested in the factorization, so let's learn the CPT of $P(Y)$ and $P(X|Y)$. The marginal probability distribution of $Y$ is just the proportion of examples with each value:

In [None]:
y_pos_vals = np.unique(df["Y"])
print("For these possible values of variable Y:",y_pos_vals)
pY = np.array([#### YOUR CODE HERE ####
               for val in y_pos_vals])/df.shape[0]
print(pY)

This is basically the product of computing the following estimator:
$$\theta_{Y=y}=\frac{n_y}{n}$$
where $n_y$ is the number of cases where value $Y=y$ is observed.

In order to compute the CPT of $X|Y$, we need to account for the distribution of the values of $X$ for each value of $Y$ separately:

In [None]:
y_pos_vals = np.unique(df["Y"])
x_pos_vals = np.unique(df["X"])
print("For these possible values of variable X:",x_pos_vals)

pXY =np.transpose(np.array([ [np.sum(np.logical_and(df["X"] == val_x, df["Y"] == val_y)) 
                              for val_x in x_pos_vals] / np.sum(df["Y"] == val_y)
                            for val_y in y_pos_vals]))
print(pXY)

This is indeed the maximum-likelihood estimator of a categorical distribution, that is the type of distributions that we are dealing with in this case. This is basically the product of computing the following estimator:
$$\theta_{X=x|Y=y}=\frac{n_{xy}}{n_y}$$
where $n_{xy}$ is the number of cases where the combination of values $X=x$ and $Y=y$ is observed.

You might have heard of the Laplace smoothing, a technique to prevent numerical problems produced by ML estimators in case of sparse data or rare events (which produce 0 probability values in the CPDs). This procedure consists of adding a constant uniform value to all the counts:

$$\theta_{X=x|Y=y}=\frac{n_{xy}+l}{n_y+l\cdot|\mathcal{X}|}$$
where $|\mathcal{X}|$ is the number of possible values of variable $X$ (in our case, $|\mathcal{X}|=2$), and 
$$\theta_{Y=y}=\frac{n_y+l}{n+l\cdot|\mathcal{Y}|}$$
where  $|\mathcal{Y}|=3$ is the number of possible values of $Y$.

In [None]:
l=1 # Laplace smoothing value
y_pos_vals = np.unique(df["Y"])
x_pos_vals = np.unique(df["X"])
print("For these possible values of variable Y:",y_pos_vals)
pY = np.array([#### YOUR CODE HERE ####
               for val in y_pos_vals])/(df.shape[0]+l*len(y_pos_vals))
print("p(Y) with Laplace smoothing (l="+str(l)+"):")
print(pY)

pXY =np.transpose(np.array([ [np.sum(np.logical_and(df["X"] == val_x, df["Y"] == val_y))+l
                              for val_x in x_pos_vals] / (np.sum(df["Y"] == val_y)+l*len(x_pos_vals))
                            for val_y in y_pos_vals]))
print("p(X|Y) with Laplace smoothing (l="+str(l)+"):")
print(pXY)

Nice trial! Now, we know how to compute MLE parameters from data. But what happens in more realistic scenarios?

## Playing with a real example

To do so, let's create a dataset from a real Bayesian network: <a href="https://www.bnlearn.com/bnrepository/discrete-small.html#survey">survey</a>.

<img src="https://www.bnlearn.com/bnrepository/survey/survey.png" width="300" />
We will learn from the generated data and compare with the real parameters. We will use pgmpy to do so:

In [None]:
gen_model = get_example_model('survey')
n_samples = 1000

samples = BayesianModelSampling(gen_model).forward_sample(size=n_samples)
samples.head()

We are just dealing with parametric learning, so let's assume that we know the real structure:

In [None]:
gen_model_struct = BayesianModel(ebunch=gen_model.edges())
gen_model_struct.nodes()

Now, we can use standard pgmpy methods to learn the MLE parameters:

In [None]:
from pgmpy.estimators import MaximumLikelihoodEstimator

mle = MaximumLikelihoodEstimator(model=gen_model_struct, data=samples)

print(gen_model.get_cpds('A'))
print(mle.estimate_cpd(node='A'))
print(gen_model.get_cpds('E'))
print(mle.estimate_cpd(node='E'))

mle.get_parameters()

This library `pgmpy` allows us to carry out all these operations at one if we call the learning function `fit`:

In [None]:
learnt_model = BayesianModel(ebunch=gen_model.edges())
learnt_model.fit(data=samples, estimator=MaximumLikelihoodEstimator)
print(learnt_model.get_cpds('E'))

To test the ability to recover the original parameters, we need to define a measure of error. In this case, we just measure the root mean square error between all the parameters:

In [None]:
def estimate_error(gen_model, learnt_model):
    diff = []
    for v in gen_model.nodes():
        real_cpd=gen_model.get_cpds(v)
        est_cpd=learnt_model.get_cpds(v)
        act_states = est_cpd.state_names
        for tpl in it.product(*act_states.values()):
            diff.append(real_cpd.get_value(**dict(zip(act_states.keys(),tpl)))-
                         est_cpd.get_value(**dict(zip(act_states.keys(),tpl))))
    return np.sqrt(np.mean(np.array(diff)**2))

Now, we can generate datasets of different size to understand how, as the sample size increases, the error is reduced.

In [None]:
errors=[]
sample_sizes = np.arange(100,2001,100)
for n_samples in sample_sizes:
    samples = BayesianModelSampling(gen_model).forward_sample(size=n_samples)
    learnt_model = BayesianModel(ebunch=gen_model.edges())
    learnt_model.fit(data=samples, estimator=MaximumLikelihoodEstimator)
    errors.append(estimate_error(#### YOUR CODE HERE ####))
plt.plot(sample_sizes,errors)

## Using Bayesian estimators

All the things that we have seen so far can be done if we use a Bayesian estimator (such as BDeu) instead of MLE:

In [None]:
from pgmpy.estimators import BayesianEstimator

n_samples = 1000
samples = BayesianModelSampling(gen_model).forward_sample(size=n_samples)

be = BayesianEstimator(model=gen_model_struct, data=samples)

print(be.estimate_cpd(node='A', prior_type="BDeu", equivalent_sample_size=1000))
be.get_parameters(prior_type="BDeu", equivalent_sample_size=1000)

And we can call the `fit` function to directly learn with a Bayesian estimator:

In [None]:
learnt_model = BayesianModel(ebunch=gen_model.edges())
learnt_model.fit(data=samples, estimator=BayesianEstimator, prior_type='BDeu', equivalent_sample_size=100)
print(learnt_model.get_cpds('E'))