# CHAPTER 20 - Learning Probabilistic Models 1

### Instructor: Brandon Haworth

#### Notebook Credit: George Tzanetakis
Jupyter Notebooks you encounter during the course were largely developed by Prof. Tzanetakis from a previous iteration of this course. I've since changed/developed them where necessary for my own iterations of CSC 421.

## WORKPLAN 

The section number is based on the 4th edition of the AIMA textbook and is the suggested
reading for this week. Each list entry provides just the additional sections. For example, the Expected reading includes the sections listed under Basic as well as the sections listed under Expected. Some additional readings are suggested for Advanced. 

1. Basic: Sections **20.1**, **20.2.1**, **20.2.2**, and **Summary**
2. Expected: Same as Basic plus **20.3**, **20.3.1**, **20.3.3**
3. Advanced: All the chapter including bibliographical and historical notes 


We have covered a variety of probabilistic models that model uncertainty and allow us to do inference in different ways. In this notebook, we describe some of the ways we can estimate probabilistic models from data. 
These techniques provide the connection between statistics, probability, and machine learning. 

The ideas are based on Chapter 20 of the Artificial Intelligence: a Modern Approach textbook and specifically Section 20.2 Learning from Complete Data. 

**Density estimation** refers to the task of learning the probability distribution function a probability density function (for continuous models) or the probability mass function (for discrete models) given some data that we assume was generated from that model. **Complete data** means that we have data for all the **variables** in our model. 

The most common type of learning is **parameter learning** where we assume a particular structure for our model and characterize it by estimating a set of parameters. For example, we might assume a normal or Gaussian multi-variate distribution and estimate the mean vector and the covariance matrix that characterizes it. As another example, we might be given the structure of a Bayesian network (in terms of parent/child conditional relationships) and learn the conditional probability tables. We will also briefly discuss the problem of learning structure as well as non-parametric density estimation in which we don't need to make any assumptions about the model. 


# Learning with Complete Data 

## A random variable class 

Define a helper random variable class based on the scipy discrete random variable functionality providing both numeric and symbolic RVs. You don't need to look at the implementation - the usage will be obvious through the examples below. 


In [3]:
%matplotlib inline 
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np 

class Random_Variable: 
    
    def __init__(self, name, values, probability_distribution): 
        self.name = name 
        self.values = values 
        self.probability_distribution = probability_distribution 
        if all(type(item) is np.int64 for item in self.values): 
            self.type = 'numeric'
            self.rv = stats.rv_discrete(name = name, 
                        values = (values, probability_distribution))
        elif all(type(item) is str for item in values): 
            self.type = 'symbolic'
            self.rv = stats.rv_discrete(name = name, 
                        values = (np.arange(len(values)), probability_distribution))
            self.symbolic_values = values 
        else: 
            self.type = 'undefined'
            
    def sample(self,size): 
        if (self.type =='numeric'): 
            return self.rv.rvs(size=size)
        elif (self.type == 'symbolic'): 
            numeric_samples = self.rv.rvs(size=size)
            mapped_samples = [self.values[x] for x in numeric_samples]
            return mapped_samples 
        
    def prob_of_value(self, value): 
        indices = np.where(self.values == value)
        return self.probability_distribution[indices[0][0]]

            

# Likelihood of model given some data 

First, let's review the concept of the likelihood of a model given some data 


Let's start by creating a random variable corresponding to a 6-faced dice where there are two faces with the numbers 1,2 and 3 therefore each number appears with equal probability. We can generate random samples from this model. 

In [4]:
import numpy as np
values = np.int64([1, 2, 3])
probabilities = [2/6., 2/6., 2/6.]
dice1 = Random_Variable('dice1', values, probabilities)
samples = dice1.sample(30)
print(samples)

[2 3 1 2 2 2 3 3 3 2 3 2 2 2 1 1 3 2 2 1 2 1 3 3 2 3 2 3 3 3]


Let's also create a random variable where two faces have the number 1, three faces have the number 2, and one face has the number 3. We can also generate random samples from this model.

In [5]:
values = np.int64([1, 2, 3])
probabilities = [2./6, 3./6, 1./6]
dice2 = Random_Variable('dice2', values, probabilities)
samples = dice2.sample(30)
print(samples)

[2 1 1 3 2 2 2 1 3 3 2 1 2 1 3 1 2 1 1 1 2 2 2 2 3 2 2 1 1 2]


The likelihood of a sequence of samples given a model can be obtained by taking the product of the corresponding 
probabilities. We can see that for this particular sequence of data, the likelihood of the model for dice2 is higher. So if we have some data and some specific models we can select the model with the highest likelihood. 

In [6]:
data = [1,2,2,1,1,3,1,2,3,2]
print("Probability of each value in the R.V. for Dice 1")
print("Face with 1: %f" % dice1.prob_of_value(1))
print("Face with 2: %f" % dice1.prob_of_value(2))
print("Face with 3: %f" % dice1.prob_of_value(3))
print("Probability of each value in the R.V. for Dice 2")
print("Face with 1: %f" % dice2.prob_of_value(1))
print("Face with 2: %f" % dice2.prob_of_value(2))
print("Face with 3: %f" % dice2.prob_of_value(3))

def likelihood(data, model):
    likelihood = 1.0 
    for d in data: 
        likelihood *= model.prob_of_value(d)
    return likelihood 

print("\nAssume we have the following observed data:")
print(data)
print("Likelihood for Dice 1: %f" % likelihood(data,dice1))
print("Likelihood for Dice 2: %f" % likelihood(data,dice2))


Probability of each value in the R.V. for Dice 1
Face with 1: 0.333333
Face with 2: 0.333333
Face with 3: 0.333333
Probability of each value in the R.V. for Dice 2
Face with 1: 0.333333
Face with 2: 0.500000
Face with 3: 0.166667

Assume we have the following observed data:
[1, 2, 2, 1, 1, 3, 1, 2, 3, 2]
Likelihood for Dice 1: 0.000017
Likelihood for Dice 2: 0.000021


Notice that even with only 10 values the likelihood gets relatively small and we can expect it will get smaller as the sequences of data get smaller. We can also use log-likelihood to avoid this problem.

In [7]:
data = [1,2,2,1,1,3,1,2,3,2]
print(dice1.prob_of_value(1))
print(dice1.prob_of_value(3))
print(dice2.prob_of_value(3))

def log_likelihood(data, model):
    likelihood = 0.0 
    for d in data: 
        likelihood += np.log(model.prob_of_value(d)) # Note the change to summation!
    return likelihood 
    
print("\nAssume we have the following observed data:")
print(data)
print("Likelihood for Dice 1: %f" % log_likelihood(data,dice1))
print("Likelihood for Dice 2: %f" % log_likelihood(data,dice2))


0.3333333333333333
0.3333333333333333
0.16666666666666666

Assume we have the following observed data:
[1, 2, 2, 1, 1, 3, 1, 2, 3, 2]
Likelihood for Dice 1: -10.986123
Likelihood for Dice 2: -10.750557


In the case above we examined two possible models. One could ask the question of all possible models for a particular problem, can we find the one with the highest likelihood? If we have a dice with six faces that can only have the numbers 1, 2, and 3 then there is a finite amount of models and we can calculate their likelihoods as we did above. However, if we relax the requirement to have a dice and simply have the values 1,2 and 3 but with arbitrarily associated probabilities then we have an infinite number of possible models. Without going into the math it turns out that at least for this particular case the model that will have the maximum likelihood can be simply obtained by counting the relative frequencies of the values in the data. This is called maximum likelihood estimation of model parameters.

In [8]:
import collections 

data = [1,2,2,1,1,3,1,2,3,2,2,2,2]
counts = collections.Counter(data)
print(counts)
est_probability_distribution = [counts[1]/float(len(data)), counts[2]/float(len(data)), counts[3]/float(len(data))]
print(est_probability_distribution)

Counter({2: 7, 1: 4, 3: 2})
[0.3076923076923077, 0.5384615384615384, 0.15384615384615385]


We can now create a new R.V. using the estimated probability distribution from the frequency of the observed samples. This should look like it comes from the same source/model

In [9]:
values = np.int64([1, 2, 3])
probabilities = est_probability_distribution 
model = Random_Variable('model', values, probabilities)
samples = model.sample(30)
print(samples)

[2 2 3 2 3 2 2 2 2 2 2 2 1 1 2 1 1 1 2 2 2 1 2 2 2 1 2 2 1 2]


# Maximum-likelihood parameter learning for Discrete Models  


Let's start by creating a random variable corresponding to a bag of candy with two types lime and cherry. We can easily generate random samples from this model. For example, in the code below we generate 100 samples. 

In [10]:
values = ['c', 'l']
probabilities = [0.2, 0.8]
dice1 = Random_Variable('bag1', values, probabilities)
samples = dice1.sample(1000)
print(samples)

['l', 'c', 'c', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'c', 'l', 'c', 'l', 'l', 'l', 'l', 'c', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'c', 'l', 'c', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'c', 'l', 'c', 'l', 'c', 'l', 'l', 'l', 'l', 'c', 'l', 'c', 'l', 'l', 'l', 'c', 'c', 'l', 'c', 'l', 'l', 'c', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'c', 'l', 'l', 'c', 'c', 'c', 'c', 'c', 'l', 'l', 'l', 'l', 'c', 'c', 'l', 'l', 'l', 'c', 'l', 'l', 'c', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'c', 'l', 'c', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'c', 'l', 'c', 'c', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'c', 'l', 'c', 'l', 'l', 'c', 'l', 'l', 'c', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'l', 'c', 'c', 'l', 'c', 'l', 'l', 'l', 'c', 'l', 'l', 'l', 'l', 'c', 'l', 'c', 'l', 'l', 'l', 'l', 'l', 'l', 'c', 'l', 'l', 'l', 'l', 'l', 'l', 'c', 'l', 'c', 'l', 'l', 'c', 'l', 'l', 'c', 'l', 'l', 'l', 'l', 'l',

Now imagine that you are just given these samples and you are told that they were from a bag of candy but you don't know the percentage of each candy type in the bag and you need to estimate it. Let's call the probability *a candy from the bag is cherry* $\theta$. Then our task of parameter learning is to estimate $\theta$ from the provided samples. In the previous notebook without much explanation, I stated that the "best" possible model in a maximum likelihood sense can be easily obtained by simply counting the percentage of each candy type in our bag. 
As you can see the estimated parameter $\theta$ is close but not the same as the original value which was $0.2$. If we had more samples this estimate becomes more accurate.  

We can see that with this simple example, we have the ability to "learn" a model. Once we have a "learned" model from the data we can use it to make predictions or inferences in general as well as generate samples if needed. 


In [11]:
import collections 

counts = collections.Counter(samples)
print(counts)
est_probability_distribution = [counts['c']/float(len(samples)), counts['l']/float(len(samples))]
print(est_probability_distribution)

Counter({'l': 795, 'c': 205})
[0.205, 0.795]


Using the counts seems intuitive and I told you that for the case of discrete random variables, this provides the maximum likelihood estimate but can we prove this assertion? 

Here is how we can do it. Each time we have a candy of a particular type we multiply the associated probability to get the likelihood of the sequence (assuming i.i.d. samples). If there are $c$ cherry candies and $l=N-c$ limes then we can write the likelihood as follows: 

$$ P({\bf d} | h_{\theta}) = \prod_{j=1}^{N} P(d_j | h_{\theta}) = \theta^{c} * (1-\theta)^{l}$$

Note: check how the mathematical expression above is notated. It uses LaTeX notation which can be embedded in markdown cells. It is a useful thing to learn to produce nice-looking equations in both notebooks and papers. 

The maximum-likelihood hypothesis is given by the value of $\theta$ that maximizes the expression above. The same value can be obtained by maximizing the **log-likelihood**. Note that we have used log-likelihood before to avoid small numerical likelihood values when computing over long sequences. Here we use it because it allows us to simplify our expression to prove our approach to maximum likelihood parameter estimation. By taking the log we convert the product to a sum which is easier to maximize. 

$$ L({\bf d}| h_{\theta}) = \log{P({\bf d} | h_{\theta})} = \sum_{j=1}^{N}\log{P(d_j| h_{\theta})} = c \log(\theta) + l \log{(1-\theta)} $$



To find the maximum-likelihood value of $\theta$, we differentiate the $L$ with respect to $\theta$ and set the resulting expression to zero: 

$$ 
\frac{L({\bf d} | h_{\theta})}{d \theta} = \frac{c}{\theta} - \frac{l}{1-\theta} = 0 
$$

Solving for $\theta$ we get: 
$$ 
\theta = \frac{c}{c+l} = \frac{c}{N}
$$ 

This might seem like a lot of work to prove something obvious but now we actually know that of all the infinite possible models of bags we could have - the one we estimate by counting the proportion of candy is the "best" in a maximum likelihood sense. 


The approach we followed can be used for a variety of probabilistic models. The steps are as follows: 

1. Write down an expression for the likelihood of the data as a function of the parameters and use log to simplify it for step 2 
2. Write down the derivative of the log-likelihood with respect to each parameter 
3. Find the parameter values such that the derivatives are zero. 

Note: If we are lucky we are able to perform steps 2 and 3 analytically and derive an exact ML parameter estimate. There are many cases especially when dealing with continuous models (which we cover below) in which maximizing the likelihood function analytically is not possible and one needs to resort to numerical methods 
which do not provide an exact solution. 

As another example of analytical ML parameter estimation, the book has one more example in which there is an extra random variable wrapper and the model has three parameters $\theta_1, \theta_2, \theta_3$. 

For example, by basically filtering the data and counting we can do ML parameter estimation for Naive Bayes models as well as Bayesian Networks with discrete random variables. You have already seen to some extent how this can be done during lectures as well as in the assignments. Notice that the structure of the Bayesian network allows us to simplify the problem of ML parameter estimation by factoring different groups of variables based on their conditional structural relationships. 


# Maximum-likelihood parameter learning for continuous models 


Continuous probability models are heavily used in real-world applications. In many cases, we need to resort to numerical optimization methods to perform parameter estimation. However, in some cases, we can get the exact answer analytically. Let's consider the simple example of learning the parameters of a Gaussian density function on a single variable. Similarly to what we did in the previous section for learning the parameters of a discrete random variable, we will first generate some data and then estimate the parameters from the data. 


The data will be generated using a Gaussian density function on a single variable. The corresponding equation is (specifically for numpy.random.normal): 

$$ P(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

The parameters of this model are the mean $\mu$ and the standard deviation $\sigma$. 

In [10]:
mu = 3.0 
sigma = 0.2 
s = np.random.normal(mu, sigma, 10)
print(s)

[2.72624835 3.05506211 2.85027426 3.12181333 3.14033344 3.1204349
 2.95032936 2.61363656 2.98996048 3.20722797]


In [11]:
samples = np.random.normal(mu, sigma, 10000)
print(samples)

[3.18787276 2.83407902 2.84811335 ... 3.30848375 3.01161803 2.99270717]


Let the observed values by $x_1, \dots, x_N$. Then the log-likelihood is: 

$$ 
L = \sum_{j=1}^{N} \log \left({\frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}}\right) = -\frac{N}{2}\log(2\pi\sigma) - \frac{1}{2\sigma^2}\sum_{j=1}^{N} (x_j-\mu)^2
$$

Setting the derivatives to zero we obtain: 

$$ 
\frac{\partial L}{\partial \mu} = - \frac{1}{\sigma^2}\sum_{j=1}^{N} (x_j-\mu) = 0 
$$
which implies: 
$$ 
\mu = \frac{\sum_j x_j}{N}
$$

So the maximum likelihood value of the mean is the sample average. Similarly, you can find that the maximum likelihood value of the standard deviation is the square root of the sample variance. You can check the textbook for the details of the standard deviation $\sigma$. 

Let's check how we can calculate these ML parameter estimates for the data that we have. 


In [12]:
estimated_mean1 = np.sum(samples) / len(samples)
print(estimated_mean1)
estimated_mean2 = np.mean(samples)
print(estimated_mean2)
estimated_std = np.std(samples)
print(estimated_std)

2.999387962350083
2.999387962350083
0.19982809234770463


So armed with simple filtering, counting and calculating sample mean and sample standard deviation we have everything we need to estimate the probabilities of a Naive Bayes model that contains a mixture of continuous and discrete variables. For the discrete variables, we count and estimate directly the probabilities. For the continuous variables, we first estimate the ML parameters (sample mean and standard deviation) and then for a particular value of the feature we use the single variable Gaussian density equation to derive a probability value for that value. 

Some notes for further reading - not needed for the "final" assignment for those interested in digging deeper. 

**Note1**: In a Bayesian network with continuous variables you have the problem of having a continuous parent and a continuous child variable. These can be addressed with linear Gaussian models. More details in the textbook 

**Note2**: Similarly to Bayesian learning in discrete models one can follow a similar approach and use a hypothesis prior to guiding the learning. The textbook shows an example that uses **beta distributions** you can check out. 

**Note3**: If you remember when we covered the Bayesian network we looked at approximate inference using direct sampling and rejection sampling. You will notice that the approach we followed was similar to statistical learning in the sense that we generated samples and then used counting to estimate probabilities. So at a basic level inference and learning can be considered the same process. We start with a few things that we know and then using data we update what we know. 

**Note4**: It is also possible to learn the structure of a Bayesian network from data. The basic idea is to search over the space of possible models. To do so we will need some method to determine when a good structure has been found. More details can be found in the book. 
