In [1]:
import torch
from torch.distributions import Normal
import math

Let us revisit the problem of predicting if a resident of Statsville is female based on the height. For this purpose, we have collected a set of height samples from adult female residents in Statsville. Unfortunately, due to unforseen circumstances we have collected a very small sample from the residents. Armed with our knowledge of Bayesian inference, we do not want to let this deter us from trying to build a model.

From physical considerations, we can assume that the distribution of heights is Gaussian. Our goal is to estimate the parameters ($\mu$, $\sigma$) of this Gaussian.


Let us first create the dataset by sampling 5 points from a Gaussian distribution with $\mu$=152 and $\sigma$=8. In real life scenarios, we do not know the mean and standard deviation of the true distribution. But for the sake of this example, let's assume that the mean height is 152cm and standard deviation is 8cm.

In [2]:
torch.random.manual_seed(0)
num_samples = 5
true_dist = Normal(152, 8)
X = true_dist.sample((num_samples, 1))
print('Dataset shape: {}'.format(X.shape))

Dataset shape: torch.Size([5, 1])


### Maximum Likelihood Estimate

If we relied on Maximum Likelihood estimation, our approach would be simply to compute the mean and standard deviation of the dataset, and use this normal distribution as our model.

$$\mu_{MLE} = \frac{1}{N}\sum_{i=1}^nx_i$$
$$\sigma_{MLE} = \frac{1}{N}\sum_{i=1}^n(x_i - \mu)^2$$

Once we estimate the parameters, we can find out the probability that a sample lies in the range using the following formula
$$ p(a < X <= b) = \int_{a}^b p(X) dX $$

However, when the amount of data is low, the MLE estimates are not as reliable. 

In [3]:
mle_mu, mle_std = X.mean(), X.std()
mle_dist = Normal(mle_mu, mle_std)

print(f"MLE: mu {mle_mu:0.2f} std {mle_std:0.2f}")

MLE: mu 149.68 std 11.52


## Bayesian Inference

Can we do better than MLE? 

One potential method to do this is to use Bayesian inference with a good prior. How does one go about selecting a good prior? Well, lets say from another survey, we know that the average and the standard deviation of height of adult female residents in Neighborville, the neighboring town. Additionally, we have no reason to believe that the distribution of heights at Statsville is significantly different.  So we can use this information to "initialize" our prior. 

Lets say the the mean height of adult female resident in Neighborville is 150 cm with a standard deviation of 9 cm.

We can use this information as our prior. The prior distribution encodes our beliefs on the parameter values.

Given that we are dealing with an unknown mean, and unknown variance, we will model the prior as a Normal Gamma distribution. 

$$p\left( \theta \middle\vert X \right) = p \left( X \middle\vert \theta \right) p \left( \theta \right)\\
p\left( \theta \middle\vert X \right) = Normal-Gamma\left( \mu_{n}, \lambda_{n}, \alpha_{n}, \beta_{n} \right) \\
p \left( X \middle\vert \theta \right)  = \mathbb{N}\left( \mu, \lambda^{ -\frac{1}{2} } \right) \\
p \left( \theta \right) = Normal-Gamma\left( \mu_{0}, \lambda_{0}, \alpha_{0}, \beta_{0} \right)$$

We will choose a prior, $p \left(\theta \right)$, such that 
$$ \mu_{0} = 150 \\
   \lambda_{0} = 100 \\
   \alpha_{0} = 100.5 \\
   \beta_{0} = 8100 $$
   
$$p \left( \theta \right) = Normal-Gamma\left( 150, 100, 100.5 , 8100 \right)$$


We will compute the posterior, $p\left( \theta \middle\vert X \right)$,  using Bayesian inference.

$$\mu_{n} = \frac{ \left( n \bar{x} + \mu_{0} \lambda_{0} \right) }{ n + \lambda_{0} } \\
\lambda_{n} = n + \lambda_{0} \\
\alpha_{n} = \frac{n}{2} + \alpha_{0} \\
\beta_{n} = \frac{ ns }{ 2 } + \beta_{ 0 } + \frac{ n \lambda_{0} } { 2 \left( n + \lambda_{0} \right) } \left( \bar{x} - \mu_{0} \right)^{ 2 }$$

$$p\left( \theta \middle\vert X \right) = Normal-Gamma\left( \mu_{n}, \lambda_{n}, \alpha_{n}, \beta_{n} \right)$$

In [4]:
class NormalGamma():
    def __init__(self, mu_, lambda_, alpha_, beta_):
        self.mu_ = mu_
        self.lambda_ = lambda_
        self.alpha_ = alpha_
        self.beta_ = beta_
        
    @property
    def mean(self):
        return self.mu_, self.alpha_/ self.beta_

    
    @property
    def mode(self):
        return self.mu_, (self.alpha_-0.5)/ self.beta_

In [5]:
def inference_unknown_mean_variance(X, prior_dist):
    mu_mle = X.mean()
    sigma_mle = X.std()
    n = X.shape[0]
    # Parameters of the prior
    mu_0 = prior_dist.mu_
    lambda_0 = prior_dist.lambda_
    alpha_0 = prior_dist.alpha_
    beta_0 = prior_dist.beta_
    
    # Parameters of posterior
    mu_n = (n * mu_mle + mu_0 * lambda_0) / (lambda_0 + n) 
    lambda_n = n + lambda_0
    alpha_n = n / 2 + alpha_0
    beta_n = (n / 2 * sigma_mle ** 2) + beta_0 + (0.5* n * lambda_0  * (mu_mle - mu_0) **2 /(n + lambda_0))  
    posterior_dist = NormalGamma(mu_n, lambda_n, alpha_n, beta_n)
    
    return posterior_dist

In [6]:
# Let us initialize the prior based on our beliefs
prior_dist = NormalGamma(150, 100, 10.5, 810)

# We compute the posterior distribution
posterior_dist = inference_unknown_mean_variance(X, prior_dist)

How do we use the posterior distribution?

Note that the posterior distribution is a distribution on the parameters $\mu$ and $\lambda$. It is important to note that the posterior and prior are distributions in the parameter space. The likelihood is a distribution on the data space.


Once we learn the posterior distribution, one way to use the distribution is to look at the mode of the distribution i.e the parameter values which have the highest probability density. Using these point estimates leads us to Maximum A Posteriori / MAP estimation.

As usual, we will obtain the maxima of the posterior probability density function $p\left( \mu, \sigma \middle\vert X \right) = Normal-Gamma\left(  \mu, \sigma ; \;\; \mu_{n}, \lambda_{n}, \alpha_{n}, \beta_{n} \right) $.

This function attains its maxima when

$$\mu = \mu_{n} \\
\lambda = \frac{ \alpha_{n} - \frac{1}{2} } { \beta_{n} }$$

We notice that the MAP estimates for $\mu$ and $\sigma$ are better than the MLE estimates. 

In [7]:
# With the Normal Gamma formulation, the unknown parameters are mu and precision
map_mu, map_precision =  posterior_dist.mode

# We can compute the standard deviation using precision.
map_std = math.sqrt(1 / map_precision)

map_dist = Normal(map_mu, map_std)
print(f"MAP: mu {map_mu:0.2f} std {map_std:0.2f}")

MAP: mu 149.98 std 9.56


How did we arrive at the values of the parameters for the prior distribution? 

Let us consider the case when we have 0 data points. In this case, posterior will become equal to the prior. If we use the mode of this posterior for our MAP estimate, we see that the mu and std parameters are the same as the $\mu$ and $\sigma$ of adult female residents in Neighborville.

In [8]:
prior_mu, prior_precision =  prior_dist.mode
prior_std = math.sqrt(1 / prior_precision)
print(f"Prior: mu {prior_mu:0.2f} std {prior_std:0.2f}")

Prior: mu 150.00 std 9.00


## Inference

Let us say we want to find out the probability that a height between 150 and 155 belongs to an adult female resident. We can now use the  the MAP estimates for $\mu$ and $\sigma$ to compute this value. 

Since our prior was good, we notice that the MAP serves as a better estimator than MLE at low values of n

In [9]:
a, b = torch.Tensor([150]), torch.Tensor([155])

true_prob = true_dist.cdf(b) - true_dist.cdf(a)
print(f'True probability: {true_prob}')

map_prob = map_dist.cdf(b) - map_dist.cdf(a)
print(f'MAP probability: {map_prob}')

mle_prob = mle_dist.cdf(b) - mle_dist.cdf(a)
print('MLE probability: {}'.format(mle_prob))

True probability: tensor([0.2449])
MAP probability: tensor([0.1995])
MLE probability: tensor([0.1669])


Let us say we receive more samples, how do we incorporate this information into our model? We can now set the prior to our current posterior and run inference again to obtain the new posterior. This process can be done interatively.

$$ p \left( \theta \right)_{n}  = p\left( \theta \middle\vert X \right)_{n-1}$$
$$ p\left( \theta \middle\vert X \right)_{n}=inference\_unknown\_mean\_variance(X_{n}, p \left( \theta \right)_{n})$$

We also notice that as the number of data points increases, the MAP starts to converge towards the true values of $\mu$ and $\sigma$ respectively

In [10]:
num_batches, batch_size = 20, 10
for i in range(num_batches):
    X_i = true_dist.sample((batch_size, 1))
    prior_i = posterior_dist
    posterior_dist = inference_unknown_mean_variance(X_i, prior_i)
    map_mu, map_precision =  posterior_dist.mode

    # We can compute the standard deviation using precision.
    map_std = math.sqrt(1 / map_precision)
    map_dist = Normal(map_mu, map_std)
    if i % 5 == 0:
        print(f"MAP at batch {i}: mu {map_mu:0.2f} std {map_std:0.2f}")
print(f"MAP at batch {i}: mu {map_mu:0.2f} std {map_std:0.2f}")

MAP at batch 0: mu 149.98 std 8.84
MAP at batch 5: mu 150.65 std 8.98
MAP at batch 10: mu 150.70 std 8.77
MAP at batch 15: mu 151.15 std 8.79
MAP at batch 19: mu 151.04 std 8.70
