$
\newcommand{\real}[1]{\mathbb{#1}}
\newcommand{\expect}{\mathrm{E}}
\newcommand{\prob}{\mathrm{P}}
\newcommand{\v}{\mathrm{var}}
\newcommand{\Comb}[2]{{}^{#1}C_{#2}}
$

# Bayesian Inference

## Prologue
Before starting to read this, read the bear notes about the basics of Probability Theory, namely:
* Probability Axioms, 
* Conditioning, 
* Random Variables, 
* Probability Mass Function, 
* Probability Density Function, 
* Expected Value, 
* Variance etc.  

Then go on to read the Limit Theorems from the textbook. These laws provide a mathematical basis for the loose interpretation of an Expectation $E[X]=\mu$ as the average of a large number of independent samples drawn from the distribution of X. Limit Theorems are useful for several reasons:
1. Conceptually, they provide an interpretation of Expectation (as well as Probability) in terms of a long sequence of identical independent experiments.
2. They allow us to derive an approximate probability distribution for the sum $S_n$ of large number of i.i.d random variables $X_1,X_2,\dots,X_n$.
3. They play a major role in inference and statistics, in the presence of large data sets.


## Statistical Inference

Statistical Inference is the process of extracting information about an unknown variable or an unknown model from available data. Statistics generally involves an element of art. For any particular problem, there may be several reasonable methods, yielding different answers. There is no principled way for selecting the "best" method, unless one makes several assumptions and imposes additional constraints on the inference problem.

Within the field of statistics there are two prominent schools of thought, with opposing views: the Bayesian and the Classical (also called Frequentist). Their fundamental difference relates to the nature of the unknown models or variables. In the Bayesian view, they are treated as random variables with known distributions. In the classical view, they are treated as deterministic quantities that happen to be unknown.

The Bayesian approach essentially tries to move the field of statistics back to the realm of probability theory. The unknown variables are treated as random variables with known prior distrubtions. We conduct an experiment, make some observation ($X$) and then try to infer the value of the random variable ($\Theta$) during the conducted experiment. We derive a posterior distribution $P(\Theta \mid X)$ which tells us how likely it is that $\Theta = \theta$ when the observation $X$ was made.

By contrast, in Classical Inference, the unknown quantity $\theta$ is viewed as a deterministic constant that happens to be unknown. It then strives to develop an estimate of $\theta$ that has some performance guarantees (confidence interval).

Suppose that we are trying to measure a physical constant, say the mass of the electron, by means of noisy experiments. The classical statistician will argue that the mass of the electron, while unknown, is just a constant, and that there is no justification for modeling it as a random variable. The Bayesian statistician will counter that a prior distribution simply reflects our state of knowledge. For example, if we already know from past experiments a rough range for this quantity, we can express this knowledge by postulating a prior distribution which is concentrated over that range.

A classical statistician will often object to the arbitrariness of picking a particular prior. A Bayesian statistician will counter that every statistical procedure contains some hidden choices. Furthermore, in some cases, classical methods turn out to be equivalent to Bayesian ones, for a particular choice of a prior. By locating all of the assumptions in one place, in the form of a prior, the Bayesian statistician contends that these assumptions are brought to the surface and are amenable to scrutiny.

### Types of Inference Problems
In **parameter estimation**, we want to generate estimates that are close to the true values of the parameters in some probabilistic sense. In this kind of problem, a model is fully specified (prior and likelihood distributions), except for an unknown, possibly multidimensional, parameter $\theta$, which we wish to estimate. This parameter can be viewed as either a random variable (Bayesian approach) or as an unknown constant (Classical approach). The usual objective is to arrive at an estimate of $\theta$ that is close to the true value in some sense. For example, using polling data, estimate the fraction of a voter population that prefers candidate A over candidate B.

In a **binary hypothesis testing** problem, we start with two hypotheses and use the available data to decide which of the two is true. For example, the Airplane-Radar example discussed earlier, where we have to infer whether an airplane was present or not, given the radar reading.

## Intro to Bayesian Inference
*Lecture 14*

Here, we study the Bayes’ Rule in detail. We learn how to calculate the Posterior distribution and how to summarize the distribution using one number: either the most probable value (where posterior is highest) or the expected value (mean value).  
  
### Bayesian Framework
In Bayesian Framework, the unknown quantity is modelled as a random variable, denoted by $\Theta$. It has a prior distribution $P_\Theta$. We aim to extract information about $\Theta$, based on another random variable observed during the experiment $X=(X_1, X_2, \dots, X_n)$. This random variable is called **observation, measurement** or an **observation vector**.

Based on $X$, we try to guess the value that $\Theta$ took during the experiment. For this we are given a model $P(X \mid \Theta)$. Once we have made an observation, we make a judgement about $\Theta$ by calculating posterior distribution $P(\Theta \mid X)$ using Bayes' Rule.

$P(\Theta \mid X)$ is called the **Posterior Distribution** because here we have already made the observation and we are trying to go back and guess what value the random variable $\Theta$ took.  

$P(X \mid \Theta)$ is called the **Likelihood Distribution** because it tells us how likely $X$ is to take a value $x$, given the value of $\Theta$.  

$P(\Theta)$ is called the **Prior Distribution**, because it gives the probability of $\Theta = \theta$ before any observation is made. The prior comes from symmetry, known range, earlier studies, subjectivity etc.

After this, we study the four different possible scenarios between continuous/discrete $\Theta$ and continuous/discrete $X$.
<img src="../images/bayesian_inference/bayes_rule.png"/>  

We now study a particular example where unknown is continuous and observation is discrete. We want to find the posterior probability of bias (probability of heads) of a coin, given the number of heads $k$ in $n$ coin tosses. The formula for the posterior distribution is given above (continuous $\Theta$, discrete $X$ scenario). The prior $f_\Theta(\theta)$ is uniform over the interval $[0,1]$.  

$$
\begin{align}
p_{K | \Theta}(k | \theta) &= \Comb{n}{k} \theta^k (1-\theta)^{n-k} \\
f_{\Theta | K}(\theta | k) &= \frac{1 \cdot \Comb{n}{k} \theta^k (1-\theta)^{n-k}}{p_{K}(k)} \\
&= \frac{1}{d(n,k)} \theta^k (1-\theta)^{n-k} \\
\end{align}
$$

$d(n,k)$ is the Normalizing Constant (doesn't vary with $\Theta$). The distribution given by the above formula is called **Beta distribution** with parameters $(k+1, n-k +1)$. Notice that if we start with a Prior that is a Beta Distribution, then the Posterior still belongs to the Beta family. Priors of this sort are known as Conjugate Priors.  

### Point Estimates
Many a times we are not interested in the posterior distribution of $\Theta$ over its entire range. We just need a point estimate for $\Theta$ given the value of $X$. The point estimator takes a specific value given each value of $X$ based on the function $\prob(\Theta \mid X)$. Therefore, the point estimator itself is a random variable denoted by $\hat{\Theta}$. Any value that the point estimator takes is called an estimate and is denoted by $\hat{\theta}$.

The value of $\hat{\theta}$ is to be determined by applying some function $g$ to the observation $x$, resulting in $\hat{\theta} = g(x)$. The random variable $\hat{\Theta} = g(X)$ is called an estimator, and its realized value equals $g(x)$ whenever the random variable $X$ takes the value $x$. As explained, the reason that $\hat{\Theta}$ is a random variable is that the outcome of the estimation procedure depends on the random value of the observation.

We can use different functions $g$ to form different estimators; some will be better than others. For an extreme example, consider the function that satisfies $g(x) = 0$ for all $x$. The resulting estimator, $\hat{\Theta} = 0$, makes no use of the data, and is therefore not a good choice. We study two of the most popular estimators:  
<br/>
**Maximum a Posteriori Probability (MAP) Estimator**
<img src="../images/bayesian_inference/MAP.png"/>
<br/>
**Conditional Expectation Estimator**
$$\hat{\theta} = E[\Theta \mid X=x]$$
<br/>
It is also called the least mean squares (LMS) estimator because it has an important property: it minimizes the Mean Squared Error over all estimators.  
$$E[(\Theta - \hat{\theta})^2] = var(\Theta - \hat{\theta}) + (E[\Theta - \hat{\theta}])^2 = var(\Theta) + (E[\Theta - \hat{\theta}])^2$$  

If the posterior distribution is symmetric around its (conditional) mean and unimodal (i.e., has a single maximum), the maximum occurs at the mean. Then, the MAP estimator coincides with the conditional expectation estimator.  
<br/>
  

### Hypothesis Testing
In a hypothesis testing problem, $\Theta$ takes one of $m$ values, $\theta_1,\theta_2,\dots,\theta_m$, where m is usually a small integer; often m = 2, in which case we are dealing with a binary hypothesis testing problem. We refer to the event ${\Theta = \theta_i}$, as the $i$th hypothesis and denote it by $H_i$.
<img src="../images/bayesian_inference/MAP_Rule.png"/>

<br/>

Watch this [video](https://www.youtube.com/watch?v=_hDfZF64wic) for a nice summary of all the topics discussed in Lecture 14.

## Linear Models with Normal Noise
*Lecture 15*

Here we study the continuous unknown, continuous observation case of Bayesian inference. We consider the case where both, known and unknown signals are independent normals. $X = W + \Theta$, where $X$ is the Observation RV, $W$ is the Noise RV and $\Theta$ is the unknown RV to be inferred.

### Recognizing Normal Distribution
<img src="../images/bayesian_inference/normal_fn.png"/>  
All functions of the above form are normal distributions. You don’t really need to remember the mean and variance. Comparing the above equation with $f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}$, we know that coefficient of $x^2 = \frac{1}{2\sigma^2} = \alpha$. Also, the mean is where the function takes the biggest value. That is the point where the absolute value of exponent of e takes the smallest value. Differentiating the exponent function you get the mean.
This tells us that product of normal probabilities is also normal.  

### Estimating Normal RV in Normal Noise
**Single Unknown, Single Observation**  
In our experiment, $X = W + \Theta$, where $X$ is the Observation RV, $W$ is the Noise RV and $\Theta$ is the unknown RV to be inferred. If the prior (distribution of $\Theta$) and likelihood (which is the same as the noise distribution translated by $\theta$ along X axis) are normal then the posterior probability of $\Theta$ is also given by a normal distribution (applying Bayes’ Rule we can see that the posterior will be of the form presented above). If $P_\Theta = P_W = N(0,1)$ (both are standard normals) and $X = x$, then the peak of the posterior distribution is at $x/2$ (obtained by partial differentiation with respect to $\theta$ of the quadratic fn in the exponent). Therefore, $\hat{\theta}_{LMS} = \hat{\theta}_{MAP} = E[\Theta \mid X=x] = x/2$. In cases where $P_\Theta$ and $P_W$ are normal but not necessarily standard normal, posterior is still normal and $\hat{\Theta}_{LMS} = \hat{\Theta}_{MAP} = aX+b$

**Single Unknown, Multiple Observations**  
Here $X$ is an n-dimensional vector as we make multiple observations $X_1, X_2, \dots, X_n$. All the observations are made at one particular value of $\Theta$. Therefore,  
$X_1 = \Theta + W_1$  
$X_2 = \Theta + W_2$  
$\vdots$  

where $W_i = N(0,\sigma_i)$ and $P_{X_i | \Theta}(x_i | \theta) = c_i e^{\frac{-(x_i - \theta)^2}{2\sigma_i^2}}$

The noise in each observation is different, but the value of the underlying unknown is the same. Though $\Theta$ can take different values, when we conduct the experiment we record observations $x_1, \dots, x_n$, $\Theta$ for some particular value $\theta$ and now we are trying to estimate that value using the posterior distribution.  

This is similar to the number game which we played in CCM Bayesian Modeling. Recall that we made multiple observations for a given hypothesis (concept), and then tried to predict the unknown hypothesis (concept) using the posterior distribution.  

The likelihood distribution $P_{X | \Theta}$ is given as follows:    
$$ P_{X | \Theta} = P_{X_1,\cdots,X_n | \Theta}(x_1,\cdots,x_n|\theta) = \prod_{i=1}^{n}P_{X_i | \Theta}(x_i | \theta)$$  

Pluging in the value of $P_{X_i | \Theta}$ in the equation above and then multiplying the obtained likelihood distribution with the prior distribution, we get the equation for the Posterior distribution. Again to obtain the MAP Estimator, we differentiate the quadratic exponent of the posterior distribution w.r.t. $\theta$.  

$$\hat{\theta}_{LMS} = \hat{\theta}_{MAP} = E[\Theta \mid X=x] = \frac{\sum_{i=0}^{n}\frac{x_i}{\sigma_i^2}}{\sum_{i=0}^{n}\frac{1}{\sigma_i^2}}$$

**Multiple Unknowns, Multiple Observations**  
In this model, $X$ and $\Theta$ are both  multi-dimensional vectors. We study this case using the Trajectory Example. Lectures 15.6, 15.7 and 15.8 go over this in detail.  

The relation between observations $X_i$(s), noise $W_i$(s) and unknown parameters $\Theta_j$(s) for the Trajectory Problem is given by:  
$$X_i = \Theta_0 + \Theta_1t_i + \Theta_2t_i^2 + W_i$$  
In the above problem, $\Theta_0$ is the (initial) height, $\Theta_1$ is the (initial) speed and $\Theta_2$ is the acceleration. All these variables can take any value during an experiment. We would have a prior distribution of these values based on the average height of buildings in an area, the velocity with which people can throw a ball vertically, gravitational acceleration etc. Based on the observations $X_i$, we try to obtain a posterior distribution over all $\Theta$(s).  
$$P_{\Theta|X} = P_{X | \Theta}  P_{\Theta} = P_{\Theta} P_{X_1 | \Theta} P_{X_2 | \Theta} \dots$$  
The MAP Estimator for $\Theta$ would be that tuple $(\theta_1,\theta_2,\theta_3)$ for which $\theta_0 + \theta_1t_i + \theta_2t_i^2$ is closest to the observed values, because $W_i = 0$ with maximum probability and probability decreases as $W_i$ deviates from $0$. Thus, $P_{X_i | \Theta}$ will be greater when $X_i$ falls closer to the estimated trajectory. MAP Estimator has the highest probability which means that for MAP trajectory, the $X_i$s are as close to the curve as possible.  

A key difference between the Trajectory Problem and "Single Unknown, Multiple Observations" scenario is that here all the $X_i$(s) are not identically distributed random variables. In the previous scenario $P_{X_i \mid \Theta}$ was the same. Here the mean for all such $P_{X_i \mid \Theta}$ is different, and depends on the time of the observation. 

Here again we derive that the posterior distribution for $\Theta = (\Theta_0, \Theta_1, \Theta_2)$. We observe that the distribution is a normal function. To get the MAP estimate (same as LMS) we need to find the peak of the normal curve. We do this by taking partial derivates with respect to $\theta_0, \theta_1, \theta_2$ and equating them to 0. Thus we have three equations and three unknowns.  

The Trajectory Estimation Problem gives a glimpse into a large field that deals with Linear Normal Models. Here, for each problem we assume the existence of some underlying independent normal random variables and the observations $X_i$ and the unknown parameters $\Theta_j$ are linear functions of these underlying variables. $X_i$ and $\Theta_j$ are also normal, because linear functions of normal variables are normal too. Problems of this kind occur very frequently in practice. The Posterior is always of the form :
$$ f_{\Theta | X}(\theta | x) = c(x)\mathrm{exp\{-quadratic(\theta_1,\theta_2,\dots)\}}$$  

MAP estimate maximizes the posterior over $(\theta_1, \theta_2, \dots)$ and thus minimizes the quadratic function in the exponent. Equating the partial derivates to zero gives us a system of linear equations. Thus the name "*linear* regression".
$$\hat{\Theta}_{MAP,j} : \textrm{Linear Function of }X = (X_1, \dots, X_n)$$  

Inference (MAP estimation) under Linear Normal Model is the same as finding the parameters of the curve, such that the distance between the curve and the points is as less as possible. MAP estimates maximize the likelihood of observations. If observations are normal variables, then the likelihood is maximum when the observation is closest to the projection formula. Thus likelihood is maximized when the points are closest to the curve. Carrying out Inference under Linear Normal Model Assumption is known as Linear Regression. Why Linear? Because we get a system of linear equations for the MAP estimates. See [this video](https://www.youtube.com/watch?v=qinepPxDUcY&list=PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6&index=161) for more explanation. 

## Least Mean Square Estimation 
*Lecture 16*

In this section, we basically just prove that the conditional expectation estimator results in the least possible mean squared error (hence the abbreviation LMS), and we explore some of its other properties.

We start by considering the simpler problem of estimating $\Theta$ with a constant $\hat{\theta}$, in the absence of an observation $X$. The estimation error $\hat{\theta} - \Theta$ is random, but the mean squared error $E[(\hat{\theta} - \Theta)^2]$ is a number that depends on $\hat{\theta}$ and can be minimized over $\hat{\theta}$. With this criterion, it turns out that the best possible estimate is to set $\hat{\theta} = E[\Theta]$, as shown below:  
$$E[(\Theta - \hat{\theta})^2] = var(\Theta - \hat{\theta}) + (E[\Theta - \hat{\theta}])^2 = var(\Theta) + (E[\Theta - \hat{\theta}])^2$$  

Even for the case where we use an observation $X$ to estimate $\Theta$, the situation is identical to the one considered earlier, except that we are now in a new "universe", where everything is conditioned on $X=x$. We can therefore adapt our earlier conclusion and assert that the conditional expectation $E[\Theta \mid X=x]$ minimizes the conditional mean squared error between estimate and actual random variable $E[(\Theta - \hat{\theta})^2 \mid X=x]$.  
<img src="../images/bayesian_inference/LMS.png">

## Linear Least Mean Square Estimation
*Lecture 17*

In this section, we derive an estimator that minimizes the mean squared error within a restricted class of estimators: those that are linear functions of the observations. While this estimator may result in higher mean squared error, it has a significant practical advantage: it requires simple calculations, involving only means, variances. and covariances of the parameters and observations. It is thus a useful alternative to the conditional expectation/LMS estimator in cases where the latter is hard to compute.

A linear estimator of a random variable $\Theta$, based on observations $X_l , \dots , X_n$, has the form
$$\hat{\Theta} = a_1X_l + \dots + a_nX_n + b$$

Given a particular choice of the scalars $a_1, \dots , a_n, b$ the corresponding mean squared error is
$$E[(\Theta - a_1X_l - \dots - a_nX_n - b)^2]$$

**Single Observation, n=1**  
We are interested in finding $a$ and $b$ that minimize the mean squared estimation error $E[(\Theta - aX -b)^2]$ associated with a linear estimator $aX + b$ of $\Theta$. Suppose that $a$ has already been chosen. How should we choose $b$? This is the same as choosing a constant $b$ to estimate the random variable $\Theta - aX$. The best choice is $b = E[\Theta - aX] = E[\Theta] - aE[X]$.  
<br/>
<img src="../images/bayesian_inference/LLMS.png">
<br/>
The formula for the linear LMS estimator only involves the means, variances, and covariance of $\Theta$ and $X$. Furthermore, it has an intuitive interpretation. The estimator starts with the baseline estimate $E[\Theta]$, which it then adjusts by taking into account the value of $X - E[X]$. Suppose $cov(\Theta, X)$ is positive. This means that the estimator should increase in proportion to $X - E[X]$. The LLMS estimator equation shows that the proportionality constant is $a = \frac{cov(\Theta, X)}{var(X)}$. If $\rho = 0$, then $X$ and $\Theta$ are not correlated, so a given value of X should not affect the value of the estimator for $\Theta$.
<br/>
The video lectures discuss two examples for LLMSE:  
<ol>
    <li> $\Theta$ is a uniform random variable in the interval $[a,b]$ and observations are values of $\Theta$ in presence of uniform noise.</li>
    <li> $\Theta$ is the bias of a coin and $X$ is the number of heads in $n$ coin tosses.</li>
</ol>
In both the cases LLMS = LMS estimator.  
<br/>

**Multiple Observations, n>1**  
Refer to the textbook. Here we learn that when $\Theta$ is a normal random variable, $\hat{\theta}_{LMS} = \hat{\theta}_{LLMS} = \hat{\theta}_{MAP}$. Ease of LLMS computation and ubiquitousness of Normal distributions make LLMS estimators very handy.