# Inference

## Detection and Bayes Rule

#### Motivation: Intractibility of Bayesian Estimation

In the Bayesian approach, we keep track of the posterior distribution of our model parameter to estimate the distribution of our random variable

*(e.g.) $X\sim Exp(\Lambda), \Pr(\Lambda = \lambda_1) = \Pr(\Lambda = \lambda_2) = \frac 12$*

$$f_x(X) = \sum_{\Lambda \in \lambda_1,\lambda_2}f_{X|\Lambda}(x|\lambda)\Pr(\Lambda = \lambda|X = x)$$

## Point Estimation

Let $q_i = \Pr[Y|\Theta]$ denote the probability of our observation, $Y$, given our model, $\Theta$. Let $p_i$ denote our prior: the probability of our model.

<img src="images/bayes.jpg" width="100">

#### MLE Rule

Maximizes the probability of data given model. Implicitly assumes a uniform prior.

$$MLE = \arg\max_{i}q_i$$

#### MAP Rule

Maximizes the posterior: probability of model given data. We can solve for the posterior using bayes. Because the denominator is the same for all $i$, this is equivalent to maximizing the joint distribution.

$$MAP = \arg\max_{i}P[\Theta_i|Y] = \arg\max_{i}p_iq_i$$

To simplify calculations we can apply monotonic functions to this value and solve for the arg max. A common choice is the natural log. 

#### Gaussian Channel
<font color="red">not sure of role or significance ... see B&T for context? projection is linear ...</font>

## Hypothesis Testing

For indicators of extremely rare events, MLE and MAP have serious flaws.
- When the conditional probability of the indicator given the rare event is high, MLE will always "cry wolf" when the indicator is observed.
- An event with a small prior (i.e. a rare event) may be penalized by MAP more than desired.
We want an estimator that gives us control over the sensitivity of our alarm so that we can tune it to our purposes.

#### Definitions

PFA: probability false alarm (error)

PCD: probability of correct detection

Our goal will be to maximize PCD while keeping PFA below a specified value, $\beta$.

#### ROC Curves

ROC curves let us visualize how our sensitivity changes as we permit more error. 

<img src="images/ROC.jpg" width="200">

#### Neyman-Pearson Theorem

The Neyman-Pearson Theorem is the answer to our goal. Intuitively, the estimator, $\hat{X}$ that maximizes PCA while meeting our error bound, $\beta$, has a PFA of exactly $\beta$.

Formally, our optimal estimator is as follows:

$$\hat{X}=\left\{
    \begin{array}{ll}
      1,\hspace{2cm}if L(\Upsilon) > \lambda\\
      1, w.p. \gamma\hspace{1cm}if L(\Upsilon) = \lambda\\
      0,\hspace{2cm} if L(\Upsilon) < \lambda
    \end{array}
  \right.
$$

where

$$L(\Upsilon) = \frac{f_{\Upsilon|X}(y|1)}{f_{\Upsilon|X}(y|0)}$$

When $L(\Upsilon)$ is strictly increasing or decreasing in $y$, a boundary, $\lambda$, implies a boundary on $y$.

## Point Estimation

What function of our observed values will minimize our expected error?

We will frame this problem as a projection in the following vector space

- $<x, y> = E[X \cdot Y]$: the dot product is the expected value of the vector products

Our optimal estimate will minimize the error between our actual model $X$ and our estimate, $\hat{X}$, which is simply a projection onto a subspace of functions of $Y$.

<img src="images/estimators.jpg" width="200">

#### LLSE

$$L[X|Y] = E[X] + \frac{\text{cov}(X, Y)}{Var(X^2)}(Y - E[Y])$$

*Intuition*

Our best guess, independent of $Y$, is the expected value of $X$. We modify this estimate by $Y$'s deviation from its mean, scled by the correlatedness of $X$ and $Y$

*Proof by Projection for Zero Mean Random Variable*

$$L[X|Y] = \text{proj}_{f(Y), f \in L}(X) = \frac{<x, y>}{<x, x>}Y = \frac{E[XY]}{E[X^2]}Y$$

The penultimate expression was found by the projection property. We could always strip the mean from our random variables, solve for $L[X|Y]$ and add them back.

*Proof by Calculus*

We wish to minimize $E[\Delta^2] = E[(X - L[X|Y])^2]$. Recall that we can express any estimator as terms with arbitrary coefficients.

$$E[\Delta^2] = E[(X - a - bY)^2]$$

Take the derivative of $a$ and $b$ and solve to find the optimum.

$$2a - 2E[X] + 2bE[Y] = 0$$
$$a* = E[X] - bE[Y]$$

$$2bE[Y^2] + 2aE[Y] - 2E[XY] = 0$$
$$b* = \frac{\text{cov}(X, Y)}{Var(X^2)}$$

<font color="red">*Vector Case*</font>

*(e.g.) For Zero-Mean Independent Variables, LLSE can be written using the SNR*

Assume $E[X]$ and $E[Z]$ are 0 and $X$ and $Z$ are independent. $Y = \alpha X + Z$. Find $L[X|Y]$.

$$\text{cov}(X, Y) = \alpha E[X^2]$$

$$\text{var}(Y) = \alpha^2 E[X^2] + E[Z^2]$$

$$L[X|Y] = \frac{\alpha E[X^2]}{\alpha^2 E[X^2] + E[Z^2]}Y = \frac{\alpha^{-1}Y}{1 + SNR^{-1}}$$

$$SNR = \frac{\alpha^2E[X^2]}{\sigma^2}$$


#### Linear Regression

To perform LLSE, you need to know the joint statistics of X and Y, but if we observe samples $\{(X, Y)\}$ we can  construct a linear function that minimizes the sample error: $E[\sum_i|X_i - a - bY_i|^2]$. The expression is the same as before, but we use the sample covariance, variance, and expectation.

#### Projection of Arbitrary degree
- rederive using gram schmidt
- or use orthogonality property E[X\hat{X}] = 0
- can express as ax^2 + bx + c ...

#### MMSE

The MMSE is the expected value of X given Y: $E[X|Y = y]$. Intuitively, if we are given a joint distribution of X and Y, we should pick the point, $x$, that balances its distribution at $y$, which is why the MMSE in the first image bisects the joint distribution.


<font color="red">*Proof that the E[X|Y] is the MMSE*</font>

## Jointly Gaussian Random Variables

#### Definitions

1. Let $Z = [Z_1, \ldots, Z_{\ell}]^T$, where $Z_i \sim N(0, 1)$ and all $Z$ are independent. $X$ are JG if $$X = AZ + \mu$$

2. Any linear combination of JG random variable is Gaussian. From every perspective, jointly gaussian random variables look Gaussian.

3. The joint density in the non-generate case is as follows

$$\frac{1}{\sqrt{|\Sigma|(2\pi)^{n}}}e^{-\frac12(x - \mu)^T\Sigma^{-1}(x - \mu)}$$


*Proof: Density of JG Given Change of Variables (1) -> (3)*

We know $f_{Z_1, \ldots, Z_N}(Z_1, \ldots, Z_n)$:

$$f_x(x) = \frac{1}{\sqrt{2\pi^n}}e^{\frac{z_1^2 + \ldots + z_n^2}2} = \frac{1}{\sqrt{2\pi^n}}e^{\frac{z^Tz}2}$$

Now we will perform a change of variables using definition (1): $Z = A^{-1}(X - \mu)$

$$f_Z(z)dz = f_X(x)dx = f_X(x)(\text{det}(A))dz$$

What is happening with $dz$ and $dz$?
<font color="red">When we perform a change of variables, we need to consider how the matrix A skews the unit hypercube defined by d_x.</font>

$$f_X(x) = \frac{f_Z(z)}{|A|} = \frac{1}{|A|\sqrt{(2\pi)^{n}}}e^{-\frac{(A^{-1}(X - \mu))^T(A^{-1}(X - \mu))}2}
= \frac{1}{\sqrt{|\Sigma|(2\pi)^{n}}}e^{-\frac12(x - \mu)^T\Sigma^{-1}(x - \mu)}$$

We use the fact that $\Sigma = AA^T$ to get the final expression.

*Proof: $\Sigma = AA^T$*

$$\Sigma = E[(X - \mu)(X - \mu)^T] = E[AZ(AZ)^T] = AE[ZZ^T]A^T = AA^T$$

#### Properties

- For JG Random Variables, No Correlation implies Independence
- The MMSE of JG Random Variables is the LLSE

<font color="red">*Proof: For JG Random Variables, No Correlation implies Independence*</font>

<font color="red">*Proof: The MMSE of JG Random Variables is the LLSE*</font>

*(e.g.) Expressing JG Random Variables Using Another Independent Gaussian, Z*

One useful property of JG random variables, $U$ and $V$, is that we can express $U$ as a function of $V$ and a random variable $Z$, such that $Z$ is indepenent of $Z$. Now expressions involving $U$ and $V$ can be expressed as two independent Gaussians.

In notation: $U, V$ are joint Gaussian. Find $a$ and $Z$, such that $U = aV + Z$ and $Z \perp Y$.

<img src="images/JG_forced_independence.jpg" width="200">

We find the answer by expanding out our expression for $U$ and shifting the mean of $Z'$ so was can express the $U$ concisely as $aV + Z$.

Notice that we can only use orthogonality to generate an independent random variable because $U$ and $V$ are joint gaussian. For other random variables orthgonality (covariance) does not imply independence.

<font color="red">relationship between independence, correlation, orthogonality, and zero dot product (notice analogous notation for orthogonal and independent that only works for JG case); most important: there's a way to derive projections without assuming zero mean, also true for relationship between covariance and orthogonality</font>

<font color="red">HW 10 problem 5</font>

## Kalman Filter

#### State Space Equations

$$x_n = ax_{n - 1} + v_n$$

$$y_n = cx_{n} + w_n$$

#### Case 1: c = 1

<img src="images/kalman.png" width="400">

Goal: find GE (variance or error of our prediction) and OE (LLSE prediction)

$$OG = x_n \hspace{1.5cm} OM = \hat{x}_{n-1|n-1} \hspace{1.5cm} OF = Y_n$$

$$OC = \tilde{Y}_n \hspace{1.5cm} OB = \hat{x}_{n|n-1} \hspace{1.5cm} OA = L[X_n|\tilde{Y}_n]$$

$$BG = \Delta_{n|n-1} \hspace{1.5cm} GE = \Delta_{n|n} \hspace{1.5cm} GF = w_n$$

Solve for OA using geometry: OA = BE, which we will derive by analyzing the triangle along the back wall. OB is given from the previous step.

$$BF = OC = \tilde{Y}_n$$

Because BGE, BGF are similar triangles: OA = BE = $\frac{\sigma_{n|n-1}^2}{\sigma_{n|n-1}^2 + \sigma_{w}^2}\tilde{Y}_n = K_n\tilde{Y}_n$

<font color="red">left off: find error of new prediction (GE)</font>

#### Case 2: c > 1