In [11]:
import numpy as np

# MUD Point for Linear Mappings

We want to compute what is the maximum-updated-density (MUD) point for the data-consistent inversion updated posterior. We do this for the simple case of linear maps, introducing assumptions as we go to allow for interpretability.

We start with writing the form for the MUD point as a maximization problem:

\begin{align}
\max_{\lambda}{\pi^{up}(\lambda)}=\max_{\lambda}\left\{\pi^{init}(\lambda)\cdot\dfrac{\pi^{data}\left(Q(\lambda)\right)}{\pi^{pf}\left(Q(\lambda)\right)}\right\}
\end{align}


## For all one-to-one maps, the updated PDF is exactly the same as the data-generating PDF
First suppose that $\lambda$ is a univariate random variable and $Q(\lambda)$ is a linear map from $\mathbb{R}\rightarrow\mathbb{R}$: 

$$Q(\lambda)=A\lambda+b.$$

For such a linear transformation, we can compute the push-forward of $\pi^{init}$ through the map $Q$ very easily.

\begin{align}
\pi^{pf}(q)&=\frac{1}{A}\pi^{init}\left(\frac{q-b}{A}\right)=\frac{1}{A}\pi^{init}\left(\lambda\right)
\end{align}

What does this mean?

This essentially means that the updated distribution will look like:

\begin{align}
\pi^{up}(\lambda)&=\pi^{init}(\lambda)\cdot\dfrac{\pi^{data}\left(Q(\lambda)\right)}{\frac{1}{A}\pi^{init}\left(\lambda\right)} \\[2ex]
&=A\cdot\pi^{data}\left(Q(\lambda)\right) \\[2ex]
\end{align}

Let $\pi^{gen}(\lambda)$ be the true data generating pdf of $\lambda$. The using transformations, we will get that:

\begin{align}
\pi^{up}(\lambda)&=A\cdot\pi^{data}\left(Q(\lambda)\right)=A\cdot\frac{1}{A}\pi^{gen}\left(\lambda\right)=\pi^{gen}\left(\lambda\right) \\[2ex]
\end{align}

In other words, the updated posterior will be exactly the data-generating pdf. Exactly. This means that all statistical moments will be exactly the same, so there is nothing to prove here.

Now suppose $Q(\lambda)$ is any one-to-one function with inverse $Q^{-1}(q)$. Then we can compute the push-forward for the random variable $\lambda$ exactly:

\begin{align}
\pi^{pf}(q)&=\pi^{init}\left(Q^{-1}(q)\right)\cdot \left|\dfrac{dQ^{-1}}{dq}\right|=\pi^{init}\left(\lambda\right)\left|\dfrac{dQ^{-1}}{dq}\right|
\end{align}

Then we will have the same situation as for a linear map:

\begin{align}
\pi^{up}(\lambda)&=\pi^{init}(\lambda)\cdot\dfrac{\pi^{data}\left(Q(\lambda)\right)}{\pi^{init}\left(\lambda\right)\left|\dfrac{dQ^{-1}}{dq}\right|} \\[2ex]
&=\left|\dfrac{dQ^{-1}}{dq}\right|^{-1}\cdot\pi^{data}\left(Q(\lambda)\right) \\[2ex]
&=\left|\dfrac{dQ^{-1}}{dq}\right|^{-1}\left|\dfrac{dQ^{-1}}{dq}\right|\cdot\pi^{gen}\left(\lambda\right) \\[2ex]
&=\pi^{gen}\left(\lambda\right) \\[2ex]
\end{align}

Which again means that there is nothing to prove.

### KEY TAKEAWAY:
For any one-to-one map $Q$, all statistics from the updated pdf will be the same as if drawn from the data-generating pdf, **regardless of the initial distribution**. This means that the only errors that will occur in the calculation of statistics on $\lambda$ will be from approximations of either the data pdf or the push-forward pdf.

## Linear Maps without an Inverse:

Suppse we have a map that is linear but not one-to-one. Let $\lambda$ be a multivariate random variable in $\mathbb{R}^M$ and $Q(\lambda):\mathbb{R}^M\rightarrow\mathbb{R}$, where $Q$ is linear. In particular, suppose $Q(\lambda)$ is:

\begin{align}
Q(\lambda)=a^T\lambda
\end{align}

Let $q\sim N(\nu,\sigma_\nu^2)$, in other words, the data being analyzed is normally distributed.

Since our data is normally distributed, I suspect it will be beneficial to choose a initial distribution from a family of distributions such that the push forward of our initial distribution will also be normally distributed. Because our map is a weighted sum of random variable $\lambda_i$, it makes sense to choose an initial distribution to be normal, since we know that a sum of independent normally distributed random variables is normal. 

In particular, we know that if $\lambda_i\sim N(\mu_i,\sigma_i^2)$, we will have:

\begin{align}
q\sim N\left(\sum_{i=1}^Ma_i\mu_i\ ,\ \sum_{i=1}^M{a_i^2\sigma_i^2}\right)
\end{align}

For simplicity, call these new mean and variance terms, $\mu_q$ and $\sigma_q^2$.

Now let us consider the MUD point of our updated distribution. Using logarithms, we can break up our maximum estimate to look like:

\begin{align}
\max_{\lambda}{\pi^{up}(\lambda)}&=\max_{\lambda}\left\{\log\left(\pi^{init}(\lambda)\right)+\log\left(\pi^{data}(q)\right)-\log\left(\pi^{pf}(q)\right)\right\} \\[2ex]
%
&=\max_{\lambda}\left\{-(\lambda-\mu)^TD^{-1}(\lambda-\mu) - \frac{1}{\sigma_\nu^2}(q-\nu)^2+\frac{1}{\sigma_q^2}(q-\mu_q)^2\right\} \\[2ex]
%
&=\min_{\lambda}\left\{\left(\lambda^TD^{-1}\lambda-\mu^T D^{-1}\lambda-\lambda^T D^{-1}\mu +\mu^TD^{-1}\mu \right) + \frac{1}{\sigma_\nu^2}(q-\nu)^2-\frac{1}{\sigma_q^2}(q-\mu_q)^2\right\} \\[2ex]
%
&=\min_{\lambda}\left\{\left(\lambda^TD^{-1}\lambda-2\mu^T D^{-1}\lambda \right) + \frac{1}{\sigma_\nu^2}(q-\nu)^2-\frac{1}{\sigma_q^2}(q-\mu_q)^2\right\} \\[2ex]
%
&=\min_{\lambda}\left\{\left(\lambda^TD^{-1}\lambda-2\mu^T D^{-1}\lambda \right) + \left(\frac{1}{\sigma_\nu^2}-\frac{1}{\sigma_q^2}\right)q^2-2\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)q+\left(\frac{\nu^2}{\sigma_\nu^2}-\frac{\mu_q^2}{\sigma_q^2}\right) \right\} \\[2ex]
%
&=\min_{\lambda}\left\{\left(\lambda^TD^{-1}\lambda-2\mu^T D^{-1}\lambda \right) + \left(\frac{1}{\sigma_\nu^2}-\frac{1}{\sigma_q^2}\right)(a^T\lambda)^2-2\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)(a^T\lambda) \right\} \\[2ex]
%
&=\min_{\lambda}\left\{ \lambda^TD^{-1}\lambda+ \left(\frac{1}{\sigma_\nu^2}-\frac{1}{\sigma_q^2}\right)(a^T\lambda)(a^T\lambda)-2\mu^T D^{-1}\lambda  -2\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)(a^T\lambda) \right\} \\[2ex]
%
&=\min_{\lambda}\left\{ \lambda^TD^{-1}\lambda+ (\lambda^Ta)\left(\frac{1}{\sigma_\nu^2}-\frac{1}{\sigma_q^2}\right)(a^T\lambda)-2\mu^T D^{-1}\lambda  -2\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)(a^T\lambda) \right\} \\[2ex]
%
&=\min_{\lambda}\left\{ \lambda^T\left(D^{-1}+\left(\frac{1}{\sigma_\nu^2}-\frac{1}{\sigma_q^2}\right)aa^T\right)\lambda-2\left[\mu^T D^{-1}+\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)a^T\right]\lambda \right\} \\[2ex]
%
&=\min_{\lambda}\left\{ \lambda^T\left(D^{-1}+\frac{1}{\sigma_\nu^2}\left(1-\frac{\sigma_\nu^2}{\sigma_q^2}\right)aa^T\right)\lambda-2\left[\mu^T D^{-1}+\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)a^T\right]\lambda \right\} \\[2ex]
\end{align}

Taking the derivative we get the following:

\begin{align}
0&=\lambda^T\left(D^{-1}+\frac{1}{\sigma_\nu^2}\left(1-\frac{\sigma_\nu^2}{\sigma_q^2}\right)aa^T\right)-2\left[\mu^T D^{-1}+\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)a^T\right] \\[2ex]
\lambda^T\left(D^{-1}+\frac{1}{\sigma_\nu^2}\left(1-\frac{\sigma_\nu^2}{\sigma_q^2}\right)aa^T\right)&= \left[\mu^T D^{-1}+\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)a^T\right] \\[2ex]
\lambda^T&= \left[\mu^T D^{-1}+\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)a^T\right]\left(D^{-1}+\frac{1}{\sigma_\nu^2}\left(1-\frac{\sigma_\nu^2}{\sigma_q^2}\right)aa^T\right)^{-1} \\[2ex]
\lambda&= \left[\mu^T D^{-1}+\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)a^T\right]^T\left(\left(D^{-1}+\frac{1}{\sigma_\nu^2}\left(1-\frac{\sigma_\nu^2}{\sigma_q^2}\right)aa^T\right)^{-1}\right)^T \\[2ex]
\lambda&= \left(\left(D^{-1}+\frac{1}{\sigma_\nu^2}\left(1-\frac{\sigma_\nu^2}{\sigma_q^2}\right)aa^T\right)^{-1}\right)^T\left[(D^{-1})^T\mu+\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)a\right] \\[2ex]
\end{align}

Let $A$ denote the matrix $aa^T$. This matrix is symmetric. Since $D^{-1}$ is a diagonal matrix, $D^{-1} + cA$ is symmetric. So we know its inverse will be symmetric (if its inverse exists). Thus we have:

\begin{align}
\lambda&= \left(D^{-1}+\frac{1}{\sigma_\nu^2}\left(1-\frac{\sigma_\nu^2}{\sigma_q^2}\right)A\right)^{-1}\left[D^{-1}\mu+\left(\frac{\nu}{\sigma_\nu^2}-\frac{\mu_q}{\sigma_q^2}\right)a\right] \\[2ex]
\end{align}

Let us consider the inverse matrix term. This matrix will look like:

\begin{align}
\begin{pmatrix}
\frac{1}{\sigma_1^2} & 0 & \ldots & 0\\
0 & \frac{1}{\sigma_2^2} & \ddots & 0\\
\vdots & \ddots & \ddots & \vdots \\
0 & \ldots & \ldots &  \frac{1}{\sigma_M^2} \\
\end{pmatrix}+c\begin{pmatrix}
a_1^2 & a_1a_2 & \ldots & a_1a_M\\
a_2a_1 & a_2^2 & \ddots & a_2a_M\\
\vdots & \ddots & \ddots & \vdots \\
a_Ma_1 & \ldots & \ldots &  a_M^2 \\
\end{pmatrix}
\end{align}

If we look at the equations row by row, we can rewrite the system of equations in the following way. Each $\lambda_i$ of the MUD point should satisfy the following system of equations:

\begin{align}
\lambda_i&=\mu_i+\dfrac{a_i^2\sigma_i^2}{\sigma_\nu^2}\cdot\dfrac{1}{a_i}\left[\nu-\sum_{j=1}^Ma_j\lambda_j\right]-\dfrac{a_i^2\sigma_i^2}{\sigma_q^2}\cdot\dfrac{1}{a_i}\left[\mu_q-\sum_{j=1}^Ma_j\lambda_j\right]
\end{align}

First note that written in this form, $\lambda_i$ is on both sides of the equation. However, written in this way allows the expression to be slightly more interpretable. This says something like each $\lambda_i$ should be the initial mean $\mu_i$ plus some corrections.

\begin{align}
\lambda_i&=\mu_i+\dfrac{a_i^2\sigma_i^2}{\sigma_\nu^2}\cdot\dfrac{1}{a_i}\left[\nu-Q(\lambda)\right]-\dfrac{a_i^2\sigma_i^2}{\sigma_q^2}\cdot\dfrac{1}{a_i}\left[Q(\mu)-Q(\lambda)\right]
\end{align}


Let's look at a concrete example to help parse this expression. Suppose that $\vec{a}=(3,-2)$ and $\mu_i\sim N(1,1)$. We can compute the push forward of this map, which will be $q\sim N(1,13)$. 

Suppose we observe a distribution $\nu\sim N(2,3)$. Let us consider what this means for the MUD point $\lambda$.

In [16]:
# check the matrix algebra below
linear_map = np.array([3,-2])
mu, sigma_mu = np.array([1,1]), np.array([1,1])
pf_mean, pf_var = np.dot(linear_map,mu), np.dot(linear_map**2,sigma_mu)
obs_mean, obs_var = 2,3

In [22]:
# covariance matrix D^-1
cov_inv = np.diag(1/sigma_mu)
matrixA = np.outer(linear_map,linear_map)

[[ 9 -6]
 [-6  4]]


In [26]:
# left-hand side of matrix solve for MUD point
left_side = cov_inv + 1/obs_var*(1-obs_var/pf_var)*matrixA

# right-hand side of matrix solve for MUD point
right_side = np.dot(cov_inv,mu) + 1/obs_var*(obs_mean-obs_var/pf_var*pf_mean)*linear_map

In [27]:
lam_MUD = np.linalg.solve(left_side,right_side)

In [29]:
print(lam_MUD)
print(np.dot(linear_map,lam_MUD))

[ 1.23076923  0.84615385]
2.0


The key thing to notice here is that the new point in lambda space has been adjusted in such a way that the push-forward of the MUD point is consistent with the observed maximum density point.

> I made a mistake here doing this matrix algebra by hand

\begin{align}
\lambda&=\left(\begin{pmatrix}
1 & 0 \\
0& 1
\end{pmatrix}+\dfrac{1}{\sigma_\nu^2}\left(1-\frac{\sigma_\nu^2}{\sigma_q^2}\right)\begin{pmatrix}
9 & -6 \\
-6 & 4 \\
\end{pmatrix}\right)^{-1}
\left[\begin{pmatrix}
\frac{\mu_1}{\sigma_1^2} \\
\frac{\mu_2}{\sigma_2^2}
\end{pmatrix}+\dfrac{1}{\sigma_\nu^2}\left(\nu-\frac{\sigma_\nu^2}{\sigma_q^2}\mu_q\right)\begin{pmatrix}
3 \\
-2
\end{pmatrix}\right] \\[2ex]
%
&=\left(\begin{pmatrix}
1 & 0 \\
0& 1
\end{pmatrix}+\dfrac{1}{3}\left(1-\frac{3}{13}\right)\begin{pmatrix}
9 & -6 \\
-6 & 4 \\
\end{pmatrix}\right)^{-1}
\left[\begin{pmatrix}
1 \\
1
\end{pmatrix}+\dfrac{1}{3}\left(2-\frac{3}{13}\cdot1\right)\begin{pmatrix}
3 \\
-2
\end{pmatrix}\right]\\[2ex]
%
&=\left(\left(\frac{1}{13}\right)\begin{pmatrix}
43 & -20 \\
-20 & \frac{79}{3} \\
\end{pmatrix}\right)^{-1}
\left[\dfrac{1}{13}\cdot\left(\begin{pmatrix}
13 \\
13
\end{pmatrix}+\left(\frac{23}{3}\right)\begin{pmatrix}
3 \\
-2
\end{pmatrix}\right)\right]\\[2ex]
%
&=\left(\begin{pmatrix}
43 & -20 \\
-20 & \frac{79}{3} \\
\end{pmatrix}\right)^{-1}
\left[\begin{pmatrix}
39 \\
-\frac{7}{3}
\end{pmatrix}\right]\\[2ex]
%
&=\dfrac{3}{43\cdot79-3\cdot20^2}\cdot\left(\begin{pmatrix}
\frac{79}{3} & 20 \\
20 & 43 \\
\end{pmatrix}\right)
\left[\begin{pmatrix}
39 \\
-\frac{7}{3}
\end{pmatrix}\right]\\[2ex]
%
&=\dfrac{3}{2197}\cdot\frac{1}{3}\cdot
\left[\begin{pmatrix}
3\cdot79\cdot13-140 \\
20\cdot39-43\cdot7
\end{pmatrix}\right]\\[2ex]
%
&=\dfrac{1}{2197}\cdot
\left[\begin{pmatrix}
2941 \\
479
\end{pmatrix}\right]\approx \begin{pmatrix}
1.34 \\
0.22
\end{pmatrix}
\end{align}

To summarize this calculation, the new MUD point has shifted a little bit in the first coordinate, when compared to the initial distribution, and shifted a lot in the second coordinate.

> reminder that the matrix algebra above is slightly incorrect

Now let's look at what our derived formula for the MUD point means:

\begin{align}
\lambda_i&=\mu_i+\dfrac{a_i^2\sigma_i^2}{\sigma_\nu^2}\cdot\dfrac{1}{a_i}\left[\nu-Q(\lambda)\right]-\dfrac{a_i^2\sigma_i^2}{\sigma_q^2}\cdot\dfrac{1}{a_i}\left[Q(\mu)-Q(\lambda)\right]\\[2ex]
\end{align}



In [32]:
def f(lam_idx):
    # given the index to compute, compute the rhs of above formula
    pf_mud = np.dot(linear_map,lam_MUD)
    adjust_one = linear_map[lam_idx]**2*sigma_mu[lam_idx]/obs_var*1/linear_map[lam_idx]*(obs_mean-pf_mud)
    adjust_two = linear_map[lam_idx]**2*sigma_mu[lam_idx]/pf_var*1/linear_map[lam_idx]*(pf_mean-pf_mud)
    
    
    lam_out = mu[lam_idx]+adjust_one-adjust_two
    
    return lam_out, adjust_one, adjust_two
    

In [36]:
f(1)

(0.8461538461538467, 5.9211894646675012e-16, 0.15384615384615399)

Empirically, the middle term is basically zero. In other words, for the MUD point of the updated distribution, the difference between the observed data mean and the projection of the MUD is zero.

So most of the adjustment to the initial mean is done through the last term. This last term is the difference between the push-forward and the projected updated point... In this case, the adjustment to the i-th $\lambda$ value is the proportion of the i-th component, scaled by the linear map, to the total push-forward variance times the fraction of the difference between the projections. <- this is a wordy expression.

Another thought: each term is like a signed measure of the distance between things in the data space.

This makes me wonder if we could come up with an optimization scheme that would generalize to higher dimensions to find the MUD point. It would be a two step process: first, find a lambda which decreases the distance between $Q(\lambda)$ and the observed maximum density point (or mean). Second step, use some variant of the formula above to adjust the proposed MUD point.