# Demonstration on 6x6 images

The code for this demo is at [toysamples1.py](toysamples1.py). I moved it out of the notebook, for now, because displaying the output images here was eating up too much web browser memory.  It might move back at some point, since I could simply run that code from here, now that the images are now saved to files; and then just display images I want here, just as now.

I'll paste some of the images output from [toysamples1.py](toysamples1.py) here.  They might or might not be up to date with the real output of the code.  They also might or might not be all generated from the same set of data samples.

*Note that the sampler is broken for now, in the sense of, gives nonsense samples.  You can see below.*

### Example X, generated input data

Here is an example of the samples drawn from initial generative process:

![](img/toysamples1/samples.png)

These are not at all noisy, because I cnat even get the drawn samples to be reasonable on un-noisy data yet :-D  I'll make it noisier once it's working on un-noisy data

### Corresponding input Z features, ie ground truth Z

![](img/toysamples1/samples_Z.png)

The Z values look ok-ish.  Note that these are the Z values used to generate the input data.  They're not the results of sampling, which are further down below.

### Expectation of A, based on ground truth Z

![](img/A_means.png)

This suggests that the method for calculating the expectation of A works for at least the ground truth data.

### The matrix $(\mathbf{Z}^\mathbf{Z} + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I})^{-1}\mathbf{Z}^T$, using ground truth Z

![](img/toysamples1/A_from_ground_truth_Z.png_ZTZIinvZT.png)

We can see that this takes a bit of every data example, giving emphasis to features in examples that only contain 1 feature, and less emphasis to features in examples that have multiple features.  But drawing in any case from all data points.

### Sampled A, from Gibbs iteration 500

![](img/toysamples1/A_draws_it500.png)

It has at least not 0 or tons of features, but the number of features is not quite right.  The features themselves are also not very right...

### The matrix $(\mathbf{Z}^\mathbf{Z} + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I})^{-1}\mathbf{Z}^T$, from Gibbs iteration 500

![](img/toysamples1/A_draws_it500.png_ZTZIinvZT.png)

Unlike the equivalent matrix for the ground truth Z values, almost all data points are ignored.  Only two data points are used, with one feature from each.  This is ... odd, and probably gives some insight into what the bug is currently, whatever/wherever that is.


## Challenges/observations/ideas

### Challenge: sampling from non-normalized distribution?

For sampling from the posterior, per the tutorial, we should use the tutorial's equation 22.  But there's a proportional-to sign, $\propto$.  How to handle this?  Since I'm not trying to do this research from scratch, just reproduce/understand the existing research, I reached out to Mr Google, to look for more explanations.  I found a video from Finale Doshi-Velez, and interluded out to her "accelerated sampling" presentation, slides and paper.  The interlude is at: [accelerated_gibbs_samplings.ipynb](accelerated_gibbs_sampling.ipynb).  So, if you're trying to follow along with my own thought processes, challenges, solutions, etc, you might want to go there now.  Or you can just continue, it's all good :-)

From reaching out to Doshi-Velez's presentation, it looks like we can sample a distribution that is only provided in proportionality, as long as it has a shape, which is well-defined, that we know how to sample from, ie typically a Gaussian.  There are probably other ways of handling other distributions, but a Gaussian should be reasonably straightforward to sample from, if we can get the distribution in that form.  Let's reach right back to the first section, and see what distribution(s) we need to sample from.

Equation 22 from the Griffiths and Ghahramani tutorial states:

$$
P(z_{ik} \mid \mathbf{X}, \mathbf{Z}_{-(i,k)}, \sigma_X, \sigma_A)
\propto
p(\mathbf{X} \mid \mathbf{Z}, \sigma_X, \sigma_A)
\,
P(z_{ik} \mid \mathbf{z}_{-i,k})
$$

The first term of this equation, ie the likelihood of $\mathbf{X}$, given the latent variables, and the hyper-parameters, is a Gaussian.  For the finite model, $P(z_{ik} \mid \mathbf{Z}_{-i,k})$ is given by equation 17 in the tutorial:

$$
P(z_{ik} = 1 \mid \mathbf{z}_{-i,k})
= \frac{m_{-i,k} + \frac{\alpha}{K}}
  {N + \frac{\alpha}{K}}
$$

This seems not to be a Gaussian.  How to sample from the product of a Gaussian and this term?

### Interlude: is the product of two normalized distributions also normalized?

Brainstorming a bit, we could sample from the Gaussian, which we could normalize first, and then multiply by $p(z_{ik} = 1 \mid \mathbf{z}_{-i,k})$.  Is it fair to say that the product of two normalized probability functions will be normalized?  Probably not, eg we could have the following two distributions:

$$
f(x) = 1
\mathrm{\,when\,} x \ge 0 \mathrm{\,and\,} x \le 1 \\ 
= 0 \mathrm{\, otherwise}
$$

(which integrates to 1), and:

$$
g(x) = 1
\mathrm{\,when\,} x \ge 2 \mathrm{\,and\,} x \le 3 \\ 
= 0 \mathrm{\, otherwise}
$$

... which integrates to 1 too.  But their product integrates to 0.

### Integrate the un-normalized distribution over $z_{ik}$?

Actually, the equation for the probaiblty of $z_{ik} = 1$ is not actually a probability distribution: it's the value of this probaiblity for one specific value of $z_{ik}$, ie $1$.

Let's try integrating over $c \cdot p(\mathbf{X} \mid \mathbf{Z}, \sigma_X, \sigma_A) \cdot P(z_{ik} \mid \mathbf{z}_{-i,k})$ $z_{ik}$, using a probability distribution of $z_{ik}$, rather than just one specific value, and where $c$ is a constant of normalization, that will make the integrant integrate to $1$.

$$
\int
c
\cdot
P(\mathbf{X} \mid \mathbf{Z}, \sigma_X, \sigma_A)
\cdot
P(z_{ik})
\,
dz_{ik}
$$

And since $z_{ik}$ is discrete, ie $z_{ik} \in \{0, 1\}$, then we can rewrite the integral as a sum:

$$
=
c
\sum_{z_{ik}=0}^1
\left(
    P(\mathbf{X} \mid \mathbf{Z}, \sigma_X, \sigma_A)
    \cdot
    P(z_{ik})
\right)
$$
&nbsp;

$$
=
c
\sum_{z_{ik}=0}^1
\left(
    \mathcal{N}(\mathbf{X}; \mu_{\mathbf{Z}, \sigma_A, \sigma_X}, \Sigma_{\mathbf{Z}, \sigma_A, \sigma_X})
    \cdot
    P(z_{ik})
\right)
$$



So, it seems like maybe we can simply calculate the value of the gaussian, for $z_{ik} \in \{0, 1\}$, and multiply by $P(z_{ik} \mid \mathbf{z}_{-i,k})$, each time; and then normalize the sum of these two products?  Just to imagine this a bit, let's say we have:

In [None]:
import numpy as np

p_X_given_Z = [0.03, 0.02]  # pretend Gaussian samples, not normalized
p_zik_given_Z_minus = [0.8, 0.2]  # normalized, sum to 1.0

#Then
p_zik_given_X_Z = [0] * 2
for zik in [0, 1]:
    p_zik_given_X_Z[zik] = p_X_given_Z[zik] * p_zik_given_Z_minus[zik]

print(p_zik_given_X_Z)

# normalize
p_zik_given_X_Z /= np.sum(p_zik_given_X_Z)
print('normalized p_zik_given_X_Z', p_zik_given_X_Z)


So, the normalized values, with this toy data, are influenced by both the likelihood, and by the prior.

Let's run with this.

### Solving for expected A

We have:

$$
\mathbb{E}[\mathbf{A}] = (\mathbf{Z}^T\mathbf{Z} + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I})^{-1}\mathbf{Z}^T\mathbf{X}
$$

It'd be good to avoid that inverse.  Can we avoid it using a solver?  The numpy solver, [numpy.linalg.solve](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.solve.html#numpy.linalg.solve) solves equations in the form:

$$
\mathbf{A} \mathbf{X} = \mathbf{B}
$$

So, let's put the equation in $\mathbb{E}[\mathbf{A}]$ above into this form:

$$
(\mathbf{Z}^T\mathbf{Z} + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I})\mathbb{E}[\mathbf{A}] = \mathbf{Z}^T\mathbf{X}
$$

So, using $\mathrm{\backslash}$ to denote "solve", we have:

$$
\mathbb{E}[\mathbf{A}] = \mathbf{Z}^T\mathbf{X} \mathrm{\,\backslash\,} (\mathbf{Z}^T\mathbf{Z} + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I})
$$


### Can we use solver for  the exponential trace term?

The exponential trace term is:

$$
\mathbf{X}^T
\left(
    \mathbf{I}
    -
    \mathbf{Z}_+
    \left(
      \mathbf{Z}_+^T\mathbf{Z}_+ + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I}_{K_+}
    \right)^{-1}
    \mathbf{Z}_+^T
\right)
\mathbf{X}
$$

Let's call this $\mathbf{R}$, so:

$$
\mathbf{R}
=
\mathbf{X}^T
\left(
    \mathbf{I}
    -
    \mathbf{Z}_+
    \left(
      \mathbf{Z}_+^T\mathbf{Z}_+ + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I}_{K_+}
    \right)^{-1}
    \mathbf{Z}_+^T
\right)
\mathbf{X}
$$

So:

$$
X^{T^{-1}}
\mathbf{R}
X^{-1}
=
\mathbf{I}
-
\mathbf{Z}_+
\left(
  \mathbf{Z}_+^T\mathbf{Z}_+ + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I}_{K_+}
\right)^{-1}
\mathbf{Z}_+^T
$$

So:
$$
\mathbf{Z}_+^{-1}
\left(
    \mathbf{I}
    -
    X^{T^{-1}}
    \mathbf{R}
    X^{-1}
\right)
\mathbf{Z}_+^{T^{-1}}
=
\left(
  \mathbf{Z}_+^T\mathbf{Z}_+ + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I}_{K_+}
\right)^{-1}
$$

So:
$$
\left(
  \mathbf{Z}_+^T\mathbf{Z}_+ + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I}_{K_+}
\right)
\mathbf{Z}_+^{-1}
\left(
    \mathbf{I}
    -
    X^{T^{-1}}
    \mathbf{R}
    X^{-1}
\right)
\mathbf{Z}_+^{T^{-1}}
=
\mathbf{I}
$$
&nbsp;

$$
\left(
  \mathbf{Z}_+^T\mathbf{Z}_+ + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I}_{K_+}
\right)
\mathbf{Z}_+^{-1}
\left(
    \mathbf{I}
    -
    X^{T^{-1}}
    \mathbf{R}
    X^{-1}
\right)
=
\mathbf{Z}_+^T
$$

Seems no obvious way to get this down to a single inverse, that we could use a solver against?

So, what about if we just try to get one level down for now?, ie solve for $\mathbf{S}$, where:

$$
\mathbf{S}
= \mathbf{I} - \mathbf{Z}_+
\left(
  \mathbf{Z}_+^T\mathbf{Z}_+ + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I}_{K_+}
\right)^{-1}
\mathbf{Z}_+^T
$$
&nbsp;

$$
\mathbf{I} - \mathbf{S}
= \mathbf{Z}_+
\left(
  \mathbf{Z}_+^T\mathbf{Z}_+ + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I}_{K_+}
\right)^{-1}
\mathbf{Z}_+^T
$$

Seems no way forward, even at this level?  Have to 'solve' the inner inverse, ie $\left(
  \mathbf{Z}_+^T\mathbf{Z}_+ + \frac{\sigma_X^2}{\sigma_A^2}\mathbf{I}_{K_+}
\right)$ against $\mathbf{I}$, which is just the inverse...  Not a biggie, just an optimization that is not obviously possible.