# Sampling neural network
Here we will try to develop a sampling neural network.

We have a Multi-Layer Perceptron with parameters.  We will use the notation of [1](http://papers.nips.cc/paper/5269-expectation-backpropagation-parameter-free-training-of-multilayer-neural-networks-with-continuous-or-discrete-weights), and refer to the weight from unit $i$ in layer $l-1$ to unit $j$ in layer l as $W_{ijl}$

Our MLP defines some deterministic function $f(X, W)$, where $X$ is a shape $(N_{samples}, N_{dims})$ array of input data, and $W_l$ is a list of $(N_{units(l-1)}, N_{units(l)})$ weight matrix (this may later be generalized for conv-nets, etc)

For 1-of-K classification, we can say that the network has the form

$$p(Y|X, W) = \prod_{n=1}^{N_{samples}}Categorical\big(f(X_n, W)\big)$$

And restrict f(X_n, W) to output a discrete distribution over {1..K}.

Applying bayes rule,
$$ \\
\begin{align} \\
p(W|X,Y)&=\frac{p(Y|X,W)p(W)}{p(Y)} \\
p(W)&\propto p(W)p(Y|X,W) \\
p(Y|X,W)&=p(W)\prod_{n=1}^{N_{samples}}Categorical\big(f(X_n, W)\big)(Y_n) \\
&=p(W)\prod_{n=1}^{N_{samples}}f(X_n, W)_{Y_n} \\
&\equiv L(W|X,Y)
\end{align}
$$

For numerical reasons, we work in terms of log-likelihood.

$$
\begin{align} \\
logL(W|X,Y)&\equiv log(L(W|W,Y))\\
&=log(p(W))+\sum_{n=1}^{N_{samples}}log(f(X_n,W)_{Y_n}) \\
\end{align}
$$


Lets make a few assumptions to simplify things.
- Each weight $W_{ijl}$ has an independent prior, so $p(W)=\prod_{ijl}p(W_{ijl})$
- Weights are discrete - that is they can take on 1-of-K values.

Now, we want to find $W$ that maximizes $p(W|X,Y)$.  We can use Gibbs sampling.  For a given weight, $\alpha \in (i,j,l)$, and a set of possible weights $c_k, k \in 1...K$

$$
\begin{align} \\
p(W_{\alpha}=c_k|W_{~\alpha}, X, Y) &= \frac{[L(W_{\alpha}=c_k|W_{~\alpha}, X, Y), k \in 1..K]}{\sum_k L(W_{\alpha}=c_k|W_{~\alpha}, X, Y)} \\
p(W_{\alpha}=c_k|W_{~\alpha}, X, Y) &= softmax([logL(W|W_{\alpha}=c_k, X, Y), k \in 1...K])\\
&= softmax([log(p(W))+\sum_{n=1}^{N_{samples}}log(f(X_n,W_{w_{\alpha}=c_k})_{Y_n}), k \in 1...K])_k
\end{align}
$$

So, we have an Gibbs sampling update:
$$
W_{\alpha} \sim Categorical \Big(softmax\big(\big[log(p(W_{\alpha}=c_k))+\sum_{n=1}^{N_{samples}}log(f(X_n,W_{w_{\alpha}=c_k})_{Y_n},k \in 1..K\big]\big)\Big)
$$

In [0]:
n_samples = 10
n_dims = 50
hidden_sizes = [100, 100]