#### Loading and visualizing MNIST

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from IPython import display
import time
%matplotlib inline

In [None]:
import os
from sklearn.datasets import fetch_mldata

# Fetch MNIST dataset and create a local copy.
if os.path.exists('./data/mnist.npz'):
    with np.load('./data/mnist.npz', 'r') as data:
        X = data['X']
        y = data['y']
else:
    mnist = fetch_mldata("MNIST original", data_home = './data/')
    X, y = mnist.data / 255.0, mnist.target
    np.savez('./data/mnist.npz', X=X, y=y)

In [None]:
# Arrange images as a 10x10 matrix.
# Define helper functions to visualize digits
def hstack_images(images):
    return np.hstack((images[i,:].reshape((28,28)) for i in range(len(images))))
def arrange_images(images, n_raw=10, n_col=10):
    return np.vstack([hstack_images(images[i:i+n_col, :]) for i in range(0, n_raw * n_col, n_col)])

indices = np.arange(X.shape[0])
np.random.shuffle(indices)
plt.imshow(arrange_images(X[indices]), cmap=plt.cm.gray, interpolation='nearest')
plt.show()

In [None]:
n_train_samples = 5000

indices = np.arange(X.shape[0])
np.random.shuffle(indices)

train_indices = indices[: n_train_samples]
X_train = X[train_indices, :]
y_train = y[train_indices]

del X
del y

#### Nonparametric approach

In [None]:
from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.grid_search import GridSearchCV

In [None]:
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)

In [None]:
params = {'bandwidth': np.logspace(-1, 1, 20)}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(X_train_pca)

print("best bandwidth: {0}".format(grid.best_estimator_.bandwidth))

In [None]:
params

In [None]:
kde = grid.best_estimator_
new_data = kde.sample(100)

In [None]:
new_data = pca.inverse_transform(new_data)
print new_data.shape

In [None]:
plt.figure( figsize = ( 9, 9 ) )
plt.imshow(arrange_images(new_data), cmap=plt.cm.gray, interpolation='nearest')
plt.show()

# _EM_ and NIST

### A brief description of the **EM** algorithm

The EM algorithm seeks to maximize the likelihood by means of successive application of two steps: the E-step and the M-step.

For any probability measure $Q$ on the space of latent variables $Z$ with density $q$ the following holds:  
\begin{align*}
\log p(X|\Theta)
    &= \int q(Z) \log p(X|\Theta) dZ
     = \mathbb{E}_q \log p(X|\Theta) \\
   %% &= \Bigl[p(X,Z|\Theta) = p(Z|X,\Theta) p(X|\Theta) \Bigr] \\
    &= \mathbb{E}_{Z\sim q} \log \frac{p(X,Z|\Theta)}{p(Z|X\Theta)}
     = \mathbb{E}_{Z\sim q} \log \frac{q(Z)}{p(Z|X,\Theta)}
     + \mathbb{E}_{Z\sim q} \log \frac{p(X,Z|\Theta)}{q(Z)} \\ 
    &= KL\bigl(q\|p(\cdot|X,\Theta)\bigr) + \mathcal{L}\bigl(q, \Theta\bigr)
\end{align*}  

since the Bayes theorem poisits that $p(X,Z|\Theta) = p(Z|X,\Theta) p(X|\Theta)$. Call this equiation the **"master equation"**.

Now note that since the Kullback-Leibler divergence is always non-negative, one has the following inequality:
$$\log p(X|\Theta) \geq \mathcal{L}\bigl(q, \Theta\bigr)$$

Let's try to make the lower bound as large as possible by changing $\Theta$ and varying $q$. But first note that the 
left-hand side of the **master equation** is independent of $q$, whence maximization of $\mathcal{L}$ with respect to $q$ (with $\Theta$ fixed) is equivalent to minimization of $KL\bigl(q\|p(\cdot|X,\Theta)\bigr)$ with respect to $q$ taking $\Theta$ fixed. Since $q$ is arbitrary, the optimal minimizer $q^*_\Theta$ is $q^*(Z|\Theta) = p(Z|X,\Theta)$ for all $Z$.

Now at the optimal distributuion $q^*_\Theta$ the **master equation** becomes
$$ \log p(X|\Theta)
= \mathcal{L}\bigl(q^*_\Theta, \Theta\bigr)
= \mathbb{E}_{Z\sim q^*_\Theta} \log \frac{p(X,Z|\Theta)}{q^*(Z|\Theta)}
= \mathbb{E}_{Z\sim q^*_\Theta} \log p(X,Z|\Theta) - \mathbb{E}_{Z\sim q^*_\Theta} \log q^*(Z|\Theta)
$$
for any $\Theta$. Thus the problem of log-likelihood maximization reduces to that of maximizing the sum of expectations on the right-hand side.

This new problem does not seem to be tractable in general since the optimization paramters $\Theta$ affect both the expected log-likelihood $\log p(X,Z|\Theta)$ under $Z\sim q^*_\Theta$ and the entropy of the optimal distribution of the latent variables $Z$.

Hopefully using an iterative procedure which switches between the computation of $q^*_\Theta$ and the maximization of $\Theta$ might be effective. Consider the folowing :
* **E**-step: considering $\Theta_i$ as fixed and given find $q^*_{\Theta_i} = \mathop{\text{argmin}}_q KL\bigl(q\|p(\cdot|X,\Theta_i)\bigr)$ and set $q_{i+1} = q^*_{\Theta_i}$;
* **M**-step: considering $q_{i+1}$ as given, solve $\mathcal{L}(q_{i+1},\Theta) \to \mathop{\text{max}}_\Theta$, where 
$$ \mathcal{L}(q,\Theta) = \mathbb{E}_{Z\sim q} \log p(X,Z|\Theta) - \mathbb{E}_{Z\sim q} \log q(Z) $$

The fact that $q_i$ is considered fixed makes the optimization of $\mathcal{L}(q_i,\Theta)$ equivalent to maximization of the expected log-likelihood, since the entropy term is fixed. Therefore the **M**-step becomes:
* given $q_{i+1}$ find $\Theta^*_{i+1} = \mathop{\text{argmax}}_\Theta \mathbb{E}_{Z\sim q_{i+1}} \log p(X,Z|\Theta)$ and put $\Theta_{i+1} = \Theta^*_{i+1}$.

Note that just after the **E**-step the following inequality is true  
$$ \mathcal{L}(q_i,\Theta_i) \leq \log p(X|\Theta_i) = \mathcal{L}(q_{i+1},\Theta_i) $$

because $q_i = p(Z|X,\Theta_i)$. Next, since $\Theta_{i+1}$ is the maximizer of $\mathcal{L}(q_{i+1},\Theta)$, one has   
$$ \mathcal{L}(q_{i+1},\Theta_i) \leq \mathcal{L}(q_{i+1},\Theta_{i+1}) $$

Threfore the effect of a single round of **EM** on the log-likelihood itself is:
$$ \log p(X|\Theta_i) = \mathcal{L}(q_{i+1},\Theta_i) \leq \mathcal{L}(q_{i+1},\Theta_{i+1}) \leq \mathcal{L}(q_{i+2},\Theta_{i+1}) = \log p(X|\Theta_{i+1}) $$
where the equality is achieved between the **E** and the **M** step within one round. This implies that **EM** indeed iteratively improves the log-likihood.

A side note: if the latent variables are independent, then the **E**-step reduces to
$$ q^*_{\Theta_i} = \mathop{\text{argmin}}_q KL\bigl(q\|p(\cdot|X,\Theta_i)\bigr) $$
since $KL\bigl(q\|p(\cdot|X,\Theta_i)\bigr) = \sum_{j=1}^{|Z|} \mathbb{E}_{z_j\sim q^j} \log\frac{q^j(z)}{p(z_j|x_j,\Theta)}$

### Application of the EM to NIST data

Each image is a random element in a discrete probability space $\Omega = \{0,1\}^{N\times M}$ with product-measure
$$\mathbb{P}(\omega) = \prod_{i=1}^N\prod_{j=1}^M \theta_{ij}^{\omega_{ij}} (1-\theta_{ij})^{1-\omega_{ij}}$$
for any $\omega\in \Omega$. In particular $M=N=28$. Basically each bit of the image is independent of any other bit and each one is a Bernoulli random varaiable with parameter $\theta_{ij}$: $\omega_{ij}\sim \text{Bern}(\theta_{ij})$.

Let's apply the EM algorithm to this dataset. The proposed model is the following.

Consider a mixture model of discrete probability spaces. Suppose there are $K$ componets in the mixture. Then each image is distributed according to the following law:
$$p(\omega|\Theta)
= \sum_{k=1}^K \pi_k p_k(\omega|\theta_k)
= \sum_{k=1}^K \pi_k \prod_{i=1}^N \prod_{j=1}^M \theta_{kij}^{\omega_{ij}} (1-\theta_{kij})^{1-\omega_{ij}}$$
where $\theta_{kij}$ is the paramter of the probability distribution of the $(i,j)$-th random variable (pixel) in the $k$-th class, and $\pi_k$ is the porbability of the $k$-th mixutre to generate a random element, $\sum_{k=1}^K \pi_k= 1$.

Suppose $X=(x_i)_{i=1}^n \in \Omega^n$ is the dataset. The log-likelihood is given by
$$ \log p(X|\Theta) = \sum_{s=1}^n \log \sum_{k=1}^K \pi_k
\prod_{i=1}^N \prod_{j=1}^M \theta_{kij}^{x_{sij}} (1-\theta_{kij})^{1-x_{sij}}$$
where $x_{sij}\in\{0,1\}$ -- is the value of the the $(i,j)$-th pixel at the $s$-th observation.

If the source $Z=(z_i)_{i=1}^n$ components of the mixture at each datapint were known, then the log-likelihood would have been
$$ \log p(X,Z|\Theta) = \sum_{s=1}^n \log \prod_{k=1}^K \Bigl[ \pi_k 
\prod_{i=1}^N \prod_{j=1}^M \theta_{kij}^{x_{sij}} (1-\theta_{kij})^{1-x_{sij}} \Bigr]^{1_{z_s = k}}$$
where $1_{z_s = k}$ is the indicator and take the value $1$ if $\{z_s = k\}$ and $0$ otherwise ($1_{\{k\}}(z_s)$ is another notation).

The log-likelihood simplifies to
$$ \log p(X,Z|\Theta) = \sum_{s=1}^n \sum_{k=1}^K 1_{z_s = k} \Bigl( \log \pi_k + 
\sum_{i=1}^N \sum_{j=1}^M \bigl( x_{sij} \log \theta_{kij} + (1-x_{sij}) \log (1-\theta_{kij}) \bigr) \Bigr) $$
and further into a more separable form
$$ \log p(X,Z|\Theta)
= \sum_{s=1}^n \sum_{k=1}^K 1_{z_s = k} \log \pi_k
+ \sum_{s=1}^n \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^M 1_{z_s = k} x_{sij} \log \theta_{kij}
+ \sum_{s=1}^n \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^M 1_{z_s = k} (1-x_{sij}) \log (1-\theta_{kij})$$

The expected log-likelihood under $z_s\sim q_s$ with $\mathbb{P}(z_s=k|X) = q_{sk}$, is given by
$$ \mathbb{E}\log p(X,Z|\Theta)
= \sum_{s=1}^n \sum_{k=1}^K q_{sk} \log \pi_k
+ \sum_{s=1}^n \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^M q_{sk} x_{sij} \log \theta_{kij}
+ \sum_{s=1}^n \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^M q_{sk} (1-x_{sij}) \log (1-\theta_{kij}) $$

#### Analytic solution

At the **E**-step one must compute $q^*(Z) = \mathbb{P}(z_s=k|X) = \hat{q}_{sk}$ based on the value of $\Theta = ((\pi_k), (\theta_{kij}))$.
$$\hat{q}_{sk}
= \frac{p(x_s|z_s=k,\Theta) p(z_s=k)}{\sum_{l=1}^K p(x_s|z_s=l,\Theta) p(z_s=l)}
\propto \pi_k \prod_{i=1}^N \prod_{j=1}^M \theta_{kij}^{x_{sij}} (1-\theta_{kij})^{1-x_{sij}}
$$
and
$$ q^*(Z) = \prod_{s=1}^n q_{s z_s} $$

In order to improve numerical stability it is better to use the following formula:
$$ \hat{q}_{sk} = \text{exp}\Bigl\{ \log \pi_k + \sum_{i=1}^N \sum_{j=1}^M \bigl( x_{sij} \log \theta_{kij} + (1-x_{sij}) \log (1-\theta_{kij}) \bigr) \Bigr\} $$
which reduces to
$$ \hat{q}_{sk} \propto \text{exp}\Bigl\{ \log \pi_k + \sum_{i=1}^N \sum_{j=1}^M \log (1-\theta_{kij}) + \sum_{i=1}^N \sum_{j=1}^M x_{sij} \bigl( \log \theta_{kij} - \log (1-\theta_{kij}) \bigr) \Bigr\} $$

At the **M**-step for some fixed $q(Z)$ one solves $\mathbb{E}\log p(X,Z|\Theta)\to \max_\Theta$ subject to $\sum_{k=1}^K \pi_k = 1$ which is a convex optimization problem with respect to $\Theta$, since the log-likelihood as a linear combination of convex functions is convex. The first order condition is $\sum_{s=1}^n \frac{q_{sk}}{\pi_k} - \lambda = 0$ for all $k=1,\ldots,K$, whence $ \lambda = \sum_{s=1}^n \sum_{l=1}^K q_{sl} = n $ and finally
$$\hat{\pi}_k = \frac{\sum_{s=1}^n q_{sk}}{n}$$
For $\theta_{kij}$, $i=1,\ldots,N$, $j=1,\ldots,M$ and $k=1,\ldots,K$ the FOC is
$$ \sum_{s=1}^n q_{sk} \frac{x_{sij}}{\theta_{kij}} - \sum_{s=1}^n q_{sk} \frac{1-x_{sij}}{1-\theta_{kij}} = 0 $$
whence
$$\hat{\theta}_{kij} =  \frac{\sum_{s=1}^n q_{sk} x_{sij}}{ \sum_{s=1}^n q_{sk} } = \frac{\sum_{s=1}^n q_{sk} x_{sij}}{ n \hat{\pi}_k }$$


In [None]:
## Theta -- K x N*M, Pi -- 1 x K , X -- n x N*M, returns Q -- n x K
def e_step( X, Theta, Pi ) :
## Get the logarithms
#     lTheta, lnTheta = np.log( Theta ), np.log( 1 - Theta )
## Compute the unnormalised probabilities
#     u_sk = Pi * np.exp( np.dot( X, np.transpose( lTheta - lnTheta ) ) + np.sum( lnTheta, axis = ( 1, ) ) )
#     q_sk = np.where( np.isnan( u_sk ), 0, u_sk )
## Normalize
#     return q_sk / np.sum( q_sk, axis = 1 ).reshape( q_sk.shape[:1]+(1,) )
    pass

In [None]:
def m_step( X, Q ) :
    Pi = np.sum( Q, axis = ( 0, ) )
    Theta = np.dot( Q.T, X ) / np.reshape( Pi, ( Q.shape[ 1 ], 1 ) )
    return Theta, np.reshape( Pi, ( 1, Pi.shape[ 0 ] ) ) / np.sum( Pi )

<hr/>

In [None]:
Pi = np.ones( ( 1, 10 ), np.float ) / 10
Theta = np.full( ( 10, X.shape[ 1 ] ), .5, np.float )

In [None]:
Q = e_step( X, Theta, Pi )
Theta, Pi = m_step( X, Q )

In [None]:
Q

In [None]:
plt.imshow( Theta[0].reshape( (28,28) ))

In [None]:
print Theta.shape
print Pi.shape
print Q.shape

<hr/>

A random variable $X\sim \text{Beta}(\alpha,\beta)$ if the law of $X$ has density
$$p(u) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} u^{\alpha-1}(1-u)^{\beta-1} $$

$$ \log p(X,Z|\Theta) = \sum_{s=1}^n \log \prod_{k=1}^K \Bigl[ \pi_k 
\prod_{i=1}^N \prod_{j=1}^M 
\frac{\Gamma(\alpha_{kij}+\beta_{kij})}{\Gamma(\alpha_{kij})\Gamma(\beta_{kij})} x_{sij}^{\alpha_{kij}-1}(1-x_{sij})^{\beta_{kij}-1} \Bigr]^{1_{z_s = k}}$$

\begin{align*}
\mathbb{E}_q \log p(X,Z|\Theta)
&= \sum_{s=1}^n \sum_{k=1}^K q_{sk} \log \pi_k 
+ \sum_{s=1}^n \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^M q_{sk} \bigl( \log \Gamma(\alpha_{kij}+\beta_{kij}) - \log \Gamma(\alpha_{kij}) - \log \Gamma(\beta_{kij}) \bigr) \\
&+ \sum_{s=1}^n \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^M q_{sk} (\alpha_{kij}-1) \log x_{sij}\\
&+ \sum_{s=1}^n \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^M q_{sk} (\beta_{kij}-1) \log(1-x_{sij})
\end{align*}
