# Expectation Maximization

[This pdf](https://github.com/mtomassoli/information-theory-tutorial) helped me a lot, but mainly with other stuff than EM. I also found a great blog [here](https://nipunbatra.github.io/blog/2014/em.html). Also, [this](https://www.cs.utah.edu/~piyush/teaching/EM_algorithm.pdf) explanation is more mathy.

Iterative method, two steps:
* Expectation - Creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters. What are calculated in the first step are the fixed, data-dependent parameters of the function Q.
* Maximization - Computes parameters maximizing the expected log-likelihood found on the E step. Once the parameters of Q are known, it is fully determined and is maximized in the second (M) step of an EM algorithm.

Other methods exist to find maximum likelihood estimates, such as gradient descent, conjugate gradient, or variants of the Gauss–Newton algorithm. Unlike EM, such methods typically require the evaluation of first and/or second derivatives of the likelihood function.

1. First, initialize the parameters $\boldsymbol {\theta }$ to some random values.
2. Compute the probability of each possible value of $\mathbf {Z}$, given $\boldsymbol {\theta }$.
3. Then, use the just-computed values of $\mathbf {Z}$  to compute a better estimate for the parameters ${\boldsymbol {\theta }}$.
4. Iterate steps 2 and 3 until convergence.

## Example
We have 0 or 1 sample data drawn from two Bernoulli distributions. Variable with Bernoulli distribution takes 1 with probability p and 0 with probability 1 - p, where p is a distribution parameter. We don't know from which distribution a sample comes from.

Maximum likelihood estimate of p is a sample mean $\frac{1}{n} \sum_i^n x_i$

In [12]:
import functools as ft
import itertools as it
import json
import math
import operator as op
import os

from IPython.display import display
from ipywidgets import interact, interact_manual, widgets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import misc, stats
from sklearn import metrics

In [8]:
observations = np.array([[1,0,0,0,1,1,0,1,0,1], # the assumption is that each row comes from a single distribution
                         [1,1,1,1,0,1,1,1,1,1], # makes sense, without it the solution would probably be that 
                         [1,0,1,1,1,1,1,0,1,1], # all ones comes from a dist with p=1 and all zeros from p=0
                         [1,0,1,0,0,0,1,1,0,0], 
                         [0,1,1,1,0,1,1,1,0,1]])

In [9]:
# if we know from which distributions the observations come from, we can simply calculate 
# the params from ml estimator

distribution_ids = np.array([0, 1, 1, 0, 1]) # hidden params

p_0 = observations[distribution_ids == 0].mean()
p_1 = observations[distribution_ids == 1].mean()

print(p_0, p_1)

0.45 0.8


In [127]:
# start from some random numbers, you might get different results depending on choice though
# visible: model_n, visible_size; hidden: model_n, hidden_size; 
# observations: observation_n (= hidden_size), observation_size 

visible_params = np.array([[0.6], [0.5]])

def estimate_hidden_params(pmf, observations, visible_params):
    hits = observations.sum(axis=1)
    observation_n, observation_size = observations.shape
    
    model_n = visible_params.shape[0]
    model_probs = np.zeros((model_n, observation_n), dtype=np.float64)
    for i, visible_row in enumerate(visible_params):
        prob = pmf(hits, observation_size, visible_row)
        model_probs[i] = prob
        
    total_probs = model_probs.sum(axis=0)
    return model_probs / total_probs
    
hidden_params = estimate_hidden_params(stats.binom.pmf, observations, visible_params)
print(hidden_params)

def maximize_visible_params(estimate, observations, hidden_params):
    per_observation_estimates = [] # observation_n x visible_size
    for i, observation_row in enumerate(observations):
        visible_row = estimate(observation_row)
        per_observation_estimates.append(visible_row)
    per_observation_estimates = np.array(per_observation_estimates)
        
    # (model_n, hidden_size) x (observation_n = hidden_size, visible_size) = (model_n, visible_size)
    visible_estimates = hidden_params @ per_observation_estimates / hidden_params.sum(axis=1)
    return visible_estimates
      
visible_params = maximize_visible_params(lambda row: row.mean(), observations, hidden_params)
print(visible_params)

[[ 0.44914893  0.80498552  0.73346716  0.35215613  0.64721512]
 [ 0.55085107  0.19501448  0.26653284  0.64784387  0.35278488]]
[ 0.71301224  0.58133931]


In [71]:
def expectation_maximization(pmf, estimate, observations, initial_visible, iterations):
    visible_params = initial_visible
    for i in range(iterations):
        hidden_params = estimate_hidden_params(pmf, observations, visible_params)
        visible_params = maximize_visible_params(estimate, observations, hidden_params)
    return visible_params, hidden_params

visible_params, hidden_params = expectation_maximization(
    stats.binom.pmf, lambda row: row.mean(), observations, np.array([[0.6], [0.5]]), 1000
)
print(visible_params)
print(hidden_params)

[ 0.79678907  0.51958312]
[[ 0.10300871  0.95201348  0.84549373  0.03070315  0.6014986 ]
 [ 0.89699129  0.04798652  0.15450627  0.96929685  0.3985014 ]]


The real values were 

* visible params [0.45 0.8]
* hidden params [0, 1, 1, 0, 1]

The algorithms swapped first distribution with second, but it's perfectly fine, they're just identified in different way.

In [93]:
expectation_maximization(
    stats.binom.pmf, lambda row: row.mean(), 
    np.array([[1], [0], [1], [0], [0], 
              [1], [0], [1], [1], [0], 
              [0], [0], [1], [0], [1], 
              [1], [0], [1], [0], [0]]), 
    np.array([[0.90], [0.10]]), 
    1000
)

(array([ 0.83938928,  0.06061072]),
 array([[ 0.93265475,  0.14600975,  0.93265475,  0.14600975,  0.14600975,
          0.93265475,  0.14600975,  0.93265475,  0.93265475,  0.14600975,
          0.14600975,  0.14600975,  0.93265475,  0.14600975,  0.93265475,
          0.93265475,  0.14600975,  0.93265475,  0.14600975,  0.14600975],
        [ 0.06734525,  0.85399025,  0.06734525,  0.85399025,  0.85399025,
          0.06734525,  0.85399025,  0.06734525,  0.06734525,  0.85399025,
          0.85399025,  0.85399025,  0.06734525,  0.85399025,  0.06734525,
          0.06734525,  0.85399025,  0.06734525,  0.85399025,  0.85399025]]))

In [89]:
expectation_maximization(
    stats.binom.pmf, lambda row: row.mean(), 
    np.array([[1, 0, 1, 0, 0, 
               1, 0, 1, 1, 0, 
               0, 0, 1, 0, 1, 
               1, 0, 1, 0, 0]]), 
    np.array([[0.75], [0.25]]), 
    1000
)

(array([ 0.45,  0.45]), array([[ 0.5],
        [ 0.5]]))

In [94]:
expectation_maximization(
    stats.binom.pmf, lambda row: row.mean(), 
    np.array([[1, 0, 1, 0, 0, 
               1, 0, 1, 1, 0, 
               0, 0, 1, 0, 1, 
               1, 0, 1, 0, 0]]), 
    np.array([[0.75]]), 
    1000
)

(array([ 0.45]), array([[ 1.]]))

In [114]:
expectation_maximization(
    stats.binom.pmf, lambda row: row.mean(), 
    np.array([[1, 0, 1, 1, 1], 
              [1, 0, 1, 1, 0], 
              [0, 0, 1, 0, 1], 
              [1, 0, 0, 0, 0]]), 
    np.array([[0.8], [0.6], [0.4], [0.2]]), 
    10000
)

(array([ 0.50237364,  0.50217056,  0.49782944,  0.49762636]),
 array([[ 0.25356136,  0.25118595,  0.2488122 ,  0.24644049],
        [ 0.2532549 ,  0.25108626,  0.24891559,  0.24674325],
        [ 0.24674325,  0.24891559,  0.25108626,  0.2532549 ],
        [ 0.24644049,  0.2488122 ,  0.25118595,  0.25356136]]))

In [125]:
foo = lambda row: stats.norm.pdf(1, row)
foo(4, 10, np.array([0.1, 10]))

array([ 0.39894228,  0.26608525])