Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early termination with truncated samples #11

Closed
philastrophist opened this issue Jan 20, 2019 · 1 comment
Closed

Early termination with truncated samples #11

philastrophist opened this issue Jan 20, 2019 · 1 comment

Comments

@philastrophist
Copy link

When running some tests on simple models (2 dimensions and 3/4 components) I find that untruncated (i.e. sel_callback=None) performs better on these models than when I give pygmmis the selection function (i.e. sel_callback=selection_function).

I've created a gist with my script here

The likelihood seems to decrease immediately when moving away from the k-means estimate.
However, if I disable convergence detection and let them run forever, then we see convergence for some split-merge runs. In fact, the log_L jumps around quite a bit before finally converging (sometimes).

The figures below show my problem, the ellipses are 1-sigma; observed data is in green, unobserved data in red.
Red ellipses are the kmeans estimate,
Black ellipses are the pygmmis estimate
Blue ellipses are the truth

With tolerance
pygmmis

Without tolerance
pygmmis-converged

Loglikelihood

log_like

We need to detect convergence better otherwise it'll get stuck in a local minimum. Maybe a t-test for flat lines with tolerances?

@pmelchior
Copy link
Owner

OK, apologies that it took a while to get back to this issue. Life...

Thanks for your clear report, I can confirm it. However, PR #12 is only a workaround, not the actual solution. So let me address the problem here.

The problem, as you have determined yourself, is not in the EM equations themselves, otherwise the solution wouldn't have converged to something meaningful, but in the reported likelihood, which is used for convergence testing.

In a completely observed GMM, the likelihood of a sample x is

p(x | GMM) = 1 / Z sum_k alpha_k p_k(x)

with Z = int dx sum_k alpha_k p_k(x). Since the p_k or normalized and the alphas sum to 1, this normalization constant is dropped. But for incomplete samples, we cannot do that. The process that led the the observed samples is

p (x | GMM, Omega) = 1 / Z' Omega(x) sum_k alpha_k p_k(x)

with Z' = int dx Omega sum_k alpha_k p_k(x). As Omega is <= 1, this normalization term boosts the value of logL. Z' tells us the fraction of the probability mass that is observed. It can be very efficiently computed with Monte Carlo integration. My last commit 0e33ac6 does exactly that. With it the drop in logL doesn't arise and #12 is not necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants