# jmschrei/pomegranate

Switch branches/tags
Nothing to show
Fetching contributors…
Cannot retrieve contributors at this time
75 lines (45 sloc) 5.98 KB

# General Mixture Models

IPython Notebook Tutorial

General Mixture models (GMMs) are an unsupervised probabilistic model composed of multiple distributions (commonly referred to as components) and corresponding weights. This allows you to model more complex distributions corresponding to a singular underlying phenomena. For a full tutorial on what a mixture model is and how to use them, see the above tutorial.

## Initialization

General Mixture Models can be initialized in two ways depending on if you know the initial parameters of the model or not: (1) passing in a list of pre-initialized distributions, or (2) running the from_samples class method on data. The initial parameters can be either a pre-specified model that is ready to be used for prediction, or the initialization for expectation-maximization. Otherwise, if the second initialization option is chosen, then k-means is used to initialize the distributions. The distributions passed for each component don't have to be the same type, and if an IndependentComponentDistribution object is passed in, then the dimensions don't need to be modeled by the same distribution.

Here is an example of a traditional multivariate Gaussian mixture where we pass in pre-initialized distributions. We can also pass in the weight of each component, which serves as the prior probability of a sample belonging to that component when doing predictions.

from pomegranate import *
d1 = MultivariateGaussianDistribution([1, 6, 3], [[1, 0, 0], [0, 1, 0], [0, 0, 1]])
d2 = MultivariateGaussianDistribution([2, 8, 4], [[1, 0, 0], [0, 1, 0], [0, 0, 2]])
d3 = MultivariateGaussianDistribution([0, 4, 8], [[2, 0, 0], [0, 3, 0], [0, 0, 1]])
model = GeneralMixtureModel([d1, d2, d3], weights=[0.25, 0.60, 0.15])

Alternatively, if we want to model each dimension differently, then we can replace the multivariate Gaussian distributions with IndependentComponentsDistribution objects.

from pomegranate import *
d1 = IndependentComponentsDistributions([NormalDistribution(5, 2), ExponentialDistribution(1), LogNormalDistribution(0.4, 0.1)])
d2 = IndependentComponentsDistributions([NormalDistribution(3, 1), ExponentialDistribution(2), LogNormalDistribution(0.8, 0.2)])
model = GeneralMixtureModel([d1, d2], weights=[0.66, 0.34])

If we do not know the parameters of our distributions beforehand and want to learn them entirely from data, then we can use the from_samples class method. This method will run k-means to initialize the components, using the returned clusters to initialize all parameters of the distributions, i.e. both mean and covariances for multivariate Gaussian distributions. Afterwards, expectation-maximization is used to refine the parameters of the model, iterating until convergence.

from pomegranate import *
model = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, n_components=3, X=X)

If we want to model each dimension using a different distribution, then we can pass in a list of callables and they will be initialized using k-means as well.

from pomegranate import *
model = GeneralMixtureModel.from_samples([NormalDistribution, ExponentialDistribution, LogNormalDistribution], n_components=5, X=X)

## Probability

The probability of a point is the sum of its probability under each of the components, multiplied by the weight of each component c, P = \sum\limits_{i \in M} P(D|M_{i})P(M_{i}). The probability method returns the probability of each sample under the entire mixture, and the log_probability method returns the log of that value.

## Prediction

The common prediction tasks involve predicting which component a new point falls under. This is done using Bayes rule P(M|D) = \frac{P(D|M)P(M)}{P(D)} to determine the posterior probability P(M|D) as opposed to simply the likelihood P(D|M). Bayes rule indicates that it isn't simply the likelihood function which makes this prediction but the likelihood function multiplied by the probability that that distribution generated the sample. For example, if you have a distribution which has 100x as many samples fall under it, you would naively think that there is a ~99% chance that any random point would be drawn from it. Your belief would then be updated based on how well the point fit each distribution, but the proportion of points generated by each sample is important as well.

We can get the component label assignments using model.predict(data), which will return an array of indexes corresponding to the maximally likely component. If what we want is the full matrix of P(M|D), then we can use model.predict_proba(data), which will return a matrix with each row being a sample, each column being a component, and each cell being the probability that that model generated that data. If we want log probabilities instead we can use model.predict_log_proba(data) instead.

## Fitting

Training GMMs faces the classic chicken-and-egg problem that most unsupervised learning algorithms face. If we knew which component a sample belonged to, we could use MLE estimates to update the component. And if we knew the parameters of the components we could predict which sample belonged to which component. This problem is solved using expectation-maximization, which iterates between the two until convergence. In essence, an initialization point is chosen which usually is not a very good start, but through successive iteration steps, the parameters converge to a good ending.

These models are fit using model.fit(data). A maximum number of iterations can be specified as well as a stopping threshold for the improvement ratio. See the API reference for full documentation.

## API Reference

.. automodule:: pomegranate.gmm
:members:
:inherited-members: