{{ message }}

pillowlab / PYMentropy

Discrete entropy estimator using the Pitman-Yor mixture (PYM) prior

Files

Failed to load latest commit information.
Type
Name
Commit time

PYM entropy estimator MATLAB reference implementation

This code lives in github: https://github.com/pillowlab/PYMentropy

This is a reference implementation in MATLAB of the entropy estimator based on a Pitman-Yor mixture (PYM) prior. For the details behind how we derive the estimator, see the following papers:

• Evan Archer, Il Memming Park, Jonathan W. Pillow. Bayesian estimation of discrete entropy with mixtures of stick breaking priors. Neural Information Processing Systems (NIPS) 2012
• Evan Archer, Il Memming Park, Jonathan W. Pillow. Bayesian Entropy Estimation for Countable Discrete Distributions. Journal of Machine Learning Research (JMLR), 15(81):2833−2868, 2014. http://arxiv.org/abs/1302.0328; http://jmlr.org/papers/v15/archer14a.html)

Quick example

Let's estimate entropy from a sequence of natural numbers using the PYM estimator (in practice, the sequence would be samples drawn from some unknown distribution):

``````>> [mm, icts] = multiplicitiesFromSamples([3 2 4 3 1 4 2 4 4]);
>> [Hbls, Hvar] = computeH_PYM(mm, icts)

Hbls =
2.1476

Hvar =
0.2715
``````

Here, `Hbls` is the Bayes least squares estimate of entropy, and `Hvar` is the posterior variance of the estimate. The units of this toolbox is nats (natural logarithm); to convert to bits, divide the result by `log(2) = 0.6931...`.

Requirements and Installation

You must have Optimization toolbox (for `fmincon`). To install, just add the package to your MATLAB path. This package is developed under 7.13.0.564 (R2011b).

If using an older version of MATLAB, you may need lightspeed for fast digamma and polygamma implementations.

Run the unit test script `test_HPYM_randomized.m` to check if your copy is working fine.

Converting data

We represent raw data in three distinct ways, each at a different level of abstraction. The raw data itself are samples of discrete symbols from some unknown distribution. We do not operate directly upon the samples; rather, all entropy estimators provided take a succint representation called multiplicities. The multiplicity representation consists of two vectors of the same length: `mm` contains the number of symbols with the same number of occurrences, and `icts` contains the corresponding number of occurrences. In short, `mm(j)` is number of symbols with `icts(j)` occurrences. For example, the symbol sequence `[a b b c c d d d d]` has a multiplicity representation where `mm = [1 2 1]` and `icts = [1 2 4]`; this tells us that there is only one symbol with a single occurrence (`a`), two symbols with two occurrences (`b` and `c`), and one symbol with 4 occurrences (`d`). The ordering in `mm` and `icts` is not meaningful. We provide the following utility functions to convert data to the multiplicities. For large data, it is recommended that you write your own code that can exploit your data structure to convert it to the multiplicity representation in a memory efficient manner.

Raw samples

Given a vector of symbols with unique numerical representation, use `multiplicitiesFromSamples`.

``````>> [mm, icts] = multiplicitiesFromSamples([1 2 2 3.5 3.5 4 4 4 4])
mm =
1
2
1

icts =
1
2
4
``````

Histogram representation

Discrete entropy does not care about the symbol identity. Since we assume independent and identically distributed samples, the ordering is also irrelvant. Hence, a discrete histogram also contains all information about the data. To convert a histogram to multiplicities, use `multiplicitiesFromCounts`.

``````>> [mm, icts] = multiplicitiesFromCounts([1 2 2 4])
mm =
1
2
1

icts =
1
2
4
``````

Converting back to histogram

If you need the histogram of your multiplicity representation, use `multiplicitiesToCounts`.

``````>> hg = multiplicitiesToCounts(mm, icts)
hg =
4
2
2
1
``````

Spike train

For a specialized case when the data are contained in a cell array of spike timings from simultaneously recorded neurons, the multiplicities can be extracted using `multisptimes2words.m`. See also `fastWords2Counts`, `discreteTimeSeries2Words`, and `words2multiplicities.m`. Alternatively, Nemenman also has a fast C++ implementation for extracting multiplicities from time series.

Entropy estimation

Simply call:

``````[Hbls, Hvar] = computeH_PYM(mm, icts, prior)
``````

where `mm` and `icts` form the multiplicity representation, and `prior` is an optional structure that rougly specifies the range of power-law tail behavior. If omitted, the default value is used. Note that PYM estimator is not finite if there are less than 2 coincidences (1 coincidence = same sample observed twice).

The package also includes a few more entropy estimators. All functions of the form `computeH_*` take the multiplicity representation and return an estimate of `H` as well as a variance (if it is supported).

Controlling the Prior

Theoretically, the PYM estimator has a degree freedom in specifying the prior in the gamma direction (details in the papers). Basically, any bounded, non-negative valued function `q(g = gamma)` defined on [0, Inf] can be used as a prior. The default prior is `q(g) = exp(-10/(1-g))`, which places more prior probability mass on distributions with light-tails, and limits the amount of prior mass placed on extremely heavy-tailed distributions.

Use `prior = pymPriorFactory(priorName, param)` to generate prespecified alternative priors, such as `q(g) = exp(-g)`. To use a custom prior, you must specify function handles for the prior function, `q(g)`, as well as its first and second derivatives. Once these function handles are placed in a structure (see pymPriorFactory.m for details), your custom prior may be used in a manner identical to that of the built-in priors.

Enjoy!

Discrete entropy estimator using the Pitman-Yor mixture (PYM) prior