Skip to content

Commit

Permalink
ADD tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
jmschrei committed Jan 3, 2018
1 parent e50d506 commit c14307d
Show file tree
Hide file tree
Showing 3 changed files with 980 additions and 31 deletions.
9 changes: 6 additions & 3 deletions docs/nan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,14 @@ Missing Values

As of version 0.9.0, pomegranate supports missing values for almost all methods. This means that models can be fit to data sets that have missing values in them, inference can be done on samples that have missing values, and even structure learning can be done in the presence of missing values. Currently, this support exists in the form of calculating sufficient statistics with respect to only the variables that are present in a sample and ignoring the missing values, in contrast to imputing the missing values and using those for the estimation.

Missing value support was added in a manner that requires the least user thought. All one has to do is add `numpy.nan` to mark an entry as missing for numeric data sets, or the string `'nan'` for string data sets. pomegranate will automatically handle missing values appropriately. The functions have been written in such a way to minimize the overhead of missing value support, by only acting differently when a missing value is found. However, it may take some models longer to do calculations in the presence of missing values than on dense data. For example, when calculating the log probability of a sample under a multivariate Gaussian distribution one can typically use BLAS or a GPU since a dot product is taken between the data and the inverse covariance matrix. Unfortunately, since missing data can occur in any of the columns, a new inverse covariance matrix has to be calculated for each sample and BLAS cannot be utilized at all.
Missing value support was added in a manner that requires the least user thought. All one has to do is add ``numpy.nan`` to mark an entry as missing for numeric data sets, or the string ``'nan'`` for string data sets. pomegranate will automatically handle missing values appropriately. The functions have been written in such a way to minimize the overhead of missing value support, by only acting differently when a missing value is found. However, it may take some models longer to do calculations in the presence of missing values than on dense data. For example, when calculating the log probability of a sample under a multivariate Gaussian distribution one can typically use BLAS or a GPU since a dot product is taken between the data and the inverse covariance matrix. Unfortunately, since missing data can occur in any of the columns, a new inverse covariance matrix has to be calculated for each sample and BLAS cannot be utilized at all.

As an example, when fitting a `NormalDistribution` to a vector of data, the parameters are estimated simply by ignoring the missing values. A data set with 100 observations and 50 missing values would produce the same model as a data set comprised simply of the 100 observations. This comes into play when fitting multivariate models, like an `IndependentComponentsDistribution`, because each distribution is fit to only the observations for their specific feature. This means that samples where some values are missing can still be utilized in the dimensions where they are observed. This can lead to more robust estimates that by imputing the missing values using the mean or median of the column.
As an example, when fitting a ``NormalDistribution`` to a vector of data, the parameters are estimated simply by ignoring the missing values. A data set with 100 observations and 50 missing values would produce the same model as a data set comprised simply of the 100 observations. This comes into play when fitting multivariate models, like an ``IndependentComponentsDistribution``, because each distribution is fit to only the observations for their specific feature. This means that samples where some values are missing can still be utilized in the dimensions where they are observed. This can lead to more robust estimates that by imputing the missing values using the mean or median of the column.

Here is an example of fitting a univariate distribution to data sets with missing values:

.. code-block:: python
>>> import numpy
>>> from pomegranate import *
>>>
Expand Down Expand Up @@ -43,13 +44,15 @@ Multivariate Gaussian distributions take a slightly more complex approach. The m

All univariate distributions return a probability of 1 for missing data. This is done to support inference algorithms in more complex models. For example, when running the forward algorithm in a hidden Markov model in the presence of missing data, one would simply ignore the emission probability for the steps where the symbol is missing. This means that when getting to the step when a missing symbol is being aligned to each of the states, the cost is simply the transition probability to that state, instead of the transition probability multiplied by the likelihood of that symbol under that states' distribution (or, equivalently, having a likelihood of 1.) Under a Bayesian network, the probability of a sample is just the product of probabilities under distributions where the sample is fully observed.

See the tutorial for more examples of missing value support in pomegranate!


FAQ
---

Q. How do I indicate that a value is missing in a data set?

A. If it is a numeric data set, indicate that a value is missing using `numpy.nan`. If it is strings (such as 'A', 'B', etc...) use the string 'nan'. If your strings are stored in a numpy array, make sure that the full string 'nan' is present. numpy arrays have a tendancy to truncate longer strings if they're defined over shorter strings (like an array containing 'A' and 'B' might truncate 'nan' to be 'n').
A. If it is a numeric data set, indicate that a value is missing using ``numpy.nan``. If it is strings (such as 'A', 'B', etc...) use the string ``'nan'``. If your strings are stored in a numpy array, make sure that the full string 'nan' is present. numpy arrays have a tendancy to truncate longer strings if they're defined over shorter strings (like an array containing 'A' and 'B' might truncate 'nan' to be 'n').


Q. Are all algorithms supported?
Expand Down
38 changes: 10 additions & 28 deletions tutorials/Tutorial_8_Semisupervised_Learning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,7 @@
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -82,9 +80,7 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -120,9 +116,7 @@
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -151,9 +145,7 @@
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -213,9 +205,7 @@
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -257,9 +247,7 @@
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -311,9 +299,7 @@
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -349,9 +335,7 @@
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -400,9 +384,7 @@
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -477,7 +459,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
"version": "2.7.14"
}
},
"nbformat": 4,
Expand Down

0 comments on commit c14307d

Please sign in to comment.