Merge pull request #70 from jmschrei/mkdocs2

ENH website
jmschrei · Feb 2, 2016 · 03725f4 · 03725f4
2 parents ee2192d + be2d925
commit 03725f4
Show file tree

Hide file tree

Showing 4 changed files with 87 additions and 60 deletions.
diff --git a/docs/bayesnet.md b/docs/bayesnet.md
@@ -2,6 +2,8 @@
 
 [Bayesian networks](http://en.wikipedia.org/wiki/Bayesian_network) are a powerful inference tool, in which nodes represent some random variable we care about, edges represent dependencies and a lack of an edge between two nodes represents a conditional independence. A powerful algorithm called the sum-product or forward-backward algorithm allows for inference to be done on this network, calculating posteriors on unobserved ("hidden") variables when limited information is given. The more information is known, the better the inference will be, but there is no requirement on the number of nodes which must be observed. If no information is given, the marginal of the graph is trivially calculated. The hidden and observed variables do not need to be explicitly defined when the network is set, they simply exist based on what information is given. 
 
+An IPython notebook tutorial with visualizations can be [found here](https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_4_Bayesian_Networks.ipynb)
+
 Lets test out the Bayesian Network framework on the [Monty Hall problem](http://en.wikipedia.org/wiki/Monty_Hall_problem). The Monty Hall problem arose from the gameshow <i>Let's Make a Deal</i>, where a guest had to choose which one of three doors had a prize behind it. The twist was that after the guest chose, the host, originally Monty Hall, would then open one of the doors the guest did not pick and ask if the guest wanted to switch which door they had picked. Initial inspection may lead you to believe that if there are only two doors left, there is a 50-50 chance of you picking the right one, and so there is no advantage one way or the other. However, it has been proven both through simulations and analytically that there is in fact a 66% chance of getting the prize if the guest switches their door, regardless of the door they initially went with. 
 
 We can reproduce this result using Bayesian networks with three nodes, one for the guest, one for the prize, and one for the door Monty chooses to open. The door the guest initially chooses and the door the prize is behind are completely random processes across the three doors, but the door which Monty opens is dependent on both the door the guest chooses (it cannot be the door the guest chooses), and the door the prize is behind (it cannot be the door with the prize behind it). 

diff --git a/docs/hmm.md b/docs/hmm.md
@@ -2,6 +2,8 @@
 
 [Hidden Markov models](http://en.wikipedia.org/wiki/Hidden_Markov_model) are a form of structured learning, in which a sequence of observations are labelled according to the hidden state they belong. HMMs can be thought of as non-greedy FSMs, in that the assignment of tags is done in a globally optimal way as opposed to being simply the best at the next step. HMMs have been used extensively in speech recognition and bioinformatics, where speech is a sequence of phonemes and DNA is a sequence of nucleotides. 
 
+An IPython notebook tutorial with visualizations can be [found here](https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_3_Hidden_Markov_Models.ipynb)
+
 A full tutorial on sequence alignment in bioinformatics can be found [here](http://nbviewer.ipython.org/github/jmschrei/yahmm/blob/master/examples/Global%20Sequence%20Alignment.ipynb) The gist is that you have a graphical structure as follows:
 
 ![alt text](http://www.cs.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec06/img106.gif "Three Character Profile HMM")

diff --git a/docs/probability.md b/docs/probability.md
@@ -3,55 +3,117 @@ Probability Distributions
 
 The probability distribution is one of the simplest probabilistic models used. While these are frequently used as parts of more complex models such as General Mixture Models or Hidden Markov Models, they can also be used by themselves. Many simple analyses require just calculating the probability of samples under a distribution, or fitting a distribution to data and seeing what the distribution parameters are. pomegranate has a large library of probability distributions, kernel densities, and the ability to combine these to form multivariate or mixture distributions. 
 
+An IPython notebook tutorial with visualizations can be [found here](https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_1_Distributions.ipynb)
+
+Here is a full list of currently implemented distributions:
+
+```
+UniformDistribution
+NormalDistribution
+LogNormalDistribution
+ExponentialDistribution
+BetaDistribution
+GammaDistribution
+DiscreteDistribution
+LambdaDistribution
+GaussianKernelDensity
+UniformKernelDensity
+TriangleKernelDensity
+IndependentComponentsDistribution
+MultivariateGaussianDistribution
+ConditionalProbabilityTable
+JointProbabilityTable
+```
+
 All distribution objects have the same methods:
 
 ```
-<b>Distribution.from_sample(sample, weight):</b> Create a new distribution parameterized using weighted MLE estimates on data
-<b>d.log_probability(sample):</b> Calculate the log probability of this point under the distribution
-<b>d.sample():</b> Return a randomly sample from this distribution
-<b>d.fit(sample, weight, inertia):</b> Fit the parameters of the distribution using weighted MLE estimates on the data with inertia as a regularizer
+copy() : Make a deep copy of the distribution
+freeze() : Prevent the distribution from updating on training calls
+thaw() : Reallow the distribution to update on training calls
+log_probability( symbol ): Return the log probability of the symbol under the distribution
+sample() : Return a randomly generated sample from the distribution
+fit / train / from_sample( items, weights=None, inertia=None ) : Update the parameters of the distribution
+summarize( items, weights=None ) : Store sufficient statistics of a dataset for a future update
+from_summaries( inertia=0.0 ) : Update the parameters of the distribution from the sufficient statistics
+to_json() : Return a json formatted string representing the distribution
+from_json( s ) : Build an appropriate distribution object from the string
 ```
 
+## Initialization
+
 A widely used model is the Normal distribution. We can easily create one, specifying the parameters if we know them.
 
 ```python
 from pomegranate import *
 
 a = NormalDistribution(5, 2)
-print a.log_probability(8)
 ```
 
-This will return -2.737, which is the log probability of 8 under that Normal Distribution.
+If we don't know the parameters of the normal distribution beforehand, we can learn them from data using the class method `from_samples`.
+
+```python
+b = NormalDistribution.from_samples([3, 4, 5, 6, 7], weights=[0.5, 1, 1.5, 1, 0.5])
+```
+
+We can initialize kernel densities by passing in a list of points and their respective weights (equal weighting if no weights are explicitly passed in) like the following:
+
+```python
+c = TriangleKernelDensity([1,5,2,3,4], weights=[4,2,6,3,1])
+```
+
+Next, we can try to make a mixture of distributions. We can make a mixture of arbitrary distributions with arbitrary weights. Usually people will make mixtures of the same type of distributions, such as a mixture of Gaussians, but this is more general than that. You do this by just passing in a list of initialized distribution objects and their associated weights. For example, we could create a mixture of a normal distribution and an exponential distribution like below. This probably doesn't make sense, but you can still do it.
 
+```python
+d = MixtureDistribution([NormalDistribution(2, 4), ExponentialDistribution(8)], weights=[1, 0.01])
+```
+
+## Prediction
+
+The only prediction step which a distribution has is calculating the log probability of a point under the parameters of the distribution. This is done using the `log_probability` method
+
+```python
+a.log_probability(8) # This will return -2.737
+b.log_probability(8) # This will also return -2.737 as 'a' and 'b' are the same distribution
+```
 
+Since all types of distributions use the same log probability method, we can do the same for the triangle kernel density.
+
+```
+print c.log_probability(8)
 ```
-b = TriangleKernelDensity( [1,5,2,3,4], weights=[4,2,6,3,1] )
-c = MixtureDistribution( [ NormalDistribution( 2, 4 ), ExponentialDistribution( 8 ) ], weights=[1, 0.01] )
 
-print a.log_probability( 8 )
-print b.log_probability( 8 )
-print c.log_probability( 8 )
+This will return -inf because there is no density at 8--no initial samples have been put down.  
+
+We can then evaluate the mixture distribution:
+
 ```
+print d.log_probability(8)
+```
+
+This should return -3.44.   
 
-This should return `-2.737`, `-inf`, and `-3.44` respectively.  
+## Fitting
 
 We can also update these distributions using Maximum Likelihood Estimates for the new values. Kernel densities will discard previous points and add in the new points, while MixtureDistributions will perform expectation-maximization to update the mixture of distributions.
 
 ```python
-c.from_sample([1, 5, 7, 3, 2, 4, 3, 5, 7, 8, 2, 4, 6, 7, 2, 4, 5, 1, 3, 2, 1])
-print c
+d.from_sample([1, 5, 7, 3, 2, 4, 3, 5, 7, 8, 2, 4, 6, 7, 2, 4, 5, 1, 3, 2, 1])
+print d
 ```
 
-This should result in `MixtureDistribution( [NormalDistribution(3.916, 2.132), ExponentialDistribution(0.99955)], [0.9961, 0.00386] )`. All distributions can be trained either as a batch using `from_sample`, or using summary statistics using `summarize` on lists of numbers until all numbers have been fed in, and then `from_summaries` like in the following example which produces the same result:
+This should result in `MixtureDistribution( [NormalDistribution(3.916, 2.132), ExponentialDistribution(0.99955)], [0.9961, 0.00386] )`. 
+
+In addition to training on a batch of data which can be held in memory, all distributions can be trained out-of-core (online) using summary statistics and still get exact updates. This is done using the `summarize` method on a minibatch of the data, and then `from_summaries` when you want to update the parameters of the data.
 
 ```python
-c = MixtureDistribution( [ NormalDistribution( 2, 4 ), ExponentialDistribution( 8 ) ], weights=[1, 0.01] )
-c.summarize([1, 5, 7, 3, 2, 4, 3])
-c.summarize([5, 7, 8])
-c.summarize([2, 4, 6, 7, 2, 4, 5, 1, 3, 2, 1])
-c.from_summaries()
+d = MixtureDistribution( [ NormalDistribution( 2, 4 ), ExponentialDistribution( 8 ) ], weights=[1, 0.01] )
+d.summarize([1, 5, 7, 3, 2, 4, 3])
+d.summarize([5, 7, 8])
+d.summarize([2, 4, 6, 7, 2, 4, 5, 1, 3, 2, 1])
+d.from_summaries()
 ```
 
 Splitting up the data into batches will still give an exact answer, but allows for out of core training of distributions on massive amounts of data. 
 
-In addition, training can be done on weighted samples by passing an array of weights in along with the data for any of the training functions, such as `c.summarize([5,7,8], weights=[1,2,3])`. Training can also be done with inertia, where the new value will be some percentage the old value and some percentage the new value, used like `c.from_sample([5,7,8], inertia=0.5)` to indicate a 50-50 split between old and new values. 
+In addition, training can be done on weighted samples by passing an array of weights in along with the data for any of the training functions, such as `d.summarize([5,7,8], weights=[1,2,3])`. Training can also be done with inertia, where the new value will be some percentage the old value and some percentage the new value, used like `d.from_sample([5,7,8], inertia=0.5)` to indicate a 50-50 split between old and new values. 
diff --git a/docs/probability.md~ b/docs/probability.md~