Updated readme and BBC example

joewandy · Oct 4, 2017 · 8bdc2ca · 8bdc2ca
1 parent ff9bc5e
commit 8bdc2ca
Show file tree

Hide file tree

Showing 4 changed files with 439 additions and 405 deletions.
diff --git a/README.md b/README.md
@@ -1,14 +1,18 @@
 # hlda
-Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model. This is based on the hLDA implementation from Mallet, having a fixed depth on the nCRP tree.
+Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model. This is based on the hLDA implementation from [Mallet](http://mallet.cs.umass.edu/topics.php), having a fixed depth on the nCRP tree.
 
-Files
-------
+Hierarchical Latent Dirichlet Allocation
+----------------------------------------
 
-- hlda/sampler.py is the Gibbs sampler.
-- Example notebook to test with synthetic data and the BBC Insight corpus can be found in the notebooks folder.
+Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non-parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and readily accommodates growing
+data collections. The hLDA model combines this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation.
 
-References
------------
+[Hierarchical Topic Models and the Nested Chinese Restaurant Process](http://www.cs.columbia.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf)
 
-- [The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies](http://cocosci.berkeley.edu/tom/papers/ncrp.pdf)
-- [Hierarchical Topic Models and the Nested Chinese Restaurant Process](http://www.cs.columbia.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf)
+[The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies](http://cocosci.berkeley.edu/tom/papers/ncrp.pdf)
+
+Implementation
+--------------
+
+- [hlda/sampler.py](hlda/sampler.py) is the Gibbs sampler.
+- An example notebook that infers the hierarchical topics on the BBC Insight corpus can be found in [notebooks/bbc_test.ipynb](notebooks/bbc_test.ipynb).
diff --git a/hlda/sampler.py b/hlda/sampler.py
@@ -186,7 +186,7 @@ def __init__(self, corpus, vocab,
 
     def estimate(self, num_samples, display_topics=50, n_words=5, with_weights=True):
 
-        print 'HierarchicalLDA sampling'
+        print 'HierarchicalLDA sampling\n'
         for s in range(num_samples):
 
             sys.stdout.write('.')
@@ -382,7 +382,7 @@ def print_nodes(self, n_words, with_weights):
 
     def print_node(self, node, indent, n_words, with_weights):
         out = '    ' * indent
-        out += 'topic %d (level=%d, total_words=%d, documents=%d): ' % (node.node_id, node.level, node.total_words, node.customers)
+        out += 'topic=%d level=%d (documents=%d): ' % (node.node_id, node.level, node.customers)
         out += node.get_top_words(n_words, with_weights)
         print out
         for child in node.children:

diff --git a/notebooks/bbc_test.ipynb b/notebooks/bbc_test.ipynb