Skip to content

Commit

Permalink
Updated readme and BBC example
Browse files Browse the repository at this point in the history
  • Loading branch information
joewandy committed Oct 4, 2017
1 parent ff9bc5e commit 8bdc2ca
Show file tree
Hide file tree
Showing 4 changed files with 439 additions and 405 deletions.
22 changes: 13 additions & 9 deletions README.md
@@ -1,14 +1,18 @@
# hlda
Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model. This is based on the hLDA implementation from Mallet, having a fixed depth on the nCRP tree.
Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model. This is based on the hLDA implementation from [Mallet](http://mallet.cs.umass.edu/topics.php), having a fixed depth on the nCRP tree.

Files
------
Hierarchical Latent Dirichlet Allocation
----------------------------------------

- hlda/sampler.py is the Gibbs sampler.
- Example notebook to test with synthetic data and the BBC Insight corpus can be found in the notebooks folder.
Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non-parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and readily accommodates growing
data collections. The hLDA model combines this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation.

References
-----------
[Hierarchical Topic Models and the Nested Chinese Restaurant Process](http://www.cs.columbia.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf)

- [The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies](http://cocosci.berkeley.edu/tom/papers/ncrp.pdf)
- [Hierarchical Topic Models and the Nested Chinese Restaurant Process](http://www.cs.columbia.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf)
[The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies](http://cocosci.berkeley.edu/tom/papers/ncrp.pdf)

Implementation
--------------

- [hlda/sampler.py](hlda/sampler.py) is the Gibbs sampler.
- An example notebook that infers the hierarchical topics on the BBC Insight corpus can be found in [notebooks/bbc_test.ipynb](notebooks/bbc_test.ipynb).
4 changes: 2 additions & 2 deletions hlda/sampler.py
Expand Up @@ -186,7 +186,7 @@ def __init__(self, corpus, vocab,

def estimate(self, num_samples, display_topics=50, n_words=5, with_weights=True):

print 'HierarchicalLDA sampling'
print 'HierarchicalLDA sampling\n'
for s in range(num_samples):

sys.stdout.write('.')
Expand Down Expand Up @@ -382,7 +382,7 @@ def print_nodes(self, n_words, with_weights):

def print_node(self, node, indent, n_words, with_weights):
out = ' ' * indent
out += 'topic %d (level=%d, total_words=%d, documents=%d): ' % (node.node_id, node.level, node.total_words, node.customers)
out += 'topic=%d level=%d (documents=%d): ' % (node.node_id, node.level, node.customers)
out += node.get_top_words(n_words, with_weights)
print out
for child in node.children:
Expand Down
543 changes: 278 additions & 265 deletions notebooks/bbc_test.ipynb

Large diffs are not rendered by default.

0 comments on commit 8bdc2ca

Please sign in to comment.