Skip to content

Implementations of various fast parallelized samplers for LDA, including Partially Collapsed LDA, Light LDA, Partially Collapsed Light LDA and a very efficient Polya-Urn LDA

Notifications You must be signed in to change notification settings

lejon/PartiallyCollapsedLDA

Repository files navigation

Build/Test

YourKit

Features

  • Supports priors on words (anchor words) which allows you to anchor words to a topic.
  • supports gracefully stopping the sampler in mid sampling, and later continuing to sample from the aborted position.
  • Supports logging the topic indicators in every iteration
  • Several implemented samplers including a very fast approximate Polya Urn sampler, an exact sparse sampler (Spalias) and an HDP sampler
  • Supports stopword lists, basic text cleaning (lowercasing, removing numbers)
  • Word occurrence threshods (frequency and TF-IDF)
  • Configuration files with sub-configuratons
  • Suppports exporting stats to import in LDAVis
  • Supports setting initial seed to garantee same starting state
  • Can measure timing of basically all steps and export as tabular list
  • Supports thinning
  • Supports relevance words
  • Supports topic diagnostics

PC-LDA

Repo for our Partially Collapsed Parallel LDA implementations described in the articles:

Måns Magnusson, Leif Jonsson, Mattias Villani, and David Broman. (2017). Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models. Journal of Computational and Graphical Statistics.

@article{magnusson2017sparse,
  title={Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models},
  author={Magnusson, M{\aa}ns and Jonsson, Leif and Villani, Mattias and Broman, David},
  journal={Journal of Computational and Graphical Statistics},
  year={2017},
  publisher={Taylor \& Francis}
}

Alexander Terenin, Måns Magnusson, Leif Jonsson, and David Draper. “Polya Urn Latent Dirichlet Allocation: a doubly sparse massively parallel sampler”. In IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017.

@inproceedings{jonsson:2018,
	author={{Terenin}, Alexander and {Magnusson}, M{\aa}ns and {Jonsson}, 
	Leif and {Draper}, David},
	title={Polya Urn Latent Dirichlet Allocation: a doubly sparse massively 
	parallel sampler},
	booktitle={Accepted for publication in IEEE Transactions on Pattern Analysis and 
	Machine Intelligence}
}

The toolkit is Open Source Software, and is released under the Common Public License. You are welcome to use the code under the terms of the license for research or commercial purposes, however please acknowledge its use with a citation: Terenin, Magnusson, Jonsson. "Polya Urn Latent Dirichlet Allocation: a doubly sparse massively parallel sampler"

The dataset (in the datasets folder) and the stopwords file (stopwords.txt, included in the repository) should be in the same folder as you run the sampler.

Example Run command:

java -cp PCPLDA-X.X.X.jar cc.mallet.topics.tui.ParallelLDA --run_cfg=src/main/resources/configuration/PLDAConfig.cfg
java -jar PCPLDA-X.X.X.jar --run_cfg=src/main/resources/configuration/PLDAConfig.cfg

(PCPLDA-X.X.X.jar is created in the 'target' folder by the 'mvn package' command)

For very large datasets you might need to add the -Xmx60g flag to Java

Please remember that this is a research prototype and the standard disclaimers apply. You will see printouts during unit tests, commented out code, old stuff not cleaned out yet etc.

But the basic sampler is tested and evaluated in a scientific manner and we have gone to great pains to ensure that it is correct. The sampler that is referred to in the article as "PC sampler" or "PC-LDA" corresponds to the class 'cc.mallet.topics.SpaliasUncollapsedParallelLDA' in the code for the sparse parallel and 'cc.mallet.topics.PolyaUrnSpaliasLDA' for the Polya Urn version. The variable selection parts are implemented in the 'cc.mallet.topics.NZVSSpaliasUncollapsedParallelLDA' class.

An example of a "main" class is cc.mallet.topics.tui.ParallelLDA

Config File Documentation

Configuration file docs

Minimal config file example

Minimal Configuration file example

Installation

  1. Install Apache Maven
  2. Install the package using maven as follows:

mvn package in bash.

Occasionally some of the "probabilistic" tests fail due to random chance. This is ok in a statistical sense but not for a test suite so this should eventually be tuned. For now if the suite is re-run it should be ok.

To install without running tests use

mvn package -DskipTests in bash.

Example run using binary (the release JAR)

java -cp PCPLDA-8.0.11.jar cc.mallet.topics.tui.ParallelLDA --run_cfg=src/main/resources/configuration/PLDAConfig.cfg

Public DOI

DOI

R-Wrapper

There is an R-wrapper for this sampler at pcldar

Acknowledgements

I'm a very satisfied user of the YourKit profiler. A Great product with great support. It has been sucessfully used for profiling in this project.

YourKit

YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.

About

Implementations of various fast parallelized samplers for LDA, including Partially Collapsed LDA, Light LDA, Partially Collapsed Light LDA and a very efficient Polya-Urn LDA

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages