University of Chicago Master's Thesis.
Crowdsourced Statistics Advice: Topic Modeling stats.stackexchange.com
Topic models are hierarchical Bayesian models used to discover latent semantic structure within collections of documents, allowing them to be reduced from millions of words to a few dozen interpretable topics. This paper presents three closely related methods: latent Dirichlet allocation, correlated topic models, and structural topic models. I discuss the estimation challenges associated with topic modeling and compare the three methods by analyzing a collection of 182,308 posts contributed by the general public to the statistics and machine learning community website stats.stackexchange.com.
Data taken from the Dec 15, 2016 Stack Exchange Data Dump and licensed under Creative Commons Share Alike 3.0.
01-exploration-and-parsing.nb.html
02-vocabulary-experimentation.nb.html
03-datafile-construction.nb.html
07b-lda-kfold-parallel.nb.html