September 23, 2021
Geoff Ford
https://polsci.github.io/
https://github.com/polsci/
See also: Colab + Gensim + Mallet
This repository is designed for students in DIGI405 at the University of Canterbury to do topic modeling through their browser using Binder. It is relevant for others who want to do topic modeling through a browser with their own corpus.
Note: The notebook has been updated to enforce Gensim v3.8 (the last version to support running topic models via Mallet) and to specify requirements using an apt.txt file and requirements.txt file over the former Dockerfile.
Make sure you are downloading your notebook regularly as Binder times out after 10 minutes! Read the Binder FAQ.
- Launch Binder (see link below).
- Using the file browser upload your corpus zip file (from the Datasets page on Learn).
- Run the
topic-modeling-with-gensim-mallet.ipynb
notebook. - Run the first code cell in the notebook to unzip the corpus.
- Use the notebook to create your topic model.
Before running the Binder, please read the Binder FAQ as it contains information on Binder, resource limits, and especially how long your session will last!
If you are not from this course, you can of course upload your own corpus as a zip. Your corpus should consist of a single directory of txt files (one document per txt file). Be aware though that resources are limited (e.g. max RAM limits). It has been tested with a corpus of < 40mb across 3600+ txt files. It is unlikely to work with very large corpora due to the memory requirements. This isn't the fastest way to run topic models, but allows you to create a topic model through your browser without installing any software.
The environment should support pyLDAvis, however this is not implemented in the sample notebook. Add a cell like this to run it (note: this is sloooowwww and not recommended!):
import pyLDAvis.gensim as gensimvis
import pyLDAvis
vis_data30 = gensimvis.prepare(gensimmodel30, doc_term_matrix, dictionary)
pyLDAvis.display(vis_data30)