Skip to content
/ gsdmm Public

Gibbs Sampling Dirichlet Multinomial Model (GSDMM) for Short-Text Clustering

Notifications You must be signed in to change notification settings

pokarats/gsdmm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gibbs Samping Dirichlet Multinomial Mixture Model (GSDMM) in Short-Text Clustering

Computational Linguistics Final Project, Winter Semester 2019, University of Saarland
GSDMM Implementation as described in:
Yin, J. and Wang., J. A dirichlet multinomial mixturemodel-based approach for short text clustering. In SIGKDD,2014
Experimenting with different beta parameters on the Stack Overflow Titles dataset made available by Kaggle.com from the paper by:
Xu, J., et al., 2015. Short Text Clustering via Convolutional Neural Networks, NAACL.

Project File Structure

  • GSDMM
    • README.md
    • gsdmm_noonPokaratsiriGoldstein.pdf project report
    • data: all corpus and label files are here
      • title_StackOverflow.txt
      • label_StackOverflow.txt
    • logs: logs of run_gsdmm.py execution
      • run_gsdmm_{run_id}.log
    • output: plots of gsdmm performance and representative words in clusters
      • cluster_per_iteration_at_different_beta.png
      • performance_at_different_beta.png
      • gsdmm_clusters_and_representative_words_{run_id}.out
    • pickled: pickle files from run_gsdmm.py
      • predicted_{run_id}_freq_words_by_beta.pickle
      • predicted_{run_id}_labels_by_beta.pickle
      • predicted_{run_id}_num_clusters_by_it_per_beta_list.pickle
      • true_most_frequent_words_by_topic.pickle
    • source_code: config file for default parameters and all source code files
      execute run_gsdmm.py from this directory
      • default_config.cfg default parameters to execute the program are defined here
      • eval.py this module calculates NMI, Homogeneity, and Completeness and plot graphs
      • gsdmm.py this module does the GSDMM algorithm
      • preprocess.py this module tokenizes and pre process the corpus file
      • run_gsdmm.py this is the main program that runs the experiment

Requirements

Python 3.7

numpy
sklearn
matplotlib
nltk
tqdm

Instructions

  • cd to the source_code directory to execute the program
  • python run_gsdmm.py -h will display all the command line options
  • commandline options will override options in the default_config.cfg file
  • python run_gsdmm.py will run GSDMM experiments with the default values in the .cfg file
  • the last run_id was 3; change to a different run_id number to execute the full program
  • program will output 2 plots (plot titles are self-explanatory), an output file showing the GSDMM predicted number of clusters, words in the clusters + frequencies
  • running the program with the same run_id will simply load data from pickled files and re-plot the 2 graphs
  • runtime: for K = 100 (starting with 100 clusters as an upper bound), the program takes approximately 1 hour for each beta value computation in the experiment. For K = 50, each cycle takes approximately 30 minutes.
  • The default setting experiments with 5 beta values; therefore, the total runtime for the entire program takes approximately 5-6 hours.
  • Please see the log file for runtime details as it includes time stamps from the last run

Releases

No releases published

Packages

No packages published

Languages