MixedMembershipWordEmbeddings

Code implementing the algorithms in: J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

Prerequisites

Python
Tensorflow
Java
Matlab (optional)

The code should work under windows and linux. I haven't tested this on MacOS but the code is implemented using cross-platform languages, so I expect it would work there as well.

Data format

The input is a single file, with one line per document, and words represented by zero-based dictionary indices. See NIPS.txt under the data folder for an example. This file encodes the NIPS corpus, due to Sam Roweis. The dictionary is also provided for NIPS, which allows the final results to be interpreted, but is not used by the algorithms.

Running the code

The first step is to train the MMSG topic model using annealed Metrolopis-Hastings-Walker collapsed Gibbs sampling. This is implemented in java. First compile the java code. From the root directory of the project:

cd java
javac edu/umbc/MMWordEmbeddings/*.java
cd ..

Then, run the code:

java -cp java edu.umbc.MMWordEmbeddings.MMSkipGramTopicModel_MHW_mixtureOfExperts filename numTopics numDocuments numWords numIterations contextSize alpha_k beta_w doAnnealing annealingFinalTemperature

The last two options default to true, and 0.0001, respectively. To run this on the NIPS corpus for 2000 topics and 1000 iterations, for example, we can use:

java -cp java edu.umbc.MMWordEmbeddings.MMSkipGramTopicModel_MHW_mixtureOfExperts data/NIPS.txt 2000 1740 13649 1000 5 0.01 0.001 true

After running (it may take a while), this results in three files:

MMskipGramTopicModel_topicAssignments.txt, in a format similar to the input data, but which contains topic assignments for each word
MMskipGramTopicModel_wordTopicCountsForTopics.txt, which contains the count matrix for the topics (words by topics). Add the smoothing hyperparameter and normalize the columns to sum to one to obtain the topics' probability distributions over words
MMskipGramTopicModel_wordTopicCountsForWords.txt, which contains the count matrix for the words' distributions over topics (words by topics). Add the smoothing hyperparameter and normalize the rows to sum to one to obtain the words' probability distributions over topics.

Finally, the embeddings are training via NCE, implemented in python using tensorflow. Edit python/mixedMembershipSkipGramPreClusteredNCE.py to select the hyperparameters for the algorithm, and its input files (i.e. the file encoding the documents which was used to train the MMSG topic model, and the corresponding MMskipGramTopicModel_topicAssignments.txt which the MMSG topic model produced). Then, run the python script:

python python/mixedMembershipSkipGramPreClusteredNCE.py

This outputs three files:

MMembeddings.txt, the topic embeddings (topics by dimensions)
MMnce_biases.txt, the bias terms from inside the softmax (one per word, each on its own line)
MMnce_weights.txt, the NCE weight parameters, also known as the output embeddings (words by dimensions)
MMnormalizedEmbeddings.txt, the topic embeddings, normalized to unit length (topics by dimensions).

Example scripts which run the above on the NIPS corpus are provided in NIPS_demo.sh (bash) and NIPS_demo.bat (windows).

In the matlab folder, scripts are provided for recovering normalized topics and distributions over topics, and for reporting the top words in the topics. These methods could easily be re-implemented in any other language, if you do not have matlab. After running the java and python scripts, try running matlab/results_demo.m (after adding the matlab folder to matlab's path).

Author

James Foulds

License

Licensed under the Apache License, Version 2.0 (the "License"). You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 .

Acknowledgments

The python code for training the embeddings was based on tutorial word embedding code by the authors of TensorFlow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MixedMembershipWordEmbeddings

Prerequisites

Data format

Running the code

Author

License

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
java/edu/umbc/MMWordEmbeddings		java/edu/umbc/MMWordEmbeddings
matlab		matlab
python		python
.gitignore		.gitignore
LICENSE		LICENSE
NIPS_demo.bat		NIPS_demo.bat
NIPS_demo.sh		NIPS_demo.sh
README.md		README.md

License

jrfoulds/MixedMembershipWordEmbeddings

Folders and files

Latest commit

History

Repository files navigation

MixedMembershipWordEmbeddings

Prerequisites

Data format

Running the code

Author

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages