Skip to content

NoviceStone/Meta-CETM

Repository files navigation

Meta-CETM

This is the official implementation for the NeurIPS 2023 paper Context-guided Embedding Adaptation for Effective Topic Modeling in Low-Resource Regimes. We have developed an approach that is proficient in discovering meaningful topics from only a few documents, and the core idea is to adaptively generate word embeddings semantically tailored to the given task by fully exploiting the contextual syntactic information.

Get started

The following lists the statistics of the datasets we used.

Dataset Source link N (#docs) V (#words) L (#labels)
20Newsgroups 20NG 11288 5968 20
Yahoo! Answers Yahoo 27069 7507 10
DBpedia DB14 30183 6274 14
Web of Science WOS 11921 4923 7

We curated the vocabulary for each dataset by removing those words with very low and very high frequencies, as well as a list of commonly used stop words. After that, we filtered out documents that contained less than 50 vocabulary terms to yield the final available part of each original dataset. The pre-processed version of all four datasets can be downloaded from

Episodic task construction

Since we adopted an episodic training strategy to learn our model, we need to sample a batch of tasks from the original corpus to construct the training, validation, and test sets separately. To do this, unzip the downloaded pre-processed datasets, put the data folder under the root directory, and then execute the following command.

cd utils
python process_to_task.py

Note that for different datasets, please modify the arguments dataset_name and data_path accordingly.

Experiment: per-holdout-word perplexity (PPL)

To train a Meta-CETM with the best predictive performance from scratch, run the following command

python run_meta_cetm.py --dataset 20ng --data_path ./data/20ng/20ng_8novel.pkl --embed_path ./data/glove.6B/glove.6B.100d.txt --docs_per_task 10 --num_topics 20 --mode train

To train a ETM using the model-agnostic meta-learning (MAML) strategy, run the following command

python run_etm.py --dataset 20ng --data_path ./data/20ng/20ng_8novel.pkl --embed_path ./data/glove.6B/glove.6B.100d.txt --docs_per_task 10 --num_topics 20 --mode train --maml_train True

In the same vein, to train a ProdLDA from scratch using MAML, you can run the command

python run_avitm.py --dataset 20ng --data_path ./data/20ng/20ng_8novel.pkl --docs_per_task 10 --num_topics 20 --mode train --maml_train True

Citation

@article{xu2024context,
  title={Context-guided Embedding Adaptation for Effective Topic Modeling in Low-Resource Regimes},
  author={Xu, Yishi and Sun, Jianqiao and Su, Yudi and Liu, Xinyang and Duan, Zhibin and Chen, Bo and Zhou, Mingyuan},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

About

An Effective Topic Modeling Approach under Resource-Limited Scenarios

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages