Meta-CETM

This is the official implementation for the NeurIPS 2023 paper Context-guided Embedding Adaptation for Effective Topic Modeling in Low-Resource Regimes. We have developed an approach that is proficient in discovering meaningful topics from only a few documents, and the core idea is to adaptively generate word embeddings semantically tailored to the given task by fully exploiting the contextual syntactic information.

Get started

The following lists the statistics of the datasets we used.

Dataset	Source link	N (#docs)	V (#words)	L (#labels)
20Newsgroups	20NG	11288	5968	20
Yahoo! Answers	Yahoo	27069	7507	10
DBpedia	DB14	30183	6274	14
Web of Science	WOS	11921	4923	7

We curated the vocabulary for each dataset by removing those words with very low and very high frequencies, as well as a list of commonly used stop words. After that, we filtered out documents that contained less than 50 vocabulary terms to yield the final available part of each original dataset. The pre-processed version of all four datasets can be downloaded from

https://drive.google.com/file/d/1byla0PKb27HXadonut_qf7OVAYEwauhX/view?usp=drive_link

Episodic task construction

Since we adopted an episodic training strategy to learn our model, we need to sample a batch of tasks from the original corpus to construct the training, validation, and test sets separately. To do this, unzip the downloaded pre-processed datasets, put the data folder under the root directory, and then execute the following command.

cd utils
python process_to_task.py

Note that for different datasets, please modify the arguments dataset_name and data_path accordingly.

Experiment: per-holdout-word perplexity (PPL)

To train a Meta-CETM with the best predictive performance from scratch, run the following command

python run_meta_cetm.py --dataset 20ng --data_path ./data/20ng/20ng_8novel.pkl --embed_path ./data/glove.6B/glove.6B.100d.txt --docs_per_task 10 --num_topics 20 --mode train

To train a ETM using the model-agnostic meta-learning (MAML) strategy, run the following command

python run_etm.py --dataset 20ng --data_path ./data/20ng/20ng_8novel.pkl --embed_path ./data/glove.6B/glove.6B.100d.txt --docs_per_task 10 --num_topics 20 --mode train --maml_train True

In the same vein, to train a ProdLDA from scratch using MAML, you can run the command

python run_avitm.py --dataset 20ng --data_path ./data/20ng/20ng_8novel.pkl --docs_per_task 10 --num_topics 20 --mode train --maml_train True

Citation

@article{xu2024context,
  title={Context-guided Embedding Adaptation for Effective Topic Modeling in Low-Resource Regimes},
  author={Xu, Yishi and Sun, Jianqiao and Su, Yudi and Liu, Xinyang and Duan, Zhibin and Chen, Bo and Zhou, Mingyuan},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
display		display
models		models
utils		utils
LICENSE		LICENSE
README.md		README.md
datasets.py		datasets.py
requirements.txt		requirements.txt
run_avitm.py		run_avitm.py
run_classification.py		run_classification.py
run_etm.py		run_etm.py
run_meta_cetm.py		run_meta_cetm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

display

display

models

models

utils

utils

LICENSE

LICENSE

README.md

README.md

datasets.py

datasets.py

requirements.txt

requirements.txt

run_avitm.py

run_avitm.py

run_classification.py

run_classification.py

run_etm.py

run_etm.py

run_meta_cetm.py

run_meta_cetm.py

Repository files navigation

Meta-CETM

Get started

Episodic task construction

Experiment: per-holdout-word perplexity (PPL)

Citation

About

Releases

Packages

Languages

License

NoviceStone/Meta-CETM

Folders and files

Latest commit

History

Repository files navigation

Meta-CETM

Get started

Episodic task construction

Experiment: per-holdout-word perplexity (PPL)

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages