Skip to content
master
Switch branches/tags
Code

textTOvec

ICLR 2019 paper: "textTOvec: DEEP CONTEXTUALIZED NEURAL AUTOREGRESSIVE TOPIC MODELS OF LANGUAGE WITH DISTRIBUTED COMPOSITIONAL PRIOR"

About

This code consists of the implementations for the model proposed in the paper published at ICLR 2019: "textTOvec: DEEP CONTEXTUALIZED NEURAL AUTOREGRESSIVE TOPIC MODELS OF LANGUAGE WITH DISTRIBUTED COMPOSITIONAL PRIOR".

Paper: https://arxiv.org/pdf/1810.03947.pdf

Requirements

Requires Python 3 (tested with 3.6.1). The remaining dependencies can then be installed via:

    $ pip install -r requirements.txt
    $ python -c "import nltk; nltk.download('all')"

Data format

Datasets: A directory containing CSV files. There is expected to be 1 CSV file per set or collection, with separate sets for training, validation and test. The CSV files in the directory must be named accordingly for DocNADE model: training_docnade.csv, validation_docnade.csv, test_docnade.csv. The CSV files in the directory must be named accordingly for lstm portion of the model ctx-DocNADE(e): training_lstm.csv, validation_lstm.csv, test_lstm.csv. For this task, each CSV file (prior to preprocessing) consists of 2 string fields with a comma delimiter - the first is the label and the second is the document body.

Vocabulary files: A plain text file, with 1 vocabulary token per line (note that this must be created in advance, we do not provide a script for creating vocabularies), each for DocNADE and lstm portions.

mapping_dict.pkl: A dictionary that maps the indices of words in DocNADE to indices of words in LSTM data format (i.e., CSVs).

How to use: Train baseline DocNADE model

The script train_DATANAME_docnade_PPL.sh or train_DATANAME_docnade_IR.sh invokes train_model.py to train the baseline DocNADE model, compute PPL/IR and save it in a repository. It will also log all the information with the PPL and IR models in the seperate directories. Here's how to use the script:

    $ ./train_20NSshort_docnade_PPL.sh
	# to compute PPL
	
	$ ./train_20NSshort_docnade_IR.sh
	# to compute IR
	
	--dataset				is the path to the input dataset. 
	--docnadeVocab 			is the path to the vocabulary of the input dataset in DocNADE portion. 
	--model 				is the path to the save the best model.
	--initialize-docnade 	Init DocNADE weights. False for DocNADE; True/False for ctx-DocNADE(e) model 
	--bidirectional 		True, if bidirectional settings reqquired in DocNADE. Default: False
	--activation 			*sigmoid* for PPL and *tanh* for IR computations. 
	--learning-rate 		0.001 
	--batch-size 			training batch szie, for instance, 100 
	--num-steps 			the number of training steps  
	--log-every 		 
	--validation-bs 		validation batch size, set to 1
	--test-bs 				test batch size, set to 1
	--validation-ppl-freq 	computate PPL of validation set at this frequency  
	--validation-ir-freq	computate IR of validation set at this frequency    
	--test-ir-freq  		computate IR of test set at this frequency  
	--test-ppl-freq  		computate PPL of test set at this frequency  
	--num-classes 			number of class labels; not used.
	--patience 				stopping criteria on validation scores
	--supervised 			If training in supervised setting. Set to False.
	--hidden-size 			The number of hidden units in a hiden vector 
	--combination-type 		The mode of combining hidden vectors from DocNADE and LSTM portions. Set to 'sum'  
	--vocab-size 			Voabulary size in DocNADE portion. 
	--deep 					True, if additional layers on both DocNADE and LSTM portions. Set False for non-deep versions. 
	--deep-hidden-sizes 	List of hidden sizes. For instance, for a two layered network, set: 200 200. Used, if deep = True
	--trainfile 			is path to training text file. (required in case of topic coherence)
	--valfile 				is path to validation text file. (required in case of topic coherence)
	--testfile 				is path to testing text file. (required in case of topic coherence)
	--reload 				True, if reloading of a model required. Set to False otherwise.  
	--reload-model-dir 		Path to the model to reload. Used only if reload set to True. 
	
	

    *** TO DO ***: Improve Documentation. 

How to use: Train ctx-DocNADE or ctx-DocNADEe model

The script train_20NSshort_docnade_lstm_PPL.sh or train_20NSshort_docnade_lstm_IR.sh invokes train_model_lstm.py to train the ctx-DocNADE or ctx-DocNADEe model, compute PPL/IR and save it in a repository. It will also log all the information with the PPL and IR models in the seperate directories. Here's how to use the script:

    $ ./train_20NSshort_docnade_lstm_PPL.sh
	# to compute PPL using ctx-DocNADE or crx-DocNADEe model i.e., textTOvec models
	
	$ ./train_20NSshort_docnade_lstm_IR.sh
	# to compute IR using ctx-DocNADE or crx-DocNADEe model i.e., textTOvec models
	
	
	--dataset				is the path to the input dataset. 
	--mapping-dict			is the path to mapping_dict.pkl file.
	--rnnVocab				is the path to the vocabulary of the input dataset in LSTM portion.
	--docnadeVocab 			is the path to the vocabulary of the input dataset in DocNADE portion. 
	--model 				is the path to the save the best model.
	--initialize-docnade 	Init DocNADE weights. False for DocNADE; True/False for ctx-DocNADE(e) model 
	--bidirectional 		True, if bidirectional settings reqquired in DocNADE. Default: False
	--activation 			sigmoid for PPL and tanh for IR computations. 
	--learning-rate 		0.001 
	--batch-size 			training batch szie, for instance, 100 
	--num-steps 			the number of training steps  
	--log-every 		 
	--validation-bs 		validation batch size, set to 1
	--test-bs 				test batch size, set to 1
	--validation-ppl-freq 	computate PPL of validation set at this frequency  
	--validation-ir-freq	computate IR of validation set at this frequency    
	--test-ir-freq  		computate IR of test set at this frequency  
	--test-ppl-freq  		computate PPL of test set at this frequency  
	--num-classes 			number of class labels; not used.
	--patience 				stopping criteria on validation scores
	--supervised 			If training in supervised setting. Set to False.
	--hidden-size 			The number of hidden units in a hiden vector 
	--combination-type 		The mode of combining hidden vectors from DocNADE and LSTM portions. Set to 'sum'  
	--vocab-size 			Voabulary size in DocNADE portion. 
	--deep 					True, if additional layers on both DocNADE and LSTM portions. Set False for non-deep versions. 
	--deep-hidden-sizes 	List of hidden sizes. For instance, for a two layered network, set: 200 200. Used, if deep = True
	--use-docnade-for-ir 	True, to log IR due to DocNADE portion *only* in the ctx-DocNADE ctx-DocNADEe model 
	--use-lstm-for-ir 		True, to log IR due to LSTM portion *only* in the ctx-DocNADE ctx-DocNADEe model 
	--use-combination-for-ir True, to log IR due to DocNADE+LSTM portion together in the ctx-DocNADE ctx-DocNADEe model 
	--initialize-rnn 		Init LSTM with Glove embeddings, i.e., ctx-DocNADEe 
	--update-docnade-w 		True, to update weights in DocNADE portion.  
	--update-rnn-w 			False, to not update embeddings in LSTM portion. 
	--lambda-hidden-lstm 	mixture weight, lambda in [0.0-1.0]
	--trainfile 			is path to training text file. (required in case of topic coherence)
	--valfile 				is path to validation text file. (required in case of topic coherence)
	--testfile 				is path to testing text file. (required in case of topic coherence)
	--reload 				True, if reloading of a model required. Set to False otherwise.  
	--reload-model-dir 		Path to the model to reload. Used only if reload set to True. 
	--reload-docnade-embeddings True, to init DocNADE portion of the ctx-DocNADE from a pretrained DocNADE model
	--docnade-embeddings-path Path to topic-embedding matrix W from a pre-trained DocNADE model
	
	

    *** TO DO ***: Improve Documentation. 

Directory structure for results and datasets

Contains dataset folders

Datasets directory: ./datasets/

Contains GloVe pretrained embeddings

Pre-trained embeddings dir: /home/usr/resources/pretrained_embeddings/

Contains results of training

Results directory: ./model/

Saved logs model dir: ./model/MODELNAME/logs/

NOTE: We will improve the readme and code documentation soon.

About

ICLR 2019 paper: "textTOvec: DEEP CONTEXTUALIZED NEURAL AUTOREGRESSIVE TOPIC MODELS OF LANGUAGE WITH DISTRIBUTED COMPOSITIONAL PRIOR"

Resources

Releases

No releases published

Packages

No packages published