Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Sparse Neural Editor

This repo is the PyTorch implementation of this paper:

Learning Sparse Prototypes for Text Generation
Junxian He, Taylor Berg-Kirkpatrick, Graham Neubig
NeurIPS 2020

In this repo, we implement a generative model of text that generates sentences by editying non-parametric prorotypes. The prototype support set is encouraged to be sparse during training to improve the memory/time efficiency at test time.


The code mainly requires PyTorch (>=1.4.0) and fairseq (we run our experiments based on this specific commit).

Install dependencies:

# install fairseq from a specific commit
git clone fairseq_local
cd fairseq_local
git reset --hard b65a85b

# a modified to use edit vectors
cp ../sparse_prototype/ fairseq

pip install --editable ./

cd ..

# install additional dependencies
pip install -r requirements.txt

Prepare Data

# download coco data

# download yelp medium data

# download yelp large data

mkdir datasets

# take coco dataset as an example
tar -xvzf coco40k.tar.gz -C datasets

# binarize dataset for fairseq
bash scripts/ coco40k

# generate a mask file which is used to avoid selecting 
# exactly the same example as prototype during training
python scripts/ coco40k


We first pre-compute the sentence embeddings for all data examples offline and save them in memory-mapped files using np.memmap. During training/evaluation, a bilinear transformation is applied between these data embeddings and prototype embeddings to obtain the retrieval distribution. Here we use BERT as the offline encoder:

# embeddings are saved into pretrained_sent_embeddings/[dataset name]
CUDA_VISIBLE_DEVICES=xx python scripts/ [dataset name]

GloVe embeddings are used in the paper to initialize word embeddings:

mkdir glove_embeddings
unzip -d glove_embeddings

# compress glove embeddings to generate a new embedding file
# that only contains the dictionary of the dataset
python scripts/ \
		--embed-path glove_embeddings/glove.6B.300d.txt \
		--dict-path data-bin/[dataset_name]/dict.txt \
		> glove_embeddings/[dataset_name]_glove.txt
Train the model:
# train the sparse neural editor
# [GPUs] can be multiple ids to perform data-parallel training
# some hyperparameters can be specified (e.g. -a [alpha]), see 
# details in the script
bash scripts/ -g [GPUs] -d [dataset name]

# train lm baseline
bash scripts/ -g [GPUs] -c lm_baseline -d [dataset name]


compute ppl:
# approximate importance-weighted ppl
bash scripts/ -g [GPUs] -d [dataset name] -e iw -p [checkpoint directory]

# pruning prototypes can be performed at eval time
# [prune num] is the number of prototypes kept
bash scripts/ -g [GPUs] -d [dataset name] -u [prune num] -e iw -p [checkpoint directory] 

Template-based Generation

See the notebook generate_demo.ipynb(mainly the sample_from_cluster function) for examples to load the pretrained model and generate based on given templates.


  title={Learning Sparse Prototypes for Text Generation},
  author={He, Junxian and Berg-Kirkpatrick, Taylor and Neubig, Graham},
  booktitle={Proceedings of NeurIPS},


PyTorch Implementation of NeurIPS 2020 paper "Learning Sparse Prototypes for Text Generation"




No releases published


No packages published