Skip to content
A pun generator based on the surprisal principle
Branch: master
Clone or download
Latest commit 3cc9a04 Apr 15, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
MTurk_aggregate_results rename file Apr 15, 2019
data move semeval original data Apr 4, 2019
pungen cleanup Mar 21, 2019
sample_data/bookcorpus/raw cleanup Mar 19, 2019
.gitignore minor fixes Nov 26, 2018 update readme Apr 4, 2019
environment.yml cleanup Mar 19, 2019 cleanup Mar 19, 2019

Pun Generation with Surprise

This repo contains code and data for the paper Pun Generation with Surprise.


  • Python 3.6
  • Pytorch 0.4
conda install pytorch=0.4.0 torchvision -c pytorch
  • Fairseq(-py)
git clone -b pungen
cd fairseq
pip install -r requirements.txt
python build develop
  • Pretrained WikiText-103 model from Fairseq
curl --create-dirs --output models/wikitext/model
tar xjf models/wikitext/model -C models/wikitext
rm models/wikitext/model


Word relatedness model

We approximate relatedness between a pair of words with a long-distance skip-gram model trained on BookCorpus sentences. The original BookCorpus data is parsed by scripts/ and you can see the sample file in sample_data/bookcorpus/raw/train.txt.

Preprocess bookcorpus data:

python -m pungen.wordvec.preprocess --data-dir data/bookcorpus/skipgram \
	--corpus data/bookcorpus/raw/train.txt \
	--min-dist 5 --max-dist 10 --threshold 80 \
	--vocab data/bookcorpus/skipgram/dict.txt

Train skip-gram model:

python -m pungen.wordvec.train --weights --cuda --data data/bookcorpus/skipgram/train.bin \
    --save_dir models/bookcorpus/skipgram \
    --mb 3500 --epoch 15 \
    --vocab data/bookcorpus/skipgram/dict.txt

Edit model

The edit model takes a word and a template (masked sentence) and combine the two coherently.

Preprocess data:

for split in train valid; do \
	PYTHONPATH=. python scripts/ -i data/bookcorpus/raw/$split.txt \
        -o data/bookcorpus/edit/$split --delete-frac 0.5 --window-size 2 --random-window-size; \

python -m pungen.preprocess --source-lang src --target-lang tgt \
	--destdir data/bookcorpus/edit/bin/data --thresholdtgt 80 --thresholdsrc 80 \
	--validpref data/bookcorpus/edit/valid \
	--trainpref data/bookcorpus/edit/train \
	--workers 8


python -m pungen.train data/bookcorpus/edit/bin/data -a lstm \
    --source-lang src --target-lang tgt \
    --task edit --insert deleted --combine token \
    --criterion cross_entropy \
    --encoder lstm --decoder-attention True \
    --optimizer adagrad --lr 0.01 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
    --clip-norm 5 --max-epoch 50 --max-tokens 7000 --no-epoch-checkpoints \
    --save-dir models/bookcorpus/edit/deleted --no-progress-bar --log-interval 5000


Build a sentence retriever based on Bookcorpus. The input should have a tokenized sentence per line.

python -m pungen.retriever --doc-file data/bookcorpus/raw/sent.tokenized.txt \
    --path models/bookcorpus/retriever.pkl --overwrite

Analyze what makes a pun funny

Compute correlation between local-global suprise scores and human funniness ratings. We provide our annotated dataset in data/funniness_annotation:

  • analysis_pun_scores.txt: sentences annotated with funniness scores from 1 to 5.
  • analysis_zscored_pun_scores.txt: the same data where scores are standardized for each annotator.
python --human-eval data/funniness_annotation/analysis_zscored_pun_scores.txt \
	--lm-path models/wikitext/ --word-counts-path models/wikitext/dict.txt \
    --skipgram-model data/bookcorpus/skipgram/dict.txt \
                     models/bookcorpus/skipgram/ \
    --outdir results/pun-analysis/analysis_zscored \
    --features grammar ratio --analysis --ignore-cache  

Generate puns

We generate puns given a pair of pun word and alternative word. We support pun generation with the following methods specified by the system argument.

  • rule: the SURGEN method described in the paper
  • rule+neural: in the last step of SURGEN, use a neural combiner to edit the topic words
  • retrieve: retrieve a sentence containing the pun word
  • retrieve+swap: retrieve a sentence containing the alternative word and replace it with the pun word For arguments controlling the neural generator (e.g., --beam, --nbest), see fairseq.options. All results and logs are saved in outdir.
python data/bookcorpus/edit/bin/data \
	--path models/bookcorpus/edit/delete/ \
	--beam 20 --nbest 1 --unkpen 100 \
	--system rule --task edit \
	--retriever-model models/bookcorpus/retriever.pkl --doc-file data/bookcorpus/raw/sent.tokenized.txt \
	--lm-path models/wikitext/ --word-counts-path models/wikitext/dict.txt \
	--skipgram-model data/bookcorpus/skipgram/dict.txt models/bookcorpus/skipgram/ \
	--num-candidates 500 --num-templates 100 \
	--num-topic-word 100 --type-consistency-threshold 0.3 \
	--pun-words data/semeval/hetero/dev.json \
	--outdir results/semeval/hetero/dev/rule \
	--scorer random \
	--max-num-examples 100


If you use the annotated SemEval pun dataset, please cite our paper:

    title={Pun Generation with Surprise},
    author={He He and Nanyun Peng and Percy Liang},
    booktitle={North American Association for Computational Linguistics (NAACL)},
You can’t perform that action at this time.