This repository contains the code for AAAI-21 paper "Adaptive Beam Search Decoding for Discrete Keyphrase Generation" and supplementary materials.pdf
Our implementation is built on the:
- seq2seq-keyphrase-pytorch.
- a related repository.
- beam search code: OpenNMT-py.
- keyphrase-generation-rl
- python 3.5+
- pytorch 1.6.0
- numpy 1.19.2
- json 2.0.9
If you want to train AdaGM, you need the following steps (provided that your data is organized as
data
--kp20k_sorted
--train_*.txt
--valid_*.txt
--test_*.txt
--cross_domain_sorted
--word_*_testing_allkeywords.txt
--word_*_testing_context.txt
pykp
utils
*.py
*.sh
1️⃣ download the dataset and execute preprocess.sh
to get the vocab, training, validation and test datasets.
2️⃣ execute train.sh
to train the model and select a better model through the validation dataset. All model checkpoints are saved in ckpt/
directory.
3️⃣ execute pred.sh
to predict the keyphrases of all test datasets. All predictions are saved in pred/
directory.
4️⃣ execute eval.sh
to evaluate the prediction results (Dupration, mae, f1@5, f1@m). All results are saved in exp/
directory.
We adopt the same five datasets with keyphrase-generation-rl, which can be downloaded from their repository.
We are high acknowledgments to Mr. Wang Chen and Hou Pong Chan for their help on data preprocessing (sort present keyphrases & remove duplicated docs).
Command: python3 preprocess.py -data_dir data/kp20k_sorted -remove_eos
Command: python3 train.py -data data/kp20k_sorted/ -vocab data/kp20k_sorted/ -exp_path exp/%s.%s -exp kp20k -epochs 20 -copy_attention -one2many -batch_size 12 -seed 9527 -delimiter_type 1
Command: python3 predict.py -vocab data/kp20k_sorted/ -src_file [src_file_path] -pred_path pred/%s.%s -copy_attention -one2many -delimiter_type 1 -model [model_path] -max_length 60 -remove_title_eos -n_best -1 -max_eos_per_output_seq 1 -replace_unk -beam_size 20 -batch_size 1
Command: python3 evaluate_prediction.py -pred_file_path [path_to_predictions.txt] -src_file_path [path_to_test_set_src_file] -trg_file_path [path_to_test_set_trg_file] -exp kp20k -export_filtered_pred -disable_extra_one_word_filter -invalidate_unk -all_ks 5 M -present_ks 5 M -absent_ks 5 M
We train the five keyphrase generation models (CopyRNN, CorrRNN, TG-Net, catSeq, catSeqD)) and save the best ckpt and predictions, separately. We also collect two models' predictions, which are given by the authors: ExHiRD & Kp-RL. They can be downloaded from here.
The code is in model.py [line: 137-146]
The code is in sequence_generator.py [line: 218-234]
hxlist@163.com
or huangxiaolist@163.com