About Fine-tuning for Text Summarization #27

dcn2020 · 2019-07-09T03:02:09Z

Hi,

Thank you for the great work. Recently I tried fine-tuning based on the pre-trained model (https://modelrelease.blob.core.windows.net/mass/mass_summarization_1024.pth).

I followed the instructions (https://github.com/microsoft/MASS#fine-tuning-2) in the readme file and ran this command on a single GPU machine. After the command finished, I tested the output by running python translate_ensemble.py --exp_name giga_test --src_lang ar --tgt_lang ti --beam 5 --batch_size 1 --model_path ./dumped/mass_summarization/bvk6g6f9xl/checkpoint.pth --output_path ./dumped/mass_summarization/bvk6g6f9xl/output.txt.beam5 < ./data/processed/giga/test.ar-ti.ar. Then I processed the output to remove the BPE mark @@ and tested the ROUGE scores. The ROUGE scores are ROUGE-1 F1=37.2 and ROUGE-2 F1=18.8.

I think I have missed something important here. Could you please instruct me how to correctly fine-tune the model?

The text was updated successfully, but these errors were encountered:

StillKeepTry · 2019-07-09T05:21:37Z

@deep-cv-nlp I have some suggestions:

Use more GPUs, as larger batches can produce better performance.
How many steps in your training? I suggest you test your performance in epoch 5 to 15. I found this dataset will easily tend to overfit with too many training steps. My best performance is also obtained among epoch 5 to 15. Just set -save_periodic 1 will save checkpoint for each epoch.

Some other tips:

try some different dropout (e.g., 0.1 to 0.3).

Besides, our pre-trained model is still under training. I will update it when having a better result.

(But anyway, your Rouge-1 F1 seems worse, I can achieve nearly 37.8 RG-1 F1 when my RG-2 F1 at 18.5).

dcn2020 · 2019-07-09T05:32:33Z

@StillKeepTry Thank you for your response.

Can you provide instructions about how to use more GPUs? I tried running it on a 4 GPU machine, but it seems that it can use only GPU0.
I just copied the command in the README file. So I assume it is 20 epoch. I will try your suggestion.

Thanks.

StillKeepTry · 2019-07-09T06:23:25Z

@deep-cv-nlp just use :
export NGPU=8; CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

shivgodhia · 2019-08-30T07:46:34Z

Hi,

Thank you for the great work. Recently I tried fine-tuning based on the pre-trained model (https://modelrelease.blob.core.windows.net/mass/mass_summarization_1024.pth).

I followed the instructions (https://github.com/microsoft/MASS#fine-tuning-2) in the readme file and ran this command on a single GPU machine. After the command finished, I tested the output by running python translate_ensemble.py --exp_name giga_test --src_lang ar --tgt_lang ti --beam 5 --batch_size 1 --model_path ./dumped/mass_summarization/bvk6g6f9xl/checkpoint.pth --output_path ./dumped/mass_summarization/bvk6g6f9xl/output.txt.beam5 < ./data/processed/giga/test.ar-ti.ar. Then I processed the output to remove the BPE mark @@ and tested the ROUGE scores. The ROUGE scores are ROUGE-1 F1=37.2 and ROUGE-2 F1=18.8.

I think I have missed something important here. Could you please instruct me how to correctly fine-tune the model?

Do you happen to know, what format is the data used for fine-tuning the model?

StillKeepTry · 2019-09-27T08:57:57Z

@deep-cv-nlp @hivestrung We have released a pre-trained model (base setting) for summarization task
(include document-level) on fairseq now. You can use it from here.

StillKeepTry · 2019-11-03T11:54:09Z

A copy of cnndm fine-tuned model can be download from here.

jind11 · 2019-11-17T23:38:48Z

hi, Could you share the hyper-parameters for training on the Gigawords? i cannot reproduce the results reported in the paper, around 2 points off. Thanks!

StillKeepTry · 2019-11-18T07:38:27Z

A copy of gigaword data can be obtained from here. So the command can be:

wget -c https://modelrelease.blob.core.windows.net/mass/gigaword.tar.gz
tar -xvf gigaword.tar.gz -C processed

fairseq-train cnndm/processed/ \
    --user-dir mass --task translation_mass --arch transformer_mass_base \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 0.0005 --min-lr 1e-09 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --weight-decay 0.0 --dropout 0.1 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --update-freq 1 --max-tokens 4096 \
    --ddp-backend=no_c10d --max-epoch 15 \
    --max-source-positions 512 --max-target-positions 512 \
    --skip-invalid-size-inputs-valid-test \
    --load-from-pretrained-model mass-base-uncased.pt \

--max-tokens and --update-freq need to be adjusted according to your GPUs. The above hyperparameters is done on 8 GPUs.

Different from cnndm, gigaword do not need to set --minlen, So the inference scripts is :

fairseq-generate processed \
	--path checkpoint_best.pt \
	--user-dir mass \
	--task translation_mass \
	--batch-size 64 \
	--beam 5 \
	--lenpen 1.0 \
	--no-repeat-ngram-size 3 \
	2>&1 | tee output.txt

grep ^T output.txt | cut -f2- | sed 's/ ##//g' > tgt.txt
grep ^H output.txt | cut -f3- | sed 's/ ##//g' > hypo.txt
files2rouge hypo.txt tgt.txt

A fine-tuned model can be downloaded from this link. You can get 38.80/19.92/36.11 from the provided checkpoint.

jind11 · 2019-12-15T19:57:58Z

hi, thanks for sharing the command for reproducing the Gigaword dataset. Now I can reach what was reported in the paper, however, I cannot reproduce the score of 38.80/19.92/36.11 in your recently released trained model. The only difference of settings is that I only used one gpu for training, so I set the update-freq to 4 instead of 1. I am wondering whether this could lead to the performance discrepancy. Thanks!
By the way, could you also share the command for reproducing the Xsum dataset? My re-run of the code in this dataset is 1 point (all ROUGE metrics) lower than what was reported in the readme file in this repo.

bdalal mentioned this issue Aug 1, 2019

Use text-summarization model w/o tuning #36

Closed

GuyTevet mentioned this issue Aug 10, 2019

Typo in shell script #13

Closed

StillKeepTry closed this as completed Nov 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Fine-tuning for Text Summarization #27

About Fine-tuning for Text Summarization #27

dcn2020 commented Jul 9, 2019

StillKeepTry commented Jul 9, 2019

dcn2020 commented Jul 9, 2019

StillKeepTry commented Jul 9, 2019

shivgodhia commented Aug 30, 2019

StillKeepTry commented Sep 27, 2019 •

edited

StillKeepTry commented Nov 3, 2019

jind11 commented Nov 17, 2019

StillKeepTry commented Nov 18, 2019

jind11 commented Dec 15, 2019

About Fine-tuning for Text Summarization #27

About Fine-tuning for Text Summarization #27

Comments

dcn2020 commented Jul 9, 2019

StillKeepTry commented Jul 9, 2019

dcn2020 commented Jul 9, 2019

StillKeepTry commented Jul 9, 2019

shivgodhia commented Aug 30, 2019

StillKeepTry commented Sep 27, 2019 • edited

StillKeepTry commented Nov 3, 2019

jind11 commented Nov 17, 2019

StillKeepTry commented Nov 18, 2019

jind11 commented Dec 15, 2019

StillKeepTry commented Sep 27, 2019 •

edited