Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Fine-tuning for Text Summarization #27

Closed
dcn2020 opened this issue Jul 9, 2019 · 9 comments
Closed

About Fine-tuning for Text Summarization #27

dcn2020 opened this issue Jul 9, 2019 · 9 comments

Comments

@dcn2020
Copy link

dcn2020 commented Jul 9, 2019

Hi,

Thank you for the great work. Recently I tried fine-tuning based on the pre-trained model (https://modelrelease.blob.core.windows.net/mass/mass_summarization_1024.pth).

I followed the instructions (https://github.com/microsoft/MASS#fine-tuning-2) in the readme file and ran this command on a single GPU machine. After the command finished, I tested the output by running python translate_ensemble.py --exp_name giga_test --src_lang ar --tgt_lang ti --beam 5 --batch_size 1 --model_path ./dumped/mass_summarization/bvk6g6f9xl/checkpoint.pth --output_path ./dumped/mass_summarization/bvk6g6f9xl/output.txt.beam5 < ./data/processed/giga/test.ar-ti.ar. Then I processed the output to remove the BPE mark @@ and tested the ROUGE scores. The ROUGE scores are ROUGE-1 F1=37.2 and ROUGE-2 F1=18.8.

I think I have missed something important here. Could you please instruct me how to correctly fine-tune the model?

@StillKeepTry
Copy link
Contributor

@deep-cv-nlp I have some suggestions:

  1. Use more GPUs, as larger batches can produce better performance.
  2. How many steps in your training? I suggest you test your performance in epoch 5 to 15. I found this dataset will easily tend to overfit with too many training steps. My best performance is also obtained among epoch 5 to 15. Just set -save_periodic 1 will save checkpoint for each epoch.

Some other tips:

  1. try some different dropout (e.g., 0.1 to 0.3).

Besides, our pre-trained model is still under training. I will update it when having a better result.

(But anyway, your Rouge-1 F1 seems worse, I can achieve nearly 37.8 RG-1 F1 when my RG-2 F1 at 18.5).

@dcn2020
Copy link
Author

dcn2020 commented Jul 9, 2019

@StillKeepTry Thank you for your response.

  1. Can you provide instructions about how to use more GPUs? I tried running it on a 4 GPU machine, but it seems that it can use only GPU0.
  2. I just copied the command in the README file. So I assume it is 20 epoch. I will try your suggestion.

Thanks.

@StillKeepTry
Copy link
Contributor

@deep-cv-nlp just use :
export NGPU=8; CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

@shivgodhia
Copy link

Hi,

Thank you for the great work. Recently I tried fine-tuning based on the pre-trained model (https://modelrelease.blob.core.windows.net/mass/mass_summarization_1024.pth).

I followed the instructions (https://github.com/microsoft/MASS#fine-tuning-2) in the readme file and ran this command on a single GPU machine. After the command finished, I tested the output by running python translate_ensemble.py --exp_name giga_test --src_lang ar --tgt_lang ti --beam 5 --batch_size 1 --model_path ./dumped/mass_summarization/bvk6g6f9xl/checkpoint.pth --output_path ./dumped/mass_summarization/bvk6g6f9xl/output.txt.beam5 < ./data/processed/giga/test.ar-ti.ar. Then I processed the output to remove the BPE mark @@ and tested the ROUGE scores. The ROUGE scores are ROUGE-1 F1=37.2 and ROUGE-2 F1=18.8.

I think I have missed something important here. Could you please instruct me how to correctly fine-tune the model?

Do you happen to know, what format is the data used for fine-tuning the model?

@StillKeepTry
Copy link
Contributor

StillKeepTry commented Sep 27, 2019

@deep-cv-nlp @hivestrung We have released a pre-trained model (base setting) for summarization task
(include document-level) on fairseq now. You can use it from here.

@StillKeepTry
Copy link
Contributor

A copy of cnndm fine-tuned model can be download from here.

@jind11
Copy link

jind11 commented Nov 17, 2019

hi, Could you share the hyper-parameters for training on the Gigawords? i cannot reproduce the results reported in the paper, around 2 points off. Thanks!

@StillKeepTry
Copy link
Contributor

A copy of gigaword data can be obtained from here. So the command can be:

wget -c https://modelrelease.blob.core.windows.net/mass/gigaword.tar.gz
tar -xvf gigaword.tar.gz -C processed

fairseq-train cnndm/processed/ \
    --user-dir mass --task translation_mass --arch transformer_mass_base \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 0.0005 --min-lr 1e-09 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --weight-decay 0.0 --dropout 0.1 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --update-freq 1 --max-tokens 4096 \
    --ddp-backend=no_c10d --max-epoch 15 \
    --max-source-positions 512 --max-target-positions 512 \
    --skip-invalid-size-inputs-valid-test \
    --load-from-pretrained-model mass-base-uncased.pt \

--max-tokens and --update-freq need to be adjusted according to your GPUs. The above hyperparameters is done on 8 GPUs.

Different from cnndm, gigaword do not need to set --minlen, So the inference scripts is :

fairseq-generate processed \
	--path checkpoint_best.pt \
	--user-dir mass \
	--task translation_mass \
	--batch-size 64 \
	--beam 5 \
	--lenpen 1.0 \
	--no-repeat-ngram-size 3 \
	2>&1 | tee output.txt

grep ^T output.txt | cut -f2- | sed 's/ ##//g' > tgt.txt
grep ^H output.txt | cut -f3- | sed 's/ ##//g' > hypo.txt
files2rouge hypo.txt tgt.txt

A fine-tuned model can be downloaded from this link. You can get 38.80/19.92/36.11 from the provided checkpoint.

@jind11
Copy link

jind11 commented Dec 15, 2019

hi, thanks for sharing the command for reproducing the Gigaword dataset. Now I can reach what was reported in the paper, however, I cannot reproduce the score of 38.80/19.92/36.11 in your recently released trained model. The only difference of settings is that I only used one gpu for training, so I set the update-freq to 4 instead of 1. I am wondering whether this could lead to the performance discrepancy. Thanks!
By the way, could you also share the command for reproducing the Xsum dataset? My re-run of the code in this dataset is 1 point (all ROUGE metrics) lower than what was reported in the readme file in this repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants