should mBART-large-en-ro have decoder_start_token_id by default? #6156

sshleifer · 2020-07-30T16:35:52Z

Hypothesis: since the argument prepend_bos is set to "False" in fairseq/examples/README.md, mbart-large-en-ro does not need decoder_start_token_id.

TODO:

create branch that deletes decoder_start_token_id. Setting it to None in the config might not be enough.
verify that decoder_start_token_id is in fact not being used by setting a breakpoint in generate.
run_eval.py on wmt-en-ro/test and see if BLEU is >= 26.46, the score with decoder_start_token_id=250020.

The text was updated successfully, but these errors were encountered:

KMFODA · 2020-08-10T11:57:20Z

Hi @sshleifer, I'd like to contribute and help out here if still needed. My thinking is to remove decoder_start_token_id from run_eval.py and generation_utils.py and change the following code:

transformers/src/transformers/generation_utils.py

Lines 403 to 409 in 6028ed9

    
           # create empty decoder_input_ids 
        
           input_ids = torch.full( 
        
               (effective_batch_size * num_beams, 1), 
        
               decoder_start_token_id, 
        
               dtype=torch.long, 
        
               device=next(self.parameters()).device, 
        
           )

to:

        input_ids = torch.full(
            (effective_batch_size * num_beams, 1),
            250020,
            dtype=torch.long,
            device=next(self.parameters()).device,
        )

sshleifer · 2020-08-11T20:45:17Z

I dont think that change will do anything since decoder_start_token_id = 250020.

What I would do is change the 250020 to a bos_token_id (0, I think) or a pad_token_id (1) and see what the BLEU score is.

KMFODA · 2020-08-13T16:22:05Z

Ah yes that makes sense. I tried those two and the eos_token_id and got the following results:

ID	BLEU Score
eos_token_id (2)	28.22
decoder_start_token_id (250020)	28.06
pad_token_id (1)	26.79
bos_token_id (0)	26.01

sshleifer · 2020-08-17T02:21:58Z

Super interesting, thanks for running that. It seems like I should change decoder_start_token_id in the mbart-large-en-ro config to 2. Do you have opinions on mbart-large-cc25?

KMFODA · 2020-08-19T08:24:36Z

No problem! Yes I think configuring decoder_start_token_id to 2 is a good idea. Unfortunately, I'm getting the same issues you're getting with mbart-large-cc25 (output's in English not Romanian and missing the first word when I use bos_token_id or 250020 and gibberish with eos/pad_token_id) and don't understand why that's the case. I'll investigate and post any useful findings.

sshleifer · 2020-08-21T22:13:07Z

I think I fixed this another way in #6526
on master

python run_eval.py facebook/mbart-large-en-ro $ENRO_DIR/test.source eos_baseline_enro_test_generations.txt \
--reference_path $ENRO_DIR/test.target \
--score_path baseline_test_bleu_eos.json --bs 32 --task translation --fp16

=> {'bleu': 26.81}

python run_eval.py facebook/mbart-large-en-ro $ENRO_DIR/test.source \
eos_baseline_enro_test_generations.txt --reference_path $ENRO_DIR/test.target \
--score_path baseline_test_bleu_eos.json --bs 32 --task translation --fp16  \
--decoder_start_token_id 2

{'bleu': 11.57} (and takes 40 mins!)

in the original fairseq I get 26.83.

sshleifer · 2020-08-22T02:03:48Z

Gunna close this since the score is now basically the same as fairseq. Thanks for your help!

sshleifer added translation machine translation utilities and models Help wanted Extra attention is needed, help appreciated labels Jul 30, 2020

sshleifer added this to To do in Examples/seq2seq via automation Jul 30, 2020

sshleifer self-assigned this Jul 30, 2020

sshleifer mentioned this issue Aug 17, 2020

Problems with generating text using mbart-large-cc25 #5755

Closed

4 tasks

sshleifer closed this as completed Aug 22, 2020

Examples/seq2seq automation moved this from To do to Done Aug 22, 2020

sshleifer mentioned this issue Sep 28, 2020

Possible error in MBart Tokenization script -- target lang code is only present in seq once #7416

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should mBART-large-en-ro have decoder_start_token_id by default? #6156

should mBART-large-en-ro have decoder_start_token_id by default? #6156

sshleifer commented Jul 30, 2020

KMFODA commented Aug 10, 2020 •

edited

Loading

sshleifer commented Aug 11, 2020 •

edited

Loading

KMFODA commented Aug 13, 2020

sshleifer commented Aug 17, 2020

KMFODA commented Aug 19, 2020

sshleifer commented Aug 21, 2020 •

edited

Loading

sshleifer commented Aug 22, 2020

should mBART-large-en-ro have decoder_start_token_id by default? #6156

should mBART-large-en-ro have decoder_start_token_id by default? #6156

Comments

sshleifer commented Jul 30, 2020

KMFODA commented Aug 10, 2020 • edited Loading

sshleifer commented Aug 11, 2020 • edited Loading

KMFODA commented Aug 13, 2020

sshleifer commented Aug 17, 2020

KMFODA commented Aug 19, 2020

sshleifer commented Aug 21, 2020 • edited Loading

sshleifer commented Aug 22, 2020

KMFODA commented Aug 10, 2020 •

edited

Loading

sshleifer commented Aug 11, 2020 •

edited

Loading

sshleifer commented Aug 21, 2020 •

edited

Loading