Skip to content

MBartForConditionalGeneration doesn't seem to be able to complete the task of filling mask. #25425

@5i-wanna-be-the-666

Description

@5i-wanna-be-the-666

System Info

transformers version: 4.29.2
Platform: Linux ubt-4090 5.15.0-75-generic
Python version: 3.9.5
PyTorch version (GPU?): 1.12.1+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker @younesbelkada @patrickvonplaten

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When I used the official document on huggingface for mask filling, I got the expected output.

from transformers import AutoTokenizer, MBartForConditionalGeneration

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")

# de_DE is the language symbol id <LID> for German
TXT = "</s> Meine Freunde sind <mask> nett aber sie essen zu viel Kuchen. </s> de_DE"

input_ids = tokenizer([TXT], add_special_tokens=False, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits

masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)

tokenizer.decode(predictions).split()
['nett', 'sehr', 'ganz', 'nicht', 'so']

But when I changed the characters that need to be filled into Chinese, there was an accident.

from transformers import AutoTokenizer, MBartForConditionalGeneration

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")

# de_DE is the language symbol id <LID> for German
TXT = "</s> 今天<mask>真好,我准备去公园打羽毛球. </s> zh_ZH"

input_ids = tokenizer([TXT], add_special_tokens=False, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits

masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)

tokenizer.decode(predictions).split()
[',·:.']

image
After that, I tried to get mBART to restore a sentence with multiple masks for me, and the effect was even worse.

from transformers import MBartTokenizer,DataCollatorForLanguageModeling,MBartForConditionalGeneration
tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-cc25")
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")

TXT_input = "<s>The weather is so nice today, I am going to play badminton in the park</s>en_xx"

inputs = tokenizer([TXT_input], add_special_tokens=False, return_tensors="pt",max_length=32,  padding='max_length')

masked_inputs_and_labels = data_collator([inputs]) 

input_ids = masked_inputs_and_labels['input_ids'][0]
attention_mask = masked_inputs_and_labels['attention_mask'][0]
labels = masked_inputs_and_labels['labels'][0]

masked_inputs={key:value[0] for key,value in masked_inputs_and_labels.items()}
outputs = model(input_ids = input_ids,attention_mask = attention_mask,labels = labels)
logits = outputs.logits

print(f'after mask: {tokenizer.decode(masked_inputs["input_ids"][0])}')

predictions = outputs.logits.argmax(dim=-1)

print(f'Predicted sentence: {tokenizer.decode(predictions[0])}')
after mask: <s> The weather is so nice today, I am going tosähkö badminton in the park</s> en_xx<pad><pad><pad><pad><pad><pad><pad><mask><pad><pad><pad>
Predicted sentence: <s>นยยยยยนนนนนน badmintonนนนap<s><s><s><s><s><s><s><s><s><s><s><s><s><s>

Excuse me, is there something wrong with my usage?In that case, how can I correctly use mBART to fill the mask?

Expected behavior

I think mBART has at least one Chinese token with five highest probabilities.Or restore the masked sentence for me.
such as:['天气','心情',.....]
or:Predicted sentence: "The weather is so nice today, I am going to play badminton in the park en_xx"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions