-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Description
System Info
transformers version: 4.29.2
Platform: Linux ubt-4090 5.15.0-75-generic
Python version: 3.9.5
PyTorch version (GPU?): 1.12.1+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No
Who can help?
@ArthurZucker @younesbelkada @patrickvonplaten
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
When I used the official document on huggingface for mask filling, I got the expected output.
from transformers import AutoTokenizer, MBartForConditionalGeneration
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")
# de_DE is the language symbol id <LID> for German
TXT = "</s> Meine Freunde sind <mask> nett aber sie essen zu viel Kuchen. </s> de_DE"
input_ids = tokenizer([TXT], add_special_tokens=False, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits
masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)
tokenizer.decode(predictions).split()
['nett', 'sehr', 'ganz', 'nicht', 'so']But when I changed the characters that need to be filled into Chinese, there was an accident.
from transformers import AutoTokenizer, MBartForConditionalGeneration
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")
# de_DE is the language symbol id <LID> for German
TXT = "</s> 今天<mask>真好,我准备去公园打羽毛球. </s> zh_ZH"
input_ids = tokenizer([TXT], add_special_tokens=False, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits
masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)
tokenizer.decode(predictions).split()
[',·:.']
After that, I tried to get mBART to restore a sentence with multiple masks for me, and the effect was even worse.
from transformers import MBartTokenizer,DataCollatorForLanguageModeling,MBartForConditionalGeneration
tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-cc25")
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
TXT_input = "<s>The weather is so nice today, I am going to play badminton in the park</s>en_xx"
inputs = tokenizer([TXT_input], add_special_tokens=False, return_tensors="pt",max_length=32, padding='max_length')
masked_inputs_and_labels = data_collator([inputs])
input_ids = masked_inputs_and_labels['input_ids'][0]
attention_mask = masked_inputs_and_labels['attention_mask'][0]
labels = masked_inputs_and_labels['labels'][0]
masked_inputs={key:value[0] for key,value in masked_inputs_and_labels.items()}
outputs = model(input_ids = input_ids,attention_mask = attention_mask,labels = labels)
logits = outputs.logits
print(f'after mask: {tokenizer.decode(masked_inputs["input_ids"][0])}')
predictions = outputs.logits.argmax(dim=-1)
print(f'Predicted sentence: {tokenizer.decode(predictions[0])}')
after mask: <s> The weather is so nice today, I am going tosähkö badminton in the park</s> en_xx<pad><pad><pad><pad><pad><pad><pad><mask><pad><pad><pad>
Predicted sentence: <s>นยยยยยนนนนนน badmintonนนนap<s><s><s><s><s><s><s><s><s><s><s><s><s><s>Excuse me, is there something wrong with my usage?In that case, how can I correctly use mBART to fill the mask?
Expected behavior
I think mBART has at least one Chinese token with five highest probabilities.Or restore the masked sentence for me.
such as:['天气','心情',.....]
or:Predicted sentence: "The weather is so nice today, I am going to play badminton in the park en_xx"