Test Fill the Mask with T5

In [1]:
from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
model = T5ForConditionalGeneration.from_pretrained("t5-small", low_cpu_mem_usage=True, torch_dtype=torch.bfloat16)                                                                                                  
tokenizer = AutoTokenizer.from_pretrained("t5-base")

config.json: 100%|██████████| 1.21k/1.21k [00:00<00:00, 3.70MB/s]
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 858kB/s]
tokenizer.json: 100%|██████████| 1.39M/1.39M [00:00<00:00, 1.77MB/s]
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [14]:
input_string = "Mr. Dursley was the director of a firm called <extra_id_0>, which made <extra_id_1>. He was a big, solid man with a bald head. Mrs. Dursley was thin and <extra_id_2> of neck, which came in very useful as she spent so much of her time <extra_id_3>. The Dursleys had a small son called Dudley and <extra_id_4>"    
#input_string = "Hallo. Mein Name ist <extra_id_0> und ich wohne in <extra_id_1>. Das Wetter ist heute schön, deswegen werde ich <extra_id_2>."                                      
#input_string = "Learning a new language like English can be challenging, but <extra_id_0> helps to <extra_id_1> faster. Practice and <extra_id_2> every day will improve <extra_id_3> skills. Don't forget to <extra_id_4> new words regularly!"

In [15]:
inputs = tokenizer(input_string, return_tensors="pt", add_special_tokens=False).input_ids

outputs = model.generate(inputs, max_length=200)

print(tokenizer.decode(outputs[0]))

<pad><extra_id_0> the Dursleys<extra_id_1> him a great man<extra_id_2> had a bald head<extra_id_3> in the firm<extra_id_4> a son called Dudley. He was a great man with a bald head<extra_id_5> the Dursleys<extra_id_6> him a great man<extra_id_7> the Dursleys<extra_id_8> the Dursleys<extra_id_9> him a great man<extra_id_10> had a bald head<extra_id_11> had a bald head<extra_id_12> in the firm<extra_id_13> in the firm<extra_id_14> a son called Dudley<extra_id_15> the firm<extra_id_16> the firm<extra_id_17> the firm<extra_id_18> it<extra_id_19> the</s>


In [21]:
def get_target_scores(text, targets, t5_tokenizer, t5_model):
  """
  A wrapper function for a mask fill-in with target words for (flan-)t5
  Parameters:
    text(String): The input text with <extra_id_0> as mask
    targets(list): A list with target words
    t5_tokenizer(T5Tokenizer): The loaded tokenizer
    t5_model(T5ForConditionalGeneration): The loaded t5 model
  """
  target_numbers = len(targets)
  constrain_ids_list = []

  # encode the target words
  for target in targets:
    encoded_target_ids = t5_tokenizer(target, add_special_tokens=False).input_ids
    constrain_ids_list.append(encoded_target_ids)

  # encode the input text
  encoded = t5_tokenizer.encode_plus(text, add_special_tokens=True, return_tensors='pt')
  input_ids = encoded['input_ids']

  # generate the outputs with the target as constrains
  outputs = t5_model.generate(input_ids=input_ids,
                          #force_words_ids=[constrain_ids_list],
                          num_beams=target_numbers+5, num_return_sequences=target_numbers+5,
                          return_dict_in_generate=True,
                          output_scores=True,
                          max_length=2)
  
  # calculate the mask position
  _0_index = text.index('<extra_id_0>')
  _result_prefix = text[:_0_index]
  _result_suffix = text[_0_index+12:]  # 12 is the length of <extra_id_0>

  result_dict = {}
  # filter each output and save it into the result dictionary
  for output_number, output in enumerate(outputs["sequences"]):
    _txt = t5_tokenizer.decode(output[1:], skip_special_tokens=False, clean_up_tokenization_spaces=False)

   # if _txt in targets:
    # save the target score
    result_dict[_txt] = outputs["sequences_scores"][output_number]
    # complete text
    print(_result_prefix + _txt + _result_suffix)

  # return the aggregated result
  return result_dict

# test the function with this input text
text = 'Meine Schwester wohnt in Bayern und <extra_id_0> fährt ein rotes Auto.' #'India is a <extra_id_0> of the world.'
scores = get_target_scores(text, ["part", "state", "country", "democracy"], tokenizer, model)
print(scores)

Meine Schwester wohnt in Bayern und Ich fährt ein rotes Auto.
Meine Schwester wohnt in Bayern und Meine fährt ein rotes Auto.
Meine Schwester wohnt in Bayern und <extra_id_0> fährt ein rotes Auto.
Meine Schwester wohnt in Bayern und Mein fährt ein rotes Auto.
Meine Schwester wohnt in Bayern und Die fährt ein rotes Auto.
Meine Schwester wohnt in Bayern und Das fährt ein rotes Auto.
Meine Schwester wohnt in Bayern und  fährt ein rotes Auto.
Meine Schwester wohnt in Bayern und Der fährt ein rotes Auto.
Meine Schwester wohnt in Bayern und Nach fährt ein rotes Auto.
{'Ich': tensor(-1.8359), 'Meine': tensor(-2.0625), '<extra_id_0>': tensor(-2.3125), 'Mein': tensor(-2.3438), 'Die': tensor(-3.0938), 'Das': tensor(-3.4375), '': tensor(-3.8125), 'Der': tensor(-4.), 'Nach': tensor(-4.2500)}
