In this notebook I am going to perform a fine-tuning of the bert-base-spanish-wwm-cased bert with a dataset of Tweets from the economic-political domain, and a fine-tuning of the distilbert-base-german-cased with a dataset of Amazon Reviews. The intention is to use this models combined with the idea presented by Chris Donahue et al. in their paper (https://arxiv.org/abs/2005.05339), which consists of a new approach to the task of infilling. Through the two models, which use [MASK] tokens to predict the words most likely to appear in those gaps, I use their main idea adapted to the possibilities of mine.

# Fine-tuning Bert for spanish

First of all, I download the bert-base-spanish-wwm-cased to see how the model performs and to know a little about its main features before proceeding to use it.

In [None]:
! pip install datasets transformers

In [None]:
import transformers

print(transformers.__version__)
#I check the version of the transformers

In [None]:
'''Download the language model'''

from transformers import AutoModelForMaskedLM

model_checkpoint = "dccuchile/bert-base-spanish-wwm-cased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [None]:
'''Check the parameters'''
bert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> BERT number of parameters: {round(bert_num_parameters)}M'")

In [None]:
'''Create an example sentence for later'''
text = "Esto es un [MASK]"

In [None]:
'''Import a tokenizer. In this case, I'm using the model also as a tokenizer'''

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

Now, I install the dataset, that can be found in the Hugging Face repository: https://huggingface.co/datasets/jhonparra18/petro-tweets

I have not performed a cleaning of the dataset, but actually the model performance could be improved if the web links were removed, since there are quite a few Tweets and then appears as a probable word semicolons and commas.

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset

dataset = load_dataset('jhonparra18/petro-tweets','es')
dataset

In [None]:
'''I create random samples, to check the data more accurately'''
sample = dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['Tweet']}'")

In [None]:
'''Now I start with the Preprocessing of tha data. 
First of all, I tokenize the sentences.'''

def tokenize_function(examples):
    result = tokenizer(examples["Tweet"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


#I get rid of the columns, because I don't really need them for later
tokenized_datasets = dataset.map(
    tokenize_function, batched=True, remove_columns=['Tweet','Date','User']
)
tokenized_datasets

#The result should show input_ids, attention_mask, word_ids and in my case token_type_ids but not every Bert has this last one.

In [None]:
#I check the models max_length to know more or less what chunk size to use
tokenizer.model_max_length
#I stick to this size because if not I will have problems with Colab
chunk_size = 128

In [None]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

In [None]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

In [None]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

In [None]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

In [None]:
tokenizer.decode(lm_datasets["train"][1]["labels"])

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [None]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

When training models for masked language modeling, one technique that can be used is to mask whole words together, not just individual tokens. 

This approach is called whole word masking. If I want to use whole word masking, I'll need to build a data collator.  

In [None]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [None]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

I first downsample the dataset before training, because if not it will take a lot of time and I will also run out of GPU.

In [None]:
train_size = 5_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

In [None]:
'''This is not really necessary if you don't want your model to be in Hugging Face'''
from huggingface_hub import notebook_login

notebook_login()

For the training I have set the number of epochs to 20, just to make it quicker, but it does not give a really good result. And I also put remove_unused_columns to False, so that it does not mess up with the form of my data. 

Apart from this, the output_dir can be changed to your own computer.

In [None]:
'''Specify the arguments for the Trainer'''
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-tweets",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    logging_steps=logging_steps,
    remove_unused_columns=False,
    num_train_epochs=20,
)

Now I call the trainer. I used the whole_word_masking_data_collator but it can also be used with a normal data_collator. 

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator= whole_word_masking_data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

As an evaluation I check the Perplexity. A lower perplexity score means a better language model, and we can see that the perplexity in this case is not that bad. 

In [None]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

In [None]:
trainer.push_to_hub()

In [None]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="mariav/bert-base-spanish-wwm-cased-finetuned-tweets"
)

I then feed the pipeline my sample text from before and see what the top 5 predictions are. It should show something similar to before, but also words included in the dataset used for fine-tuning. 

In [None]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

# Fine-tuning DistilBert for german

In this case I use the Distilbert-german-cased and the amazon-reviews-multilingual dataset.

The process is the same as before, just changing what is necessary to include the new dataset.

In [None]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-german-cased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [None]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> Distilbert number of parameters: {round(distilbert_num_parameters)}M'")

In [None]:
text = "Das ist eine gute [MASK]"

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

In [None]:
from datasets import load_dataset

dataset = load_dataset('amazon_reviews_multi','de')
dataset

In [None]:
sample = dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['review_body']}'")

In [None]:
def tokenize_function(examples):
    result = tokenizer(examples["review_body"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


tokenized_datasets = dataset.map(
    tokenize_function, batched=True, remove_columns=["review_body",'review_id','product_id','reviewer_id','stars','review_title','language','product_category']
)
tokenized_datasets

In [None]:
tokenizer.model_max_length

chunk_size = 128

In [None]:
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

In [None]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

In [None]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

In [None]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

In [None]:
tokenizer.decode(lm_datasets["train"][1]["labels"])

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [None]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

In [None]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [None]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

In [None]:
train_size = 9_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-amazon-reviews",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    logging_steps=logging_steps,
    remove_unused_columns=False,
    num_train_epochs=10,
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator= whole_word_masking_data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

In [None]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

In [None]:
trainer.push_to_hub()

In [None]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="mariav/distilbert-base-german-cased-finetuned-amazon-reviews"
)

In [None]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

# ILM inference

In this part what I do is adapt the main idea of the previously mentioned paper from Chris Donahue et al., because their ILM is not directly applicable to my own models. The basic idea behind it is to create my own fill-in-the-blank prompts using my own tokenizer and model. I create a context (such as 'La ___ es azul') related to the domain of my dataset, tokenize it using my own tokenizer, replace the blank (s) with special token(s) that my model recognizes as placeholders, and then pass the resulting tokens through my model to generate predictions for the missing word (s).

For this I used two different approaches for both models:
- Generating only one [MASK] in the sentence.
- Generating two [MASK] in the sentence.

For the first one I use the context, and for the second one I don't, just to prove the importance of giving a context, because it really changes the results.

## Spanish-BERT

###Generating one [MASK]
For the generation of only one [MASK] in the sentence I use a pre-trained tokenizer and language model to generate predictions for masked tokens in a list of sentences. 

I load the tokenizer and model using AutoTokenizer and AutoModelForMaskedLM from the Transformers library. I define a list of sentences with masked tokens and a context sentence, and tokenize both using the tokenizer. The context sentence is repeated for each input sentence and the resulting tensors are concatenated along the sequence dimension.

Finally, the model is used to generate predictions for the masked tokens in the concatenated tensor, and the predicted tokens are decoded and printed for each input sentence. 

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("mariav/bert-base-spanish-wwm-cased-finetuned-tweets")
model = AutoModelForMaskedLM.from_pretrained("mariav/bert-base-spanish-wwm-cased-finetuned-tweets")

# Define the sentences with masked tokens and context
sentences = [
    "Hoy en día, [MASK] es un tema muy importante en la sociedad.",
    "La [MASK] es una de las preocupaciones más importantes en la política actual.",
    "La [MASK] en la educación es un problema que se debe abordar.",
    "La [MASK] de género es un tema que necesita más atención en nuestra sociedad.",
]

context = "La corrupción es uno de los principales problemas en la política y la sociedad actual."

# Tokenize the sentences 
tokenized_sentences = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Tokenize the context
tokenized_context = tokenizer(context, padding=True, truncation=True, return_tensors="pt")

# Repeat the tokenized context for each input sentence
num_sentences = len(sentences)
repeated_context = {}
for k, v in tokenized_context.items():
    repeated_context[k] = v.repeat(num_sentences, 1)
# Tokenize the sentences and prepend the context
tokenized_sentences = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
tokenized_sentences["input_ids"] = torch.cat([repeated_context["input_ids"], tokenized_sentences["input_ids"]], dim=1)
tokenized_sentences["attention_mask"] = torch.cat([repeated_context["attention_mask"], tokenized_sentences["attention_mask"]], dim=1)

# Generate predictions for the masked tokens in the sentences
with torch.no_grad():
    outputs = model(torch.tensor(tokenized_sentences["input_ids"]), attention_mask=torch.tensor(tokenized_sentences["attention_mask"]))
    predictions = outputs.logits.argmax(dim=-1)

# Print the predicted tokens
for i, sentence in enumerate(sentences):
    mask_index = torch.where(tokenized_sentences["input_ids"][i] == tokenizer.mask_token_id)[0][0]
    token = predictions[i][mask_index].item()
    predicted_token = tokenizer.decode(token)
    completed_sentence = sentence.replace('[MASK]', predicted_token)
    print(completed_sentence)


### Using two [MASK]
For this part, I create a function (predict_missing_words) that two words to be predicted in the same sentence. I consider this one is not really great, taking into account that is not done randomly, but you need to specify which words of the sentence you want to [MASK]. 

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("mariav/bert-base-spanish-wwm-cased-finetuned-tweets")
model = AutoModelForMaskedLM.from_pretrained("mariav/bert-base-spanish-wwm-cased-finetuned-tweets")

def predict_missing_words(sentence, mask_words=["", ""], top_k=5):
    # Encode the sentence with special tokens and get the token IDs
    input_ids = tokenizer.encode(sentence, add_special_tokens=True)
    token_ids = tokenizer.convert_ids_to_tokens(input_ids)

    # Find indices of words to mask
    mask_idx = []
    for mask_word in mask_words:
        mask_idx += [i for i, tok_id in enumerate(token_ids) if tok_id == mask_word]

    # Mask words and get new input IDs
    masked_input_ids = input_ids.copy()
    for i in mask_idx:
        masked_input_ids[i] = tokenizer.mask_token_id

    # Convert input IDs to tensors
    input_ids_tensor = torch.tensor([input_ids])
    masked_input_ids_tensor = torch.tensor([masked_input_ids])

    # Generate predictions for the masked tokens
    with torch.no_grad():
        predictions = model(masked_input_ids_tensor)[0]

    # Get top-k predicted words for each masked token
    predicted_words = []
    for i in mask_idx:
        predicted_token_ids = predictions[0, i].topk(k=top_k).indices.tolist()
        predicted_words.append([tokenizer.convert_ids_to_tokens([tok_id])[0] for tok_id in predicted_token_ids])

    # Generate all possible sentence combinations
    sentence_combinations = [input_ids]
    for i, predicted_word_set in enumerate(predicted_words):
        new_sentence_combinations = []
        for sentence in sentence_combinations:
            for predicted_word in predicted_word_set:
                new_sentence = sentence.copy()
                new_sentence[mask_idx[i]] = tokenizer.convert_tokens_to_ids(predicted_word)
                new_sentence_combinations.append(new_sentence)
        sentence_combinations = new_sentence_combinations

    # Convert all sentence combinations to strings and return them
    return [tokenizer.decode(sentence) for sentence in sentence_combinations]

In [None]:
import torch

sentence = "La república es una falsa democracia."
predicted_sentences = predict_missing_words(sentence, mask_words=["república", "democracia"], top_k=5)
for predicted_sentence in predicted_sentences:
  print(predicted_sentence)

## German-DistilBert

For this part, I do the same two approaches and I check how it works for the german model, just adding different sentences and a context more adapted to the dataset.

### Generating only one [MASK]

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("mariav/distilbert-base-german-cased-finetuned-amazon-reviews")
model = AutoModelForMaskedLM.from_pretrained("mariav/distilbert-base-german-cased-finetuned-amazon-reviews")

# Define the sentences 
sentences = ["Das [MASK] sagt mir nicht zu.", 'Ich empfehle [MASK] allen.','Ich lese immer [MASK].','Die [MASK] haben mir geholfen.']
context = "Ich habe kürzlich ein Produkt auf Amazon gekauft und war mit der Qualität und dem Service sehr zufrieden."

# Tokenize the sentences with the provided context
tokenized_sentences = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Tokenize the context
tokenized_context = tokenizer(context, padding=True, truncation=True, return_tensors="pt")

# Repeat the tokenized context for each input sentence
num_sentences = len(sentences)
repeated_context = {}
for k, v in tokenized_context.items():
    repeated_context[k] = v.repeat(num_sentences, 1)

# Tokenize the sentences and prepend the context
tokenized_sentences = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
tokenized_sentences["input_ids"] = torch.cat([repeated_context["input_ids"], tokenized_sentences["input_ids"]], dim=1)
tokenized_sentences["attention_mask"] = torch.cat([repeated_context["attention_mask"], tokenized_sentences["attention_mask"]], dim=1)

# Generate predictions for the masked tokens in the sentences
with torch.no_grad():
    outputs = model(torch.tensor(tokenized_sentences["input_ids"]), attention_mask=torch.tensor(tokenized_sentences["attention_mask"]))
    predictions = outputs.logits.argmax(dim=-1)

# Print the predicted tokens
for i, sentence in enumerate(sentences):
    mask_index = torch.where(tokenized_sentences["input_ids"][i] == tokenizer.mask_token_id)[0][0]
    token = predictions[i][mask_index].item()
    predicted_token = tokenizer.decode(token)
    completed_sentence = sentence.replace('[MASK]', predicted_token)
    print(completed_sentence)

### Generating two [MASK]

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("mariav/distilbert-base-german-cased-finetuned-amazon-reviews")
model = AutoModelForMaskedLM.from_pretrained("mariav/distilbert-base-german-cased-finetuned-amazon-reviews")

def predict_missing_words(sentence, mask_words=["", ""], top_k=5):
    # Encode the sentence with special tokens and get the token IDs
    input_ids = tokenizer.encode(sentence, add_special_tokens=True)
    token_ids = tokenizer.convert_ids_to_tokens(input_ids)

    # Find indices of words to mask
    mask_idx = []
    for mask_word in mask_words:
        mask_idx += [i for i, tok_id in enumerate(token_ids) if tok_id == mask_word]

    # Mask words and get new input IDs
    masked_input_ids = input_ids.copy()
    for i in mask_idx:
        masked_input_ids[i] = tokenizer.mask_token_id

    # Convert input IDs to tensors
    input_ids_tensor = torch.tensor([input_ids])
    masked_input_ids_tensor = torch.tensor([masked_input_ids])

    # Generate predictions for the masked tokens
    with torch.no_grad():
        predictions = model(masked_input_ids_tensor)[0]

    # Get top-k predicted words for each masked token
    predicted_words = []
    for i in mask_idx:
        predicted_token_ids = predictions[0, i].topk(k=top_k).indices.tolist()
        predicted_words.append([tokenizer.convert_ids_to_tokens([tok_id])[0] for tok_id in predicted_token_ids])

    # Generate all possible sentence combinations
    sentence_combinations = [input_ids]
    for i, predicted_word_set in enumerate(predicted_words):
        new_sentence_combinations = []
        for sentence in sentence_combinations:
            for predicted_word in predicted_word_set:
                new_sentence = sentence.copy()
                new_sentence[mask_idx[i]] = tokenizer.convert_tokens_to_ids(predicted_word)
                new_sentence_combinations.append(new_sentence)
        sentence_combinations = new_sentence_combinations

    # Convert all sentence combinations to strings and return them
    return [tokenizer.decode(sentence) for sentence in sentence_combinations]


In [None]:
import torch

sentence = "Das Produkt hat mir nicht gefallen."
predicted_sentences = predict_missing_words(sentence, mask_words=["Produkt", "gefallen"], top_k=5)
print(predicted_sentences)