xlm-mlm-17-1280 model masked word prediction #1842

ceatlinar · 2019-11-15T17:41:37Z

Hi
I would like some help with how to use pretrained xlm-mlm-17-1280 model to get predictions for masked word prediction. I have followed http://mayhewsw.github.io/2019/01/16/can-bert-generate-text/ for BERT mask prediction and it is working. Could you help me with how to use xlm-mlm-17-1280 model for word prediction. I need to get prediction for Turkish Language which is one of the languages in 17 languages

Bachstelze · 2019-11-17T08:57:52Z

Would it be possible to use XML-R #1769 ? Its model has a simple description ( Masked Language Models in chapter 3) and is similar to BERT-Base besides tokenization, training configuration and language embeddings.

ceatlinar · 2019-11-17T14:25:34Z

Hi
Thanks for the advice but idk if the model you mentioned has a pretrained one for Turkish because I need to use it for Turkish. Also it is kind of a need for me to use the model I asked for prediction. Any tips on how I could use that model for getting masked word prediction would be great. Thanks in advance

Bachstelze · 2019-11-17T18:26:04Z

There are also multilingual, pretrained models for BERT, which we could try. Usually the quality decreases in large, multilingual models with very different languages.
But they have mostly the similar architecture like bert-base, so we could try to rerun the linked example with the line modelpath = "bert-base-multilingual-cased".

ceatlinar · 2019-11-17T18:38:05Z

I get the following warning and error when trying modelpath = "bert-base-multilingual-cased":
Sorry I am not familiar with the transformers so it may be an easy error to fix but Idk how
The pre-trained model you are loading is a cased model but you have not set do_lower_case to False. We are setting do_lower_case=False for you but you may want to check this behavior.
Traceback (most recent call last):
File "e.py", line 13, in
masked_index = tokenized_text.index(target)
ValueError: 'hungry' is not in list

Bachstelze · 2019-11-17T22:52:52Z

'hungry' is in the list, but as two tokens since the multilingual model has a different vocabulary. Therefore, we have to tokenize the target word. Check this out:

#!/usr/bin/python3
#
# first Axiom: Aaron Swartz is everything
# second Axiom: The Schwartz Space is his discription of physical location
# first conclusion: His linear symmetry is the Fourier transform
# second conclusion: His location is the Montel space
# Third conclusion: His location is the Fréchet space

import torch
from transformers import BertModel, BertTokenizer, BertForMaskedLM

modelname = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(modelname)
model = BertModel.from_pretrained(modelname)

def predictMask(maskedText, masked_index):
    # Convert token to vocabulary indices
    indexed_tokens = tokenizer.convert_tokens_to_ids(maskedText)
    # Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
    segments_ids = [1] * len(maskedText)

    # Convert inputs to PyTorch tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])
    # Load pre-trained model (weights)
    model = BertForMaskedLM.from_pretrained(modelname)
    model.eval()

    # Predict all tokens
    predictions = model(tokens_tensor, segments_tensors)
    predicted_index = torch.argmax(predictions[0][0][masked_index]).item()
    predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])

    print("Original:", text)
    print("Masked:", " ".join(maskedText))

    print("Predicted token:", predicted_token)
    maskedText[masked_index] = predicted_token[0]

    # delete this section for faster inference
    print("Other options:")
    # just curious about what the next few options look like.
    for i in range(10):
        predictions[0][0][masked_index][predicted_index] = -11100000
        predicted_index = torch.argmax(predictions[0][0][masked_index]).item()
        predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
        print(predicted_token)


    print("Masked, tokenized text with the prediction:", maskedText)
    return maskedText


text = "let´s go fly a kite!"
target = "kite"
tokenized_text = tokenizer.tokenize(text)
tokenized_target = tokenizer.tokenize(target)
print("tokenized text:", tokenized_text)
print("tokenized target:", tokenized_target)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = tokenized_text.index(tokenized_target[0])
for i in range(len(tokenized_target)):
    tokenized_text[masked_index+i] = '[MASK]'

for i in range(len(tokenized_target)):
    tokenized_text = predictMask(tokenized_text, masked_index+i)

ceatlinar · 2019-11-18T08:16:16Z

I tried the code but it's giving word pieces suggestions, not whole word. And the suggestions are poor. Thank you so much for your effort but this is not useful for me unless somehow I could get whole word suggestions. Also, I am still seeking for an implementation of xlm model to get prediction, of anyone could help, that would be great

Bachstelze · 2019-11-19T20:59:24Z

Don't the pieces build complete words in the end?
Read my first answer for XML, the mentioned model supports the turkish language.

LysandreJik · 2019-11-19T21:58:04Z

Hi, you can predict a masked word with XLM as you would do with any other MLM-based model. Here's an example using the checkpoint xlm-mlm-17-1280 you mentioned:

from transformers import XLMTokenizer, XLMWithLMHeadModel
import torch

# load tokenizer
tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-17-1280")

# encode sentence with a masked token in the middle
sentence = torch.tensor([tokenizer.encode("This was the first time Nicolas ever saw a " + tokenizer.mask_token + ". It was huge.")])

# Identify the masked token position
masked_index = torch.where(sentence == tokenizer.mask_token_id)[1].tolist()[0]

# Load model
model = XLMWithLMHeadModel.from_pretrained("xlm-mlm-17-1280")

# Get the five top answers
result = model(sentence)
result = result[0][:, masked_index].topk(5).indices
result = result.tolist()[0]

print(tokenizer.decode(result))
# monster dragon snake wolf tiger

ceatlinar · 2019-11-21T17:23:39Z

Thank you so much guys for the replies, they been very helpfull.

Bachstelze mentioned this issue Nov 17, 2019

XLM Masked Word Prediction #1857

Closed

ceatlinar closed this as completed Nov 21, 2019

ceatlinar mentioned this issue Nov 21, 2019

xlm-mlm-17-1280 for masked word prediction facebookresearch/XLM#237

Closed

valdrox mentioned this issue Dec 9, 2019

XLM model masked word prediction Double Language #2112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xlm-mlm-17-1280 model masked word prediction #1842

xlm-mlm-17-1280 model masked word prediction #1842

ceatlinar commented Nov 15, 2019

Bachstelze commented Nov 17, 2019

ceatlinar commented Nov 17, 2019

Bachstelze commented Nov 17, 2019 •

edited

ceatlinar commented Nov 17, 2019

Bachstelze commented Nov 17, 2019

ceatlinar commented Nov 18, 2019

Bachstelze commented Nov 19, 2019

LysandreJik commented Nov 19, 2019

ceatlinar commented Nov 21, 2019

xlm-mlm-17-1280 model masked word prediction #1842

xlm-mlm-17-1280 model masked word prediction #1842

Comments

ceatlinar commented Nov 15, 2019

Bachstelze commented Nov 17, 2019

ceatlinar commented Nov 17, 2019

Bachstelze commented Nov 17, 2019 • edited

ceatlinar commented Nov 17, 2019

Bachstelze commented Nov 17, 2019

ceatlinar commented Nov 18, 2019

Bachstelze commented Nov 19, 2019

LysandreJik commented Nov 19, 2019

ceatlinar commented Nov 21, 2019

Bachstelze commented Nov 17, 2019 •

edited