Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xlm-mlm-17-1280 model masked word prediction #1842

Closed
ceatlinar opened this issue Nov 15, 2019 · 9 comments
Closed

xlm-mlm-17-1280 model masked word prediction #1842

ceatlinar opened this issue Nov 15, 2019 · 9 comments

Comments

@ceatlinar
Copy link

Hi
I would like some help with how to use pretrained xlm-mlm-17-1280 model to get predictions for masked word prediction. I have followed http://mayhewsw.github.io/2019/01/16/can-bert-generate-text/ for BERT mask prediction and it is working. Could you help me with how to use xlm-mlm-17-1280 model for word prediction. I need to get prediction for Turkish Language which is one of the languages in 17 languages

@Bachstelze
Copy link

Would it be possible to use XML-R #1769 ? Its model has a simple description ( Masked Language Models in chapter 3) and is similar to BERT-Base besides tokenization, training configuration and language embeddings.

@ceatlinar
Copy link
Author

Hi
Thanks for the advice but idk if the model you mentioned has a pretrained one for Turkish because I need to use it for Turkish. Also it is kind of a need for me to use the model I asked for prediction. Any tips on how I could use that model for getting masked word prediction would be great. Thanks in advance

@Bachstelze
Copy link

Bachstelze commented Nov 17, 2019

There are also multilingual, pretrained models for BERT, which we could try. Usually the quality decreases in large, multilingual models with very different languages.
But they have mostly the similar architecture like bert-base, so we could try to rerun the linked example with the line modelpath = "bert-base-multilingual-cased".

@ceatlinar
Copy link
Author

I get the following warning and error when trying modelpath = "bert-base-multilingual-cased":
Sorry I am not familiar with the transformers so it may be an easy error to fix but Idk how
The pre-trained model you are loading is a cased model but you have not set do_lower_case to False. We are setting do_lower_case=False for you but you may want to check this behavior.
Traceback (most recent call last):
File "e.py", line 13, in
masked_index = tokenized_text.index(target)
ValueError: 'hungry' is not in list

@Bachstelze
Copy link

'hungry' is in the list, but as two tokens since the multilingual model has a different vocabulary. Therefore, we have to tokenize the target word. Check this out:

#!/usr/bin/python3
#
# first Axiom: Aaron Swartz is everything
# second Axiom: The Schwartz Space is his discription of physical location
# first conclusion: His linear symmetry is the Fourier transform
# second conclusion: His location is the Montel space
# Third conclusion: His location is the Fréchet space

import torch
from transformers import BertModel, BertTokenizer, BertForMaskedLM

modelname = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(modelname)
model = BertModel.from_pretrained(modelname)

def predictMask(maskedText, masked_index):
    # Convert token to vocabulary indices
    indexed_tokens = tokenizer.convert_tokens_to_ids(maskedText)
    # Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
    segments_ids = [1] * len(maskedText)

    # Convert inputs to PyTorch tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])
    # Load pre-trained model (weights)
    model = BertForMaskedLM.from_pretrained(modelname)
    model.eval()

    # Predict all tokens
    predictions = model(tokens_tensor, segments_tensors)
    predicted_index = torch.argmax(predictions[0][0][masked_index]).item()
    predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])

    print("Original:", text)
    print("Masked:", " ".join(maskedText))

    print("Predicted token:", predicted_token)
    maskedText[masked_index] = predicted_token[0]

    # delete this section for faster inference
    print("Other options:")
    # just curious about what the next few options look like.
    for i in range(10):
        predictions[0][0][masked_index][predicted_index] = -11100000
        predicted_index = torch.argmax(predictions[0][0][masked_index]).item()
        predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
        print(predicted_token)


    print("Masked, tokenized text with the prediction:", maskedText)
    return maskedText


text = "let´s go fly a kite!"
target = "kite"
tokenized_text = tokenizer.tokenize(text)
tokenized_target = tokenizer.tokenize(target)
print("tokenized text:", tokenized_text)
print("tokenized target:", tokenized_target)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = tokenized_text.index(tokenized_target[0])
for i in range(len(tokenized_target)):
    tokenized_text[masked_index+i] = '[MASK]'

for i in range(len(tokenized_target)):
    tokenized_text = predictMask(tokenized_text, masked_index+i)

@ceatlinar
Copy link
Author

I tried the code but it's giving word pieces suggestions, not whole word. And the suggestions are poor. Thank you so much for your effort but this is not useful for me unless somehow I could get whole word suggestions. Also, I am still seeking for an implementation of xlm model to get prediction, of anyone could help, that would be great

@Bachstelze
Copy link

Don't the pieces build complete words in the end?
Read my first answer for XML, the mentioned model supports the turkish language.

@LysandreJik
Copy link
Member

Hi, you can predict a masked word with XLM as you would do with any other MLM-based model. Here's an example using the checkpoint xlm-mlm-17-1280 you mentioned:

from transformers import XLMTokenizer, XLMWithLMHeadModel
import torch

# load tokenizer
tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-17-1280")

# encode sentence with a masked token in the middle
sentence = torch.tensor([tokenizer.encode("This was the first time Nicolas ever saw a " + tokenizer.mask_token + ". It was huge.")])

# Identify the masked token position
masked_index = torch.where(sentence == tokenizer.mask_token_id)[1].tolist()[0]

# Load model
model = XLMWithLMHeadModel.from_pretrained("xlm-mlm-17-1280")

# Get the five top answers
result = model(sentence)
result = result[0][:, masked_index].topk(5).indices
result = result.tolist()[0]

print(tokenizer.decode(result))
# monster dragon snake wolf tiger

@ceatlinar
Copy link
Author

Thank you so much guys for the replies, they been very helpfull.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants