Skip to content

Conversation

kanpuriyanawab
Copy link
Collaborator

Closes #720

@kanpuriyanawab kanpuriyanawab changed the title Add an XLMRobertaMaskedLM task model WIP Mar 31, 2023
@kanpuriyanawab kanpuriyanawab marked this pull request as draft March 31, 2023 05:22
@kanpuriyanawab kanpuriyanawab changed the title WIP Add an XLMRobertaMaskedLM task model Mar 31, 2023
@kanpuriyanawab kanpuriyanawab marked this pull request as ready for review March 31, 2023 13:50
@kanpuriyanawab
Copy link
Collaborator Author

Hey @mattdangerw @chenmoneygithub this PR ready for review.

@mattdangerw
Copy link
Member

/gcbrun

@kanpuriyanawab
Copy link
Collaborator Author

kanpuriyanawab commented Apr 1, 2023

This is weird, my black is upto date, and even after running ./shell/format.sh code format ci fails

image

Further, black version 22.6.0, tf 2.12.0 (bumped up)

@chenmoneygithub chenmoneygithub self-assigned this Apr 3, 2023
@abuelnasr0
Copy link
Contributor

PR looks great but I had one comment that might help if you would like to read it.

take a look at this table and the implementation:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py#L171
for the tokenizer we need to add the < mask> token to the end of the vocabulary and not assign it to id 4 because not all sentencepiece models will be like what is in the test you made. instead, and like the original implementation we should force the user to use a sentencepiece model without adding any special token just like this:

sentencepiece.SentencePieceTrainer.train(
 sentence_iterator=vocab_data.as_numpy_iterator(),
    model_writer=bytes_io,
    vocab_size=13,
    model_type="WORD",
)

and this will create a sentencepiece tokenizer with only three special tokens like what is in the mentioned table.
then add the < mask> token to the end of the vocabulary. and consider that also in the all functions of the XLMRoberta tokenizer.

thank you.

@chenmoneygithub
Copy link
Contributor

@shivance Missed your earlier comment. It seems that black is not working properly, could you try adding a dummy line which is long enough to break black style and rerun the formatting script? If it's not getting formatted, maybe your virtual environment has some conflict.

@mattdangerw
Copy link
Member

@abuelnasr0 has a good point! We may want to look through the original code on fairseq and see how they handle the mask token. Ideally we can use the same ID that they used for pre-training, that will give us better out of box results for this task.

@abheesht17 helped figure out a similar issue for DebertaV3, so I'll tag him to see if he has thoughts!

@kanpuriyanawab
Copy link
Collaborator Author

kanpuriyanawab commented Apr 6, 2023

@shivance Missed your earlier comment. It seems that black is not working properly, could you try adding a dummy line which is long enough to break black style and rerun the formatting script? If it's not getting formatted, maybe your virtual environment has some conflict.

@chenmoneygithub I figured out the reason of black's behaviour is python 3.10 environment which I started using lately, weird thing is none of the imports are working, I've fixed it by creating a new python 3.9 environment. Code format CI test pass now.

@kanpuriyanawab
Copy link
Collaborator Author

kanpuriyanawab commented Apr 6, 2023

@mattdangerw FairSeq XLM Roberta is using same interface as roberta.

... We may want to look through the original code on fairseq and see how they handle the mask token. Ideally we can use the same ID that they used for pre-training, that will give us better out of box results for this task.

I just observed that all of our other tokenizers don't hard code the special token ids, except XLM roberta

https://github.com/keras-team/keras-nlp/blob/a3deb52ef6b2dde69d4416a35a9a7f2154523973/keras_nlp/models/deberta_v3/deberta_v3_tokenizer.py#L92-L101

https://github.com/keras-team/keras-nlp/blob/a3deb52ef6b2dde69d4416a35a9a7f2154523973/keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py#L92-L106

And probably this is what @abuelnasr0 wanted to point out in #930?

following the format like other tokenizers do like self.pad_token_id = self.token_to_id(pad_token) should fix

for the tokenizer we need to add the < mask> token to the end of the vocabulary and not assign it to id 4 because not > all sentencepiece models will be like what is in the test y

@abheesht17
Copy link
Collaborator

abheesht17 commented Apr 6, 2023

@shivance, we should keep the hardcoded special tokens.

For the mask token, i.e., <mask>, all you need to do is append this token to the end of the vocabulary, and increase vocabulary size (inside vocabulary_size()) by 1. Spelling this out step-by-step, this is what you have to do:

Inside __init__():

self.mask_token_id = super().vocabulary_size() + 1        # or you can also do self.vocabulary_size() - 1

Inside vocabulary_size():

return super().vocabulary_size() + 2

Inside get_vocabulary():

return self._vocabulary_prefix + vocabulary[3:] + ["<mask>"]

Inside id_to_token():

if id == self.mask_token_id:
    return "<mask>"

Inside token_to_id():

if token == "<mask>":
    return self.mask_token_id

(I've identified an issue with how OOV tokens are handled here, will open a PR, but this shouldn't block your work. Edit: Here is the PR.)

For detokenize(), we can do something similar to DeBERTaV3: https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/deberta_v3/deberta_v3_tokenizer.py#L125-L127.

@kanpuriyanawab
Copy link
Collaborator Author

Still WIP

@kanpuriyanawab
Copy link
Collaborator Author

Hi @abheesht17 ,
Could you review it once again please?

Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! mainly looking good, a few comments.

sequence_length=512,
truncate="round_robin",
mask_selection_rate=0.15,
mask_selection_length=96,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, is 96 a good default for XLM-R MLM? @abheesht17

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good! Max sequence length of the model is 512, mask rate above is 0.15, so 76 average. 96 should cover almost all samples.

self.pad_token_id = 1 # <pad>
self.end_token_id = 2 # </s>
self.unk_token_id = 3 # <unk>
self.mask_token_id = self.vocabulary_size() - 1 # <mask>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not carefully read XLM-R tokenization part, will let Matt check this. @mattdangerw

@kanpuriyanawab
Copy link
Collaborator Author

Addressed all the comments except ones left for @mattdangerw and @abheesht17 's review.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks good to me. Have we checked that we are using the same index for the mask token that fairseq is doing.

Also, do we have a colab testing if this works end to end? Something like https://gist.github.com/mattdangerw/b16c257973762a0b4ab9a34f6a932cc1

generates label weights.
Examples:
```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we have this style elsewhere (just add one empty line and the sentence below)...

Examples:

Directly calling the layer on data.
...example code block here...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kanpuriyanawab
Copy link
Collaborator Author

Thanks! This looks good to me. Have we checked that we are using the same index for the mask token that fairseq is doing.

Also, do we have a colab testing if this works end to end? Something like https://gist.github.com/mattdangerw/b16c257973762a0b4ab9a34f6a932cc1

image

Indeed the fairseq & HF implementation of XLMRoberta have mask as last token in vocab

@kanpuriyanawab
Copy link
Collaborator Author

Hey @mattdangerw regarding the colab, it's going out of memory for xlm roberta base multi, even with the 1% of training set, memory is not able to hold it up.

@mattdangerw
Copy link
Member

Hey @mattdangerw regarding the colab, it's going out of memory for xlm roberta base multi, even with the 1% of training set, memory is not able to hold it up.

Got it! Thanks for letting me know. We probably should run some sort of test here to sanity check before we merge. I will look into doing it on GCP.

@mattdangerw
Copy link
Member

Actually was able to test this out via a "colab pro" GPU. And it works great!

https://colab.research.google.com/gist/mattdangerw/ec3cf790c11773ed5f5e335df47480b1/xlmr-mlm.ipynb

@mattdangerw
Copy link
Member

/gcbrun

@mattdangerw
Copy link
Member

Actually, the training flow works great, but there is a still an error during detokenize(). @shivance can you take a look?

@mattdangerw
Copy link
Member

@kanpuriyanawab
Copy link
Collaborator Author

WIP

@kanpuriyanawab
Copy link
Collaborator Author

Please review

@mattdangerw
Copy link
Member

/gcbrun

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivance thank you very much! Approving!

@mattdangerw mattdangerw merged commit 61f9d9e into keras-team:master Apr 18, 2023
@kanpuriyanawab kanpuriyanawab deleted the xlm-roberta branch June 27, 2023 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an XLMRobertaMaskedLM task model
5 participants