Add an XLMRobertaMaskedLM task model #950

kanpuriyanawab · 2023-03-31T04:54:16Z

Closes #720

kanpuriyanawab · 2023-03-31T14:30:57Z

Hey @mattdangerw @chenmoneygithub this PR ready for review.

mattdangerw · 2023-03-31T16:54:32Z

/gcbrun

kanpuriyanawab · 2023-04-01T06:02:46Z

This is weird, my black is upto date, and even after running ./shell/format.sh code format ci fails

Further, black version 22.6.0, tf 2.12.0 (bumped up)

abuelnasr0 · 2023-04-04T15:07:59Z

PR looks great but I had one comment that might help if you would like to read it.

take a look at this table and the implementation:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py#L171
for the tokenizer we need to add the < mask> token to the end of the vocabulary and not assign it to id 4 because not all sentencepiece models will be like what is in the test you made. instead, and like the original implementation we should force the user to use a sentencepiece model without adding any special token just like this:

sentencepiece.SentencePieceTrainer.train(
 sentence_iterator=vocab_data.as_numpy_iterator(),
    model_writer=bytes_io,
    vocab_size=13,
    model_type="WORD",
)

and this will create a sentencepiece tokenizer with only three special tokens like what is in the mentioned table.
then add the < mask> token to the end of the vocabulary. and consider that also in the all functions of the XLMRoberta tokenizer.

thank you.

chenmoneygithub · 2023-04-04T20:15:22Z

@shivance Missed your earlier comment. It seems that black is not working properly, could you try adding a dummy line which is long enough to break black style and rerun the formatting script? If it's not getting formatted, maybe your virtual environment has some conflict.

mattdangerw · 2023-04-04T20:22:48Z

@abuelnasr0 has a good point! We may want to look through the original code on fairseq and see how they handle the mask token. Ideally we can use the same ID that they used for pre-training, that will give us better out of box results for this task.

@abheesht17 helped figure out a similar issue for DebertaV3, so I'll tag him to see if he has thoughts!

kanpuriyanawab · 2023-04-06T06:18:27Z

@shivance Missed your earlier comment. It seems that black is not working properly, could you try adding a dummy line which is long enough to break black style and rerun the formatting script? If it's not getting formatted, maybe your virtual environment has some conflict.

@chenmoneygithub I figured out the reason of black's behaviour is python 3.10 environment which I started using lately, weird thing is none of the imports are working, I've fixed it by creating a new python 3.9 environment. Code format CI test pass now.

kanpuriyanawab · 2023-04-06T06:43:50Z

@mattdangerw FairSeq XLM Roberta is using same interface as roberta.

... We may want to look through the original code on fairseq and see how they handle the mask token. Ideally we can use the same ID that they used for pre-training, that will give us better out of box results for this task.

I just observed that all of our other tokenizers don't hard code the special token ids, except XLM roberta

https://github.com/keras-team/keras-nlp/blob/a3deb52ef6b2dde69d4416a35a9a7f2154523973/keras_nlp/models/deberta_v3/deberta_v3_tokenizer.py#L92-L101

https://github.com/keras-team/keras-nlp/blob/a3deb52ef6b2dde69d4416a35a9a7f2154523973/keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py#L92-L106

And probably this is what @abuelnasr0 wanted to point out in #930?

following the format like other tokenizers do like self.pad_token_id = self.token_to_id(pad_token) should fix

for the tokenizer we need to add the < mask> token to the end of the vocabulary and not assign it to id 4 because not > all sentencepiece models will be like what is in the test y

abheesht17 · 2023-04-06T13:13:26Z

@shivance, we should keep the hardcoded special tokens.

For the mask token, i.e., <mask>, all you need to do is append this token to the end of the vocabulary, and increase vocabulary size (inside vocabulary_size()) by 1. Spelling this out step-by-step, this is what you have to do:

Inside __init__():

self.mask_token_id = super().vocabulary_size() + 1        # or you can also do self.vocabulary_size() - 1

Inside vocabulary_size():

return super().vocabulary_size() + 2

Inside get_vocabulary():

return self._vocabulary_prefix + vocabulary[3:] + ["<mask>"]

Inside id_to_token():

if id == self.mask_token_id:
    return "<mask>"

Inside token_to_id():

if token == "<mask>":
    return self.mask_token_id

(I've identified an issue with how OOV tokens are handled here, will open a PR, but this shouldn't block your work. Edit: Here is the PR.)

For detokenize(), we can do something similar to DeBERTaV3: https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/deberta_v3/deberta_v3_tokenizer.py#L125-L127.

kanpuriyanawab · 2023-04-06T15:18:08Z

Still WIP

kanpuriyanawab · 2023-04-06T16:08:00Z

Hi @abheesht17 ,
Could you review it once again please?

chenmoneygithub

thanks! mainly looking good, a few comments.

keras_nlp/layers/masked_lm_mask_generator.py

keras_nlp/models/xlm_roberta/xlm_roberta_masked_lm.py

keras_nlp/models/xlm_roberta/xlm_roberta_masked_lm_preprocessor.py

chenmoneygithub · 2023-04-06T23:53:43Z

keras_nlp/models/xlm_roberta/xlm_roberta_masked_lm_preprocessor.py

+        sequence_length=512,
+        truncate="round_robin",
+        mask_selection_rate=0.15,
+        mask_selection_length=96,


I am not sure, is 96 a good default for XLM-R MLM? @abheesht17

Should be good! Max sequence length of the model is 512, mask rate above is 0.15, so 76 average. 96 should cover almost all samples.

keras_nlp/models/xlm_roberta/xlm_roberta_masked_lm_test.py

chenmoneygithub · 2023-04-07T00:02:36Z

keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py

        self.pad_token_id = 1  # <pad>
        self.end_token_id = 2  # </s>
        self.unk_token_id = 3  # <unk>
+        self.mask_token_id = self.vocabulary_size() - 1  # <mask>


I did not carefully read XLM-R tokenization part, will let Matt check this. @mattdangerw

kanpuriyanawab · 2023-04-09T07:24:28Z

Addressed all the comments except ones left for @mattdangerw and @abheesht17 's review.

mattdangerw

Thanks! This looks good to me. Have we checked that we are using the same index for the mask token that fairseq is doing.

Also, do we have a colab testing if this works end to end? Something like https://gist.github.com/mattdangerw/b16c257973762a0b4ab9a34f6a932cc1

mattdangerw · 2023-04-12T17:33:11Z

keras_nlp/models/xlm_roberta/xlm_roberta_masked_lm_preprocessor.py

+            generates label weights.
+
+    Examples:
+    ```python


nit: we have this style elsewhere (just add one empty line and the sentence below)...

Examples: Directly calling the layer on data. ...example code block here...

I didn't get that, all other files have similar formatting
https://github.com/keras-team/keras-nlp/blob/aaa6d230a42783eae6d9d695e1cc5de2c3b68f8b/keras_nlp/models/deberta_v3/deberta_v3_masked_lm_preprocessor.py#L73-L80

I guess I was thinking this, but we still aren't consistent in the new style...

https://github.com/keras-team/keras-nlp/blob/aaa6d230a42783eae6d9d695e1cc5de2c3b68f8b/keras_nlp/models/bert/bert_masked_lm_preprocessor.py#L75-L81

kanpuriyanawab · 2023-04-13T17:49:40Z

Thanks! This looks good to me. Have we checked that we are using the same index for the mask token that fairseq is doing.

Also, do we have a colab testing if this works end to end? Something like https://gist.github.com/mattdangerw/b16c257973762a0b4ab9a34f6a932cc1

Indeed the fairseq & HF implementation of XLMRoberta have mask as last token in vocab

kanpuriyanawab · 2023-04-13T18:08:17Z

Hey @mattdangerw regarding the colab, it's going out of memory for xlm roberta base multi, even with the 1% of training set, memory is not able to hold it up.

mattdangerw · 2023-04-13T18:10:10Z

Hey @mattdangerw regarding the colab, it's going out of memory for xlm roberta base multi, even with the 1% of training set, memory is not able to hold it up.

Got it! Thanks for letting me know. We probably should run some sort of test here to sanity check before we merge. I will look into doing it on GCP.

mattdangerw · 2023-04-13T18:26:29Z

Actually was able to test this out via a "colab pro" GPU. And it works great!

https://colab.research.google.com/gist/mattdangerw/ec3cf790c11773ed5f5e335df47480b1/xlmr-mlm.ipynb

mattdangerw · 2023-04-13T18:26:43Z

/gcbrun

mattdangerw · 2023-04-13T18:30:06Z

Actually, the training flow works great, but there is a still an error during detokenize(). @shivance can you take a look?

mattdangerw · 2023-04-13T22:37:37Z

@shivance we probably want to copy these lines for now... https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/deberta_v3/deberta_v3_tokenizer.py#L125-L127

kanpuriyanawab · 2023-04-15T00:50:10Z

WIP

kanpuriyanawab · 2023-04-16T07:49:04Z

Please review

mattdangerw · 2023-04-17T20:12:00Z

/gcbrun

mattdangerw

@shivance thank you very much! Approving!

xlm roberta masked modeling

6823288

kanpuriyanawab changed the title ~~Add an XLMRobertaMaskedLM task model~~ WIP Mar 31, 2023

kanpuriyanawab marked this pull request as draft March 31, 2023 05:22

tokenizer test fix

b132e5e

kanpuriyanawab changed the title ~~WIP~~ Add an XLMRobertaMaskedLM task model Mar 31, 2023

mirror test speedup + rework docstrings for maskedlm

b94e3f0

kanpuriyanawab marked this pull request as ready for review March 31, 2023 13:50

fixing vocab shift in sentencepiece

87f6c2f

Merge branch 'keras-team:master' into xlm-roberta

1912609

chenmoneygithub self-assigned this Apr 3, 2023

mask lm gen code format

1d76df0

addressing comments

467e204

fixing tests

d3f751f

chenmoneygithub suggested changes Apr 7, 2023

View reviewed changes

addressing comments

9798993

mattdangerw reviewed Apr 12, 2023

View reviewed changes

replicate deberta detokenize for xlm

d2d0d9b

fixing detokenize

f227996

mattdangerw approved these changes Apr 17, 2023

View reviewed changes

mattdangerw merged commit 61f9d9e into keras-team:master Apr 18, 2023

kanpuriyanawab deleted the xlm-roberta branch June 27, 2023 08:52

Add an XLMRobertaMaskedLM task model #950

Add an XLMRobertaMaskedLM task model #950

Uh oh!

Conversation

kanpuriyanawab commented Mar 31, 2023

Uh oh!

kanpuriyanawab commented Mar 31, 2023

Uh oh!

mattdangerw commented Mar 31, 2023

Uh oh!

kanpuriyanawab commented Apr 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abuelnasr0 commented Apr 4, 2023

Uh oh!

chenmoneygithub commented Apr 4, 2023

Uh oh!

mattdangerw commented Apr 4, 2023

Uh oh!

kanpuriyanawab commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kanpuriyanawab commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abheesht17 commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kanpuriyanawab commented Apr 6, 2023

Uh oh!

kanpuriyanawab commented Apr 6, 2023

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chenmoneygithub Apr 6, 2023

Choose a reason for hiding this comment

Uh oh!

mattdangerw Apr 12, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chenmoneygithub Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

kanpuriyanawab commented Apr 9, 2023

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

mattdangerw Apr 12, 2023

Choose a reason for hiding this comment

Uh oh!

kanpuriyanawab Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

mattdangerw Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

kanpuriyanawab commented Apr 13, 2023

Uh oh!

kanpuriyanawab commented Apr 13, 2023

Uh oh!

mattdangerw commented Apr 13, 2023

Uh oh!

mattdangerw commented Apr 13, 2023

Uh oh!

mattdangerw commented Apr 13, 2023

Uh oh!

mattdangerw commented Apr 13, 2023

Uh oh!

mattdangerw commented Apr 13, 2023

Uh oh!

kanpuriyanawab commented Apr 15, 2023

Uh oh!

kanpuriyanawab commented Apr 16, 2023

kanpuriyanawab commented Apr 1, 2023 •

edited

Loading

kanpuriyanawab commented Apr 6, 2023 •

edited

Loading

kanpuriyanawab commented Apr 6, 2023 •

edited

Loading

abheesht17 commented Apr 6, 2023 •

edited

Loading