Rework docstring of XLMRoberta #882

abuelnasr0 · 2023-03-20T02:31:09Z

my first PR in keras_nlp and in open source contribution
for #867 issue I reworked docstring of XLMRoberta in a way that is similar #843
I have few questions and suggestions:
1- I changed max_Sequence_length description in xlm_roberta_backbone.py. should I write the old description back ?
2- in xlm_roberta_tokenizer.py. I suggest deleting lines 38 to 42. what do you think ?
3- I suggest adding a training method for XLMRoberta tokenizer for custom vocabulary
4- for training sentencepair tokenizer we need to import io and scentencepice should I import them in the example code
5- should I delete the example in xlm_roberta_preprocessor.py in lines 43 to 45
6- I suggest adding a softmax activation in XLMRoberta classifier
7- I added Arabic examples because XLMRoberta is multilingual model and it will be good to show that. what do you think ?

I made my tests in here https://colab.research.google.com/drive/1uyIQetDu9auoMP4ZHVX82zLW3tkEzZx7#scrollTo=X72rpihBvZX7

waiting for feedback.
@mattdangerw
thanks.

google-cla · 2023-03-20T02:31:13Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

mattdangerw · 2023-03-21T01:11:15Z

Thanks! This look great and really appreciate the attention to detail here.

1- I changed max_Sequence_length description in xlm_roberta_backbone.py. should I write the old description back ?

New description looks good. We may want to copy it for other models as a follow up.

2- in xlm_roberta_tokenizer.py. I suggest deleting lines 38 to 42. what do you think ?

The note about the weird sentencepiece modifications? I think we can leave it in, but maybe it would be better if the paragraph began. Note: If you are providing you own custom SentencePiece model, ...

3- I suggest adding a training method for XLMRoberta tokenizer for custom vocabulary

SGTM! Thank you

4- for training sentencepair tokenizer we need to import io and scentencepice should I import them in the example code

We don't generally do imports in the example code, we just assume a fixed set of symbols are available (including sentencepiece and io for now).

5- should I delete the example in xlm_roberta_preprocessor.py in lines 43 to 45

We can leave it in I think.

6- I suggest adding a softmax activation in XLMRoberta classifier

Interesting point! We have actually discussed this quite a bit. For now, we are sticking with logit output across the library (it is more the norm, and allows for manipulation of the logits when doing generation). But we may revisit this! And appreciate the feedback.

7- I added Arabic examples because XLMRoberta is multilingual model and it will be good to show that. what do you think ?

I like it!

mattdangerw

This looks great! Just had a couple comments.

mattdangerw · 2023-03-21T01:14:35Z

keras_nlp/models/xlm_roberta/xlm_roberta_preprocessor.py

+    # Preprocess a batch of sentence pairs.
+    # When handling multiple sequences, always convert to tensors first!
+    first = tf.constant(["The quick brown fox jumped.", "اسمي اسماعيل"])
+    second = tf.constant(["The fox tripped.", "Oh look, a whale."])


In this examples, can we replace Oh look, a whale. with a short Arabic sentence that could follow the first?

We want to make it clear how the batches will be handled. E.g. position 0 in both constants is for one batch sample and position 1 is another.

mattdangerw · 2023-03-21T01:19:46Z

keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py

-    download a matching vocabulary for a XLM-RoBERTa preset.
+    download a matching vocabulary for an XLM-RoBERTa preset.

    The original fairseq implementation of XLM-RoBERTa modifies the indices of


Re your comment, we could rewrite this slightly to be shorter...

Note: If you are providing you own custom SentencePiece model, the original fairseq implementation of XLM-RoBERTa re-maps some token indices from the underlying sentencepiece output. To preserve compatibility, we do the same re-mapping here.

@mattdangerw requested changes has been pushed.

mattdangerw

This LGTM! Really appreciate all the feedback you gave, this is an impressive first contribution!

abuelnasr0 · 2023-03-23T01:11:58Z

This LGTM! Really appreciate all the feedback you gave, this is an impressive first contribution!
Thank you!

* Rework docstring of XLMRoberta * Fix typo * Add arabic example and Shorten the comment about sentencepiece tokenizer * Fix typo

Rework docstring of XLMRoberta

9825800

Fix typo

19f05f4

mattdangerw requested changes Mar 21, 2023

View reviewed changes

abuelnasr0 added 2 commits March 22, 2023 02:28

Add arabic example and Shorten the comment about sentencepiece tokenizer

8289b15

Fix typo

01782be

mattdangerw self-assigned this Mar 22, 2023

mattdangerw approved these changes Mar 22, 2023

View reviewed changes

mattdangerw merged commit 592bad9 into keras-team:master Mar 22, 2023

kanpuriyanawab pushed a commit to kanpuriyanawab/keras-nlp that referenced this pull request Mar 26, 2023

Rework docstring of XLMRoberta (keras-team#882)

f1672c9

* Rework docstring of XLMRoberta * Fix typo * Add arabic example and Shorten the comment about sentencepiece tokenizer * Fix typo

kanpuriyanawab pushed a commit to kanpuriyanawab/keras-nlp that referenced this pull request Mar 26, 2023

Rework docstring of XLMRoberta (keras-team#882)

4a3a02b

* Rework docstring of XLMRoberta * Fix typo * Add arabic example and Shorten the comment about sentencepiece tokenizer * Fix typo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework docstring of XLMRoberta #882

Rework docstring of XLMRoberta #882

Uh oh!

abuelnasr0 commented Mar 20, 2023 •

edited

Loading

Uh oh!

google-cla bot commented Mar 20, 2023

Uh oh!

mattdangerw commented Mar 21, 2023

Uh oh!

mattdangerw left a comment

Uh oh!

mattdangerw Mar 21, 2023

Uh oh!

mattdangerw Mar 21, 2023

Uh oh!

abuelnasr0 Mar 22, 2023

Uh oh!

mattdangerw left a comment

Uh oh!

abuelnasr0 commented Mar 23, 2023

Uh oh!

Uh oh!

Rework docstring of XLMRoberta #882

Rework docstring of XLMRoberta #882

Uh oh!

Conversation

abuelnasr0 commented Mar 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-cla bot commented Mar 20, 2023

Uh oh!

mattdangerw commented Mar 21, 2023

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

mattdangerw Mar 21, 2023

Choose a reason for hiding this comment

Uh oh!

mattdangerw Mar 21, 2023

Choose a reason for hiding this comment

Uh oh!

abuelnasr0 Mar 22, 2023

Choose a reason for hiding this comment

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

abuelnasr0 commented Mar 23, 2023

Uh oh!

Uh oh!

abuelnasr0 commented Mar 20, 2023 •

edited

Loading