Skip to content

Conversation

abuelnasr0
Copy link
Contributor

@abuelnasr0 abuelnasr0 commented Mar 20, 2023

my first PR in keras_nlp and in open source contribution
for #867 issue I reworked docstring of XLMRoberta in a way that is similar #843
I have few questions and suggestions:
1- I changed max_Sequence_length description in xlm_roberta_backbone.py. should I write the old description back ?
2- in xlm_roberta_tokenizer.py. I suggest deleting lines 38 to 42. what do you think ?
3- I suggest adding a training method for XLMRoberta tokenizer for custom vocabulary
4- for training sentencepair tokenizer we need to import io and scentencepice should I import them in the example code
5- should I delete the example in xlm_roberta_preprocessor.py in lines 43 to 45
6- I suggest adding a softmax activation in XLMRoberta classifier
7- I added Arabic examples because XLMRoberta is multilingual model and it will be good to show that. what do you think ?

I made my tests in here https://colab.research.google.com/drive/1uyIQetDu9auoMP4ZHVX82zLW3tkEzZx7#scrollTo=X72rpihBvZX7

waiting for feedback.
@mattdangerw
thanks.

@google-cla
Copy link

google-cla bot commented Mar 20, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@mattdangerw
Copy link
Member

Thanks! This look great and really appreciate the attention to detail here.

1- I changed max_Sequence_length description in xlm_roberta_backbone.py. should I write the old description back ?

New description looks good. We may want to copy it for other models as a follow up.

2- in xlm_roberta_tokenizer.py. I suggest deleting lines 38 to 42. what do you think ?

The note about the weird sentencepiece modifications? I think we can leave it in, but maybe it would be better if the paragraph began. Note: If you are providing you own custom SentencePiece model, ...

3- I suggest adding a training method for XLMRoberta tokenizer for custom vocabulary

SGTM! Thank you

4- for training sentencepair tokenizer we need to import io and scentencepice should I import them in the example code

We don't generally do imports in the example code, we just assume a fixed set of symbols are available (including sentencepiece and io for now).

5- should I delete the example in xlm_roberta_preprocessor.py in lines 43 to 45

We can leave it in I think.

6- I suggest adding a softmax activation in XLMRoberta classifier

Interesting point! We have actually discussed this quite a bit. For now, we are sticking with logit output across the library (it is more the norm, and allows for manipulation of the logits when doing generation). But we may revisit this! And appreciate the feedback.

7- I added Arabic examples because XLMRoberta is multilingual model and it will be good to show that. what do you think ?

I like it!

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Just had a couple comments.

# Preprocess a batch of sentence pairs.
# When handling multiple sequences, always convert to tensors first!
first = tf.constant(["The quick brown fox jumped.", "اسمي اسماعيل"])
second = tf.constant(["The fox tripped.", "Oh look, a whale."])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this examples, can we replace Oh look, a whale. with a short Arabic sentence that could follow the first?

We want to make it clear how the batches will be handled. E.g. position 0 in both constants is for one batch sample and position 1 is another.

download a matching vocabulary for a XLM-RoBERTa preset.
download a matching vocabulary for an XLM-RoBERTa preset.
The original fairseq implementation of XLM-RoBERTa modifies the indices of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re your comment, we could rewrite this slightly to be shorter...

Note: If you are providing you own custom SentencePiece model, the original fairseq implementation of XLM-RoBERTa re-maps some token indices from the underlying sentencepiece output. To preserve compatibility, we do the same re-mapping here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdangerw requested changes has been pushed.

@mattdangerw mattdangerw self-assigned this Mar 22, 2023
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Really appreciate all the feedback you gave, this is an impressive first contribution!

@mattdangerw mattdangerw merged commit 592bad9 into keras-team:master Mar 22, 2023
@abuelnasr0
Copy link
Contributor Author

This LGTM! Really appreciate all the feedback you gave, this is an impressive first contribution!
Thank you!

kanpuriyanawab pushed a commit to kanpuriyanawab/keras-nlp that referenced this pull request Mar 26, 2023
* Rework docstring of XLMRoberta

* Fix typo

* Add arabic example and Shorten the comment about sentencepiece tokenizer

* Fix typo
kanpuriyanawab pushed a commit to kanpuriyanawab/keras-nlp that referenced this pull request Mar 26, 2023
* Rework docstring of XLMRoberta

* Fix typo

* Add arabic example and Shorten the comment about sentencepiece tokenizer

* Fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants