-
Notifications
You must be signed in to change notification settings - Fork 301
Rework docstring of XLMRoberta #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework docstring of XLMRoberta #882
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Thanks! This look great and really appreciate the attention to detail here.
New description looks good. We may want to copy it for other models as a follow up.
The note about the weird sentencepiece modifications? I think we can leave it in, but maybe it would be better if the paragraph began.
SGTM! Thank you
We don't generally do imports in the example code, we just assume a fixed set of symbols are available (including sentencepiece and io for now).
We can leave it in I think.
Interesting point! We have actually discussed this quite a bit. For now, we are sticking with logit output across the library (it is more the norm, and allows for manipulation of the logits when doing generation). But we may revisit this! And appreciate the feedback.
I like it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Just had a couple comments.
# Preprocess a batch of sentence pairs. | ||
# When handling multiple sequences, always convert to tensors first! | ||
first = tf.constant(["The quick brown fox jumped.", "اسمي اسماعيل"]) | ||
second = tf.constant(["The fox tripped.", "Oh look, a whale."]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this examples, can we replace Oh look, a whale.
with a short Arabic sentence that could follow the first?
We want to make it clear how the batches will be handled. E.g. position 0 in both constants is for one batch sample and position 1 is another.
download a matching vocabulary for a XLM-RoBERTa preset. | ||
download a matching vocabulary for an XLM-RoBERTa preset. | ||
The original fairseq implementation of XLM-RoBERTa modifies the indices of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re your comment, we could rewrite this slightly to be shorter...
Note: If you are providing you own custom SentencePiece model, the original fairseq implementation of XLM-RoBERTa re-maps some token indices from the underlying sentencepiece output. To preserve compatibility, we do the same re-mapping here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdangerw requested changes has been pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! Really appreciate all the feedback you gave, this is an impressive first contribution!
|
* Rework docstring of XLMRoberta * Fix typo * Add arabic example and Shorten the comment about sentencepiece tokenizer * Fix typo
* Rework docstring of XLMRoberta * Fix typo * Add arabic example and Shorten the comment about sentencepiece tokenizer * Fix typo
my first PR in keras_nlp and in open source contribution
for #867 issue I reworked docstring of XLMRoberta in a way that is similar #843
I have few questions and suggestions:
1- I changed max_Sequence_length description in xlm_roberta_backbone.py. should I write the old description back ?
2- in xlm_roberta_tokenizer.py. I suggest deleting lines 38 to 42. what do you think ?
3- I suggest adding a training method for XLMRoberta tokenizer for custom vocabulary
4- for training sentencepair tokenizer we need to import io and scentencepice should I import them in the example code
5- should I delete the example in xlm_roberta_preprocessor.py in lines 43 to 45
6- I suggest adding a softmax activation in XLMRoberta classifier
7- I added Arabic examples because XLMRoberta is multilingual model and it will be good to show that. what do you think ?
I made my tests in here https://colab.research.google.com/drive/1uyIQetDu9auoMP4ZHVX82zLW3tkEzZx7#scrollTo=X72rpihBvZX7
waiting for feedback.
@mattdangerw
thanks.