-
Notifications
You must be signed in to change notification settings - Fork 301
Rework model docstrings for progressive disclosure of complexity for f_net #879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
704ccec
4052381
ee496e4
c6942d8
bbc1794
1443679
a85e7db
41c1dc0
a405f32
7139387
9c56a28
0fe924d
728012f
1e1f2d0
1fefb8e
d873c30
b4743f5
bc613cc
ed9bdc6
f37e7e8
3aa7fdb
3e05cf8
facbbc4
fdd7934
0f2cf90
a53de58
5fe0282
00dd693
438c41d
0686945
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,14 +28,14 @@ class FNetMaskedLMPreprocessor(FNetPreprocessor): | |
`keras_nlp.models.FNetMaskedLM` task model. Preprocessing will occur in | ||
multiple steps. | ||
|
||
- Tokenize any number of input segments using the `tokenizer`. | ||
- Pack the inputs together with the appropriate `"<s>"`, `"</s>"` and | ||
1. Tokenize any number of input segments using the `tokenizer`. | ||
2. Pack the inputs together with the appropriate `"<s>"`, `"</s>"` and | ||
`"<pad>"` tokens, i.e., adding a single `"<s>"` at the start of the | ||
entire sequence, `"</s></s>"` between each segment, | ||
and a `"</s>"` at the end of the entire sequence. | ||
- Randomly select non-special tokens to mask, controlled by | ||
3. Randomly select non-special tokens to mask, controlled by | ||
`mask_selection_rate`. | ||
- Construct a `(x, y, sample_weight)` tuple suitable for training with a | ||
4. Construct a `(x, y, sample_weight)` tuple suitable for training with a | ||
`keras_nlp.models.FNetMaskedLM` task model. | ||
|
||
Args: | ||
|
@@ -66,54 +66,53 @@ class FNetMaskedLMPreprocessor(FNetPreprocessor): | |
out of budget. It supports an arbitrary number of segments. | ||
|
||
Examples: | ||
|
||
Directly calling the layer on data. | ||
```python | ||
# Load the preprocessor from a preset. | ||
preprocessor = keras_nlp.models.FNetMaskedLMPreprocessor.from_preset( | ||
"f_net_base_en" | ||
) | ||
|
||
# Tokenize and mask a single sentence. | ||
sentence = tf.constant("The quick brown fox jumped.") | ||
preprocessor(sentence) | ||
preprocessor("The quick brown fox jumped.") | ||
|
||
# Tokenize and mask a batch of sentences. | ||
sentences = tf.constant( | ||
["The quick brown fox jumped.", "Call me Ishmael."] | ||
) | ||
preprocessor(sentences) | ||
# Tokenize and mask a batch of single sentences. | ||
preprocessor(["The quick brown fox jumped.", "Call me Ishmael."]) | ||
|
||
# Tokenize and mask a dataset of sentences. | ||
features = tf.constant( | ||
["The quick brown fox jumped.", "Call me Ishmael."] | ||
# Tokenize and mask sentence pairs. | ||
# In this case, always convert input to tensors before calling the layer. | ||
first = tf.constant(["The quick brown fox jumped.", "Call me Ishmael."]) | ||
second = tf.constant(["The fox tripped.", "Oh look, a whale."]) | ||
preprocessor((first, second)) | ||
``` | ||
|
||
Mapping with `tf.data.Dataset`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. newline before this heading |
||
```python | ||
preprocessor = keras_nlp.models.FNetMaskedLMPreprocessor.from_preset( | ||
"f_net_base_en" | ||
) | ||
ds = tf.data.Dataset.from_tensor_slices((features)) | ||
|
||
first = tf.constant(["The quick brown fox jumped.", "Call me Ishmael."]) | ||
second = tf.constant(["The fox tripped.", "Oh look, a whale."]) | ||
|
||
# Map single sentences. | ||
ds = tf.data.Dataset.from_tensor_slices(first) | ||
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE) | ||
|
||
# Alternatively, you can create a preprocessor from your own vocabulary. | ||
vocab_data = tf.data.Dataset.from_tensor_slices( | ||
["the quick brown fox", "the earth is round"] | ||
) | ||
|
||
# Creating sentencepiece tokenizer for FNet LM preprocessor | ||
bytes_io = io.BytesIO() | ||
sentencepiece.SentencePieceTrainer.train( | ||
sentence_iterator=vocab_data.as_numpy_iterator(), | ||
model_writer=bytes_io, | ||
vocab_size=12, | ||
model_type="WORD", | ||
pad_id=0, | ||
bos_id=1, | ||
eos_id=2, | ||
unk_id=3, | ||
pad_piece="<pad>", | ||
unk_piece="<unk>", | ||
bos_piece="[CLS]", | ||
eos_piece="[SEP]", | ||
user_defined_symbols="[MASK]", | ||
# Map sentence pairs. | ||
ds = tf.data.Dataset.from_tensor_slices((first, second)) | ||
# Watch out for tf.data's default unpacking of tuples here! | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not by this PR - I think it is worth calling out first and second will be concatenated if calling preprocessor in this way. Now the comment just says "watch out" without showing the output. Maybe we can add "sentence pairs are automatically packed before tokenization"? @mattdangerw thoughts on this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, that is not quite the issue here. The fact that the outputs are concatenated is not that surprising. The fact the It stems from the fact that these two calls are handled differently...
We can update this comment if we want, but I would not do it on this PR. I would do that on a separate PR, for all the models at once (so we don't forget to update this elsewhere). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this should be fine, if we generate separate PR for different models at once There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, the comment above was meant as an explainer. Let's stick to the language we have been using in other PRs verbatim for this PR. |
||
# Best to invoke the `preprocessor` directly in this case. | ||
ds = ds.map( | ||
lambda first, second: preprocessor(x=(first, second)), | ||
num_parallel_calls=tf.data.AUTOTUNE, | ||
) | ||
proto = bytes_io.getvalue() | ||
tokenizer = keras_nlp.models.FNetTokenizer(proto=proto) | ||
preprocessor = keras_nlp.models.FNetMaskedLMPreprocessor(tokenizer=tokenizer) | ||
``` | ||
""" | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is missing most of the content on the BERT classifier, may be worth another look.