Add BartPreprocessor #856

abheesht17 · 2023-03-16T06:46:14Z

Resolves #904

chenmoneygithub

Thanks for the PR!

A high level comment about the label - the current version is using causal LM label, but BART seems to have multiple usages, so shall we just let the y pass by? For specific tasks, we can add special label creator, e.g., BartCausalLMPreprocessor just like GPT2CausalLMPreprocessor.

chenmoneygithub · 2023-03-17T01:51:03Z

keras_nlp/models/bart/bart_preprocessor.py

+        # TODO: Allow users to pass separate `sequence_length`s for encoder and
+        # decoder.
+        # Note: We use `MultiSegmentPacker` instead of `StartEndPacker` because
+        # we might want to support multiple segments in the future (at least for


I think we can drop the "in the future (at least the encoder)", this is used in MNLI, from the paper:
"""
The fine-tuned model concatenates the two sentences
with appended an EOS token, and passes them to both
the BART encoder and decoder. In contrast to BERT,
the representation of the EOS token is used to classify
the sentences relations.
"""

chenmoneygithub · 2023-03-17T01:53:20Z

keras_nlp/models/bart/bart_preprocessor.py

+            and ["encoder_inputs", "decoder_inputs"] == list(x.keys())
+        ):
+            raise ValueError(
+                f'`x` must be a dictionary, containing the keys `"encoder_inputs"`'


nit: the first line is not an f-string.

chenmoneygithub · 2023-03-17T01:59:51Z

keras_nlp/models/bart/bart_preprocessor.py

+
+        # Get the labels by shifting the decoder inputs one place to the left.
+        if decoder_token_ids.shape.rank == 1:
+            y = decoder_token_ids[1:]


This is causal LM label, but from the BART paper, causal LM is not the only use, and actually it's only useful in machine translation IIUC. Should we by default just let y pass by?

mattdangerw

Left some comments, in particular not sure if we are doing the label offset right for seq2seq

mattdangerw · 2023-03-17T17:06:42Z

keras_nlp/models/bart/bart_causal_lm_preprocessor.py

+        sequence_length: The length of the packed inputs.
+
+    Examples:
+    ```python


let's rework this to follow #843

mattdangerw · 2023-03-17T17:12:14Z

keras_nlp/models/bart/bart_preprocessor.py

+        super().__init__(**kwargs)
+        self.tokenizer = tokenizer
+
+        # TODO: Allow users to pass separate `sequence_length`s for encoder and


should we make an issue for this?

Resolved it in this PR itself.

mattdangerw · 2023-03-17T21:39:25Z

keras_nlp/models/bart/bart_seq_2_seq_preprocessor.py

+
+    Args:
+        tokenizer: A `keras_nlp.models.BartTokenizer` instance.
+        sequence_length: The length of the packed inputs.


Probably worth mentioning that this is the length for both encoder and decoder sequences (for now).

mattdangerw · 2023-03-17T21:43:56Z

keras_nlp/models/bart/bart_seq_2_seq_preprocessor.py

+        # The last token does not have a next token. Hence, we truncate it.
+        x = {
+            **x,
+            "decoder_token_ids": decoder_token_ids[..., :-1],


Will this actually work as we want? I think this will generate an encoder sequence with length sequence_length but a decoder sequence with length sequence_length - 1.

We want both to have both feature sequence have the same length I think, which means we have to tokenize the encoder sequence with length sequence_length and the decoder with length sequence_length + 1 before the feature label offsetting.

@mattdangerw - in that case, we'll need to define two MultiSegmentPackers. Might as well work on #904 in this PR itself instead of saving it for later?

mattdangerw · 2023-03-17T21:46:25Z

keras_nlp/models/bart/bart_preprocessor.py

+                    left-to-right manner and fills up the buckets until we run
+                    out of budget. It supports an arbitrary number of segments.
+
+    Examples:


Let's rework all of these pull requests to match the style here #843

mattdangerw · 2023-03-28T22:03:39Z

keras_nlp/models/bart/bart_preprocessor.py

+
+    # Tokenize and pack a sentence pair.
+    inputs = {
+        "encoder_inputs": (


Looking over this more, let's keep it simple on the first attempt, and have no support for multiple segments in the base preprocessor layer for now. This will fit with GPT2 code.

IMO this is still just too complicated, and I not sure the use case. For classification, we can support multiple segments, but I don't see the huge need for multiple segments with separate encoder and decoder inputs. Do we have a clear use case there we want to support?

If not, let's land this with the simpler feature set.

mattdangerw · 2023-03-29T20:26:01Z

keras_nlp/models/bart/bart_preprocessor.py

+
+    # Tokenize and pack a single sentence.
+    inputs = {
+        "encoder_inputs": "The fox was sleeping.",


Open question...

Should we call this "encoder_text" to better accommodate "encoder_audio" for whisper? Or will it be simpler to have the same names everywhere. I somewhat like the self documenting property of saying this is text input.

Yep, "encoder_text" and "decoder_text" sound good to me!

mattdangerw · 2023-04-01T01:40:30Z

/gcbrun

jbischof · 2023-04-03T20:44:41Z

When should one use BartPreprocessor vs BartSeq2SeqPreprocessor? Just reading the docstring I am a little confused.

Would we ever make a BartClassifier subclass which passes a single input sequence to both encoder and decoder?

abheesht17 · 2023-04-03T22:28:52Z

@jbischof, BartPreprocessor is meant to be a very general layer. If the user wants to do something funky, he/she/they will go with BartPreprocessor. We do not expect users to use this layer often; most usecases will be satisfied with BartSeq2SeqLMPreprocessor.

Overall, the idea for all model preprocessors is to have a general preprocessor, and then task-specific preprocessors which subclass the general preprocessor. For other models which we have in the library so far, {model}Preprocessor = {model}ClassifierPreprocessor; we'll have to make an alias (for other models).

BartClassifierPreprocessor will be added in a follow-up PR.

abheesht17 · 2023-04-03T22:29:15Z

Oops, accidentally closed the PR

jbischof · 2023-04-03T23:52:41Z

Thanks for the clarification @abheesht17!

mattdangerw · 2023-04-04T03:13:50Z

/gcbrun

mattdangerw

Thanks! Just minor comments

mattdangerw · 2023-04-04T03:20:31Z

keras_nlp/models/bart/bart_seq_2_seq_lm_preprocessor_test.py

+        ("tf_format", "tf", "model"),
+        ("keras_format", "keras_v3", "model.keras"),
+    )
+    def test_saved_model(self, save_format, filename):


two changes we can make for efficiency here...

#945 (don't save traces)
#894 (mark as large, make separate serialization test)

mattdangerw · 2023-04-04T03:21:32Z

keras_nlp/models/bart/bart_seq_2_seq_lm_preprocessor_test.py

+            "decoder_text": " kohli is the best",
+        }
+
+        output = self.preprocessor(input_data)


this would be much more readable as x, y, sw = self.preprocessor(input_data) (and use below)

mattdangerw · 2023-04-04T03:22:30Z

keras_nlp/models/bart/bart_seq_2_seq_lm_preprocessor.py

+            Each value in the dictionary should be a tensor of single string
+            sequences. Inputs may be batched or unbatched. Raw python inputs
+            will be converted to tensors.
+        y: Any label data. Any passed value will be ignored since this is


Copy the language here https://github.com/keras-team/keras-nlp/blob/df2e85ebb4c99afdf58655c56e574f492fa519ce/keras_nlp/models/bert/bert_masked_lm_preprocessor.py#L71-L73

mattdangerw · 2023-04-04T03:35:13Z

keras_nlp/models/bart/bart_preprocessor_test.py

+        model_output = model(input_data)
+        restored_model_output = restored_model(input_data)
+
+        self.assertAllEqual(


I think assertAllClose will handle a nested structure here, so you could just assertAllClose(outputs, restored_outputs)

mattdangerw · 2023-04-04T03:35:34Z

keras_nlp/models/bart/bart_seq_2_seq_lm_preprocessor_test.py

+        model_output = model(input_data)
+        restored_model_output = restored_model(input_data)
+
+        self.assertAllEqual(


same here, this could get considerably shorter with assertAllClose

mattdangerw · 2023-04-04T18:42:22Z

/gcbrun

mattdangerw

LGTM! Will pull in as soon as testing is done

mattdangerw · 2023-04-04T20:11:09Z

Looks like the failure is unrelated, so I will pull this in.

jbischof · 2023-04-04T21:22:27Z

Congrats @abheesht17!

abheesht17 · 2023-04-05T02:00:21Z

Thanks, @jbischof! Text generation with BART next up!

abheesht17 added 6 commits March 15, 2023 13:44

Initial version

e72d10e

Fix packer calls

c18ce5c

Fixes

10d73ac

Do not allow multiple segments

49ea1e1

Change to dictionary inputs

08eb84e

Clean ups

8bdc8f4

abheesht17 requested a review from mattdangerw March 16, 2023 06:46

Fix doc-string

a063446

chenmoneygithub suggested changes Mar 17, 2023

View reviewed changes

abheesht17 added 2 commits March 17, 2023 12:53

Address comments

08bf7af

Add BartCausalLMPreprocessor

3ef4899

abheesht17 requested a review from chenmoneygithub March 17, 2023 08:37

abheesht17 added 2 commits March 17, 2023 17:07

Clean up doc-string

fa71033

Change name to Seq2Seq

e673a25

mattdangerw requested changes Mar 17, 2023

View reviewed changes

mattdangerw self-assigned this Mar 22, 2023

abheesht17 added 2 commits March 25, 2023 03:23

Allow separate seq lens and address comments

18cfe26

Remove redundancy from doc-string

55d6014

abheesht17 requested a review from mattdangerw March 24, 2023 21:57

mattdangerw reviewed Mar 28, 2023

View reviewed changes

mattdangerw reviewed Mar 29, 2023

View reviewed changes

abheesht17 added 3 commits March 31, 2023 07:49

Remove multi-segment support

2f9cbc1

Fix doc-strings

87eb224

Add UTs

1a9aebe

abheesht17 added 4 commits April 2, 2023 00:55

BartSeq2SeqPreprocessor -> BartSeq2SeqLMPreprocessor

ac7187a

Fix import

1a67fe0

Fix UTs

b872015

Merge branch 'keras-team:master' into bart-preprocessor

dcd0ee3

abheesht17 closed this Apr 3, 2023

abheesht17 reopened this Apr 3, 2023

mattdangerw approved these changes Apr 4, 2023

View reviewed changes

Address NITs

974b4a9

mattdangerw approved these changes Apr 4, 2023

View reviewed changes

mattdangerw merged commit c046ab6 into keras-team:master Apr 4, 2023

mattdangerw mentioned this pull request Apr 4, 2023

Fix gpt2, t5 and fnet under mixed precision #958

Merged

Add BartPreprocessor #856

Add BartPreprocessor #856

Uh oh!

Conversation

abheesht17 commented Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw Mar 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abheesht17 Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Apr 1, 2023

Uh oh!

jbischof commented Apr 3, 2023

Uh oh!

abheesht17 commented Apr 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abheesht17 commented Apr 3, 2023

Uh oh!

jbischof commented Apr 3, 2023

Uh oh!

mattdangerw commented Apr 4, 2023

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Apr 4, 2023

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Apr 4, 2023

Uh oh!

jbischof commented Apr 4, 2023

Uh oh!

abheesht17 commented Apr 5, 2023

Uh oh!

abheesht17 commented Mar 16, 2023 •

edited

Loading

mattdangerw Mar 28, 2023 •

edited

Loading

abheesht17 Mar 30, 2023 •

edited

Loading

abheesht17 commented Apr 3, 2023 •

edited

Loading