Support Task Saving/Loading #1547

SamanehSaadat · 2024-04-02T23:42:05Z

To support saving and loading Task, the following changes have been made:

support saving and loading task.json and task.weights.h5.
support saving and loading Preprocessor (added preprocessor.json).
move preset saving and loading logic to base classes, e.g. Tokenizer, Backbone, etc. and only kept the low-level preset manipulation in preset_utils.py.

TODO:

add unit tests for new functions added to preset_utils.py.
add unit test for saving and loading in each base class.

Future plan: currently backbone and config are called config.json and model.weights.h5. Our plan is to rename these to backbone.json and backbone.weights.h5 in a followup PR.

mattdangerw

Looks good! Just dropping some initial feedback!

mattdangerw · 2024-04-03T20:13:43Z

keras_nlp/models/preprocessor.py

+
+    def save_to_preset(self, preset):
+        """TODO: add docstring."""
+        save_to_preset(


Can we just call self.tokenizer.save_to_preset(preset) here? Also, I wonder if we should update the name of the preset arg to path. Would make it clearer we are looking for a filesystem path here.

mattdangerw · 2024-04-03T20:16:03Z

keras_nlp/models/task.py

+
+        self.preprocessor.save_to_preset(preset)
+        self.backbone.save_to_preset(preset)
+        weights_filename = "task.weights.h5"


Keep this as a constant? Just for consistency?

Good point! Done!

mattdangerw · 2024-04-03T20:17:14Z

keras_nlp/models/task.py

+        self.backbone.save_to_preset(preset)
+        weights_filename = "task.weights.h5"
+
+        # TODO: the serialization and saving logic should probably be moved to preset_utils.py


+1, not exactly sure where the divisions should live, but we should probably clear up the division between the model code and saving utils a bit.

mattdangerw · 2024-04-03T20:18:48Z

keras_nlp/models/task.py

+        )
+        weights_store.close()
+
+    # TODO: do we want to have a `save_weights` flag in this public save_to_preset? probably yes!


To save the architecture without weights? What's the use case?

I don't know what I was thinking when I wrote this :))
I was probably thinking about adding a load_weights flag to anywhere we load weights (similar to what we have in load_from_preset)!

mattdangerw · 2024-04-03T20:20:28Z

keras_nlp/models/task.py

+        weights_store = keras.src.saving.saving_lib.H5IOStore(
+            filepath, mode="r"
+        )
+        # Q: when loading task weights, there shouldn't be any backbone layers, why calculate and exclude backbone layers?


Otherwise I believe _load_weights will fail because it will try to load the backbone layers weights and not find them in the file.

I see! Thanks for the explanation.

mattdangerw · 2024-04-03T21:34:54Z

keras_nlp/models/task.py

+        task = keras.saving.deserialize_keras_object(task_config)
+        load_weights = load_weights and task_config["weights"]
+        task_weights_path = os.path.join(preset, task_config["weights"])
+        task.load_task_weights(task_weights_path)


We need to call task.backbone.load_weights too somewhere right? Where do the backbone weights get loaded?

Backbone is loaded before task loading by calling load_from_preset here.

Then if preset has task.json, task is loaded and we set task.backbone = backbone here.

mattdangerw · 2024-04-03T21:46:39Z

keras_nlp/models/task.py

@@ -267,59 +274,162 @@ def from_preset(
                "constructor with a `backbone` argument. "
                f"Received: backbone={kwargs['backbone']}."
            )
+
+        # Load backbone from preset.
+        config_path = os.path.join(preset, CONFIG_FILE)


I think we need to rework this to a slightly new flow. Something like this...

# Backbone case. if not exists("task.json") or not issubclass(check_config_class("task.json"), cls): # This should be basically what is here already. # Load a preprocessor. Load a backbone. return cls(backbone=backbone, preprocessor=preprocessor, **kwargs) # Task case. task = keras.saving.deserialize_keras_object("task.json") if load_weights: task.backbone.load_weights("model.weights.h5") task.load_task_weights("task.weights.h5")

Basically if we don't see a task.json or the task.json is for a different task, we load the low level objects and make a default task object with them. If we find a task object for our class, we load it exactly as it was before.

Let me know if that makes sense. I think the logic here is a bit different.

What I was thinking was to load the backbone and preprocessor in any case (whether we have a task.json or not).

If there is no task.json make a default task object with backbone and preprocessor.

If there is a task.json, load the task-specific things and assign the loaded backbone to task.backbone.
Do you see any issues or disadvantages with this approach?
I'm mainly doing this because I thought it would be cleaner to just have one piece of code that loads the backbone.

PS: I made some changes this morning so they may not have been included in the version of code that you reviewed.

Assigning the backbone to task would mean we double up on backbone memory until the next GC. We have to avoid that I think, otherwise we will OOM people very easily.

One case I was getting at in my snippet but worth calling out explicitly. You saved, say, a BertClassifier but are loading a different task, e.g. a BertMaskedLM. In this case our task.json is from the wrong object, and I think we fall back to the "backbone case" here.

Oh, I see! Makes sense! Thanks for explaining this, Matt!

SamanehSaadat

Thanks for the review, Matt!

SamanehSaadat · 2024-04-03T21:56:36Z

keras_nlp/models/task.py

+
+        self.preprocessor.save_to_preset(preset)
+        self.backbone.save_to_preset(preset)
+        weights_filename = "task.weights.h5"


Good point! Done!

SamanehSaadat · 2024-04-03T22:07:23Z

keras_nlp/models/task.py

+        task = keras.saving.deserialize_keras_object(task_config)
+        load_weights = load_weights and task_config["weights"]
+        task_weights_path = os.path.join(preset, task_config["weights"])
+        task.load_task_weights(task_weights_path)


Backbone is loaded before task loading by calling load_from_preset here.

Then if preset has task.json, task is loaded and we set task.backbone = backbone here.

SamanehSaadat · 2024-04-03T22:08:27Z

keras_nlp/models/task.py

+        weights_store = keras.src.saving.saving_lib.H5IOStore(
+            filepath, mode="r"
+        )
+        # Q: when loading task weights, there shouldn't be any backbone layers, why calculate and exclude backbone layers?


I see! Thanks for the explanation.

SamanehSaadat · 2024-04-03T22:18:49Z

keras_nlp/models/task.py

+        )
+        weights_store.close()
+
+    # TODO: do we want to have a `save_weights` flag in this public save_to_preset? probably yes!


I don't know what I was thinking when I wrote this :))
I was probably thinking about adding a load_weights flag to anywhere we load weights (similar to what we have in load_from_preset)!

SamanehSaadat · 2024-04-03T22:29:31Z

keras_nlp/models/task.py

@@ -267,59 +274,162 @@ def from_preset(
                "constructor with a `backbone` argument. "
                f"Received: backbone={kwargs['backbone']}."
            )
+
+        # Load backbone from preset.
+        config_path = os.path.join(preset, CONFIG_FILE)


What I was thinking was to load the backbone and preprocessor in any case (whether we have a task.json or not).

If there is no task.json make a default task object with backbone and preprocessor.

If there is a task.json, load the task-specific things and assign the loaded backbone to task.backbone.
Do you see any issues or disadvantages with this approach?
I'm mainly doing this because I thought it would be cleaner to just have one piece of code that loads the backbone.

PS: I made some changes this morning so they may not have been included in the version of code that you reviewed.

…r.json doesn't exist.

…ig.json.

mattdangerw

Thanks! Looking good. I think there's two main question I see.

How we structure proprocessor and task loading, and what we save in our json files. They are inter related.

mattdangerw · 2024-04-11T18:28:15Z

keras_nlp/models/backbone.py

+        save_serialized_object(self, preset, config_file=CONFIG_FILE)
+        save_weights(self, preset, MODEL_WEIGHTS_FILE)
+        save_metadata(self, preset)
+        # save_to_preset(self, preset)


remove commented out code?

mattdangerw · 2024-04-11T19:18:01Z

keras_nlp/models/task.py

-                    filter(lambda x: x.backbone_cls == preset_cls, subclasses)
+
+        task = None
+        try:


I think this could get a lot more readable if we made a check_file_exists or similarly named util. The try/except and wrapping our get_file in preset utils with a FileNotFoundError makes for a weird interface.

Added check_file_exists.

mattdangerw · 2024-04-11T19:18:59Z

keras_nlp/models/task.py

+            objects_to_skip=backbone_layer_ids,
+        )
+
+    def save_weights(self, filepath):


We should probably call this save_task_weights. It's different than vanilla save_weights for Keras, we should name and document it differently.

mattdangerw · 2024-04-11T19:32:17Z

keras_nlp/tokenizers/tokenizer.py

+        make_preset_dir(preset)
+        save_tokenizer_assets(self, preset)
+        save_serialized_object(self, preset, config_file=TOKENIZER_CONFIG_FILE)
+        # save_to_preset(self, preset, config_filename=TOKENIZER_CONFIG_FILE)


keras_nlp/models/task.py

mattdangerw · 2024-04-11T19:35:23Z

keras_nlp/models/task.py

+        Args:
+            preset: The path to the local model preset directory.
+        """
+        check_keras_version()


Should we move make_preset_dir(preset) and check_keras_version() down into the other utilities? E.g. into save_serialized_object?

That would make these top-level functions a little less cluttered and easier to read.

Good point! Moved these two to save_serialized_object.

mattdangerw · 2024-04-11T19:38:17Z

keras_nlp/models/task.py

+            return task
+        return cls(backbone=backbone, preprocessor=preprocessor, **kwargs)
+
+    def load_weights(self, filepath):


We should probably call this load_task_weights. It's different than vanilla save_weights for Keras, we should name and document it differently.

mattdangerw · 2024-04-11T19:51:52Z

keras_nlp/models/task.py

-            load_weights=load_weights,
-            config_overrides=kwargs,
+            config_file=TASK_CONFIG_FILE,
+            config_to_skip=["preprocessor", "backbone"],


To discuss, but I'm not sure we should do this.

For weights we are saving the task bits separately, but that's really because weights are huge. We can't afford to save backbone weights and task weights separately in their entirety.

For configs, everything is small. We can duplicate, and can just effectively do keras.saving.serialize_keras_object(task) here. That means duplicated config between tokenizer.json, backbone.json, preprocessor.json and task.json. But we don't care. It's lightweight, makes our code simpler, and most importantly keep our assets simple that we put in the format. Any user can call keras.saving.deserialize_keras_object(task_json) if they are so inclined (thought that won't handle weight loading).

As we discussed offline, we'll allow config duplication, i.e. task.json includes backbone config and preprocessor.json includes tokenizer config.

mattdangerw · 2024-04-11T19:52:24Z

keras_nlp/models/task.py

-            backbone = load_from_preset(
+            backbone_config = load_config(preset, CONFIG_FILE)
+            # TODO: this is not really an override! It's an addition! Should I rename this?
+            config_overrides = {"backbone": backbone_config}


Just make a comment below. But I think we might want to keep our json objects really simple, so we don't need to patch them like this.

mattdangerw · 2024-04-11T20:42:39Z

keras_nlp/models/preprocessor.py

-            cls = subclasses[0]
-        tokenizer = load_from_preset(
+
+        # For backward compatibility, if preset doesn't have `preprocessor.json`


Oops, had a comment here I must have forgot to hit save on. To discuss, but I think there are two main cases for task and preprocessor loading. Neither are backward compat.

# Preprocessor load. if exists("preprocessor.json") and is_class("preprocessor.json", cls): preprocessor = load_serialized_object(preset, "preprocessor.json", **kwargs) load tokenizer assets else: # Load from sub objects and create with default config. tokenizer = tokenizer_cls.from_preset(preset) preprocessor = cls(tokenizer=tokenizer) # Task load if exists("task.json") and is_class("task.json", cls): task = load_serialized_object(preset, "task.json", **kwargs) load weights load tokenizer assets else: # Load from sub objects and create with default config. backbone = backbone_cls.from_preset(preset) preprocessor = preprocess_cls.from_preset(preset) task = cls(backbone=backbone, preprocessor=preprocessor)

SamanehSaadat

Thanks for the review, Matt!

SamanehSaadat · 2024-04-11T22:43:34Z

keras_nlp/models/task.py

-                    filter(lambda x: x.backbone_cls == preset_cls, subclasses)
+
+        task = None
+        try:


Added check_file_exists.

SamanehSaadat · 2024-04-11T22:48:46Z

keras_nlp/models/task.py

-            backbone = load_from_preset(
+            backbone_config = load_config(preset, CONFIG_FILE)
+            # TODO: this is not really an override! It's an addition! Should I rename this?
+            config_overrides = {"backbone": backbone_config}


SamanehSaadat · 2024-04-11T23:00:49Z

keras_nlp/models/task.py

+            return task
+        return cls(backbone=backbone, preprocessor=preprocessor, **kwargs)
+
+    def load_weights(self, filepath):


SamanehSaadat · 2024-04-11T23:00:58Z

keras_nlp/models/task.py

+            objects_to_skip=backbone_layer_ids,
+        )
+
+    def save_weights(self, filepath):


keras_nlp/models/task.py

SamanehSaadat · 2024-04-11T23:16:47Z

keras_nlp/models/task.py

+        Args:
+            preset: The path to the local model preset directory.
+        """
+        check_keras_version()


Good point! Moved these two to save_serialized_object.

SamanehSaadat · 2024-04-11T23:19:23Z

keras_nlp/models/task.py

-            load_weights=load_weights,
-            config_overrides=kwargs,
+            config_file=TASK_CONFIG_FILE,
+            config_to_skip=["preprocessor", "backbone"],


As we discussed offline, we'll allow config duplication, i.e. task.json includes backbone config and preprocessor.json includes tokenizer config.

SamanehSaadat · 2024-04-11T23:20:00Z

keras_nlp/tokenizers/tokenizer.py

+        make_preset_dir(preset)
+        save_tokenizer_assets(self, preset)
+        save_serialized_object(self, preset, config_file=TOKENIZER_CONFIG_FILE)
+        # save_to_preset(self, preset, config_filename=TOKENIZER_CONFIG_FILE)


mattdangerw

Looks good! Few more comments.

keras_nlp/models/preprocessor.py

mattdangerw · 2024-04-11T23:50:42Z

keras_nlp/models/preprocessor.py

+                preset,
+                PREPROCESSOR_CONFIG_FILE,
+            )
+            for asset in preprocessor.tokenizer.file_assets:


can't we get rid of the function below? and just do

for asset in preprocessor.tokenizer.file_assets: filename = get_file(preset, os.path.join(TOKENIZER_ASSET_DIR, asset)) dirname = os.path.dirname(filename)

seem simpler

keras_nlp/models/task.py

mattdangerw · 2024-04-11T23:56:06Z

keras_nlp/models/task.py

                    preset,
-                    config_file="tokenizer.json",
+                    load_weights=load_weights,
+                    config_overrides=config_overrides,


Below this, we can have

preprocessor = cls.preprocessor_cls.from_preset(preset) return cls(backbone=backbone, preprocessor=preprocessor, **kwargs)

and then we are done I think. No need for the rest of this function. We can delegate to preprocessor.from_preset which has the logic you have below.

Right! Preprocessor should have all the necessary validation logic to prevent repetition!
Done!

mattdangerw · 2024-04-11T23:57:57Z

keras_nlp/utils/preset_utils.py

 ):
    """Validate a preset is being loaded on the correct class."""
    config_path = get_file(preset, config_file)
    with open(config_path) as config_file:
        config = json.load(config_file)
    return keras.saving.get_registered_object(config["registered_name"])
+
+
+def get_asset_dir(


i'm not sure this util needs to exist. see comment above.

Removed it!

mattdangerw

Looks great!

I think we are missing a couple edge cases. But this looks really solid and readable overall!

mattdangerw · 2024-04-12T17:51:43Z

keras_nlp/models/task.py

+            )
+
+        save_serialized_object(self, preset_dir, config_file=TASK_CONFIG_FILE)
+        self.save_task_weights(get_file(preset_dir, TASK_WEIGHTS_FILE))


I think we need to think more about the case where our task has no new weights. For a lot of language modeling stuff, that will be the norm. I think the high level behavior we want is

task = something where all weight are in backbone task.save_to_preset("dir") # ok! no task.weights.h5 created task.save_task_weights("task.weights.h5") # probably good to error like we do now.

Not sure how to best handle this in code, just try/catch on this line? Add a self.has_task_weights() method?

We need to handle this on the loading side too. Skip load_task_weights if the file does not exist.

Added a self.has_task_weights() to check weights exists before saving. I do this check again in save_task_weights() so task.save_task_weights() is a complete function on its own!

For loading in from_preset, I added a check to only load weights if the file exists.

Let me know what you think.

keras_nlp/models/task.py

keras_nlp/models/preprocessor.py

keras_nlp/models/task.py

mattdangerw

This looks great! Some final nits

mattdangerw · 2024-04-12T21:55:04Z

keras_nlp/utils/preset_utils.py

+            message = str(e)
+            if message.find("403 Client Error"):
+                raise FileNotFoundError(
+                    f"`{path}` doesn't exist in preset directory `{preset}`.\n"


I think usually we don't need a trailing \n in error messages. Would format strangely. Only between lines we want to separate.

Right! Removed \n!

mattdangerw · 2024-04-12T21:55:08Z

keras_nlp/utils/preset_utils.py

+            message = str(e)
+            if message.find("403 Client Error"):
+                raise FileNotFoundError(
+                    f"`{path}` doesn't exist in preset directory `{preset}`.\n"


mattdangerw · 2024-04-12T21:55:13Z

keras_nlp/utils/preset_utils.py

+        local_path = os.path.join(preset, path)
+        if not os.path.exists(local_path):
+            raise FileNotFoundError(
+                f"`{path}` doesn't exist in preset directory `{preset}`.\n"


mattdangerw · 2024-04-12T21:55:45Z

keras_nlp/utils/preset_utils.py

-    weights_filename="model.weights.h5",
-):
-    """Save a KerasNLP layer to a preset directory."""
+def check_keras_version():


nit: check_keras_3(). might improve readability

mattdangerw reviewed Apr 3, 2024

View reviewed changes

mattdangerw mentioned this pull request Apr 3, 2024

Utilities to save and load only tasks weights #1531

Closed

SamanehSaadat commented Apr 3, 2024

View reviewed changes

mattdangerw mentioned this pull request Apr 5, 2024

[RfC] Ideas for better Hugging Face Hub integration #1529

Closed

SamanehSaadat added 16 commits April 10, 2024 18:18

Initial task saving/loading implementation.

aa8acaa

Fix bugs.

aedb6d3

Remove unused code.

63bb512

Load preprocessor from preset.

dcee99e

Address reviews.

00a6509

Separate task and backbone loading cases in task's from_preset.

a0205cb

Move preset-related logic of task saving and loading to preset_utils.py

605d2c0

Improve messages and docs is task.

814b19f

Move saving logic to the base classes' from_preset.

c0e32ad

Re-design loading from preset.

50f1f9c

Switch to using public API for loading and saving task weights.

3cc4f02

Handle file existance check in the get_file function.

6e2699a

Raise FileNotFound exception in get_file if the file doesn't exist.

9dd46e7

Improve preprocessor saving and loading.

740d542

Move saving logic to the base classes.

3a500ae

Check keras version before saving.

fa3c6fe

SamanehSaadat force-pushed the task-upload branch from 981a150 to fa3c6fe Compare April 10, 2024 20:06

SamanehSaadat added 6 commits April 10, 2024 20:14

Random fixes.

c238621

Remove preset saving and loading tests.

f89d795

Merge branch 'keras-team:master' into task-upload

069303c

Allow loading if a preset doesn't have preprocessor.json.

f772e41

Fix a Task bug.

a465d2b

For backward compatibility, allow loading preprocessor if preprocesso…

7b95077

…r.json doesn't exist.

SamanehSaadat marked this pull request as ready for review April 11, 2024 01:20

SamanehSaadat added 2 commits April 11, 2024 16:26

Fix preprocessor test.

5cda4a5

Fix preprocessor bug.

5d9dd82

SamanehSaadat added 2 commits April 11, 2024 16:34

Fix backbone weights loading bug.

90f6679

Update backbone validation as we no longer store weights path in conf…

f60c23b

…ig.json.

mattdangerw reviewed Apr 11, 2024

View reviewed changes

Address reviews.

988f47e

SamanehSaadat commented Apr 11, 2024

View reviewed changes

Rename save_to_preset's argument.

7e3753c

mattdangerw reviewed Apr 11, 2024

View reviewed changes

SamanehSaadat added 3 commits April 12, 2024 16:55

Address reviews.

4361873

Preprocessor class check updates.

19581ae

Tokenizer asset loading updates.

34c69a2

mattdangerw reviewed Apr 12, 2024

View reviewed changes

Update task and preprocessor loading.

f0ca17a

mattdangerw reviewed Apr 12, 2024

View reviewed changes

keras_nlp/models/task.py Show resolved Hide resolved

SamanehSaadat added 2 commits April 12, 2024 20:59

Skip saving and loading of task weights if they don't exist.

d71f3ee

Add jax_memory_cleanup argument in Task.

677de72

mattdangerw approved these changes Apr 12, 2024

View reviewed changes

SamanehSaadat added 5 commits April 12, 2024 21:58

Fix a weight loading bug in backbone.

173570b

Address reviews.

603ce5f

Update _validate_tokenizer to look for assets in asset dir.

160ef83

Update tokenizer validation.

38243e6

Handle cached model missing file exception.

55396ad

SamanehSaadat merged commit ee5263b into keras-team:master Apr 15, 2024
10 checks passed

SamanehSaadat mentioned this pull request Apr 22, 2024

Add Saving Tests #1590

Merged

SamanehSaadat mentioned this pull request May 17, 2024

Pass kwargs to tokenizer when creating preprocessor #1632

Merged

Support Task Saving/Loading #1547

Support Task Saving/Loading #1547

Conversation

SamanehSaadat commented Apr 2, 2024 • edited Loading

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamanehSaadat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

SamanehSaadat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamanehSaadat commented Apr 2, 2024 •

edited

Loading

mattdangerw Apr 11, 2024 •

edited

Loading