Keras automatic checkpoint #11197

WeichenXu123 · 2024-02-20T12:32:29Z

🛠 DevTools 🛠

Install mlflow from this PR

pip install git+https://github.com/mlflow/mlflow.git@refs/pull/11197/merge

Checkout with GitHub CLI

gh pr checkout 11197

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Keras automatic checkpoint implementation.

This PR replaces the old PR: #10955

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

Several automatic checkpoint arguments are added into keras autologging.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

github-actions · 2024-02-20T12:32:50Z

Documentation preview for 6d9ad09 will be available when this CircleCI job
completes successfully.

More info

Ignore this comment if this PR does not change the documentation.
It takes a few minutes for the preview to be available.
The preview is updated when a new commit is pushed to this PR.
This comment was created by https://github.com/mlflow/mlflow/actions/runs/8046328422.

chenmoneygithub

Thanks Weichen, great work! Left some comments.

chenmoneygithub · 2024-02-20T22:57:13Z

mlflow/utils/checkpoint_utils.py

+                could reflect as little as 1 batch, since the metrics get reset
+                every epoch). Defaults to `"epoch"`.
+        """
+        self.client = client


can we use the fluent API instead of the client API?

I prefer to use client API, it can set tracking uri explicitly (we need to set tracking uri here, for distributed training case). Any reason to use fluent API ?

I think fluent api is our standard user interface, so it is more robust, also we can shorten the arglist. For distributed training scenario, is mlflow.set_tracking_uri() enough?

But anyway, I don't think this is a blocking issue, just want to open the discussion to align the API design.

oh, we can remove client arguements from MlflowModelCheckpointCallback constructor.

I also removed run_id from MlflowModelCheckpointCallback constructor and use fluent apis instead

mlflow/pytorch/_lightning_autolog.py

chenmoneygithub · 2024-02-20T23:37:27Z

mlflow/pytorch/_lightning_autolog.py

@@ -438,15 +359,25 @@ def on_train_batch_end(
        batch,
        batch_idx,
    ) -> None:
+        self.trainer = trainer


I'm a little concerned about this line - we are setting a class attribute in on_train_batch_end, which is called multiple times. Do we need this line?

I can make it only set it at the first time.

The purpose of setting this attribute is for usage in save_checkpoint, otherwise we need to figure out a way to pass trainer object to save_checkpoint function. Note save_checkpoint is called in base class but its implementation is in sub-class.

can we set it in the constructor or in on_train_start hook?

Moved it into

@rank_zero_only def on_fit_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None: self.trainer = trainer

We can't put it in the constructor, because the callbacks argument is also defined in Trainer constructor, and we need to support usage like:

mlflow_checkpoint_callback = MLflowModelCheckpointCallback() # we can't pass trainer object here. trainer = Trainer( callbacks=[mlflow_checkpoint_callback] )

mlflow/tensorflow/__init__.py

mlflow/tensorflow/_autolog.py

tests/tensorflow/test_tensorflow2_autolog.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

chenmoneygithub

Thanks Weichen! We are pretty close, left a few non-blocking comments.

chenmoneygithub · 2024-02-22T01:45:12Z

mlflow/pytorch/_lightning_autolog.py

@@ -286,7 +287,7 @@ def on_test_end(self, trainer, pl_module):
        self.metrics_logger.flush()


-class MlflowModelCheckpointCallback(pl.Callback, metaclass=ExceptionSafeAbstractClass):
+class MlflowModelCheckpointCallback(pl.Callback, MlflowModelCheckpointCallbackBase):


Although at this time we don't expect users to directly use this class, since this is a public class, can we add a code example on how to use it?

I prefer to make it a private class _MlflowModelCheckpointCallbackBase

ohh we cannot make it a private class because we are importing it in a different file.

mlflow/pytorch/_lightning_autolog.py

mlflow/tensorflow/__init__.py

mlflow/tensorflow/callback.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

mlflow/tensorflow/__init__.py

mlflow/tensorflow/callback.py

mlflow/utils/checkpoint_utils.py

mlflow/tensorflow/__init__.py

mlflow/utils/checkpoint_utils.py

chenmoneygithub

Thanks Weichen! Approved with a comment

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

mlflow/tensorflow/__init__.py

mlflow/utils/checkpoint_utils.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

harupy

LGTM

harupy · 2024-02-26T08:46:28Z

mlflow/tensorflow/__init__.py

+            run_id=run_id, epoch=epoch, global_step=global_step, dst_path=tmp_dir.path()
+        )
+
+        if os.path.basename(downloaded_checkpoint_filepath).split(".")[-2] == "weights":


This line looks a bit scary. .split(".")[-2] fails if the result of os.path.basename(downloaded_checkpoint_filepath) doesn't contain a dot.

Improved error handling for this case:

artifact_name = os.path.basename(downloaded_checkpoint_filepath) artifact_name_splits = artifact_name.split(".") if len(artifact_name_splits) < 2: raise MlflowException( f"The model checkpoint artifact file name '{artifact_name}' is malformed." )

if a str doesn't contain dot, 'xx'.split(".") gets ['xx']

I'd use endswith or regex so we don't need os.path.basename or split.

Personally I prefer split than regex, if we incautiously write a bad regex it might cause performance issue when matching a long string, split is safer to me.

@WeichenXu123 #11250

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2024-02-26T10:04:25Z

Test failures in tensorflow / dev are not relevent.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Arthur Jenoudet <arthur.jenoudet@databricks.com>

WeichenXu123 added 5 commits February 15, 2024 17:17

init

6f1c1aa

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

a8e5997

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

373e0f3

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

0064903

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

8d4d8f9

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 requested a review from chenmoneygithub February 20, 2024 12:32

github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Feb 20, 2024

chenmoneygithub requested changes Feb 20, 2024

View reviewed changes

WeichenXu123 added 2 commits February 21, 2024 11:00

merge-master

f8ac215

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

e9f735e

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 requested review from chenmoneygithub and mlflow-automation February 21, 2024 03:33

mlflow-automation requested review from B-Step62, BenWilson2, daniellok-db, harupy and serena-ruan and removed request for mlflow-automation February 21, 2024 03:34

address comments

eab18a7

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

harupy added the enable-dev-tests Enables cross-version tests for dev versions label Feb 22, 2024

chenmoneygithub requested changes Feb 22, 2024

View reviewed changes

WeichenXu123 added 7 commits February 22, 2024 11:30

merge master

f5befe2

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

0951d02

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

094ef04

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update example

0124d43

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

remove client arg

ff84be0

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

240b743

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Merge branch 'master' into keras-automatic-checkpoint

0bc9a6a

WeichenXu123 requested a review from chenmoneygithub February 22, 2024 08:21