Processing model checkpointing #256

ohinds · 2023-08-21T23:29:11Z

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Summary

This PR adds checkpointing to the segmentation processing estimator. It introduces a class that derives from the tensorflow ModelCheckpoint that will save nobrainer-specific information at the checkpoint.

Checklist

I have added tests to cover my changes
I have updated documentation (if necessary)

Acknowledgment

I acknowledge that this contribution will be available under the Apache 2 license.

for more information, see https://pre-commit.ci

satra

comments on a few design choices that are not clear.

satra · 2023-08-22T01:13:27Z

nobrainer/processing/segmentation.py

@@ -74,8 +82,13 @@ def _compile():
            )

        if warm_start:
+            if checkpoint_tracker:
+                self = checkpoint_tracker.load()


shouldn't this be a static method or classmethod if it overwrites self? perhaps take a look at the save/load method in base estimator.

Yes, this was bothering me, too. Let me try reworking the interface a bit. The decisions I'm wrestling with are mostly related to how much to hide behind the scenes while still providing flexibility and consistency.

The way things are now, one decides to warm start and load a checkpoint when calling fit(), which seems logical to me, but then this overwriting self part with the loaded checkpoint seems like the right thing to do. I will try moving all this to static construction to see how the interface looks.

satra · 2023-08-22T01:14:37Z

nobrainer/processing/segmentation.py

-        checkpoint_dir=os.getcwd(),
+        checkpoint_file_path=None,


isn't checkpoint often a directory? why is this asking for a file path?

This is the tf/keras approach, which I'm extending here (see https://keras.io/guides/training_with_built_in_methods/#checkpointing-models).

filepath can be a directory, but it's used to format in the relevant epoch or batch variable as well. See the test below for using filepath to save in a directory.

…nets/nobrainer into ohinds-model-checkpointing

for more information, see https://pre-commit.ci

satra

a few suggestions for the failing pre-commit test.

i think this looks cleaner than before. it would be good to consider what a version of script in a slurm setting where the job is cancelled and requeued. how would that do it?

satra · 2023-08-23T03:16:13Z

nobrainer/tests/checkpoint_test.py

 import numpy as np
 from numpy.testing import assert_allclose
-import os
 import pytest


Suggested change

import pytest

satra · 2023-08-23T03:16:22Z

nobrainer/tests/checkpoint_test.py

@@ -1,13 +1,15 @@
 """Tests for `nobrainer.processing.checkpoint`."""

-from nobrainer.processing.segmentation import Segmentation
-from nobrainer.models import meshnet
+import os


Suggested change

import os

satra · 2023-08-23T03:16:32Z

nobrainer/processing/checkpoint.py

@@ -3,6 +3,7 @@
 from glob import glob
 import logging
 import os
+
 import tensorflow as tf

 from .base import BaseEstimator


Suggested change

from .base import BaseEstimator

…nets/nobrainer into ohinds-model-checkpointing

for more information, see https://pre-commit.ci

ohinds · 2023-08-24T18:08:14Z

a few suggestions for the failing pre-commit test.

i think this looks cleaner than before. it would be good to consider what a version of script in a slurm setting where the job is cancelled and requeued. how would that do it?

I added a test test_warm_start_workflow that demonstrates how this would be done. I could wrap this inside the BaseEstimator if you think that would be cleaner.

satra · 2023-08-25T21:19:18Z

nobrainer/tests/checkpoint_test.py

+                for weight_array in layer.get_weights():
+                    assert np.count_nonzero(weight_array)
+        except (AssertionError, ValueError):
+            bem = Segmentation(meshnet, checkpoint_filepath=checkpoint_filepath)


how about simply:

bem = Segmentation(meshnet, checkpoint_filepath=checkpoint_filepath, warm_start=True)

and have the try-except be a function of warm_start inside the baseestimator and checkpoint_filepath

Because this isn't a warm start? That seems confusing.

We could do something like

bem = Segmentation.load_or_init_with_checkpoints(meshnet, checkpoint_filepath=checkpoint_filepath)

which could initialize from zero if no checkpoints are found?

it is a warm start in a way. doing init_with_checkpoints sounds good.

Great, pushed that change.

satra · 2023-08-25T22:30:20Z

@ohinds - guide notebook still not running.

ohinds · 2023-08-25T22:34:31Z

@ohinds - guide notebook still not running.

That was a transient error. I restarted.

ohinds added 5 commits August 20, 2023 17:29

First pass at segmentation model checkpoints.

9bf1ff4

Output and argument massaging.

15bb946

Add a checkpoint tracker to save and load estimators.

06d304c

Make CheckpointTracker inherit from ModelCheckpoint for flexibility.

17b9bad

Add a test for checkpointing.

61ccc9f

ohinds requested a review from satra August 21, 2023 23:29

[pre-commit.ci] auto fixes from pre-commit.com hooks

fbbc4c2

for more information, see https://pre-commit.ci

satra reviewed Aug 22, 2023

View reviewed changes

ohinds and others added 3 commits August 22, 2023 11:39

Rework checkpoint loading to statically load.

cded43c

Merge branch 'ohinds-model-checkpointing' of https://github.com/neuro…

c7d7cfe

…nets/nobrainer into ohinds-model-checkpointing

[pre-commit.ci] auto fixes from pre-commit.com hooks

29991dd

for more information, see https://pre-commit.ci

satra reviewed Aug 23, 2023

View reviewed changes

ohinds and others added 3 commits August 24, 2023 10:59

Add a test of a workflow for warm starting when possible.

8c3fc54

Merge branch 'ohinds-model-checkpointing' of https://github.com/neuro…

2f51db2

…nets/nobrainer into ohinds-model-checkpointing

[pre-commit.ci] auto fixes from pre-commit.com hooks

9286eb7

for more information, see https://pre-commit.ci

ohinds added 2 commits August 24, 2023 11:21

Better checks that warm start is loading the model properly

8e6839d

Pre-commit failure fixes.

eb32be3

satra reviewed Aug 25, 2023

View reviewed changes

ohinds and others added 2 commits August 25, 2023 14:56

All estimator init uses static loading

9fe59a6

Merge branch 'master' into ohinds-model-checkpointing

0185eff

satra merged commit 99eaf9d into master Aug 25, 2023
7 checks passed

ohinds mentioned this pull request Aug 30, 2023

nobrainer.processing.segmentation.Segmenation.fit() supports checkpointing #255

Closed

hvgazula deleted the ohinds-model-checkpointing branch March 8, 2024 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing model checkpointing #256

Processing model checkpointing #256

ohinds commented Aug 21, 2023

satra left a comment

satra Aug 22, 2023

ohinds Aug 22, 2023 •

edited

satra Aug 22, 2023

ohinds Aug 22, 2023

satra left a comment

satra Aug 23, 2023

satra Aug 23, 2023

satra Aug 23, 2023

ohinds commented Aug 24, 2023

satra Aug 25, 2023

satra Aug 25, 2023

ohinds Aug 25, 2023

ohinds Aug 25, 2023 •

edited

satra Aug 25, 2023

ohinds Aug 25, 2023

satra commented Aug 25, 2023

ohinds commented Aug 25, 2023

Processing model checkpointing #256

Processing model checkpointing #256

Conversation

ohinds commented Aug 21, 2023

Types of changes

Summary

Checklist

Acknowledgment

satra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ohinds Aug 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ohinds commented Aug 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ohinds Aug 25, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satra commented Aug 25, 2023

ohinds commented Aug 25, 2023

ohinds Aug 22, 2023 •

edited

ohinds Aug 25, 2023 •

edited