Bug fixes and unit testing for Callbacks #242

dhpitt · 2023-10-18T17:32:03Z

Simple change. Model loads checkpoint in Trainer.train() instead of Trainer.__init__()

JeanKossaifi

Looks good, left some comments

neuralop/training/callbacks.py

examples/checkpoint_FNO_darcy.py

neuralop/training/__init__.py

neuralop/training/callbacks.py

JeanKossaifi · 2023-10-20T22:53:17Z

neuralop/training/callbacks.py

+            folder from which to resume training state. 
+            Expects saved states in the form: (all but model optional)
+                model.pt, optimizer.pt, scheduler.pt, regularizer.pt
+            All state files present will be loaded. 


I'm wondering whether we should
i) try to load best_model.pt (if monitor_metric is not None this will exist)
ii) if that fails, load model.pt (which will be the model from the latest save_interval)

Currently isn't the way I have things set up - at each save_interval I create a new folder that saves the state. Same for best_model. Do you think it would be better to continually overwrite the save in one folder?

I think so - to avoid blowing up memory?

neuralop/training/callbacks.py

…s pass

JeanKossaifi · 2023-10-24T00:32:25Z

Awesome, thank you @dhpitt !

dhpitt added 3 commits October 18, 2023 17:28

load from checkpoint in trainer.Train()

aa4c0bd

change to new checkpoint load syntax

ae5ef65

test model checkpoint callback

583683e

dhpitt changed the title ~~Load model from checkpoint in a trainer~~ Bug fixes and unit testing for Callbacks Oct 18, 2023

dhpitt added 11 commits October 19, 2023 15:15

pause training and save all states on epoch callback

74017c7

pause and resume training callbacks

211645d

callback test with new dir structure

f6f0293

simpler state checkpoint callback

f086d9f

test for new state checkpointer

9aa39ee

test for new state checkpointer

b382c79

polish checkpointer

4bbe561

update example for new callback

e36a90e

verbose flag defaults to false

6724349

verbose flag defaults to false

4093b04

fix typo to merge

44645ac

JeanKossaifi reviewed Oct 20, 2023

View reviewed changes

dhpitt added 11 commits October 23, 2023 14:13

Merge branch 'main' into load_from_ckpt

efd9d27

resume training in fno darcy

8f6b82c

resume training in fno darcy

92123b9

rename to checkpointcallback

62e2d3d

respond to API comments

7888837

fix for test

4de2de7

update test for new behavior

6defad5

fix imports and new api in test

8faa72e

update example for new feedback

59f5b9d

change directory structure in checkpoints, fix monitoring logic, test…

4e3a32b

…s pass

update scripts for new wandb callback logic

caa20c5

JeanKossaifi merged commit 9b388be into neuraloperator:main Oct 24, 2023
1 check passed

dhpitt deleted the load_from_ckpt branch March 25, 2024 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fixes and unit testing for Callbacks #242

Bug fixes and unit testing for Callbacks #242

dhpitt commented Oct 18, 2023

JeanKossaifi left a comment

JeanKossaifi Oct 20, 2023

dhpitt Oct 23, 2023

JeanKossaifi Oct 23, 2023

dhpitt Oct 23, 2023

JeanKossaifi commented Oct 24, 2023

Bug fixes and unit testing for Callbacks #242

Bug fixes and unit testing for Callbacks #242

Conversation

dhpitt commented Oct 18, 2023

JeanKossaifi left a comment

Choose a reason for hiding this comment

JeanKossaifi Oct 20, 2023

Choose a reason for hiding this comment

dhpitt Oct 23, 2023

Choose a reason for hiding this comment

JeanKossaifi Oct 23, 2023

Choose a reason for hiding this comment

dhpitt Oct 23, 2023

Choose a reason for hiding this comment

JeanKossaifi commented Oct 24, 2023