Move saving checkpoints from model to trainer #262

frostedoyster · 2024-06-14T08:48:33Z

Closes #214, #89, #203. It involves a refactor where the trainer now saves checkpoints instead of the model.

This solves #203, which asks for saving optimizer and scheduler states for much improved restarting of training. The slightly changed interface allows PET checkpoints to work both for restarting and exporting (#214). Finally, we document what a checkpoint (e.g. model.ckpt) should be (#89).

Contributor (creator of pull-request) checklist

Tests updated (for new features and bugfixes)?
Documentation updated (for new features)?
Issue referenced (for PRs that solve an issue)?

📚 Documentation preview 📚: https://metatrain--262.org.readthedocs.build/en/262/

PicoCentauri · 2024-06-15T18:56:29Z

Does this work with the design to support lightning?

frostedoyster · 2024-06-16T05:53:40Z

Yes, by inheriting from the lightning classes and making the necessary adaptations (as we were planning to do before)

PicoCentauri

Good. Some minor changes in the docs should reflect the changes in the code. Still the question remains if this will work with a lightning model where the model load the checkpoint.

But I can imagine it works because our Trainer knows about the model can just call lightning_model.load_from_checkpoint

PicoCentauri · 2024-06-17T08:24:32Z

docs/src/dev-docs/new-architecture.rst

-``save_checkpoint()``, ``load_checkpoint()``  as well as a ``restart()`` and
-``export()`` method.
+The ``ModelInterface`` is the main model class and must implement the
+``load_checkpoint()``, ``restart()`` and ``export()`` methods.


load_checkpoint is also a trainer method now.

Yes, and it's documented in the TrainerInterface just below

PicoCentauri · 2024-06-17T08:24:53Z

docs/src/dev-docs/new-architecture.rst

        @classmethod
        def load_checkpoint(cls, path: Union[str, Path]) -> "ModelInterface":
            pass


Suggested change

@classmethod

def load_checkpoint(cls, path: Union[str, Path]) -> "ModelInterface":

pass

This is still present for all architectures (we need a model.load_checkpoint for export, where there is no Trainer)

But then I think we should also provide for consistency a save_checkpoint for the model.

Which means basically this PR adds a save_checkpoint() and load_checkpoint for the trainer.

I don't think so. We should only have what's necessary. If people want to have it (and call it inside the Trainer.save_checkpoint) then it's up to them

Yeah makes sense

PicoCentauri · 2024-06-17T08:26:06Z

src/metatrain/cli/export.py

@@ -80,6 +80,9 @@ def export_model(model: Any, output: Union[Path, str] = "exported-model.pt") ->
        torch.jit.save(model, path)
    else:
        extensions_path = "extensions/"
-        logger.info(f"Exporting model to {path} and extensions to {extensions_path}")
+        logger.info(
+            f"Exporting model to `{path}` and extensions to `{extensions_path}`"


I think I usally prefer quotes for path and back ticks for variables. But feel free

Suggested change

f"Exporting model to `{path}` and extensions to `{extensions_path}`"

f"Exporting model to {path!r} and extensions to {extensions_path!r}"

Are you sure that !r isn't going to print some other stuff if path and/or extension_path are path objects?

Ahh probably you are right. But then I would go for single ticks instead of backticks

Luthaf · 2024-06-17T15:43:28Z

So the idea is that we need to save checkpoint in the trainer to be able to also save the optimizer state, but we want to load checkpoints from the model for export/restart?

frostedoyster · 2024-06-17T18:12:39Z

@Luthaf yes, that's the idea. The same checkpoint can be read by the trainer (for restarting training, reads everything: model, optimizer, scheduler, etc) or the model (just loads the model, for exporting, can also be called by the trainer's loader)

PicoCentauri

I think it is almost ready just my minor comments with the ticks.

frostedoyster added 2 commits June 14, 2024 10:47

Move saving checkpoints from model to trainer

37bc0ef

Make restart and export work for PET

15dfc0c

This was linked to issues Jun 14, 2024

Save optimizer state along with model checkpoints #203

Closed

Decide how to handle checkpointing #89

Closed

frostedoyster force-pushed the trainer-checkpoints branch from 16cb56e to d497062 Compare June 14, 2024 18:19

Run tests

dce3af4

frostedoyster force-pushed the trainer-checkpoints branch from d497062 to dce3af4 Compare June 14, 2024 18:26

frostedoyster and others added 2 commits June 14, 2024 20:41

Update documentation

9591224

Merge branch 'main' into trainer-checkpoints

8d5c656

frostedoyster marked this pull request as ready for review June 14, 2024 18:52

frostedoyster requested review from DavideTisi, spozdn and abmazitov as code owners June 14, 2024 18:52

frostedoyster requested a review from PicoCentauri June 14, 2024 18:52

PicoCentauri reviewed Jun 17, 2024

View reviewed changes

frostedoyster requested a review from PicoCentauri June 17, 2024 08:52

PicoCentauri reviewed Jun 20, 2024

View reviewed changes

frostedoyster and others added 2 commits June 21, 2024 11:10

Merge branch 'main' into trainer-checkpoints

9693d66

Change to normal quotation marks

943a5e8

frostedoyster requested a review from PicoCentauri June 21, 2024 09:17

PicoCentauri approved these changes Jun 21, 2024

View reviewed changes

frostedoyster merged commit 7943c9c into main Jun 21, 2024
13 checks passed

frostedoyster deleted the trainer-checkpoints branch June 21, 2024 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move saving checkpoints from model to trainer #262

Move saving checkpoints from model to trainer #262

frostedoyster commented Jun 14, 2024 •

edited

Loading

PicoCentauri commented Jun 15, 2024

frostedoyster commented Jun 16, 2024

PicoCentauri left a comment

PicoCentauri Jun 17, 2024

frostedoyster Jun 17, 2024

PicoCentauri Jun 17, 2024

frostedoyster Jun 17, 2024

PicoCentauri Jun 19, 2024

frostedoyster Jun 20, 2024

PicoCentauri Jun 20, 2024

PicoCentauri Jun 17, 2024

frostedoyster Jun 17, 2024

PicoCentauri Jun 19, 2024

Luthaf commented Jun 17, 2024

frostedoyster commented Jun 17, 2024 •

edited

Loading

PicoCentauri left a comment

	@classmethod
	def load_checkpoint(cls, path: Union[str, Path]) -> "ModelInterface":
	pass

	f"Exporting model to `{path}` and extensions to `{extensions_path}`"
	f"Exporting model to {path!r} and extensions to {extensions_path!r}"

Move saving checkpoints from model to trainer #262

Move saving checkpoints from model to trainer #262

Conversation

frostedoyster commented Jun 14, 2024 • edited Loading

Contributor (creator of pull-request) checklist

PicoCentauri commented Jun 15, 2024

frostedoyster commented Jun 16, 2024

PicoCentauri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Luthaf commented Jun 17, 2024

frostedoyster commented Jun 17, 2024 • edited Loading

PicoCentauri left a comment

Choose a reason for hiding this comment

frostedoyster commented Jun 14, 2024 •

edited

Loading

frostedoyster commented Jun 17, 2024 •

edited

Loading