Use state_dict Torchmetrics Serialization #2116

nik-mosaic · 2023-04-04T09:55:31Z

What does this PR do?

Our metrics objects were previously serialized using pickle. The pickling includes many fields which are unnecessary and may change from version to version, causing a mismatch when we stop and start training with slightly different configurations or upgrade Torchmetrics.

This PR changes the saving and loading of Composer state objects so that state.train_metrics and state.eval_metrics use Torchmetrics' built in state_dict() method instead of Pickle. The state_dict() method essentially returns a dictionary of <metric_name, metric_value> <key, value> pairs, without any of the other data.

When loading a state dict from its serialized version, we recreate the metric as follows:.
(1) Get the state.model
(2) Call model.get_metrics(), which creates a default version of the metrics
(3) Loop over the dictionary of <metric_name, metric_value> pairs from the serialization, matching metric names to the metrics from model.get_metrics() and populating values with the saved metric_value. If a saved metric isn't present in the model, we skip it.

Other changes in this PR:

Gate _ensure_metrics_device_and_dtype, which is only necessary for Deepspeed models, behind an is_model_deepspeed() check. This should fix CO-1910.

Testing:

We have modified the test_checkpoint.py and test_sharded_checkpoint.py files to include a metric equality check wherever we have a weight equality or an optimizer equality check.
Added test_load_remote_checkpoint, a backwards compatibility checkpoint test. I have uploaded a checkpoint saved with Composer 0.13.5 and default dependenices (torchmetrics 0.11.3, etc.). This test downloads the checkpoint and ensures equivalency with a currently trained version. As we continue to push to Composer, this test will become more useful because our remote checkpoint will remain frozen while our local one will include all future trainer/model/metrics changes.
Locally, all tests pass on torchmetrics 0.11.4, but we do not bump the version in this PR.
@coryMosaicML has verified that CO-1910 has been fixed with this PR.

What issue(s) does this change relate to?

CO-1918, CO-1907, CO-1853.

…oser into nik/torchmetrics

…o nik/torchmetrics

mvpatel2000

This adds a backwards compatibility test -- very nice! But is this change backwards compatible? The checkpoint appears to be off dev and not 0.13.5

Also, please show screenshot of test manually passing as it is only run on daily (since its remote)

tests/trainer/test_checkpoint.py

dakinggg

Thanks for digging through this @nik-mosaic! Left a few comments and questions, mostly around the usage of private attributes.

composer/core/state.py

composer/trainer/trainer.py

tests/trainer/test_checkpoint.py

nik-mosaic · 2023-04-27T03:03:55Z

Here it is passing pytest tests/trainer/test_checkpoint.py -m remote. The remote checkpoint file has now changed, it is now one created off the Composer dev branch. To get this test to pass, we add a case to the state.py serialization/deserialization section to support deserializing a Torchmetrics object. Old composer checkpoints will have Torchmetrics objects, new ones will have dictionaries with metric tensors.

mvpatel2000

Why do we not need the backwards compatibiltiy hacks with this? because wont they still be the old format?

composer/core/state.py

nik-mosaic · 2023-04-27T03:35:01Z

Why do we not need the backwards compatibility hacks with this? because wont they still be the old format?

I have added a comment to address this directly in the code: The explanation is as follows.

Given the rest of a Composer checkpoint, the state_dict() and _computed attributes of a Torchmetrics object are enough information to recreate it upon serialization. We only serialize the minimum metric information to maximize backwards compatibility --- old checkpoints will continue to be compatible even if other Torchmetrics attributes have changed.

mvpatel2000

LGTM, thank you for taking this on!

Serialize and load torchmetrics through state_dict() and load_state_dict() instead of pickle

nik-mosaic and others added 30 commits April 3, 2023 00:04

Add to state

500f883

Merge branch 'mosaicml:dev' into nik/torchmetrics

4e8c5c4

Initial state dict load/save changes

9dc5a18

comment out code, temporarily

6c974f5

fix typo

3e0118c

Fix comment

a3d49de

Update state

1ee9c4e

Check and warn for missing keys when loading metrics

660326b

Update state serialization to remove mismatched keys

eff0f60

add small fixes

518157e

Merge branch 'dev' into nik/torchmetrics

ce4ec9c

Fix eval metric serialization, add dataloader option

bf32982

Merge branch 'dev' into nik/torchmetrics

b778a33

Merge branch 'nik/torchmetrics' of https://github.com/nik-mosaic/comp…

9fbb63d

…oser into nik/torchmetrics

simplify

d329a8a

remove dataloader param

7ef839f

add model

755ac99

Merge branch 'dev' into nik/torchmetrics

f0e2ffc

fix pyright

546e381

Merge branch 'dev' into nik/torchmetrics

568fb32

Merge branch 'dev' into nik/torchmetrics

55e20de

Better deepspeed check

4804920

Merge branch 'nik/torchmetrics' of https://github.com/nik-mosaic/comp…

d93d2e8

…oser into nik/torchmetrics

update tests to check for metrics

21550bc

fix torchmetrics deepcompare issue

20bc2a2

attempt to fix device issues

4e9c994

fix some device issues

ef11d32

Merge branch 'dev' into nik/torchmetrics

ae92193

Add proper device movement for metrics

46f9a06

Merge branch 'nik/torchmetrics' of github.com:nik-mosaic/composer int…

0952aff

…o nik/torchmetrics

nik-mosaic added 2 commits April 25, 2023 14:32

add backwards compatible checkpoint tests

40faf2c

Merge branch 'nik/torchmetrics' of github.com:nik-mosaic/composer int…

269dca4

…o nik/torchmetrics

nik-mosaic requested a review from mvpatel2000 April 25, 2023 15:03

nik-mosaic marked this pull request as ready for review April 25, 2023 15:03

nik-mosaic requested a review from dakinggg April 25, 2023 15:04

mvpatel2000 requested a review from bcui19 April 25, 2023 20:39

mvpatel2000 reviewed Apr 25, 2023

View reviewed changes

tests/trainer/test_checkpoint.py Outdated Show resolved Hide resolved

mvpatel2000 requested a review from eracah April 25, 2023 20:43

Add one extra epoch

6c62636

dakinggg reviewed Apr 26, 2023

View reviewed changes

nik-mosaic and others added 4 commits April 27, 2023 00:38

Update tests to use dev remote checkpoint, fix tests

bdfc74e

Add comment

a2a7acc

add comments

2505119

Merge branch 'dev' into nik/torchmetrics

b2ad831

dakinggg approved these changes Apr 27, 2023

View reviewed changes

Add test

c178a0d

mvpatel2000 reviewed Apr 27, 2023

View reviewed changes

composer/core/state.py Show resolved Hide resolved

composer/core/state.py Outdated Show resolved Hide resolved

add comments

5755c95

fix typo

230b18c

nik-mosaic requested a review from mvpatel2000 April 27, 2023 03:36

nik-mosaic and others added 2 commits April 26, 2023 20:39

Update state.py

b84091a

fix pyright

8af9e79

mvpatel2000 approved these changes Apr 27, 2023

View reviewed changes

nik-mosaic merged commit 389255b into mosaicml:dev Apr 27, 2023

nik-mosaic deleted the nik/torchmetrics branch April 27, 2023 04:07

dakinggg pushed a commit that referenced this pull request Apr 27, 2023

Use state_dict Torchmetrics Serialization (#2116)

8fde05f

Serialize and load torchmetrics through state_dict() and load_state_dict() instead of pickle

dakinggg pushed a commit that referenced this pull request Apr 27, 2023

Use state_dict Torchmetrics Serialization (#2116)

b9922c5

Serialize and load torchmetrics through state_dict() and load_state_dict() instead of pickle

mvpatel2000 mentioned this pull request May 15, 2023

Pin torchmetrics #2065

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use state_dict Torchmetrics Serialization #2116

Use state_dict Torchmetrics Serialization #2116

nik-mosaic commented Apr 4, 2023 •

edited

Loading

mvpatel2000 left a comment •

edited

Loading

dakinggg left a comment •

edited by nik-mosaic

Loading

nik-mosaic commented Apr 27, 2023 •

edited

Loading

mvpatel2000 left a comment

nik-mosaic commented Apr 27, 2023 •

edited

Loading

mvpatel2000 left a comment

Use state_dict Torchmetrics Serialization #2116

Use state_dict Torchmetrics Serialization #2116

Conversation

nik-mosaic commented Apr 4, 2023 • edited Loading

What does this PR do?

What issue(s) does this change relate to?

mvpatel2000 left a comment • edited Loading

Choose a reason for hiding this comment

dakinggg left a comment • edited by nik-mosaic Loading

Choose a reason for hiding this comment

nik-mosaic commented Apr 27, 2023 • edited Loading

mvpatel2000 left a comment

Choose a reason for hiding this comment

nik-mosaic commented Apr 27, 2023 • edited Loading

mvpatel2000 left a comment

Choose a reason for hiding this comment

nik-mosaic commented Apr 4, 2023 •

edited

Loading

mvpatel2000 left a comment •

edited

Loading

dakinggg left a comment •

edited by nik-mosaic

Loading

nik-mosaic commented Apr 27, 2023 •

edited

Loading

nik-mosaic commented Apr 27, 2023 •

edited

Loading