Skip saving of direct serialization fields #445

ravi-mosaicml · 2022-02-09T21:00:30Z

When the Trainer is constructed, the user passes in a max_duration, grad_accum, and precision in Trainer.__init__. These values should take precedence over whatever was in a checkpoint, so these values should not be saved in the checkpoint anyways.

Since these were the only direct serialization attributes, also cleaned up the state logic to just require that the attributes to be serialized are listed. No need to list which fields are not getting serialized. Also renamed fields to attrs to reflect that State is no longer a dataclass, since fields are a dataclass concept.

Also cleaned up the state serialization by refactoring the deepspeed logic out of the state and into the trainer next to where deepspeed is initialized. The state should not need to know whether the model is deepspeed or not.

Closes #441.

When the trainer is constructed, the passes in a max_duration, grad_accum, and precision. These values should take precedence over whatever was in a checkpoint, so there is no need to serialize them. Since these were the only direct serialization attributes, also cleaned up the state logic to just require that the attributes to be serialized are listed. (Renamed `fields` to `attrs` to reflect that State is no longer a dataclass, since fields are a dataclass concept). Also cleaned up the state serialization by refactoring the deepspeed logic out of the state and into the trainer next to where deepspeed is initialized. The state should not need to know whether the model is deepspeed or not. Closes #441.

ajaysaini725

Couple questions otherwise LGTM

composer/core/state.py

hanlint · 2022-02-10T01:20:03Z

I agree on max_duration since it's a required input to the Trainer, but less convinced about not saving grad_accum and precision. Those seem like ones we would want to persist in the checkpoint? If someone trains in AMP, saves the checkpoint, then loads it up again without specify the Precision, then we would default to trainign on FP32?

Ancillary question is -- would we actually want logic where the checkpoint is loaded, then any user-provided arguments through the init are overridden? Right now, from what I understand, they are silently ignored.

ravi-mosaicml · 2022-02-10T01:37:38Z

I think grad accum is very hardware-specific ..e.g. start training on 3080s, then resume on 3090s, so I don't think that should be persisted in the checkpoint.

Though I think precision is also hardware specific (e.g. some hardware has support for bf16; others support only fp16 or amp), I could go either way on it theoretically. However, since precision is passed in on init (perhaps implicitly, though a default), I think whatever is passed to the trainer should be used. This does not affect YAHP uses since that is serialized in the hparams. (https://github.com/mosaicml/composer/blob/8b625ad167f9838580e907c7b411b63f539080c3/examples/run_composer_trainer.py).

This change also makes it consistent with deepspeed, since we don't save the deepspeed config (including precision) in the checkpoint.

It also ensures that all arguments in the trainer init (except for seed) are used. All other parameters are used either to construct the classes into which the state is loaded, or are not saved.

hanlint · 2022-02-10T01:39:52Z

Hmm OK -- might be good to get the original issue requester and user cc: @siriuslee and maybe @A-Jacobson to opine on the expected user experience here.

hanlint · 2022-02-10T01:42:04Z

The use case for saving precision especially s that users will want to know whether the precision that the checkpoint model was trained with -- has implications for what precision they should deploy the model in for inference.

ravi-mosaicml · 2022-02-10T01:42:06Z

Sounds good, if precision and/or grad_accumt should be persisted, then we need to update the trainer's init signature to be grad_accum: Optional[int] = None where None implies to use the value in the checkpoint, or if no checkpoint, then use the current default. Can implement in a separate PR if desired.

ravi-mosaicml · 2022-02-10T01:51:23Z

Created #451 to track this.

When the Trainer is constructed, the user passes in a max_duration, grad_accum, and precision in `Trainer.__init__`. These values should take precedence over whatever was in a checkpoint, so these values should not be saved in the checkpoint anyways. Since these were the only direct serialization attributes, also cleaned up the state logic to just require that the attributes to be serialized are listed. No need to list which fields are not getting serialized. Also renamed `fields` to `attrs` to reflect that State is no longer a dataclass, since fields are a dataclass concept. Also cleaned up the state serialization by refactoring the deepspeed logic out of the state and into the trainer next to where deepspeed is initialized. The state should not need to know whether the model is deepspeed or not. Closes #441.

When the Trainer is constructed, the user passes in a max_duration, grad_accum, and precision in `Trainer.__init__`. These values should take precedence over whatever was in a checkpoint, so these values should not be saved in the checkpoint anyways. Since these were the only direct serialization attributes, also cleaned up the state logic to just require that the attributes to be serialized are listed. No need to list which fields are not getting serialized. Also renamed `fields` to `attrs` to reflect that State is no longer a dataclass, since fields are a dataclass concept. Also cleaned up the state serialization by refactoring the deepspeed logic out of the state and into the trainer next to where deepspeed is initialized. The state should not need to know whether the model is deepspeed or not. Closes mosaicml#441.

ravi-mosaicml requested a review from jbloxham February 9, 2022 21:00

Repalced try/catch with if statements

bf7443d

ravi-mosaicml requested a review from ajaysaini725 February 9, 2022 22:16

ajaysaini725 approved these changes Feb 9, 2022

View reviewed changes

composer/core/state.py Outdated Show resolved Hide resolved

composer/core/state.py Show resolved Hide resolved

ravi-mosaicml added 2 commits February 9, 2022 17:12

Merge branch 'dev' into ravi/skip_serialization_of_trainer_init_fields

ffbfa53

Updated comment for clarification

7b2a0f2

ravi-mosaicml merged commit ea95116 into dev Feb 10, 2022

ravi-mosaicml deleted the ravi/skip_serialization_of_trainer_init_fields branch February 10, 2022 01:38

ravi-mosaicml mentioned this pull request Feb 10, 2022

Checkpoint Metadata #451

Closed

ravi-mosaicml mentioned this pull request Feb 24, 2022

Override max_epochs on resume from checkpoint with SSR #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip saving of direct serialization fields #445

Skip saving of direct serialization fields #445

ravi-mosaicml commented Feb 9, 2022 •

edited

ajaysaini725 left a comment

hanlint commented Feb 10, 2022 •

edited

ravi-mosaicml commented Feb 10, 2022

hanlint commented Feb 10, 2022

hanlint commented Feb 10, 2022

ravi-mosaicml commented Feb 10, 2022

ravi-mosaicml commented Feb 10, 2022

Skip saving of direct serialization fields #445

Skip saving of direct serialization fields #445

Conversation

ravi-mosaicml commented Feb 9, 2022 • edited

ajaysaini725 left a comment

Choose a reason for hiding this comment

hanlint commented Feb 10, 2022 • edited

ravi-mosaicml commented Feb 10, 2022

hanlint commented Feb 10, 2022

hanlint commented Feb 10, 2022

ravi-mosaicml commented Feb 10, 2022

ravi-mosaicml commented Feb 10, 2022

ravi-mosaicml commented Feb 9, 2022 •

edited

hanlint commented Feb 10, 2022 •

edited