Make Usage of Params Consistent v2 #568

anfals · 2024-03-28T22:35:12Z

#498 Showed that we have inconsistent use of params vs {'params': params} within MaxText, which causes problems when we have changes that require multiple collections of variables (like https://github.com/google/flax/blob/main/flax/linen/fp8_ops.py).

These changes streamline this as much as possible to use params directly.

Changes from Last Time:
The last version of this PR caused issues because it didn't update the checkpoint conversion scripts. Those have now been updated to reflected the updated checkpoint structure (which is basically just a 'params' wrapping the outside of the dict).

rwitten

One important random change slipped in to a test by accident.

Because this is now a high risk change, I'd like the XLML tests run in advance. Can you run them all via XLML?

(And separately, can you post in the MaxText room what you're planning to do here and that it is breaking compat?)

rwitten · 2024-03-30T20:32:29Z

MaxText/checkpointing.py

@@ -178,8 +178,11 @@ def  map_to_pspec(data):
    p = epath.Path(load_parameters_from_path)
    ckptr = orbax.checkpoint.PyTreeCheckpointer()
    restore_args = orbax.checkpoint.checkpoint_utils.construct_restore_args(abstract_unboxed_pre_state.params)
+    # Orbax quirk -> we save the entire TrainState, which has a field `params` which holds the PyTree


Hmmm this seems weird, should we fix that while we're fixing things?

Ack, I'll double check with Susie who worked on this last I think. If I had to answer now, I think just saving the state directly is useful because we also get the optimizer state captured in the checkpoint. And we can cleanly capture any other variable collections (like overwrite_with_gradients) without any sort of special logic.

Update: We can actually just tweak this a little bit to just load things used the abstract_state directly and it works. Cleaner code, but I wonder if this is less efficient/wasteful because we are restoring things we don't need (i.e. all the stuff that isn't params)

And another update: This actually seems to be an optimization. We don't want to restore anything other than the params field here as that would be wasteful, both computationally and for memory usage (we immediately toss the rest of it). So we have this little odd bit here to basically say only restore the params field. I added a comment clarifying what's really happening.

rwitten · 2024-03-30T20:32:58Z

end_to_end/llama_finetuning_test.sh

@@ -13,6 +13,7 @@ DATASET_PATH=gs://maxtext-dataset

 export LOSS_THRESHOLD=2.5

+echo python3 MaxText/train.py MaxText/configs/base.yml run_name=runner_direct_${idx} base_output_directory=${BASE_OUTPUT_DIRECTORY} load_parameters_path=${base_ckpt_path} model_name='llama2-7b' dataset_path=${DATASET_PATH} async_checkpointing=false  model_name='llama2-7b' ici_tensor_parallelism=4 steps=10 per_device_batch_size=.25 metrics_file='metrics.txt'


whoops good catch. fixed!

anfals · 2024-04-02T20:07:36Z

One important random change slipped in to a test by accident.

Because this is now a high risk change, I'd like the XLML tests run in advance. Can you run them all via XLML?

(And separately, can you post in the MaxText room what you're planning to do here and that it is breaking compat?)

Made some fixes and added some clarifying comments!

I've run the XLML tests and they passed, but I'm running them again with the latest gemma test changes. I won't merge until that's finished. I'll also post in the MaxText room before I merge.

anfals requested a review from rwitten as a code owner March 28, 2024 22:35

anfals force-pushed the anfals/params_streamline_redo branch from f03988e to d2e4572 Compare March 29, 2024 21:13

anfals changed the title ~~[NOT READY FOR MERGE] Make Usage of Params Consistent~~ Make Usage of Params Consistent v2 Mar 29, 2024

anfals requested a review from khatwanimohit March 29, 2024 21:15

anfals assigned rwitten Mar 29, 2024

anfals requested a review from gobbleturk March 29, 2024 21:17

rwitten requested changes Mar 30, 2024

View reviewed changes

rwitten removed their assignment Mar 30, 2024

anfals force-pushed the anfals/params_streamline_redo branch from ba8863f to 504c1cd Compare April 2, 2024 17:19

Streamline params usage

6498e14

anfals force-pushed the anfals/params_streamline_redo branch from 2b5c6f7 to 6498e14 Compare April 2, 2024 18:14

anfals assigned anfals and rwitten and unassigned anfals Apr 2, 2024

anfals requested a review from rwitten April 2, 2024 20:07

rwitten approved these changes Apr 3, 2024

View reviewed changes

github-actions bot added the pull ready label Apr 3, 2024

rwitten removed their assignment Apr 3, 2024

copybara-service bot merged commit de49d83 into main Apr 3, 2024
9 checks passed

copybara-service bot deleted the anfals/params_streamline_redo branch April 3, 2024 18:45

A9isha mentioned this pull request Apr 11, 2024

Converting checkpoints #551

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Usage of Params Consistent v2 #568

Make Usage of Params Consistent v2 #568

anfals commented Mar 28, 2024 •

edited

Loading

rwitten left a comment

rwitten Mar 30, 2024

anfals Apr 1, 2024

anfals Apr 2, 2024 •

edited

Loading

anfals Apr 2, 2024

rwitten Mar 30, 2024

anfals Apr 1, 2024

anfals commented Apr 2, 2024

		@@ -13,6 +13,7 @@ DATASET_PATH=gs://maxtext-dataset

		export LOSS_THRESHOLD=2.5

		echo python3 MaxText/train.py MaxText/configs/base.yml run_name=runner_direct_${idx} base_output_directory=${BASE_OUTPUT_DIRECTORY} load_parameters_path=${base_ckpt_path} model_name='llama2-7b' dataset_path=${DATASET_PATH} async_checkpointing=false model_name='llama2-7b' ici_tensor_parallelism=4 steps=10 per_device_batch_size=.25 metrics_file='metrics.txt'

Make Usage of Params Consistent v2 #568

Make Usage of Params Consistent v2 #568

Conversation

anfals commented Mar 28, 2024 • edited Loading

rwitten left a comment

Choose a reason for hiding this comment

rwitten Mar 30, 2024

Choose a reason for hiding this comment

anfals Apr 1, 2024

Choose a reason for hiding this comment

anfals Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

anfals Apr 2, 2024

Choose a reason for hiding this comment

rwitten Mar 30, 2024

Choose a reason for hiding this comment

anfals Apr 1, 2024

Choose a reason for hiding this comment

anfals commented Apr 2, 2024

anfals commented Mar 28, 2024 •

edited

Loading

anfals Apr 2, 2024 •

edited

Loading