jax.Array must be fully replicated to be saved in aggregate file #353

tatami-galaxy · 2023-06-11T13:47:18Z

I'm trying to save a checkpoint and getting this error message. Saving code :

ckpt = {'state': state, 'config': model.config} 
save_args = orbax_utils.save_args_from_target(ckpt)
checkpoint_manager.save(global_step + 1, ckpt, save_kwargs={'save_args': save_args})

This line in orbax/checkpoint/pytree_checkpoint_handler.py is throwing the error :

if isinstance(value, jax.Array) and not value.is_fully_replicated:
     raise ValueError(
         'jax.Array must be fully replicated to be saved in aggregate file.'
     )

state is an instance of flax.training.train_state. What could be causing this? I tried disabling jax.Array with jax.config.update('jax_array', False) but that does not work with jax and jaxlib 0.4.7.

The text was updated successfully, but these errors were encountered:

cpgaffney1 · 2023-06-12T15:33:40Z

To use the aggregate option, you should either have numpy arrays, basic scalar types, or you can reshard your jax.Arrays to be replicated across all devices. This is pretty easy to do - just supply a sharding of None instead. JAX documentation has lots of details

tatami-galaxy · 2023-06-13T12:07:53Z

Hi @cpgaffney1, thanks for the response. I am replicating the state prior to training with state = jax_utils.replicate(state). I also shard the batches during training with flax.training.common_utils.shard. This error goes away if I call flax.jax_utils.unreplicate() on the state before saving like so : ckpt = {'state': unreplicate(state), 'config': model.config}. Is this supposed to happen?

cpgaffney1 · 2023-06-14T22:17:13Z

jax_utils uses jax.device_put_replicated. When I run the following

import jax

replicated = jax.device_put_replicated(np.arange(32), jax.devices())
replicated.is_fully_replicated

is_fully_replicated is actually False, which is surprising. Checking to see if this is expected behavior.

cpgaffney1 · 2023-06-14T22:41:38Z

I'm told that this is a known bug. In the meantime, you should use a different method to replicate the state - perhaps just use pjit.

tatami-galaxy · 2023-06-15T05:15:43Z

Thanks. What would be the difference between saving the replicated state after pjit vs calling unreplicate() and saving as I'm doing now?

cpgaffney1 · 2023-06-15T19:57:44Z

If replicating the arrays, they can be safely saved into the msgpack file. If calling unreplicate, I believe flax's behavior is to instruct Orbax to save using Tensorstore, which supports sharded arrays.

cpgaffney1 closed this as completed Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jax.Array must be fully replicated to be saved in aggregate file #353

jax.Array must be fully replicated to be saved in aggregate file #353

tatami-galaxy commented Jun 11, 2023

cpgaffney1 commented Jun 12, 2023

tatami-galaxy commented Jun 13, 2023 •

edited

cpgaffney1 commented Jun 14, 2023

cpgaffney1 commented Jun 14, 2023

tatami-galaxy commented Jun 15, 2023

cpgaffney1 commented Jun 15, 2023

jax.Array must be fully replicated to be saved in aggregate file #353

jax.Array must be fully replicated to be saved in aggregate file #353

Comments

tatami-galaxy commented Jun 11, 2023

cpgaffney1 commented Jun 12, 2023

tatami-galaxy commented Jun 13, 2023 • edited

cpgaffney1 commented Jun 14, 2023

cpgaffney1 commented Jun 14, 2023

tatami-galaxy commented Jun 15, 2023

cpgaffney1 commented Jun 15, 2023

tatami-galaxy commented Jun 13, 2023 •

edited