You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've been using t5x to run some experiments. Until today it was successfully loading previous checkpoint, but with 0.1.14 it started crashing:
Traceback (most recent call last):
File "/home/thomas/code/t5x/t5x/train.py", line 616, in <module>
gin_utils.run(main)
File "/home/thomas/code/t5x/t5x/gin_utils.py", line 103, in run
app.run(
File "/home/thomas/.local/lib/python3.8/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/thomas/.local/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/thomas/code/t5x/t5x/train.py", line 596, in main
_main(argv)
File "/home/thomas/code/t5x/t5x/train.py", line 614, in _main
train_using_gin()
File "/home/thomas/.local/lib/python3.8/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/thomas/.local/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/home/thomas/.local/lib/python3.8/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/home/thomas/code/t5x/t5x/train.py", line 324, in train
train_state = train_state_initializer.from_checkpoint_or_scratch(
File "/home/thomas/code/t5x/t5x/utils.py", line 528, in from_checkpoint_or_scratch
return (self.from_checkpoint(ckpt_cfgs, ds_iter=ds_iter, init_rng=init_rng)
File "/home/thomas/code/t5x/t5x/utils.py", line 513, in from_checkpoint
train_states = list(
File "/home/thomas/code/t5x/t5x/utils.py", line 499, in from_checkpoints
yield _restore_path(ckpt_path, restore_cfg)
File "/home/thomas/code/t5x/t5x/utils.py", line 461, in _restore_path
return restore_checkpointer.restore(
File "/home/thomas/code/t5x/t5x/checkpoints.py", line 811, in restore
state_dict = self._read_state_from_tensorstore(
File "/home/thomas/code/t5x/t5x/checkpoints.py", line 860, in _read_state_from_tensorstore
state_dict = _run_future_tree(future_state_dict)
File "/home/thomas/code/t5x/t5x/checkpoints.py", line 160, in _run_future_tree
leaves = loop.run_until_complete(asyncio.gather(*future_leaves))
File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/home/thomas/code/t5x/t5x/checkpoint_importer.py", line 115, in _get_and_cast
arr = await self._get_fn() # pytype: disable=bad-return-type
File "/home/thomas/code/t5x/t5x/checkpoints.py", line 1187, in _read_ts
t = await ts.open(tmp_ts_spec_dict, open=True)
ValueError: Error opening "zarr" driver: Metadata at "gs://t5x-dummy-bucket/gs://{EXPERIMENT_NAME}/checkpoint_420000/state.param_states.decoder.layers_0.pre_mlp_layer_norm.scale.v/.zarray" does not exist
Reverting back to 0.1.13 worked. I'm guessing there was some breaking change?
The text was updated successfully, but these errors were encountered:
We've been using
t5x
to run some experiments. Until today it was successfully loading previous checkpoint, but with0.1.14
it started crashing:Reverting back to
0.1.13
worked. I'm guessing there was some breaking change?The text was updated successfully, but these errors were encountered: