Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t5x checkpoint importer crashes #27

Closed
thomasw21 opened this issue Dec 7, 2021 · 1 comment
Closed

t5x checkpoint importer crashes #27

thomasw21 opened this issue Dec 7, 2021 · 1 comment

Comments

@thomasw21
Copy link

thomasw21 commented Dec 7, 2021

We've been using t5x to run some experiments. Until today it was successfully loading previous checkpoint, but with 0.1.14 it started crashing:

Traceback (most recent call last):
  File "/home/thomas/code/t5x/t5x/train.py", line 616, in <module>
    gin_utils.run(main)
  File "/home/thomas/code/t5x/t5x/gin_utils.py", line 103, in run
    app.run(
  File "/home/thomas/.local/lib/python3.8/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/thomas/.local/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/thomas/code/t5x/t5x/train.py", line 596, in main
    _main(argv)
  File "/home/thomas/code/t5x/t5x/train.py", line 614, in _main
    train_using_gin()
  File "/home/thomas/.local/lib/python3.8/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/thomas/.local/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/home/thomas/.local/lib/python3.8/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/thomas/code/t5x/t5x/train.py", line 324, in train
    train_state = train_state_initializer.from_checkpoint_or_scratch(
  File "/home/thomas/code/t5x/t5x/utils.py", line 528, in from_checkpoint_or_scratch
    return (self.from_checkpoint(ckpt_cfgs, ds_iter=ds_iter, init_rng=init_rng)
  File "/home/thomas/code/t5x/t5x/utils.py", line 513, in from_checkpoint
    train_states = list(
  File "/home/thomas/code/t5x/t5x/utils.py", line 499, in from_checkpoints
    yield _restore_path(ckpt_path, restore_cfg)
  File "/home/thomas/code/t5x/t5x/utils.py", line 461, in _restore_path
    return restore_checkpointer.restore(
  File "/home/thomas/code/t5x/t5x/checkpoints.py", line 811, in restore
    state_dict = self._read_state_from_tensorstore(
  File "/home/thomas/code/t5x/t5x/checkpoints.py", line 860, in _read_state_from_tensorstore
    state_dict = _run_future_tree(future_state_dict)
  File "/home/thomas/code/t5x/t5x/checkpoints.py", line 160, in _run_future_tree
    leaves = loop.run_until_complete(asyncio.gather(*future_leaves))
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/home/thomas/code/t5x/t5x/checkpoint_importer.py", line 115, in _get_and_cast
    arr = await self._get_fn()  # pytype: disable=bad-return-type
  File "/home/thomas/code/t5x/t5x/checkpoints.py", line 1187, in _read_ts
    t = await ts.open(tmp_ts_spec_dict, open=True)
ValueError: Error opening "zarr" driver: Metadata at "gs://t5x-dummy-bucket/gs://{EXPERIMENT_NAME}/checkpoint_420000/state.param_states.decoder.layers_0.pre_mlp_layer_norm.scale.v/.zarray" does not exist

Reverting back to 0.1.13 worked. I'm guessing there was some breaking change?

@jbms
Copy link
Collaborator

jbms commented Dec 8, 2021

This is now fixed by google-research/t5x@a3510b1.

@jbms jbms closed this as completed Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants