-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error loading model from checkpoint on Apple M1 #446
Comments
Is this in a notebook? |
Yes. |
Awesome! that worked. I am now running into the following issue: ''' File ~/t5x/t5x/checkpoints.py:1594, in load_t5x_checkpoint(path, step, state_transformation_fns, remap, restore_dtype, lazy_parameters) File ~/t5x/t5x/checkpoints.py:167, in _run_future_tree(future_tree) File ~/opt/miniconda3/lib/python3.9/site-packages/nest_asyncio.py:89, in _patch_loop..run_until_complete(self, future) File ~/opt/miniconda3/lib/python3.9/asyncio/tasks.py:258, in Task.__step(failed resolving arguments) File ~/t5x/t5x/checkpoint_importer.py:114, in LazyAwaitableArray.get_async.._get_and_cast() File ~/t5x/t5x/checkpoints.py:1422, in _read_ts(param_info, maybe_tspec, ckpt_path, restore_dtype, mesh, axes) File ~/opt/miniconda3/lib/python3.9/asyncio/futures.py:284, in Future.await(self) File ~/opt/miniconda3/lib/python3.9/asyncio/tasks.py:328, in Task.__wakeup(self, future) File ~/opt/miniconda3/lib/python3.9/asyncio/futures.py:201, in Future.result(self) ValueError: Error opening "zarr" driver: Error reading local file "./longt5_base_transient_checkpoint_1000000/target.decoder.layers_0.encoder_decoder_attention.key.kernel/.zarray": Invalid key: "./longt5_base_transient_checkpoint_1000000/target.decoder.layers_0.encoder_decoder_attention.key.kernel/.zarray" |
Are you sure that file exists? |
yep. the directory and files exists. I am not very familiar with TensorStore library...I think that's being used...is there a quick test you can suggest that'd help isolate the issue? |
giving the full path instead of relative path solved the issue! |
What version of Tensorstore did you use? I met a similar problem when saving the checkpoint. My Tensorstore is 0.1.19. At the beginning, I used the relative path and got the same error as you mentioned above: ValueError: Error opening "zarr" driver: Error reading local file "./pretrain_model/checkpoint_5000.tmp-1650694933/state.param_states.decoder.decoder_norm.scale.v/.zarray": Invalid key: "./pretrain_model/checkpoint_5000.tmp-1650694933/state.param_states.decoder.decoder_norm.scale.v/.zarray" Then I changed the path to the absolute path and this issue was solved. But a new issue occurred. This error occurred when I ended training for 100 steps and saved the checkpoint to the absolute path '/gpfsnyu/scratch/kf2395/jukemir_t5/pretrain/' The error message: I0426 13:05:36.808195 140074531202880 train.py:516] Epoch 0 of 10000 |
I am using version 0.1.18 |
I am trying to load longT5 model from checkpoint and getting the following error. Any help is much appreciated.
`
RuntimeError Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 t5x_checkpoint = t5x.checkpoints.load_t5x_checkpoint(checkpoint_dir)
File ~/t5x/t5x/checkpoints.py:1594, in load_t5x_checkpoint(path, step, state_transformation_fns, remap, restore_dtype, lazy_parameters)
1592 if not lazy_parameters:
1593 future_state_dict = jax.tree_map(lambda x: x.get_async(), state_dict)
-> 1594 state_dict = _run_future_tree(future_state_dict)
1596 if restore_dtype is not None:
1597 state_dict['target'] = _cast(state_dict['target'], restore_dtype)
File ~/t5x/t5x/checkpoints.py:167, in _run_future_tree(future_tree)
165 # TODO(adarob): Use asyncio.run in py3.7+.
166 loop = asyncio.get_event_loop()
--> 167 leaves = loop.run_until_complete(asyncio.gather(*future_leaves))
168 return jax.tree_unflatten(treedef, leaves)
File ~/opt/miniconda3/lib/python3.9/asyncio/base_events.py:623, in BaseEventLoop.run_until_complete(self, future)
612 """Run until the Future is done.
613
614 If the argument is a coroutine, it is wrapped in a Task.
(...)
620 Return the Future's result, or raise its exception.
621 """
622 self._check_closed()
--> 623 self._check_running()
625 new_task = not futures.isfuture(future)
626 future = tasks.ensure_future(future, loop=self)
File ~/opt/miniconda3/lib/python3.9/asyncio/base_events.py:583, in BaseEventLoop._check_running(self)
581 def _check_running(self):
582 if self.is_running():
--> 583 raise RuntimeError('This event loop is already running')
584 if events._get_running_loop() is not None:
585 raise RuntimeError(
586 'Cannot run the event loop while another loop is running')
RuntimeError: This event loop is already running
`
The text was updated successfully, but these errors were encountered: