You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The example script below was working fine until today. I believe that it was working in version 4.11.0.dev0. If you can please tell me how to checkout the source for 4.11.0.dev0 from github, I will confirm that it works.
Exception in device=TPU:7: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:4: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:2: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:1: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:6: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:5: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:3: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:0: zero-dimensional tensor (at position 0) cannot be concatenated
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/content/transformers/src/transformers/trainer.py", line 1383, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/content/transformers/src/transformers/trainer.py", line 1383, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
tensors = nested_xla_mesh_reduce(tensors, name)
File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
tensors = nested_xla_mesh_reduce(tensors, name)
File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
return xm.mesh_reduce(name, tensors, torch.cat)
File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
return xm.mesh_reduce(name, tensors, torch.cat)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
return reduce_fn(xldata) if xldata else cpu_data
File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
return reduce_fn(xldata) if xldata else cpu_data
uted/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
main()
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/content/transformers/src/transformers/trainer.py", line 1383, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
tensors = nested_xla_mesh_reduce(tensors, name)
File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
return xm.mesh_reduce(name, tensors, torch.cat)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
main()
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/content/transformers/src/transformers/trainer.py", line 1383, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
tensors = nested_xla_mesh_reduce(tensors, name)
File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
return xm.mesh_reduce(name, tensors, torch.cat)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
main()
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/content/transformers/src/transformers/trainer.py", line 1383, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
Traceback (most recent call last):
File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
tensors = nested_xla_mesh_reduce(tensors, name)
File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
return xm.mesh_reduce(name, tensors, torch.cat)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
main()
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/content/transformers/src/transformers/trainer.py", line 1383, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
tensors = nested_xla_mesh_reduce(tensors, name)
File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
return xm.mesh_reduce(name, tensors, torch.cat)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/content/transformers/src/transformers/trainer.py", line 1383, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/content/transformers/src/transformers/trainer.py", line 1383, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
tensors = nested_xla_mesh_reduce(tensors, name)
File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
tensors = nested_xla_mesh_reduce(tensors, name)
File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
return xm.mesh_reduce(name, tensors, torch.cat)
File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
return xm.mesh_reduce(name, tensors, torch.cat)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
return reduce_fn(xldata) if xldata else cpu_data
File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
50%|█████████████▌ | 20/40 [08:22<08:22, 25.15s/it]
Traceback (most recent call last):
File "transformers/examples/pytorch/xla_spawn.py", line 85, in <module>
main()
File "transformers/examples/pytorch/xla_spawn.py", line 81, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
start_method=start_method)
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 144, in join
exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 17
Expected behavior
No error.
The text was updated successfully, but these errors were encountered:
I see where the problem comes from. Will push a fix tonight or tomorrow morning, then we will do a patch release.
In the meantime you should have no error by staying on v4.10
I run out of memory using transformers v4.X where X > 10 training led-large-16384-arxiv with four gradient accumulation steps and a batch size of two like in this notebook on an A6000 with 48 GB of RAM. I had to bump gradient accumulation steps and batch size down to 1 each to fit the model + batch on the GPU. Wild. Don't really feel like opening an issue, but yeah just thought I'd chirp in here and say that with v4.10.1 I can fit up to 8 samples per batch with four gradient accumulation steps on the A6000.
If you upgrade to 4.11.1 in the colab notebook I shared it fails, but for 4.10.1 it works just fine.
Environment info
transformers
version: 4.12.0.dev0Who can help
@sgugger
Information
The example script below was working fine until today. I believe that it was working in version
4.11.0.dev0
. If you can please tell me how to checkout the source for4.11.0.dev0
from github, I will confirm that it works.To reproduce
Steps to reproduce the behavior:
On a TPU colab instance with High-RAM, run:
Gives the error:
Expected behavior
No error.
The text was updated successfully, but these errors were encountered: