transformers seems to have recently been "bricked" #13798

quantitative-technologies · 2021-09-29T19:54:18Z

Environment info

transformers version: 4.12.0.dev0
Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.12
PyTorch version (GPU?): 1.9.0+cu102 (False)
Tensorflow version (GPU?): 2.6.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: Yes

Who can help

Information

The example script below was working fine until today. I believe that it was working in version 4.11.0.dev0. If you can please tell me how to checkout the source for 4.11.0.dev0 from github, I will confirm that it works.

To reproduce

Steps to reproduce the behavior:

On a TPU colab instance with High-RAM, run:

CHECKPOINT=bert-large-uncased
DATASET=rte
EPOCHS=2
BATCH_SIZE=16
LEARNING_RATE=3e-5

python transformers/examples/pytorch/xla_spawn.py --num_cores 8 \
  transformers/examples/pytorch/text-classification/run_glue.py \
  --model_name_or_path $CHECKPOINT \
  --task_name $DATASET \
  --seed 10000 \
  --output_dir results \
  --overwrite_output_dir \
  --num_train_epochs $EPOCHS \
  --evaluation_strategy no \
  --logging_strategy epoch \
  --save_strategy epoch \
  --per_device_train_batch_size $BATCH_SIZE \
  --per_device_eval_batch_size $BATCH_SIZE \
  --learning_rate $LEARNING_RATE \
  --do_train

Gives the error:

Exception in device=TPU:7: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:4: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:2: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:1: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:6: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:5: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:3: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:0: zero-dimensional tensor (at position 0) cannot be concatenated

  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data

uted/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
Traceback (most recent call last):
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated

  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
 50%|█████████████▌             | 20/40 [08:22<08:22, 25.15s/it]
Traceback (most recent call last):
  File "transformers/examples/pytorch/xla_spawn.py", line 85, in <module>
    main()
  File "transformers/examples/pytorch/xla_spawn.py", line 81, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 144, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 17

Expected behavior

No error.

The text was updated successfully, but these errors were encountered:

sgugger · 2021-09-29T21:54:06Z

I see where the problem comes from. Will push a fix tonight or tomorrow morning, then we will do a patch release.
In the meantime you should have no error by staying on v4.10

odellus · 2021-09-30T12:11:15Z

I run out of memory using transformers v4.X where X > 10 training led-large-16384-arxiv with four gradient accumulation steps and a batch size of two like in this notebook on an A6000 with 48 GB of RAM. I had to bump gradient accumulation steps and batch size down to 1 each to fit the model + batch on the GPU. Wild. Don't really feel like opening an issue, but yeah just thought I'd chirp in here and say that with v4.10.1 I can fit up to 8 samples per batch with four gradient accumulation steps on the A6000.

If you upgrade to 4.11.1 in the colab notebook I shared it fails, but for 4.10.1 it works just fine.

sgugger mentioned this issue Sep 30, 2021

Fix gather for TPU #13813

Merged

sgugger closed this as completed in #13813 Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformers seems to have recently been "bricked" #13798

transformers seems to have recently been "bricked" #13798

quantitative-technologies commented Sep 29, 2021

sgugger commented Sep 29, 2021

odellus commented Sep 30, 2021 •

edited

Loading

transformers seems to have recently been "bricked" #13798

transformers seems to have recently been "bricked" #13798

Comments

quantitative-technologies commented Sep 29, 2021

Environment info

Who can help

Information

To reproduce

Expected behavior

sgugger commented Sep 29, 2021

odellus commented Sep 30, 2021 • edited Loading

odellus commented Sep 30, 2021 •

edited

Loading