Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transformers seems to have recently been "bricked" #13798

Closed
quantitative-technologies opened this issue Sep 29, 2021 · 2 comments · Fixed by #13813
Closed

transformers seems to have recently been "bricked" #13798

quantitative-technologies opened this issue Sep 29, 2021 · 2 comments · Fixed by #13813

Comments

@quantitative-technologies
Copy link
Contributor

Environment info

  • transformers version: 4.12.0.dev0
  • Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.12
  • PyTorch version (GPU?): 1.9.0+cu102 (False)
  • Tensorflow version (GPU?): 2.6.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: Yes

Who can help

@sgugger

Information

The example script below was working fine until today. I believe that it was working in version 4.11.0.dev0. If you can please tell me how to checkout the source for 4.11.0.dev0 from github, I will confirm that it works.

To reproduce

Steps to reproduce the behavior:

On a TPU colab instance with High-RAM, run:

CHECKPOINT=bert-large-uncased
DATASET=rte
EPOCHS=2
BATCH_SIZE=16
LEARNING_RATE=3e-5

python transformers/examples/pytorch/xla_spawn.py --num_cores 8 \
  transformers/examples/pytorch/text-classification/run_glue.py \
  --model_name_or_path $CHECKPOINT \
  --task_name $DATASET \
  --seed 10000 \
  --output_dir results \
  --overwrite_output_dir \
  --num_train_epochs $EPOCHS \
  --evaluation_strategy no \
  --logging_strategy epoch \
  --save_strategy epoch \
  --per_device_train_batch_size $BATCH_SIZE \
  --per_device_eval_batch_size $BATCH_SIZE \
  --learning_rate $LEARNING_RATE \
  --do_train

Gives the error:

Exception in device=TPU:7: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:4: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:2: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:1: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:6: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:5: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:3: zero-dimensional tensor (at position 0) cannot be concatenated
Exception in device=TPU:0: zero-dimensional tensor (at position 0) cannot be concatenated

  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data

uted/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
Traceback (most recent call last):
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 564, in _mp_fn
    main()
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated

  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/examples/pytorch/text-classification/run_glue.py", line 486, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1383, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 1467, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer.py", line 2373, in _nested_gather
    tensors = nested_xla_mesh_reduce(tensors, name)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/content/transformers/src/transformers/trainer_pt_utils.py", line 155, in nested_xla_mesh_reduce
    return xm.mesh_reduce(name, tensors, torch.cat)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py", line 916, in mesh_reduce
    return reduce_fn(xldata) if xldata else cpu_data
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
 50%|█████████████▌             | 20/40 [08:22<08:22, 25.15s/it]
Traceback (most recent call last):
  File "transformers/examples/pytorch/xla_spawn.py", line 85, in <module>
    main()
  File "transformers/examples/pytorch/xla_spawn.py", line 81, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 144, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 17

Expected behavior

No error.

@sgugger
Copy link
Collaborator

sgugger commented Sep 29, 2021

I see where the problem comes from. Will push a fix tonight or tomorrow morning, then we will do a patch release.
In the meantime you should have no error by staying on v4.10

@odellus
Copy link
Contributor

odellus commented Sep 30, 2021

I run out of memory using transformers v4.X where X > 10 training led-large-16384-arxiv with four gradient accumulation steps and a batch size of two like in this notebook on an A6000 with 48 GB of RAM. I had to bump gradient accumulation steps and batch size down to 1 each to fit the model + batch on the GPU. Wild. Don't really feel like opening an issue, but yeah just thought I'd chirp in here and say that with v4.10.1 I can fit up to 8 samples per batch with four gradient accumulation steps on the A6000.

If you upgrade to 4.11.1 in the colab notebook I shared it fails, but for 4.10.1 it works just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants