Bug with .from_pretrained in distributed mode on high-ram Colab instances + Accelerate #17659

muellerzr · 2022-06-10T14:03:48Z

System Info

- `transformers` version: 4.19.3
- `accelerate` version: 0.10.0.dev0
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-biotic
- Python version: 3.713
- Huggingface_hub version: 0.7.0
- PyTorch version (GPU?): 1.9.0+cu102 (False)
- Tensorflow version (GPU?): 2.8.2 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script? No (notebook_launcher)
- Using distributed or parallel set-up in script? Accelerate

Who can help?

@sgugger + others related to .from_pretrained

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

On either a high-ram TPU Colab instance or Titan RTX/bad** machine, load the Simple NLP Example and attempt to run it top-down.

I modified the launch part to be:

notebook_launcher(training_function, (model,))

And changed the training function to accept a model arg for it to work

Expected behavior

It should just run and train fine, but instead it will get a SIGSEGV error. We've pinpointed it due to model = AutoModelFromPretrained... being inside of the loop that gets launched in the distributed process. If instead we use the already downloaded + instantiated one, the code runs just fine.

from_pretrained should guarantee that a single process is being loaded at a time, but instead we get a SIGSEV.

Here's two open issues in Accelerate that have detailed stack traces:

The text was updated successfully, but these errors were encountered:

muellerzr · 2022-06-10T14:04:45Z

cc @pacman100

muellerzr · 2022-06-10T18:44:46Z

The crux of this issue is Google Colab, and its inability to clear out the old model architecture from RAM. So instead of loading 8x model weights, then our new dict, plus the potential to do a gc.collect(), the other model weights are hanging indefinitely in colab space, unable to be freed. Gist here: https://gist.github.com/muellerzr/763feb654fc0446ed4ebf1813e0cb05e

muellerzr added Core: Modeling Internals of the library; Models. Distributed Training / Models bug labels Jun 10, 2022

muellerzr mentioned this issue Jun 10, 2022

notebook_launcher fails in colab pytorch/xla#3136

Open

muellerzr changed the title ~~Bug with .from_pretrained in distributed mode on high-ram instances~~ Bug with .from_pretrained in distributed mode on high-ram instances + Accelerate Jun 10, 2022

muellerzr changed the title ~~Bug with .from_pretrained in distributed mode on high-ram instances + Accelerate~~ Bug with .from_pretrained in distributed mode on high-ram Colab instances + Accelerate Jun 10, 2022

muellerzr closed this as completed Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug with .from_pretrained in distributed mode on high-ram Colab instances + Accelerate #17659

Bug with .from_pretrained in distributed mode on high-ram Colab instances + Accelerate #17659

muellerzr commented Jun 10, 2022 •

edited

muellerzr commented Jun 10, 2022

muellerzr commented Jun 10, 2022

Bug with .from_pretrained in distributed mode on high-ram Colab instances + Accelerate #17659

Bug with .from_pretrained in distributed mode on high-ram Colab instances + Accelerate #17659

Comments

muellerzr commented Jun 10, 2022 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

muellerzr commented Jun 10, 2022

muellerzr commented Jun 10, 2022

muellerzr commented Jun 10, 2022 •

edited