Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with .from_pretrained in distributed mode on high-ram Colab instances + Accelerate #17659

Closed
2 of 4 tasks
muellerzr opened this issue Jun 10, 2022 · 2 comments
Closed
2 of 4 tasks
Labels

Comments

@muellerzr
Copy link
Contributor

muellerzr commented Jun 10, 2022

System Info

- `transformers` version: 4.19.3
- `accelerate` version: 0.10.0.dev0
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-biotic
- Python version: 3.713
- Huggingface_hub version: 0.7.0
- PyTorch version (GPU?): 1.9.0+cu102 (False)
- Tensorflow version (GPU?): 2.8.2 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script? No (notebook_launcher)
- Using distributed or parallel set-up in script? Accelerate

Who can help?

@sgugger + others related to .from_pretrained

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

On either a high-ram TPU Colab instance or Titan RTX/bad** machine, load the Simple NLP Example and attempt to run it top-down.

I modified the launch part to be:

notebook_launcher(training_function, (model,))

And changed the training function to accept a model arg for it to work

Expected behavior

It should just run and train fine, but instead it will get a SIGSEGV error. We've pinpointed it due to model = AutoModelFromPretrained... being inside of the loop that gets launched in the distributed process. If instead we use the already downloaded + instantiated one, the code runs just fine.

from_pretrained should guarantee that a single process is being loaded at a time, but instead we get a SIGSEV.

Here's two open issues in Accelerate that have detailed stack traces:

@muellerzr
Copy link
Contributor Author

cc @pacman100

@muellerzr muellerzr changed the title Bug with .from_pretrained in distributed mode on high-ram instances Bug with .from_pretrained in distributed mode on high-ram instances + Accelerate Jun 10, 2022
@muellerzr
Copy link
Contributor Author

The crux of this issue is Google Colab, and its inability to clear out the old model architecture from RAM. So instead of loading 8x model weights, then our new dict, plus the potential to do a gc.collect(), the other model weights are hanging indefinitely in colab space, unable to be freed. Gist here: https://gist.github.com/muellerzr/763feb654fc0446ed4ebf1813e0cb05e

@muellerzr muellerzr changed the title Bug with .from_pretrained in distributed mode on high-ram instances + Accelerate Bug with .from_pretrained in distributed mode on high-ram Colab instances + Accelerate Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant