You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
On either a high-ram TPU Colab instance or Titan RTX/bad** machine, load the Simple NLP Example and attempt to run it top-down.
I modified the launch part to be:
notebook_launcher(training_function, (model,))
And changed the training function to accept a model arg for it to work
Expected behavior
It should just run and train fine, but instead it will get a SIGSEGV error. We've pinpointed it due to model = AutoModelFromPretrained... being inside of the loop that gets launched in the distributed process. If instead we use the already downloaded + instantiated one, the code runs just fine.
from_pretrained should guarantee that a single process is being loaded at a time, but instead we get a SIGSEV.
Here's two open issues in Accelerate that have detailed stack traces:
muellerzr
changed the title
Bug with .from_pretrained in distributed mode on high-ram instances
Bug with .from_pretrained in distributed mode on high-ram instances + Accelerate
Jun 10, 2022
The crux of this issue is Google Colab, and its inability to clear out the old model architecture from RAM. So instead of loading 8x model weights, then our new dict, plus the potential to do a gc.collect(), the other model weights are hanging indefinitely in colab space, unable to be freed. Gist here: https://gist.github.com/muellerzr/763feb654fc0446ed4ebf1813e0cb05e
muellerzr
changed the title
Bug with .from_pretrained in distributed mode on high-ram instances + Accelerate
Bug with .from_pretrained in distributed mode on high-ram Colab instances + Accelerate
Jun 10, 2022
System Info
Who can help?
@sgugger + others related to
.from_pretrained
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
On either a high-ram TPU Colab instance or Titan RTX/bad** machine, load the Simple NLP Example and attempt to run it top-down.
I modified the launch part to be:
And changed the training function to accept a model arg for it to work
Expected behavior
It should just run and train fine, but instead it will get a
SIGSEGV
error. We've pinpointed it due tomodel = AutoModelFromPretrained...
being inside of the loop that gets launched in the distributed process. If instead we use the already downloaded + instantiated one, the code runs just fine.from_pretrained
should guarantee that a single process is being loaded at a time, but instead we get a SIGSEV.Here's two open issues in Accelerate that have detailed stack traces:
The text was updated successfully, but these errors were encountered: