New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Include example code] mixed_precision="fp16" will break torch.save function. #866
Comments
cc @muellerzr This might already be fixed by your recent work. Or it's what has broken it ;-) |
Wow, that was fast, thanks for the quick response lol |
@BurguerJohn the solution is you should only save via this: (and install accelerate from github, as this fix was put in this morning!) torch.save(accelerator.unwrap_model(model), "model_unwrap.sd") @sgugger we could probably make |
Is the solution on the main branch? Just installed it on colab accelerate-0.15.0.dev0 and still giving the same error. |
@BurguerJohn yes it is. I ran the following to test it: from accelerate import Accelerator
import torch
import torch.nn as nn
class ExampleModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(3, 3, kernel_size=1)
model = ExampleModule()
accelerator = Accelerator(
gradient_accumulation_steps=1,
mixed_precision="fp16",
log_with="tensorboard",
logging_dir=".",
)
model = accelerator.prepare(model)
torch.save(accelerator.unwrap_model(model), "model_unwrap.sd") |
@muellerzr No we can't have |
Whelp, as Sylvain says, too much magic :) What I presented above would be the official "correct" answer to what you're wanting to do. |
Could it be some limitation on colab or python 3.7? Still need to test it on my computer, but colab seen to still have trouble with it: |
Thanks @BurguerJohn, will look into this as it seems to be py 3.7 specific! |
Cool, thanks! Will try it out with other version later. Again, thanks for all the support. |
@BurguerJohn sadly I don't have a solution immediatly for you. This whole issue stems from jupyter specifically. Even calling it through the CLI on Jupyter will have this bug. No clue why but I'll be working on a different solution to this soon. |
@muellerzr Alright, no problem. Just letting you know that it seen that this bug also happen without jupyter, on windows: python 3.9 |
Can you open up a seperate issue for that please? Because termios and tty are part of the stdlib for python |
I'm giving more tests, it may be some name conflict with my project. Just tested to call termio on a clean project and it work. Will do more tests before opening another issue. |
Can confirm that the error still happen on Window:Python 3.9 even without jupyter.
|
I'm pretty sure is this line causing the problem: |
@BurguerJohn it's actually stemming from ConvertFp32 IIRC when I was looking (see the line at the bottom of that section you were looking at) |
Yeah, but this line is enough to break the torch.save Unless unwrap should do something to revert this, didn't had the time to read all the code. |
It should be with that pr Sylvain mentioned earlier (and it shows it works on Ubuntu based systems that aren't running Jupyter). I'll be looking deeper into this tommorow |
No problem, thanks for all the help. I already managed to make my code work without the prepare line, so there is no need to rush. |
BTW, here's a very good s/o explaining what's happening: https://stackoverflow.com/questions/27641859/pickling-decorated-callable-class-wrapper |
@BurguerJohn found a fix, essentially it's possible for us to follow the trail of |
Wow, that is pretty cool. It's also something new for me. Glad you managed to find a good solution. |
in Colab Notebook, I tried : accelerator = Accelerator( model = accelerator.prepare(model) but it give error : /usr/local/lib/python3.10/dist-packages/torch/serialization.py:441 in save │ |
@phananh03x can you provide us a full repr (what is ExampleModel?) and the version of accelerate you are using |
from diffusers import UNet2DModel model = UNet2DModel(in_channels=1, out_channels=1, block_out_channels=(32, 64, 128, 128)) |
accelerate-0.19.0 |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
It will return this error if mixed_precision="fp16"
Expected behavior
The text was updated successfully, but these errors were encountered: