-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torch SD-based models tensor invalid for input size #95
Comments
Hi @pattontim , Thanks for the report. So far I have reproduced and the files seem to match 1:1. Here is a script I propose: from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import torch
pt_filename = hf_hub_download(repo_id="CompVis/stable-diffusion-v-1-4-original", filename="sd-v1-4.ckpt")
pt_loaded = torch.load(pt_filename)["state_dict"]
sf_filename = hf_hub_download(repo_id="CompVis/stable-diffusion-v-1-4-original", filename="model.safetensors", revision="refs/pr/224")
sf_loaded = load_file(sf_filename)
for k, v in pt_loaded.items():
pt_tensor = v
sf_tensor = sf_loaded[k]
if not torch.allclose(pt_tensor, sf_tensor):
raise Exception("Difference")
print("Seems everything is ok.") |
The error doesn't have a stacktrace but it's probably linked to some other part of the pytorch checkpoint being missing/improperly created. What is your hardware ? I tried to reproduce but couldn't with https://github.com/huggingface/safetensors/blob/main/bindings/python/convert.py even with your pytorch version but couldn't. |
Also too the liberty to open a PR against stable-diffusion-webui since I saw you opened an issue there : AUTOMATIC1111/stable-diffusion-webui#4930 (I also like it ! :D ) |
My hardware is a GTX3060 and I used SAFETENSORS_FAST_GPU when creating the file and when loading the file. I tried it again without this flag at creation time and without the flag at load time and it still didn't work. I'm loading the model when swapping from one model with xformers applied, to the safetensors file. Edit: It fails when loading the sf even without the webui, which means state of webui is not impactful. Or perhaps its the error being thrown when not enough memory. Is the SHA256 of your safetensors output e57901186bb65c5b7b9fce118dd221bd646fdcc0a8ab34dfdc25ead5bd11fb59 ? Python 3.10.6 |
It appears to successfully load and compare the file if I set map_location=cuda and device='cuda'. My CPU is amd fx6350 and only half of my RAM is used at peak |
I don't know what xformers does, but it's low level enough to screw things up. |
I should have reworded, xformers is applied to the model loaded before the safetensor. However I eliminated this being relevant by restarting my PC and just using your script but with the safetensor I created instead. I used your script but instead load the safetensor I created with the script in the OP. Works on gpu fails on cpu. Reproducing:
Whatever difference there is between the tensors in SD and trinart_60000 steps may explain why trinart loads with the above code but fails with SD. |
Ok. You are in windows, that's good to know (don't think it makes a difference for now, but good to keep in mind). The file you already created is wrong now, and cannot be saved I fear.
This is correct, a tensor of this size should be of size could you share the file somewhere ? I think the header is corrupted somehow (it shouldn't happen). The culprit is most likely this code : https://github.com/huggingface/safetensors/blob/main/bindings/python/py_src/safetensors/torch.py#L175 Could you share:
(Always easier if I'm able to reproduce the bug) Other option: Could you try loading the file with : (You might need to rename |
Windows 11 Pro 22621.674 (WSL installed but not used) Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] venv |
Deleted a post where I accidentally put I64 for I32. Loads both the gpt and previously-erring .safetensors file fine and outputs the keys with the pure python approach. I added to DTYPES |
So your issue is fixed ?
Both should already exist : https://github.com/huggingface/safetensors/blob/main/bindings/python/py_src/safetensors/torch.py#L149 |
As in I ran your gist https://gist.github.com/Narsil/3edeec2669a5e94e4707aa0f901d2282 and it worked. Maybe it will work in dev if the torch fixes since release 0.2.4? |
Very good signal, the issue is in the loading, not in the serializing.
I don't think much has changed since. Though could you try to install from source to check ? You need rust https://rustup.rs/ Then cd bindings/python
pip install setuptools_rust
pip install -e . |
I want to say its not fixed on the dev build but I can't say for certain. When that runs it works as expected until the first torch model is loaded, then I get a windows error that python has stopped working. The GPU continues to load second model over time and then the program terminates without a message. No "Seems everything is ok.". Just a blank terminal when finished. |
Can you upload the serialized file somewhere ? |
Is it necessary? Is the SHA256 of model.safetensors on your sd1.4 PR different than e57901186bb65c5b7b9fce118dd221bd646fdcc0a8ab34dfdc25ead5bd11fb59 ? |
https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/discussions/224/files No it's not. So this from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
sf_filename = hf_hub_download(repo_id="CompVis/stable-diffusion-v-1-4-original", filename="model.safetensors", revision="refs/pr/224")
sf_loaded = load_file(sf_filename) Fails on your machine too ? That would be extremely weird, since both codes executes pretty much the same thing.
Multiple potential causes, being out of memory is classic though
All the code I shared is supposed to load things on CPU, if things are loaded on GPU something is wrong. (The goal here is to identify the bug so we need to remove as much code as possible, just putting things on CPU should be enough to gauge if the loaded tensors are correct.) |
It fails on my machine. It doesn't fail when using the gist, that's correct.
I meant that it is continued to be loaded in the RAM, because device is cpu afterall. I was looking at the available memory and it wasnt exhausted yet while running, but its possible. |
Okay, thanks to @mfuntowicz which has a windows machine we were able to figure out. Turns out it IS because of windows that there is an issue. This IMO is a PyO3 issue, which I will report, and will provide a hotfix soon. Thank you so much for helping on this. |
Shoudl be fixed with |
I can confirm that the issue is now fixed in 0.2.5, thanks for the fix! |
There might be a slight discrepancy between the loading and saving process in safetensors. When loading a SD-based model like sd-1.4 packaged into a PyTorch checkpoint, we'll call it sd-v1-4.ckpt. We can package its state_dict and discard the torch format.
Packaging as safe_tensors
Loading the tensors
load_file('sd-v1-4.safetensors', device='cpu')
Results in error:
File "venv\lib\site-packages\safetensors\torch.py", line 99, in load_file
result[k] = f.get_tensor(k)
RuntimeError: shape '[1280, 1280, 3, 3]' is invalid for input of size 7290352
Expected behaviour: safetensors fails while trying to save unexpected tensor data or creates tensors which can be loaded
Affected version: safetensors=2.4.0, torch=1.12.1+cu113
ckpt size:
3.97 GB 4,265,381,888 bytes (4,265,380,512 bytes)
safetensor size:
3.97 GB 4,265,148,416 bytes (4,265,146,304 bytes)
SHA fe4efff1e174c627256e44ec2991ba279b3816e364b49f9be2abc0b3ff3f8556
Using pruned version of CompVis/stable-diffusion-v-1-4-original
Apologies if this is already fixed with addition of more dtypes. Will try to get more info by running through check output and debug info of this specific tensor
The text was updated successfully, but these errors were encountered: