Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch SD-based models tensor invalid for input size #95

Closed
pattontim opened this issue Nov 20, 2022 · 21 comments
Closed

Torch SD-based models tensor invalid for input size #95

pattontim opened this issue Nov 20, 2022 · 21 comments

Comments

@pattontim
Copy link

There might be a slight discrepancy between the loading and saving process in safetensors. When loading a SD-based model like sd-1.4 packaged into a PyTorch checkpoint, we'll call it sd-v1-4.ckpt. We can package its state_dict and discard the torch format.

Packaging as safe_tensors

    sf_filename = "sd-v1-4.safetensors"
    filename = "sd-v1-4.ckpt"

    loaded = torch.load(filename)
    loaded = loaded['state_dict']

    # appears to pop nothing in this case
    shared = shared_pointers(loaded)
    for shared_weights in shared:
        for name in shared_weights[1:]:
            loaded.pop(name)

    loaded = {k: v.contiguous() for k, v in loaded.items()}

    save_file(loaded, metadata={"format": "pt"})

    check_file_size(local, filename)

Loading the tensors

load_file('sd-v1-4.safetensors', device='cpu')

Results in error:

File "venv\lib\site-packages\safetensors\torch.py", line 99, in load_file
result[k] = f.get_tensor(k)
RuntimeError: shape '[1280, 1280, 3, 3]' is invalid for input of size 7290352

Expected behaviour: safetensors fails while trying to save unexpected tensor data or creates tensors which can be loaded
Affected version: safetensors=2.4.0, torch=1.12.1+cu113

ckpt size:
3.97 GB 4,265,381,888 bytes (4,265,380,512 bytes)
safetensor size:
3.97 GB 4,265,148,416 bytes (4,265,146,304 bytes)
SHA fe4efff1e174c627256e44ec2991ba279b3816e364b49f9be2abc0b3ff3f8556

Using pruned version of CompVis/stable-diffusion-v-1-4-original

Apologies if this is already fixed with addition of more dtypes. Will try to get more info by running through check output and debug info of this specific tensor

@Narsil
Copy link
Collaborator

Narsil commented Nov 21, 2022

Hi @pattontim ,

Thanks for the report.

So far I have reproduced and the files seem to match 1:1.

Here is a script I propose:

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import torch

pt_filename = hf_hub_download(repo_id="CompVis/stable-diffusion-v-1-4-original", filename="sd-v1-4.ckpt")
pt_loaded = torch.load(pt_filename)["state_dict"]
sf_filename = hf_hub_download(repo_id="CompVis/stable-diffusion-v-1-4-original", filename="model.safetensors", revision="refs/pr/224")
sf_loaded = load_file(sf_filename)

for k, v in pt_loaded.items():
    pt_tensor = v
    sf_tensor = sf_loaded[k]

    if not torch.allclose(pt_tensor, sf_tensor):
        raise Exception("Difference")

print("Seems everything is ok.")

@Narsil
Copy link
Collaborator

Narsil commented Nov 21, 2022

The error doesn't have a stacktrace but it's probably linked to some other part of the pytorch checkpoint being missing/improperly created.

What is your hardware ?

I tried to reproduce but couldn't with https://github.com/huggingface/safetensors/blob/main/bindings/python/convert.py even with your pytorch version but couldn't.

@Narsil
Copy link
Collaborator

Narsil commented Nov 21, 2022

Also too the liberty to open a PR against stable-diffusion-webui since I saw you opened an issue there : AUTOMATIC1111/stable-diffusion-webui#4930 (I also like it ! :D )

@pattontim
Copy link
Author

pattontim commented Nov 21, 2022

My hardware is a GTX3060 and I used SAFETENSORS_FAST_GPU when creating the file and when loading the file. I tried it again without this flag at creation time and without the flag at load time and it still didn't work. I'm loading the model when swapping from one model with xformers applied, to the safetensors file.

Edit: It fails when loading the sf even without the webui, which means state of webui is not impactful. Or perhaps its the error being thrown when not enough memory. Is the SHA256 of your safetensors output e57901186bb65c5b7b9fce118dd221bd646fdcc0a8ab34dfdc25ead5bd11fb59 ? Python 3.10.6

@pattontim
Copy link
Author

pattontim commented Nov 21, 2022

It appears to successfully load and compare the file if I set map_location=cuda and device='cuda'. My CPU is amd fx6350 and only half of my RAM is used at peak

@Narsil
Copy link
Collaborator

Narsil commented Nov 21, 2022

I'm loading the model when swapping from one model with xformers applied, to the safetensors file.

I don't know what xformers does, but it's low level enough to screw things up.
Do you mind sharing a simple reproducing script ?
Did you try the script I suggested ? Does it work ?

@pattontim
Copy link
Author

pattontim commented Nov 21, 2022

I'm loading the model when swapping from one model with xformers applied, to the safetensors file.

I don't know what xformers does, but it's low level enough to screw things up. Do you mind sharing a simple reproducing script ? Did you try the script I suggested ? Does it work ?

I should have reworded, xformers is applied to the model loaded before the safetensor. However I eliminated this being relevant by restarting my PC and just using your script but with the safetensor I created instead. I used your script but instead load the safetensor I created with the script in the OP. Works on gpu fails on cpu.

Reproducing:

import torch
from safetensors.torch import save_file, load_file

sf_filename = "sd-v1-4.safetensors"
filename = "D:\\test\\sd-v1-4.ckpt"

loaded = torch.load(filename)
loaded = loaded['state_dict']
local = os.path.join("D:\\test\\", sf_filename)

loaded = {k: v.contiguous() for k, v in loaded.items()}

save_file(loaded, local, metadata={"format": "pt"})
import torch
from safetensors.torch import save_file, load_file

pt_filename = "D:\\test\\sd-v1-4.ckpt"
pt_loaded = torch.load(pt_filename, map_location='cpu')["state_dict"]
sf_filename = "D:\\test\\sd-v1-4.safetensors"
sf_loaded = load_file(sf_filename, device='cpu')     #<--- fails here

for k, v in pt_loaded.items():
    pt_tensor = v
    sf_tensor = sf_loaded[k]

    if not torch.allclose(pt_tensor, sf_tensor):
        raise Exception("Difference")

print("Seems everything is ok.") <---- proceeds to here if map_location and device='cuda', or if the model file is swapped with trinart_60k

Whatever difference there is between the tensors in SD and trinart_60000 steps may explain why trinart loads with the above code but fails with SD.

@Narsil
Copy link
Collaborator

Narsil commented Nov 21, 2022

Ok.

You are in windows, that's good to know (don't think it makes a difference for now, but good to keep in mind).

The file you already created is wrong now, and cannot be saved I fear.

 RuntimeError: shape '[1280, 1280, 3, 3]' is invalid for input of size 7290352

This is correct, a tensor of this size should be of size 14745600 not 7290352 It's almost a 2x change (but not truly??).
However, this should have been checked for before actually saving the file.

could you share the file somewhere ? I think the header is corrupted somehow (it shouldn't happen).
That, or the loading part is messing up somehow.

The culprit is most likely this code : https://github.com/huggingface/safetensors/blob/main/bindings/python/py_src/safetensors/torch.py#L175
However the size mismatch is not just an off by 2x thing, which means it's super weird.

Could you share:

  • Your windows version (And if WSL or not)
  • Your python version
  • You environment (conda ,pyenv, virtualenv, or none)

(Always easier if I'm able to reproduce the bug)

Other option:

Could you try loading the file with :
https://gist.github.com/Narsil/3edeec2669a5e94e4707aa0f901d2282

(You might need to rename untyped() to _untyped() because you are in 1.12).
We have already seen some pure pytorch bugs before.

@pattontim
Copy link
Author

pattontim commented Nov 21, 2022

Could you share:

* Your windows version (And if WSL or not)

* Your python version

* You environment (conda ,pyenv, virtualenv, or none)

Windows 11 Pro 22621.674 (WSL installed but not used)

Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]

venv

@pattontim
Copy link
Author

pattontim commented Nov 21, 2022

Deleted a post where I accidentally put I64 for I32.

Loads both the gpt and previously-erring .safetensors file fine and outputs the keys with the pure python approach. I added to DTYPES "I64":torch.int64, "I32":torch.int32

@Narsil
Copy link
Collaborator

Narsil commented Nov 21, 2022

So your issue is fixed ?

Loads both the gpt and previously-erring .safetensors file fine and outputs the keys with the pure python approach. I added to DTYPES "I64":torch.int64, "I32":torch.int32

Both should already exist : https://github.com/huggingface/safetensors/blob/main/bindings/python/py_src/safetensors/torch.py#L149

@pattontim
Copy link
Author

So your issue is fixed ?

Loads both the gpt and previously-erring .safetensors file fine and outputs the keys with the pure python approach. I added to DTYPES "I64":torch.int64, "I32":torch.int32

Both should already exist : https://github.com/huggingface/safetensors/blob/main/bindings/python/py_src/safetensors/torch.py#L149

As in I ran your gist https://gist.github.com/Narsil/3edeec2669a5e94e4707aa0f901d2282 and it worked. Maybe it will work in dev if the torch fixes since release 0.2.4?

@Narsil
Copy link
Collaborator

Narsil commented Nov 21, 2022

As in I ran your gist https://gist.github.com/Narsil/3edeec2669a5e94e4707aa0f901d2282 and it worked.

Very good signal, the issue is in the loading, not in the serializing.

Maybe it will work in dev if the torch fixes since release 0.2.4?

I don't think much has changed since. Though could you try to install from source to check ?

You need rust https://rustup.rs/

Then

cd bindings/python
pip install setuptools_rust
pip install -e .

@pattontim
Copy link
Author

pattontim commented Nov 21, 2022

I want to say its not fixed on the dev build but I can't say for certain. When that runs it works as expected until the first torch model is loaded, then I get a windows error that python has stopped working. The GPU continues to load second model over time and then the program terminates without a message. No "Seems everything is ok.". Just a blank terminal when finished.

@Narsil
Copy link
Collaborator

Narsil commented Nov 21, 2022

Can you upload the serialized file somewhere ?

@pattontim
Copy link
Author

Can you upload the serialized file somewhere ?

Is it necessary? Is the SHA256 of model.safetensors on your sd1.4 PR different than e57901186bb65c5b7b9fce118dd221bd646fdcc0a8ab34dfdc25ead5bd11fb59 ?

@Narsil
Copy link
Collaborator

Narsil commented Nov 21, 2022

https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/discussions/224/files

No it's not.

So this

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

sf_filename = hf_hub_download(repo_id="CompVis/stable-diffusion-v-1-4-original", filename="model.safetensors", revision="refs/pr/224")
sf_loaded = load_file(sf_filename)

Fails on your machine too ?
If yes, it doesn't fail with https://gist.github.com/Narsil/3edeec2669a5e94e4707aa0f901d2282 (with gpt2 replaced by the proper repo_id and revision), correct ?

That would be extremely weird, since both codes executes pretty much the same thing.

that python has stopped working

Multiple potential causes, being out of memory is classic though

The GPU continues to load second model over time

All the code I shared is supposed to load things on CPU, if things are loaded on GPU something is wrong. (The goal here is to identify the bug so we need to remove as much code as possible, just putting things on CPU should be enough to gauge if the loaded tensors are correct.)

@pattontim
Copy link
Author

pattontim commented Nov 21, 2022

So this

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

sf_filename = hf_hub_download(repo_id="CompVis/stable-diffusion-v-1-4-original", filename="model.safetensors", revision="refs/pr/224")
sf_loaded = load_file(sf_filename)

Fails on your machine too ? If yes, it doesn't fail with https://gist.github.com/Narsil/3edeec2669a5e94e4707aa0f901d2282 (with gpt2 replaced by the proper repo_id and revision), correct ?

It fails on my machine. It doesn't fail when using the gist, that's correct.

That would be extremely weird, since both codes executes pretty much the same thing.

that python has stopped working

Multiple potential causes, being out of memory is classic though

The GPU continues to load second model over time

I meant that it is continued to be loaded in the RAM, because device is cpu afterall. I was looking at the available memory and it wasnt exhausted yet while running, but its possible.

@Narsil
Copy link
Collaborator

Narsil commented Nov 22, 2022

Okay, thanks to @mfuntowicz which has a windows machine we were able to figure out.

Turns out it IS because of windows that there is an issue.
https://doc.rust-lang.org/stable/std/ffi/type.c_long.html c_long is a i32 on Windows platform meaning the tensor slice we're taking is overflowing and taking a wrong slice.

This IMO is a PyO3 issue, which I will report, and will provide a hotfix soon.

Thank you so much for helping on this.

@Narsil
Copy link
Collaborator

Narsil commented Nov 23, 2022

Shoudl be fixed with 0.2.5 , can you confirm @pattontim ?

@pattontim
Copy link
Author

pattontim commented Nov 23, 2022

Shoudl be fixed with 0.2.5 , can you confirm @pattontim ?

I can confirm that the issue is now fixed in 0.2.5, thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants