Make model key assignment deterministic #5792

lstein · 2024-02-24T15:30:44Z

What type of PR is this? (check all applicable)

Have you discussed this change with the InvokeAI team?

Yes
No, because:

Have you updated all relevant documentation?

Yes
No

Description

Previously, when a model was installed, it was assigned a random database key. This caused issues because if the model was reinstalled, its key would change. It also makes metadata irreproducible. This PR changes the behavior so that keys are assigned determinstically based on the model contents

.safetensors, .ckpt and other single file models are hashed with sha1. This is compatible with A1111's model hashes.
The contents of diffusers directories are hashed using imohash (faster, but nonstandard)

Related Tickets & Documents

See discord: https://discord.com/channels/1020123559063990373/1149513647022948483/1210441942249377792

Related Issue #
Closes #

QA Instructions, Screenshots, Recordings

Try installing and uninstalling a model, using the command line invokeai-model-install --add <url or repoid>, followed by invokeai-mode-install --delete <name of model>. Both safetensors URLs and HF diffusers should get the same key each time.

Merge Plan

Can merge when approved.

Added/updated tests?

Yes - changed length of expected hash for embedding files.
No

[optional] Are there any post deployment tasks we need to perform?

psychedelicious

There's a problem with the SHA1 implementation. It uses a block size when reading the file, causing the hash to be incorrect.

I was curious so I whipped up a benchmark of file hashing in python: https://github.com/psychedelicious/hash_bench

SHA1 with block size is by far the fastest, but BLAKE3 is still way faster than any other correct algorithm. I think we should use this for single-file models.

CivitAI provides BLAKE3 hashes, and you can query for a model on their API using it: https://civitai.com/api/v1/model-versions/by-hash/39d4e86885924311d47475838b8e10edb7a33c780934e9fc9d38b80c27cb658e

(that's DreamShaper XL something or other - used in my tests)

There are a few ways to implement BLAKE3, see my benchmark repo for the fastest method, which parallelizes hashing and uses memory mapping (I don't really understand what this means but it makes it super fast).

RyanJDick · 2024-02-26T14:09:28Z

From a general DB design perspective, I'd be inclined to keep the random primary key and add the hash as another column on the table. (Which I think is what we had discussed when we originally decided to use a random key.)

This would be better if we think there's any chance that the definition of the hash key will change in the future, or it's uniqueness constraint will be dropped. Example situations:

If we ever decide to change the hashing function that we use (e.g. for performance, we discover a bug, to map very similar models to the same 'hash', etc.), the migration will be simpler.
If we ever decide that we want to track deleted models, having a unique primary key is more practical.
If, in the future, we want to run optimizations on models (e.g. TensorRT) and store the optimized artifacts, having a single primary hash key probably won't map well to what we are trying to represent.
And probably many more situations that are hard to anticipate...

Up to you if you think the added effort now is worth it for the future-proofing.

psychedelicious · 2024-02-26T20:22:18Z

The PK can be anything but we need to be able to reference models using a stable, deterministic identifier.

There's no point in having model metadata if it isn't a stable reference to the same model.

Using a cryptographic hash also means metadata is useful between users.

Also, I think we should consider dropping imohash and instead use b3 to hash the diffusers weights. Iterate over all files and update the hash as you go. No need to rely on the imohash library.

lstein · 2024-02-27T15:35:46Z

Huh. I didn't realize that I was hashing incorrectly. The web is filled with misleading info, I guess.

The memoryview() method applied to sha256 gives the same answer as sha256sum on the command line and is twice as fast as the linux command-line tool:

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

Python:

time python hash.py ais-stcks-sdxl.safetensors
cf729c0896c2bd69d2a9e5687f5ebe0b44d29879529e2271f8f2d64550485608

real	0m0.591s
user	0m0.554s
sys	0m0.037s

Command line tool:

time sha256sum ais-stcks-sdxl.safetensors 
cf729c0896c2bd69d2a9e5687f5ebe0b44d29879529e2271f8f2d64550485608  ais-stcks-sdxl.safetensors

real	0m1.080s
user	0m1.040s
sys	0m0.041s

The only reason I chose sha1 in the first place is that it produces shorter hashes.

lstein · 2024-02-27T15:53:42Z

Oh, just saw the blake3 benchmark results. That is very fast. Why don't we just go with that and stick with it?

lstein · 2024-02-27T17:37:14Z

From a general DB design perspective, I'd be inclined to keep the random primary key and add the hash as another column on the table. (Which I think is what we had discussed when we originally decided to use a random key.)

We are trying to satisfy multiple use cases, ordered in decreasing level of priority.

Models should have stable identifiers that don't change when they are uninstalled and reinstalled.
Users should be able to use the metadata embedded in generated images to reproduce the image.
If a model is transformed from .safetensors to diffusers versions, its identifier shouldn't change.
We want to be able to easily identify a model and recognize what it is. The UI will hide the model ID almost all the time, but there is a use case in which the user is viewing the raw model metadata and trying to figure out what model was used.
We want to check that a downloaded model file or directory has arrived intact.
We want to know if model A on one user's computer is the same as model B on another user's.

With respect to (1), we need to adopt deterministic model identifiers rather than randomly-assigned ones. This seems obvious now, but recall that when we originally discussed the MM2 redesign in response to Spencer's RFC last summer we had a consensus that the identifiers should be random.

Thoughts:

Integrity checking:
- Use cases (5) and (6) are independent of how the models are named and shouldn't be conflated with (1) and (2). We hash the models correctly and store that info for integrity checking and comparison. We don't need to use a hash for the ID.
- Note that we already store the file hash as an element in the model's config. We just need to adopt a hash algorithm that is consistent with the checksums provided by Civitai and other repos.
- blake3 is nice and fast. My only reservation is that it is not widely known and I had to search around a bit to find a desktop tool that implemented it.
Identifying the model reproducibly:
- Use cases (1) and (2) are satisfied by any algorithm that deterministically assigns the model ID.
- (4) is hard. We could discuss reverting to using the base/type/name as the identifier, but we know this leads to confusion when the same model is downloaded from different locations under slightly different names. The one advantage this has is that closely related models, such as the fp16 and fp32 versions, will have similar IDs.
- Another option is to use the source path or URL, but this is brittle.
- The hash solution is a good one. The one thing I don't like about it is how ungainly the long length is (for example, it makes it hard to interact with the SQL database). One solution is to shorten the ID by truncating the hex digest to the first 12 characters and still have minimal risk of collisions (less than 1 chance in a quadrillion).
Managing format changes:
- The current code calculates a hash when the model is first installed and then uses this hash as the ID and stores it in the config under current_hash. Later, if the model is converted into diffusers, the ID remains the same, but the hash value is moved to an original_hash field, the model is re-hashed, and the new hash replaces the value in the current_hash field.
- This solves the safetensors->diffusers conversion issue, but doesn't help resolve the issue of the user having the same model in two different original formats - say SDXL-base.safetensors downloaded from Civitai and SDXL-base diffusers downloaded from HuggingFace. Do we want to try to address this? What about different floating point precisions?

Overall, I think my preference would be to use blake3 for hashing (both safetensors and diffusers directory recursion), to store the hashes in the config for integrity checking, and to use a truncated version of the hash for the model ID. We would also want to provide the user with a UI display element that shows the model ID, its name, its source, its format and its hash, which would help them match models that have been converted or renamed. Finally, maybe the image generation metadata display could be modified to show the model name as well as its ID?

lstein · 2024-02-27T18:31:23Z

The other thing we should discuss before merging this PR is how hashes are represented in the model config. The pydantic model is currently:

class ModelConfigBase(BaseModel):
    name: str = Field(description="model name")
    key: str =Field(description="model key")
    original_hash: Optional[str] = Field(
        description="original fasthash of model contents", default=None
    )
    current_hash: Optional[str] = Field(
        description="current fasthash of model contents", default=None
    ) 
[irrelevant fields removed]

Civitai computes multiple hashes, and maybe we should allow for similar flexibility in the future. One approach is:

original_hashes: Dict[HashAlgorithm, str] = Field(description="dict of hash algorithms and their resulting hashes")

This would let us apply multiple hashes to support other model sources. Downside is that it would necessitate a database migration.

Another approach would simply to adopt a convention of prefixing the hash with its algorithm name, as in blake3:abcd1234efff0. This wouldn't need a database migration and would give us the flexibility to the hash algorithm in the future.

- When installing, model keys are now calculated from the model contents. - .safetensors, .ckpt and other single file models are hashed with sha1 - The contents of diffusers directories are hashed using imohash (faster) fixup yaml->sql db migration script to assign deterministic key - this commit also detects and assigns the correct image encoder for ip adapter models.

- Some algos are slow, so it is now just called ModelHash - Added all hashlib algos, plus BLAKE3 and the fast (but incorrect) SHA1 algo

This changes the functionality of this PR to only use the updated hashing for model hashes with a UUID for the key.

- Use memory view for hashlib algorithms (closer to python 3.11's filehash API in hashlib) - Remove `sha1_fast` (realized it doesn't even hash the whole file, it just does the first block) - Add support for custom file filters - Update docstrings - Update tests

psychedelicious · 2024-03-03T03:26:38Z

I've been moving back and forth between this PR and #5846, in which key will be a UUID. I've retained the improvements to the model hash class in this PR, but just made the key to be a UUID.

My recent commits thus change this PR from "Make model key assignment deterministic" to "Improved model hashing".

lstein requested review from blessedcoolant, GreggHelt2, brandonrising, RyanJDick, hipsterusername and psychedelicious as code owners February 24, 2024 15:30

github-actions bot added python PRs that change python files backend PRs that change backend files services PRs that change app services PythonTests labels Feb 24, 2024

lstein force-pushed the feat/deterministic-model-keys branch from 2f9f698 to 225a41d Compare February 24, 2024 16:23

hipsterusername approved these changes Feb 24, 2024

View reviewed changes

psychedelicious requested changes Feb 26, 2024

View reviewed changes

github-actions bot added Root PythonDeps labels Feb 27, 2024

psychedelicious force-pushed the feat/deterministic-model-keys branch from 4801979 to ee78412 Compare February 28, 2024 07:50

Lincoln Stein and others added 7 commits March 3, 2024 11:03

feat(mm): use blake3 for hashing

f4f8521

feat(mm): make hash.py a script for testing

de971a1

feat(mm): add hashing algos to ModelHash

a1170e4

- Some algos are slow, so it is now just called ModelHash - Added all hashlib algos, plus BLAKE3 and the fast (but incorrect) SHA1 algo

feat(mm): modularize ModelHash to facilitate testing

08e9b27

feat(mm): make ModelHash instantiatable, taking an algorithm as arg

6823a66

tests(mm): add tests for ModelHash

24ee5f5

psychedelicious changed the base branch from next to main March 3, 2024 00:05

psychedelicious requested review from maryhipp and ebr as code owners March 3, 2024 00:05

psychedelicious force-pushed the feat/deterministic-model-keys branch from ee78412 to 48ca841 Compare March 3, 2024 03:15

github-actions bot added python-tests PRs that change python tests python-deps PRs that change python dependencies labels Mar 3, 2024

psychedelicious added 3 commits March 3, 2024 14:22

fix(mm): use UUIDv4 for key

69bf276

This changes the functionality of this PR to only use the updated hashing for model hashes with a UUID for the key.

tests(mm): update tests to reflect using UUID for key

9f0bf65

psychedelicious force-pushed the feat/deterministic-model-keys branch from 48ca841 to 9f0bf65 Compare March 3, 2024 03:23

psychedelicious enabled auto-merge (rebase) March 3, 2024 03:24

psychedelicious self-requested a review March 3, 2024 03:28

psychedelicious approved these changes Mar 3, 2024

View reviewed changes

psychedelicious removed PythonDeps labels Mar 3, 2024

psychedelicious merged commit 2f372d9 into main Mar 3, 2024
14 checks passed

psychedelicious deleted the feat/deterministic-model-keys branch March 3, 2024 03:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make model key assignment deterministic #5792

Make model key assignment deterministic #5792

lstein commented Feb 24, 2024

psychedelicious left a comment

RyanJDick commented Feb 26, 2024

psychedelicious commented Feb 26, 2024

lstein commented Feb 27, 2024 •

edited

Loading

lstein commented Feb 27, 2024

lstein commented Feb 27, 2024

lstein commented Feb 27, 2024

psychedelicious commented Mar 3, 2024

Make model key assignment deterministic #5792

Make model key assignment deterministic #5792

Conversation

lstein commented Feb 24, 2024

What type of PR is this? (check all applicable)

Have you discussed this change with the InvokeAI team?

Have you updated all relevant documentation?

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Merge Plan

Added/updated tests?

[optional] Are there any post deployment tasks we need to perform?

psychedelicious left a comment

Choose a reason for hiding this comment

RyanJDick commented Feb 26, 2024

psychedelicious commented Feb 26, 2024

lstein commented Feb 27, 2024 • edited Loading

lstein commented Feb 27, 2024

lstein commented Feb 27, 2024

lstein commented Feb 27, 2024

psychedelicious commented Mar 3, 2024

lstein commented Feb 27, 2024 •

edited

Loading