[BUG] ViT-L-14 downloaded in Demo compared to Huggingface Demo Card don't seem to be the same #291

rphlstck · 2024-02-26T09:41:45Z

thanks you all for your amazing work!
I have a question regarding the used image encoder. Initializing OpenFlamingo with the demo code provided in the README.md:

from open_flamingo import create_model_and_transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="anas-awadalla/mpt-1b-redpajama-200b",
    tokenizer_path="anas-awadalla/mpt-1b-redpajama-200b",
    cross_attn_every_n_layers=1,
    cache_dir="PATH/TO/CACHE/DIR"  # Defaults to ~/.cache
)

The ViT seems to be downloaded from here (.../open_clip/pretrained.py):

...
_VITL14 = dict(
    openai=_pcfg(
        "https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt"),
        ...

On the huggingface model card of OpenFlamingo this openai's ViT model card is linked.
Loading both models and comparing them with the following function will return False:

def models_equal(m1, m2):
    for p1, p2 in zip(m1.parameters(), m2.parameters()):
        p1 = p1.detach().cpu()
        p2 = p2.detach().cpu()
        if p1.data.ne(p2.data).sum() > 0:
            return False
    return True

Visual inspection of the models state dict, also seems to strengthen the impression that the two stated models are not the same:

m1.state_dict()
>>> OrderedDict([('positional_embedding', tensor([[ 0.0016,  0...-0.0310]])), ('text_projection', tensor([[-0.0109,  0... 0.0007]])), ('logit_scale', tensor(4.6052)), ('visual.class_embedding', tensor([ 0.0138,  0.... -0.2366])), ('visual.positional_embedding', tensor([[ 0.0019,  0...-0.0360]])), ('visual.proj', tensor([[ 0.0224, -0... 0.0262]])), ('visual.conv1.weight', tensor([[[[ 2.5284e-...8e-02]]]])), ('visual.ln_pre.weight', tensor([0.3311, 0.00..., 0.0039])), ('visual.ln_pre.bias', tensor([-0.0045, -0.... -0.0132])), ('visual.transformer.r...n_1.weight', tensor([6.1186e-04, ...4703e-04])), ('visual.transformer.r....ln_1.bias', tensor([ 1.3605e-04,...4003e-05])), ('visual.transformer.r...roj_weight', tensor([[-7.0632e-05...844e-05]])), ('visual.transformer.r..._proj_bias', tensor([ 1.5674, -1.... -0.0043])), ('visual.transformer.r...roj.weight', tensor([[-6.7596e-03...667e-03]])), ...])
m2.state_dict()
>>> OrderedDict([('positional_embedding', tensor([[-7.6808e-03...221e-04]])), ('text_projection', tensor([[-0.0296, -0... 0.0495]])), ('logit_scale', tensor(2.6593)), ('visual.class_embedding', tensor([ 0.0050, -0....  0.0151])), ('visual.positional_embedding', tensor([[ 0.0064, -0...-0.0201]])), ('visual.proj', tensor([[-0.0386,  0... 0.0059]])), ('visual.conv1.weight', tensor([[[[-2.4814e-...6e-02]]]])), ('visual.ln_pre.weight', tensor([1., 1., 1., ..., 1., 1.])), ('visual.ln_pre.bias', tensor([0., 0., 0., ..., 0., 0.])), ('visual.transformer.r...n_1.weight', tensor([1., 1., 1., ..., 1., 1.])), ('visual.transformer.r....ln_1.bias', tensor([0., 0., 0., ..., 0., 0.])), ('visual.transformer.r...roj_weight', tensor([[-0.0264,  0... 0.0205]])), ('visual.transformer.r..._proj_bias', tensor([0., 0., 0., ..., 0., 0.])), ('visual.transformer.r...roj.weight', tensor([[-0.0115,  0... 0.0120]])), ...])

(Maybe the models are the same and I am just not knowledged enough to see it)

Now, I would be interested which ViT-L-14 was used during training of OpenFlamingo.

Expected Behavior

Both models m1 and m2 to be the same model

Current Behavior

m1 and m2 seem to be different.

Steps to Reproduce

import open_clip

def models_equal(m1, m2):
    for p1, p2 in zip(m1.parameters(), m2.parameters()):
        p1 = p1.detach().cpu()
        p2 = p2.detach().cpu()
        if p1.data.ne(p2.data).sum() > 0:
            return False
    return True


def main():
    m1, _, _ = open_clip.create_model_and_transforms(
        "ViT-L-14",
        pretrained="openai"
    )

    m2, _, _ = open_clip.create_model_and_transforms(
        "ViT-L-14",
        # download the above linked openai CLIP model from the huggingface hub
        cache_dir="/PATH/TO/CACHEDIR/.cache/huggingface/hub/models--openai--clip-vit-large-patch14/snapshots/32bd64288804d66eefd0ccbe215aa642df71cc41"
    )

    models = [m1, m2]
    models_names = ["m1", "m2"]

    for i, m1 in enumerate(models):
        for j, m2 in enumerate(models):
            if i < j:
                print(f"Comparing {models_names[i]} and {models_names[j]}")
                print(f"Models are equal: {models_equal(m1, m2)}")
    # print(m1.state_dict())
    # print(m2.state_dict())


if __name__ == '__main__':
    main()

Environment

OS: Ubuntu 22.04.3 LTS
Python: 3.11.8
open-clip-torch==2.24.0
torch==2.2.0
torchvision==0.17.0

Edit

Using both ViTs to encode the same image results in different embeddings as well.

The text was updated successfully, but these errors were encountered:

rphlstck added the bug Something isn't working label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ViT-L-14 downloaded in Demo compared to Huggingface Demo Card don't seem to be the same #291

[BUG] ViT-L-14 downloaded in Demo compared to Huggingface Demo Card don't seem to be the same #291

rphlstck commented Feb 26, 2024 •

edited

[BUG] ViT-L-14 downloaded in Demo compared to Huggingface Demo Card don't seem to be the same #291

[BUG] ViT-L-14 downloaded in Demo compared to Huggingface Demo Card don't seem to be the same #291

Comments

rphlstck commented Feb 26, 2024 • edited

Expected Behavior

Current Behavior

Steps to Reproduce

Environment

Edit

rphlstck commented Feb 26, 2024 •

edited