Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ViT-L-14 downloaded in Demo compared to Huggingface Demo Card don't seem to be the same #291

Open
rphlstck opened this issue Feb 26, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@rphlstck
Copy link

rphlstck commented Feb 26, 2024

Hey @anas-awadalla and co,

thanks you all for your amazing work!
I have a question regarding the used image encoder. Initializing OpenFlamingo with the demo code provided in the README.md:

from open_flamingo import create_model_and_transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="anas-awadalla/mpt-1b-redpajama-200b",
    tokenizer_path="anas-awadalla/mpt-1b-redpajama-200b",
    cross_attn_every_n_layers=1,
    cache_dir="PATH/TO/CACHE/DIR"  # Defaults to ~/.cache
)

The ViT seems to be downloaded from here (.../open_clip/pretrained.py):

...
_VITL14 = dict(
    openai=_pcfg(
        "https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt"),
        ...

On the huggingface model card of OpenFlamingo this openai's ViT model card is linked.
Loading both models and comparing them with the following function will return False:

def models_equal(m1, m2):
    for p1, p2 in zip(m1.parameters(), m2.parameters()):
        p1 = p1.detach().cpu()
        p2 = p2.detach().cpu()
        if p1.data.ne(p2.data).sum() > 0:
            return False
    return True

Visual inspection of the models state dict, also seems to strengthen the impression that the two stated models are not the same:

m1.state_dict()
>>> OrderedDict([('positional_embedding', tensor([[ 0.0016,  0...-0.0310]])), ('text_projection', tensor([[-0.0109,  0... 0.0007]])), ('logit_scale', tensor(4.6052)), ('visual.class_embedding', tensor([ 0.0138,  0.... -0.2366])), ('visual.positional_embedding', tensor([[ 0.0019,  0...-0.0360]])), ('visual.proj', tensor([[ 0.0224, -0... 0.0262]])), ('visual.conv1.weight', tensor([[[[ 2.5284e-...8e-02]]]])), ('visual.ln_pre.weight', tensor([0.3311, 0.00..., 0.0039])), ('visual.ln_pre.bias', tensor([-0.0045, -0.... -0.0132])), ('visual.transformer.r...n_1.weight', tensor([6.1186e-04, ...4703e-04])), ('visual.transformer.r....ln_1.bias', tensor([ 1.3605e-04,...4003e-05])), ('visual.transformer.r...roj_weight', tensor([[-7.0632e-05...844e-05]])), ('visual.transformer.r..._proj_bias', tensor([ 1.5674, -1.... -0.0043])), ('visual.transformer.r...roj.weight', tensor([[-6.7596e-03...667e-03]])), ...])
m2.state_dict()
>>> OrderedDict([('positional_embedding', tensor([[-7.6808e-03...221e-04]])), ('text_projection', tensor([[-0.0296, -0... 0.0495]])), ('logit_scale', tensor(2.6593)), ('visual.class_embedding', tensor([ 0.0050, -0....  0.0151])), ('visual.positional_embedding', tensor([[ 0.0064, -0...-0.0201]])), ('visual.proj', tensor([[-0.0386,  0... 0.0059]])), ('visual.conv1.weight', tensor([[[[-2.4814e-...6e-02]]]])), ('visual.ln_pre.weight', tensor([1., 1., 1., ..., 1., 1.])), ('visual.ln_pre.bias', tensor([0., 0., 0., ..., 0., 0.])), ('visual.transformer.r...n_1.weight', tensor([1., 1., 1., ..., 1., 1.])), ('visual.transformer.r....ln_1.bias', tensor([0., 0., 0., ..., 0., 0.])), ('visual.transformer.r...roj_weight', tensor([[-0.0264,  0... 0.0205]])), ('visual.transformer.r..._proj_bias', tensor([0., 0., 0., ..., 0., 0.])), ('visual.transformer.r...roj.weight', tensor([[-0.0115,  0... 0.0120]])), ...])

(Maybe the models are the same and I am just not knowledged enough to see it)

Now, I would be interested which ViT-L-14 was used during training of OpenFlamingo.

Expected Behavior

Both models m1 and m2 to be the same model

Current Behavior

m1 and m2 seem to be different.

Steps to Reproduce

import open_clip

def models_equal(m1, m2):
    for p1, p2 in zip(m1.parameters(), m2.parameters()):
        p1 = p1.detach().cpu()
        p2 = p2.detach().cpu()
        if p1.data.ne(p2.data).sum() > 0:
            return False
    return True


def main():
    m1, _, _ = open_clip.create_model_and_transforms(
        "ViT-L-14",
        pretrained="openai"
    )

    m2, _, _ = open_clip.create_model_and_transforms(
        "ViT-L-14",
        # download the above linked openai CLIP model from the huggingface hub
        cache_dir="/PATH/TO/CACHEDIR/.cache/huggingface/hub/models--openai--clip-vit-large-patch14/snapshots/32bd64288804d66eefd0ccbe215aa642df71cc41"
    )

    models = [m1, m2]
    models_names = ["m1", "m2"]

    for i, m1 in enumerate(models):
        for j, m2 in enumerate(models):
            if i < j:
                print(f"Comparing {models_names[i]} and {models_names[j]}")
                print(f"Models are equal: {models_equal(m1, m2)}")
    # print(m1.state_dict())
    # print(m2.state_dict())


if __name__ == '__main__':
    main()

Environment

OS: Ubuntu 22.04.3 LTS
Python: 3.11.8
open-clip-torch==2.24.0
torch==2.2.0
torchvision==0.17.0

Edit

Using both ViTs to encode the same image results in different embeddings as well.

@rphlstck rphlstck added the bug Something isn't working label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant