# Alibaba-NLP/gme-Qwen2-VL
* general-multimodal-embedding from alibaba [[2B]](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct), [[7B]](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct)
    * Qwen2-VL base model
    * uses separate `GmeQwen2VL` model class [[code]](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct/blob/main/gme_inference.py)

In [1]:
import os

from src.config import settings
from src.gme_inference import GmeQwen2VL


  from tqdm.autonotebook import tqdm


# Load Model
Set to `attn_implementation="eager"` to prevent following error
```
attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
```
* https://github.com/hiyouga/LLaMA-Factory/issues/6838

In [None]:
model = GmeQwen2VL(
    model_path=os.path.join(settings.model_dir, "embedding/gme-Qwen2-VL-2B-Instruct"),
    device="mps",
    max_length=8192,
    attn_implementation="eager"
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
## Inference Example
texts = [
    "What kind of car is this?",
    "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023."
]
images = [
    'file://./resources/Tesla_Cybertruck_damaged_window.jpg',
    'file://./resources/2024_Tesla_Cybertruck_Foundation_Series,_front_left_(Greenwich).jpg',
    # 'https://en.wikipedia.org/wiki/File:Tesla_Cybertruck_damaged_window.jpg',
    # 'https://en.wikipedia.org/wiki/File:2024_Tesla_Cybertruck_Foundation_Series,_front_left_(Greenwich).jpg',
]

e_text = model.get_text_embeddings(texts=texts)
e_image = model.get_image_embeddings(images=images)
print((e_text * e_image).sum(-1))
## expected value (from hf page): tensor([0.2281, 0.6001], dtype=torch.float16)
## macos inference value: tensor([0.3860, 0.5542], dtype=torch.float16)

encode:   0%|          | 0/1 [00:00<?, ?it/s]

encode:   0%|          | 0/1 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the

tensor([0.3860, 0.5542], dtype=torch.float16)


# Prepare Data

In [8]:
from PIL import Image
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True
from io import BytesIO

from torch.utils.data import DataLoader

from src.dataset import (
    load_webqa_data,
    WebQAQueryDataset,
    WebQATCandidateDataset,
    WebQATICandidateDataset
)

In [9]:
queries, candidates = load_webqa_data(
    os.path.join(settings.webqa_data_dir, "WebQA_test.json"),
    task="t2ti",
    text_template = "{title} {fact}",
    image_text_template = "{title} {caption}"
)

query_ds = WebQAQueryDataset(data=queries)
candidates_ds = WebQATICandidateDataset(
    data=candidates,
    lineidx_fpath=os.path.join(settings.webqa_data_dir, "imgs.lineidx"),
    images_fpath=os.path.join(settings.webqa_data_dir, "images/imgs.tsv"),
)

In [10]:
query_dl = DataLoader(query_ds, batch_size=4)
for x in query_dl:
    print(x)
    break

{'text': ['"Are both the Original Playboy Mansion and Gage Park High School made of brick?"', '"Are there bears in the background of the painting "Greek Landscape"?"', '"Are there flowering trees in front of both the Georgia Tech Library and the Newman Library at Virginia Tech?"', '"Is the surface of the egg next to the handrail at the Big Egg Hunt  in Covent Garden London shiny or dull?"']}
