Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Image or Multi-Video Inference Example #97

Open
chancharikmitra opened this issue Jul 20, 2024 · 2 comments
Open

Multi-Image or Multi-Video Inference Example #97

chancharikmitra opened this issue Jul 20, 2024 · 2 comments

Comments

@chancharikmitra
Copy link

chancharikmitra commented Jul 20, 2024

Hello, and thanks for such a great contribution to the field of interleaved LMMs! This is really great work. I was wondering if there was an example of the format for multiple image or multiple video inference (similar to what is shown in the in-context learning examples)? Does it involve appending multiple <image> tokens at the specified locations? And then are the images and videos inserted sequentially?

From my understanding of the run_vila.py script, the way to have an ICL input for images (and the corresponding structure for videos, of course) would be as follows:

python -W ignore llava/eval/run_vila.py \
    --model-path Efficient-Large-Model/Llama-3-VILA1.5-8b \
    --conv-mode llama_3 \
    --query "<image>\n ICL text 1 <image>\n ICL text 2 <image>\n" \
    --image-file "img1.png,img2.png,img3.png"

However, I am not sure if the positions of the <image> tokens are considered by the model during generation because looking at the llava_llama.py, the method for preparing the multimodal inputs is inherited from LLaVA, which I believe just concatenates the image features and does not embed them specifically in the locations of the <image> tokens.

I may have missed something as I am still new to the codebase and exploring the model more deeply. Would appreciate any clarification on the point about multi-image and multi-video inputs. Thanks!

Edit: After having looked more deeply, it seems to me at least that the way I have formatted the prompt (with '\n' included) aligns with your code. However, I see in your paper that the image tokens are enumerated:
image

Edit 2:

  1. As a side note, I do get this warning a lot.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:128001 for open-end generation.

I think the pad token is fine as it is automatically set to the eos_token. But what about the mask? I see no mention of that when I try to evaluate on datasets like SEEDBench. I do seem to get uncharacteristically low acc. on these benchmarks, and I am trying to find out why.

  1. I also noticed that the run_vila.py script does not have 'llama_3' as a conv_mode option. Is it possible that VILA-1.5 used a different conv_mode?
@chancharikmitra chancharikmitra changed the title Unquantized Model Availability Multi-Image or Multi-Image Inference Example Jul 20, 2024
@chancharikmitra chancharikmitra changed the title Multi-Image or Multi-Image Inference Example Multi-Image or Multi-Video Inference Example Jul 20, 2024
@DtYXs
Copy link

DtYXs commented Jul 26, 2024

Hello, I think in the VILA code, images are embed specifically in the locations of the <image> tokens.

for i in range(num_images + 1):
cur_new_input_embeds.append(cur_input_embeds_no_im[i])
cur_new_labels.append(cur_labels_noim[i])
if i < num_images:
cur_image_features = image_features[cur_image_idx]
cur_image_idx += 1
cur_new_input_embeds.append(cur_image_features)
cur_new_labels.append(
torch.full(
(cur_image_features.shape[0],),
IGNORE_INDEX,
device=cur_labels.device,
dtype=cur_labels.dtype,
)
)
cur_new_input_embeds = torch.cat(cur_new_input_embeds)
cur_new_labels = torch.cat(cur_new_labels)
new_input_embeds.append(cur_new_input_embeds)
new_labels.append(cur_new_labels)

@chancharikmitra
Copy link
Author

Thank you @DtYXs for the clarification about the <image> token and its placement! Given that, do you have any insights on why zero-shot performance on VILA-1.5-8b might be lower than what is being reported? Few-shot improvements are fantastic as advertised. Perhaps it is related to my concerns regarding masking and the conv_mode formatting. However, looking deeper at the eval scripts, I see that the conv_mode was passed directly - so 'llama_3' indeed would have been used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants