-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python inference demo #57
Comments
Hi, thank you for your interest in our work. This is a great suggestion! We have added an example script for CLI inference (single-turn Q-A session). An interactive CLI interface is WIP. Please see instruction here: https://github.com/haotian-liu/LLaVA#cli-inference. |
when I run this shell : it's error : |
为什么直接下载这里的的模型不能直接用?https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0 |
Cool, thank you very much @haotian-liu ! Do you have plans for providing a CLI that allows to feed multiple images and text prompts turn by turn anytime soon? This would be super cool to use your model for new downstream tasks. |
Yes, I agree with @MaxFBurg, are there any such implementation plans? |
Yes, that's a great suggestion, and as mentioned in my previous reply, the interactive CLI support is planned. We are planning to upgrade to the Vicuna v1.1 soon, as it has a better support for these. Stay tuned! And if you are interested in contributing, please let me know! |
We are not allowed to share the full model weights due to the LLaMA license, please see here for weight conversion. |
Thanks for your response @haotian-liu qs = args.query
if mm_use_im_start_end:
qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
else:
qs = qs + '\n' + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len with qs = args.query
if mm_use_im_start_end:
qs = qs + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
else:
qs = qs + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len I think this would be a naive extension of the current single image one turn inference procedure to a single turn inference procedure that can take two images as input in the prompt. Do you think something as straightforward as this will work out of the box? However, this doesn't seem to work well in practice for multi-image comparisons, a few examples follow: For all the below examples, I used the following prompt with the modified code above: As you can see, the generations completely ignore the first image and give a detailed description of the second image. However, the model does understand that there are two images in the third case, given that its description contains "In the second image". For a comparison, these are the model's responses when prompting with the same images and prompt on the web demo: This response is more coherent, and describes the difference between the two images fairly reasonably. I am wondering if this is an inherent limitation of the single-turn multi-image prompting style I've used above since it could be out-of-distribution (since your visual instruction tuning dataset only contains a single image per sample) for the model. Do you have any suggestions on a better evaluation strategy for this multi-image comparison either through single turn or multi turn prompting? |
Did you also change the image part |
Yes, this is the code I updated: image = load_image(args.image_file)
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0] with image_tensor = torch.stack(
[
image_processor.preprocess(load_image(image_file), return_tensors="pt")["pixel_values"][0]
for image_file in args.image_file.split(",")
]
)
input_ids = torch.as_tensor(inputs.input_ids).cuda() I just pass in comma-separated image input files. Please let me know whether there is an issue in this impl? |
@penghe2021 Due to the current way of training, we do not observe the model having very good capability referring to / comparing with multiple images. We are working on improving this aspect as well, stay tuned! |
Thanks @haotian-liu, so I assume the above implementation for a single turn multi-image inference is correct, but its an OOD problem due to the current training set-up of the model (sees only one image per sample during visual instruction tuning). However, I still see the model performing well on multiple images when used in the multi-turn set-up, so looking forward to your demo implementation of that, do you have a plan for when that can be released? |
It might be a superfluous or large request, but if the model could be integrated into a huggingface AutoModel or pipeline setup, I think it would be very accessible. Especially for experimenting with different use-cases. |
Hi @Marcusntnu, thank you for your interest in our work, and thank you for the great suggestion. This is WIP, and our first step is to move that the LLaVA model implementation to this repo, which has been completed. It should be implemented very soon, thanks and stay tuned! |
@haotian-liu Have you ever considered releasing the multi turn infernece code? |
@wjjlisa, do you mean the multi-turn conversation in CLI, as in our Gradio demo? This is planned for release by the end of this month. Was busy working on the NeurIPS recently... |
@vishaal27 I Would like to know what the structure of the data input looks like. I am trying to do a similar thing. |
@cyril-mino Sorry I don't get your question -- what do you mean by structure of data input? I just pass in two images to the model as a list of tensors (with the updated code above) and pass in the prompt that asks to compare the two images. |
@vishaal27 apologies, I thought you were finetuning. |
can I check whether the multi-turn framework would be updated into the repo anytime soon? Thanks for the great work. |
Hi Vishaal, by stacking the tensors we create a new dimension input for the model which would throw an exception, how did you overcome this issue? |
Hi Adriel, as far as I recall (I must admit I haven't looked at this script in over a month) the |
Hi Vishaal, thanks for sharing the code. Indeed the fork has changed quite a fair bit. Seems like the mm-projector is removed, and the pretrained model as well. I can confirm that the current fork with the modified image tensor input is unable to work due to the dimensionality error in one of the nn.modules during forward pass. Can I do a quick check with you, did you use the llama-13b model or the facebook/opt model for your testing back then? |
Great, thanks for letting me know -- I will however need to get back to this script at some point and get it to work, so I can let you know if I can figure something out for this use-case. Please do let me know if you are able to as well :) |
@adrielkuek @vishaal27 |
Interested in multiple image input as well. We're wondering if we could perform multimodal few-shot classification (on-the-fly, without fine-tuning) or not. |
Hi, everyone. I've be surfing in the code base of LLaVA for a while and find it hard to find the exact |
Something like #432 ? Would appreciate any suggestions. |
Hello, I am also interested in inputting more than one image for some experiments. I am trying to find the right template for this, considering that the base template is
|
I am in great need of a multi-dialog feature for batch-inference with SGLang. |
Hi cyril-mino, did you make it to support the fine-tuning LLaVa with multi-images & text? I wonder if there are some extra steps besides the codes mentioned by @vishaal27. Thanks~ |
Is there any way we can embed other modalities? Such as bbox, class ... ? |
Do I have to use a browser to demonstrate when running a large model locally?
Is there a demo in Python that directly feeds images and language into Python?
The text was updated successfully, but these errors were encountered: