python inference demo #57

sssssshf · 2023-04-25T06:36:19Z

Do I have to use a browser to demonstrate when running a large model locally?
Is there a demo in Python that directly feeds images and language into Python?

haotian-liu · 2023-04-25T21:30:28Z

Hi, thank you for your interest in our work.

This is a great suggestion! We have added an example script for CLI inference (single-turn Q-A session). An interactive CLI interface is WIP.

Please see instruction here: https://github.com/haotian-liu/LLaVA#cli-inference.

sssssshf · 2023-04-26T06:06:35Z

when I run this shell :
python -m llava.eval.run_llava
--model-name /LLaVA-13B-v0
--image-file "https://llava-vl.github.io/static/images/view.jpg"
--query "What are the things I should be cautious about when I visit here?"

it's error :
HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '/LLaVA-13B-v0'.

sssssshf · 2023-04-26T06:09:33Z

为什么直接下载这里的的模型不能直接用？https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0

MaxFBurg · 2023-04-26T09:04:09Z

Cool, thank you very much @haotian-liu ! Do you have plans for providing a CLI that allows to feed multiple images and text prompts turn by turn anytime soon? This would be super cool to use your model for new downstream tasks.

vishaal27 · 2023-04-26T12:37:41Z

Yes, I agree with @MaxFBurg, are there any such implementation plans?

haotian-liu · 2023-04-29T03:43:06Z

@MaxFBurg @vishaal27

Yes, that's a great suggestion, and as mentioned in my previous reply, the interactive CLI support is planned. We are planning to upgrade to the Vicuna v1.1 soon, as it has a better support for these. Stay tuned! And if you are interested in contributing, please let me know!

haotian-liu · 2023-04-29T03:43:45Z

为什么直接下载这里的的模型不能直接用？https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0

We are not allowed to share the full model weights due to the LLaMA license, please see here for weight conversion.

vishaal27 · 2023-04-29T12:38:47Z

Thanks for your response @haotian-liu
I tried replacing these lines in your eval script llava.eval.run_llava.py:

    qs = args.query
    if mm_use_im_start_end:
        qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + '\n' + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len

with

    qs = args.query
    if mm_use_im_start_end:
        qs = qs + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len

I think this would be a naive extension of the current single image one turn inference procedure to a single turn inference procedure that can take two images as input in the prompt. Do you think something as straightforward as this will work out of the box?

However, this doesn't seem to work well in practice for multi-image comparisons, a few examples follow:

For all the below examples, I used the following prompt with the modified code above: {<img_1> <img_2> <"Describe the change applied to the first image to get to the second image">}.

As you can see, the generations completely ignore the first image and give a detailed description of the second image. However, the model does understand that there are two images in the third case, given that its description contains "In the second image".

For a comparison, these are the model's responses when prompting with the same images and prompt on the web demo:

This response is more coherent, and describes the difference between the two images fairly reasonably.

I am wondering if this is an inherent limitation of the single-turn multi-image prompting style I've used above since it could be out-of-distribution (since your visual instruction tuning dataset only contains a single image per sample) for the model. Do you have any suggestions on a better evaluation strategy for this multi-image comparison either through single turn or multi turn prompting?

penghe2021 · 2023-04-30T07:42:06Z

Thanks for your response @haotian-liu I tried replacing these lines in your eval script llava.eval.run_llava.py:
    qs = args.query
    if mm_use_im_start_end:
        qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + '\n' + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len
with
    qs = args.query
    if mm_use_im_start_end:
        qs = qs + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len
I think this would be a naive extension of the current single image one turn inference procedure to a single turn inference procedure that can take two images as input in the prompt. Do you think something as straightforward as this will work out of the box?

However, this doesn't seem to work well in practice for multi-image comparisons, a few examples follow:

For all the below examples, I used the following prompt with the modified code above: {<img_1> <img_2> <"Describe the change applied to the first image to get to the second image">}.

As you can see, the generations completely ignore the first image and give a detailed description of the second image. However, the model does understand that there are two images in the third case, given that its description contains "In the second image".

For a comparison, these are the model's responses when prompting with the same images and prompt on the web demo:

This response is more coherent, and describes the difference between the two images fairly reasonably.

I am wondering if this is an inherent limitation of the single-turn multi-image prompting style I've used above since it could be out-of-distribution (since your visual instruction tuning dataset only contains a single image per sample) for the model. Do you have any suggestions on a better evaluation strategy for this multi-image comparison either through single turn or multi turn prompting?

Did you also change the image part

vishaal27 · 2023-04-30T09:41:36Z

Yes, this is the code I updated:

    image = load_image(args.image_file)
    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

with

   image_tensor = torch.stack(
        [
            image_processor.preprocess(load_image(image_file), return_tensors="pt")["pixel_values"][0]
            for image_file in args.image_file.split(",")
        ]
    )

    input_ids = torch.as_tensor(inputs.input_ids).cuda()

I just pass in comma-separated image input files. Please let me know whether there is an issue in this impl?

haotian-liu · 2023-04-30T18:20:20Z

@penghe2021 Due to the current way of training, we do not observe the model having very good capability referring to / comparing with multiple images. We are working on improving this aspect as well, stay tuned!

vishaal27 · 2023-04-30T19:05:31Z

Thanks @haotian-liu, so I assume the above implementation for a single turn multi-image inference is correct, but its an OOD problem due to the current training set-up of the model (sees only one image per sample during visual instruction tuning). However, I still see the model performing well on multiple images when used in the multi-turn set-up, so looking forward to your demo implementation of that, do you have a plan for when that can be released?

Marcusntnu · 2023-05-02T14:55:04Z

It might be a superfluous or large request, but if the model could be integrated into a huggingface AutoModel or pipeline setup, I think it would be very accessible. Especially for experimenting with different use-cases.

haotian-liu · 2023-05-05T04:39:27Z

Hi @Marcusntnu, thank you for your interest in our work, and thank you for the great suggestion. This is WIP, and our first step is to move that the LLaVA model implementation to this repo, which has been completed. It should be implemented very soon, thanks and stay tuned!

wjjlisa · 2023-05-17T10:12:13Z

@haotian-liu Have you ever considered releasing the multi turn infernece code?

haotian-liu · 2023-05-17T18:24:18Z

@wjjlisa, do you mean the multi-turn conversation in CLI, as in our Gradio demo? This is planned for release by the end of this month. Was busy working on the NeurIPS recently...

SeungyounShin · 2023-05-29T11:14:00Z

This is my experiments with prompt tunning. Not perfect but pretty amazing

Seems like img1,img2,text is better performing.

cyril-mino · 2023-06-13T07:46:18Z

@vishaal27 I Would like to know what the structure of the data input looks like. I am trying to do a similar thing.

vishaal27 · 2023-06-13T08:36:05Z

@cyril-mino Sorry I don't get your question -- what do you mean by structure of data input? I just pass in two images to the model as a list of tensors (with the updated code above) and pass in the prompt that asks to compare the two images.

cyril-mino · 2023-06-13T08:37:49Z

@vishaal27 apologies, I thought you were finetuning.

adrielkuek · 2023-06-13T10:36:50Z

This is my experiments with prompt tunning. Not perfect but pretty amazing

Seems like img1,img2,text is better performing.

Hi, possible to share the input query prompts for this output? Thanks!

adrielkuek · 2023-06-14T02:07:26Z

@wjjlisa, do you mean the multi-turn conversation in CLI, as in our Gradio demo? This is planned for release by the end of this month. Was busy working on the NeurIPS recently...

can I check whether the multi-turn framework would be updated into the repo anytime soon? Thanks for the great work.

adrielkuek · 2023-06-19T08:08:30Z

Yes, this is the code I updated:

    image = load_image(args.image_file)
    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

with

   image_tensor = torch.stack(
        [
            image_processor.preprocess(load_image(image_file), return_tensors="pt")["pixel_values"][0]
            for image_file in args.image_file.split(",")
        ]
    )

    input_ids = torch.as_tensor(inputs.input_ids).cuda()

I just pass in comma-separated image input files. Please let me know whether there is an issue in this impl?

Hi Vishaal, by stacking the tensors we create a new dimension input for the model which would throw an exception, how did you overcome this issue?

vishaal27 · 2023-06-21T10:03:08Z

Hi Adriel, as far as I recall (I must admit I haven't looked at this script in over a month) the model.generate function was able to take in multiple input images as a concatenated tensor. For full clarity, here is the script I used, hope it helps (disclaimer: this script uses a fork of the repository that is quite old, it is possible a few things might have changed since then): https://github.com/MaxFBurg/LLaVA/blob/main/llava/eval/run_llava_two_images.py#L51

adrielkuek · 2023-06-22T14:18:38Z

Hi Vishaal, thanks for sharing the code. Indeed the fork has changed quite a fair bit. Seems like the mm-projector is removed, and the pretrained model as well. I can confirm that the current fork with the modified image tensor input is unable to work due to the dimensionality error in one of the nn.modules during forward pass. Can I do a quick check with you, did you use the llama-13b model or the facebook/opt model for your testing back then?

vishaal27 · 2023-06-22T20:42:56Z

Great, thanks for letting me know -- I will however need to get back to this script at some point and get it to work, so I can let you know if I can figure something out for this use-case. Please do let me know if you are able to as well :)
Re. your question -- We used the llama-13b model back then, I think at that stage the opt model was not available if I recall correctly.

codybum · 2023-07-02T07:40:20Z

Hi Vishaal, thanks for sharing the code. Indeed the fork has changed quite a fair bit. Seems like the mm-projector is removed, and the pretrained model as well. I can confirm that the current fork with the modified image tensor input is unable to work due to the dimensionality error in one of the nn.modules during forward pass. Can I do a quick check with you, did you use the llama-13b model or the facebook/opt model for your testing back then?

@adrielkuek @vishaal27
We are also very interested in using multi-image input. Our interest is less in comparison, but rather using multiple images to represent the same thing, as described here: #197 (comment)

HireTheHero · 2023-09-10T01:56:08Z

Interested in multiple image input as well. We're wondering if we could perform multimodal few-shot classification (on-the-fly, without fine-tuning) or not.
Will test Vishaal's solution and maybe create PR when I have time.

LumenYoung · 2023-09-11T17:16:50Z

Hi, everyone. I've be surfing in the code base of LLaVA for a while and find it hard to find the exact generate() function implementation for llama based LLaVa. I'm trying to find the generate() for LLaMA. It would be helpful since I want to find a way to work in the multiple image mode. Any help would be appreciated!

LumenYoung · 2023-09-11T17:59:51Z

This is my experiments with prompt tunning. Not perfect but pretty amazing

Seems like img1,img2,text is better performing.

Hi, SeungyounShin. Would you mind sharing how you manage to embed both image into one query. It would be really helpful as I currently am not able to find a way to do this.

HireTheHero · 2023-09-12T13:45:07Z

Something like #432 ? Would appreciate any suggestions.

CreativeBuilds · 2023-10-25T13:55:34Z

This is my experiments with prompt tunning. Not perfect but pretty amazing
Seems like img1,img2,text is better performing.

Would you be able to upload this model to hugging face or share it some other way? Very interested in getting this to run with image comparison.

aldoz-mila · 2024-01-15T18:27:57Z

Hello, I am also interested in inputting more than one image for some experiments. I am trying to find the right template for this, considering that the base template is USER: <image>\n<prompt>\nASSISTANT:.

(i) What would be the prompt template when feeding more than one image? Should I use something like USER: <image1><image2>\n<prompt>\nASSISTANT:?
(ii) How do you input the multiple images? Do you concatenate them in the first dimension? For example, assuming this in my code snippet (from HuggingFace integration of LLaVA): output = pipe( image, prompt=text_prompt, generate_kwargs={"max_new_tokens": 200} )

fisher75 · 2024-03-25T10:23:30Z

I am in great need of a multi-dialog feature for batch-inference with SGLang.

Sprinter1999 · 2024-04-02T03:37:45Z

@vishaal27 apologies, I thought you were finetuning.

Hi cyril-mino, did you make it to support the fine-tuning LLaVa with multi-images & text? I wonder if there are some extra steps besides the codes mentioned by @vishaal27. Thanks~

yc-cui · 2024-04-22T11:13:04Z

Is there any way we can embed other modalities? Such as bbox, class ... ?

haotian-liu self-assigned this Apr 25, 2023

haotian-liu added the enhancement New feature or request label Apr 25, 2023

haotian-liu mentioned this issue May 7, 2023

[Usage] bug on https://llava.hliu.cc/ #105

Open

haotian-liu mentioned this issue May 29, 2023

[Question] Possibility for Multi-image input? #197

Open

HireTheHero mentioned this issue Sep 12, 2023

Multiple images for eval.run_llava #432

Merged

Debolena7 mentioned this issue Jun 10, 2024

[Usage] How can I implemet few shot learning on LLaVa #1202

Open

python inference demo #57

python inference demo #57

Comments

sssssshf commented Apr 25, 2023

haotian-liu commented Apr 25, 2023

sssssshf commented Apr 26, 2023

sssssshf commented Apr 26, 2023

MaxFBurg commented Apr 26, 2023 • edited Loading

vishaal27 commented Apr 26, 2023

haotian-liu commented Apr 29, 2023

haotian-liu commented Apr 29, 2023

vishaal27 commented Apr 29, 2023

penghe2021 commented Apr 30, 2023 • edited Loading

vishaal27 commented Apr 30, 2023

haotian-liu commented Apr 30, 2023

vishaal27 commented Apr 30, 2023

Marcusntnu commented May 2, 2023

haotian-liu commented May 5, 2023

wjjlisa commented May 17, 2023

haotian-liu commented May 17, 2023

SeungyounShin commented May 29, 2023 • edited Loading

cyril-mino commented Jun 13, 2023

vishaal27 commented Jun 13, 2023

cyril-mino commented Jun 13, 2023

adrielkuek commented Jun 13, 2023

adrielkuek commented Jun 14, 2023

adrielkuek commented Jun 19, 2023

vishaal27 commented Jun 21, 2023

adrielkuek commented Jun 22, 2023

vishaal27 commented Jun 22, 2023

codybum commented Jul 2, 2023

HireTheHero commented Sep 10, 2023

LumenYoung commented Sep 11, 2023

LumenYoung commented Sep 11, 2023

HireTheHero commented Sep 12, 2023

CreativeBuilds commented Oct 25, 2023

aldoz-mila commented Jan 15, 2024

fisher75 commented Mar 25, 2024

Sprinter1999 commented Apr 2, 2024

yc-cui commented Apr 22, 2024

MaxFBurg commented Apr 26, 2023 •

edited

Loading

penghe2021 commented Apr 30, 2023 •

edited

Loading

SeungyounShin commented May 29, 2023 •

edited

Loading