Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python inference demo #57

Open
sssssshf opened this issue Apr 25, 2023 · 36 comments
Open

python inference demo #57

sssssshf opened this issue Apr 25, 2023 · 36 comments
Assignees
Labels
enhancement New feature or request

Comments

@sssssshf
Copy link

Do I have to use a browser to demonstrate when running a large model locally?
Is there a demo in Python that directly feeds images and language into Python?

@haotian-liu haotian-liu self-assigned this Apr 25, 2023
@haotian-liu haotian-liu added the enhancement New feature or request label Apr 25, 2023
@haotian-liu
Copy link
Owner

Hi, thank you for your interest in our work.

This is a great suggestion! We have added an example script for CLI inference (single-turn Q-A session). An interactive CLI interface is WIP.

Please see instruction here: https://github.com/haotian-liu/LLaVA#cli-inference.

@sssssshf
Copy link
Author

when I run this shell :
python -m llava.eval.run_llava
--model-name /LLaVA-13B-v0
--image-file "https://llava-vl.github.io/static/images/view.jpg"
--query "What are the things I should be cautious about when I visit here?"

it's error :
HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '/LLaVA-13B-v0'.

@sssssshf
Copy link
Author

为什么直接下载这里的的模型不能直接用?https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0

@MaxFBurg
Copy link

MaxFBurg commented Apr 26, 2023

Cool, thank you very much @haotian-liu ! Do you have plans for providing a CLI that allows to feed multiple images and text prompts turn by turn anytime soon? This would be super cool to use your model for new downstream tasks.

@vishaal27
Copy link

Yes, I agree with @MaxFBurg, are there any such implementation plans?

@haotian-liu
Copy link
Owner

@MaxFBurg @vishaal27

Yes, that's a great suggestion, and as mentioned in my previous reply, the interactive CLI support is planned. We are planning to upgrade to the Vicuna v1.1 soon, as it has a better support for these. Stay tuned! And if you are interested in contributing, please let me know!

@haotian-liu
Copy link
Owner

为什么直接下载这里的的模型不能直接用?https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0

We are not allowed to share the full model weights due to the LLaMA license, please see here for weight conversion.

@vishaal27
Copy link

Thanks for your response @haotian-liu
I tried replacing these lines in your eval script llava.eval.run_llava.py:

    qs = args.query
    if mm_use_im_start_end:
        qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + '\n' + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len

with

    qs = args.query
    if mm_use_im_start_end:
        qs = qs + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len

I think this would be a naive extension of the current single image one turn inference procedure to a single turn inference procedure that can take two images as input in the prompt. Do you think something as straightforward as this will work out of the box?

However, this doesn't seem to work well in practice for multi-image comparisons, a few examples follow:

For all the below examples, I used the following prompt with the modified code above: {<img_1> <img_2> <"Describe the change applied to the first image to get to the second image">}.

Screenshot 2023-04-29 at 1 57 07 PM

Screenshot 2023-04-29 at 1 57 22 PM

Screenshot 2023-04-29 at 1 57 15 PM

As you can see, the generations completely ignore the first image and give a detailed description of the second image. However, the model does understand that there are two images in the third case, given that its description contains "In the second image".

For a comparison, these are the model's responses when prompting with the same images and prompt on the web demo:

Screenshot 2023-04-29 at 2 02 59 PM

This response is more coherent, and describes the difference between the two images fairly reasonably.

I am wondering if this is an inherent limitation of the single-turn multi-image prompting style I've used above since it could be out-of-distribution (since your visual instruction tuning dataset only contains a single image per sample) for the model. Do you have any suggestions on a better evaluation strategy for this multi-image comparison either through single turn or multi turn prompting?

@penghe2021
Copy link

penghe2021 commented Apr 30, 2023

Thanks for your response @haotian-liu I tried replacing these lines in your eval script llava.eval.run_llava.py:

    qs = args.query
    if mm_use_im_start_end:
        qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + '\n' + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len

with

    qs = args.query
    if mm_use_im_start_end:
        qs = qs + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len

I think this would be a naive extension of the current single image one turn inference procedure to a single turn inference procedure that can take two images as input in the prompt. Do you think something as straightforward as this will work out of the box?

However, this doesn't seem to work well in practice for multi-image comparisons, a few examples follow:

For all the below examples, I used the following prompt with the modified code above: {<img_1> <img_2> <"Describe the change applied to the first image to get to the second image">}.

Screenshot 2023-04-29 at 1 57 07 PM

Screenshot 2023-04-29 at 1 57 22 PM

Screenshot 2023-04-29 at 1 57 15 PM

As you can see, the generations completely ignore the first image and give a detailed description of the second image. However, the model does understand that there are two images in the third case, given that its description contains "In the second image".

For a comparison, these are the model's responses when prompting with the same images and prompt on the web demo:

Screenshot 2023-04-29 at 2 02 59 PM

This response is more coherent, and describes the difference between the two images fairly reasonably.

I am wondering if this is an inherent limitation of the single-turn multi-image prompting style I've used above since it could be out-of-distribution (since your visual instruction tuning dataset only contains a single image per sample) for the model. Do you have any suggestions on a better evaluation strategy for this multi-image comparison either through single turn or multi turn prompting?

Did you also change the image part

@vishaal27
Copy link

Yes, this is the code I updated:

    image = load_image(args.image_file)
    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

with

   image_tensor = torch.stack(
        [
            image_processor.preprocess(load_image(image_file), return_tensors="pt")["pixel_values"][0]
            for image_file in args.image_file.split(",")
        ]
    )

    input_ids = torch.as_tensor(inputs.input_ids).cuda()

I just pass in comma-separated image input files. Please let me know whether there is an issue in this impl?

@haotian-liu
Copy link
Owner

@penghe2021 Due to the current way of training, we do not observe the model having very good capability referring to / comparing with multiple images. We are working on improving this aspect as well, stay tuned!

@vishaal27
Copy link

Thanks @haotian-liu, so I assume the above implementation for a single turn multi-image inference is correct, but its an OOD problem due to the current training set-up of the model (sees only one image per sample during visual instruction tuning). However, I still see the model performing well on multiple images when used in the multi-turn set-up, so looking forward to your demo implementation of that, do you have a plan for when that can be released?

@Marcusntnu
Copy link

It might be a superfluous or large request, but if the model could be integrated into a huggingface AutoModel or pipeline setup, I think it would be very accessible. Especially for experimenting with different use-cases.

@haotian-liu
Copy link
Owner

Hi @Marcusntnu, thank you for your interest in our work, and thank you for the great suggestion. This is WIP, and our first step is to move that the LLaVA model implementation to this repo, which has been completed. It should be implemented very soon, thanks and stay tuned!

@wjjlisa
Copy link

wjjlisa commented May 17, 2023

@haotian-liu Have you ever considered releasing the multi turn infernece code?

@haotian-liu
Copy link
Owner

@wjjlisa, do you mean the multi-turn conversation in CLI, as in our Gradio demo? This is planned for release by the end of this month. Was busy working on the NeurIPS recently...

@SeungyounShin
Copy link

SeungyounShin commented May 29, 2023

Screen Shot 2023-05-29 at 8 13 09 PM

This is my experiments with prompt tunning. Not perfect but pretty amazing

Seems like img1,img2,text is better performing.

@cyril-mino
Copy link

@vishaal27 I Would like to know what the structure of the data input looks like. I am trying to do a similar thing.

@vishaal27
Copy link

@cyril-mino Sorry I don't get your question -- what do you mean by structure of data input? I just pass in two images to the model as a list of tensors (with the updated code above) and pass in the prompt that asks to compare the two images.

@cyril-mino
Copy link

@vishaal27 apologies, I thought you were finetuning.

@adrielkuek
Copy link

Screen Shot 2023-05-29 at 8 13 09 PM

This is my experiments with prompt tunning. Not perfect but pretty amazing

Seems like img1,img2,text is better performing.

Hi, possible to share the input query prompts for this output? Thanks!

@adrielkuek
Copy link

@wjjlisa, do you mean the multi-turn conversation in CLI, as in our Gradio demo? This is planned for release by the end of this month. Was busy working on the NeurIPS recently...

can I check whether the multi-turn framework would be updated into the repo anytime soon? Thanks for the great work.

@adrielkuek
Copy link

Yes, this is the code I updated:

    image = load_image(args.image_file)
    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

with

   image_tensor = torch.stack(
        [
            image_processor.preprocess(load_image(image_file), return_tensors="pt")["pixel_values"][0]
            for image_file in args.image_file.split(",")
        ]
    )

    input_ids = torch.as_tensor(inputs.input_ids).cuda()

I just pass in comma-separated image input files. Please let me know whether there is an issue in this impl?

Hi Vishaal, by stacking the tensors we create a new dimension input for the model which would throw an exception, how did you overcome this issue?

@vishaal27
Copy link

Hi Adriel, as far as I recall (I must admit I haven't looked at this script in over a month) the model.generate function was able to take in multiple input images as a concatenated tensor. For full clarity, here is the script I used, hope it helps (disclaimer: this script uses a fork of the repository that is quite old, it is possible a few things might have changed since then): https://github.com/MaxFBurg/LLaVA/blob/main/llava/eval/run_llava_two_images.py#L51

@adrielkuek
Copy link

Hi Vishaal, thanks for sharing the code. Indeed the fork has changed quite a fair bit. Seems like the mm-projector is removed, and the pretrained model as well. I can confirm that the current fork with the modified image tensor input is unable to work due to the dimensionality error in one of the nn.modules during forward pass. Can I do a quick check with you, did you use the llama-13b model or the facebook/opt model for your testing back then?

@vishaal27
Copy link

Great, thanks for letting me know -- I will however need to get back to this script at some point and get it to work, so I can let you know if I can figure something out for this use-case. Please do let me know if you are able to as well :)
Re. your question -- We used the llama-13b model back then, I think at that stage the opt model was not available if I recall correctly.

@codybum
Copy link

codybum commented Jul 2, 2023

Hi Vishaal, thanks for sharing the code. Indeed the fork has changed quite a fair bit. Seems like the mm-projector is removed, and the pretrained model as well. I can confirm that the current fork with the modified image tensor input is unable to work due to the dimensionality error in one of the nn.modules during forward pass. Can I do a quick check with you, did you use the llama-13b model or the facebook/opt model for your testing back then?

@adrielkuek @vishaal27
We are also very interested in using multi-image input. Our interest is less in comparison, but rather using multiple images to represent the same thing, as described here: #197 (comment)

@HireTheHero
Copy link
Contributor

Interested in multiple image input as well. We're wondering if we could perform multimodal few-shot classification (on-the-fly, without fine-tuning) or not.
Will test Vishaal's solution and maybe create PR when I have time.

@LumenYoung
Copy link

Hi, everyone. I've be surfing in the code base of LLaVA for a while and find it hard to find the exact generate() function implementation for llama based LLaVa. I'm trying to find the generate() for LLaMA. It would be helpful since I want to find a way to work in the multiple image mode. Any help would be appreciated!

@LumenYoung
Copy link

Screen Shot 2023-05-29 at 8 13 09 PM

This is my experiments with prompt tunning. Not perfect but pretty amazing

Seems like img1,img2,text is better performing.

Hi, SeungyounShin. Would you mind sharing how you manage to embed both image into one query. It would be really helpful as I currently am not able to find a way to do this.

@HireTheHero
Copy link
Contributor

Something like #432 ? Would appreciate any suggestions.

@CreativeBuilds
Copy link

Screen Shot 2023-05-29 at 8 13 09 PM This is my experiments with prompt tunning. Not perfect but pretty amazing

Seems like img1,img2,text is better performing.

Would you be able to upload this model to hugging face or share it some other way? Very interested in getting this to run with image comparison.

@aldoz-mila
Copy link

Hello, I am also interested in inputting more than one image for some experiments. I am trying to find the right template for this, considering that the base template is USER: <image>\n<prompt>\nASSISTANT:.

  • (i) What would be the prompt template when feeding more than one image? Should I use something like USER: <image1><image2>\n<prompt>\nASSISTANT:?
  • (ii) How do you input the multiple images? Do you concatenate them in the first dimension? For example, assuming this in my code snippet (from HuggingFace integration of LLaVA): output = pipe( image, prompt=text_prompt, generate_kwargs={"max_new_tokens": 200} )

@fisher75
Copy link

I am in great need of a multi-dialog feature for batch-inference with SGLang.

@Sprinter1999
Copy link

@vishaal27 apologies, I thought you were finetuning.

Hi cyril-mino, did you make it to support the fine-tuning LLaVa with multi-images & text? I wonder if there are some extra steps besides the codes mentioned by @vishaal27. Thanks~

@yc-cui
Copy link

yc-cui commented Apr 22, 2024

Is there any way we can embed other modalities? Such as bbox, class ... ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests