Add visual-question-answering / multimodal support to gradio notebook tasks #1392

Bedrovelsen · 2024-03-03T11:02:29Z

Enjoying the recent gradio notebook stuff!

Was curious about when/if support for an additional hugging face task option of "visual question answering“" is planned?

If not currently planning to add this could a quick overview on how to add a new task category to the gradio notebook codebase (beside just manually reading over the current code for gradio notebooks myself to figure it out on my own which I can do of course but guidance from the team is preferred for best practices in contributing etc)

saqadri · 2024-03-03T16:22:10Z

Thanks @Bedrovelsen! Would love your help adding that, and messages you on discord so our team can work with you to make sure you can get this set up!

Bedrovelsen · 2024-03-03T17:39:05Z

Sounds good

rholinshead · 2024-03-04T21:30:34Z

Just copying over the quick implementation overview from discord here:

A new HuggingFaceVisualQuestionAnsweringRemoteInference ModelParser under https://github.com/lastmile-ai/aiconfig/tree/main/extensions/HuggingFace/python/src/aiconfig_extension_hugging_face/remote_inference_client folder
This parser should look pretty similar to the existing HuggingFaceImage2TextRemoteInference model parser, with the following changes:

serialize implementation will do the same image/attachment data stuff but the constructed PromptInput will also need data string representing the 'question' string value from the data passed to serialize
refine_completion_params implementation can be the same, but should have comment pointing to the visual_question_answering api code: https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/inference/_client.py#L1785
deserialize implementation can be mostly the same, except we will need to add 'question' to the completion_data from the prompt data: completion_data["question"] = prompt["data"]
run implementation will be similar as well, just needs to call client.visual_question_answering with the completion_data and need to handle the response as desired. It looks like the response will be a list of VisualQuestionAnsweringOutputElement objects; we'll want to serialize those as ExecuteResult outputs in the format you think is best. For example, we could have data be the answer and then store the score in metadata

I believe the helpers about validating/retrieving the image from attachments can just be kept the same.

With the parser implemented, we can expose it in the extension here: https://github.com/lastmile-ai/aiconfig/blob/main/extensions/HuggingFace/python/src/aiconfig_extension_hugging_face/__init__.py

For testing the extension, please see README instructions - https://github.com/lastmile-ai/aiconfig/blob/main/extensions/HuggingFace/python/README.md

Then, I would recommend importing and registering the new parser in https://github.com/lastmile-ai/aiconfig/blob/main/cookbooks/Gradio/aiconfig_model_registry.py with id "Visual Question Answering" and then following the Getting Started instructions in https://github.com/lastmile-ai/aiconfig/edit/main/cookbooks/Gradio/README.md to open the huggingface.aiconfig.json file with the new parser registered.

On the UI side, we will need to add a new PromptSchema to the client for rendering the parser's input and settings nicely. I can implement that shortly

# Implement HuggingFaceVisualQuestionAnsweringRemoteInferencePromptSchema For #1392 This will add the prompt schema so that visual question answering prompts have the nice UI for input and settings

…ma (#1396) Implement HuggingFaceVisualQuestionAnsweringRemoteInferencePromptSchema # Implement HuggingFaceVisualQuestionAnsweringRemoteInferencePromptSchema For #1392 This will add the prompt schema so that visual question answering prompts have the nice UI for input and settings --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/lastmile-ai/aiconfig/pull/1396). * #1397 * __->__ #1396

rholinshead · 2024-03-04T22:51:33Z

Whoops, linked #1396 which has the schema changes and it auto-closed. This issue is still open

rholinshead mentioned this issue Mar 4, 2024

Implement HuggingFaceVisualQuestionAnsweringRemoteInferencePromptSchema #1396

Merged

rholinshead linked a pull request Mar 4, 2024 that will close this issue

Implement HuggingFaceVisualQuestionAnsweringRemoteInferencePromptSchema #1396

Merged

rholinshead closed this as completed in #1396 Mar 4, 2024

rholinshead reopened this Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add visual-question-answering / multimodal support to gradio notebook tasks #1392

Add visual-question-answering / multimodal support to gradio notebook tasks #1392

Bedrovelsen commented Mar 3, 2024

saqadri commented Mar 3, 2024

Bedrovelsen commented Mar 3, 2024

rholinshead commented Mar 4, 2024

rholinshead commented Mar 4, 2024

Add visual-question-answering / multimodal support to gradio notebook tasks #1392

Add visual-question-answering / multimodal support to gradio notebook tasks #1392

Comments

Bedrovelsen commented Mar 3, 2024

saqadri commented Mar 3, 2024

Bedrovelsen commented Mar 3, 2024

rholinshead commented Mar 4, 2024

rholinshead commented Mar 4, 2024