Llava onevision: output align for tests and add image_sizes input param#43678
Llava onevision: output align for tests and add image_sizes input param#43678vasqu merged 9 commits intohuggingface:mainfrom
image_sizes input param#43678Conversation
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
|
@vasqu pls help review, thx! |
There was a problem hiding this comment.
Here I revert the changes in #43403, as it seems image_sizes input param is a common param for VLM models. As both lighton_ocr and llava_onevision models need it.
There was a problem hiding this comment.
Gotcha, fair enough then if it's not unique anymore. Just see my comment to add examples to that comments
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
image_sizes input param
There was a problem hiding this comment.
Gotcha, fair enough then if it's not unique anymore. Just see my comment to add examples to that comments
| ) | ||
| EXPECTED_DECODED_TEXT = EXPECTED_DECODED_TEXTS.get_expectation() | ||
| # fmt: on | ||
| EXPECTED_DECODED_TEXT = [ |
There was a problem hiding this comment.
Interesting, so the results are the same again 👀
| ("xpu", 3): 'user\n\nWhat do you see in this image?\nassistant\nThe image is a radar chart that compares the performance of different models in a specific task, likely related to natural language processing or machine learning. The chart is divided into several axes, each representing a different model or method. The models are color-coded and labeled with their respective names. The axes are labeled with terms such as "VQA," "GQA," "MQA," "VIZ," "TextVQA," "SQA-IMG," and "MQE." The radar chart shows', | ||
| ("cuda", 7): 'user\n\nWhat do you see in this image?\nassistant\nThe image is a radar chart that compares the performance of different models in a specific task, likely related to natural language processing or machine learning. The chart is divided into several axes, each representing a different model or method. The models are color-coded and labeled with their respective names. The axes are labeled with terms such as "VQA," "GQA," "MQA," "VQAv2," "MM-Vet," "LLaVA-Bench," "LLaVA-1', |
There was a problem hiding this comment.
It looks like xpu and cuda (8) are the same again. Only cuda 7 differs but iirc that's for the old T4 GPUs so IMO it could be merged again too because T4 GPUs are no longer used on our CI
cc @ydshieh wdyt?
There was a problem hiding this comment.
Yes, I upgraded Pytorch to 2.10, and the output for XPU and A100 are the same now. But I do not have cuda 7 device, so I keep the result for cuda 7 here.
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: lighton_ocr, llava_onevision |
vasqu
left a comment
There was a problem hiding this comment.
LGTM, just running the slow test to be sure
|
run-slow: lighton_ocr, llava_onevision |
|
This comment contains models: ["models/lighton_ocr", "models/llava_onevision"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
In this PR, we do several things for llava_onevision model:
image_sizesparam inflash_attn_inference_equivalencefunc to support VLM models like lighton_ocr and llava_onevision