Skip to content

Conversation

Kotomi-Du
Copy link

@Kotomi-Du Kotomi-Du commented Oct 10, 2025

GQA is originally supported by OV starting from 2025.1. This PR is to align with OV support.

"beam_idx",
"past_key_values",
"present",
"total_seq_len",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kotomi-Du Does the stateful model post translation into OVIR comprise of total_seq_len input always? Is this a general case for all LLMs now (since which OV toolkit version this was added)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the input name from Msft generic model (specifically Phisilica model), not the Epctx OVIR model OV toolkit generated

{"Atanh", V_2020_4, {"CPU"}},
{"Atanh", V_2022_1, {"GPU"}},
{"Attention", V_2023_0, {"CPU", "GPU"}},
{"GroupQueryAttention", V_2025_1, {"CPU", "GPU"}},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the JIRA in the PR description that enables GQA Op for CPU & GPU plugins in the ONNX OV frontend. Please make sure this change doesn't conflict with GQA support for NPU in your validation process (FYI we are not targeting to support GQA for NPU currently)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If GQA is enabled for CPU it will also be marked for NPU. Can you revert the change

Copy link

@MayureshV1 MayureshV1 Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GQA should NOT be marked as supported on CPU. NPU uses CPU's capability. As @preetha-intel mentioned, if it is marked as supported on CPU it would result in the op targeting to run on NPU and in case of compilation failure would run on OV CPU instead of MLAS which is currently a CP+ production requirement.

Copy link

@ankitm3k ankitm3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants