[Attention] Attention head quantization strategy #481

kylesayrs · 2025-10-06T22:19:26Z

Purpose

Support attention head quantization
- This will be used by [Transform] Attention/Cache transforms #436

Given an attention state of shape (batch_size, num_heads, seq_len, head_dim), the head attention strategy will generate scales of shape (num_heads, 1, 1).

Prerequisites

[Tests] Mock Observers, Static Lifecycle Tests #482

Changes

Add attention head quantization strategy
Fix shapes of per-tensor attention flattening
Elaborate on attention calibration tests

Testing

Added attention head quantization test and validated that generated scales and zero points make sense

brian-dellabetta

how are we expecting users to use QuantizationStrategy.ATTN_HEAD in a recipe? If i'm understanding correctly, it would look something like this?

quant_stage:
  quant_modifiers:
    QuantizationModifier:
      config_groups:
        group0:
          targets: ["re:.*self_attn$"]
          weights:
            strategy: attn_head
            ...
        group1:
          targets: ["re:.*(q|k|v)_proj$"]
          weights:
            strategy: group
            ...

kylesayrs · 2025-10-07T23:00:28Z

@brian-dellabetta I’ve decided that giving per-attention strategy its own strategy (rather than reusing group) makes more sense.

quant_stage:
  quant_modifiers:
    QuantizationModifier:
      config_groups:
        group0:
          targets: ["re:.*self_attn$"]
          input_activations:
            strategy: attn_head
            ...

brian-dellabetta

overall format LGTM, but i'm struggling with understanding how we're arriving at some of these expected_shapes

src/compressed_tensors/quantization/lifecycle/initialize.py

brian-dellabetta

Thanks for updating!

tests/mock_observer.py

The base branch was changed.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fynnsu

Looks good! Do we need logic somewhere to reverse flatten_attention_for_quantization? It seems like it would be important to be sure that the unflattening process is implemented in parallel with the flattening function.

src/compressed_tensors/quantization/lifecycle/initialize.py

kylesayrs · 2025-10-09T20:11:10Z

@fynnsu The inverse function would require extra metadata (for example, unflattening (batch_size * seq_len) requires knowing either batch_size or seq_len).

Calibration only requires the forward function. Implementing the backwards function would allow us to share the util across calibration and quantization forward. This might be nice for standardization and potentially faster runtime, but isn't high priority right now.

kylesayrs changed the base branch from main to kylesayrs/refactor-initialize-tests October 6, 2025 22:22

kylesayrs force-pushed the kylesayrs/add-attn-head-strat branch 2 times, most recently from bf00a99 to 2ea692d Compare October 6, 2025 22:28

kylesayrs force-pushed the kylesayrs/refactor-initialize-tests branch from 97a4d16 to 0fdfbd1 Compare October 6, 2025 22:31

kylesayrs force-pushed the kylesayrs/add-attn-head-strat branch from 2ea692d to 326f802 Compare October 6, 2025 22:31

kylesayrs force-pushed the kylesayrs/refactor-initialize-tests branch from 0fdfbd1 to 8973328 Compare October 7, 2025 22:05

kylesayrs force-pushed the kylesayrs/add-attn-head-strat branch 2 times, most recently from 70da261 to 48875e2 Compare October 7, 2025 22:13

kylesayrs marked this pull request as ready for review October 7, 2025 22:15

brian-dellabetta reviewed Oct 7, 2025

View reviewed changes

brian-dellabetta approved these changes Oct 8, 2025

View reviewed changes

src/compressed_tensors/quantization/lifecycle/initialize.py Outdated Show resolved Hide resolved

kylesayrs force-pushed the kylesayrs/add-attn-head-strat branch from 48875e2 to e1ca4fd Compare October 8, 2025 18:44

brian-dellabetta previously approved these changes Oct 8, 2025

View reviewed changes

tests/mock_observer.py Show resolved Hide resolved

Base automatically changed from kylesayrs/refactor-initialize-tests to main October 9, 2025 13:20

kylesayrs changed the base branch from main to transform_arg_support October 9, 2025 13:21

kylesayrs changed the base branch from transform_arg_support to main October 9, 2025 13:21

kylesayrs added 12 commits October 9, 2025 10:17

refactor

cecabf8

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reduce diff

7000b74

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reduce diff

e91bb12

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

increase num of required observed dims

50bc670

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove attention head

42255bc

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add tests

a0b2caf

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove attn head

07170c0

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

simplify

52be0f7

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

refactor

ad09ed8

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reduce diff

19a5051

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

increase num of required observed dims

370e2ca

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add tests

bb3ddaf

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs added 5 commits October 9, 2025 10:19

add tests for attn head

2996303

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add tests

ed5a255

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reduce diff

bfac475

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix shapes

bf1b9ba

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix shapes

e3f24d4

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs force-pushed the kylesayrs/add-attn-head-strat branch from d084c5e to e3f24d4 Compare October 9, 2025 14:19

revert

85f96cc

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs mentioned this pull request Oct 9, 2025

add support for per-head attention quantization #442

Closed

brian-dellabetta approved these changes Oct 9, 2025

View reviewed changes

fynnsu approved these changes Oct 9, 2025

View reviewed changes

src/compressed_tensors/quantization/lifecycle/initialize.py Show resolved Hide resolved

kylesayrs merged commit 3e4f164 into main Oct 9, 2025
2 checks passed

kylesayrs deleted the kylesayrs/add-attn-head-strat branch October 9, 2025 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attention] Attention head quantization strategy #481

[Attention] Attention head quantization strategy #481

Uh oh!

kylesayrs commented Oct 6, 2025 •

edited

Loading

Uh oh!

brian-dellabetta left a comment

Uh oh!

kylesayrs commented Oct 7, 2025 •

edited

Loading

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

fynnsu left a comment

Uh oh!

Uh oh!

kylesayrs commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Attention] Attention head quantization strategy #481

[Attention] Attention head quantization strategy #481

Uh oh!

Conversation

kylesayrs commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Prerequisites

Changes

Testing

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylesayrs commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kylesayrs commented Oct 6, 2025 •

edited

Loading

kylesayrs commented Oct 7, 2025 •

edited

Loading