Llama: allow custom 4d masks #29618

gante · 2024-03-12T17:42:34Z

What does this PR do?

Reintroduces the ability to pass custom 4D attention masks, which was removed in the static cache transition. The following tests are now passing

RUN_SLOW=1 python -m pytest -v ./tests/test_modeling_utils.py::Mask4DTestFP32
RUN_SLOW=1 python -m pytest -v ./tests/test_modeling_utils.py::Mask4DTestFP16

cc @ArthurZucker after you come back from holidays, have a look at this PR :)

gante · 2024-03-12T17:43:36Z

tests/test_modeling_utils.py


        hid_0 = self.model.model.embed_tokens(input_0)
-        outs_0 = self.model.model.layers[0].self_attn.forward(hid_0)[0]
+        outs_0 = self.model.model.layers[0].self_attn.forward(hid_0, position_ids=position_ids_0)[0]


position_ids is now a "mandatory" input to the attention layer forward

gante · 2024-03-12T17:44:44Z

tests/test_modeling_utils.py

        # outs_0.shape == torch.Size([3, 4, 768])

        hid_1 = self.model.model.embed_tokens(input_1)
        outs_1 = self.model.model.layers[0].self_attn.forward(
-            hid_1, attention_mask=mask_1.bool(), position_ids=position_ids_1
+            hid_1, attention_mask=causal_mask_1, position_ids=position_ids_1


the attention layer forward now expects numerical 4D causal masks (as opposed to 2D boolean masks)

gante · 2024-03-12T17:45:04Z

tests/test_modeling_utils.py

        )[0]
        # outs_1.shape == torch.Size([1, 6, 768])

        outs_0_last_tokens = outs_0[:, -1, :]  # last tokens in each batch line
        outs_1_last_tokens = outs_1[0, -3:, :]  # last three tokens
-        assert torch.allclose(outs_0_last_tokens, outs_1_last_tokens)
-
-    def test_inner_model(self):


This test was a copy of the test below 🤔

HuggingFaceDocBuilderDev · 2024-03-12T18:03:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for reenabling this!

Only question before merge is how come this is only needed for the gemma and llama models?

gante · 2024-03-13T15:07:43Z

Only question before merge is how come this is only needed for the gemma and llama models?

@amyeroberts They are the only models that have received the static cache treatment. The static cache transition did not foresee this case in the original diff :)

We are finalizing support on the generate side before we propagate this pattern across the library! (#29374)

allow 4d masks

3616b6e

gante requested a review from amyeroberts March 12, 2024 17:42

gante commented Mar 12, 2024

View reviewed changes

amyeroberts approved these changes Mar 13, 2024

View reviewed changes

gante merged commit 1e21c4f into huggingface:main Mar 13, 2024
19 checks passed

gante deleted the fix_29525 branch March 13, 2024 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama: allow custom 4d masks #29618

Llama: allow custom 4d masks #29618

gante commented Mar 12, 2024

gante Mar 12, 2024

gante Mar 12, 2024

gante Mar 12, 2024

HuggingFaceDocBuilderDev commented Mar 12, 2024

amyeroberts left a comment

gante commented Mar 13, 2024

Llama: allow custom 4d masks #29618

Llama: allow custom 4d masks #29618

Conversation

gante commented Mar 12, 2024

What does this PR do?

gante Mar 12, 2024

Choose a reason for hiding this comment

gante Mar 12, 2024

Choose a reason for hiding this comment

gante Mar 12, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Mar 12, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

gante commented Mar 13, 2024