Fix `torch.compile` with `fullgraph=True` when `attention_mask` input is used #29211

fxmarty · 2024-02-22T12:51:45Z

As per title.

Fixes #29190

fxmarty · 2024-02-22T12:52:11Z

Let's consider using pytorch/pytorch#120400 if this is accepted and released.

ArthurZucker

I guess we don't have a choice?

ArthurZucker · 2024-02-22T12:56:27Z

src/transformers/models/llama/modeling_llama.py

+                # Attend to all tokens in masked rows from the causal_mask, for example the relevant first rows when
+                # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+                # Details: https://github.com/pytorch/pytorch/issues/110213
+                causal_mask = causal_mask.mul(~torch.all(causal_mask == causal_mask.min(), dim=-1, keepdim=True)).to(


Kind of related to #29210

@ArthurZucker this would conflict but is unrelated

ArthurZucker · 2024-02-22T12:56:38Z

src/transformers/models/llama/modeling_llama.py

+            is_tracing = (
+                torch.jit.is_tracing()
+                or isinstance(input_tensor, torch.fx.Proxy)
+                or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
+            )


fxmarty · 2024-02-22T13:01:24Z

I guess we don't have a choice?

The other choice would be:

            is_tracing = (
                torch.jit.is_tracing()
                or isinstance(input_tensor, torch.fx.Proxy)
                or torch._dynamo.is_fullgraph_tracing())
            )

but torch._dynamo.is_fullgraph_tracing does not exist in PyTorch.

One other possibility is to always do causal_mask = causal_mask.mul(~torch.all(causal_mask == causal_mask.min(), dim=-1, keepdim=True)) no matter what.

One other possibility is to not use causal_mask = causal_mask.mul(~torch.all(causal_mask == causal_mask.min(), dim=-1, keepdim=True)) at all, but that means that we drop support for memory-efficient attention backend in Transformers cc @drisspg

One other possibility is to move the causal mask logic outside of the modeling code.

HuggingFaceDocBuilderDev · 2024-02-22T13:16:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Alright, let's make sure CIs are green en bench are not slower!

kwen2501 · 2024-02-27T16:17:33Z

Thanks for the fix!
Just wanted to share new API offerings from PyTorch (as of a couple days ago):
pytorch/pytorch#119602

Summary:

A more general flag is named torch.compiler.is_compiling(). This flag indicates whether a graph is traced/compiled via torch.export() or torch.compile(). The flag works even in non-strict mode (i.e. even without TorchDynamo).
A more specific flag is named torch.compiler.is_dynamo_compiling(), it's stricter, because it's only set to True when TorchDynamo is used, so, in non-strict mode it would be False.

Cc: @mreso @khabinov

fxmarty · 2024-02-27T17:57:11Z

@kwen2501 Thank you! torch.export is not always using dynamo? Reading https://pytorch.org/docs/stable/export.html I thought so!

fix torch.export.export for llama

c1316c3

do not change doc title

c2fdb49

fxmarty requested review from gante, amyeroberts, ArthurZucker and younesbelkada February 22, 2024 12:54

ArthurZucker reviewed Feb 22, 2024

View reviewed changes

fxmarty requested a review from ArthurZucker February 22, 2024 13:02

ArthurZucker approved these changes Feb 22, 2024

View reviewed changes

make fix copies

25dd521

fxmarty merged commit 2cc8cf6 into huggingface:main Feb 22, 2024
19 checks passed

fxmarty mentioned this pull request Feb 22, 2024

torch.export fails for llama model #29190

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `torch.compile` with `fullgraph=True` when `attention_mask` input is used #29211

Fix `torch.compile` with `fullgraph=True` when `attention_mask` input is used #29211

fxmarty commented Feb 22, 2024

fxmarty commented Feb 22, 2024

ArthurZucker left a comment

ArthurZucker Feb 22, 2024

fxmarty Feb 22, 2024

ArthurZucker Feb 22, 2024

fxmarty Feb 22, 2024

fxmarty commented Feb 22, 2024 •

edited

HuggingFaceDocBuilderDev commented Feb 22, 2024

ArthurZucker left a comment

kwen2501 commented Feb 27, 2024

fxmarty commented Feb 27, 2024

Fix torch.compile with fullgraph=True when attention_mask input is used #29211

Fix torch.compile with fullgraph=True when attention_mask input is used #29211

Conversation

fxmarty commented Feb 22, 2024

fxmarty commented Feb 22, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Feb 22, 2024

Choose a reason for hiding this comment

fxmarty Feb 22, 2024

Choose a reason for hiding this comment

ArthurZucker Feb 22, 2024

Choose a reason for hiding this comment

fxmarty Feb 22, 2024

Choose a reason for hiding this comment

fxmarty commented Feb 22, 2024 • edited

HuggingFaceDocBuilderDev commented Feb 22, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

kwen2501 commented Feb 27, 2024

fxmarty commented Feb 27, 2024

Fix `torch.compile` with `fullgraph=True` when `attention_mask` input is used #29211

Fix `torch.compile` with `fullgraph=True` when `attention_mask` input is used #29211

fxmarty commented Feb 22, 2024 •

edited