attn_mask #8

pldlgb · 2022-07-23T10:42:49Z

cls_mask = rearrange(text!=self.pad_id, 'b j -> b 1 j')  
attn_mask = F.pad(cls_mask, (0, 1, seq, 0), value=True)

attn_mask = rearrange(attn_mask, 'b i j -> b 1 i j')  
sim = sim.masked_fill(~attn_mask, -torch.finfo(sim.dtype).max)

Hello, I am confused of the implement of "attn_mask". I think this padding function only can mask the last row of "sim". Could you please explain it? Perhaps it's a very fool question. Thank you so much.

skyerhxx · 2023-04-17T12:36:52Z

I have the same question. It seems like the attn_mask = F.pad(cls_mask, (0, 1, seq, 0), value=True) is not right.
Based on the original paper, the attn_mask here should be in the form of an inverted triangle, to prevent the current timestep feature from seeing the future timestep feature.

Welcome to discuss.

* (lucidrains#8) Inicia documentação Sphinx * (lucidrains#8) Realoca assets * (lucidrains#8) Hotfix + melhoria da organização do repositório * (lucidrains#8) Remove desnecessário Co-authored-by: edudsan <eduschusan@gmail.com> * (lucidrains#8) Containerização da biblioteca + Banco (lucidrains#8) Containerização da biblioteca + Banco --------- Co-authored-by: edudsan <eduschusan@gmail.com>

gshaikov-paige · 2023-09-01T10:31:19Z

@skyerhxx This is not the causal mask, this is a mask that prevents CLS tokens from attending to PAD tokens in the batch.

We add PAD tokens to the text batch since text examples have different length but the tensor has a fixed dimension, so to concat them into a batch tensor one must pad the end sequence with dummy token, i.e. a PAD token. However, since we append CLS token to the very end, it will attend to the entire sequence, including PAD tokens, which we don't want. So we mask them out.

gshaikov-paige · 2023-09-01T10:58:54Z

@pldlgb we only mask the last row of sim because this row corresponds to the CLS token query. Without this mask it will attend to all the keys before it, incl. PAD keys.

We don't need to mask other queries because we don't care what PAD queries attend to - they will be masked out when we compute CE loss. We also don't need to mask text queries since they are already masked by the causal mask so they can only look backwards at other text queries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attn_mask #8

attn_mask #8

pldlgb commented Jul 23, 2022 •

edited

Loading

skyerhxx commented Apr 17, 2023

gshaikov-paige commented Sep 1, 2023 •

edited

Loading

gshaikov-paige commented Sep 1, 2023 •

edited

Loading

attn_mask #8

attn_mask #8

Comments

pldlgb commented Jul 23, 2022 • edited Loading

skyerhxx commented Apr 17, 2023

gshaikov-paige commented Sep 1, 2023 • edited Loading

gshaikov-paige commented Sep 1, 2023 • edited Loading

pldlgb commented Jul 23, 2022 •

edited

Loading

gshaikov-paige commented Sep 1, 2023 •

edited

Loading

gshaikov-paige commented Sep 1, 2023 •

edited

Loading