Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attn_mask #8

Open
pldlgb opened this issue Jul 23, 2022 · 3 comments
Open

attn_mask #8

pldlgb opened this issue Jul 23, 2022 · 3 comments

Comments

@pldlgb
Copy link

pldlgb commented Jul 23, 2022

cls_mask = rearrange(text!=self.pad_id, 'b j -> b 1 j')  
attn_mask = F.pad(cls_mask, (0, 1, seq, 0), value=True)

attn_mask = rearrange(attn_mask, 'b i j -> b 1 i j')  
sim = sim.masked_fill(~attn_mask, -torch.finfo(sim.dtype).max)

Hello, I am confused of the implement of "attn_mask". I think this padding function only can mask the last row of "sim". Could you please explain it? Perhaps it's a very fool question. Thank you so much.

@skyerhxx
Copy link

I have the same question. It seems like the attn_mask = F.pad(cls_mask, (0, 1, seq, 0), value=True) is not right.
Based on the original paper, the attn_mask here should be in the form of an inverted triangle, to prevent the current timestep feature from seeing the future timestep feature.

Welcome to discuss.

AnHoff added a commit to AnHoff/CoCa-pytorch that referenced this issue Jun 29, 2023
* (lucidrains#8) Inicia documentação Sphinx

* (lucidrains#8) Realoca assets

* (lucidrains#8) Hotfix + melhoria da organização do repositório

* (lucidrains#8) Remove desnecessário

Co-authored-by: edudsan <eduschusan@gmail.com>

* (lucidrains#8) Containerização da biblioteca + Banco

(lucidrains#8) Containerização da biblioteca + Banco

---------

Co-authored-by: edudsan <eduschusan@gmail.com>
@gshaikov-paige
Copy link

gshaikov-paige commented Sep 1, 2023

@skyerhxx This is not the causal mask, this is a mask that prevents CLS tokens from attending to PAD tokens in the batch.

We add PAD tokens to the text batch since text examples have different length but the tensor has a fixed dimension, so to concat them into a batch tensor one must pad the end sequence with dummy token, i.e. a PAD token. However, since we append CLS token to the very end, it will attend to the entire sequence, including PAD tokens, which we don't want. So we mask them out.

@gshaikov-paige
Copy link

gshaikov-paige commented Sep 1, 2023

@pldlgb we only mask the last row of sim because this row corresponds to the CLS token query. Without this mask it will attend to all the keys before it, incl. PAD keys.

We don't need to mask other queries because we don't care what PAD queries attend to - they will be masked out when we compute CE loss. We also don't need to mask text queries since they are already masked by the causal mask so they can only look backwards at other text queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants