When using multihead_attention, why does the queries are normalized while keys and values are not ?

for i in range(len(self.attention_layers)):
            seqs = torch.transpose(seqs, 0, 1)
            Q = self.attention_layernorms[i](seqs)
            mha_outputs, _ = self.attention_layers[i](Q, seqs, seqs, 
                                            attn_mask=attention_mask)
                                            # key_padding_mask=timeline_mask
                                            # need_weights=False) this arg do not work?
            seqs = Q + mha_outputs
            seqs = torch.transpose(seqs, 0, 1)
In the SASRec paper, Ⅲ. Methodology part B.Self-Attention Block, the formula uses the same embedding object as queries, keys and values, then converts it through linear projections. Why does queries are normalized, while keys and values are not in the code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When using multihead_attention, why does the queries are normalized while keys and values are not ? #33

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

When using multihead_attention, why does the queries are normalized while keys and values are not ? #33

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions