Bug Report for https://neetcode.io/problems/multi-headed-self-attention
Please describe the bug below and include any steps to reproduce the bug or screenshots if possible.
Your multihead attention implementation is wrong (you should first calculate complete q, k, v from embedding using w_k, w_q, w_v, and then divide them into the heads and calculate the attention)
In your solution, embeddings are divided into heads, and then weights are multiplied, which is not the standard transformer implementation.