-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CellFormat in AVX2 kernel incorrect? Question for clarification #159
Comments
Come to think more about this, I realize that I was mislead by one of the asm instructions. I think it's not I am not completely clear yet how the interaction with all the other operations is, but at least this doesn't violate my assumption about which memory is being accessed (in particular, reading past the cell as defined in the template). Not completely clear to me are also the “ |
Pinging @rrwinterton who introduced the kernel: #122 |
I'll let @rrwinterton comment as I'm not familiar with this code. Generic info: the general meaning of the kernel format concepts here is explained in |
I came to realize that I was trying to understand the kernel devoid of the packing mechanism. I now realize that that doesn't work. I have to wrap my head around how exactly the matrices are packed, but I am pretty sure the answer to my problem is there. |
Closing for now, let us know if any new question comes up. |
Thanks bjacob I didn't get the email of the previous questions in December. If there are any questions you may have after reviewing the packing mechanism and the concepts documents bjacob provided let me know. I will watch the emails more closely. |
I am studying the kernels implemented in gemmlowp, and noticed a possible discrepancy in the
KernelFormat
of the AVX2 kernel, i.e. here:gemmlowp/internal/kernel_avx.h
Line 33 in 1490d29
So if I understand
CellFormat<4,2>
correctly, width is 4 (corresponding the the number of columns on the RHS), and depth is 2 (rows in the RHS), meaning we have a 2x4 matrix. Further, `CellOrder::WidthMajor for the RHS implies column order storage, so consecutive increments should be placed in consecutive rows:The comment in
kernel_avx.h
says:So here is already one discrepancy: the format described by the template is 2x4, not 2x8.
We see the next discrepancy in the inline asm code in
kernel_avx.h
contains this:gemmlowp/internal/kernel_avx.h
Line 113 in 1490d29
The command
vpmovzxbw
corresponds to_mm256_cvtepu8_epi16 (__m128i a)
, which reads 16 bytes (16 * 8 = 128), and sign extends them to 16i16
(16 * 16 = 256, the amount of bits fitting in aymm*
register). But the template only allows for 8 bytes! So my intuition would be that the comment should correct, given that it defines a matrix of 16 elements. Yet I don't understand how the code apparently seems to work if we only have 8 bytes for the RHS?Finally, we have:
gemmlowp/internal/kernel_avx.h
Line 180 in 1490d29
Here we jump to the 8th element in the kernel. But according to the template this should be outside of the defined region of RHS?
I tried to correct the template to work according to my expectations and according to the comments, i.e. I set
KernelSideFormat<CellFormat<8, 2, CellOrder::WidthMajor>, 1>>
, but that lead to the tests erroring.Can somebody comment on my observation and rectify my misunderstanding of the code? I guess it seems to work, but I don't understand how. Specifically, how is the code not reading past the defined memory regions?
On the other hand, given that we have 16
ymm*
registers on AVX machines, I also understand what the code itself tries to accomplish: by having 4 columns of on the RHS, we can broadcast two rows at a time to a packed vector, and combine it with one cell of the LHS. This way, we only need 4 registers to accumulate one cell block of the LHS with all of RHS, which only has one cell and will thus always stay inymm1
. This allows the entire operation to be had in the 16ymm*
registers without a single register spill.I just don't understand how an AVX load works with 8 bytes only.
The text was updated successfully, but these errors were encountered: