Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CellFormat in AVX2 kernel incorrect? Question for clarification #159

Closed
SuperFluffy opened this issue Dec 11, 2018 · 6 comments
Closed

CellFormat in AVX2 kernel incorrect? Question for clarification #159

SuperFluffy opened this issue Dec 11, 2018 · 6 comments

Comments

@SuperFluffy
Copy link

SuperFluffy commented Dec 11, 2018

I am studying the kernels implemented in gemmlowp, and noticed a possible discrepancy in the KernelFormat of the AVX2 kernel, i.e. here:

KernelSideFormat<CellFormat<4, 2, CellOrder::WidthMajor>, 1>>

So if I understand CellFormat<4,2> correctly, width is 4 (corresponding the the number of columns on the RHS), and depth is 2 (rows in the RHS), meaning we have a 2x4 matrix. Further, `CellOrder::WidthMajor for the RHS implies column order storage, so consecutive increments should be placed in consecutive rows:

1 3 5 7
2 4 6 8

The comment in kernel_avx.h says:

A 2x8 cell of Rhs is stored in 16bit in ymm1 ,

So here is already one discrepancy: the format described by the template is 2x4, not 2x8.

We see the next discrepancy in the inline asm code in kernel_avx.h contains this:

"vpmovzxbw (%[rhs_ptr]), %%ymm1 \n\t" // mov rhs to ymm1

The command vpmovzxbw corresponds to _mm256_cvtepu8_epi16 (__m128i a), which reads 16 bytes (16 * 8 = 128), and sign extends them to 16 i16 (16 * 16 = 256, the amount of bits fitting in a ymm* register). But the template only allows for 8 bytes! So my intuition would be that the comment should correct, given that it defines a matrix of 16 elements. Yet I don't understand how the code apparently seems to work if we only have 8 bytes for the RHS?

Finally, we have:

"vpmovzxbw 0x08(%[rhs_ptr]), %%ymm1 \n\t" // mov rhs to ymm1

Here we jump to the 8th element in the kernel. But according to the template this should be outside of the defined region of RHS?

I tried to correct the template to work according to my expectations and according to the comments, i.e. I set KernelSideFormat<CellFormat<8, 2, CellOrder::WidthMajor>, 1>>, but that lead to the tests erroring.

Can somebody comment on my observation and rectify my misunderstanding of the code? I guess it seems to work, but I don't understand how. Specifically, how is the code not reading past the defined memory regions?

On the other hand, given that we have 16 ymm* registers on AVX machines, I also understand what the code itself tries to accomplish: by having 4 columns of on the RHS, we can broadcast two rows at a time to a packed vector, and combine it with one cell of the LHS. This way, we only need 4 registers to accumulate one cell block of the LHS with all of RHS, which only has one cell and will thus always stay in ymm1. This allows the entire operation to be had in the 16 ymm* registers without a single register spill.

I just don't understand how an AVX load works with 8 bytes only.

@SuperFluffy
Copy link
Author

Come to think more about this, I realize that I was mislead by one of the asm instructions. I think it's not vpmovzxbw that is required, but rather vpmovzxbd. vpmovzxbd indeed loads 8 u8 into one 256 bit vector, zero extending them to 8 i32.

I am not completely clear yet how the interaction with all the other operations is, but at least this doesn't violate my assumption about which memory is being accessed (in particular, reading past the cell as defined in the template).

Not completely clear to me are also the “lower half” comments in above the loads of rhs_ptr. What about the upper half?

@SuperFluffy
Copy link
Author

Pinging @rrwinterton who introduced the kernel: #122

@bjacob
Copy link
Contributor

bjacob commented Dec 11, 2018

I'll let @rrwinterton comment as I'm not familiar with this code.

Generic info: the general meaning of the kernel format concepts here is explained in
https://github.com/google/gemmlowp/blob/master/internal/kernel.h
and a reference kernel implementation for an arbitrary format is given in
https://github.com/google/gemmlowp/blob/master/internal/kernel_reference.h

@SuperFluffy
Copy link
Author

I came to realize that I was trying to understand the kernel devoid of the packing mechanism. I now realize that that doesn't work. I have to wrap my head around how exactly the matrices are packed, but I am pretty sure the answer to my problem is there.

@bjacob
Copy link
Contributor

bjacob commented Feb 26, 2019

Closing for now, let us know if any new question comes up.

@bjacob bjacob closed this as completed Feb 26, 2019
@rrwinterton
Copy link
Contributor

Thanks bjacob I didn't get the email of the previous questions in December. If there are any questions you may have after reviewing the packing mechanism and the concepts documents bjacob provided let me know. I will watch the emails more closely.
Rich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants