-
Notifications
You must be signed in to change notification settings - Fork 455
QS8 / QU8 HSwish Microkernels #8255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
swamipreksha
commented
Apr 13, 2025
- Implementations for various ISAs:
- x86 AVX2
- Scalar ISA
- Unit tests
- Implementations for various ISAs: - x86 AVX2 - Scalar ISA - Unit tests Signed-Off-by: Ravi Kumar Soni <ravi.kumar.soni@intel.com> Signed-off-by: Swami, Preksha <preksha.swami@intel.com>
558d05d to
83b449d
Compare
|
@dsharlet Kindly review our hardswish implementation for qs8 and qu8 datatypes. |
|
This is a unary op, where we should be generating a LUT. I'm not sure that this implementation is going to be that much better. It's vectorized, but it needs quite a few instructions, and many of those instructions are 4x wider (floats) than the input/output data (int8/uint8). Can you please benchmark to see if this is the case? What does the following command report on master and with your branch? |
| __m256 va_sum = _mm256_add_ps(va_mul, vthree); | ||
| __m256 va_clamped = _mm256_min_ps(_mm256_max_ps(va_sum, vzero), vsix); | ||
| __m256 vacc = _mm256_mul_ps(_mm256_mul_ps(va_mul, vsixth), va_clamped); | ||
| vacc = _mm256_div_ps(vacc, voutput_scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be a reciprocal mul_ps instead?
| input += 8; | ||
| const __m128i vout_low = _mm256_castsi256_si128(vout); | ||
| const __m128i vout_high = _mm256_extracti128_si256(vout, 1); | ||
| const __m128i vout_packed = _mm_packs_epi32(vout_low, vout_high); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of 2 packs can you do a cvtepi32_epi8?
fbarchard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! There is an f32-vhswish/avx.c.in that is a bit simplier than this?
is the math the same?
and yes, to dillon's suggestion of a lut. An example of that is softmax for 8 bit.
As it is, the cost of converting bytes to float and back is a bottleneck, and only 8 values can be processed at a time.