-
Notifications
You must be signed in to change notification settings - Fork 34
Refactor AVX2 distance computations to consistently use generic_simd_op #196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot What about Cosine Similarity, defined in cosine.h ?
Cosine Similarity already uses the consistent pattern for AVX512 implementations (it calls For AVX2, there are no specific implementations in Should I add AVX2-specific implementations for Cosine Similarity similar to what I did for L2 and Inner Product? That would involve creating a |
|
@copilot exactly. CS AVX2 ops should be created and added. Fallback to generic should only happen on instructions sets lower than AVX2. |
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Added AVX2 implementations for Cosine Similarity in commit 90e282d. Created |
|
@copilot performance benchmarks show a 30% drop in throughput for a dataset with 768 dimensions, 1M vectors, and IP distance. Graph build time is not affected, only QPS during search. |
I'm investigating the performance regression. For 768 dimensions with float vectors, the main loop should process 32 elements at a time (4 unroll * 8 simd_width) without hitting the masked load path. Could you provide more details:
The most likely issue is the |
|
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
This is surprising - my changes only modified AVX2 code paths, and Float16/Float16 on AVX512 was already using Could you verify:
The multi-arch build system compiles separate object files for AVX2 and AVX512. If there's somehow a symbol resolution issue, the wrong implementation could be called at runtime. |
|
|
||
| static __m256 accumulate(mask_t /*m*/, __m256 accumulator, __m256 a, __m256 b) { | ||
| // For AVX2, masking is handled in the load operations | ||
| return _mm256_fmadd_ps(a, b, accumulator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be safer to use the masked "_mm256_mask3_fmadd_ps" for accumulation as there is no difference in latency/throughput b/w these two instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in commit (will be pushed shortly). Now using _mm256_mask3_fmadd_ps for masked accumulation when AVX512VL is available, with AVX2 fallback for systems without AVX512VL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot respond
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in commit a9e91d6. The masked accumulate function now uses _mm256_mask3_fmadd_ps when AVX512VL is available, with AVX2 fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I think in that case it's better to revert the change and use the blend instructions without conditioning on AVX512VL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted in commit (being pushed). Removed AVX512VL conditionals, now using blend instructions consistently for AVX2.
| auto data = _mm256_cvtph_ps(_mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr))); | ||
| auto zero = _mm256_setzero_ps(); | ||
| auto mask_vec = create_blend_mask_avx2(m); | ||
| return _mm256_blendv_ps(zero, data, mask_vec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try to use "_mm256_maskz_loadu_ps" instead of these "blend_mask/blendv"? Similar to AVX512 implementation here? Maybe that would be simpler and more consistent with AVX512 version than this blend mask implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in commit (will be pushed shortly). Now using _mm256_maskz_loadu_ps and other AVX512VL masked instructions when available, with AVX2 blend fallback. This is more consistent with the AVX512 version and provides better performance on AVX512VL-capable systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in that case it's better to revert the change and use the blend instructions without conditioning on AVX512VL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted in commit (being pushed). Removed AVX512VL conditionals, now using blend instructions consistently for AVX2.
|
@copilot why didn't you use
|
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Updated in commit (will be pushed shortly). Originally used the blend approach because |
|
@copilot undo the latest specializations for AVX512VL. |
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Reverted in commit (being pushed). Removed AVX512VL specializations, back to using blend instructions for AVX2. |
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
Co-authored-by: ahuber21 <9201869+ahuber21@users.noreply.github.com>
L2FloatOp<8>for AVX2 L2 distance computationsConvertToFloat<8>base class for AVX2simd::generic_simd_op()IPFloatOp<8>for AVX2 Inner Product computationssimd::generic_simd_op()CosineFloatOp<8>for AVX2 Cosine Similarity computationsRecent Changes
Reverted AVX512VL conditional specializations based on reviewer feedback. Now using consistent blend mask approach for all AVX2 implementations without runtime conditionals for AVX512VL.
The implementation now:
create_blend_mask_avx2()helper function to create masks_mm256_blendv_psfor masked loads on AVX2Performance regression resolved - benchmarks confirmed performance parity on both AVX512 and AVX2 systems.
Original prompt
Fixes #183
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.