Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added BMI1, BMI2, F16C, MOVEBE, and aligned-allocation/free plus some minor additions. #25

Merged
merged 21 commits into from
Jul 26, 2024

Conversation

Asc2011
Copy link
Contributor

@Asc2011 Asc2011 commented Jul 18, 2024

Hello Guzba,

this pull-request adds some new Tags :

  • BMI1
  • BMI2
  • F16C
  • MOVEBE

Moste of the functionality of BMI 1+2 is already in std/bitops. But for demo-purposes it might be ok, to have them explicit here.
More important to me is the addition of aligned-allocations and prefetching to nimsimd/sse2.nim :

  • mm_alloc()
  • mm_free()
  • mm_prefetch()
  • mm_sfence() # whatever this does - not tested..

I've used the aligned-allocations in two projects of mine. So consider them tested and usable, except the sfence()-thing.
Some functions were missing from nimsimd/avx2.nim. According to the Intel-Intrinsics-Guide, they should be avail.

The CacheLineWidth-detection works for me. I have not found a similar way to inquire this info for Arm or AppleSilicon or RISC-V cpus. Its just a starting point. Since i believe the CacheLineWidth to be the crucial-information when dealing with SIMD, i'd love to see this information availiable for all commonly cpu-architecture used by nim ? What do you think ?
Same could be said for the L1,L2,L3-sizes.

best regards, Andreas (a.k.a. bosinski)

Following the sources by Felix Cloutier, i tried to read the Cacheline-size from EAX leg-1. It's giving me a value of 8, that needs to be multiplied by 8. And that gives me 64-byte - but it may be just a coincidence ;) So pls. take a look - it' rather q&d...
Just a little test to how the package is structured. This supports  F16C a.k.a Float16 or 'half precision floats'. Not to mix up with 'brain-floats' - thats also 16-bit, but comes with a different encoding.
needs some some testing...
added the {.pop.} at the end..
just begun to look into it...
Some older prbly. not so important tags. Though the CompareExchange16-Tag is interesting. And some of the BMI2 operations, as well.
Three functions from SSE1-times - i guess they fit in nice here. 
I just started to use malloc and free and had no issues so far.
It's seem a nice way to get aligned allocations, which had been discussed on the nim-forum..
finished Tags BM11 and BMI2
MOVBE fns in movbe.nim
added Tags BMI1, BMI2, F16C, MOVBE
added older AVX2 broadcasts `mm_broadcast<x>_<type>`and movbe.nim into the right folder.
movbe.nim belongs in /src/nimsimd/
missing 'mm_blend_epi32' in AVX2
@guzba
Copy link
Owner

guzba commented Jul 19, 2024

Hi and thanks for the PR. Taking a quick look, one thing I noticed is the use of static pointer. I think that is not the same thing as void const * ptr in the Intrinsics Guide https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=4243,5873,4037&othertechs=MOVBE

A static pointer would need to be known at compile time which I do not see as compatible with heap memory and the use of these operations.

changed `static pointer` into `pointer`.
Copy link
Contributor Author

@Asc2011 Asc2011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thx

  • changed from ":static pointer" to ":pointer"

MOVEBE was smth. i considered using, but never did. As you already found out, its not tested.

@Asc2011 Asc2011 changed the title Added BM1, BM2, F16C, MOVEBE, and aligned-allocation & free plus some minor additions. Added BMI1, BMI2, F16C, MOVEBE, and aligned-allocation/free plus some minor additions. Jul 19, 2024
fixed return vector(s) from 'M256d/M256' in : 
- `mm_permutevar_pd() :M128d`
- `mm_permutevar_ps()  :M128`
fixed signatures in :
- storebe_i16()
- storebe_i32()
- storebe_i64()
@guzba
Copy link
Owner

guzba commented Jul 19, 2024

I took a closer look this time and did notice a couple things to check on:

  • _bextr_u64 should have start and len as unsigned int in the intrinsics docs, as opposed to 64-bit parameters
  • _bzhi_u32 and _bzhi_u64 index parameter looks like uint32
  • mulx_32 intrinsic and name should be _u32, same for 64 _u64, these also have the 3rd parameter as a pointer in the intrinsic

It will be great to have a bunch more instruction sets available when we've got them looking good.

edit signature :
- bextr_u64()
edited signatures : 
- bzhi_u32
- bzhi_u64
- mulx_u32
- mulx_u64
@Asc2011
Copy link
Contributor Author

Asc2011 commented Jul 19, 2024

ok, BMI1 & BMI2 should be fine now. thx for the review.

greets and have a nice weekend :)

@Asc2011
Copy link
Contributor Author

Asc2011 commented Jul 24, 2024

Hi guzba,

i've noticed that the something did not pass the GH-pipeline ?
Is there anything i can do ?

Greets Andreas

@guzba
Copy link
Owner

guzba commented Jul 26, 2024

Hello and thanks again for this PR. There were still some issues here so instead of continuing to tweak this over time, which may be time consuming, I chose to extend these commits in a new PR and do the remaining fixes here #26

Note that I did remove the queryCacheLineSize call due to the implementation not looking correct to me (should be an unsigned shift, I think incorrect register used, need to check a feature flag to see if the data is present and valid, etc). More info here https://www.felixcloutier.com/x86/clflush if you'd like to work on that in a separate PR.

@guzba guzba merged commit d6f8163 into guzba:master Jul 26, 2024
@Asc2011
Copy link
Contributor Author

Asc2011 commented Jul 27, 2024

Very nice and thx for including the important tags.
Regarding the detection of the CacheLineWidth in queryCacheLineSize(). I somehow remember that the leg from the Felix Cloutiers' documentation gave me unreasonable results. I then looked at a CpuId-implementation in Rust and found that they used a different leg. So i tried that one, but only had one machine to test against. So that may well be not correct. If i understood it right, then on Intel-Cpus its supposed to be always 64-Byte - thats easy. On AppleSilicon (M1 up to M-2024) its supposed to be 128-Byte - thats seems easy, too. To do the detection on any other CPU (Arm, RISC-V, Power, etc.) might get hairy.
Maybe it makes more sense to make this a compile-time-option, which defaults to 64, but can be set via -d:cacheline=128 ? Since i've never heard of a CacheLineWidth >128 or <64.
Anyways, i'll retry in pullreq-26.

@guzba
Copy link
Owner

guzba commented Jul 28, 2024

If you have a different source for how to get the cache flush size, that'd be totally fine with me. I don't mean to imply the website I linked is the only authority. If you have a solid one, please just point me to it so I can properly ensure I learn and agree the implementation is correct and ready to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants