Added BMI1, BMI2, F16C, MOVEBE, and aligned-allocation/free plus some minor additions. #25

Asc2011 · 2024-07-18T14:52:59Z

Hello Guzba,

this pull-request adds some new Tags :

BMI1
BMI2
F16C
MOVEBE

Moste of the functionality of BMI 1+2 is already in std/bitops. But for demo-purposes it might be ok, to have them explicit here.
More important to me is the addition of aligned-allocations and prefetching to nimsimd/sse2.nim :

mm_alloc()
mm_free()
mm_prefetch()
mm_sfence() # whatever this does - not tested..

I've used the aligned-allocations in two projects of mine. So consider them tested and usable, except the sfence()-thing.
Some functions were missing from nimsimd/avx2.nim. According to the Intel-Intrinsics-Guide, they should be avail.

The CacheLineWidth-detection works for me. I have not found a similar way to inquire this info for Arm or AppleSilicon or RISC-V cpus. Its just a starting point. Since i believe the CacheLineWidth to be the crucial-information when dealing with SIMD, i'd love to see this information availiable for all commonly cpu-architecture used by nim ? What do you think ?
Same could be said for the L1,L2,L3-sizes.

best regards, Andreas (a.k.a. bosinski)

Following the sources by Felix Cloutier, i tried to read the Cacheline-size from EAX leg-1. It's giving me a value of 8, that needs to be multiplied by 8. And that gives me 64-byte - but it may be just a coincidence ;) So pls. take a look - it' rather q&d...

Just a little test to how the package is structured. This supports F16C a.k.a Float16 or 'half precision floats'. Not to mix up with 'brain-floats' - thats also 16-bit, but comes with a different encoding.

needs some some testing...

added the {.pop.} at the end..

just begun to look into it...

Some older prbly. not so important tags. Though the CompareExchange16-Tag is interesting. And some of the BMI2 operations, as well.

the missing bmi2.nim

Three functions from SSE1-times - i guess they fit in nice here. I just started to use malloc and free and had no issues so far. It's seem a nice way to get aligned allocations, which had been discussed on the nim-forum..

finished Tags BM11 and BMI2

MOVBE fns in movbe.nim

added Tags BMI1, BMI2, F16C, MOVBE

typos

added older AVX2 broadcasts `mm_broadcast<x>_<type>`and movbe.nim into the right folder.

movbe.nim belongs in /src/nimsimd/

missing 'mm_blend_epi32' in AVX2

guzba · 2024-07-19T02:00:51Z

Hi and thanks for the PR. Taking a quick look, one thing I noticed is the use of static pointer. I think that is not the same thing as void const * ptr in the Intrinsics Guide https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=4243,5873,4037&othertechs=MOVBE

A static pointer would need to be known at compile time which I do not see as compatible with heap memory and the use of these operations.

changed `static pointer` into `pointer`.

Asc2011

good catch, thx

changed from ":static pointer" to ":pointer"

MOVEBE was smth. i considered using, but never did. As you already found out, its not tested.

fixed return vector(s) from 'M256d/M256' in : - `mm_permutevar_pd() :M128d` - `mm_permutevar_ps() :M128`

fixed signatures in : - storebe_i16() - storebe_i32() - storebe_i64()

guzba · 2024-07-19T17:31:20Z

I took a closer look this time and did notice a couple things to check on:

_bextr_u64 should have start and len as unsigned int in the intrinsics docs, as opposed to 64-bit parameters
_bzhi_u32 and _bzhi_u64 index parameter looks like uint32
mulx_32 intrinsic and name should be _u32, same for 64 _u64, these also have the 3rd parameter as a pointer in the intrinsic

It will be great to have a bunch more instruction sets available when we've got them looking good.

edit signature : - bextr_u64()

edited signatures : - bzhi_u32 - bzhi_u64 - mulx_u32 - mulx_u64

Asc2011 · 2024-07-19T18:33:02Z

ok, BMI1 & BMI2 should be fine now. thx for the review.

greets and have a nice weekend :)

Asc2011 · 2024-07-24T21:47:06Z

Hi guzba,

i've noticed that the something did not pass the GH-pipeline ?
Is there anything i can do ?

Greets Andreas

guzba · 2024-07-26T18:43:16Z

Hello and thanks again for this PR. There were still some issues here so instead of continuing to tweak this over time, which may be time consuming, I chose to extend these commits in a new PR and do the remaining fixes here #26

Note that I did remove the queryCacheLineSize call due to the implementation not looking correct to me (should be an unsigned shift, I think incorrect register used, need to check a feature flag to see if the data is present and valid, etc). More info here https://www.felixcloutier.com/x86/clflush if you'd like to work on that in a separate PR.

Asc2011 · 2024-07-27T19:51:06Z

Very nice and thx for including the important tags.
Regarding the detection of the CacheLineWidth in queryCacheLineSize(). I somehow remember that the leg from the Felix Cloutiers' documentation gave me unreasonable results. I then looked at a CpuId-implementation in Rust and found that they used a different leg. So i tried that one, but only had one machine to test against. So that may well be not correct. If i understood it right, then on Intel-Cpus its supposed to be always 64-Byte - thats easy. On AppleSilicon (M1 up to M-2024) its supposed to be 128-Byte - thats seems easy, too. To do the detection on any other CPU (Arm, RISC-V, Power, etc.) might get hairy.
Maybe it makes more sense to make this a compile-time-option, which defaults to 64, but can be set via -d:cacheline=128 ? Since i've never heard of a CacheLineWidth >128 or <64.
Anyways, i'll retry in pullreq-26.

guzba · 2024-07-28T04:52:46Z

If you have a different source for how to get the cache flush size, that'd be totally fine with me. I don't mean to imply the website I linked is the only authority. If you have a solid one, please just point me to it so I can properly ensure I learn and agree the implementation is correct and ready to go.

Asc2011 added 16 commits February 25, 2024 23:33

half-precision floting-points float15 by IEEE

d6465fc

Just a little test to how the package is structured. This supports F16C a.k.a Float16 or 'half precision floats'. Not to mix up with 'brain-floats' - thats also 16-bit, but comes with a different encoding.

Update f16c.nim cleanen up F16C

1cc16db

needs some some testing...

Update f16c.nim forgot the {.pop.}

fd9a64a

added the {.pop.} at the end..

For TAG BM1 - WIP

7794f23

just begun to look into it...

runtimecheck.nim added BM1, BMI2, CMPXCHG16

8ae513b

Some older prbly. not so important tags. Though the CompareExchange16-Tag is interesting. And some of the BMI2 operations, as well.

bmi2.nim BMI2 is work-in-progress

3f6001e

the missing bmi2.nim

now into the right branch-ASC Update sse2.nim free/malloc/sfence

b204469

Three functions from SSE1-times - i guess they fit in nice here. I just started to use malloc and free and had no issues so far. It's seem a nice way to get aligned allocations, which had been discussed on the nim-forum..

bmi1.nim and bmi2.nim done

a9ba796

finished Tags BM11 and BMI2

new Tag MOVBE

0b6265b

MOVBE fns in movbe.nim

Update README.md

efd6680

added Tags BMI1, BMI2, F16C, MOVBE

Update README.md syntax

9a2180f

typos

added older AVX2 broadcasts and movbe.nim into the right folder

43bf727

added older AVX2 broadcasts `mm_broadcast<x>_<type>`and movbe.nim into the right folder.

Delete src/movbe.nim from wrong folder

742cad7

movbe.nim belongs in /src/nimsimd/

Update avx2.nim, added mm_blend_epi32

ceb2c43

missing 'mm_blend_epi32' in AVX2

Merge branch 'guzba:master' into asc

66dd118

Update movbe.nim

520c944

changed `static pointer` into `pointer`.

Asc2011 commented Jul 19, 2024

View reviewed changes

Asc2011 changed the title ~~Added BM1, BM2, F16C, MOVEBE, and aligned-allocation & free plus some minor additions.~~ Added BMI1, BMI2, F16C, MOVEBE, and aligned-allocation/free plus some minor additions. Jul 19, 2024

Update avx.nim

d82a82b

fixed return vector(s) from 'M256d/M256' in : - `mm_permutevar_pd() :M128d` - `mm_permutevar_ps() :M128`

This was referenced Jul 19, 2024

Two more glitches in avx.nim #22

Open

Aligned alloc _mm_malloc/free() would be nice to have #23

Open

Update movbe.nim

3518f67

fixed signatures in : - storebe_i16() - storebe_i32() - storebe_i64()

Asc2011 added 2 commits July 19, 2024 20:21

Update bmi1.nim

411a857

edit signature : - bextr_u64()

Update bmi2.nim

d6f8163

edited signatures : - bzhi_u32 - bzhi_u64 - mulx_u32 - mulx_u64

guzba merged commit d6f8163 into guzba:master Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added BMI1, BMI2, F16C, MOVEBE, and aligned-allocation/free plus some minor additions. #25

Added BMI1, BMI2, F16C, MOVEBE, and aligned-allocation/free plus some minor additions. #25

Asc2011 commented Jul 18, 2024

guzba commented Jul 19, 2024 •

edited

Loading

Asc2011 left a comment •

edited

Loading

guzba commented Jul 19, 2024 •

edited

Loading

Asc2011 commented Jul 19, 2024

Asc2011 commented Jul 24, 2024

guzba commented Jul 26, 2024

Asc2011 commented Jul 27, 2024

guzba commented Jul 28, 2024

Added BMI1, BMI2, F16C, MOVEBE, and aligned-allocation/free plus some minor additions. #25

Added BMI1, BMI2, F16C, MOVEBE, and aligned-allocation/free plus some minor additions. #25

Conversation

Asc2011 commented Jul 18, 2024

guzba commented Jul 19, 2024 • edited Loading

Asc2011 left a comment • edited Loading

Choose a reason for hiding this comment

guzba commented Jul 19, 2024 • edited Loading

Asc2011 commented Jul 19, 2024

Asc2011 commented Jul 24, 2024

guzba commented Jul 26, 2024

Asc2011 commented Jul 27, 2024

guzba commented Jul 28, 2024

guzba commented Jul 19, 2024 •

edited

Loading

Asc2011 left a comment •

edited

Loading

guzba commented Jul 19, 2024 •

edited

Loading