Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BLAKE3 freebsd aarch64 #14728

Merged
merged 1 commit into from Apr 26, 2023
Merged

Conversation

mcmilk
Copy link
Contributor

@mcmilk mcmilk commented Apr 9, 2023

Motivation and Context

See pull request #14715

How Has This Been Tested?

Compile + benchmarkingon same cpu: APM eMAG 8180 r3p2

Debian 11
[debian@debian-11-2 ~/src]$  cat /proc/spl/kstat/zfs/chksum_bench
implementation               1k      4k     16k     64k    256k      1m      4m     16m
edonr-generic               815    1016    1047    1028    1067    1031    1012     984
skein-generic               291     316     308     313     320     316     307     318
sha256-generic              127     136     138     140     138     139     138     140
sha256-armv7                147     163     162     166     163     164     165     164
sha256-neon                 180     205     192     209     209     209     208     207
sha256-armv8-ce             553     622     653     673     660     667     648     654
sha512-generic              188     211     219     221     218     220     221     219
sha512-armv7                221     254     253     263     264     264     260     263
blake3-generic              140     135     135     135     135     135     135     134
blake3-sse2                 178     286     284     304     292     287     284     284
blake3-sse41                179     298     305     308     304     300     294     293
FreeBSD 13.1
[freebsd@freebsd-13 ~/src]$ sysctl kstat.zfs.misc.chksum_bench
kstat.zfs.misc.chksum_bench: 
implementation               1k      4k     16k     64k    256k      1m      4m     16m
edonr-generic               702     826     881     879     858     870     821     835
skein-generic               122     133     135     136     136     136     134     134
sha256-generic              135     145     145     147     147     147     144     144
sha256-armv7                153     161     167     167     167     167     164     164
sha256-neon                 186     201     207     208     207     208     204     205
sha256-armv8-ce             555     638     663     669     661     668     647     630
sha512-generic              198     221     230     229     227     225     223     225
sha512-armv7                223     253     259     264     263     264     259     257
blake3-generic              235     235     230     230     227     226     223     226
blake3-sse2                 174     299     304     307     302     297     277     275
blake3-sse41                178     302     307     307     300     298     276     277

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Contributor

@kevans91 kevans91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this looks good to me. Thanks!

@rincebrain
Copy link
Contributor

I don't think it matters given the fact that it was broken before, but I'm curious if there's any noticable performance impact on Linux or if it's just noise.

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 9, 2023

I don't think it matters given the fact that it was broken before, but I'm curious if there's any noticable performance impact on Linux or if it's just noise.

Same machine, but with Debian 11:

[debian@debian-11-2 ~]$ lscpu 
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       APM
Model:                           2
Model name:                      X-Gene
Stepping:                        0x3
BogoMIPS:                        80.00
NUMA node0 CPU(s):               0-3
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
[debian@debian-11-2 ~]$ cat /proc/spl/kstat/zfs/chksum_bench
16 0 0x01 -1 0 69832221893625 69881496832800
implementation               1k      4k     16k     64k    256k      1m      4m     16m
edonr-generic               850     991    1069    1054    1051    1063    1057    1055
skein-generic               293     312     320     322     323     323     322     323
sha256-generic              130     137     141     142     142     141     141     142
sha256-armv7                149     164     167     168     167     168     167     168
sha256-neon                 183     205     208     210     210     209     209     209
sha256-armv8-ce             554     642     660     675     672     671     667     669
sha512-generic              191     212     221     223     222     223     223     223
sha512-armv7                218     256     263     265     265     265     265     264
blake3-generic              137     137     136     136     135     136     135     135
blake3-sse2                 174     292     299     300     298     297     289     291
blake3-sse41                166     294     302     304     302     299     292     294

So everything seems fine :)

@lundman
Copy link
Contributor

lundman commented Apr 9, 2023

Can we have the asm_linkage.h changes in here, or do I need to do them in a future PR?

@kevans91
Copy link
Contributor

kevans91 commented Apr 10, 2023

I've uncovered another problem in my local integration of the fpu changes... it turns out that we have one point that breaks the build that I haven't thought of how to workaround.

https://github.com/openzfs/zfs/blob/master/module/icp/algs/blake3/blake3_impl.c#L33

Here we have an escape hatch for x86_64 to avoid the SIMD implementations with HAVE_SSE2, but there's no equivalent for arm64. This ends up breaking our bootloader build, where fpu_kern(9) isn't a thing and we cannot at all do float. Our amd64 loader avoids it because it just doesn't define HAVE_SSE2.

At first I thought perhaps __ARM_NEON was the solution, but I can't actually figure out how to get clang to not offer it up (even with -mfpu=none, for example). We already have one instance of #undef __aarch64__, but I'd really rather not increase that number...

(edit to add: -mgeneral-regs-only seems to turn off __ARM_NEON, which makes it a bad choice)

@lundman
Copy link
Contributor

lundman commented Apr 10, 2023

macOS results - unsure if not using x18 changed anything, but then I can crash it using generic as well.

This adds asm_linkage,h for aarch64 f6fa7f5a584d4bbf889974e3bacb237a60dd7e95

These are the changes to this PRs .S files to make them assemble:
openzfsonosx@72c67e5

And here is blake3.c peppered with VERIFY:
openzfsonosx@fc96fcf

I tend to get one of the following:

  1. While rsync'ing data to blake3 dataset:
kernel.release.t8112`handle_kernel_abort(state=0xfffffe3a32eb33c0, esr=2516582404,
 fault_addr=0xaa80e95d44a42713, fault_code=FSC_TRANSLATION_FAULT_L0, fault_type=1, 
expected_fault_handler=<unavailable>) at sleh.c:2464:2 [opt]

Something has certainly gone wrong there, sure is an interesting address 0xaa80e95d44a42713

  1. rsync data will go ok, but zpool scrub and/or zpool export/import
    /* This can go negative, then while below goes forever */
    ssize_t cvs_remaining;

    if (chunk_state_len(&ctx->chunk) > 0) {
        cvs_remaining = ctx->cv_stack_len;
        output = chunk_state_output(&ctx->chunk);
    } else {
        /* There are always at least 2 CVs in the stack in this case. */
        VERIFY3U(ctx->cv_stack_len, !=, 1 /* B */);
        VERIFY3U(ctx->cv_stack_len, >=, 2 /* B */);
        cvs_remaining = ctx->cv_stack_len - 2;
        output = parent_output(&ctx->cv_stack[cvs_remaining * 32],
            ctx->key, ctx->chunk.flags);
    }

VERIFY3(ctx->cv_stack_len != 1) failed (1 != 1)

Which is where cvs_remaining would go negative for macOS.

@lundman
Copy link
Contributor

lundman commented Apr 10, 2023

Oh I see, ok I can fix macOS with:


#if defined(_KERNEL) && !defined(__APPLE__)
    BLAKE3_CTX *ctx = blake3_per_cpu_ctx[CPU_SEQID_UNSTABLE];
#else
    BLAKE3_CTX *ctx = kmem_alloc(sizeof (*ctx), KM_SLEEP);
#endif

Ie, allocate it instead of using blake3_per_cpu_ctx. We don't have cpu_number() on macOS cos Apple are mean.

We implemented it with

openzfsonosx@6aeb73c

but clearly that isn't good enough. I thought mpidr_el1 would work.

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 10, 2023

Can we have the asm_linkage.h changes in here, or do I need to do them in a future PR?

Oh no, I forget them again - when I had done the aproval :/

These are the changes to this PRs .S files to make them assemble:
openzfsonosx/openzfs-fork@72c67e5

Squashed together.

And here is blake3.c peppered with VERIFY:
openzfsonosx/openzfs-fork@fc96fcf

No need for this. You have to ensure, that the ctx isn't shared between the cpus. The part below can't be changed to !APPLE within kernel space... the KM_SLEEP may sleep and some state is vanished then, because entering another cpu.
This was the reason for the blake3_per_cpu_ctx variable :-/

#if defined(_KERNEL)
#ifndef APPLE
    BLAKE3_CTX *ctx = blake3_per_cpu_ctx[CPU_SEQID_UNSTABLE];
#else
    YOUR WORKAROUND HERE - when the per-cpu-context-int-there-in-apple
#endif
#else
    BLAKE3_CTX *ctx = kmem_alloc(sizeof (*ctx), KM_SLEEP);
#endif

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 10, 2023

We should include the PAGE and PAGEOFF defines for aarch64 in FreeBSD an Linux within this patchset also ...

Edit: no, we should not .... it's a bit more todo then... so it should be done with a seperate PR.

@lundman
Copy link
Contributor

lundman commented Apr 10, 2023

The part below can't be changed to !APPLE within kernel space... the KM_SLEEP may sleep and some state is vanished then, because entering another cpu.

Ah interesting. kmem_alloc() worked long enough to confirm the problem at least. I'll think about how to solve this then. mac/M1 certainly is quite happy to preempt to different cores. Does this code run with preemption enabled on Linux? Or does it work because it relies on kfpu_begin() ?

@lundman
Copy link
Contributor

lundman commented Apr 10, 2023

We should include the PAGE and PAGEOFF defines for aarch64 in FreeBSD an Linux within this patchset also ...

Ah I missed that, they are just empty on those platforms.

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 10, 2023

The part below can't be changed to !APPLE within kernel space... the KM_SLEEP may sleep and some state is vanished then, because entering another cpu.

Ah interesting. kmem_alloc() worked long enough to confirm the problem at least. I'll think about how to solve this then. mac/M1 certainly is quite happy to preempt to different cores. Does this code run with preemption enabled on Linux? Or does it work because it relies on kfpu_begin() ?

The kfpu_begin() and kfpu_end() calls are done later, so preemption should be okay.

@mcmilk mcmilk force-pushed the fix-freebsd-aarch64 branch 3 times, most recently from eb8fbe8 to b9f7987 Compare April 10, 2023 06:57
@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 10, 2023

We should include the PAGE and PAGEOFF defines for aarch64 in FreeBSD an Linux within this patchset also ...

Ah I missed that, they are just empty on those platforms.

Ui, ENTRY_ALIGN() and friends are also not defined in master currently. Should we add them also within this patchset?

Edit: currently: aarach64 is broken with these changes in the assembly.
Edit2: ENTRY_ALIGN() should be added for aarch64 in some own PR ... so it's working again.

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 10, 2023

I've uncovered another problem in my local integration of the fpu changes... it turns out that we have one point that breaks the build that I haven't thought of how to workaround.

https://github.com/openzfs/zfs/blob/master/module/icp/algs/blake3/blake3_impl.c#L33

Here we have an escape hatch for x86_64 to avoid the SIMD implementations with HAVE_SSE2, but there's no equivalent for arm64. This ends up breaking our bootloader build, where fpu_kern(9) isn't a thing and we cannot at all do float. Our amd64 loader avoids it because it just doesn't define HAVE_SSE2.

At first I thought perhaps __ARM_NEON was the solution, but I can't actually figure out how to get clang to not offer it up (even with -mfpu=none, for example). We already have one instance of #undef __aarch64__, but I'd really rather not increase that number...

(edit to add: -mgeneral-regs-only seems to turn off __ARM_NEON, which makes it a bad choice)

The attempt to use the c files was done by me also ... the highly optimized (clang 13 -O3 assembly) isn't reached by older versions and gcc ... so we will have a big performance degradation then. Also some compiler will complain about usage of xyz - because they don't or don't want understand some intrinsics ... see #14624 for an example.
Using the assembly is fine and will result in constant high performance for some specific algo... it's only used when needed (encryption,compression,checksums). We could try to use an even newer clang / gcc version for even more performance.

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 10, 2023

@lundman - Do we really all these headers with nearly the same ENTRY_ALIGN() SETSIZE() and so on definitions?

lib/libspl/include/os/freebsd/sys/ia32/asm_linkage.h
lib/libspl/include/os/linux/sys/ia32/asm_linkage.h
lib/libspl/include/os/macos/sys/ia32/asm_linkage.h

lib/libspl/include/os/freebsd/sys/aarch64/asm_linkage.h
lib/libspl/include/os/linux/sys/aarch64/asm_linkage.h
lib/libspl/include/os/macos/sys/aarch64/asm_linkage.h

include/os/freebsd/spl/sys/aarch64/asm_linkage.h
include/os/linux/spl/sys/aarch64/asm_linkage.h
include/os/macos/spl/sys/aarch64/asm_linkage.h

include/os/freebsd/spl/sys/ia32/asm_linkage.h
include/os/linux/spl/sys/ia32/asm_linkage.h
include/os/macos/spl/sys/ia32/asm_linkage.h

In your openzfs-fork repo the definitions for aarch64 (linux/freebsd) are not there currently.
For fixing FreeBSD it's maybe better to leave the b3_aarch64_sse(2|41).S in old style for now... and chnage it to unify assembly afterwards.

@lundman
Copy link
Contributor

lundman commented Apr 10, 2023

I only went that way, as we (Solaris) started with ia32/asm_linkage.h - so I assumed it was the ZFS style.

But then, the SIMD files are done with simd_aarch64.h so maybe that is the preferred style? Can someone confirm? No directory needed, so that is a bonus.

I've fixed up the asm files like 18 times to be able to compile on macOS, each time upstream changes them, so it certainly is tiring. The hope is that can end soon though, but it's deflating when even more code goes in that needs fixing.

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 10, 2023

We should include the PAGE and PAGEOFF defines for aarch64 in FreeBSD an Linux within this patchset also ...

Ah I missed that, they are just empty on those platforms.

Ui, ENTRY_ALIGN() and friends are also not defined in master currently. Should we add them also within this patchset?

Edit: currently: aarach64 is broken with these changes in the assembly.

I personally see the asm_linkage.h a bit special ... so maybe one of it to rule them all (arch/systems + kernel/user space) would be okay. But we need to define that all special cpu stuff and so on goes into simd_x86.h, simd_aarch64.h and so on.

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 10, 2023

Rebased and testings with different clang versions (11..15) - the best result was with clang-14 -O3.

Edit: Speedup on Apple M1: 1575->1600 ... only around 2% - but constant

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Apr 10, 2023
@lundman
Copy link
Contributor

lundman commented Apr 11, 2023

Starting to get a hang of the higher level of the blake3 work here, with the "stack" and pushing out work. Isn't this something that in ZFS would have traditionally been done with a taskq, just throw the work at the taskq, and it will use available threads as needed, with the usual tuning knobs? Why the special case for blake3?

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 11, 2023

Starting to get a hang of the higher level of the blake3 work here, with the "stack" and pushing out work. Isn't this something that in ZFS would have traditionally been done with a taskq, just throw the work at the taskq, and it will use available threads as needed, with the usual tuning knobs? Why the special case for blake3?

@lundman - I think fixing the macOS implementation of blake3_per_cpu_ctx[CPU_SEQID_UNSTABLE]; was not my motivation for this pull request ;-) Could we do this with some other commit?

This one fixes the x18 issue of the aarch64 assembly - I also added a link within the comment, how this assembly gets generated. The C files and also my work on this are in public domain.

@lundman
Copy link
Contributor

lundman commented Apr 11, 2023

Yep, skip over that extra, I'll handle that separately.

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 14, 2023

@lundman - please test again in macOS - I think it should work now (fix: x29 must be used as frame pointer register)

The x18 register isn't useable within FreeBSD kernel space, so we
have to fix the BLAKE3 aarch64 assembly for not using it.

The source files are here: https://github.com/mcmilk/BLAKE3-tests

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
@mjguzik
Copy link
Contributor

mjguzik commented Apr 26, 2023

what's the status here?

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 26, 2023

what's the status here?

I tested this on Linux and FreeBSD 13 in kernel space, but not FreeBSD 14 currently.

@behlendorf
Copy link
Contributor

I tested this on Linux and FreeBSD 13 in kernel space, but not FreeBSD 14 currently.

Thanks for the update. Then I suggest we get this integrated unless you'd prefer we wait until it can also be tested on FreeBSD 14.

@mcmilk
Copy link
Contributor Author

mcmilk commented Apr 26, 2023

I tested this on Linux and FreeBSD 13 in kernel space, but not FreeBSD 14 currently.

Thanks for the update. Then I suggest we get this integrated unless you'd prefer we wait until it can also be tested on FreeBSD 14.

It should work their also, I wanted to check this first... but my time says no this.

@kevans91
Copy link
Contributor

This boots fine on main (14) w/ kfpu_allowed() flipped back to 1.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Apr 26, 2023
@behlendorf
Copy link
Contributor

@kevans91 thanks for test booting this!

@kevans91
Copy link
Contributor

Yup! JFYI- I committed https://cgit.freebsd.org/src/commit/?id=ce5a210997da3c4 which is a little different than what we upstreamed, and I don't recall if we've mentioned the problem or not -- we have to special-case _STANDALONE on aarch64 because blake3_impl.c offers a way to avoid compiling it for x86 (HAVE_SSE2) but not for aarch64.

I'm hoping that we can fix that so that the bootloader doesn't need any of this code that will never run there, then we can just revert back to always providing kfpu_begin()/kfpu_end() to match simd_x86.h. @bsdimp has taken a bit of a look at it, but I don't think he has a patch to propose at the moment.

@behlendorf behlendorf merged commit ee72800 into openzfs:master Apr 26, 2023
22 of 25 checks passed
@kevans91
Copy link
Contributor

kevans91 commented Apr 27, 2023

Thanks for merging this! Somewhat unrelated, can someone help me understand the format of the vfs.zfs.blake3_impl sysctl?

vfs.zfs.blake3_impl: cycle [fastest] generic sse2 sse41 

I don't understand what "cycle" and "[fastest]" are trying to convey, they're always there and implementations are listed in the same order. My understanding is that one of them should be the fastest, but I can't tell that from this sysctl (ditto for the sha*_impl sysctls, it seems).

@kevans91
Copy link
Contributor

Looking at the code for that, I suspect that's actually a bug and that one of generic, sse2, sse41 should have been [bracketed] while the rest appear with no styling and the words "cycle" / "[fastest]" not appearing at all.

@rincebrain
Copy link
Contributor

It works the same as fletcher_4_impl, or zfs_vdev_raidz_impl.

cycle cycles through them all, fastest picks whichever one microbenchmarked fastest on module load (see chksum_bench), and explicitly setting one of the others will force it to use that one instead. It defaults to fastest for reasons which may be obvious.

@mcmilk mcmilk deleted the fix-freebsd-aarch64 branch April 27, 2023 09:15
@mjguzik
Copy link
Contributor

mjguzik commented Apr 27, 2023

I think the complaint is that this does not explain which one benchmarked as the fastest, so you don't know which one you are using.

Personally I can't be arsed to do anything here and I don't think it warrants work modulo terminal boredom, but a patched report would be nice. For example:
vfs.zfs.blake3_impl.backends: // list sorted by speed comes here, perhaps with ratios in brackets to the slowest variant, say: foo (2x) bar (1.3x) baz (1x)
vfs.zfs.blake3_impl.selected: // self-explanatory
vfs.zfs.blake3_impl.policy: // normally reports "fastest", but autochanges to "custom" or similar if algo is hand-picked

also the _impl suffix really should not be there and blake3 should probably be under vfs.zfs.crypto. or similar to avoid namespace clashes.

@kevans91
Copy link
Contributor

It works the same as fletcher_4_impl, or zfs_vdev_raidz_impl.

cycle cycles through them all, fastest picks whichever one microbenchmarked fastest on module load (see chksum_bench), and explicitly setting one of the others will force it to use that one instead. It defaults to fastest for reasons which may be obvious.

ahh, ok, thanks! yeah, as mjg noted my hope was that this sysctl would show me which one had benchmarked the fastest; information I don't really know how to get otherwise. The described behavior does make sense in hindsight.

@behlendorf
Copy link
Contributor

On Linux you can view the micro benchmark results under /proc/spl/kstat/zfs/fletcher_4_bench including what was determined to be the fastest. One significant complication is that a "fastest" implementation is selected for both the native and byteswap code paths and these may be different.

# cat /proc/spl/kstat/zfs/fletcher_4_bench 
0 0 0x01 -1 0 7499279922389279 7499293276686735
implementation   native         byteswap       
scalar           6388458351     5009553771     
superscalar      8100951319     6130898497     
superscalar4     7625799296     5727904645     
sse2             10511346368    6149915690     
ssse3            10324161068    9369312698     
avx2             17409212184    14667001202    
fastest          avx2           avx2      

The same is true for the raidz benchmark results which measure each of the parity generation and reconstruction implementations. In this case they're all the same, but that's not always the case.

# cat /proc/spl/kstat/zfs/vdev_raidz_bench 
15 0 0x01 -1 0 7499280026994935 7499309300788810
implementation   gen_p           gen_pq          gen_pqr         rec_p           rec_q           rec_r           rec_pq          rec_pr          rec_qr          rec_pqr         
original         426453231       244379597       97729470        1277542228      268227858       39766321        95675800        23387251        23229562        15075042        
scalar           1514673865      356636345       164221321       1509497478      484860760       331000932       202724277       166963142       105276313       93926265        
sse2             2752273969      1045242466      513820091       2707823574      860589100       679029086       438582380       403552477       245874854       117310136       
ssse3            2745654092      1044979690      512685717       2699201410      1415780849      1070443009      840369754       690772874       507568163       385384470       
avx2             4443652484      1419984749      850036931       4276440548      2581496504      2075772283      1245326145      1185238021      906576713       680949928       
fastest          avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2   

andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request May 1, 2023
The x18 register isn't useable within FreeBSD kernel space, so we
have to fix the BLAKE3 aarch64 assembly for not using it.

The source files are here: https://github.com/mcmilk/BLAKE3-tests

Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes openzfs#14728
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants