Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sha256 x86_64 optimization v2 #2351

Closed
wants to merge 2 commits into from
Closed

sha256 x86_64 optimization v2 #2351

wants to merge 2 commits into from

Conversation

tuxoko
Copy link
Contributor

@tuxoko tuxoko commented May 30, 2014

This is a revision of #2332

Currently, the optimization only applies to kernel space,
because I haven't figured out how to it properly in user space.

@kernelOfTruth
Copy link
Contributor

In file included from /var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module/zfs/../../module/zfs/sha256.c:29:0:
/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/include/sys/sha256.h:5:24: fatal error: asm/sha256.h: No such file or directory
#include <asm/sha256.h>
^
compilation terminated.

probably should be

#include <asm-generic/sha256.h>
#include <asm-x86_64/sha256.h>

instead of

#include <asm/sha256.h>

edit:

not sure why github is swallowing the text and not displaying it:

http://pastebin.com/bk7b4H4N

@chrisrd
Copy link
Contributor

chrisrd commented Jul 14, 2014

@kernelOfTruth "not sure why github is swallowing the text and not displaying it" Github markup: for literal block text, surround the block with 3 back-quotes (```) on separate lines, e.g.:

In file included from /var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module/zfs/../../module/zfs/sha256.c:29:0:
/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/include/sys/sha256.h:5:24: fatal error: asm/sha256.h: No such file or directory
 #include <asm/sha256.h>
                        ^
compilation terminated.


probably should be

#include <asm-generic/sha256.h>
#include <asm-x86_64/sha256.h>

instead of

#include <asm/sha256.h>

@kernelOfTruth
Copy link
Contributor

@chrisrd thank you very much for this information 👍

@tuxoko
Copy link
Contributor Author

tuxoko commented Jul 14, 2014

@kernelOfTruth
How do you build your package?
Do you have full configure and build log?
The <asm/sha256.h> should be copied from the target arch's directory during configure.
I'm not sure why it didn't happen for you.

@kernelOfTruth
Copy link
Contributor

@tuxoko sorry for the delay

building via the Gentoo package manager

manually unpacking (ebuild zfs-kmod-9999.ebuild unpack / ebuild zfs-9999.ebuild unpack)

patching it in

then

ebuild zfs-kmod-9999.ebuild compile install qmerge

will post the log later

@kernelOfTruth
Copy link
Contributor

oops, sorry,

wrong log caused by permissions,

will post the correct one later - mea culpa :(

@kernelOfTruth
Copy link
Contributor


patch -p1 < /usr/src/sources/zfs/current/27.07.2014_sha256\ x86_64\ optimization\ v2_2351/27.07.2014_sha256\ x86_64\ optimization\ v2_2351.diff 
patching file config/user-arch.m4
patching file configure.ac
patching file include/.gitignore
patching file include/Makefile.am
patching file include/asm-generic/Makefile.am
patching file include/asm-generic/sha256.h
patching file include/asm-x86_64/Makefile.am
patching file include/asm-x86_64/sha256.h
patching file include/sys/Makefile.am
Hunk #1 succeeded at 37 with fuzz 1 (offset 1 line).
patching file include/sys/sha256.h
patching file include/sys/zio_checksum.h
patching file lib/libzpool/Makefile.am
Hunk #1 succeeded at 125 (offset 2 lines).
patching file module/.gitignore
patching file module/Makefile.in
patching file module/zfs/Makefile.in
Hunk #1 succeeded at 99 (offset 2 lines).
patching file module/zfs/asm-x86_64/Makefile.in
patching file module/zfs/asm-x86_64/sha256-avx-asm.S
patching file module/zfs/asm-x86_64/sha256-avx2-asm.S
patching file module/zfs/asm-x86_64/sha256-ssse3-asm.S
patching file module/zfs/asm-x86_64/sha256_x86_64.c
patching file module/zfs/sha256.c
patching file module/zfs/spa_misc.c
Hunk #1 succeeded at 1662 with fuzz 2 (offset 3 lines).
patching file module/zfs/zio_checksum.c

applying the patch:


In file included from /var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module/zfs/../../module/zfs/sha256.c:29:0:
/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/include/sys/sha256.h:5:24: fatal error: asm/sha256.h: No such file or directory
 #include <asm/sha256.h>
                        ^
compilation terminated.
/usr/src/linux-3.14.14_btrfs_test29/scripts/Makefile.build:308: recipe for target '/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module/zfs/../../module/zfs/sha256.o' failed
make[6]: *** [/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module/zfs/../../module/zfs/sha256.o] Error 1
make[6]: *** Waiting for unfinished jobs....
/usr/src/linux-3.14.14_btrfs_test29/scripts/Makefile.build:455: recipe for target '/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module/zfs' failed
make[5]: *** [/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module/zfs] Error 2
/usr/src/linux-3.14.14_btrfs_test29/Makefile:1277: recipe for target '_module_/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module' failed
make[4]: *** [_module_/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module] Error 2
Makefile:133: recipe for target 'sub-make' failed
make[3]: *** [sub-make] Error 2
make[3]: Leaving directory '/usr/src/linux-3.14.14_btrfs_test29'
Makefile:19: recipe for target 'modules' failed
make[2]: *** [modules] Error 2
make[2]: Leaving directory '/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module'
Makefile:675: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999'
Makefile:543: recipe for target 'all' failed
make: *** [all] Error 2
 * ERROR: sys-fs/zfs-kmod-9999::gentoo failed (compile phase):
 *   emake failed
 * 
 * If you need support, post the output of `emerge --info '=sys-fs/zfs-kmod-9999::gentoo'`,
 * the complete build log and the output of `emerge -pqv '=sys-fs/zfs-kmod-9999::gentoo'`.
 * The complete build log is located at '/var/log/portage/sys-fs:zfs-kmod-9999:20140727-211202.log'.
 * For convenience, a symlink to the build log is located at '/var/tmp/portage/sys-fs/zfs-kmod-9999/temp/build.log'.
 * The ebuild environment file is located at '/var/tmp/portage/sys-fs/zfs-kmod-9999/temp/environment'.
 * Working directory: '/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999'
 * S: '/var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999'

full build log:

http://pastebin.com/RxA81VhW

@kernelOfTruth
Copy link
Contributor

out of /var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/ :

grep -iR "asm/sha256.h" .
./include/sys/sha256.h:#include <asm/sha256.h>
find . | grep sha256.h
./include/asm-generic/sha256.h
./include/asm-x86_64/sha256.h
./include/sys/sha256.h

after replacing

#include <asm/sha256.h>

in

include/sys/sha256.h

with

#include <asm-generic/sha256.h>
#include <asm-x86_64/sha256.h>

it compiles fine

is there a way to check if this optimization is in use ?

@ryao
Copy link
Contributor

ryao commented Oct 26, 2014

@kernelOfTruth A simple way is to profile using perf and see what symbols are in use.

http://wiki.gentoo.org/wiki/ZFSOnLinux_Development_Guide#Generating_a_Flame_Graph_with_Perf

Another way is to attach gdb to your kernel and check the value of sha256_transform_asm against the various routines.

@tuxoko I am in a position to test the AVX2 routine, although I might not find time to do that this week. Also, this would benefit from further revision when Broadwell debuts the new sha256 instructions:

https://software.intel.com/en-us/articles/intel-sha-extensions

@@ -125,3 +132,51 @@ zio_checksum_SHA256(const void *buf, uint64_t size, zio_cksum_t *zcp)
(uint64_t)H[4] << 32 | H[5],
(uint64_t)H[6] << 32 | H[7]);
}

void (*sha256_transform)(const void *, uint32_t *, uint64_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a writeable function pointer here will break PaX builds. The PaX plugin was designed to break builds when writeable function pointers like these are used because they can be written with the address of arbitrary code. It would be better to have an enum that is set based on the test results and then used to select the correct function via a switch statement.

@ryao
Copy link
Contributor

ryao commented Oct 26, 2014

The sha256 checksums are calculated in such a way that we generate big endian versions of them. Do the Intel routines provided also do that? If not, we will need to do byte swapping to fix that. Otherwise, we risk introducing a disk format change.

Also, it might be worth considering whether we could have the compiler generate "optimized" versions of this routine against different CPUs for us. I modified our current sha256.c to allow GCC to generate assembly code from a single file, added the static inline suggestion that I made and built it with gcc -S -O3 -fno-stack-protector -march=core-avx2 sha256.c. The resulting sha256.s is not perfect, but it is rather good

http://dpaste.com/24NZ4K6
http://dpaste.com/28EPAEP

An alternative way of doing achieving what this pull request aims to do without using hand writen assembly would be to split sha256.c into two files, sha256-base.c and sha256-generic.c. The former would contain the logic for switching between implementations while the latter would be used with different compiler invocations to obtain the same routine built for different CPUs. We would change the name of the function on each via a CPP switch (e.g. -DSHA256_NAME=sha256_transform_avx2). Then we could link it all together and get a similar effect to assembly, with the benefit that we can include custom versions for as many CPUs as we want.

It would be interesting to do benchmarks to see if the handwritten assembly is noticeably faster than the GCC output. If it is not, then we could avoid adding hand written assembly, yet receive the benefits in a way that could be adapted to other ISAs without the need for one of us to understand the ISA.

@ryao
Copy link
Contributor

ryao commented Oct 26, 2014

It occurs to me that we could tell the compiler to build the existing SHA256 routine with SSE2 instructions on amd64. The kernel's build system explicitly tells the compiler not to do this because we need to use kernel_fpu_begin()/kernel_fpu_end() to make it safe. SSE2 is always available on amd64 processors and we are already talking about using kernel_fpu_begin()/kernel_fpu_end(), so there is little reason not to use it to accelerate the common case.

#endif

if (sha256_transform_asm)
sha256_transform = arch_sha256_transform;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that we currently do not support realtime kernels, but that support will come as soon as someone writes patches for it. kernel_fpu_begin()/kernel_fpu_end() turns off interrupts inside the critical section. Turning off interrupts for any appreciable amount of time is undesireable on realtime systems. We likely should have a module option to allow optimizations to be disabled on such systems. It might even be better to disable them by default on realtime kernels.

@ryao
Copy link
Contributor

ryao commented Oct 26, 2014

@tuxoko The following documentation should be useful for implementing these routines in userspace:

Section 6.4:
http://www.agner.org/optimize/calling_conventions.pdf

Section 15.1:
http://www.agner.org/optimize/optimizing_assembly.pdf

I have Haswell hardware that I can use for testing, although it should also be possible to do testing with QEMU.

0x8d5651e46d3cdb76, 0x2d02d0bf37c9e592 }},
};

static void sha256_test(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing a self-test routine is an excellent idea. However, it is important to understand that the generic routine is restricted to a subset of instructions that operate on the normal integer registers, such that the likelihood of a defect that only affects checksums is small. This allowed us to avoid introducing a self check in the past, yet still be relatively safe.

Introducing optimized assembly changes things because we begin exercising transistors that often go unused during normal operation. This dramatically increases the risk of a CPU defect affecting the checksum routines. Having a self test occur only during debug builds means that the vast majority of ZoL installations will have nothing to guard against such defects. At the same time, using preprogrammed data to do a comparison risks passing CPUs with hypothetical defects that affect only some byte sequences and not others. We could be running on a system where all but 1 CPU core is good, so only checking 1 core like we do here would miss it. We could even begin the test on a bad CPU core, but fail to detect it because we are rescheduled to a good CPU core.

With those things in mind, I would like to see some changes:

  1. A self test should be done whenever we use optimized assembly. This includes non-debug builds. Running this in debug builds when we use non-optimized assembly as you do here should also be done.
  2. When we detect that we can use the optimized assembly routine, we should perform a second self-check routine that initializes a bufffer with random data, calculates the hash with the generic routine, calculates the hash with the optimized routine, and compares the result. This is intended to provide some protection against defects that would get past a static input.
  3. We need to do an Illumos-style xcall to run this check on all available CPU cores. We would want to implement the xcall infrastructure in the SPL using Linux's on_each_cpu() routine. Code operating in that context will operate with interrupts disabled, so there is no risk of being rescheduled (and failing to test CPU cores) like we have here.
  4. There exist Linux systems that support hotpluggable CPUs, so we should detect the addition of CPU cores to a system so that we can test them. I do not know how to do this offhand, so it needs investigation.

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Dec 13, 2014
This is a revision of openzfs#2332

Currently, the optimization only applies to kernel space,
because I haven't figured out how to it properly in user space.

AVX2 is untested because I don't have such CPU.
So use it with you own discretion.
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Dec 13, 2014
This is a revision of openzfs#2332

Currently, the optimization only applies to kernel space,
because I haven't figured out how to it properly in user space.

AVX2 is untested because I don't have such CPU.
So use it with you own discretion.
@pavel-odintsov
Copy link

Hello, folks!

Any progress in this issue?

@behlendorf behlendorf added the Type: Feature Feature request or new feature label Jan 16, 2015
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Feb 14, 2015
This is a revision of openzfs#2332

Currently, the optimization only applies to kernel space,
because I haven't figured out how to it properly in user space.

AVX2 is untested because I don't have such CPU.
So use it with you own discretion.
kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Feb 15, 2015
… v2)

In file included from /var/tmp/portage/sys-fs/zfs-kmod-9999-r1/work/zfs-kmod-9999/module/zfs/../../module/zfs/sha256.c:29:0:
/var/tmp/portage/sys-fs/zfs-kmod-9999-r1/work/zfs-kmod-9999/include/sys/sha256.h:5:24: fatal error: asm/sha256.h: No such file or directory
 #include <asm/sha256.h>
@sempervictus
Copy link
Contributor

@tuxoko: is there any chance you have a version of this which is compatible with the abd_next branch?
I also got an error about bool being an undefined type which required that i

#include <stdbool.h>

in order to bypass it - not sure how "allowed" that is, or why i'm seeing it on my end.

@tuxoko
Copy link
Contributor Author

tuxoko commented Mar 10, 2016

@sempervictus
Which file? I think the correct way in kernel is to #include <linux/types.h>

@kernelOfTruth
Copy link
Contributor

zfs module import is failing due to:

zfs: Unknown symbol arch_sha256_init (err 0)

@tuxoko
Copy link
Contributor Author

tuxoko commented Mar 10, 2016

@kernelOfTruth
Building kmod would fail, I have no idea why, but the Makefile.in inside asm-x86_64 don't transform into Makefile.

@tuxoko
Copy link
Contributor Author

tuxoko commented Mar 10, 2016

Fixed kmod build. But in-tree still fails.

@tuxoko tuxoko force-pushed the asm2 branch 2 times, most recently from fd01b7e to 8baa4a5 Compare March 10, 2016 22:25
@tuxoko
Copy link
Contributor Author

tuxoko commented Mar 10, 2016

Fix typo in kernel_fpu_end and fix build error in linux 4.5

@behlendorf
Copy link
Contributor

@tuxoko it would be great if you could rebase this patch on the work @ironMann has done in #4381. That would help give us a good idea if the proposed generic interfaces are going to meet our needs.

@tuxoko
Copy link
Contributor Author

tuxoko commented Mar 21, 2016

Rebase to master.

XTMP3 = %ymm3
XTMP4 = %ymm8
XFER = %ymm9
XTMP5 = %ymm11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will run only in 64bit mode. (regs ymm8...15 are used). Such code should be
protected with

#include <sys/isa_defs.h>
#if defined(HAVE_AVX2) && defined(__x86_64)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only built in x86_64, see module/zfs/Makefile.in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I made an incomplete case for ifdefs.
Having ifdef(HAVE_AVX2) around the code will prevent old compilers and binutils (gcc older than 4.7) from going in and choking on unknown instructions.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
@tuxoko
Copy link
Contributor Author

tuxoko commented Jun 3, 2016

Update: rebase to master, cleanup code, add module parameter to choose algo like in fletcher, add benchmark to select fastest during init.

Add ssse3, avx, avx2 optimized sha256. During module init, the fastest
available version will be selected.

Currently, we only support optimization in kernel space. User programs will
use generic code.

Note: The sha256-{ssse3,avx,avx2}-asm.S files are from linux-3.14.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
@tcaputi
Copy link
Contributor

tcaputi commented Jun 6, 2016

Hello. I know I am coming into this PR very, very late, but I just noticed it. I just wanted to make sure you guys were aware of PR #4329 for ZFS encryption. The first commit in that PR ports the crypto API from Illumos to a ZoL kernel module. This code includes a sha256 implementation with x86_64 assembly that compiles in both userspace and kernel space. In the current PR, I have not replaced the existing sha256 code in an effort to limit the scope of the PR (which is already quite sizable). However, this would be very easy to add (probably about an hour's worth of time, 45 minutes of which would just be verifying I didn't break anything on big endian systems). I certainly don't mean to step on anybody's toes, but would it make sense to look at doing this? It might not considering that the encryption patch might take a while to get merged (considering its size).

@behlendorf behlendorf mentioned this pull request Jun 6, 2016
@behlendorf
Copy link
Contributor

@tcaputi thanks for commenting, I've posted a more detailed comment in #4329 about this. The short version is as a first step toward ZFS encryption let's get the crypto framework merged and a few smaller changes which leverage it. That'll help us shake out any issues. I suspect we'll want to use the vectorized sha256 version implemented here when available. @tuxoko do you have any benchmark results for this?

@tuxoko
Copy link
Contributor Author

tuxoko commented Jun 7, 2016

@behlendorf
IIRC, the benchmark is something like this on a Haswell i7:
generic ~240MiB/s
ssse3, avx ~390MiB/s
avx2 ~460MiB/s

@behlendorf
Copy link
Contributor

@tuxoko now that vectorized fletcher, raidz, and crypto framework are all in master I think would be a good time to rebase this so we can get it finalized and merged. The straight forward thing to do is probably just extend your existing patch to include the sha256 implementation from the icp module as an option. It would be good to fix it up so it builds in user space as well like the similar fletcher code.

@behlendorf
Copy link
Contributor

Closing for now to minimize the number of open action PRs. It can be reopened when someone has time to work on this.

@sempervictus
Copy link
Contributor

@behlendorf: any chance of revisiting this, or implementing something newer than the rather old OpenSSL derived functions?

@behlendorf
Copy link
Contributor

@sempervictus I'd love to see this implemented if someone has the time to work on it.

@jumbi77 jumbi77 mentioned this pull request Sep 18, 2021
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.