-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sha256 x86_64 optimization v2 #2351
Conversation
In file included from /var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/module/zfs/../../module/zfs/sha256.c:29:0: probably should be #include <asm-generic/sha256.h> instead of #include <asm/sha256.h> edit: not sure why github is swallowing the text and not displaying it: |
@kernelOfTruth "not sure why github is swallowing the text and not displaying it" Github markup: for literal block text, surround the block with 3 back-quotes (```) on separate lines, e.g.:
|
@chrisrd thank you very much for this information 👍 |
@kernelOfTruth |
@tuxoko sorry for the delay building via the Gentoo package manager manually unpacking (ebuild zfs-kmod-9999.ebuild unpack / ebuild zfs-9999.ebuild unpack) patching it in then ebuild zfs-kmod-9999.ebuild compile install qmerge will post the log later |
oops, sorry, wrong log caused by permissions, will post the correct one later - mea culpa :( |
applying the patch:
full build log: |
out of /var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/ :
after replacing
in include/sys/sha256.h with
it compiles fine is there a way to check if this optimization is in use ? |
@kernelOfTruth A simple way is to profile using perf and see what symbols are in use. http://wiki.gentoo.org/wiki/ZFSOnLinux_Development_Guide#Generating_a_Flame_Graph_with_Perf Another way is to attach gdb to your kernel and check the value of sha256_transform_asm against the various routines. @tuxoko I am in a position to test the AVX2 routine, although I might not find time to do that this week. Also, this would benefit from further revision when Broadwell debuts the new sha256 instructions: https://software.intel.com/en-us/articles/intel-sha-extensions |
@@ -125,3 +132,51 @@ zio_checksum_SHA256(const void *buf, uint64_t size, zio_cksum_t *zcp) | |||
(uint64_t)H[4] << 32 | H[5], | |||
(uint64_t)H[6] << 32 | H[7]); | |||
} | |||
|
|||
void (*sha256_transform)(const void *, uint32_t *, uint64_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a writeable function pointer here will break PaX builds. The PaX plugin was designed to break builds when writeable function pointers like these are used because they can be written with the address of arbitrary code. It would be better to have an enum that is set based on the test results and then used to select the correct function via a switch statement.
The sha256 checksums are calculated in such a way that we generate big endian versions of them. Do the Intel routines provided also do that? If not, we will need to do byte swapping to fix that. Otherwise, we risk introducing a disk format change. Also, it might be worth considering whether we could have the compiler generate "optimized" versions of this routine against different CPUs for us. I modified our current http://dpaste.com/24NZ4K6 An alternative way of doing achieving what this pull request aims to do without using hand writen assembly would be to split sha256.c into two files, sha256-base.c and sha256-generic.c. The former would contain the logic for switching between implementations while the latter would be used with different compiler invocations to obtain the same routine built for different CPUs. We would change the name of the function on each via a CPP switch (e.g. -DSHA256_NAME=sha256_transform_avx2). Then we could link it all together and get a similar effect to assembly, with the benefit that we can include custom versions for as many CPUs as we want. It would be interesting to do benchmarks to see if the handwritten assembly is noticeably faster than the GCC output. If it is not, then we could avoid adding hand written assembly, yet receive the benefits in a way that could be adapted to other ISAs without the need for one of us to understand the ISA. |
It occurs to me that we could tell the compiler to build the existing SHA256 routine with SSE2 instructions on amd64. The kernel's build system explicitly tells the compiler not to do this because we need to use |
#endif | ||
|
||
if (sha256_transform_asm) | ||
sha256_transform = arch_sha256_transform; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that we currently do not support realtime kernels, but that support will come as soon as someone writes patches for it. kernel_fpu_begin()/kernel_fpu_end()
turns off interrupts inside the critical section. Turning off interrupts for any appreciable amount of time is undesireable on realtime systems. We likely should have a module option to allow optimizations to be disabled on such systems. It might even be better to disable them by default on realtime kernels.
@tuxoko The following documentation should be useful for implementing these routines in userspace: Section 6.4: Section 15.1: I have Haswell hardware that I can use for testing, although it should also be possible to do testing with QEMU. |
0x8d5651e46d3cdb76, 0x2d02d0bf37c9e592 }}, | ||
}; | ||
|
||
static void sha256_test(void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing a self-test routine is an excellent idea. However, it is important to understand that the generic routine is restricted to a subset of instructions that operate on the normal integer registers, such that the likelihood of a defect that only affects checksums is small. This allowed us to avoid introducing a self check in the past, yet still be relatively safe.
Introducing optimized assembly changes things because we begin exercising transistors that often go unused during normal operation. This dramatically increases the risk of a CPU defect affecting the checksum routines. Having a self test occur only during debug builds means that the vast majority of ZoL installations will have nothing to guard against such defects. At the same time, using preprogrammed data to do a comparison risks passing CPUs with hypothetical defects that affect only some byte sequences and not others. We could be running on a system where all but 1 CPU core is good, so only checking 1 core like we do here would miss it. We could even begin the test on a bad CPU core, but fail to detect it because we are rescheduled to a good CPU core.
With those things in mind, I would like to see some changes:
- A self test should be done whenever we use optimized assembly. This includes non-debug builds. Running this in debug builds when we use non-optimized assembly as you do here should also be done.
- When we detect that we can use the optimized assembly routine, we should perform a second self-check routine that initializes a bufffer with random data, calculates the hash with the generic routine, calculates the hash with the optimized routine, and compares the result. This is intended to provide some protection against defects that would get past a static input.
- We need to do an Illumos-style xcall to run this check on all available CPU cores. We would want to implement the xcall infrastructure in the SPL using Linux's
on_each_cpu()
routine. Code operating in that context will operate with interrupts disabled, so there is no risk of being rescheduled (and failing to test CPU cores) like we have here. - There exist Linux systems that support hotpluggable CPUs, so we should detect the addition of CPU cores to a system so that we can test them. I do not know how to do this offhand, so it needs investigation.
This is a revision of openzfs#2332 Currently, the optimization only applies to kernel space, because I haven't figured out how to it properly in user space. AVX2 is untested because I don't have such CPU. So use it with you own discretion.
This is a revision of openzfs#2332 Currently, the optimization only applies to kernel space, because I haven't figured out how to it properly in user space. AVX2 is untested because I don't have such CPU. So use it with you own discretion.
Hello, folks! Any progress in this issue? |
This is a revision of openzfs#2332 Currently, the optimization only applies to kernel space, because I haven't figured out how to it properly in user space. AVX2 is untested because I don't have such CPU. So use it with you own discretion.
… v2) In file included from /var/tmp/portage/sys-fs/zfs-kmod-9999-r1/work/zfs-kmod-9999/module/zfs/../../module/zfs/sha256.c:29:0: /var/tmp/portage/sys-fs/zfs-kmod-9999-r1/work/zfs-kmod-9999/include/sys/sha256.h:5:24: fatal error: asm/sha256.h: No such file or directory #include <asm/sha256.h>
@tuxoko: is there any chance you have a version of this which is compatible with the abd_next branch?
in order to bypass it - not sure how "allowed" that is, or why i'm seeing it on my end. |
@sempervictus |
zfs module import is failing due to:
|
@kernelOfTruth |
Fixed kmod build. But in-tree still fails. |
fd01b7e
to
8baa4a5
Compare
Fix typo in kernel_fpu_end and fix build error in linux 4.5 |
Rebase to master. |
XTMP3 = %ymm3 | ||
XTMP4 = %ymm8 | ||
XFER = %ymm9 | ||
XTMP5 = %ymm11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will run only in 64bit mode. (regs ymm8...15 are used). Such code should be
protected with
#include <sys/isa_defs.h>
#if defined(HAVE_AVX2) && defined(__x86_64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is only built in x86_64, see module/zfs/Makefile.in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I made an incomplete case for ifdefs.
Having ifdef(HAVE_AVX2) around the code will prevent old compilers and binutils (gcc older than 4.7) from going in and choking on unknown instructions.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Update: rebase to master, cleanup code, add module parameter to choose algo like in fletcher, add benchmark to select fastest during init. |
Add ssse3, avx, avx2 optimized sha256. During module init, the fastest available version will be selected. Currently, we only support optimization in kernel space. User programs will use generic code. Note: The sha256-{ssse3,avx,avx2}-asm.S files are from linux-3.14. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Hello. I know I am coming into this PR very, very late, but I just noticed it. I just wanted to make sure you guys were aware of PR #4329 for ZFS encryption. The first commit in that PR ports the crypto API from Illumos to a ZoL kernel module. This code includes a sha256 implementation with x86_64 assembly that compiles in both userspace and kernel space. In the current PR, I have not replaced the existing sha256 code in an effort to limit the scope of the PR (which is already quite sizable). However, this would be very easy to add (probably about an hour's worth of time, 45 minutes of which would just be verifying I didn't break anything on big endian systems). I certainly don't mean to step on anybody's toes, but would it make sense to look at doing this? It might not considering that the encryption patch might take a while to get merged (considering its size). |
@tcaputi thanks for commenting, I've posted a more detailed comment in #4329 about this. The short version is as a first step toward ZFS encryption let's get the crypto framework merged and a few smaller changes which leverage it. That'll help us shake out any issues. I suspect we'll want to use the vectorized sha256 version implemented here when available. @tuxoko do you have any benchmark results for this? |
@behlendorf |
@tuxoko now that vectorized fletcher, raidz, and crypto framework are all in master I think would be a good time to rebase this so we can get it finalized and merged. The straight forward thing to do is probably just extend your existing patch to include the sha256 implementation from the icp module as an option. It would be good to fix it up so it builds in user space as well like the similar fletcher code. |
Closing for now to minimize the number of open action PRs. It can be reopened when someone has time to work on this. |
@behlendorf: any chance of revisiting this, or implementing something newer than the rather old OpenSSL derived functions? |
@sempervictus I'd love to see this implemented if someone has the time to work on it. |
This is a revision of #2332
Currently, the optimization only applies to kernel space,
because I haven't figured out how to it properly in user space.