Stack optimizations and refactoring of NTT-based Saber #181

mvanbeirendonck · 2021-02-10T10:38:40Z

This pull request adds stack optimizations for the NTT-based Saber and thoroughly refactors the Saber Cortex-M4 code. Detailed info is in the first commit message (f390aea). Stack optimizations include some help from @vincentvbh.

Stack usage should reduce by ~5x (a little less for LightSaber), performance improves a tiny bit. Benchmarks should still be recreated on your setups, I didn't add them yet.

Let me know if you have any questions or need help merging!

Cheers,
Michiel

…TT-based Saber. Firstly, this commit merges improvements between different Saber implementations. 1) For round 3, the Saber reference code was thoroughly refactored and the codebase reduced [https://github.com/KULeuven-COSIC/SABER]. These changes are now integrated into the m4 code. 2) All unnecessary modular reductions have been removed. The only modular reductions are now in the packing functions. 3) Packing/unpacking functions are simplified [PQClean, commit f8503cb]. 4) The secret-key is stored in compressed format [ia.cr/2020/268, Section 4.1]. This reduces the secret-key size, and the packing/unpacking functions are faster. (This requires a fix in pqm4’s testvectors.c, as the secret-key is checked against the one produced by PQclean). 5) During re-encryption, the verification of the ciphertext is performed in place [ia.cr/2020/268, Section 4.2]. 6) Use symlinks for Light/FireSaber to make (minimal) differences with Saber more clear. Secondly, this commit implements some optimizations and reduces the memory footprint of the NTT-based multiplication. 1) Saber does not require any modular reduction apart from bitstream packing. Elements can be kept in int16_t (central-reduced) format. 1.a) The secret-key is sign-extended from 4-bit to 16-bit when unpacked. 1.b) The vectors b and b' are sign-extended from 10-bit to 16-bit when unpacked. 1.c) 1.a and 1.b allow to remove NTT_pk (with central reduction) and use NTT (without central reduction) uniformly. 1.d) NTT_inv and NTT_inv_inner include a final step that converts from int16_t back to mod_p or mod_q. This is not necessary and removed. 2) During encryption, the NTT of s' is only computed once and reused between A*s' and b*s'. 3) Some just-in-time memory optimizations of [ia.cr/2018/682, Section 2.2] are implemented for the NTT-based multiplication. Polynomial vectors are generated from their seed just-in-time, converted to NTT domain, and pointwise multiplied. The next polynomial vectors can reuse all the buffers. The idea is to extend this from polynomial vectors to individual polynomials. This still requires a new my_mul function. For {Fire,Light}Saber (keygen/encaps/decaps) the resulting implementation is approximately (2.3-2.6%/4.7-5.5%/7.4-9.5%) faster and uses (27-36%/47-61%/49-62%) less dynamic memory than the current version in pqm4.

and comment out non-stack-optimized (very slightly faster) functions

shake_out was SABER_POLYVECBYTES instead of only SABER_POLYBYTES. Introduced a few unions to overlap memory.

mkannwischer · 2021-02-18T05:56:37Z

LGTM! Thank you very much @mvanbeirendonck and also @vincentvbh!
I've added the benchmarks.
I heard rumours that @vincentvbh has something even smaller by now, but it will take a little longer until that is ready, so I'll merge this already.

mvanbeirendonck and others added 13 commits December 8, 2020 16:01

Add central reduction for matrix A

1d746d0

Add benchmarks

3fa0a90

WIP : more memory-efficient NTT implementation

61da84b

Make secret key compression optional

107bc3c

and comment out non-stack-optimized (very slightly faster) functions

Reclaim ~1kB more stack space

429105e

shake_out was SABER_POLYVECBYTES instead of only SABER_POLYBYTES. Introduced a few unions to overlap memory.

rm redundant files

8f24ae2

clean ups; add soft links

b05a41e

Reclaim ~1kB more stack space

86ec2be

shake_out was SABER_POLYVECBYTES instead of only SABER_POLYBYTES. Introduced a few unions to overlap memory.

Merge branch 'vincentvbh-master'

6ab664a

typo

1638b65

Noinline no longer needed without fast funcs

f27a39b

Merge remote-tracking branch 'upstream/master'

08a462d

mvanbeirendonck mentioned this pull request Feb 10, 2021

[WIP] Add stack optimizations. ntt-polymul/ntt-polymul#3

Draft

add benchmarks

04459e8

mkannwischer merged commit 992f0f2 into mupq:master Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack optimizations and refactoring of NTT-based Saber #181

Stack optimizations and refactoring of NTT-based Saber #181

mvanbeirendonck commented Feb 10, 2021

mkannwischer commented Feb 18, 2021

Stack optimizations and refactoring of NTT-based Saber #181

Stack optimizations and refactoring of NTT-based Saber #181

Conversation

mvanbeirendonck commented Feb 10, 2021

mkannwischer commented Feb 18, 2021