Optimize CKKS square #353

fboemer · 2021-06-12T00:25:55Z

Optimizes CKKS square by avoiding unnecessary allocation / copy.

On ICX with clang-12, running sealbench CKKS EvaluateSquare for 1000 iterations yields:

N	HEXL	Time before (us)	Time after (us)	Speedup
1024	OFF	12.3	11.7	1.05x
1024	ON	2.39	1.68	1.42x
2048	OFF	24.4	22.5	1.08x
2048	ON	8.68	6.52	1.33x
4096	OFF	107	89.3	1.22x
4096	ON	38.4	31.1	1.23x
8192	OFF	429	375	1.14x
8192	ON	183	106	1.75x
16384	OFF	1878	1520	1.23x
16384	ON	774	401	1.93x
32768	OFF	6972	5798	1.20x
32768	ON	3205	1990	1.61x

I didn't see significant additional speedup with a tiling approach similar to #346.
In case you'd like to try it out, I've pasted the code for a tiled version of this implementation below.

       // Prepare destination
        encrypted.resize(context_, context_data.parms_id(), dest_size);
        // Set up iterators for input ciphertext
        // auto encrypted_iter = iter(encrypted);

        size_t tile_size = min<size_t>(coeff_count, size_t(1024));
        size_t num_tiles = coeff_count / tile_size;
#ifdef SEAL_DEBUG
        if (coeff_count % tile_size != 0)
        {
            throw invalid_argument("tile_size does not divide coeff_count");
        }
#endif

        // Set up iterators for input ciphertexts
        PolyIter encrypted_iter = iter(encrypted);

        // Semantic misuse of RNSIter; each is really pointing to the data for each RNS factor in sequence
        RNSIter encrypted1_0_iter(*encrypted_iter[0], tile_size);
        RNSIter encrypted1_1_iter(*encrypted_iter[1], tile_size);
        RNSIter encrypted1_2_iter(*encrypted_iter[2], tile_size);

        // Computes the output tile_size coefficients at a time
        // Given input tuple of polynomials x = (x[0], x[1], x[2]), computes
        // x = (x[0] * x[0], 2 * x[0] * x[1] , x[1] * x[1])
        // with appropriate modular reduction
        SEAL_ITERATE(coeff_modulus, coeff_modulus_size, [&](auto I) {
            SEAL_ITERATE(iter(size_t(0)), num_tiles, [&](auto J) {
                // Compute third output polynomial, overwriting input
                // x[2] = x[1] * x[1]
                dyadic_product_coeffmod(encrypted1_1_iter[0], encrypted1_1_iter[0], tile_size, I, encrypted1_2_iter[0]);

                // Compute second output polynomial, overwriting input
                // x[1] = x[1] * x[0]
                dyadic_product_coeffmod(encrypted1_1_iter[0], encrypted1_0_iter[0], tile_size, I, encrypted1_1_iter[0]);
                // x[1] += x[1]
                add_poly_coeffmod(encrypted1_1_iter[0], encrypted1_1_iter[0], tile_size, I, encrypted1_1_iter[0]);

                // Compute first output polynomial, overwriting input
                // x[0] = x[0] * x[0]
                dyadic_product_coeffmod(encrypted1_0_iter[0], encrypted1_0_iter[0], tile_size, I, encrypted1_0_iter[0]);

                // Manually increment iterators
                ++encrypted1_0_iter;
                ++encrypted1_1_iter;
                ++encrypted1_2_iter;
            });
        });

WeiDaiWD · 2021-06-15T05:20:23Z

Testing with clang-10:
This PR is much faster than original SEAL, the tiled version is not (much) faster than this PR.

Testing with gcc-9:
The tiled version is faster than this PR which is much faster than original SEAL, except for 1024.
For 1024, the original SEAL is as fast as clang-10, costing 22 us, while both new versions with gcc-9 take 60+ us which is slower than 2048. I disassembled Evaluator::ckks_square in the original SEAL and this PR, and they are almost identical. The issue is likely caused by poor/weird performance of SEAL_ITERATE with GCC, which is called in dyadic_product.

I'll merge this PR. For gcc and 1024 case, I'll look into fixing that in a future release. I don't think this affects too many users. The speedup we get from this PR is more valuable. Thank you so much for this!

Optimize CKKS multiply

e1b0ed2

WeiDaiWD merged commit 16f40b4 into microsoft:contrib Jun 15, 2021

fboemer deleted the fboemer/faster-ckks-square branch November 3, 2021 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CKKS square #353

Optimize CKKS square #353

fboemer commented Jun 12, 2021

WeiDaiWD commented Jun 15, 2021

Optimize CKKS square #353

Optimize CKKS square #353

Conversation

fboemer commented Jun 12, 2021

WeiDaiWD commented Jun 15, 2021