Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CKKS square #353

Merged
merged 1 commit into from Jun 15, 2021
Merged

Conversation

fboemer
Copy link
Contributor

@fboemer fboemer commented Jun 12, 2021

Optimizes CKKS square by avoiding unnecessary allocation / copy.

On ICX with clang-12, running sealbench CKKS EvaluateSquare for 1000 iterations yields:

N HEXL Time before (us) Time after (us) Speedup
1024 OFF 12.3 11.7 1.05x
1024 ON 2.39 1.68 1.42x
2048 OFF 24.4 22.5 1.08x
2048 ON 8.68 6.52 1.33x
4096 OFF 107 89.3 1.22x
4096 ON 38.4 31.1 1.23x
8192 OFF 429 375 1.14x
8192 ON 183 106 1.75x
16384 OFF 1878 1520 1.23x
16384 ON 774 401 1.93x
32768 OFF 6972 5798 1.20x
32768 ON 3205 1990 1.61x

I didn't see significant additional speedup with a tiling approach similar to #346.
In case you'd like to try it out, I've pasted the code for a tiled version of this implementation below.

       // Prepare destination
        encrypted.resize(context_, context_data.parms_id(), dest_size);
        // Set up iterators for input ciphertext
        // auto encrypted_iter = iter(encrypted);

        size_t tile_size = min<size_t>(coeff_count, size_t(1024));
        size_t num_tiles = coeff_count / tile_size;
#ifdef SEAL_DEBUG
        if (coeff_count % tile_size != 0)
        {
            throw invalid_argument("tile_size does not divide coeff_count");
        }
#endif

        // Set up iterators for input ciphertexts
        PolyIter encrypted_iter = iter(encrypted);

        // Semantic misuse of RNSIter; each is really pointing to the data for each RNS factor in sequence
        RNSIter encrypted1_0_iter(*encrypted_iter[0], tile_size);
        RNSIter encrypted1_1_iter(*encrypted_iter[1], tile_size);
        RNSIter encrypted1_2_iter(*encrypted_iter[2], tile_size);

        // Computes the output tile_size coefficients at a time
        // Given input tuple of polynomials x = (x[0], x[1], x[2]), computes
        // x = (x[0] * x[0], 2 * x[0] * x[1] , x[1] * x[1])
        // with appropriate modular reduction
        SEAL_ITERATE(coeff_modulus, coeff_modulus_size, [&](auto I) {
            SEAL_ITERATE(iter(size_t(0)), num_tiles, [&](auto J) {
                // Compute third output polynomial, overwriting input
                // x[2] = x[1] * x[1]
                dyadic_product_coeffmod(encrypted1_1_iter[0], encrypted1_1_iter[0], tile_size, I, encrypted1_2_iter[0]);

                // Compute second output polynomial, overwriting input
                // x[1] = x[1] * x[0]
                dyadic_product_coeffmod(encrypted1_1_iter[0], encrypted1_0_iter[0], tile_size, I, encrypted1_1_iter[0]);
                // x[1] += x[1]
                add_poly_coeffmod(encrypted1_1_iter[0], encrypted1_1_iter[0], tile_size, I, encrypted1_1_iter[0]);

                // Compute first output polynomial, overwriting input
                // x[0] = x[0] * x[0]
                dyadic_product_coeffmod(encrypted1_0_iter[0], encrypted1_0_iter[0], tile_size, I, encrypted1_0_iter[0]);

                // Manually increment iterators
                ++encrypted1_0_iter;
                ++encrypted1_1_iter;
                ++encrypted1_2_iter;
            });
        });

@WeiDaiWD
Copy link
Contributor

Testing with clang-10:
This PR is much faster than original SEAL, the tiled version is not (much) faster than this PR.

Testing with gcc-9:
The tiled version is faster than this PR which is much faster than original SEAL, except for 1024.
For 1024, the original SEAL is as fast as clang-10, costing 22 us, while both new versions with gcc-9 take 60+ us which is slower than 2048. I disassembled Evaluator::ckks_square in the original SEAL and this PR, and they are almost identical. The issue is likely caused by poor/weird performance of SEAL_ITERATE with GCC, which is called in dyadic_product.

I'll merge this PR. For gcc and 1024 case, I'll look into fixing that in a future release. I don't think this affects too many users. The speedup we get from this PR is more valuable. Thank you so much for this!

@WeiDaiWD WeiDaiWD merged commit 16f40b4 into microsoft:contrib Jun 15, 2021
@fboemer fboemer deleted the fboemer/faster-ckks-square branch November 3, 2021 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants