# Parallelizing HE with tf-shell

There are two forms of parallelism in `tf-shell`.

First, many of tf-shell's operations internally parallelize over the dimension
of the input. For example, multiplying two [slot, 3] ciphertexts (i.e. a vector
of ciphertexts with length 3) may run the three element-wise multiplications in
parallel. This is performed using TensorFlow's thread pool.

Second is graph level parallelism. This is where multiple operations are run in
parallel. As an extension of TensorFlow, `tf-shell` supports this.

In [1]:
import tf_shell
import tensorflow as tf
import timeit

context = tf_shell.create_context64(
    log_n=10,
    main_moduli=[8556589057, 8388812801],
    plaintext_modulus=40961,
    scaling_factor=3,
    seed="test_seed",
)

secret_key = tf_shell.create_key64(context)
rotation_key = tf_shell.create_rotation_key64(context, secret_key)

single_pt = tf.random.uniform([context.num_slots, 1], dtype=tf.float32, maxval=10)
single_ct = tf_shell.to_encrypted(single_pt, secret_key, context)

vector_pt = tf.random.uniform([context.num_slots, 8], dtype=tf.float32, maxval=10)
vector_ct = tf_shell.to_encrypted(vector_pt, secret_key, context)


2024-10-29 19:47:41.337185: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-29 19:47:41.532076: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


INFO: Generating key
INFO: Generating rotation key
INFO: Generating rotation key


## Op Level Parallelism

Benchmark the time taken to multiply two ciphertexts. The first test performs
multiplication between two individual ciphertexts (each has shape [slots, 1]).
The second test measures element-wise multiplication between [slot, 8]
ciphertexts, i.e. a vector of ciphertexts of length 8.

Without parallelism, the element-wise multiplication is expected to take 8 times
longer than the individual multiplication, but we show here this is not the
case.

In [2]:
def mul_single_ct_ct():
    return single_ct * single_ct

def mul_vector_ct_ct():
    return vector_ct * vector_ct

single_ct_ct_time = min(timeit.Timer(mul_single_ct_ct).repeat(repeat=10, number=100))
print(f"Multiply single ct * single ct: {single_ct_ct_time}")

vector_ct_ct_time = min(timeit.Timer(mul_vector_ct_ct).repeat(repeat=10, number=100))
print(f"Multiply vector ct * vector ct: {vector_ct_ct_time}")

Multiply single ct * single ct: 0.028574070000104257
Multiply vector ct * vector ct: 0.038458762001027935


## Graph Level Parallelism

Benchmark the time taken to perform two multiplications. The first test is run
in TensorFlow's eager mode, meaning the two multiplications are run
sequentially. The second test is run in graph mode, and tensorflow may run the
two multiplications in parallel.

In [3]:
large_pt = tf.random.uniform([context.num_slots, 10000], dtype=tf.float32, maxval=10)
large_ct = tf_shell.to_encrypted(large_pt, secret_key, context)

def fn():
    # The two operations may be run in parallel when in graph mode.
    return [large_ct + 1, large_ct + 2]

def eager():
    return fn()

@tf.function
def deferred():
    return fn()

single_ct_ct_time = min(timeit.Timer(eager).repeat(repeat=1, number=100))
print(f"Imperitive execution: {single_ct_ct_time}")

vector_ct_ct_time = min(timeit.Timer(deferred).repeat(repeat=1, number=100))
print(f"Graph-based execution: {vector_ct_ct_time}")

Imperitive execution: 21.420444982000845
Graph-based execution: 9.077498362999904
