Multi-core compiler evaluation #8

comidan · 2021-04-09T15:30:58Z

Hello!
Really great work with EVA!

I was trying to use it but I was unable to scale it efficiently in time. I checked via htop on Ubuntu 20 that it was being used only one core.
Now, I am on a VM so if the problem is this that's okay but otherwise would it be possible to let the code being compiled have multi-core support?

I tried to build EVA with the cmake -DUSE_GALOIS=ON . command, but nothing changed.

Thank you for your time!

The text was updated successfully, but these errors were encountered:

olsaarik · 2021-04-24T02:24:18Z

Hi! -DUSE_GALOIS=ON is indeed the right way to enable the multi-core support. I was able to reproduce the behavior on my end (also did not see multiple cores used in a program with ample concurrency), so it seems we have a regression. I'll investigate, thanks for filing the issue!

olsaarik · 2021-05-03T18:19:04Z

Turns out Galois was defaulting to 1 thread and we were missing a call to galois::setActiveThreads.

This issue should be fixed in main with EVA defaulting to as many threads as there are cores available to the process. I've also exposed a new function set_num_threads in the Python API that lets you control the amount of parallelism.

comidan · 2021-05-03T19:05:58Z

Thank you very much for the support @olsaarik !

I tried right now to clone EVA and install it with -DUSE_GALOIS=ON but when I try to run my code I don't see any parallelism in place by looking at HTOP:

My code is this:

from eva import EvaProgram, Input, Output, evaluate
from eva import evaluate
from eva.ckks import CKKSCompiler
from eva.seal import generate_keys
from eva.metric import valuation_mse
from common import *
from random import uniform
import numpy as np
import unittest
import math

batch = 2
dim = 32
dim1 = 32

def linear(x, y):
	return np.matmul(x, y) # + bias

def attention(x):
	temp = np.array([i for i in range(dim*dim1)]).reshape(dim, dim1) / 1000
	query = linear(x, temp)
	key = linear(x, temp)
	value = linear(x, temp)
	return query, key, value

def self_attention(query, key, value):
	temp = linear(query, key)
	
	temp = linear(temp, np.arange(dim*dim).reshape(dim,dim))
   
	temp = linear(temp, value)
	return temp

def norm(x):
	return linear(x, np.arange(dim*dim).reshape(dim,dim))

def ff(x):
	return linear(linear(x, np.arange(dim*dim).reshape(dim, dim)/1000), np.arange(dim*dim).reshape(dim, dim)/1000)

def classifier(x):
    v = np.arange(dim*2).reshape(dim,2)/1000
    w = np.arange(2*2).reshape(2,2)/1000
    return np.dot(np.dot(x, v), w)

transformer = EvaProgram('transformer', vec_size=dim*dim1)
with transformer:
	x = np.array([Input(f'x{i}') for i in range(batch*dim*dim1)]).reshape(batch, dim, dim1)
	print(x.shape)
	query, key, value = attention(x)
	attn = self_attention(query, key, value)
	out = norm(attn)
	out = ff(out)
	out = out ** 2
	out = norm(out)
	out = out ** 2
	out = classifier(out)
	print(out.shape)
	for i in range(out.shape[0]):
		for j in range(out.shape[1]):
			for k in range(out.shape[2]):
				Output(f'y{i+j}', out[i][j][k])

transformer.set_input_scales(32)
transformer.set_output_ranges(32)

if __name__ == "__main__":
	compiler = CKKSCompiler(config={'security_level':'128', 'warn_vec_size':'false'})
	compiled, params, signature = compiler.compile(transformer)
	print("poly degree: ", params.poly_modulus_degree)
	print("prime bits: ", params.prime_bits)
	print("rotations: ", params.rotations)

I also tried to encapsulate the logic written inside the EvaProgram in a loop as suggested here #6 but only one core was being used, is there anything wrong?
Notice that I'm just trying to compile the EvaProgram in order to obtain the CKKS parameters, as for now I'm not running any evaluation: I don't know if the multithreading is only for evaluating the error in computation given some parameters or if hopefully it's also present for determining the parameters.

Do you have any working example on multiple cores? It would be very good to have an example for multi core usage for EVA in trying to obtain the required CKKS parameters.

olsaarik · 2021-05-04T00:50:42Z

Thanks for clarifying your use case. Multi-core evaluation is currently only used for encrypting inputs and the actual evaluation of the program, i.e., public_ctx.encrypt and public_ctx.execute. Using multiple threads for the compilation is not fully straightforward, as most of the work in the compiler is done inside transformation passes that make modifications to the program.

I took a look at your program (nice re-use of Numpy by the way, I didn't know you could do that!) and it is larger than the programs we've used with EVA before. I exported it into DOT by writing out transformer.to_DOT() and it seems the program results in ~1.8 million instructions with some 663k of those being multiplications. I don't know the parameters required for this, but assuming a single multiplication takes at least 40 ms (which might be true for small parameters), this will take some 7ish hours to evaluate. Which is still totally reasonable :)

I tried playing with the dimensions and on my own laptop with 8 GB of RAM dim and dim1 set to 4 was still reasonably fast, but 8 used enough memory to start swapping and the process slowed down a lot. The Term class in EVA that represents these instructions is reasonably small, so this memory usage seems suspicious.

I'll dig into this more to figure out what is slowing down the compilation for this program. Thanks for the very interesting example!

comidan · 2021-05-04T07:06:57Z

Good morning/evening @olsaarik , thank you so much for the detailed explanation!

Yes indeed I tried to run the code with lower sizes like 4 or 8 exactly and it was fast, I got the parameters but then of course they are not good for way larger sizes like 32x32 or more.

Unfortunately I don't have much experience / didn't find much documentation with how CKKS parameters are chosen, or better: compared to the other HE scheme provided in SEAL, BFV, I find it not very straightforward on how should I search/apply some kind of criteria for the right prime coefficients' bit sizes (and the number of them) given a certain polynomial modulus degree. So, that's the reason I was leveraging on EVA, which is indeed really so great!

Ok that's great, I thank you very much!

benpbenp · 2021-06-17T08:24:30Z

I am also hitting performance limits at the compiler step. With multiplicative depth of only 1, compile time seems to go up quadratically with the number of operations in the program. Since the compilation is just a constant number of traversals through the program, that is unexpected right?

EDIT: To clarify, the multiplications are all plain text. Also, there are a similar number of Ciphertext additions. Additive depth 2.

multiplications	compile time (s)
125	2.86
500	15.19
1125	47.56
2000	116.65
3125	246.83
4500	478.75
6125	803
8000	1344.45
10125	2078
12500	3845.89

benpbenp · 2021-06-20T20:47:31Z

It seems I am able to work around this issue by converting all python-land constants into Eva Inputs with is_encrypted=False. For example, if x is an encrypted Input, instead of doing 2*x, I do p*x with p = Input("p", is_encrypted=False) and then set p to 2 at evaluation time.
@comidan perhaps it is worth trying this in your case?
I will try to create a small test case for this and put in a new issue.

This was referenced Apr 19, 2021

FR: Investigate Microsoft EVA support OpenMined/TenSEAL#279

Open

Perform inference on multiple data in SIMD fashion #6

Open

olsaarik added the bug Something isn't working label May 4, 2021

benpbenp mentioned this issue Jul 8, 2021

Compilation performance issue with constants #16

Open

RoryPotter mentioned this issue Mar 18, 2022

"bad optional access" (naive matmul) #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-core compiler evaluation #8

Multi-core compiler evaluation #8

comidan commented Apr 9, 2021 •

edited

Loading

olsaarik commented Apr 24, 2021

olsaarik commented May 3, 2021

comidan commented May 3, 2021 •

edited

Loading

olsaarik commented May 4, 2021

comidan commented May 4, 2021 •

edited

Loading

benpbenp commented Jun 17, 2021 •

edited

Loading

benpbenp commented Jun 20, 2021 •

edited

Loading

Multi-core compiler evaluation #8

Multi-core compiler evaluation #8

Comments

comidan commented Apr 9, 2021 • edited Loading

olsaarik commented Apr 24, 2021

olsaarik commented May 3, 2021

comidan commented May 3, 2021 • edited Loading

olsaarik commented May 4, 2021

comidan commented May 4, 2021 • edited Loading

benpbenp commented Jun 17, 2021 • edited Loading

benpbenp commented Jun 20, 2021 • edited Loading

comidan commented Apr 9, 2021 •

edited

Loading

comidan commented May 3, 2021 •

edited

Loading

comidan commented May 4, 2021 •

edited

Loading

benpbenp commented Jun 17, 2021 •

edited

Loading

benpbenp commented Jun 20, 2021 •

edited

Loading