<h1> Encrypted Inference-Linear Regression</h1>

Author:
<ul>
    <li>Hrishikesh Kamath - <a href="https://twitter.com/kamathhrishi">Twitter</a> - <a href="https://github.com/kamathhrishi">Github</a>
</ul>
Encrypted Inference is the process of performing inference with machine learning models such that model owner cannot observe the true input data nor can the data owner see the true model weights. The weights and data are encrypted by splitting them into shares and performiming computations according to a protocol. The general class of methods know as <b>Secure Multi Party Computation (SMPC)</b>. 

Below figure depicts MPC for ML models for 2 parties. 

<img height="600px" width="600px" src="Images/smpc_illustration.png"></img>

In this example we use Virtual Machine to demonstrate performing inference using SMPC. That is the workers are present on the same PC. If you want to understand how to perform remotely check out duet tutorials. 
In this example, we train a Linear regression model in plaintext on Boston Housing Dataset. Then we use the model for performing encrypted inference on test data. This tutorial uses protocol Falcon for 3 parties and SPDZ for 3 and 5 parties. 

In SyMPC the computation between parties occurs using a orchestrator which describes how computations should take place.

In [1]:
#running loss is nan
#batch size
#Define Linear regression model (figure out)
#criterion = torch.nn.MSELoss(reduction='mean') 
#optimizer = torch.optim.SGD(model.parameters(), lr=lr)

In [2]:
#External libraries
import pandas as pd
import numpy as np
import time

In [3]:
#Import torch
import torch
import torch.nn as nn
import torch.utils.data as data_utils

In [4]:
#Set a manual seed to maintain consistency
torch.manual_seed(0)

<torch._C.Generator at 0x7f1244151990>

<h2>Data Loading and Processing</h2>

In [5]:
!apt-get update
!apt-get install wget

Hit:1 http://deb.debian.org/debian buster InRelease
Get:2 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:3 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:4 http://security.debian.org/debian-security buster/updates/main amd64 Packages [308 kB]
Fetched 426 kB in 1s (346 kB/s)   
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  wget
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 902 kB of archives.
After this operation, 3335 kB of additional disk space will be used.
Get:1 http://deb.debian.org/debian buster/main amd64 wget amd64 1.20.1-1.1 [902 kB]
Fetched 902 kB in 0s (2588 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package wget.
(Reading database ... 20003 files and directories currently installed.)
Preparing to u

In [6]:
#Download Boston housing dataset
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data

--2022-01-11 10:42:14--  https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49082 (48K) [application/x-httpd-php]
Saving to: ‘housing.data’


2022-01-11 10:42:16 (107 KB/s) - ‘housing.data’ saved [49082/49082]



In [7]:
#Import dataset and add headers
dataset=pd.read_csv("housing.data",delim_whitespace=True,
                    names=["crim","zn","indus",
                           "chas","nox","rm",
                           "age","dis","rad",
                           "tax","ptratio","black",
                           "lstat","medv"])

In [8]:
#Visualize and look at columns and rows of dataset
dataset.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [9]:
#Split data into features and target variables
features = dataset.drop("medv",axis=1)
targets = dataset["medv"]

In [10]:
#Normalize features
features = features.apply(
    lambda x: (x - x.mean()) / x.std()
)

In [11]:
features

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat
0,-0.419367,0.284548,-1.286636,-0.272329,-0.144075,0.413263,-0.119895,0.140075,-0.981871,-0.665949,-1.457558,0.440616,-1.074499
1,-0.416927,-0.487240,-0.592794,-0.272329,-0.739530,0.194082,0.366803,0.556609,-0.867024,-0.986353,-0.302794,0.440616,-0.491953
2,-0.416929,-0.487240,-0.592794,-0.272329,-0.739530,1.281446,-0.265549,0.556609,-0.867024,-0.986353,-0.302794,0.396035,-1.207532
3,-0.416338,-0.487240,-1.305586,-0.272329,-0.834458,1.015298,-0.809088,1.076671,-0.752178,-1.105022,0.112920,0.415751,-1.360171
4,-0.412074,-0.487240,-1.305586,-0.272329,-0.834458,1.227362,-0.510674,1.076671,-0.752178,-1.105022,0.112920,0.440616,-1.025487
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,-0.412820,-0.487240,0.115624,-0.272329,0.157968,0.438881,0.018654,-0.625178,-0.981871,-0.802418,1.175303,0.386834,-0.417734
502,-0.414839,-0.487240,0.115624,-0.272329,0.157968,-0.234316,0.288648,-0.715931,-0.981871,-0.802418,1.175303,0.440616,-0.500355
503,-0.413038,-0.487240,0.115624,-0.272329,0.157968,0.983986,0.796661,-0.772919,-0.981871,-0.802418,1.175303,0.440616,-0.982076
504,-0.407361,-0.487240,0.115624,-0.272329,0.157968,0.724955,0.736268,-0.667776,-0.981871,-0.802418,1.175303,0.402826,-0.864446


In [12]:
targets

0      24.0
1      21.6
2      34.7
3      33.4
4      36.2
       ... 
501    22.4
502    20.6
503    23.9
504    22.0
505    11.9
Name: medv, Length: 506, dtype: float64

In [13]:
#Convert features and targets into torch tensors
features = torch.tensor(features.values.astype(np.float32)) 
targets = torch.tensor(targets.values.astype(np.float32))

In [14]:
# Arguments for training
batch_size = 16
epochs = 300
train_test_split = 0.8
lr = 0.001

In [15]:
#Split dataset into train and test
train_indices=int(len(features)*train_test_split)

train_x = features[:train_indices]
train_y = targets[:train_indices]

test_x = features[train_indices+1:]
test_y = targets[train_indices+1:]

In [16]:
nom = int(input("Enter number of models to b created: "))
m=[]

Enter number of models to b created: 5


In [17]:
xtrain=[]
xtrain.append(np.array_split(train_x, nom))

ytrain=[]
ytrain.append(np.array_split(train_y, nom))

In [18]:
#Divide dataset into batches
def get_batches(X, y):
    batches = []
    for index in range(0,len(train_x)+1,batch_size):
        batches.append((X[index:index+batch_size],y[index:index+batch_size]))
    
    return batches

<h1>Plaintext Training</h1>

In [19]:
#Import syft
import syft as sy
sy.logger.remove()

In [20]:
#Define Linear regression model
class LinearSyNet(sy.Module):
    def __init__(self, torch_ref):
        super(LinearSyNet, self).__init__(torch_ref=torch_ref)
        self.fc1 = self.torch_ref.nn.Linear(13,1)

    def forward(self, x):
        x = self.fc1(x)
        return x

In [21]:
#Define model, loss function and optimizer
for i in range(nom):
    model = LinearSyNet(torch)
    m.append(model)
criterion = torch.nn.MSELoss(reduction='mean') 
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

In [22]:
xtest=[]
xtest.append(np.array_split(test_x, nom))

ytest=[]
ytest.append(np.array_split(test_y, nom))

In [23]:
print(xtest[0][1])

tensor([[ 1.0037, -0.4872,  1.0150, -0.2723,  0.2529, -0.6371, -0.3153, -0.8536,
          1.6596,  1.5294,  0.8058, -3.6368,  0.4253],
        [ 3.9584, -0.4872,  1.0150, -0.2723,  1.0727, -0.1176,  0.3597, -0.9176,
          1.6596,  1.5294,  0.8058, -3.7007,  0.2614],
        [ 0.4364, -0.4872,  1.0150, -0.2723,  1.0727, -0.1304,  0.3384, -0.8830,
          1.6596,  1.5294,  0.8058, -2.8473,  1.2417],
        [ 0.6656, -0.4872,  1.0150, -0.2723,  1.0727,  0.1357,  0.9601, -0.8676,
          1.6596,  1.5294,  0.8058, -3.2417,  1.6002],
        [ 0.5672, -0.4872,  1.0150, -0.2723,  0.2529,  0.0902,  0.6226, -0.8274,
          1.6596,  1.5294,  0.8058, -2.9928,  0.6983],
        [ 0.7497, -0.4872,  1.0150, -0.2723,  0.2529,  0.7805,  0.9139, -0.8106,
          1.6596,  1.5294,  0.8058, -3.0160,  0.9854],
        [ 0.3291, -0.4872,  1.0150, -0.2723,  0.2529,  0.1998,  0.2211, -0.7573,
          1.6596,  1.5294,  0.8058, -2.8339, -0.0873],
        [ 0.2287, -0.4872,  1.0150, -0.2723,  1.

In [41]:
#Training Loop
for i in range(nom):
    train_batches=get_batches(xtrain[0][i],ytrain[0][i])
    print("model: ", i)
    for epoch in range(epochs):
      running_loss = 0.0
      for index in range(0,len(train_batches)):
        # Clear gradient buffers because we don't want any gradient from previous epoch to carry forward, dont want to cummulate gradients
        optimizer.zero_grad()

        # get output from the model, given the inputs
        outputs = m[i](train_batches[index][0]).reshape([-1])

        # get loss for the predicted output
        loss = criterion(outputs,train_batches[index][1])
        running_loss += loss
        # get gradients w.r.t to parameters
        loss.backward()

        # update parameters
        optimizer.step()

      test_accuracy = criterion(m[i](xtest[0][i]).reshape([-1]),ytest[0][i])#use xtest and ytest
      if((epoch%50)==0):
         print(f"Epoch {epoch}/{epochs}  Running Loss : {running_loss.item()/batch_size} and test loss: {test_accuracy.item()}")

model:  0
Epoch 0/300  Running Loss : nan and test loss: 257.4114990234375
Epoch 50/300  Running Loss : nan and test loss: 257.4114990234375
Epoch 100/300  Running Loss : nan and test loss: 257.4114990234375
Epoch 150/300  Running Loss : nan and test loss: 257.4114990234375
Epoch 200/300  Running Loss : nan and test loss: 257.4114990234375
Epoch 250/300  Running Loss : nan and test loss: 257.4114990234375
model:  1
Epoch 0/300  Running Loss : nan and test loss: 169.77113342285156
Epoch 50/300  Running Loss : nan and test loss: 169.77113342285156
Epoch 100/300  Running Loss : nan and test loss: 169.77113342285156
Epoch 150/300  Running Loss : nan and test loss: 169.77113342285156
Epoch 200/300  Running Loss : nan and test loss: 169.77113342285156
Epoch 250/300  Running Loss : nan and test loss: 169.77113342285156
model:  2
Epoch 0/300  Running Loss : nan and test loss: 282.78448486328125
Epoch 50/300  Running Loss : nan and test loss: 282.78448486328125
Epoch 100/300  Running Loss : nan

<h1>Encrypted Inference</h1>

In [25]:
#SyMPC imports required for encrypted inference
import sympc
from sympc.session import Session
from sympc.session import SessionManager
from sympc.tensor import MPCTensor
from sympc.protocol import Falcon,FSS

In [26]:
def get_clients(n_parties):
  #Generate required number of syft clients and return them.

  parties=[]
  for index in range(n_parties): 
      parties.append(sy.VirtualMachine(name = "worker"+str(index)).get_root_client())

  return parties

In [27]:
def split_send(data,session):
    """Splits data into number of chunks equal to number of parties and distributes it to respective 
       parties.
    """
    data_pointers = []
    
    split_size = int(len(data)/len(session.parties))+1
    for index in range(0,len(session.parties)):
        ptr=data[index*split_size:index*split_size+split_size].share(session=session)
        data_pointers.append(ptr)
        
    return data_pointers

In [28]:
def inference(n_clients,protocol=None):
    
  # Get VM clients 
  parties=get_clients(n_clients)

  # Setup the session for the computation
  if(protocol):
     session = Session(parties = parties,protocol = protocol)
  else:
     session = Session(parties = parties)
        
  SessionManager.setup_mpc(session)

  for i in range(nom):
        #Split data and send data to clients
        pointers = split_send(xtest[0][i],session)

        #Encrypt model 
        mpc_model = m[i].share(session)

        #Encrypt test data
        #test_data=MPCTensor(secret=test_x, session = session)

        #Perform inference and measure time taken
        start_time = time.time()

        results = []

        for ptr in pointers:
            encrypted_results = mpc_model(ptr)
            plaintext_results = encrypted_results.reconstruct()
            results.append(plaintext_results)

        end_time = time.time()

        print(f"Time for inference: {end_time-start_time}s")

        predictions = torch.cat(results).reshape([-1])

        #Calculate Loss
        print("MSE Loss: ",criterion(predictions,ytest[0][i]).item())

        return predictions

In [29]:
predictions=inference(3,Falcon("semi-honest"))

Time for inference: 0.1719961166381836s
MSE Loss:  257.4112243652344


We can see that the prediction values and mean squared error values are almost the same as final model. Small differences are due to precision loss.

In [30]:
for index in range(0,10):
    print(f"Index {index}")
    print(f"Encrypted Prediction Output {predictions[index].item()}")
    #print(f"Plaintext Prediction Output {plaintext_predictions[index].item()}")
    print(f"Expected Prediction: {test_y[index]}")
    print("\n")

Index 0
Encrypted Prediction Output -1.28875732421875
Expected Prediction: 5.0


Index 1
Encrypted Prediction Output -1.43994140625
Expected Prediction: 11.899999618530273


Index 2
Encrypted Prediction Output -0.8323516845703125
Expected Prediction: 27.899999618530273


Index 3
Encrypted Prediction Output -1.2740936279296875
Expected Prediction: 17.200000762939453


Index 4
Encrypted Prediction Output -0.817962646484375
Expected Prediction: 27.5


Index 5
Encrypted Prediction Output -0.48388671875
Expected Prediction: 15.0


Index 6
Encrypted Prediction Output -0.7993011474609375
Expected Prediction: 17.200000762939453


Index 7
Encrypted Prediction Output -1.4966278076171875
Expected Prediction: 17.899999618530273


Index 8
Encrypted Prediction Output -1.01806640625
Expected Prediction: 16.299999237060547


Index 9
Encrypted Prediction Output -1.724609375
Expected Prediction: 7.0




<h1> Conclusion </h1>

Falcon can also provide a malicious security guarantee for an honest majority at the cost of higher inference time. Malicious security ensures that all the parties compute according to the protocol and do not deviate from protocol or tamper with shares. 

In [31]:
predictions=inference(3,Falcon("malicious"))

Time for inference: 0.536308765411377s
MSE Loss:  257.4112854003906


When we do not pass any protocol to session, SyMPC uses SPDZ and FSS protocol with semi-honest security type. 

SPDZ is used for multiplication and related operations (convolution,matmul,etc).
Functional Secret Sharing (FSS) for other operations such as comparison, equality, maxpool, etc. 

FSS works only for 2 parties while SPDZ could extend to N parties. 

Linear regression uses only matmul which utilizes SPDZ protocol allowing us to run linear regression with several parties in this tutorial. 

In [32]:
predictions=inference(3)

Time for inference: 0.3889961242675781s
MSE Loss:  257.4106750488281


In [33]:
predictions=inference(4)

Time for inference: 0.7254760265350342s
MSE Loss:  257.4106750488281


In [34]:
predictions=inference(5)

Time for inference: 1.1546521186828613s
MSE Loss:  257.4105224609375


<center><h3> Comparison </h3></center>

| Protocol | Security Type| Parties | Inference Time (s) |
| --- | --- | --- | --- |
| Plaintext | |  | 0.000534|
| Falcon | Semi-honest | 3 | 0.118 |
| Falcon | Malicious | 3 | 0.6659 |
| SPDZ| Semi-honest | 3 | 0.4993 |
| SPDZ | Semi-honest | 5 | 1.3192|

<b>Note:</b> The above table is only for comparison. The inference time varies for different PC specs and CPU load.



Falcon provides faster inference for a 3 parties setting. While, SPDZ allows inference for N number of parties. Both allow inference with almost same accuracy as plaintext. 

Although Falcon is much more faster, it isn't scalable because it is applicable for only 3 parties. Further, it is less secure. Since it uses 2-out-of-3 sharing where each party recieves two shares allowing 2 parties to reconstruct a secret without other party knowing. Falcon also assumes that majority of parties are honest (2 in this case). 

While, SPDZ and FSS distributes a single share to every party requiring shares from all parties for reconstruction ensuring no parties could collude. Currently SyMPC provides support for SPDZ and FSS with semi-honest security guarantee. This allows parties to tamper with shares leading to incorrect results. 

<h3>What's next?</h3>

SyMPC is still under development! We will add here more features as soon they are stable enough, stay tuned! 🕺

If you enjoyed this tutorial, show your support by Starring SyMPC! 🙏