### Description of the code

\texttt{Sherpa.FL} allows to easily convert a generic dataset to interact with the platform:

In [1]:
import shfl
from shfl.data_base.data_base import LabeledDatabase
import sklearn.datasets
import numpy as np

all_data = sklearn.datasets.fetch_california_housing()
n_features = 2
data = all_data["data"][:,0:n_features]
labels = all_data["target"]    

# Retain part for DP sensitivity sampling:
size = 2000
sampling_data = data[:-size, ]
sampling_labels = labels[:-size, ]

# Create database:
database = LabeledDatabase(data[0:-size, ], labels[0:-size])
np.random.seed(123)     # Reproducibility 
train_data, train_labels, test_data, test_labels = database.load_data()

Using TensorFlow backend.


In [2]:
print("Shape of train and test data: " + str(train_data.shape) + str(test_data.shape))
print("Total: " + str(train_data.shape[0] + test_data.shape[0]))

Shape of train and test data: (14912, 2)(3728, 2)
Total: 18640


We will simulate a FL scenario by distributing the train data over a collection of clients, assuming an IID setting:

In [3]:
np.random.seed(132)     # Reproducibility
iid_distribution = shfl.data_distribution.IidDataDistribution(database)
federated_data, test_data, test_labels = iid_distribution.get_federated_data(num_nodes=5, percent=100)

At this stage, we need to define the linear regression model, and we choose the federated aggregator to be the average of the clients' models: 

In [4]:
from shfl.model import LinearRegressionModel

def model_builder():
    model = LinearRegressionModel(n_features=n_features, n_targets=1)
    return model

aggregator = shfl.federated_aggregator.FedAvgAggregator()

### Running the model in a Federated configuration
We're now ready to run the FL model. 
Note that in this case, we set the number of rounds `n=1` since no iterations are needed in the case of linear regression. 
The performance metrics used are the Root Mean Squared Error (RMSE) and the $R^2$ score.
It can be observed that the performance of the *Global model* (i.e. the aggregated model) is in general superior with respect to the performance of each node, thus the federated learning approach proves to be beneficial:

In [5]:
federated_government = shfl.learning_approach.FederatedGovernment(model_builder, federated_data, aggregator)
federated_government.run_rounds(n=1, test_data=test_data, test_label=test_labels)

Accuracy round 0
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd5f8>: (0.8459915023737836, 0.4745474503504118)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd630>: (0.8444041146505223, 0.47651748052103493)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd860>: (0.8452828590455053, 0.47542737051482586)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd940>: (0.8447408958734675, 0.4760998268466655)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dda58>: (0.8441217945411079, 0.4768674668212892)
Global model test performance : (0.8446385610343933, 0.47622675332252484)





In [6]:
# Comparison with centralized model:
centralized_model = LinearRegressionModel(n_features=n_features, n_targets=1)
centralized_model.train(data=train_data, labels=train_labels)
print(centralized_model.evaluate(data=test_data, labels=test_labels))

(0.8446335485225723, 0.47623296997405207)


### Differential Privacy: sampling the model's sensitivity
In the case of applying the Laplace privacy mechanism (see Section \ref{sec:dp_key_elements}), the noise added has to be of the order of the sensitivity of the model's output, i.e. the model parameters of our linear regression. 
In the general case, the model's sensitivity might be difficult to compute analytically. 
An alternative approach is to attain *random* differential privacy through a sampling over the data \citep{Rubinstein2017}. 
That is, instead of computing analytically the *global* sensitivity $\Delta f$, we compute an *empirical estimation* of it by sampling over the dataset.
This approach is convenient since allows for the sensitivity estimation of an arbitrary model or a black-box computer function.
The \texttt{Sherpa.FL} framework provides this functionality in the class `SensitivitySampler`.

In order to carry out this approach, we need to specify a distribution of the data to sample from. 
This in general requires previous knowledge and/or model assumptions. 
In order not make any specific assumption on the distribution of the dataset, we can choose a *uniform* distribution. 
To the end, we define our class of `ProbabilityDistribution` that uniformly samples over a data-frame.
We use the previously retained part of the dataset for sampling:

In [7]:
class UniformDistribution(shfl.differential_privacy.ProbabilityDistribution):
    """
    Implement Uniform sampling over the data
    """
    def __init__(self, sample_data):
        self._sample_data = sample_data

    def sample(self, sample_size):
        row_indices = np.random.randint(low=0, high=self._sample_data.shape[0], size=sample_size, dtype='l')
        
        return self._sample_data[row_indices, :]
    
sample_data = np.hstack((sampling_data, sampling_labels.reshape(-1,1)))

The class `SensitivitySampler` implements the sampling given a *query*, i.e. the learning model itself in this case.
We only need to add the method `get` to our model since it is required by the class `SensitivitySampler`. 
We choose the sensitivity norm to be the $L_1$ norm and we apply the sampling. 
The value of the sensitivity depends on the number of samples `n`: the more samples we perform, the more accurate the sensitivity. 
Indeed, increasing the number of samples `n`, the sensitivity gets more accurate and typically decreases.

In [8]:
from shfl.differential_privacy import SensitivitySampler
from shfl.differential_privacy import L1SensitivityNorm

class LinearRegressionSample(LinearRegressionModel):
    
    def get(self, data_array):
        data = data_array[:, 0:-1]
        labels = data_array[:, -1]
        train_model = self.train(data, labels)
      
        return self.get_model_params()

distribution = UniformDistribution(sample_data)
sampler = SensitivitySampler()
np.random.seed(456)     # Reproducibility
n_samples = 2500
max_sensitivity, mean_sensitivity = sampler.sample_sensitivity(
    LinearRegressionSample(n_features=n_features, n_targets=1), 
    L1SensitivityNorm(), distribution, n=n_samples, gamma=0.05)
print("Max sensitivity from sampling: " + str(max_sensitivity))
print("Mean sensitivity from sampling: " + str(mean_sensitivity))

Max sensitivity from sampling: 0.02264994843712804
Mean sensitivity from sampling: 0.0012347735748639995


Unfortunately, sampling over a dataset involves, at each sample, the training of the model on two datasets differing in one entry \citep{Rubinstein2017}.
Thus in general this procedure might be computationally expensive (e.g. in the case of training a deep neuronal network).

### Running the model in a Federated configuration with Differential Privacy
At this stage we are ready to add a layer of DP to our federated learning model. 
Specifically, we will apply the Laplace mechanism from Section \ref{sec:dp_key_elements}, assuming the sensitivity of our model is the mean obtained from the previous sampling, namely $\Delta f \approx 0.001$. 
The Laplace mechanism provided by the \texttt{Sherpa.FL} Framework is then assigned as the \textit{private} access type to the model's parameters of each client in a new `FederatedGovernment` object. 
This results into an $\epsilon$-\textit{differentially private FL model}.
For example, picking the value $\epsilon = 0.5$, we can run the FL experiment with DP:

In [9]:
from shfl.differential_privacy import LaplaceMechanism

params_access_definition = LaplaceMechanism(sensitivity=mean_sensitivity, epsilon=0.5)
federated_governmentDP = shfl.learning_approach.FederatedGovernment(
    model_builder, federated_data, aggregator, model_params_access=params_access_definition)
np.random.seed(789)     # Reproducibility
federated_governmentDP.run_rounds(n=1, test_data=test_data, test_label=test_labels)

Accuracy round 0
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd5f8>: (0.8459915023737836, 0.4745474503504118)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd630>: (0.8444041146505223, 0.47651748052103493)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd860>: (0.8452828590455053, 0.47542737051482586)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd940>: (0.8447408958734675, 0.4760998268466655)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dda58>: (0.8441217945411079, 0.4768674668212892)
Global model test performance : (0.8460469476619272, 0.474478573009117)





In the above example we observed that the performance of the model has slightly deteriorated due to the addition of DP.
In general, the privacy increases at expenses of accuracy (i.e. for smaller values of $\epsilon$).

In [10]:
# Run several runs with different levels of privacy: for fixed Delta_f, we change epsilon
from shfl.differential_privacy.composition_dp import AdaptiveDifferentialPrivacy
from shfl.differential_privacy.composition_dp import ExceededPrivacyBudgetError

global_epsilon_delta = (4, 0) 
epsilon_range = np.array([0.2,0.5,0.8])
gl_evaluationDP = np.zeros((epsilon_range.size, 2))

np.random.seed(1234)
for i_epsilon in range(epsilon_range.size):
    print("---------------------------\n")
    print("epsilon = " + str(epsilon_range[i_epsilon]))
    
    dpm = LaplaceMechanism(sensitivity=mean_sensitivity, epsilon=epsilon_range[i_epsilon])
    
    params_access_definition = AdaptiveDifferentialPrivacy(global_epsilon_delta, differentially_private_mechanism=dpm)
    federated_governmentDP = shfl.learning_approach.FederatedGovernment(
        model_builder, federated_data, aggregator, model_params_access=params_access_definition)
    i_run = 0
    while True:
        try:
            # Queries are performed using the Laplace mechanism
            print("i_run = " + str(i_run))
            federated_governmentDP.run_rounds(n=1, test_data=test_data, test_label=test_labels)
            print("Executed i_run = " + str(i_run))
            gl_evaluationDP[i_epsilon,:] += np.asarray(federated_governmentDP._model.evaluate(data=test_data, labels=test_labels))
            print("Sum i_run = " + str(i_run))
            i_run += 1
        except ExceededPrivacyBudgetError:
            # At this point we have spent all our privacy budget
            print("Reached privacy budget at i_run = " + str(i_run))
            gl_evaluationDP[i_epsilon,:] = np.divide(gl_evaluationDP[i_epsilon,:], i_run)
            break 
        
print(gl_evaluationDP)

---------------------------

epsilon = 0.2
i_run = 0
Accuracy round 0
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd5f8>: (0.8459915023737836, 0.4745474503504118)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd630>: (0.8444041146505223, 0.47651748052103493)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd860>: (0.8452828590455053, 0.47542737051482586)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dd940>: (0.8447408958734675, 0.4760998268466655)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x7fe28c5dda58>: (0.8441217945411079, 0.4768674668212892)
Global model test performance : (0.8453108924030469, 0.4753925755919852)



Executed i_run = 0
Sum i_run = 0
i_run = 1
Accuracy round 0
Test performance client <shfl.private.federated_operation.FederatedDataN