# Federated models: linear regression
Here, we explain how to set up a Linear Regression experiment in the Federated setting using the Sherpa.ai Federated Learning and Differential Privacy Framework. 
Results from federated learning are compared to (non-federated) centralized learning. 
Moreover, we also show how the addition of differential privacy affects the performance of the federated model. 
Ultimately, an application of the composition theorems for adaptive differential privacy is given. 

## The data
In the present example, we will use the [California Housing dataset from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). 
We only make use of two features, in order to reduce the variance in the prediction. 
The Sherpa.ai Federated Learning and Differential Privacy Framework allows a generic dataset to easily be converted, to interact with the platform:

In [None]:
import shfl
from shfl.data_base.data_base import LabeledDatabase
import sklearn.datasets
import numpy as np
from shfl.private.reproducibility import Reproducibility

# Comment to turn off reproducibility:
Reproducibility(1234)

all_data = sklearn.datasets.fetch_california_housing()
n_features = 2
data = all_data["data"][:,0:n_features]
labels = all_data["target"]    

# Retain part for DP sensitivity sampling:
size = 2000
sampling_data = data[-size:, ]
sampling_labels = labels[-size:, ]

# Create database:
database = LabeledDatabase(data[0:-size, ], labels[0:-size])
train_data, train_labels, test_data, test_labels = database.load_data()

In [None]:
print("Shape of training and test data: " + str(train_data.shape) + str(test_data.shape))
print("Total: " + str(train_data.shape[0] + test_data.shape[0]))

We will simulate a FL scenario by distributing the training data over a collection of clients, assuming an IID setting:

In [None]:
iid_distribution = shfl.data_distribution.IidDataDistribution(database)
federated_data, test_data, test_labels = iid_distribution.get_federated_data(num_nodes=5)

## The model
At this stage, we need to define the linear regression model. The linear regression model is encapsulated in the Sherpa.ai framework and thus readily usable. We choose the federated aggregator to be the average of the client models:

In [None]:
from shfl.model.linear_regression_model import LinearRegressionModel

def model_builder():
    model = LinearRegressionModel(n_features=n_features, n_targets=1)
    return model

aggregator = shfl.federated_aggregator.FedAvgAggregator()

## Run the federated learning experiment
We're now ready to run the FL model. 
The Sherpa.ai Federated Learning and Differential Privacy Framework offers support for the Linear Regression model from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). 
The user must specify the number of features and targets, in advance.
Note that in this case, we set the number of rounds `n=1` since no iterations are needed in the case of linear regression. 
The performance metrics used are the Root Mean Squared Error (RMSE) and the $R^2$ score.
It can be observed that the performance of the global model (i.e., the aggregated model) is generally superior with respect to the performance of each node, thus, the federated learning approach proves to be beneficial:

In [None]:
federated_government = shfl.federated_government.FederatedGovernment(model_builder, federated_data, aggregator)
federated_government.run_rounds(n=1, test_data=test_data, test_label=test_labels)

And we can observe that the performance is comparable to the centralized learning model:

In [None]:
# Comparison with centralized model:
centralized_model = LinearRegressionModel(n_features=n_features, n_targets=1)
centralized_model.train(data=train_data, labels=train_labels)
print(centralized_model.evaluate(data=test_data, labels=test_labels))

## Add differential privacy
We want to assess the impact of differential privacy (see [The Algorithmic Foundations of Differential Privacy](https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf), Section 3.3) on the federated model's performance.
### Model's sensitivity
In the case of applying the Laplace privacy mechanism (see [Laplace mechanism notebook](../differential_privacy/differential_privacy_laplace.ipynb)), the noise added has to be of the same order as the sensitivity of the model's output, i.e., the model parameters of our linear regression. 
In the general case, the model's sensitivity might be difficult to compute analytically. 
An alternative approach is to attain random differential privacy through a sampling over the data (e.g., see [Rubinstein 2017](https://arxiv.org/abs/1706.02562). 
That is, instead of computing the global sensitivity $\Delta f$ analytically, we compute an empirical estimation of it by sampling over the dataset.
However, be advised that this will guarantee the weaker property of random differential privacy.
This approach is convenient, since it allows for the sensitivity estimation of an arbitrary model or a black-box computer function.
The Sherpa.ai Federated Learning and Differential Privacy Framework provides this functionality in the class `SensitivitySampler`.

We need to specify a distribution of the data to sample from. 
Generally, this requires previous knowledge and/or model assumptions. 
In order not to make any specific assumptions about the distribution of the dataset, we can choose a uniform distribution. 
We define our class of `ProbabilityDistribution` that uniformly samples over a data-frame.
We use the previously retained part of the dataset for sampling:

In [None]:
class UniformDistribution(shfl.differential_privacy.ProbabilityDistribution):
    """
    Implement Uniform sampling over the data
    """
    def __init__(self, sample_data):
        self._sample_data = sample_data

    def sample(self, sample_size):
        row_indices = np.random.randint(low=0, high=self._sample_data.shape[0], size=sample_size, dtype='l')
        
        return self._sample_data[row_indices, :]
    
sample_data = np.hstack((sampling_data, sampling_labels.reshape(-1,1)))

The class `SensitivitySampler` implements the sampling, given a query, i.e., the learning model itself, in this case.
We only need to add the `get` method to our model since it is required by the class `SensitivitySampler`. 
We choose the sensitivity norm to be the $L_1$ norm and we apply the sampling. 
The value of the sensitivity depends on the number of samples `n`: the more samples we perform, the more accurate the sensitivity. 
Indeed, upon increasing the number of samples `n`, the sensitivity becomes more accurate and typically decreases.

In [None]:
from shfl.differential_privacy import SensitivitySampler
from shfl.differential_privacy import L1SensitivityNorm

class LinearRegressionSample(LinearRegressionModel):
    
    def get(self, data_array):
        data = data_array[:, 0:-1]
        labels = data_array[:, -1]
        train_model = self.train(data, labels)
      
        return self.get_model_params()

distribution = UniformDistribution(sample_data)
sampler = SensitivitySampler()
n_samples = 4000
max_sensitivity, mean_sensitivity = sampler.sample_sensitivity(
    LinearRegressionSample(n_features=n_features, n_targets=1), 
    L1SensitivityNorm(), distribution, n=n_samples, gamma=0.05)
print("Max sensitivity from sampling: " + str(max_sensitivity))
print("Mean sensitivity from sampling: " + str(mean_sensitivity))

Unfortunately, sampling over a dataset involves the training of the model on two datasets differing in one entry, at each sample.
Thus, in general, this procedure might be computationally expensive (e.g. in the case of training a deep neural network).

### Run the federated learning experiment with differential privacy
At this stage, we are ready to add a layer of DP to our federated learning model. 
We will apply the Laplace mechanism, employing the sensitivity obtained from the previous sampling. 
The Laplace mechanism provided by the Sherpa.ai Federated Learning and Differential Privacy Framework is then assigned as the private access type to the model parameters of each client in a new `FederatedGovernment` object. 
This results in an $\epsilon$-differentially private FL model.
For example, by choosing the value $\epsilon = 0.5$, we can run the FL experiment with DP:

In [None]:
from shfl.differential_privacy import LaplaceMechanism

params_access_definition = LaplaceMechanism(sensitivity=max_sensitivity, epsilon=0.5)
federated_governmentDP = shfl.federated_government.FederatedGovernment(
    model_builder, federated_data, aggregator, model_params_access=params_access_definition)
federated_governmentDP.run_rounds(n=1, test_data=test_data, test_label=test_labels)

In the above example we saw that the performance of the model deteriorated slightly, due to the addition of differential privacy. 
It must be noted that each run involves a different random noise added by the differential privacy mechanism.
However, in general, privacy increases at the expense of accuracy (i.e. for smaller values of $\epsilon$).
This can be observed by calculating a mean of several runs, as explained below.

### Multiple queries: composition of differential private mechanisms using adaptivity
Now, we will explain the application of the composition theorems using adaptivity, as implemented in Sherpa.ai Federated Learning and Differential Privacy Framework (see the [Composition concepts notebook](../differential_privacy/differential_privacy_composition_concepts.ipynb)).
The idea is to stop when the privacy budget is expended.
This happens when the same query is executed on the client dataset, as this might disclose sensitive information (see [The Algorithmic Foundations of Differential Privacy](https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf), Section 3.5.2).
Note that, when applying the composition theorems for privacy filters in the present example, we are assuming that the estimated sensitivity is a good enough approximation of the analytic sensitivity (see [Rogers 2016](https://papers.nips.cc/paper/6170-privacy-odometers-and-filters-pay-as-you-go-composition.pdf)).

In the following experiment, we set a privacy budget (variable `global_epsilon_delta = (4, 0)`), and we consider different values of $\epsilon$ for the query (variable `epsilon_range = np.array([0.2,0.5,0.8])`). 
In each case, the execution automatically exits when the privacy budget is expended. 
Taking the average of the performance metrics, we can verify that the accuracy increases for larger values of $\epsilon$, which is associated with lower privacy.  

In [None]:
# Run several runs with different levels of privacy: for fixed sensitivity, we use different values of epsilon
from shfl.differential_privacy.composition_dp import AdaptiveDifferentialPrivacy
from shfl.differential_privacy.composition_dp import ExceededPrivacyBudgetError

global_epsilon_delta = (4, 0) 
epsilon_range = np.array([0.2,0.5,0.8])
gl_evaluationDP = np.zeros((epsilon_range.size, 2))

for i_epsilon in range(epsilon_range.size):
    print("---------------------------\n")
    print("epsilon = " + str(epsilon_range[i_epsilon]))
    
    dpm = LaplaceMechanism(sensitivity=max_sensitivity, epsilon=epsilon_range[i_epsilon])
    
    params_access_definition = AdaptiveDifferentialPrivacy(global_epsilon_delta, differentially_private_mechanism=dpm)
    federated_governmentDP = shfl.federated_government.FederatedGovernment(
        model_builder, federated_data, aggregator, model_params_access=params_access_definition)
    i_run = 0
    while True:
        try:
            # Queries are performed using the Laplace mechanism
            #print("i_run = " + str(i_run))
            federated_governmentDP.run_rounds(n=1, test_data=test_data, test_label=test_labels)
            print("Executed i_run = " + str(i_run))
            gl_evaluationDP[i_epsilon,:] += np.asarray(federated_governmentDP._model.evaluate(data=test_data, labels=test_labels))
            print("\n")
            i_run += 1
        except ExceededPrivacyBudgetError:
            # At this point we have spent all our privacy budget
            print("Reached privacy budget at i_run = " + str(i_run))
            print("\n")
            gl_evaluationDP[i_epsilon,:] = np.divide(gl_evaluationDP[i_epsilon,:], i_run)
            break 
        
print("Mean performance: \n" + str(gl_evaluationDP))