<a href="https://colab.research.google.com/github/noo-rashbass/synthetic-data-service/blob/differentialprivacy/lulu/Using_TF_Privacy/privacy_parameter_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [17]:
!pip install tensorflow_privacy

from tensorflow_privacy.privacy.analysis import compute_dp_sgd_privacy
from tensorflow_privacy.privacy.analysis.rdp_accountant import compute_rdp, get_privacy_spent
import numpy as np
import math
import pandas as pd



"the privacy cost of this mechanism is fully specified by $q$ (sampling probability) together with the privacy tuple $(S, \sigma)$" ('A General Approach ...', TF Privacy paper)

$S = l_2$-norm clip

$\sigma =$ standard deviation of noise added

---

$\epsilon = 0$ is perfect privacy but no utility

$\epsilon = 1$ is thought of as a good amount of privacy

$\epsilon = \infty$ is no privacy

---
These tools provide an upper bound for $\epsilon$ and in reality, the value could be much better. We can record a ledger for future calculation of a better bound.


# Privacy parameter selection

### Estimate Procedure (no ledger)

Mostly taken care of by the modified version of `apply_dp_sgd_analysis` found in this notebook.


1. Fix number (or ranges to try) of epochs, batch size, number of microbatches, training set size and  target_delta less than 1/n.
2. Fix a value for noise_multiplier.
3. for a range or orders, call `compute_rdp` (using q=batch_size/n, steps = epochs*n/batch_size = number of batches processed). This optimizes the output rdp over the range of orders using `get_privacy_spent`
4. for each rdp calculate the equivalent epsilon.
5. if needed, repeat steps 3-4 until range of orders covers the range $$(1+\frac{\ln(1/\delta)}{\epsilon_{max}},1+\frac{\ln(1/\delta)}{\epsilon_{min}})$$
This only takes one or two attempts as epsilon is not very sensitive to orders, but this helps make use the optimal order is not the min/max of the range tried. Expanding the range of values doesn't slow analysis much.
6. repeat for each stage of training and sum. We're expecting very large numbers for epsilon, so reporting more precisely than an order of maginitude doesn't add much value.

---
Adjust step 1 params for training speed
Adjust step 2 params by choosing standard deviation and noise clip to trade utility for privacy

noise_multiplier = standard_dev/l2_norm_clip

The larger the noise multiplier, the lower the lower the epsilon (more private) the worse the utility. Utility evaluation is needed to inform the noise multiplier.

In [41]:
epochs_list = list(range(5000, 11000, 1000)) # rough values supplied by Evonne
batch_size_list = [32,64,128] # should be integer multiple of number of microbatches

noise_multiplier = 0.1
num_microbatches = 1 # increasing may increase utility but slows training
n = 1421 # Training pop size
target_delta = 1e-4 # should be less than 1/n since it can be interpreted as a probability of failure at each individual

In [47]:
# here sigma = noise_multiplier
def apply_dp_sgd_analysis(q, sigma, steps, orders, delta):

  rdp = compute_rdp(q, sigma, steps, orders)
  eps, _, opt_order = get_privacy_spent(orders, rdp, target_delta=delta)

  # print('DP-SGD with sampling rate = {:.3g}% and noise_multiplier = {} iterated'
  #       ' over {} steps satisfies'.format(100 * q, sigma, steps), end=' ')
  # print('differential privacy with eps = {:.3g} and delta = {}.'.format(
  #     eps, delta))
  # print('The optimal RDP order is {}.'.format(opt_order))

  if opt_order == max(orders):
    print('The privacy estimate is likely to be improved by increasing the max order')
  if opt_order == min(orders):
    print('The privacy estimate is likely to be improved by decreasing the min order')

  return [q, sigma, steps, eps, opt_order, delta]

def suggest_order_range(target_delta, eps_max, eps_min):
  # print(str(1+(math.log(1/target_delta)/eps_max)))
  # print(str(1+(math.log(1/target_delta)/eps_min)))
  return 1+(math.log(1/target_delta)/eps_max), 1+(math.log(1/target_delta)/eps_min)

In [50]:
start, stop = suggest_order_range(target_delta, 10**3, 10**5)
orders = np.linspace(start, stop) # adjust this in step 5

analysis = []
for batch_size in batch_size_list:
  for epochs in epochs_list:
    q = batch_size/n # sampling ratio
    if q>1:
      raise app.UsageError('n must be larger than batch size')
    steps = int(math.ceil(epochs*n/batch_size))
    analysis.append([epochs, batch_size] + apply_dp_sgd_analysis(q, noise_multiplier, steps, orders, target_delta))

df = pd.DataFrame(analysis, columns=['epochs', 'batch_size', 'q', 'noise_multiplier', 'steps', 'epsilon', 'opt_order', 'delta'])
df

Unnamed: 0,epochs,batch_size,q,noise_multiplier,steps,epsilon,opt_order,delta
0,5000,32,0.022519,0.1,222032,240456.134111,1.001209,0.0001
1,6000,32,0.022519,0.1,266438,287022.835716,1.001209,0.0001
2,7000,32,0.022519,0.1,310844,333498.349849,1.001023,0.0001
3,8000,32,0.022519,0.1,355250,379853.908292,1.001023,0.0001
4,9000,32,0.022519,0.1,399657,426210.510638,1.001023,0.0001
5,10000,32,0.022519,0.1,444063,472471.17653,1.000836,0.0001
6,5000,64,0.045039,0.1,111016,244023.531842,1.001209,0.0001
7,6000,64,0.045039,0.1,133219,291303.706567,1.001209,0.0001
8,7000,64,0.045039,0.1,155422,338483.796731,1.001023,0.0001
9,8000,64,0.045039,0.1,177625,385551.557289,1.001023,0.0001


This estimate calculates privacy loss per step and assumes the same amount for each step.