# Permanent Randomized response

The basic idea of differential privacy is to make a particular query stochastic so that the underlying data is kept private. The average attack consists of performing the same query many times in order to reliably estimate the underlying data. This is, of course, not desirable so we should either limit the number of queries or design algorithms that are not vulnerable under this kind of attack.

In this notebook we present a simple example of this phenomenon based on a single node that contains one binary number $n$ that encondes whether the node is guilty ($n=1$) or innocent ($n=0$). A randomized algorithm is used to query the node in order to preserve the privacy of $n$.

[Check reference for this]

### Average attack

In [1]:
import numpy as np
import shfl

Using TensorFlow backend.


We start by creating a single node that contains a binary number. In this case, we set this number to 1 (guilty). By setting f1=0.8, we make sure that 80% of the times we query the node, we get an answer of 1. On the remaining 20% of the cases we obtain an answer of 0.

In [2]:
from shfl.private.node import DataNode
from shfl.differential_privacy.dp_mechanism import RandomizedResponseBinary
from shfl.private.data import DataAccessDefinition

n = 1 #the node is guilty

node_single = DataNode()
node_single.set_private_data(name="guilty", data=np.array([n]))
dp_mechanism = RandomizedResponseBinary(f1=0.8, f2=0.8)
data_access_definition = DataAccessDefinition(dp_mechanism=dp_mechanism)
node_single.configure_private_data_access("guilty", data_access_definition)

If we perform the query just once, we cannot be sure that the result matches the true data.

In [3]:
result = node_single.query_private_data(private_property="guilty")
print("The result of one query is: " + str(int(result)))

The result of one query is: 1


If we perform the query $N$ times and take the average, we can estimate the true data with an error that goes to zero as $N$ increases.

In [4]:
N = 500
result_query = np.empty(N)
for i in range(N):
    result_query[i] = node_single.query_private_data(private_property="guilty")
print(np.mean(result_query))

0.792


We see that the average result of the query is close to 0.8. This allows us to conclude that the raw answer is most probably 1. Otherwise, the result would've been close to 0.2.

### Permanent randomized response

A possible solution to this problem is to create a node that contains two pieces of information: the true data and a **permantent randomized respose** (PRR). The latter is initialized to None and, once the node receives the first query, it creates a random binary number following the algorithm described above which is saved as the PRR. The result of the query is then a randomized response using the PRR as input. This way, even if the query is performed a large number of times, the attacker may only guess with certainty the PRR, but not the true data.

In [14]:
node_single_prr = DataNode()
data_binary = np.array([1])  #the node is guilty
node_single_prr.set_private_data(name="guilty", data=np.array([n]))
node_single_prr.configure_private_data_access("guilty", data_access_definition)

permanent_response = node_single_prr.query_private_data(private_property="guilty")
print("The PRR is: " + str(int(permanent_response)))

# we save the prr as a new piece of information
node_single_prr.set_private_data(name="guilty", data=np.append(data_binary, permanent_response))

The PRR is: 0


From now on, all the external queries are done over the permanent one. The true data remains completely hidden.

In [15]:
result_query = np.empty(N)
for i in range(N):
    # we only access the result of the query done to the PRR
    result_query[i] = node_single_prr.query_private_data(private_property="guilty")[1]
print(np.mean(result_query))

0.19


The result is not always close to 0.8, since the permanent response might be 0. The average attack may, at best, identify the permanent random response, but not the raw data.