# Simple Attack

In this notebook, we will examine perhaps the simplest possible attack on an individual's private data and what the OpenDP library can do to mitigate it.

## Loading the data

The vetting process is currently underway for the code in the OpenDP Library.
Any constructors that have not been vetted may still be accessed if you opt-in to "contrib".

In [105]:
import numpy as np
from opendp.mod import enable_features
enable_features('contrib')

We begin with loading up the data.

In [106]:
import os
data_path = os.path.join('.', 'data', 'pums_10000.csv')

with open(data_path) as input_file:
    col_names = input_file.readline().strip().split(',')
    data = input_file.read()

print(col_names)
print('\n'.join(data.split('\n')[:6]))

['sex', 'age', 'educ', 'income', 'married', 'race']
0,45,6,6000,1,1
1,41,8,13000,1,2
0,63,14,17810,1,1
1,71,15,3600,1,4
0,44,5,10000,0,1
1,49,1,0,0,4


The following code parse the data to get just a one vector of all the incomes.
More details on it can be found [here](https://github.com/opendp/opendp/blob/main/python/example/basic_data_analysis.ipynb).

In [107]:
from opendp.trans import make_split_dataframe, make_select_column, make_cast, make_impute_constant

income_preprocessor = (
    # Convert data into a dataframe where columns are of type Vec<str>
    make_split_dataframe(separator=",", col_names=col_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="income", TOA=str)
)

# make a transformation that casts from a vector of strings to a vector of floats
cast_str_float = (
    # Cast Vec<str> to Vec<Option<floats>>
    make_cast(TIA=str, TOA=float) >>
    # Replace any elements that failed to parse with 0., emitting a Vec<int>
    make_impute_constant(0.)
)

# replace the previous preprocessor: extend it with the caster
income_preprocessor = income_preprocessor >> cast_str_float
incomes = income_preprocessor(data)

print(incomes[:7])

[6000.0, 13000.0, 17810.0, 3600.0, 10000.0, 0.0, 30530.0]


## A simple attack

Say there's an attacker who's target is the first person in our data (i.e. the first in the csv),
and so he intends to learn his income.

In [108]:
person_of_interest = incomes[0]
print('person of interest:\n\n{0}'.format(person_of_interest))

person of interest:

6000.0


Now consider the case that the attacker knows everything about the data, except for the person of interest's (POI) income, which is considered private.
They can back out the individual's income very easily, just from asking for the mean overall income.

In [109]:
# attacker information: he already knows everyone else's income, so he certainly compute the follwoing
known_mean = np.mean(incomes[1:])
known_obs = len(incomes) - 1

# assume the attackers know legitimately the overall mean and number of people in the data...
overall_mean = np.mean(incomes)
n_obs = len(incomes)

# back out POI's income
poi_income = overall_mean * n_obs - known_obs * known_mean
print('poi_income: {0}'.format(poi_income))

poi_income: 6000.0


The attacker now knows with certainty that the POI has an income of \$6,000.


## Using OpenDP
Let's see what happens if the attacker were made to interact with the data through OpenDP and was given a privacy budget of $\epsilon = 1$.
We will assume that the attacker is reasonably familiar with differential privacy and believes that they should use tighter data bounds than they know are actually in the data in order to get a less noisy estimate.
They will need to update their `known_mean` accordingly.

In [111]:
from opendp.mod import binary_search_chain
from opendp.trans import make_clamp, make_bounded_sum, make_sized_bounded_mean, make_bounded_resize
from opendp.meas import make_base_geometric, make_base_laplace

enable_features("floating-point")

max_influence = 1
income_bounds = (0, 1000000)
count_release = 100

income_bounds = (0.0, 100_000.0)
mean_preprocessor = (
    make_clamp(bounds=income_bounds) >>
    make_bounded_resize(size=count_release, bounds=income_bounds, constant=10_000.0) >>
    make_sized_bounded_mean(size=count_release, bounds=income_bounds) >>
    make_base_laplace(scale=1.0)
)

print("DP mean:", mean_preprocessor(incomes))
# TODO: it's not clear hot make_bounded_resize resize the data (i.e. is it truncates the last entries from its input)? Given the algorithm, it would be possible to compute known_mean accordingly.
print("Known mean:", np.mean(np.clip(incomes, 0, 100_000)))

DP mean: 29805.466099517937
Known mean: 26883.913


We will be using `n_sims` to simulate the process a number of times to get a sense for various possible outcomes for the attacker.
In practice, they would see the result of only one simulation.

In [112]:
# known_mean
# TODO: it's not clear hot make_bounded_resize resize the data (i.e. is it truncates the last entries from its input)? Given the algorithm, it would be possible to compute known_mean accordingly.
#known_mean = np.mean(np.clip(data.iloc[1:]['income'], 0, 100_000))

# initialize vector to store estimated overall means
n_sims = 10_000
n_queries = 1
poi_income_ests = []
estimated_means = []

# get estimates of overall means
for i in range(n_sims):
    query_means = []
    for j in range(n_queries):
        query_means.append(mean_preprocessor(incomes))

    # get estimates of POI income
    estimated_means.append(np.mean(query_means))
    poi_income_ests.append(estimated_means[i] * n_obs - known_obs * known_mean)


# get mean of estimates
print('Known Mean Income (after truncation): {0}'.format(known_mean))
print('Observed Mean Income: {0}'.format(np.mean(estimated_means)))
print('Estimated POI Income: {0}'.format(np.mean(poi_income_ests)))
print('True POI Income: {0}'.format(person_of_interest))

Known Mean Income (after truncation): 30945.951195119513
Observed Mean Income: 29806.075146931886
Estimated POI Income: -11367814.530681115
True POI Income: 6000.0


We see empirically that, in expectation, the attacker can get a reasonably good estimate of POI's income. However, they will rarely (if ever) get it exactly and would have no way of knowing if they did.

Below is a plot showing an empirical distribution of estimates of POI income.

In [114]:
import warnings
import seaborn as sns

# hide warning created by outstanding scipy.stats issue
warnings.simplefilter(action='ignore', category=FutureWarning)

# distribution of POI income
ax = sns.distplot(poi_income_ests, kde = False, hist_kws = dict(edgecolor = 'black', linewidth = 1))
ax.set(xlabel = 'Estimated POI income')

ModuleNotFoundError: No module named 'seaborn'