# Simple Attack

In this notebook, we will examine perhaps the simplest possible attack on an individual's private data and what the OpenDP library can do to mitigate it.

## Loading the data

The vetting process is currently underway for the code in the OpenDP Library.
Any constructors that have not been vetted may still be accessed if you opt-in to "contrib".

In [13]:
from opendp.mod import enable_features
enable_features('contrib')

We begin with loading up the data.

In [11]:
import os
data_path = os.path.join('.', 'data', 'pums_10000.csv')

with open(data_path) as input_file:
    col_names = input_file.readline().strip().split(',')
    data = input_file.read()

print(col_names)
print('\n'.join(data.split('\n')[:6]))

['sex', 'age', 'educ', 'income', 'married', 'race']
0,45,6,6000,1,1
1,41,8,13000,1,2
0,63,14,17810,1,1
1,71,15,3600,1,4
0,44,5,10000,0,1
1,49,1,0,0,4


The following code parse the data to get just a one vector of all the incomes.
More details on it can be found at XXXX.

In [19]:
from opendp.trans import make_split_dataframe, make_select_column, make_cast, make_impute_constant

income_preprocessor = (
    # Convert data into a dataframe where columns are of type Vec<str>
    make_split_dataframe(separator=",", col_names=col_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="income", TOA=str)
)

# make a transformation that casts from a vector of strings to a vector of ints
cast_str_int = (
    # Cast Vec<str> to Vec<Option<int>>
    make_cast(TIA=str, TOA=int) >>
    # Replace any elements that failed to parse with 0, emitting a Vec<int>
    make_impute_constant(0)
)

# replace the previous preprocessor: extend it with the caster
income_preprocessor = income_preprocessor >> cast_str_int
incomes = income_preprocessor(data)

print(incomes[:7])

[6000, 13000, 17810, 3600, 10000, 0, 30530]
<class 'list'>


## A simple attack

Say there's an attacker who's target is the first person in our data (i.e. the first in the csv),
and so he intends to learn his income.

In [None]:
person_of_interest = incomes[0]
print('person of interest:\n\n{0}'.format(person_of_interest))

Now consider the case that the attacker knows everything about the data, except for the person of interest's (POI) income, which is considered private.
They can back out the individual's income very easily, just from asking for the mean overall income.

In [20]:
# attacker information: he already knows everyone else's income, so he certainly compute the follwoing
known_mean = np.mean(incomes[1:])
known_obs = n_obs - 1

# assume the attackers know legitimately the overall mean and number of people in the data...
overall_mean = np.mean(incomes)
n_obs = len(incomes)

# back out POI's income
poi_income = overall_mean * n_obs - known_obs * known_mean
print('poi_income: {0}'.format(poi_income))

poi_income: 6000.0


The attacker now knows with certainty that the POI has an income of 6,000.


## Using OpenDP
Let's see what happens if the attacker were made to interact with the data through OpenDP and was given a privacy budget of $\epsilon = 1$.
We will assume that the attacker is reasonably familiar with differential privacy and believes that they should use tighter data bounds than they know are actually in the data in order to get a less noisy estimate.
They will need to update their `known_mean` accordingly.

We will also assume that the attacker will spend all of their privacy budget on a single query.
This assumption can be changed by changing the `n_queries` variable below.

We will be using `n_sims` to simulate the process a number of times to get a sense for various possible outcomes for the attacker.
In practice, they would see the result of only one simulation.

In this example, instead of just passing a scale into `make_base_geometric`,
lets say I want whatever scale will make my measurement 1-epsilon DP.
Again, I can use a search utility to find such a scale.

In [41]:
from opendp.mod import binary_search_chain
from opendp.trans import make_clamp, make_bounded_sum, make_sized_bounded_mean, make_bounded_resize
from opendp.meas import make_base_geometric


max_influence = 1
income_bounds = (0, 1000000)
count_release = 100

# bounded_income_sum = (
#     income_preprocessor >>
#     # Clamp income values
#     make_clamp(bounds=income_bounds) >>
#     # These bounds must be identical to the clamp bounds, otherwise chaining will fail
#     make_bounded_sum(bounds=income_bounds)
# )
#
#
# dp_sum = binary_search_chain(
#     lambda s: bounded_income_sum >> make_base_geometric(scale=s),
#     d_in=max_influence,
#     d_out=1.)
#
# # ...and make our 1-epsilon DP release
# print("DP sum:", dp_sum(data))

mean_preprocessor = (
    # Clamp age values
    make_clamp(bounds=income_bounds) >>
    # Resize the dataset to length `count_release`.
    #     If there are fewer than `count_release` rows in the data, fill with a constant of 10_000.
    #     If there are more than `count_release` rows in the data, only keep `count_release` rows
    make_bounded_resize(size=count_release, bounds=income_bounds, constant=10_000) >>
    # Compute the mean
    make_sized_bounded_mean(size=count_release, bounds=income_bounds)
)


print(mean_preprocessor(incomes))

OpenDPException: FFI("No match for concrete type i32 (TypeId { t: 13431306602944299956 })")
	   0: backtrace::backtrace::trace
	   1: backtrace::capture::Backtrace::new_unresolved
	   2: _opendp_trans__make_sized_bounded_mean
	   3: _ffi_call_unix64
	   4: _ffi_call
	   5: __ctypes_callproc
	   6: _PyCFuncPtr_call
	   7: __PyObject_MakeTpCall
	   8: _call_function
	   9: __PyEval_EvalFrameDefault
	  10: __PyEval_EvalCodeWithName
	  11: __PyFunction_Vectorcall
	  12: _call_function
	  13: __PyEval_EvalFrameDefault
	  14: __PyEval_EvalCodeWithName
	  15: _PyEval_EvalCode
	  16: _builtin_exec
	  17: _cfunction_vectorcall_FASTCALL
	  18: _call_function
	  19: __PyEval_EvalFrameDefault
	  20: _gen_send_ex
	  21: __PyEval_EvalFrameDefault
	  22: _gen_send_ex
	  23: __PyEval_EvalFrameDefault
	  24: _gen_send_ex
	  25: _method_vectorcall_O
	  26: _call_function
	  27: __PyEval_EvalFrameDefault
	  28: _function_code_fastcall
	  29: _call_function
	  30: __PyEval_EvalFrameDefault
	  31: _function_code_fastcall
	  32: _call_function
	  33: __PyEval_EvalFrameDefault
	  34: __PyEval_EvalCodeWithName
	  35: __PyFunction_Vectorcall
	  36: _method_vectorcall
	  37: _PyVectorcall_Call
	  38: __PyEval_EvalFrameDefault
	  39: __PyEval_EvalCodeWithName
	  40: __PyFunction_Vectorcall
	  41: _method_vectorcall
	  42: _call_function
	  43: __PyEval_EvalFrameDefault
	  44: _gen_send_ex
	  45: _builtin_next
	  46: _cfunction_vectorcall_FASTCALL
	  47: _context_run
	  48: _cfunction_vectorcall_FASTCALL_KEYWORDS
	  49: _call_function
	  50: __PyEval_EvalFrameDefault
	  51: __PyEval_EvalCodeWithName
	  52: __PyFunction_Vectorcall
	  53: _call_function
	  54: __PyEval_EvalFrameDefault
	  55: _gen_send_ex
	  56: _builtin_next
	  57: _cfunction_vectorcall_FASTCALL
	  58: _context_run
	  59: _cfunction_vectorcall_FASTCALL_KEYWORDS
	  60: _call_function
	  61: __PyEval_EvalFrameDefault
	  62: __PyEval_EvalCodeWithName
	  63: __PyFunction_Vectorcall
	  64: _method_vectorcall
	  65: _call_function
	  66: __PyEval_EvalFrameDefault
	  67: _gen_send_ex
	  68: _builtin_next
	  69: _cfunction_vectorcall_FASTCALL
	  70: _context_run
	  71: _cfunction_vectorcall_FASTCALL_KEYWORDS
	  72: _call_function
	  73: __PyEval_EvalFrameDefault
	  74: __PyEval_EvalCodeWithName
	  75: __PyFunction_Vectorcall
	  76: _method_vectorcall
	  77: _PyVectorcall_Call
	  78: __PyEval_EvalFrameDefault
	  79: _gen_send_ex
	  80: _method_vectorcall_O
	  81: _call_function
	  82: __PyEval_EvalFrameDefault
	  83: _function_code_fastcall
	  84: _method_vectorcall
	  85: _context_run
	  86: _cfunction_vectorcall_FASTCALL_KEYWORDS
	  87: _call_function
	  88: __PyEval_EvalFrameDefault
	  89: __PyEval_EvalCodeWithName
	  90: __PyFunction_Vectorcall
	  91: __PyObject_FastCallDict
	  92: _partial_call
	  93: __PyObject_MakeTpCall
	  94: _call_function
	  95: __PyEval_EvalFrameDefault
	  96: _function_code_fastcall
	  97: _call_function
	  98: __PyEval_EvalFrameDefault
	  99: __PyEval_EvalCodeWithName
	 100: __PyFunction_Vectorcall
	 101: _context_run
	 102: _cfunction_vectorcall_FASTCALL_KEYWORDS
	 103: _PyVectorcall_Call
	 104: __PyEval_EvalFrameDefault
	 105: _function_code_fastcall
	 106: _call_function
	 107: __PyEval_EvalFrameDefault
	 108: _function_code_fastcall
	 109: _call_function
	 110: __PyEval_EvalFrameDefault
	 111: _function_code_fastcall
	 112: _call_function
	 113: __PyEval_EvalFrameDefault
	 114: _function_code_fastcall
	 115: _call_function
	 116: __PyEval_EvalFrameDefault
	 117: _function_code_fastcall
	 118: _call_function
	 119: __PyEval_EvalFrameDefault
	 120: __PyEval_EvalCodeWithName
	 121: __PyFunction_Vectorcall
	 122: _method_vectorcall
	 123: _call_function
	 124: __PyEval_EvalFrameDefault
	 125: __PyEval_EvalCodeWithName
	 126: _PyEval_EvalCode
	 127: _builtin_exec
	 128: _cfunction_vectorcall_FASTCALL
	 129: _call_function
	 130: __PyEval_EvalFrameDefault
	 131: __PyEval_EvalCodeWithName
	 132: __PyFunction_Vectorcall
	 133: _call_function
	 134: __PyEval_EvalFrameDefault
	 135: __PyEval_EvalCodeWithName
	 136: __PyFunction_Vectorcall
	 137: _PyVectorcall_Call
	 138: _pymain_run_module
	 139: _Py_RunMain
	 140: _pymain_main
	 141: _Py_BytesMain
	


Now here's the loop

In [None]:
# update known_mean
#known_mean = np.mean(np.clip(data.iloc[1:]['income'], 0, 100_000))

# initialize vector to store estimated overall means
n_sims = 10_000
n_queries = 1
poi_income_ests = []
estimated_means = []

# get estimates of overall means
for i in range(n_sims):
    query_means = []
    for j in range(n_queries):
        query_means.append(dp_mean(incomes))
        # query_means.append(sn.dp_mean(
        #     data = income,
        #     privacy_usage = {'epsilon': 1/n_queries}))

    # get estimates of POI income
    estimated_means.append(np.mean(query_means))
    poi_income_ests.append(estimated_means[i] * n_obs - known_obs * known_mean)

In [None]:
# get mean of estimates
print('Known Mean Income (after truncation): {0}'.format(known_mean))
print('Observed Mean Income: {0}'.format(np.mean(estimated_means)))
print('Estimated POI Income: {0}'.format(np.mean(poi_income_ests)))
print('True POI Income: {0}'.format(person_of_interest['income']))

We see empirically that, in expectation, the attacker can get a reasonably good estimate of POI's income. However, they will rarely (if ever) get it exactly and would have no way of knowing if they did.

Below is a plot showing an empirical distribution of estimates of POI income.

In [None]:
import warnings
# import seaborn as sns
sns = None

# hide warning created by outstanding scipy.stats issue
warnings.simplefilter(action='ignore', category=FutureWarning)

# distribution of POI income
ax = sns.distplot(poi_income_ests, kde = False, hist_kws = dict(edgecolor = 'black', linewidth = 1))
ax.set(xlabel = 'Estimated POI income')