# Experiment One

In this experiment, we first demonstrate how a simple aggregate query can be exploited to gain information about a single individual.

In [4]:
import psycopg2 as pg
import pandas as pd

# Database setup
host = "localhost"
database = "cdm"
user = "postgres"
password = %env PGPASSWORD
connection_string = "host={} dbname={} user={} password={}".format(host, database, user, password)

db = pg.connect(connection_string)

## Ground Truth

Here we see the actual value for the number of people with HIV.

In [5]:
# 4241530 id for HIV

simple_query = """
SELECT COUNT(*) FROM condition_occurrence WHERE condition_concept_id = '4241530';
"""

pd.read_sql(simple_query, con=db)

Unnamed: 0,count
0,3280


## Attack

If an adversay knows the id of a specific patient, but is limited to only running aggregate queries, they can still determine the HIV status of that patient as follows:

In [6]:
simple_query_attack = """
SELECT COUNT(*) FROM condition_occurrence WHERE condition_concept_id = '4241530'
    AND person_id != 68;
"""

pd.read_sql(simple_query_attack, con=db)

Unnamed: 0,count
0,3278


Since the count is different, the attacker can infer the HIV status of the patient with `id=68`.

## Privacy Mechanism

To obfuscate the true value and protect the above attack, we apply a privacy mechanism. In this case we apply noise drawn from the Laplace distribution scaled by $\Delta f/\epsilon$ where $\Delta f$ is the _sensitivity_ of the query. $\Delta f = 1$ for single value results.

In [7]:
from IPython.display import display
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import numpy as np

def single_value_differential_privacy(query=simple_query, epsilon=1):
    
    # Run the query
    results = pd.read_sql(query, con=db)
    count = results['count'][0]
    
    # Apply Laplacian randomness with $\lamda = \frac{1}{\epsilon}$
    # see https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.laplace.html
    noise = np.random.laplace(scale=1.0/epsilon);
    
    # Round the number since having a floating point count doesn't really make sense
    return np.round(count + noise)

def run(button):
    result = single_value_differential_privacy(simple_query, slider.value)
    box.children = [widgets.Label("Result: "), widgets.Label(value=str(result))]

button = widgets.Button(description="Run Query")
button.on_click(run)

box = widgets.Box()

slider = widgets.FloatSlider(min=0.001, max=1, value=1, step=0.001, description='Epsilon')

display(slider)
display(button)

box

## Repeat Query Attack

Even with this privacy mechanism applied, it is still possible to reconstruct the original value by running the query multiple times.

In [8]:
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)

def run(button):
    results = []
    epsilon_i = budget.value / n.value
    
    for i in range(0, n.value):
        results.append(single_value_differential_privacy(query=simple_query, epsilon=epsilon_i))
        progress.value = (i + 1) / n.value * 100
        
    data = [go.Histogram(x=results)]
    layout = go.Layout(
        title='HIV Attack Results (n = {}, budget = {})'.format(n.value, budget.value), 
        xaxis={'title':'Patient Count', 'tickangle': 300, 'exponentformat': 'none'}, 
        yaxis={'title':'Occurences'},
        bargap=0.1)
    
    iplot({"data": data, "layout": layout})  

# Number of query runs (n)
n = widgets.BoundedIntText(value=1000, min=1, max=10000, description='n:')

# Privacy budget (\epsilon_{total})
budget = widgets.BoundedFloatText(value=0.1, min=0.001, description=r'$\epsilon_{total}$:')

# Fancy progress bar 🎩
progress = widgets.FloatProgress(min=0,max=100, step=1, description='Progress:')

button = widgets.Button(description="Run Attack")
button.on_click(run)

display(n)
display(budget)
display(progress)
display(button)