# Single Value Privacy

This notebook explores how we can implement differential privacy when the query response is a single numeric value. We will try to protect the genders of the members of the dataset. A lot of the concepts implemented here come from [this article](https://research.neustar.biz/2014/09/08/differential-privacy-the-basics/).

In [8]:
import psycopg2 as pg
import pandas as pd

# Database setup
host = "localhost"
database = "cdm"
user = "postgres"
password = %env PGPASSWORD
connection_string = "host={} dbname={} user={} password={}".format(host, database, user, password)

db = pg.connect(connection_string)

In [19]:
# The subset of SynPUF data we use only has these two concepts for gender
female_concept_id = 8532;
male_concept_id = 8507;

gender_count_query = "SELECT COUNT(*) FROM person WHERE gender_concept_id = {};".format(female_concept_id);

pd.read_sql(gender_count_query, con=db)

Unnamed: 0,count
0,64347


So we have exactly **64347** women in our dataset, and this is the number we need to protect, since someone could run the same query on a dataset without a single individual, and thus determine their gender (as shown in the [data exploration notebook](./data-exploration.ipynb)). To start, we need to understand what the _sensitivity_ of the query is. Determining this requires knowing your schema and your data, but the general formula is:

$$
\Delta f = \max_{D, D'} \|f(D) - f(D')\|_{1}
$$

where $\|x\|_{1}$ is the [L1 norm](https://en.wikipedia.org/wiki/Taxicab_geometry) of $x$. Intuitively, this is just the maximum difference in the values that a query $f$ can return on a pair of databases ($D$ and $D'$) that differ in only one row. In our case this means one person is added or removed from the database.

The gender count query only returns one value (the count), and this value can only change by a maximum of 1 if we remove or add a person to the database. Therefore, our L1 norm for this query is 1. [Differential privacy theory](https://cacm.acm.org/magazines/2011/1/103226-a-firm-foundation-for-private-data-analysis/fulltext) tells us that by adding noise drawn from the $Laplace\left(\Delta f/\epsilon\right)$ distribution, we are guaranteed $\epsilon$-differential privacy.

There is one more thing to consider, which is that running the query multiple times will always reveal a bit more about the distribution (or value, if only a single number is returned) of the underlying data. We therefore need to think about $\epsilon$ as $\epsilon_{total}$ instead. So if we run the query $q$ times, we get:

$$
\epsilon_{total} = \sum^{q}_{i=1}{\epsilon_i}
$$

We usually refer to $\epsilon_{total}$ as the _privacy budget_, and each time we run a query, we are using up $\epsilon_i$ of the budget.

Below, we consider the case of a single run of the gender count query. Since we know that the _sensitivity_ of this query is $1$, we can use [NumPy](https://docs.scipy.org/doc/numpy/index.html) to generate random noise based on the value of $\epsilon$ (controlled by the slider).

In [131]:
from IPython.display import display
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import numpy as np

def run(button):
    result = single_value_differential_privacy(gender_count_query, slider.value)
    box.children = [widgets.Label("Result: "), widgets.Label(value=str(result))]

button = widgets.Button(description="Run Query")
button.on_click(run)

box = widgets.Box()

def single_value_differential_privacy(query=gender_count_query, epsilon=1):
    
    # Run the query
    results = pd.read_sql(query, con=db)
    count = results['count'][0]
    
    # Apply Laplacian randomness with $\lamda = \frac{1}{\epsilon}$
    # see https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.laplace.html
    noise = np.random.laplace(scale=1/epsilon);
    
    # Round the number since having a floating point count doesn't really make sense
    return np.round(count + noise)

slider = widgets.FloatSlider(min=0.001, max=10, value=1, step=0.001, description='Epsilon')

display(slider)
display(button)

box

The above value is the result of single query with $\epsilon_{total} = \epsilon = 1$. Notice that reducing the value of $\epsilon$ increases the noise, and making it too large increases the likelihood that the true value (64347) is returned.

## Privacy Budget

In practice, we need to be able to protect database members from an adversary when the query can be run multiple times. We will only consider the case when the exact same query is run at most $n$ times, and will not consider the problem of similar queries being used to extract the same data in a slightly different form.

We define values for $\epsilon_{total}$ and $n$, applying the differential privacy $\epsilon_i = \frac{\epsilon_{total}}{n}$ each time, and display a histogram of the returned value.

In [207]:
from IPython.display import display
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import numpy as np
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import plotly.graph_objs as go

# This is to use ployly offline
init_notebook_mode(connected=True)

iframe = None

# Number of query runs (n)
n = widgets.BoundedIntText(value=10, min=1, max=1000, description='n:')

# Privacy budget (\epsilon_{total})
budget = widgets.BoundedFloatText(value=10, min=1, description=r'$\epsilon_{total}$:')

# Fancy progress bar 🎩
progress = widgets.FloatProgress(min=0,max=100, step=1, description='Progress:')


def run(button):
    results = []
    epsilon_i = budget.value / n.value
    
    for i in range(0, n.value):
        results.append(single_value_differential_privacy(epsilon=epsilon_i))
        progress.value = (i + 1) / n.value * 100
        
    data = [go.Histogram(x=results, name="n = {}, budget = {}".format(n.value, budget.value))]
    layout = go.Layout(
        title='Gender Count Attack Results', 
        xaxis={'title':'Female Count', 'tickangle': 300, 'exponentformat': 'none'}, 
        yaxis={'title':'Occurences'},
        showlegend=True,
        bargap=0.2)
    
    iplot({"data": data, "layout": layout})        

button = widgets.Button(description="Run Attack")
button.on_click(run)

display(n)
display(budget)
display(progress)
display(button)