# First Look at DP

Differential privacy (DP) is a technique used to release information about a population
in a way that limits the exposure of any one individual's personal information.

In this notebook, we'll conduct a differentially-private analysis on a teacher survey (a tabular dataset).

## Why Differential Privacy?

Protecting the privacy of individuals while still sharing information is nontrivial.
For example, if I naively "anonymized" the teacher survey by removing any identifiers 
(like the person's name and social security number), 
it would still be very easy to re-identify individuals via quasi-identifiers.

In [1]:
import pandas as pd
df = pd.read_csv("../data/teacher_survey/teacher_survey.csv", header=None)
df.columns = ['sex',
              'age',
              'maritalStatus',
              'hasChildren',
              'highestEducationLevel',
              'sourceOfStress',
              'smoker',
              'optimism',
              'lifeSatisfaction',
              'selfEsteem']


Say I was curious about my non-binary co-worker, and I knew their age (27).

In [2]:
df.loc[(df['sex'] == 3) & (df['age'] == 27)]


Unnamed: 0,sex,age,maritalStatus,hasChildren,highestEducationLevel,sourceOfStress,smoker,optimism,lifeSatisfaction,selfEsteem
2252,3,27,3,2,3,8,1,26,29,38


In a dataset of 7,000 records, these two identifiers alone uniquely identified my coworker, allowing me to see their responses-- they are unmarried, but living with their partner.

In other situations there are more nuanced techniques that can be used to determine if an individual exists in a dataset (called a membership attack)
or reconstruct parts of, or even the entire dataset, from summary statistics.

Giving strong formal privacy guarantees requires the rigorous mathematical grounding that differential privacy provides.


## 2. Unit of Privacy
The first step of a DP analysis is to identify the unit of privacy.
In the teacher survey, each teacher contributes one row to the dataset, so our unit of privacy is a single row.

> Broadly speaking, differential privacy can be applied to any medium of data for which you can define a unit of privacy.
> In other contexts, the unit of privacy may be multiple rows, a user ID, or a node or edge in a graph.

The unit of privacy frames the privacy guarantee provided by $\epsilon$.
An analysis is only $\epsilon$-DP if the methods can conceal any change made to the input dataset as coarse as the unit of privacy.

Since each teacher may only contribute at most one row, 
we will tune our methods such that the outputs are at most $\epsilon$-distinguishably different 
upon the addition or removal of any one row from any input dataset.


## 3. Public Information
The next step is to identify public information about the dataset.

* information that is invariant across all potential input datasets (may include column names and per-column categories)
* information that is publicly available from other sources
* information from other DP releases

For convenience, I've collected metadata from the codebook [into a JSON file](../data/teacher_survey/public_metadata.json).

In [3]:
import json
metadata = json.load(open("../data/teacher_survey/public_metadata.json"))
metadata["column_names"]

['sex',
 'age',
 'maritalStatus',
 'hasChildren',
 'highestEducationLevel',
 'sourceOfStress',
 'smoker',
 'optimism',
 'lifeSatisfaction',
 'selfEsteem']

In this case (and in most cases), we consider column names public because they weren't picked in response to the data, they were "fixed" before collecting the data.

By similar reasoning, the unique keys in the "maritalStatus" column are considered part of the public metadata because they were chosen independently from survey responses.

In [4]:
metadata["column_metadata"]["maritalStatus"]

{'keys': ['1', '2', '3', '4', '5', '6', '7', '8'],
 'key_labels': ['Single',
  'Steady Relationship',
  'Living with partner',
  'Married first time',
  'Remarried',
  'Separated',
  'Divorced',
  'Widowed']}

Both of these examples involved data invariants.
Remember that a data invariant is information about your dataset that you are explicitly choosing not to protect,
under the basis that it does not contain sensitive information. 
Be careful because, if an invariant does, indeed, contain sensitive information,
then you expose individuals in the dataset to unbounded privacy loss.

Having this public metadata will significantly improve the utility of our analysis.
Now we can move on to making differentially private releases.

## 4. Construct Measurement

A measurement is a randomized function that takes a dataset and returns a differentially private release.
The OpenDP Library provides building-blocks that can be functionally composed (called "chaining").

In OpenDP, building-blocks that have not yet completed the vetting process are kept behind the "contrib" flag.
We enable this flag here:

In [5]:
from opendp.mod import enable_features
enable_features("contrib")

Building blocks that do not yet privatize the output are called "transformations" instead of "measurements".
The following transformation loads the age column from a CSV:

In [6]:
from opendp.transformations import make_split_dataframe, make_select_column
# the `>>` operator denotes the functional composition 
marital_status_trans = (
    make_split_dataframe(separator=",", col_names=metadata["column_names"]) >>
    make_select_column("maritalStatus", str)
)

We use this preprocessor transformation, as well as a count transformation, and then privatize the count with a measurement.
All differentially private measurements involve sampling from a carefully-calibrated probability distribution that is concentrated around the quantity of interest.
The privacy guarantees are based on how noisy this distribution is.

In this case we add noise from a discretized version of the Laplace distribution.

In [7]:
from opendp.transformations import make_count
from opendp.measurements import make_base_discrete_laplace

count_meas = marital_status_trans >> make_count(str) >> make_base_discrete_laplace(scale=2.)

Each invokation of a measurement incurs some privacy loss, and we can use the privacy map to tell us how much.
Recalling the "Unit of Privacy" section, we know that each teacher contributed at most one row to the survey.


In [8]:
max_contributions = 1
count_meas.map(d_in=max_contributions)

0.5

The privacy map tells us that passing data through this transformation will incur an $\epsilon = 0.5$ privacy spend to any individual in a dataset passed in.
$\epsilon$ is a commonly-used proxy to quantify the worst-case risk to any individual.
It is customary to refer to a data release with such bounded risk as $\epsilon$-DP.

In this case, the privacy spend is based on two factors: the noise scale, and the max contributions. 
In other cases, this number may also vary based on other parameters used to make any of the other transformations in the chain.

## 5. Differentially Private Release

Notice that all the work we've done thus far has never involved looking at the sensitive dataset we're analyzing.
In order to obtain accurate privacy guarantees, the OpenDP Library should mediate all access to the sensitive dataset.

We now invoke the measurement on our dataset, and consume $\epsilon = 0.5$.

In [9]:
csv_data = open("../data/teacher_survey/teacher_survey.csv").read()
count_meas(csv_data)

7004

The result is a random draw from the discrete laplace distribution, centered at the true count of the number of records in the underlying dataset.

## Grouped Counts and Composition
You can also make multiple releases on the same dataset.
In this example, we construct a grouped count query, and add vector-valued noise to the vector of counts:

In [10]:
from opendp.transformations import make_count_by_categories

histogram_meas = (
    marital_status_trans >>
    make_count_by_categories(categories=metadata["column_metadata"]["maritalStatus"]["keys"]) >>
    make_base_discrete_laplace(scale=2., D="VectorDomain<AllDomain<int>>")
)

Just like before, the measurement has a privacy map from which we can determine the epsilon spend of invoking this measurement:

In [11]:
histogram_meas.map(1)

0.5

The total epsilon spend is the sum of the epsilon spends of each measurement we make on the sensitive dataset (1.0).
Recommendations for the total epsilon expenditure will vary depending on the use-case, 
and level of risk posed to individuals should an adversary be identify an individual in the sensitive dataset.
However, it is generally considered good practice to choose your noise scale and other parameters in such a way that the overall privacy spend is around 1.0.


In [12]:
# make the 0.5-DP release
dp_counts = histogram_meas(csv_data)

# pair the noisy counts with their labels:
dict(zip(metadata["column_metadata"]["maritalStatus"]["key_labels"], dp_counts))

{'Single': 1680,
 'Steady Relationship': 589,
 'Living with partner': 587,
 'Married first time': 3016,
 'Remarried': 473,
 'Separated': 158,
 'Divorced': 385,
 'Widowed': 111}

With this release, the overall privacy expenditure becomes $\epsilon = 1$.