# SQL Queries with Differential Privacy

In [8]:
import pandas as pd
from opendp.whitenoise.sql import PandasReader, PrivateReader
from opendp.whitenoise.metadata import CollectionMetadata

pums = pd.read_csv('readers/PUMS.csv')
meta = CollectionMetadata.from_file('readers/PUMS.yaml')

query = 'SELECT married, COUNT(*) AS n FROM PUMS.PUMS GROUP BY married'

reader = PandasReader(meta, pums)
private_reader = PrivateReader(meta, reader)

result = private_reader.execute_typed(query)
print(result)

 married  |n      
 ---------|-------
  False   | 434   
  True    | 552   


In the above example, we query the PUMS microdata to get the count of individuals by marriage status.  If you run the above cell repeatedly, you will see that the answer changes a bit bewteen queries.

The `PrivateReader` class works by wrapping any SQL data source that returns typed tuples.  In this sample, we wrap a `PandasReader`, which returns SQL queries against Pandas dataframes:

In [9]:
reader = PandasReader(meta, pums)
result = reader.execute_typed(query)
print(result)

 married|n      
 -------|-------
  0     | 451   
  1     | 549   


Calling the underlying `Reader` directly will give the exact result.  The `Reader` implementations do not know anything about differential privacy, and simply return SQL query results.  In addition to the `PandasReader`, we provide built-in `SqlServerReader`, `PostgresReader`, `SparkReader`, and `PrestoReader`.  The `Reader` interface is extensible, so developers can wrap existing DB-API drivers to provide access to other popular database engines.

The `PrivateReader` exposes the same interface as any other reader, allowing you to swap in differentially private results wherever exact results are currently used.  The `PrivateReader` accepts some additional paramaters to control privacy/accuracy tradeoff:

In [12]:
private_reader = PrivateReader(meta, reader, 4.0)  # large epsilon, less privacy
result = private_reader.execute_typed(query)
print(result)
print()

private_reader = PrivateReader(meta, reader, 0.1)  # smaller epsilon, more privacy
result = private_reader.execute_typed(query)
print(result)


 married  |n      
 ---------|-------
  False   | 451   
  True    | 548   

 married  |n      
 ---------|-------
  False   | 409   
  True    | 533   


## Metadata

The `PrivateReader` needs some metadata that describes the data source.  Differentially private processing needs to know which columns can be used in numeric computations, as well as information about the sensitivity of data, and which column is the private identifier.  Metadata should be provided by the data owner, and should not be data-dependent.  For example, the acceptable range for the `age` column should be domain-specific, and should not use the actual minimum and maximum values from the data:

In [14]:
meta = CollectionMetadata.from_file('readers/PUMS.yaml')
print(meta)

PUMS.PUMS [1000 rows]
	age [int] (0,100)
	sex (card: 0)
	educ [int] (unbounded)
	race (card: 0)
	income [int] (0,500000)
	married (boolean)
	*pid [int] (unbounded)
