# SQL Queries with Differential Privacy

## Read data
Smartnoise supports issue SQL queries against CSV files, database engines, and Spark clusters.

In [3]:
import pandas as pd

pums = pd.read_csv('../../datasets/PUMS.csv')

print(pums)

     age  sex  educ  race   income  married
0     59    1     9     1      0.0        1
1     31    0     1     3  17000.0        0
2     36    1    11     1      0.0        1
3     54    1    11     1   9100.0        1
4     39    0     5     3  37000.0        0
..   ...  ...   ...   ...      ...      ...
995   73    0     3     3  24200.0        0
996   38    1     2     3      0.0        0
997   50    0    13     1  22000.0        1
998   44    1    14     4    500.0        1
999   29    1    11     1  66400.0        0

[1000 rows x 6 columns]


## Execute DP query

Open a private SQL connection

In [7]:
from snsql import *

metadata = '../../datasets/PUMS.yaml'


private_reader = from_connection(
    pums, metadata=metadata, 
    privacy=Privacy(epsilon=1.0, delta=1/1000)
)

query = 'SELECT married, COUNT(*) AS n FROM PUMS.PUMS GROUP BY married'

result_dp = private_reader.execute_df(query)
print(result_dp)

  married    n
0       0  454
1       1  548


**Note**, in the above example, we query the PUMS microdata to get the count of individuals by marriage status.  If you run the private query repeatedly, you will see that the answer changes a bit between queries.

In [8]:
result_dp = private_reader.execute_df(query)
print(result_dp)

  married    n
0       0  449
1       1  549


The `PrivateReader` allows you to swap in differentially private results wherever exact results are currently used, it can accept some additional paramaters to control privacy/accuracy tradeoff.  Smaller epsilon will be more private, but less accurate. 

In [12]:
for epsilon in [4.0, 0.1]:
    private_reader = from_connection(
        pums, metadata=metadata, 
        privacy=Privacy(epsilon=epsilon, delta=1/1000)
    )
    print(f"epsilon is: {epsilon}")
    result = private_reader.execute_df(query)
    print(result)
    print()

epsilon is: 4.0
  married    n
0       0  451
1       1  548

epsilon is: 0.1
  married    n
0       0  453
1       1  534



## Execute normal query 
Calling the underlying `Reader` directly will give the exact result.

In [13]:
result = private_reader.reader.execute_df(query)

print(result)

   married    n
0        0  451
1        1  549


## Metadata file

The `PrivateReader` needs some metadata that describes the data source.  Differentially private processing needs to know which columns can be used in numeric computations, as well as information about the sensitivity of data, and which column is the private identifier.  Metadata should be provided by the data owner, and should not be data-dependent.  For example, the acceptable range for the `age` column should be domain-specific, and should not use the actual minimum and maximum values from the data:

In [16]:
import snsql
meta = snsql.metadata.Metadata.from_file('../../datasets/PUMS.yaml')
print(meta)

PUMS.PUMS [1000 rows]
	age [int] (0,100)
	sex (card: 0)
	educ (card: 0)
	race (card: 0)
	income [int] (0,500000)
	married (card: 0)
